Build Impala from Scratch

This article shows how to build Cloudera Impala from scratch, including compiling, linking and using.


My Environment:

CentOS 6.5

No network proxy (if set there would be a lot of downloading problems)


Impala Commit (Version):

Change-Id: I3b5cefb4d7193045fc6fc5e94766589c2299b5b1

commit d90f3f3fd1a134578b1860be1b2f41a57a8d8896 1 parent ee40ba2


 

 

Get Impala source:

git clone https://github.com/cloudera/impala

 

 

1. Install tools:

sudo yum install boost-test boost-program-options libevent-devel automake libtool flex bison gcc-c++ openssl-devel \

make cmake doxygen.x86_64 glib-devel boost-devel python-devel bzip2-devel svn libevent-devel cyrus-sasl-devel \

wget git unzip

 

 

2. Uninstall the old boost library (yum remove boost) of CentOS 6.5, and install the Boost lib 1.46.1.

For example:

export BOOST_ROOT=’/usr/local/boost_1_46_0

 

cd boost_1_46_1/

./bjam threading=multi –layout=tagged

./bjam  threading=multi –layout=tagged install

 

 

3. Install LLVM 3.3:

 

wget http://llvm.org/releases/3.3/llvm-3.3.src.tar.gz

tar xvzf llvm-3.3.src.tar.gz

cd llvm-3.3.src/tools

svn co http://llvm.org/svn/llvm-project/cfe/tags/RELEASE_33/final/ clang

cd ../projects

svn co http://llvm.org/svn/llvm-project/compiler-rt/tags/RELEASE_33/final/ compiler-rt

cd ..

./configure –with-pic

make -j4 REQUIRES_RTTI=1

sudo make install

 

 

4. Set up the JDK path. (environment variable: /etc/.bashrc, ~/.bash_profile)

 

 

5. Install Maven 3.0.4:

 

wget http://www.fightrice.com/mirrors/apache/maven/maven-3/3.0.4/binaries/apache-maven-3.0.4-bin.tar.gz

tar xvf apache-maven-3.0.4.tar.gz && sudo mv apache-maven-3.0.4 /usr/local

 

Update  ~/.bashrc,add the environment variables:

 

export M2_HOME=/usr/local/apache-maven-3.0.4

export M2=$M2_HOME/bin

export PATH=$M2:$PATH

 

source ~/.bashrc

mvn -version

 

 

6. Check the paths in bin/set-classpath.sh and Build Impala:

 

cd $IMPALA_HOME

./build_all

 

 

7. If there are errors like no -lboost_date_time, update it to -lboost_date_time-mt in the Makefile.

Other problems are similar, change them all to “*-mt” or update them to a new version, these problems are searchable on Google.

 

 

8. easy_install prettytable

easy_install thrift

 

(9. Build the thirdparty files, it seems automatic in the new Impala versions.)

 

(10. Download setuptools-5.1.zip and install it if needed when building.)

 


 

Configure and start Impala:   (The blue text should be customized)

 

1. The /etc/hosts:

 

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4

::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

172.16.24.132   master

 

 

2. Configure and run HDFS/Hadoop:

 

(1) Create the hadoop data directory (In this case, file:///home/yc/hdfs)

(2) Mkdir the /var/run/hadoop-hdfs

(3) Configure XML of Hadoop in the following directory:

 

$IMPALA_HOME/thirdparty/hadoop-2.0.0-cdh4.5.0/etc/hadoop

 

 

The configured “hdfs-site.xml” should look like this, be careful about “dn.50010“.

 

hdfs-site.xml

========================================================================

<configuration>

 

<property>

<name>dfs.client.read.shortcircuit</name>

<value>true</value>

</property>

 

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

 

<property>

<name>dfs.datanode.hdfs-blocks-metadata.enabled</name>

<value>true</value>

</property>

 

<property>

<name>dfs.datanode.data.dir</name>

<value>file:///home/yc/hdfs</value>

</property>

 

<property>

<name>dfs.client.use.legacy.blockreader.local</name>

<value>false</value>

</property>

 

<property>

<name>dfs.datanode.data.dir.perm</name>

<value>750</value>

</property>

 

<property>

<name>dfs.block.local-path-access.user</name>

<value>root</value>

</property>

 

<property>

<name>dfs.client.file-block-storage-locations.timeout</name>

<value>5000</value>

</property>

 

<property>

<name>dfs.domain.socket.path</name>

<value>/var/run/hadoop-hdfs/dn.50010</value>

</property>

 

<property>

<name>dfs.client.file-block-storage-locations.timeout.millis</name>

<value>10000</value>

</property>

 

</configuration>

========================================================================

 

 

core-site.xml

========================================================================

<configuration>

 

<property>

<name>hadoop.native.lib</name>

<value>true</value>

<description>Should native hadoop libraries, if present, be used.</description>

</property>

 

<property>

<name>fs.default.name</name>

<value>hdfs://master:9000</value>

</property>

 

<property>

<name>dfs.client.read.shortcircuit</name>

<value>true</value>

</property>

 

<property>

<name>dfs.client.use.legacy.blockreader.local</name>

<value>false</value>

</property>

 

<property>

<name>dfs.client.read.shortcircuit.skip.checksum</name>

<value>false</value>

</property>

 

<property>

<name>hadoop.tmp.dir</name>

<value>/home/yc/hdfs/tmp</value>

<description>A base for other temporary directories.</description>

</property>

 

</configuration>

========================================================================

 

 

 

 

 

yarn-site.xml

========================================================================

<?xml version=”1.0″?>

<configuration>

<!– Site specific YARN configuration properties –>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

</configuration>

========================================================================

 

 

mapred-site.xml:

========================================================================

<?xml version=”1.0″?>

<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

</configuration>

========================================================================

 

 

hive-site.xml:

========================================================================

<configuration>

</configuration>

========================================================================

 

 

 

Copy

core-site.xml  hdfs-site.xml  hive-site.xml

into

$IMPALA_HOME/conf

 

 

Configure

$IMPALA_HOME/bin/set-classpath.sh

as follows:

 

========================================================================

CLASSPATH=\

$IMPALA_HOME/conf:\

$IMPALA_HOME/fe/src/test/resources:\

$IMPALA_HOME/fe/target/classes:\

$IMPALA_HOME/fe/target/dependency:\

$IMPALA_HOME/fe/target/test-classes:

 

for jar in `ls ${IMPALA_HOME}/fe/target/dependency/*.jar`; do

CLASSPATH=${CLASSPATH}:$jar

done

 

export CLASSPATH

========================================================================

 

 

 

Format HDFS:

$IMPALA_HOME/thirdparty/hadoop-2.0.0-cdh4.5.0/bin/hdfs datanode -format

 

$IMPALA_HOME/thirdparty/hadoop-2.0.0-cdh4.5.0/sbin/start-all.sh

 

Optional:

(./bin/hdfs datanode)

 

Type in command “jps”, you will see things like this (number would be different):

 

1404

54375 NameNode

54646 NodeManager

558 SecondaryNameNode

54663 ResourceManager

6545 Jps

384 DataNode

54695 NodeManager

54727 NodeManager

2061

65490 NameNode

 

 

Create directories in HDFS:

 

$HADOOP_HOME/bin/hdfs dfs -mkdir  /tmp

$HADOOP_HOME/bin/hdfs dfs -mkdir  /user

$HADOOP_HOME/bin/hdfs dfs -mkdir  /user/impala

$HADOOP_HOME/bin/hdfs dfs -mkdir  /user/impala/tab1

 

 

Put data into HDFS:

 

$HADOOP_HOME/bin/hdfs dfs -put ./tab1.csv /user/impala/tab1

 


 

3. Start Impala Daemons:

 

cd $IMPALA_HOME

 

./be/build/debug/statestore/statestored

 

./bin/start-impalad.sh

 

./bin/start-catalogd.sh

 

(Do not need to start Hive)

 

Start Impala Shell:

 

./bin/impala-shell.sh

 

(WARNING: Do NOT need to start $IMPALA_HOME/thirdparty/hive-0.10.0-cdh4.5.0/bin/hiveserver2)

 


 

References:

Impala old version building: https://github.com/tomdz/impala

http://hi.baidu.com/huareal/item/52be8401cf349729a1312d66

http://hi.baidu.com/huareal/item/d651821043df5cfa86ad4eff

http://www.blogjava.net/ivanwan/archive/2006/05/18.html

Impala Hbase Interaction

Summary of Hbase and Impala connections and interactions.

 

 

(1) Very simple example for interacting between Hbase and Impala.

 

 

  1. Configure Hbase as normal. (edit hbase-site.xml and start the daemons. Use “jps” to see daemons)

 

 

  1. Create a table in Hbase using the following command in Hbase shell (command: “hbase shell”):

 

——————————————————————————

create ‘a’,’ints’

enable ‘a’

——————————————————————————

 

there are two columns, one is the key_ID, another one is called “ints”.

 

 

  1. In Hive shell (command: “hive shell”), we type in the following command:

 

——————————————————————————

CREATE EXTERNAL TABLE a (

id int,

int_col int)

STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’

WITH SERDEPROPERTIES (

“hbase.columns.mapping” =

“:key,ints:int_col”

)

TBLPROPERTIES(“hbase.table.name” = “a”);

——————————————————————————

 

 

This query create an external table in Hive from Hbase, and map the id to int, the “ints” column to int in Hive.

This approach regards the ID as int. We can also regard it as string.

Details please see:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/Impala/Installing-and-Using-Impala/ciiu_impala_hbase.html

 

 

  1. In the impala shell (command: “impala-shell”), we first invalidate the metadata to let it find the new created table.

 

——————————————————————————

INVALIDATE METADATA a;

——————————————————————————

 

Now we can use SQL queries in the impala shell.

 

Warning:

It is an external table so that when dropped by Impala or Hive, the original HBase table is not touched at all.

 

 

 

 

 

 

 

 

 

 

(2) More complex example for interacting between Hbase and Impala with multiple types.

 

For data types references in Impala:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_datatypes.html?scroll=float_unique_1

 

  1. Create a table in Hbase using the following command in Hbase shell (command: “hbase shell”):

 

——————————————————————————

create ‘hbasealltypessmall’, ‘bools’, ‘ints’, ‘floats’, ‘strings’

enable ‘hbasealltypessmall’

quit

——————————————————————————

 

 

  1. In Hive shell (command: “hive shell”), we type in the following command:

 

——————————————————————————

CREATE EXTERNAL TABLE hbasestringids (

id string,

bool_col boolean,

tinyint_col tinyint,

smallint_col smallint,

int_col int,

bigint_col bigint,

float_col float,

double_col double,

date_string_col string,

string_col string,

timestamp_col timestamp)

STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’

WITH SERDEPROPERTIES (

“hbase.columns.mapping” =

“:key,bools:bool_col,ints:tinyint_col,ints:smallint_col,ints:int_col,ints:\

bigint_col,floats:float_col,floats:double_col,strings:date_string_col,\

strings:string_col,strings:timestamp_col”

)

TBLPROPERTIES(“hbase.table.name” = “hbasealltypessmall”);

——————————————————————————

 

This query regards the id as string, and map HBase 5 columns to Hive 11 columns.

For example, it maps tinyint_col, smallint_col, int_col and bigint_col to ints.

 

 

  1. In the impala shell (command: “impala-shell”), we first invalidate the metadata to let it find the new created table.

 

——————————————————————————

INVALIDATE METADATA a;

——————————————————————————

 

Now we can use SQL queries in the impala shell.

 

 

 

References and other materials:

 

http://mapredit.blogspot.com/2013/05/query-hbase-tables-with-impala.html

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_impala_hbase.html

http://yanbohappy.sinaapp.com/?tag=hbase

http://doc.mapr.com/display/MapR/Working+with+Impala

Impala “insert” Performance

  1. Inserted 100001 rows in 7.28s

 

Table format:

(key_ID, int)

 

For example:

insert into a values (0,126),(1,173),(2,568),(3,164),(4,593),(5,788),(6,924),(7,206),(8,359),(9,690),(10,987),(11,231),(12,817),(13,122),(14,373),(15,177),(16,156),(17,256),(18,203),(19,38);

 

When the row number reaches 1,000,000, the parsing speed becomes pretty slow.

 

 

 

  1. Inserted 10000 rows in 14.11s

 

Table format:  (11 columns)

(string, boolean, double, float, bigint, int, smallint, tinyint, string, string, timestamp)

 

For example:

insert into hbasestringids values (‘0′,true,0.8,1.5,123456789101112,12345678,12345,12,’aaa’,’abc’,’1985-09-25 17:45:30.005′),(‘1′,false,6.77777,1.1111,7654321121212,87654321,21345,123,’bbb’,’dcba’,’1986-10-25 17:45:30.005′);

 

When the row number reaches 100,000, the program crashes.

If dividing the query into multiple queries, it would work.

 

 

 

  1. Insert with subselect:

 

Inserted 10000 rows in 1.47s

Inserted 20000 rows in 2.26s

 

There is no problem with this method for 100000 rows.

 

Query:

insert into default.b (id, string_col, int_col, double_col, bool_col, timestamp_col) select id, string_col, int_col, double_col, bool_col, timestamp_col from default.hbasestringids limit 20000;

 

Table default.b and default.hbasestringids have the same description:

 

+—————–+———–+———+

| name            | type      | comment |

+—————–+———–+———+

| id              | string    |         |

| bool_col        | boolean   |         |

| double_col      | double    |         |

| float_col       | float     |         |

| bigint_col      | bigint    |         |

| int_col         | int       |         |

| smallint_col    | smallint  |         |

| tinyint_col     | tinyint   |         |

| date_string_col | string    |         |

| string_col      | string    |         |

| timestamp_col   | timestamp |         |

+—————–+———–+———+

 

 

 

 

 

 

 

 

 

System Configuration:

 

Architecture:          x86_64

CPU op-mode(s):        32-bit, 64-bit

Byte Order:            Little Endian

CPU(s):                8

On-line CPU(s) list:   0-7

Thread(s) per core:    2

Core(s) per socket:    4

Socket(s):             1

NUMA node(s):          1

Vendor ID:             GenuineIntel

CPU family:            6

Model:                 42

Stepping:              7

CPU MHz:               1600.000

BogoMIPS:              6785.08

Virtualization:        VT-x

L1d cache:             32K

L1i cache:             32K

L2 cache:              256K

L3 cache:              8192K

NUMA node0 CPU(s):     0-7

 

Physical Memory:

16G

 

Disk:

 

ATA device, with non-removable media

Model Number:       ST31500341AS

Serial Number:      9VS551EJ

Firmware Revision:  CC4G

Transport:          Serial

Standards:

Used: unknown (minor revision code 0x0029)

Supported: 8 7 6 5

Likely used: 8

Configuration:

Logical         max     current

cylinders       16383   16383

heads           16      16

sectors/track   63      63

CHS current addressable sectors:   16514064

LBA    user addressable sectors:  268435455

LBA48  user addressable sectors: 2930277168

Logical/Physical Sector size:           512 bytes

device size with M = 1024*1024:     1430799 MBytes

device size with M = 1000*1000:     1500301 MBytes (1500 GB)

cache/buffer size  = unknown

Nominal Media Rotation Rate: 7200

 

OS Version:

Linux version 2.6.32-431.el6.x86_64 (mockbuild@c6b8.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Fri Nov 22 03:15:09 UTC 2013

Cloudera Impala SQL Query Availability

What query statements could be used in impala?

 

In general, impala shell has the following command available:

 

connect   exit     history  profile  select  shell  unset  values   with

describe  explain  insert   quit     set     show   use    version

alter  create  desc  drop  help  load

 

 

What is available for SQL:

 

select insert drop create describe ALTER (VIEW)

 

 

What is UNavailable for SQL:

 

delete update

 

 

Their detailed usage is in this webpage:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/1.2.4/Installing-and-Using-Impala/ciiu_langref_sql.html?scroll=describe_unique_1