Build Impala from Scratch

August 10, 2014September 2, 2014 Data Era

This article shows how to build Cloudera Impala from scratch, including compiling, linking and using.

My Environment:

CentOS 6.5

No network proxy (if set there would be a lot of downloading problems)

Impala Commit (Version):

Change-Id: I3b5cefb4d7193045fc6fc5e94766589c2299b5b1

commit d90f3f3fd1a134578b1860be1b2f41a57a8d8896 1 parent ee40ba2

Get Impala source:

git clone https://github.com/cloudera/impala

1. Install tools:

sudo yum install boost-test boost-program-options libevent-devel automake libtool flex bison gcc-c++ openssl-devel \

make cmake doxygen.x86_64 glib-devel boost-devel python-devel bzip2-devel svn libevent-devel cyrus-sasl-devel \

wget git unzip

2. Uninstall the old boost library (yum remove boost) of CentOS 6.5, and install the Boost lib 1.46.1.

For example:

export BOOST_ROOT=’/usr/local/boost_1_46_0

cd boost_1_46_1/

./bjam threading=multi –layout=tagged

./bjam threading=multi –layout=tagged install

3. Install LLVM 3.3:

wget http://llvm.org/releases/3.3/llvm-3.3.src.tar.gz

tar xvzf llvm-3.3.src.tar.gz

cd llvm-3.3.src/tools

svn co http://llvm.org/svn/llvm-project/cfe/tags/RELEASE_33/final/ clang

cd ../projects

svn co http://llvm.org/svn/llvm-project/compiler-rt/tags/RELEASE_33/final/ compiler-rt

cd ..

./configure –with-pic

make -j4 REQUIRES_RTTI=1

sudo make install

4. Set up the JDK path. (environment variable: /etc/.bashrc, ~/.bash_profile)

5. Install Maven 3.0.4:

wget http://www.fightrice.com/mirrors/apache/maven/maven-3/3.0.4/binaries/apache-maven-3.0.4-bin.tar.gz

tar xvf apache-maven-3.0.4.tar.gz && sudo mv apache-maven-3.0.4 /usr/local

Update ~/.bashrc，add the environment variables:

export M2_HOME=/usr/local/apache-maven-3.0.4

export M2=$M2_HOME/bin

export PATH=$M2:$PATH

source ~/.bashrc

mvn -version

6. Check the paths in bin/set-classpath.sh and Build Impala:

cd $IMPALA_HOME

./build_all

7. If there are errors like no -lboost_date_time, update it to -lboost_date_time-mt in the Makefile.

Other problems are similar, change them all to “*-mt” or update them to a new version, these problems are searchable on Google.

8. easy_install prettytable

easy_install thrift

(9. Build the thirdparty files, it seems automatic in the new Impala versions.)

(10. Download setuptools-5.1.zip and install it if needed when building.)

Configure and start Impala: (The blue text should be customized)

1. The /etc/hosts:

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4

::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

172.16.24.132 master

2. Configure and run HDFS/Hadoop:

(1) Create the hadoop data directory (In this case, file:///home/yc/hdfs)

(2) Mkdir the /var/run/hadoop-hdfs

(3) Configure XML of Hadoop in the following directory:

$IMPALA_HOME/thirdparty/hadoop-2.0.0-cdh4.5.0/etc/hadoop

The configured “hdfs-site.xml” should look like this, be careful about “dn.50010“.

hdfs-site.xml

========================================================================

<name>dfs.client.read.shortcircuit</name>

</property>

<name>dfs.replication</name>

</property>

<name>dfs.datanode.hdfs-blocks-metadata.enabled</name>

</property>

<name>dfs.datanode.data.dir</name>

</property>

<name>dfs.client.use.legacy.blockreader.local</name>

<value>false</value>

</property>

<name>dfs.datanode.data.dir.perm</name>

</property>

<name>dfs.block.local-path-access.user</name>

</property>

<name>dfs.client.file-block-storage-locations.timeout</name>

</property>

<name>dfs.domain.socket.path</name>

<value>/var/run/hadoop-hdfs/dn.50010</value>

</property>

<name>dfs.client.file-block-storage-locations.timeout.millis</name>

</property>

</configuration>

========================================================================

core-site.xml

========================================================================

<name>hadoop.native.lib</name>

<description>Should native hadoop libraries, if present, be used.</description>

</property>

<name>fs.default.name</name>

<value>hdfs://master:9000</value>

</property>

<name>dfs.client.read.shortcircuit</name>

</property>

<name>dfs.client.use.legacy.blockreader.local</name>

<value>false</value>

</property>

<name>dfs.client.read.shortcircuit.skip.checksum</name>

<value>false</value>

</property>

<name>hadoop.tmp.dir</name>

<description>A base for other temporary directories.</description>

</property>

</configuration>

========================================================================

yarn-site.xml

========================================================================

<?xml version=”1.0″?>

<!– Site specific YARN configuration properties –>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

</configuration>

========================================================================

mapred-site.xml:

========================================================================

<?xml version=”1.0″?>

<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<name>mapreduce.framework.name</name>

</property>

</configuration>

========================================================================

hive-site.xml:

========================================================================

</configuration>

========================================================================

Copy

core-site.xml hdfs-site.xml hive-site.xml

into

$IMPALA_HOME/conf

Configure

$IMPALA_HOME/bin/set-classpath.sh

as follows:

========================================================================

CLASSPATH=\

$IMPALA_HOME/conf:\

$IMPALA_HOME/fe/src/test/resources:\

$IMPALA_HOME/fe/target/classes:\

$IMPALA_HOME/fe/target/dependency:\

$IMPALA_HOME/fe/target/test-classes:

for jar in `ls ${IMPALA_HOME}/fe/target/dependency/*.jar`; do

CLASSPATH=${CLASSPATH}:$jar

done

export CLASSPATH

========================================================================

Format HDFS:

$IMPALA_HOME/thirdparty/hadoop-2.0.0-cdh4.5.0/bin/hdfs datanode -format

$IMPALA_HOME/thirdparty/hadoop-2.0.0-cdh4.5.0/sbin/start-all.sh

Optional:

(./bin/hdfs datanode)

Type in command “jps”, you will see things like this (number would be different):

1404

54375 NameNode

54646 NodeManager

558 SecondaryNameNode

54663 ResourceManager

6545 Jps

384 DataNode

54695 NodeManager

54727 NodeManager

2061

65490 NameNode

Create directories in HDFS:

$HADOOP_HOME/bin/hdfs dfs -mkdir /tmp

$HADOOP_HOME/bin/hdfs dfs -mkdir /user

$HADOOP_HOME/bin/hdfs dfs -mkdir /user/impala

$HADOOP_HOME/bin/hdfs dfs -mkdir /user/impala/tab1

Put data into HDFS:

$HADOOP_HOME/bin/hdfs dfs -put ./tab1.csv /user/impala/tab1

3. Start Impala Daemons:

cd $IMPALA_HOME

./be/build/debug/statestore/statestored

./bin/start-impalad.sh

./bin/start-catalogd.sh

(Do not need to start Hive)

Start Impala Shell:

./bin/impala-shell.sh

(WARNING: Do NOT need to start $IMPALA_HOME/thirdparty/hive-0.10.0-cdh4.5.0/bin/hiveserver2)

References:

Impala old version building: https://github.com/tomdz/impala

http://hi.baidu.com/huareal/item/52be8401cf349729a1312d66

http://hi.baidu.com/huareal/item/d651821043df5cfa86ad4eff

http://www.blogjava.net/ivanwan/archive/2006/05/18.html

Impala Hbase Interaction

August 7, 2014September 1, 2014 Data Era

Summary of Hbase and Impala connections and interactions.

(1) Very simple example for interacting between Hbase and Impala.

Configure Hbase as normal. (edit hbase-site.xml and start the daemons. Use “jps” to see daemons)

Create a table in Hbase using the following command in Hbase shell (command: “hbase shell”):

——————————————————————————

create ‘a’,’ints’

enable ‘a’

——————————————————————————

there are two columns, one is the key_ID, another one is called “ints”.

In Hive shell (command: “hive shell”), we type in the following command:

——————————————————————————

CREATE EXTERNAL TABLE a (

id int,

int_col int)

STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’

WITH SERDEPROPERTIES (

“hbase.columns.mapping” =

“:key,ints:int_col”

)

TBLPROPERTIES(“hbase.table.name” = “a”);

——————————————————————————

This query create an external table in Hive from Hbase, and map the id to int, the “ints” column to int in Hive.

This approach regards the ID as int. We can also regard it as string.

Details please see:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/Impala/Installing-and-Using-Impala/ciiu_impala_hbase.html

In the impala shell (command: “impala-shell”), we first invalidate the metadata to let it find the new created table.

——————————————————————————

INVALIDATE METADATA a;

——————————————————————————

Now we can use SQL queries in the impala shell.

Warning:

It is an external table so that when dropped by Impala or Hive, the original HBase table is not touched at all.

(2) More complex example for interacting between Hbase and Impala with multiple types.

For data types references in Impala:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_datatypes.html?scroll=float_unique_1

Create a table in Hbase using the following command in Hbase shell (command: “hbase shell”):

——————————————————————————

create ‘hbasealltypessmall’, ‘bools’, ‘ints’, ‘floats’, ‘strings’

enable ‘hbasealltypessmall’

quit

——————————————————————————

In Hive shell (command: “hive shell”), we type in the following command:

——————————————————————————

CREATE EXTERNAL TABLE hbasestringids (

id string,

bool_col boolean,

tinyint_col tinyint,

smallint_col smallint,

int_col int,

bigint_col bigint,

float_col float,

double_col double,

date_string_col string,

string_col string,

timestamp_col timestamp)

STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’

WITH SERDEPROPERTIES (

“hbase.columns.mapping” =

“:key,bools:bool_col,ints:tinyint_col,ints:smallint_col,ints:int_col,ints:\

bigint_col,floats:float_col,floats:double_col,strings:date_string_col,\

strings:string_col,strings:timestamp_col”

)

TBLPROPERTIES(“hbase.table.name” = “hbasealltypessmall”);

——————————————————————————

This query regards the id as string, and map HBase 5 columns to Hive 11 columns.

For example, it maps tinyint_col, smallint_col, int_col and bigint_col to ints.

In the impala shell (command: “impala-shell”), we first invalidate the metadata to let it find the new created table.

——————————————————————————

INVALIDATE METADATA a;

——————————————————————————

Now we can use SQL queries in the impala shell.

References and other materials:

http://mapredit.blogspot.com/2013/05/query-hbase-tables-with-impala.html

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_impala_hbase.html

http://yanbohappy.sinaapp.com/?tag=hbase

http://doc.mapr.com/display/MapR/Working+with+Impala

Impala “insert” Performance

August 3, 2014September 1, 2014 Data Era

Inserted 100001 rows in 7.28s

Table format:

(key_ID, int)

For example:

insert into a values (0,126),(1,173),(2,568),(3,164),(4,593),(5,788),(6,924),(7,206),(8,359),(9,690),(10,987),(11,231),(12,817),(13,122),(14,373),(15,177),(16,156),(17,256),(18,203),(19,38);

When the row number reaches 1,000,000, the parsing speed becomes pretty slow.

Inserted 10000 rows in 14.11s

Table format: (11 columns)

(string, boolean, double, float, bigint, int, smallint, tinyint, string, string, timestamp)

For example:

insert into hbasestringids values (‘0′,true,0.8,1.5,123456789101112,12345678,12345,12,’aaa’,’abc’,’1985-09-25 17:45:30.005′),(‘1′,false,6.77777,1.1111,7654321121212,87654321,21345,123,’bbb’,’dcba’,’1986-10-25 17:45:30.005′);

When the row number reaches 100,000, the program crashes.

If dividing the query into multiple queries, it would work.

Insert with subselect:

Inserted 10000 rows in 1.47s

Inserted 20000 rows in 2.26s

There is no problem with this method for 100000 rows.

Query:

insert into default.b (id, string_col, int_col, double_col, bool_col, timestamp_col) select id, string_col, int_col, double_col, bool_col, timestamp_col from default.hbasestringids limit 20000;

Table default.b and default.hbasestringids have the same description:

+—————–+———–+———+

| name | type | comment |

+—————–+———–+———+

| id | string | |

| bool_col | boolean | |

| double_col | double | |

| float_col | float | |

| bigint_col | bigint | |

| int_col | int | |

| smallint_col | smallint | |

| tinyint_col | tinyint | |

| date_string_col | string | |

| string_col | string | |

| timestamp_col | timestamp | |

+—————–+———–+———+

System Configuration:

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 8

On-line CPU(s) list: 0-7

Thread(s) per core: 2

Core(s) per socket: 4

Socket(s): 1

NUMA node(s): 1

Vendor ID: GenuineIntel

CPU family: 6

Model: 42

Stepping: 7

CPU MHz: 1600.000

BogoMIPS: 6785.08

Virtualization: VT-x

L1d cache: 32K

L1i cache: 32K

L2 cache: 256K

L3 cache: 8192K

NUMA node0 CPU(s): 0-7

Physical Memory:

16G

Disk:

ATA device, with non-removable media

Model Number: ST31500341AS

Serial Number: 9VS551EJ

Firmware Revision: CC4G

Transport: Serial

Standards:

Used: unknown (minor revision code 0x0029)

Supported: 8 7 6 5

Likely used: 8

Configuration:

Logical max current

cylinders 16383 16383

heads 16 16

sectors/track 63 63

—

CHS current addressable sectors: 16514064

LBA user addressable sectors: 268435455

LBA48 user addressable sectors: 2930277168

Logical/Physical Sector size: 512 bytes

device size with M = 1024*1024: 1430799 MBytes

device size with M = 1000*1000: 1500301 MBytes (1500 GB)

cache/buffer size = unknown

Nominal Media Rotation Rate: 7200

OS Version:

Linux version 2.6.32-431.el6.x86_64 (mockbuild@c6b8.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Fri Nov 22 03:15:09 UTC 2013

Cloudera Impala SQL Query Availability

July 28, 2014September 1, 2014 Data Era

What query statements could be used in impala?

In general, impala shell has the following command available:

connect exit history profile select shell unset values with

describe explain insert quit set show use version

alter create desc drop help load

What is available for SQL:

select insert drop create describe ALTER (VIEW)

What is UNavailable for SQL:

delete update

Their detailed usage is in this webpage:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/1.2.4/Installing-and-Using-Impala/ciiu_langref_sql.html?scroll=describe_unique_1