Getting started with Apache CarbonData

This tutorial provides a quick introduction to using CarbonData.

Examples

Firstly suggest you go through all examples, to understand how to create table, how to load data, how to make query.

Interactive Query with the Spark Shell

1.Install

Download a packaged release of Spark 1.5.0 or later
Configure the Hive Metastore using Mysql (you can use this key words to search:mysql hive metastore) and move mysql-connector-java jar to ${SPARK_HOME}/lib
Download thrift, rename to thrift and add to path.
Download Apache CarbonData code and build it

$ git clone https://github.com/apache/incubator-carbondata.git carbondata
$ cd carbondata
$ mvn clean install -DskipTests
$ cp assembly/target/scala-2.10/carbondata_*.jar ${SPARK_HOME}/lib
$ mkdir ${SPARK_HOME}/carbondata
$ cp -r processing/carbonplugins ${SPARK_HOME}/carbondata

2 Interactive Data Query

Run spark shell

$ cd ${SPARK_HOME}
$ carbondata_jar=./lib/$(ls -1 lib |grep "^carbondata_.*\.jar$")
$ mysql_jar=./lib/$(ls -1 lib |grep "^mysql.*\.jar$")
$ ./bin/spark-shell --master local --jars ${carbondata_jar},${mysql_jar}

Create CarbonContext instance

import org.apache.spark.sql.CarbonContext
import java.io.File
import org.apache.hadoop.hive.conf.HiveConf
val storePath = "hdfs://hacluster/Opt/CarbonStore"
val cc = new CarbonContext(sc, storePath)
cc.setConf("carbon.kettle.home","./carbondata/carbonplugins")
val metadata = new File("").getCanonicalPath + "/carbondata/metadata"
cc.setConf("hive.metastore.warehouse.dir", metadata)
cc.setConf(HiveConf.ConfVars.HIVECHECKFILEFORMAT.varname, "false")

Note: storePath can be a hdfs path or a local path , the path is used to store table data.

Create table

cc.sql("create table if not exists table1 (id string, name string, city string, age Int) STORED BY 'org.apache.carbondata.format'")

Create sample.csv file in ${SPARK_HOME}/carbondata directory

cd ${SPARK_HOME}/carbondata
cat > sample.csv << EOF
id,name,city,age
1,david,shenzhen,31
2,eason,shenzhen,27
3,jarry,wuhan,35
EOF

Load data to table1 in spark shell

val dataFilePath = new File("").getCanonicalPath + "/carbondata/sample.csv"
cc.sql(s"load data inpath '$dataFilePath' into table table1")

Note: Carbondata also support LOAD DATA LOCAL INPATH 'folder_path' INTO TABLE [db_name.]table_name OPTIONS(property_name=property_value, ...) syntax, but right now there is no significant meaning to local in carbondata.We just keep it to align with hive syntax. dataFilePath can be hdfs path as well like val dataFilePath = hdfs://hacluster//carbondata/sample.csv

Query data from table1

cc.sql("select * from table1").show
cc.sql("select city, avg(age), sum(age) from table1 group by city").show

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick-Start.md

Quick-Start.md

Getting started with Apache CarbonData

Examples

Interactive Query with the Spark Shell

1.Install

2 Interactive Data Query

Files

Quick-Start.md

Latest commit

History

Quick-Start.md

File metadata and controls

Getting started with Apache CarbonData

Examples

Interactive Query with the Spark Shell

1.Install

2 Interactive Data Query