This tutorial provides a quick introduction to using CarbonData.
Firstly suggest you go through all examples, to understand how to create table, how to load data, how to make query.
- Download a packaged release of Spark 1.5.0 or later
- Configure the Hive Metastore using Mysql (you can use this key words to search:mysql hive metastore) and move mysql-connector-java jar to ${SPARK_HOME}/lib
- Download thrift, rename to thrift and add to path.
- Download Apache CarbonData code and build it
$ git clone carbondata
$ cd carbondata
$ mvn clean install -DskipTests
$ cp assembly/target/scala-2.10/carbondata_*.jar ${SPARK_HOME}/lib
$ mkdir ${SPARK_HOME}/carbondata
$ cp -r processing/carbonplugins ${SPARK_HOME}/carbondata
- Run spark shell
$ cd ${SPARK_HOME}
$ carbondata_jar=./lib/$(ls -1 lib |grep "^carbondata_.*\.jar$")
$ mysql_jar=./lib/$(ls -1 lib |grep "^mysql.*\.jar$")
$ ./bin/spark-shell --master local --jars ${carbondata_jar},${mysql_jar}
- Create CarbonContext instance
import org.apache.spark.sql.CarbonContext
import org.apache.hadoop.hive.conf.HiveConf
val storePath = "hdfs://hacluster/Opt/CarbonStore"
val cc = new CarbonContext(sc, storePath)
val metadata = new File("").getCanonicalPath + "/carbondata/metadata"
cc.setConf("hive.metastore.warehouse.dir", metadata)
cc.setConf(HiveConf.ConfVars.HIVECHECKFILEFORMAT.varname, "false")
Note: storePath
can be a hdfs path or a local path , the path is used to store table data.
- Create table
cc.sql("create table if not exists table1 (id string, name string, city string, age Int) STORED BY 'org.apache.carbondata.format'")
- Create sample.csv file in ${SPARK_HOME}/carbondata directory
cd ${SPARK_HOME}/carbondata
cat > sample.csv << EOF
- Load data to table1 in spark shell
val dataFilePath = new File("").getCanonicalPath + "/carbondata/sample.csv"
cc.sql(s"load data inpath '$dataFilePath' into table table1")
Note: Carbondata also support LOAD DATA LOCAL INPATH 'folder_path' INTO TABLE [db_name.]table_name OPTIONS(property_name=property_value, ...)
syntax, but right now there is no significant meaning to local in carbondata.We just keep it to align with hive syntax. dataFilePath
can be hdfs path as well like val dataFilePath = hdfs://hacluster//carbondata/sample.csv
- Query data from table1
cc.sql("select * from table1").show
cc.sql("select city, avg(age), sum(age) from table1 group by city").show