Skip to content

Mongo to HDFS Hive migration

Valentin Kuznetsov edited this page Dec 4, 2016 · 2 revisions

This page is obsolete

We keep it for reference.

Mongo to HDFS/Hive migration

This document sums up necessary steps to migrate data from MongoDB storage into HDFS and then use Hive to query it. First, we need to inject data into MongoDB. It can be done either using pymongo directly or use testClient tool from WMArchive. The later assumes you setup WMArchive as service and read local file and replicated it N-times to MongoDB. Here is command you need to do in your shell:

testClient.py --json=./test/data/fwjr_processing.json --ntimes=100

Here we insert 100 copies of given JSON file into MongoDB via WMArchive data-service. Next we'll create an avro file which will contain all documents in it. For that purpose we'll use mongo2avro tool

# create output area where avro files will be stored
mkdir -p tmp/data

# create migration area where accumulated avro files will be transferred once they'll reach a given threshold
mkdir -p tmp/migrate

# run migration script. It takes all mongoio records from MongoDB and put them into single avro file into tmp/data area
# once file reach a threshold (by default it is 256MB and can be adjusted via --thr option) the avro file will be
# moved into migration area
mongo2avro --mongo=mongodb://localhost:8230 --schema=test/data/fwjr_processing.avsc --odir=./tmp/data --mdir=./tmp/migrate --verbose

# here is an example of the output of aforementioned command
[03/Feb/2016:13:16:59] 1454523419.0 MongoStorage  mongodb://localhost:8230
[03/Feb/2016:13:16:59] 1454523419.0 MongoStorage  Collection(Database(MongoClient(host=['localhost:8230'], document_class=dict, tz_aware=False, connect=True, w=1), u'fwjr'), u'db')
[03/Feb/2016:13:16:59] 1454523419.0 AvroStorage  avroio:test/data/fwjr_processing.avsc
[03/Feb/2016:13:17:05] 1454523425.0 mongo2avro  1000 docs ./tmp/data/20160203_181659.avro 7.2MB (7196498 bytes) RSS:66.3MB
[03/Feb/2016:13:17:05] 1454523425.0 mongo2avro  wrote 1000 docs ./tmp/data/20160203_181659.avro 7.2MB (7211757 bytes)

In the end we'll have tmp/data/20160203_181659.avro file with all 1000 records.

This data can be migrated into HDFS either by using hadoop tools, e.g.

hadoop fs -mkdir /path/hdfs/data
hadoop fs -put tmp/data/20160203_181659.avro /path/hdfs/data

or via migrate2hdfs tool.

HDFS and Hive

Once you migrated your data to HDFS we can use Hive to query it. Here our initial setup

# on HDFS we have data area
hadoop fs -ls /path/hdfs/data

# generate avro schema file
json2avsc --fin=fwjr.json --fout=fwjr.avsc

# we put to HDFS our avro schema file
hadoop fs -put fwjr.avsc /path/hdfs/

and we ready to start hive:

shell# hive
# create new database
hive> create database cms_wm_archive location '/path/hdfs/arch.db';

# use our database
hive> use cms_wm_archive;

# create external table pointed to our data files
hive> create external table wmarchive
  ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.avro.AvroSerDe"
  STORED AS INPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat"
  OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat"
  LOCATION "/user/valya/test/hive/data/"
  TBLPROPERTIES("avro.schema.url"="hdfs:///path/hdfs/fwjr.avsc");

# query our data
hive> select jobid from wmarchive;

Voila! It works!