-
Notifications
You must be signed in to change notification settings - Fork 58
Script Format
-
InputFormat:
com.thinkaurelius.faunus.formats.script.ScriptInputFormat
-
OutputFormat:
com.thinkaurelius.faunus.formats.script.ScriptOutputFormat
ScriptInputFormat
and ScriptOutputFormat
take an arbitrary Gremlin script and use that script to either read or write FaunusVertex
objects, respectively. This can be considered the most general InputFormat
/OutputFormat
possible in that Faunus uses the user provided script for all reading/writing.
The data below represents an adjacency list representation of an untyped, directed graph. First line reads, “vertex 0 has no outgoing edges.” The second line reads, “vertex 1 has an outgoing edge to vertices 4, 3, 2, and 0.”
0:
1:4,3,2,0
2:5,3,1
3:11,6,1,2
4:
5:
6:
7:8,9,10,11,1
8:
9:
10:
11:6
There is no corresponding InputFormat
that can parse this particular file (or some adjacency list variant of it). As such, ScriptInputFormat
can be used. With ScriptInputFormat
a Gremlin-Groovy script is stored in HDFS and leveraged by each mapper in the Faunus job. The Gremlin-Groovy script must have the following method defined:
def void read(FaunusVertex vertex, String line) { ... }
An appropriate read()
for the above adjacency list file is:
def void read(FaunusVertex vertex, String line) {
parts = line.split(':');
vertex.reuse(Long.valueOf(parts[0]))
if (parts.length == 2) {
parts[1].split(',').each {
vertex.addEdge(Direction.OUT, 'linkedTo', Long.valueOf(it));
}
}
}
Note that to avoid object creation overhead, the previous vertex is provided to the next parse. The vertex can be “reused” with the FaunusVertex.reuse(long id)
method which wipes all previous data.
The above files are provided with the Faunus distribution and can be used from the Gremlin REPL.
gremlin> hdfs.copyFromLocal('data/InputIds.groovy','InputIds.groovy')
==>null
gremlin> hdfs.copyFromLocal('data/graph-of-the-gods.id','graph-of-the-gods.id')
==>null
gremlin> hdfs.ls()
==>rw-r--r-- marko supergroup 429 InputIds.groovy
==>rw-r--r-- marko supergroup 69 graph-of-the-gods.id
gremlin> g = FaunusFactory.open('bin/script-input.properties')
==>faunusgraph[scriptinputformat->graphsonoutputformat]
gremlin> g.getProperties()
==>faunus.output.location.overwrite=true
==>faunus.graph.output.format=com.thinkaurelius.faunus.formats.graphson.GraphSONOutputFormat
==>faunus.input.location=graph-of-the-gods.id
==>faunus.graph.input.format=com.thinkaurelius.faunus.formats.script.ScriptInputFormat
==>faunus.output.location=output
==>faunus.sideeffect.output.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
==>faunus.input.script.file=InputIds.groovy
gremlin> g._
13/04/09 11:17:25 WARN mapreduce.FaunusCompiler: Using developer reference to target/faunus-0.3.0-SNAPSHOT-job.jar
13/04/09 11:17:25 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)
13/04/09 11:17:25 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.IdentityMap.Map]
...
gremlin> hdfs.head('output')
==>{"_id":0}
==>{"_id":1,"_outE":[{"_label":"linkedTo","_id":-1,"_inV":4},{"_label":"linkedTo","_id":-1,"_inV":3},{"_label":"linkedTo","_id":-1,"_inV":2},{"_label":"linkedTo","_id":-1,"_inV":0}]}
==>{"_id":2,"_outE":[{"_label":"linkedTo","_id":-1,"_inV":5},{"_label":"linkedTo","_id":-1,"_inV":3},{"_label":"linkedTo","_id":-1,"_inV":1}]}
==>{"_id":3,"_outE":[{"_label":"linkedTo","_id":-1,"_inV":11},{"_label":"linkedTo","_id":-1,"_inV":6},{"_label":"linkedTo","_id":-1,"_inV":1},{"_label":"linkedTo","_id":-1,"_inV":2}]}
==>{"_id":4}
==>{"_id":5}
==>{"_id":6}
==>{"_id":7,"_outE":[{"_label":"linkedTo","_id":-1,"_inV":8},{"_label":"linkedTo","_id":-1,"_inV":9},{"_label":"linkedTo","_id":-1,"_inV":10},{"_label":"linkedTo","_id":-1,"_inV":11},{"_label":"linkedTo","_id":-1,"_inV":1}]}
==>{"_id":8}
==>{"_id":9}
==>{"_id":10}
==>{"_id":11,"_outE":[{"_label":"linkedTo","_id":-1,"_inV":6}]}
The principle above can also be used for writing a <NullWritable,FaunusVertex>
stream back to a file in HDFS.
To be continued…