Skip to content

Script Format

okram edited this page Apr 9, 2013 · 16 revisions

  • InputFormat: com.thinkaurelius.faunus.formats.script.ScriptInputFormat
  • OutputFormat: com.thinkaurelius.faunus.formats.script.ScriptOutputFormat

ScriptInputFormat and ScriptOutputFormat take an arbitrary Gremlin script and use that script to either read or write FaunusVertex objects, respectively. This can be considered the most general InputFormat/OutputFormat possible in that Faunus uses the user provided script for all reading/writing.

Script InputFormat Support

The data below represents an adjacency list representation of an untyped, directed graph. First line reads, “vertex 0 has no outgoing edges.” The second line reads, “vertex 1 has an outgoing edge to vertices 4, 3, 2, and 0.”

0:
1:4,3,2,0
2:5,3,1
3:11,6,1,2
4:
5:
6:
7:8,9,10,11,1
8:
9:
10:
11:6

There is no corresponding InputFormat that can parse this particular file (or some adjacency list variant of it). As such, ScriptInputFormat can be used. With ScriptInputFormat a Gremlin-Groovy script is stored in HDFS and leveraged by each mapper in the Faunus job. The Gremlin-Groovy script must have the following method defined:

def void read(FaunusVertex vertex, String line) { ... }

An appropriate read() for the above adjacency list file is:

def void read(FaunusVertex vertex, String line) {
    parts = line.split(':');
    vertex.reuse(Long.valueOf(parts[0]))
    if (parts.length == 2) {
        parts[1].split(',').each {
            vertex.addEdge(Direction.OUT, 'linkedTo', Long.valueOf(it));
        }
    }
}

Note that to avoid object creation overhead, the previous vertex is provided to the next parse. The vertex can be “reused” with the FaunusVertex.reuse(long id) method which wipes all previous data.

The above files are provided with the Faunus distribution and can be used from the Gremlin REPL.

gremlin> hdfs.copyFromLocal('data/InputIds.groovy','InputIds.groovy')
==>null
gremlin> hdfs.copyFromLocal('data/graph-of-the-gods.id','graph-of-the-gods.id')
==>null
gremlin> hdfs.ls()
==>rw-r--r-- marko supergroup 429 InputIds.groovy
==>rw-r--r-- marko supergroup 69 graph-of-the-gods.id
gremlin> g = FaunusFactory.open('bin/script-input.properties')
==>faunusgraph[scriptinputformat->graphsonoutputformat]
gremlin> g.getProperties()
==>faunus.output.location.overwrite=true
==>faunus.graph.output.format=com.thinkaurelius.faunus.formats.graphson.GraphSONOutputFormat
==>faunus.input.location=graph-of-the-gods.id
==>faunus.graph.input.format=com.thinkaurelius.faunus.formats.script.ScriptInputFormat
==>faunus.output.location=output
==>faunus.sideeffect.output.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
==>faunus.input.script.file=InputIds.groovy
gremlin> g._
13/04/09 11:17:25 WARN mapreduce.FaunusCompiler: Using developer reference to target/faunus-0.3.0-SNAPSHOT-job.jar
13/04/09 11:17:25 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)
13/04/09 11:17:25 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.IdentityMap.Map]
...
gremlin> hdfs.head('output')
==>{"_id":0}
==>{"_id":1,"_outE":[{"_label":"linkedTo","_id":-1,"_inV":4},{"_label":"linkedTo","_id":-1,"_inV":3},{"_label":"linkedTo","_id":-1,"_inV":2},{"_label":"linkedTo","_id":-1,"_inV":0}]}
==>{"_id":2,"_outE":[{"_label":"linkedTo","_id":-1,"_inV":5},{"_label":"linkedTo","_id":-1,"_inV":3},{"_label":"linkedTo","_id":-1,"_inV":1}]}
==>{"_id":3,"_outE":[{"_label":"linkedTo","_id":-1,"_inV":11},{"_label":"linkedTo","_id":-1,"_inV":6},{"_label":"linkedTo","_id":-1,"_inV":1},{"_label":"linkedTo","_id":-1,"_inV":2}]}
==>{"_id":4}
==>{"_id":5}
==>{"_id":6}
==>{"_id":7,"_outE":[{"_label":"linkedTo","_id":-1,"_inV":8},{"_label":"linkedTo","_id":-1,"_inV":9},{"_label":"linkedTo","_id":-1,"_inV":10},{"_label":"linkedTo","_id":-1,"_inV":11},{"_label":"linkedTo","_id":-1,"_inV":1}]}
==>{"_id":8}
==>{"_id":9}
==>{"_id":10}
==>{"_id":11,"_outE":[{"_label":"linkedTo","_id":-1,"_inV":6}]}

Script OutputFormat Support

The principle above can also be used for writing a <NullWritable,FaunusVertex> stream back to a file in HDFS.

To be continued…

Clone this wiki locally