SUIM

Spark for Unstructured Information, provides a thin abstraction layer for UIMA
on top of Spark. SUIM leverages on Spark resilient distributed dataset (RDD) to run UIMA pipelines using uimaFIT, SUIM pipelines are distributed across the nodes on a cluster and can be operated on in parallel [1].

SUIM allows you to run analytical pipelines on the resulting (or intermediate) CAS to execute furhter text analytics or machine learning algorithms.

Examples

Count buildings from the UIMA tutorial.

Using the RoomAnnotator from the UIMA tutorial:

    val typeSystem = TypeSystemDescriptionFactory.createTypeSystemDescription()
    val params = Seq(FileSystemCollectionReader.PARAM_INPUTDIR, "data")
    val rdd = makeRDD(createCollectionReader(classOf[FileSystemCollectionReader], params: _*), sc)
    val rnum = createEngineDescription(classOf[RoomNumberAnnotator])
    val rooms = rdd.map(process(_, rnum)).flatMap(scas => JCasUtil.select(scas.jcas, classOf[RoomNumber]))
    val counts = rooms.map(room => room.getBuilding()).map((_,1)).reduceByKey(_ + _)
    counts.foreach(println(_))

If the collection is to large to fit in memory, or you already have a collection of SCASes use an HDFS RDD:

    val rdd = sequenceFile(reateCollectionReader(classOf[FileSystemCollectionReader], params: _*),
      "hdfs://localhost:9000/documents", sc)

Tokenize and count words with DKPro Core

Use DKPro Core [2] to tokenize and Spark to do token level analytics.

    val typeSystem = TypeSystemDescriptionFactory.createTypeSystemDescription()
    val rdd = makeRDD(createCollectionReader(classOf[TextReader],
      ResourceCollectionReaderBase.PARAM_SOURCE_LOCATION, "data",
      ResourceCollectionReaderBase.PARAM_LANGUAGE, "en",
      ResourceCollectionReaderBase.PARAM_PATTERNS,  Array("[+]*.txt")), sc)
    val seg = createPrimitiveDescription(classOf[BreakIteratorSegmenter])
    val tokens = rdd.map(process(_, seg)).flatMap(scas => JCasUtil.select(scas.jcas, classOf[Token]))
    val counts = tokens.map(token => token.getCoveredText())
      .filter(filter(_))
      .map((_,1)).reduceByKey(_ + _)
      .map(pair => (pair._2, pair._1)).sortByKey(true)
    counts.foreach(println(_))

Common Tasks

To build:

mvn compile

To run:

mvn scala:run

To test:

mvn test

To create standalone with dependencies:

mvn package
java -jar target/spark-uima-tools-0.0.1-SNAPSHOT-jar-with-dependencies.jar

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
suim-examples		suim-examples
suim-java		suim-java
suim-scala		suim-scala
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.markdown		README.markdown
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SUIM

Examples

Count buildings from the UIMA tutorial.

Tokenize and count words with DKPro Core

Common Tasks

References

About

Releases

Packages

Languages

License

jianlinshi/suim

Folders and files

Latest commit

History

Repository files navigation

SUIM

Examples

Count buildings from the UIMA tutorial.

Tokenize and count words with DKPro Core

Common Tasks

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages