Skip to content
forked from oaqa/suim

Analytic UIMA pipelines using Spark

License

Notifications You must be signed in to change notification settings

jianlinshi/suim

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SUIM

Spark for Unstructured Information, provides a thin abstraction layer for UIMA
on top of Spark. SUIM leverages on Spark resilient distributed dataset (RDD) to run UIMA pipelines using uimaFIT, SUIM pipelines are distributed across the nodes on a cluster and can be operated on in parallel [1].

SUIM allows you to run analytical pipelines on the resulting (or intermediate) CAS to execute furhter text analytics or machine learning algorithms.

Examples

Count buildings from the UIMA tutorial.

Using the RoomAnnotator from the UIMA tutorial:

    val typeSystem = TypeSystemDescriptionFactory.createTypeSystemDescription()
    val params = Seq(FileSystemCollectionReader.PARAM_INPUTDIR, "data")
    val rdd = makeRDD(createCollectionReader(classOf[FileSystemCollectionReader], params: _*), sc)
    val rnum = createEngineDescription(classOf[RoomNumberAnnotator])
    val rooms = rdd.map(process(_, rnum)).flatMap(scas => JCasUtil.select(scas.jcas, classOf[RoomNumber]))
    val counts = rooms.map(room => room.getBuilding()).map((_,1)).reduceByKey(_ + _)
    counts.foreach(println(_))

If the collection is to large to fit in memory, or you already have a collection of SCASes use an HDFS RDD:

    val rdd = sequenceFile(reateCollectionReader(classOf[FileSystemCollectionReader], params: _*),
      "hdfs://localhost:9000/documents", sc)

Tokenize and count words with DKPro Core

Use DKPro Core [2] to tokenize and Spark to do token level analytics.

    val typeSystem = TypeSystemDescriptionFactory.createTypeSystemDescription()
    val rdd = makeRDD(createCollectionReader(classOf[TextReader],
      ResourceCollectionReaderBase.PARAM_SOURCE_LOCATION, "data",
      ResourceCollectionReaderBase.PARAM_LANGUAGE, "en",
      ResourceCollectionReaderBase.PARAM_PATTERNS,  Array("[+]*.txt")), sc)
    val seg = createPrimitiveDescription(classOf[BreakIteratorSegmenter])
    val tokens = rdd.map(process(_, seg)).flatMap(scas => JCasUtil.select(scas.jcas, classOf[Token]))
    val counts = tokens.map(token => token.getCoveredText())
      .filter(filter(_))
      .map((_,1)).reduceByKey(_ + _)
      .map(pair => (pair._2, pair._1)).sortByKey(true)
    counts.foreach(println(_))

Common Tasks

To build:

mvn compile

To run:

mvn scala:run

To test:

mvn test

To create standalone with dependencies:

mvn package
java -jar target/spark-uima-tools-0.0.1-SNAPSHOT-jar-with-dependencies.jar

References

About

Analytic UIMA pipelines using Spark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Scala 88.9%
  • Java 11.1%