-
Notifications
You must be signed in to change notification settings - Fork 8
Architecture
The insights portion of the Polar Deep Insights project contains 2 major components:
- Insight Generator
- Insight Visualizer
The insight generator is a python library which provides an interface to extract entities, locations, file metadata and measurements from documents.
Our python library interfaces with the following context extraction libraries to extract required types of meta information.
Tool | Type |
---|---|
Apache tika | content and file-metadata |
Stanford's core NLP / NER | dates and locations |
Python regex | entities |
Grobid Quantities | measurements |
Setup extraction libraries as described here. Ensure that they are running on the mentioned ports.
cd ./insight-generator
The given a file path as argument the main.py
script recurses down the directory tree and extracts the above mentioned meta information from each file and saves the extracted contents onto a local file.
This extraction library works with files on S3/HDFS. It requires the files to be mounted onto the local file system.
# Syntax
python main.py [ ROOT-PATH ] [ OUTPUT-PATH ]
# Example
python main.py "/tmp/dump" entities.txt
Users can build custom implementations to handle the extracted meta information.
from extractors.base import InformationExtractor
from util.dir_tree import DirTreeTraverser
def customProcessor(metaInfo):
# Do something with the extracted meta information
pass
def process(PATH):
mI = InformationExtractor(PATH).extract()
customProcessor(mI)
DirTreeTraverser(BASE_PATH).iterateAndPerform(process)
Users can use the extract.py
script as a stand alone meta information extractor.
# Syntax
python extract.py [ FILE-PATH ]
# Example
python extract.py /tmp/dump/test.html | json_pp
This is a angular-js SPA which facilitates building an 'ontology-of-interest' using the concept editor interface.
Users can then gain insights from the extracted information from the insight generator module through the [query interface] (Guide).
The query interface is extensible. Each visualization tab is an angular component. If users which to add their custom visualization elements, they can define custom angular components as follows.
<!-- Custom visualization component -->
<div polar-analytics-my-custom-visualization data-filters="filters"></div>
// Custom component controller
$scope.data = Document.query($FilterParser($scope.filters), {
// Custom elastic search aggregate query
});
// Build your visualization with the aggregated data from Elastic Search set on the controller scope.
Open an issue with regard to setup or contributions.
Information Retrieval and Data Science (IRDS) research group, University of Southern California.