Extraction and linking #89

bolandka · 2016-05-02T23:29:50Z

much more efficient context extraction using lucene's highlighter package
more efficient pattern searching using LucenePatternSearcher, added tokenization algorithm and whitespaceAnalyzer
textualReferences consist of whole sentences instead of fixed context size (Context size #5)
number of words used in patterns is adjustable and independent of context size
generalized (further) punctuation and years in patterns for higher recall
optimized lucene queries for better performance
prohibit only generation of less general patterns in FrequencyBasedBootstrapping, not of equally general patterns
fixed ReliabilityBasedBootstrapping (fix computation of reliability scores (InfolisPattern.getReliable) #15, fix reliability-based bootstrapping (ReliabilityBasedBootstrapping) #16)
added startpage parameter for text extraction to enable ignoring title pages
updated libraries (java, lucene, ..)
adapted code to changes in the dara repository
filtering resourceType: dataset when querying dara (ReferenceResolving: better filter for dara results #75)
creation of temporary files in tokenizer and bibliographyExtractor if no output directories are specified
posting of intermediate results to temporary datastores (objects have to be posted to the datastore but cannot be deleted #87)
added class to extract dois from texts (add class to extract DOIs #80)
added a very basic importer for springer a++ documents
added package for importing and exporting annotated data (some features still missing)
some more optimizations, fixes and cleaning (e.g. extractBib property in Bootstrapping ignored when input files are txt instead of pdf #79, NPE in ApplyPatternAndResolve if called with queryServices variant #77, NullPointerException when trying to retrieve reliability of QueryService #53)

….8 and added mallet dependency

…ow on

…ted extraction of contexts to tokenized input text; inserted some tokenizer markup to stopword list

…ordNLP or openNLP libraries

…ed model files is added

…ted by punctuation

…xt from pdfs that contain title pages

…ographyExtractor

… if no output directory is specified

…lgorithms that will be run on the extracted files as default

…dTimeMatcher defined in RegexUtils; treat years as stopwords

…nput and different lucene analyzer

…ards

…se characters

… some redundant posting of data

…patterns and contexts created when searching for a seed now stored only temporarily; use of custom PostingsHighlighter relying on tokenized input instead of applying lucene's sentence splitting

…ed as patternRegex, old patternRegex is not needed anymore due to extracting contexts using a lucene highlighter

…to tokenizer; removed empty quotations in regex

…eral already accepted pattern (as before) but allow acceptance of equally general patterns

… not contain any words

…o latest changes in the process

…en; queryService uri for searchResult is set in solrQS; QS reliability score is used in reliability score calculation in SearchResultLinker (#53)

…tstrapping (#16); post intermediate results to temporary data store (#87)

…endent of context size)

bolandka and others added 30 commits February 21, 2016 12:48

Merge branch 'master' into extractionAndLinking

90d2d92

Added package for importing and exporting annotated data

e81cf74

added tokenizer algorithm

173d30d

updated stanford package to new version, set sourceCompatibility to 1…

2a5c630

….8 and added mallet dependency

Merge remote-tracking branch 'origin/master' into extractionAndLinking

3d17ba7

code update: use new version of lucene; use WhitespaceAnalyzer from n…

fb37a2d

…ow on

fixed escaping of lucene queries

166d1a2

contexts now have variable size (#5); more efficient extraction; adap…

f7a531c

…ted extraction of contexts to tokenized input text; inserted some tokenizer markup to stopword list

adapted test to new data in dara repository

ff2a80f

adapted test to new data in dara repository

1ab1d13

added abstract tokenizer class with implementing classes using standf…

1ed386e

…ordNLP or openNLP libraries

fixed stopwords for stanfordNLP tokens

2e478eb

minor addition to debug log

52acc1d

set openNLP tokenizer test to ignore until download script for requir…

547b2a8

…ed model files is added

lucene queries now use wildcards to find collocations of words connec…

0afeca4

…ted by punctuation

generalized patterns for years and numbers

f7c6649

fix: entityLinks are not in fixed order

8346fdd

added startpage parameter for textExtractor: useful for extracting te…

4952471

…xt from pdfs that contain title pages

integrated tokenization into workflow

949c97d

analogous specification of output directories for tokenizer and bibli…

2173bbf

…ographyExtractor

adapted execute method to altered getTokenizedSentences method

6901f28

fixed tokenization on demand for bootstrapping / CLI

394189b

Tokenizer and BibliographyExtractor now create temporary output files…

59496db

… if no output directory is specified

adapted tests to new data in dara

bf31bfc

renamed pattern.txt -> patterns.txt

b614a24

added usage of parameters for tokenization

7ae840a

added required tokenize parameter

ae87693

set tokenize to false for text extraction and to true for all other a…

9a860b8

…lgorithms that will be run on the extracted files as default

extraction of contexts adapted to tokenization; us timeout for limite…

65cebc5

…dTimeMatcher defined in RegexUtils; treat years as stopwords

pattern induction for flexible window size and adapted to tokenized i…

9ace508

…nput and different lucene analyzer

bolandka and others added 28 commits April 11, 2016 12:31

start page parameter will no longer be ignored in --convert-to-text mode

2340685

optimised generation of lucene queries: no leading and trailing wildc…

9235107

…ards

fix

2af487d

fixed log messages

855b0d2

ignore tokens inserted by TokenizerStanford when checking for upperca…

0a4ee14

…se characters

refined regex

b283766

basic importer for Springer A++ xml files

f7eea41

using lucene highlighter to extract contexts instead of RegexSearcher

c4a5dfa

improved context extraction using PostingsHighlighter

8f33703

added progress updates and documentation

a1a1666

removed superfluous persistExecution() calls

d86f204

added usage of temporary datastores for intermediate results; removed…

fc75f47

… some redundant posting of data

LuceneSearcher reads index only once and then performs all searches; …

595b938

…patterns and contexts created when searching for a seed now stored only temporarily; use of custom PostingsHighlighter relying on tokenized input instead of applying lucene's sentence splitting

added extractBib as param for Bootstrapping

d175fdb

removed matchingFiles parameter - now outputFiles are used instead

ff12c23

removed minimal from pattern - what was stored as minimal is now stor…

4f919b8

…ed as patternRegex, old patternRegex is not needed anymore due to extracting contexts using a lucene highlighter

normalized punctuation in regex; added splitting of (some) compounds …

2e73fab

…to tokenizer; removed empty quotations in regex

fix + changed log messages

a331c72

prohibit acceptance of patterns that are a special case of a more gen…

a9f1f2f

…eral already accepted pattern (as before) but allow acceptance of equally general patterns

added constraint: referenced term must not be a stopword

cd7fdaa

textual references are now also created when left or right context do…

0da9782

… not contain any words

added class to extract DOIs from texts (#80); adapted RegexSearcher t…

df61b17

…o latest changes in the process

queryService is now posted in FederatedSearcher if only class was giv…

08f44c8

…en; queryService uri for searchResult is set in solrQS; QS reliability score is used in reliability score calculation in SearchResultLinker (#53)

fix and example json for doiExtractor

2df1bf0

fixed search-candidate mode

dfed67e

fixed computation of reliability scores (#15) and ReliabilityBasedBoo…

3201ca1

…tstrapping (#16); post intermediate results to temporary data store (#87)

Merge branch 'extractionAndLinking' into extractionAndLinking

3d82465

number of words used for creating patterns can now be adjusted (indep…

d3d507d

…endent of context size)

kba merged commit d3d507d into master May 3, 2016

kba deleted the extractionAndLinking branch May 4, 2016 13:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extraction and linking #89

Extraction and linking #89

bolandka commented May 2, 2016

Extraction and linking #89

Extraction and linking #89

Conversation

bolandka commented May 2, 2016