-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extraction and linking #89
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Member
bolandka
commented
May 2, 2016
- much more efficient context extraction using lucene's highlighter package
- more efficient pattern searching using LucenePatternSearcher, added tokenization algorithm and whitespaceAnalyzer
- textualReferences consist of whole sentences instead of fixed context size (Context size #5)
- number of words used in patterns is adjustable and independent of context size
- generalized (further) punctuation and years in patterns for higher recall
- optimized lucene queries for better performance
- prohibit only generation of less general patterns in FrequencyBasedBootstrapping, not of equally general patterns
- fixed ReliabilityBasedBootstrapping (fix computation of reliability scores (InfolisPattern.getReliable) #15, fix reliability-based bootstrapping (ReliabilityBasedBootstrapping) #16)
- added startpage parameter for text extraction to enable ignoring title pages
- updated libraries (java, lucene, ..)
- adapted code to changes in the dara repository
- filtering resourceType: dataset when querying dara (ReferenceResolving: better filter for dara results #75)
- creation of temporary files in tokenizer and bibliographyExtractor if no output directories are specified
- posting of intermediate results to temporary datastores (objects have to be posted to the datastore but cannot be deleted #87)
- added class to extract dois from texts (add class to extract DOIs #80)
- added a very basic importer for springer a++ documents
- added package for importing and exporting annotated data (some features still missing)
- some more optimizations, fixes and cleaning (e.g. extractBib property in Bootstrapping ignored when input files are txt instead of pdf #79, NPE in ApplyPatternAndResolve if called with queryServices variant #77, NullPointerException when trying to retrieve reliability of QueryService #53)
….8 and added mallet dependency
…ted extraction of contexts to tokenized input text; inserted some tokenizer markup to stopword list
…ordNLP or openNLP libraries
…ed model files is added
…ted by punctuation
…xt from pdfs that contain title pages
… if no output directory is specified
…lgorithms that will be run on the extracted files as default
…dTimeMatcher defined in RegexUtils; treat years as stopwords
…nput and different lucene analyzer
… some redundant posting of data
…patterns and contexts created when searching for a seed now stored only temporarily; use of custom PostingsHighlighter relying on tokenized input instead of applying lucene's sentence splitting
…ed as patternRegex, old patternRegex is not needed anymore due to extracting contexts using a lucene highlighter
…to tokenizer; removed empty quotations in regex
…eral already accepted pattern (as before) but allow acceptance of equally general patterns
… not contain any words
…o latest changes in the process
…en; queryService uri for searchResult is set in solrQS; QS reliability score is used in reliability score calculation in SearchResultLinker (#53)
…endent of context size)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.