Merge pull request #24 from mromanello/1.7.x

version 1.7.1
mromanello · Nov 5, 2019 · 0712f55 · 0712f55
2 parents 0e9c5bf + 658b8f3
commit 0712f55
Show file tree

Hide file tree

Showing 77 changed files with 609,819 additions and 2,548 deletions.
diff --git a/.codecov.yml b/.codecov.yml
diff --git a/.gitignore b/.gitignore
@@ -7,6 +7,7 @@ __pycache__/
 
 # pytest
 .cache
+.pytest_cache
 
 # venv
 .env
@@ -16,13 +17,25 @@ env/
 *~
 
 # project specific
-citation_extractor/data/pickles/
+# citation_extractor/data/pickles/
 citation_extractor/output/
 
 # pypi
 build/
+*.egg-info/
 dist/
 
-.coverage
-*.egg-info/
+
+# Pycharm
+.idea
+
+# Misc
+.coverage*
 \#*#
+
+# Sphinx
+docs/_build/
+
+# misc
+.tox
+all_in_one.iob
diff --git a/.python-version b/.python-version
@@ -0,0 +1 @@
+2.7.13
diff --git a/.travis.yml b/.travis.yml
@@ -14,7 +14,8 @@ install:
   - pip install -r requirements_dev.txt
   - pip install .
 # command to run tests
-script: pytest -vv --cov=citation_extractor --ignore=tests/test_eval.py
-#script: travis_wait 60 pytest -s -vv --cov=citation_extractor
+#script: pytest -vv --cov=citation_extractor tests/test_Utils.py tests/test_ned.py
+script: pytest -vv --cov=citation_extractor
+#script: pytest -vv --cov=citation_extractor --ignore=tests/test_eval.py
 after_success:
   - codecov
diff --git a/CHANGES.md b/CHANGES.md
@@ -1,3 +1,9 @@
+### 1.7.x
+
+- added library documentation
+- MLCitationMatcher by [@mfilippo](http://github.com/mfilippo/)
+- started to move away from brat standoff format as the default output
+
 ### 31.01.2018 1.6.x @mr56k
 
 - removed the library `CRFPP` as a dependency, and replaced with the `sklearn`- compatible `sklearn-crfsuite`.

diff --git a/NOTES.md b/NOTES.md
@@ -1,9 +1,19 @@
-where I left: try to provide the module with minimum data and directory structure necessary to run some tests. 
+where I left: try to provide the module with minimum data and directory structure necessary to run some tests.
 
 ## Tests
 
 * [testing good practices](http://pytest.org/latest/goodpractises.html)
 
+## For nicer CLIs
+
+```python
+from termcolor import colored
+colored('test', 'red')
+print(colored('test', 'red'))
+print(colored('✓', 'red'))
+print(colored('✓', 'green'))
+```
+
 ## Distributing the package
 
 * see <http://pythonhosted.org/setuptools/setuptools.html>
@@ -18,24 +28,26 @@ then SciPy, then scikit-learn
 
 ## Notes to implement Supervised Relation Detection
 
-* working with many languages makes it more comlicated to work with syntactic features as chunkers do not exist for all the languages we considered ()
+* working with many languages makes it more complicated to work with
+syntactic features as chunkers do not exist for all the languages we considered ()
 
 * the training set should contain both positive and negative examples; to create a negative example out of a positive relation, e.g. "rel(arg1,arg2)" is enough to invert it, "rel(arg2,arg1)"
 
     class= (scope_pos | scope_neg)
 
-## Notes to improve the Named Entity Disambiguation
-
-### Code
+## `ML CitationMatcher`
 
-* improve the logging
-* test that the code can be parallelised
+cfr [this thread in SO](https://stackoverflow.com/questions/15111408/how-does-sklearn-svm-svcs-function-predict-proba-work-internally)
 
-### Logic
+to output a probability for each classification by SVM pass `probabilities=True`
+when
 
-* instead of disambiguating relations first and then entities
-* try to do that by following the sequence of the document
-* get all the annotations for a given document, ordered as they appear...
-* ... then proceed to disambiguate each annotation, using the annotation type to call appropriate function/method
-* this way, neighbouring entity mentions can be used to help with the disambiguation of relations
+```python
+self._classifier = svm.SVC(
+    kernel='linear',
+    C=C,
+    cache_size=cache_size
+)
+```
 
+return the probabilities from `citation_extractor.ned.ml::predict()`
diff --git a/Pipfile b/Pipfile
@@ -24,8 +24,8 @@ sklearn-crfsuite = "*"
 scikit-learn = "*"
 jellyfish = "*"
 stop-words = "*"
-antlr_python_runtime = {file = "http://www.antlr3.org/download/Python/antlr_python_runtime-3.1.3.tar.gz"}
-treetagger = {git = "https://github.com/mromanello/treetagger-python.git"}
+antlr_python_runtime = {file = "http://www.antlr3.org/download/Python/antlr_python_runtime-3.1.3.tar.gz#egg=antlr_python_runtime-3.1.3"}
+treetagger = {git = "https://github.com/mromanello/treetagger-python.git#egg=treetagger-1.0.1"}
 
 
 [requires]

diff --git a/README.md b/README.md
@@ -3,8 +3,8 @@
 ## Status
 
 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.35470.svg)](https://doi.org/10.5281/zenodo.35470)
-[![Build Status](https://travis-ci.org/mromanello/CitationExtractor.svg?branch=master)](https://travis-ci.org/mromanello/CitationExtractor)
-[![codecov](https://codecov.io/gh/mromanello/CitationExtractor/branch/master/graph/badge.svg)](https://codecov.io/gh/mromanello/CitationExtractor)
+[![Build Status](https://travis-ci.org/mromanello/CitationExtractor.svg?branch=ml-matcher)](https://travis-ci.org/mromanello/CitationExtractor)
+[![codecov](https://codecov.io/gh/mromanello/CitationExtractor/branch/ml-matcher/graph/badge.svg)](https://codecov.io/gh/mromanello/CitationExtractor/branch/ml-matcher)
 
 ## Installation
 

diff --git a/TODO.md b/TODO.md
@@ -1,3 +1,23 @@
+## Up next
+
+- [ ] revise `pipeline` module:
+  - rationale: serialize to JSON as default
+  - remove dependency with `brat` code (`conll2standoff`)
+  - update tests
+- [ ] update TreeTagger installation script
+  - and provide a version of Mac OS
+
+
+## integration of ML-Matcher into the codebase
+
+* [x] evaluation of `MLCitationMatcher` (via `tests/test_eval.py`)
+* [x] parallelise train/disambiguate/feature extraction with `dask`
+* [x] write test `FeatureExtractor.extract_nil` (mr)
+* [x] write tests for `ned.candidates.CandidatesGenerator` (mr)
+* [x] write documentation for feature functions (mf)
+* [x] implement `MLCitationMatcher.train`
+* [x] implement `MLCitationMatcher.classify`
+
 ## Next steps
 
 * [ ] improve the code quality/style
@@ -13,8 +33,9 @@
 
 ## Code Refactoring
 
+* [ ] remove obsolete functions from `pipeline`
 * to streamline installation, try to remove local dependencies:
-	* [ ] add `pysuffix` to the codebase => `Utils.pysuffix` (or os)
+	* [ ] add `pysuffix` to the codebase => `Utils.pysuffix` (or so)
 
 * [ ] change the `LookupDictionary` in `Utils.FastDict` so that it gets the data directly from the Knowledge Base instead of the static file (**needs tests**)
 
@@ -31,6 +52,7 @@
 
 ## Testing
 
+* [ ] rewrite tests for `pipeline` module
 
 * write tests for:
     * [x] creating and running a citation extractor