Skip to content

Commit

Permalink
Merge pull request #24 from mromanello/1.7.x
Browse files Browse the repository at this point in the history
version 1.7.1
  • Loading branch information
Matteo Romanello authored Nov 5, 2019
2 parents 0e9c5bf + 658b8f3 commit 0712f55
Show file tree
Hide file tree
Showing 77 changed files with 609,819 additions and 2,548 deletions.
3 changes: 0 additions & 3 deletions .codecov.yml

This file was deleted.

19 changes: 16 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ __pycache__/

# pytest
.cache
.pytest_cache

# venv
.env
Expand All @@ -16,13 +17,25 @@ env/
*~

# project specific
citation_extractor/data/pickles/
# citation_extractor/data/pickles/
citation_extractor/output/

# pypi
build/
*.egg-info/
dist/

.coverage
*.egg-info/

# Pycharm
.idea

# Misc
.coverage*
\#*#

# Sphinx
docs/_build/

# misc
.tox
all_in_one.iob
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
2.7.13
5 changes: 3 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ install:
- pip install -r requirements_dev.txt
- pip install .
# command to run tests
script: pytest -vv --cov=citation_extractor --ignore=tests/test_eval.py
#script: travis_wait 60 pytest -s -vv --cov=citation_extractor
#script: pytest -vv --cov=citation_extractor tests/test_Utils.py tests/test_ned.py
script: pytest -vv --cov=citation_extractor
#script: pytest -vv --cov=citation_extractor --ignore=tests/test_eval.py
after_success:
- codecov
6 changes: 6 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
### 1.7.x

- added library documentation
- MLCitationMatcher by [@mfilippo](http://github.com/mfilippo/)
- started to move away from brat standoff format as the default output

### 31.01.2018 1.6.x @mr56k

- removed the library `CRFPP` as a dependency, and replaced with the `sklearn`- compatible `sklearn-crfsuite`.
Expand Down
38 changes: 25 additions & 13 deletions NOTES.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,19 @@
where I left: try to provide the module with minimum data and directory structure necessary to run some tests.
where I left: try to provide the module with minimum data and directory structure necessary to run some tests.

## Tests

* [testing good practices](http://pytest.org/latest/goodpractises.html)

## For nicer CLIs

```python
from termcolor import colored
colored('test', 'red')
print(colored('test', 'red'))
print(colored('', 'red'))
print(colored('', 'green'))
```

## Distributing the package

* see <http://pythonhosted.org/setuptools/setuptools.html>
Expand All @@ -18,24 +28,26 @@ then SciPy, then scikit-learn

## Notes to implement Supervised Relation Detection

* working with many languages makes it more comlicated to work with syntactic features as chunkers do not exist for all the languages we considered ()
* working with many languages makes it more complicated to work with
syntactic features as chunkers do not exist for all the languages we considered ()

* the training set should contain both positive and negative examples; to create a negative example out of a positive relation, e.g. "rel(arg1,arg2)" is enough to invert it, "rel(arg2,arg1)"

class= (scope_pos | scope_neg)

## Notes to improve the Named Entity Disambiguation

### Code
## `ML CitationMatcher`

* improve the logging
* test that the code can be parallelised
cfr [this thread in SO](https://stackoverflow.com/questions/15111408/how-does-sklearn-svm-svcs-function-predict-proba-work-internally)

### Logic
to output a probability for each classification by SVM pass `probabilities=True`
when

* instead of disambiguating relations first and then entities
* try to do that by following the sequence of the document
* get all the annotations for a given document, ordered as they appear...
* ... then proceed to disambiguate each annotation, using the annotation type to call appropriate function/method
* this way, neighbouring entity mentions can be used to help with the disambiguation of relations
```python
self._classifier = svm.SVC(
kernel='linear',
C=C,
cache_size=cache_size
)
```

return the probabilities from `citation_extractor.ned.ml::predict()`
4 changes: 2 additions & 2 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@ sklearn-crfsuite = "*"
scikit-learn = "*"
jellyfish = "*"
stop-words = "*"
antlr_python_runtime = {file = "http://www.antlr3.org/download/Python/antlr_python_runtime-3.1.3.tar.gz"}
treetagger = {git = "https://github.com/mromanello/treetagger-python.git"}
antlr_python_runtime = {file = "http://www.antlr3.org/download/Python/antlr_python_runtime-3.1.3.tar.gz#egg=antlr_python_runtime-3.1.3"}
treetagger = {git = "https://github.com/mromanello/treetagger-python.git#egg=treetagger-1.0.1"}


[requires]
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
## Status

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.35470.svg)](https://doi.org/10.5281/zenodo.35470)
[![Build Status](https://travis-ci.org/mromanello/CitationExtractor.svg?branch=master)](https://travis-ci.org/mromanello/CitationExtractor)
[![codecov](https://codecov.io/gh/mromanello/CitationExtractor/branch/master/graph/badge.svg)](https://codecov.io/gh/mromanello/CitationExtractor)
[![Build Status](https://travis-ci.org/mromanello/CitationExtractor.svg?branch=ml-matcher)](https://travis-ci.org/mromanello/CitationExtractor)
[![codecov](https://codecov.io/gh/mromanello/CitationExtractor/branch/ml-matcher/graph/badge.svg)](https://codecov.io/gh/mromanello/CitationExtractor/branch/ml-matcher)

## Installation

Expand Down
24 changes: 23 additions & 1 deletion TODO.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,23 @@
## Up next

- [ ] revise `pipeline` module:
- rationale: serialize to JSON as default
- remove dependency with `brat` code (`conll2standoff`)
- update tests
- [ ] update TreeTagger installation script
- and provide a version of Mac OS


## integration of ML-Matcher into the codebase

* [x] evaluation of `MLCitationMatcher` (via `tests/test_eval.py`)
* [x] parallelise train/disambiguate/feature extraction with `dask`
* [x] write test `FeatureExtractor.extract_nil` (mr)
* [x] write tests for `ned.candidates.CandidatesGenerator` (mr)
* [x] write documentation for feature functions (mf)
* [x] implement `MLCitationMatcher.train`
* [x] implement `MLCitationMatcher.classify`

## Next steps

* [ ] improve the code quality/style
Expand All @@ -13,8 +33,9 @@

## Code Refactoring

* [ ] remove obsolete functions from `pipeline`
* to streamline installation, try to remove local dependencies:
* [ ] add `pysuffix` to the codebase => `Utils.pysuffix` (or os)
* [ ] add `pysuffix` to the codebase => `Utils.pysuffix` (or so)

* [ ] change the `LookupDictionary` in `Utils.FastDict` so that it gets the data directly from the Knowledge Base instead of the static file (**needs tests**)

Expand All @@ -31,6 +52,7 @@

## Testing

* [ ] rewrite tests for `pipeline` module

* write tests for:
* [x] creating and running a citation extractor
Expand Down
Loading

0 comments on commit 0712f55

Please sign in to comment.