-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #16 from mromanello/v1.4.x
V1.4.x
- Loading branch information
Showing
37,525 changed files
with
956,286 additions
and
2,472 deletions.
The diff you're trying to view is too large. We only load the first 3000 changed files.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
coverage: | ||
ignore: | ||
- "citation_extractor/settings/" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# MacOS Specific | ||
.DS_Store | ||
|
||
# Python | ||
__pycache__/ | ||
*.py[cod] | ||
|
||
# pytest | ||
.cache | ||
|
||
# venv | ||
.env | ||
env/ | ||
|
||
# Vim tmp files | ||
*~ | ||
|
||
# project specific | ||
citation_extractor/data/pickles/ | ||
|
||
# pypi | ||
build/ | ||
dist/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
env: | ||
- TREETAGGER_HOME=/home/$USER/tree-tagger/cmd/ | ||
language: python | ||
python: | ||
- "2.7" | ||
# command to install dependencies | ||
before_install: | ||
- sudo apt-get update --fix-missing | ||
- sudo apt-get install gfortran libopenblas-dev liblapack-dev | ||
- sudo apt-get remove automake | ||
install: | ||
- ./install_treetagger.sh | ||
- sudo -H ./install_dependencies.sh | ||
- sudo chmod 777 -R crfpp | ||
- cd crfpp/ | ||
- export C_INCLUDE_PATH=/usr/local/include/:${C_INCLUDE_PATH} | ||
- export CPLUS_INCLUDE_PATH=/usr/local/include/:${CPLUS_INCLUDE_PATH} | ||
- pip install -e python | ||
- cd | ||
- git clone https://github.com/mromanello/hucit_kb.git | ||
- cd hucit_kb | ||
- pip install -r requirements.txt | ||
- pip install . | ||
- sudo -H ./install_3stores.sh | ||
- pip install http://www.antlr3.org/download/Python/antlr_python_runtime-3.1.3.tar.gz https://github.com/mromanello/pyCTS/archive/master.zip citation_parser | ||
- cd $TRAVIS_BUILD_DIR | ||
- pip install -e lib/ | ||
- pip install -r requirements.txt | ||
- pip install -r requirements_dev.txt | ||
- pip install . | ||
# command to run tests | ||
script: pytest -vv --cov=citation_extractor --ignore=tests/test_eval.py | ||
#script: travis_wait 60 pytest -s -vv --cov=citation_extractor | ||
after_success: | ||
- codecov |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
author: Matteo Romanello, <[email protected]> | ||
|
||
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.35470.svg)](https://doi.org/10.5281/zenodo.35470) | ||
[![Build Status](https://travis-ci.org/mromanello/CitationExtractor.svg?branch=master)](https://travis-ci.org/mromanello/CitationExtractor) | ||
[![codecov](https://codecov.io/gh/mromanello/CitationExtractor/branch/master/graph/badge.svg)](https://codecov.io/gh/mromanello/CitationExtractor) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,38 +1,58 @@ | ||
* re-organise the logging | ||
* in `process.preproc_document` replace `guess_language` with `langid` library as it seems way more accurate (!!) | ||
* https://docs.python.org/2/library/pkgutil.html#pkgutil.get_data | ||
* `get_resource_filename` and `resource_isdir()` | ||
## Next steps | ||
|
||
* create evaluation `py.tests` for NER, RelEX and (as soon as possible) NED | ||
- k-fold cross evaluation | ||
- this way evaluations can be ran every time e.g. a feature extraction function is changed/introduced | ||
- write results to disk so that they can be inspected e.g. via brat | ||
- for RelEx: compare rule-based and ML-based extraction | ||
* create some stats about the traning/test corpus | ||
- number of entities by class | ||
- number of relations | ||
- number tokens | ||
- language distribution of documents | ||
|
||
## Code Refactoring | ||
|
||
~~* remove obsolete bits from module `process`~~ | ||
* rename `process` -> `pipeline` | ||
* move active learning classes to a separate module | ||
* in the `settings.base_settings` replace absolute paths with use of `pkg_resources`: | ||
* to streamline installation, try to remove local dependencies: | ||
* add `pysuffix` to the codebase => `Utils.pysuffix` (or os) | ||
|
||
* change the `LookupDictionary` in `Utils.FastDict` so that it gets the data directly from the Knowledge Base instead of the static file (**needs tests**) | ||
|
||
pkg_resources.resource_filename('citation_extractor','data/authors.csv') | ||
- put author names into a dictionary, assuring that the keys are unique | ||
- this code uses the new KB, not the one in `citation_extractor.ned` | ||
|
||
flat_author_names = {"%s$$n%i"%(author.get_urn(), i+1):name[1] | ||
for author in kb.get_authors() | ||
for i,name in enumerate(author.get_names()) | ||
if author.get_urn() is not None} | ||
|
||
* include training/test data in the `data` directory | ||
* `CRFSuite` instead of `CRF++`: <http://sklearn-crfsuite.readthedocs.org/en/latest/> (and combine with <http://www.nltk.org/api/nltk.classify.html>) | ||
* to try to make the `crfpp_wrap.CRF_Classifier` pickleable: | ||
|
||
def __getstate__(self): | ||
d = self.__dict__.copy() | ||
if 'logger' in d.keys() | ||
d['logger'] = d['logger'].name | ||
return d | ||
def __setstate__(self, d): | ||
if 'logger' in d.keys(): | ||
d['logger'] = logging.getLogger(d['logger']) | ||
self.__dict__.update(d) | ||
* move `crfpp_templates` to the `data` directory | ||
* re-organise the logging | ||
|
||
* ~~in `process.preproc_document` replace `guess_language` with `langid` library as it seems way more accurate (!!)~~ | ||
* ~~move active learning classes from `Utils.aph_corpus` to a separate module~~ | ||
~~* remove obsolete bits from module `process`~~ | ||
~~* rename `process` -> `pipeline`~~ | ||
~~* in the `settings.base_settings` replace absolute paths with use of `pkg_resources`:~~ | ||
* ~~include training/test data in the `data` directory~~ | ||
~~* to try to make the `crfpp_wrap.CRF_Classifier` pickleable~~ | ||
|
||
### Refactoring CitationParser | ||
|
||
* ~create a new module `ned.py` and move here:~ | ||
~- `CitationMatcher` (now in `citation_parser`)~ | ||
~- `KnowledgeBase` (now in `citation_parser`)~ | ||
~- in the longer-term move also the `CitationParser` and the `anltr` grammar files~ | ||
|
||
## Testing | ||
|
||
* use py.test [doku](http://pytest.org/latest/pytest.pdf) | ||
* what to test | ||
* creating and running a citation extractor | ||
* test whether the `citation_extractor` can be pickled | ||
|
||
* write tests for: | ||
* ~~creating and running a citation extractor~~ | ||
* ~~test whether the `citation_extractor` can be pickled~~ | ||
* use of the several classifiers (not only CRF) i.e. scikitlearnadapter | ||
* test that the ActiveLearner still works | ||
* ~~use py.test [doku](http://pytest.org/latest/pytest.pdf)~~ | ||
|
||
|
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.