Skip to content

Commit

Permalink
Merge pull request #16 from mromanello/v1.4.x
Browse files Browse the repository at this point in the history
V1.4.x
  • Loading branch information
mromanello authored Jun 28, 2017
2 parents c147389 + 784c08a commit a53b9c7
Show file tree
Hide file tree
Showing 37,525 changed files with 956,286 additions and 2,472 deletions.
The diff you're trying to view is too large. We only load the first 3000 changed files.
3 changes: 3 additions & 0 deletions .codecov.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
coverage:
ignore:
- "citation_extractor/settings/"
23 changes: 23 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# MacOS Specific
.DS_Store

# Python
__pycache__/
*.py[cod]

# pytest
.cache

# venv
.env
env/

# Vim tmp files
*~

# project specific
citation_extractor/data/pickles/

# pypi
build/
dist/
35 changes: 35 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
env:
- TREETAGGER_HOME=/home/$USER/tree-tagger/cmd/
language: python
python:
- "2.7"
# command to install dependencies
before_install:
- sudo apt-get update --fix-missing
- sudo apt-get install gfortran libopenblas-dev liblapack-dev
- sudo apt-get remove automake
install:
- ./install_treetagger.sh
- sudo -H ./install_dependencies.sh
- sudo chmod 777 -R crfpp
- cd crfpp/
- export C_INCLUDE_PATH=/usr/local/include/:${C_INCLUDE_PATH}
- export CPLUS_INCLUDE_PATH=/usr/local/include/:${CPLUS_INCLUDE_PATH}
- pip install -e python
- cd
- git clone https://github.com/mromanello/hucit_kb.git
- cd hucit_kb
- pip install -r requirements.txt
- pip install .
- sudo -H ./install_3stores.sh
- pip install http://www.antlr3.org/download/Python/antlr_python_runtime-3.1.3.tar.gz https://github.com/mromanello/pyCTS/archive/master.zip citation_parser
- cd $TRAVIS_BUILD_DIR
- pip install -e lib/
- pip install -r requirements.txt
- pip install -r requirements_dev.txt
- pip install .
# command to run tests
script: pytest -vv --cov=citation_extractor --ignore=tests/test_eval.py
#script: travis_wait 60 pytest -s -vv --cov=citation_extractor
after_success:
- codecov
52 changes: 0 additions & 52 deletions INSTALL.md

This file was deleted.

96 changes: 16 additions & 80 deletions NOTES.md
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,15 @@ where I left: try to provide the module with minimum data and directory structur

## Tests

* use `py.test` to run the tests
* combine standlone and doctests, depending from the context
* [testing good practices](http://pytest.org/latest/goodpractises.html)
* <https://pytest.org/latest/getting-started.html>

## Distributing the package

* see <http://pythonhosted.org/setuptools/setuptools.html>

## Installation problems:

to install SciPy on Ubuntu when needs:
to install SciPy on Ubuntu one needs:

sudo apt-get install gfortran libopenblas-dev liblapack-dev

Expand All @@ -27,79 +24,18 @@ then SciPy, then scikit-learn

class= (scope_pos | scope_neg)

def prepare_for_training(doc_id, basedir):
"""
result = [
[
[
"arg1_entity":"AAUTHOR"
,"arg2_entity":"REFSCOPE"
,"concent":"AAUTHORREFSCOPE"
]
,'scope_pos'
]
,[
[
"arg1_entity":"REFSCOPE"
,"arg2_entity":"AAUTHOR"
,"concent":"REFSCOPEAAUTHOR"
]
,'scope_neg'
]
]
"""
instances = []
entities, relations = read_ann_file(doc_id, basedir)
for arg1,arg2 in relations:
instance.append(extract_relation_features(arg1,arg2,entities,fulltext),'scope_pos')
instance.append(extract_relation_features(arg2,arg1,entities,fulltext),'scope_neg')
return instances

def extract_relation_features(arg1,arg2,entities,fulltext):
"""
the following features should be extracted:
Arg1_entity:AAUTHOR
Arg2_entity:REFSCOPE
ConcEnt: AAUTHORREFSCOPE
WordsBtw:0
EntBtw:0
Thuc.=True (bow_arg1)
1.8=True (bow_arg2)
word_before_arg1
word_after_arg1
word_before_arg2
word_after_arg2
"""
pass

class relation_extractor:
__init__(self,classifier,train_dirs):
"""
todo
"""
training = [(file.replace(".ann",""),train_dir) for dir in train_dir
for file in glob.glob("%s*.ann"%dir)]
training_instances = [prepare_for_training(doc_id,base_dir)
for doc_id,based_dir in doc_ids]
self.classifier.train(training_instances)
extract(self,entities,fulltext):
"""
todo
"""
relations = []
for candidate in itertools.combinations(entites,2):
arg1 = candidate[0]
arg2 = candidate[1]
feature_set = extract_relation_features(arg1,arg2,entities,fulltext)
label = self.classifier.classify(feature_set)
if(label=="scope_pos"):
relations.append((arg1,arg2,label))
return relations

* when detecting relations it is necessary to compare all pairs of entities
* to find all unique pairs (combinations) in a list with python:

import itertools
my_list = [1,2,3,4]
for p in itertools.combinations(my_list,2):
print p
## Notes to improve the Named Entity Disambiguation

### Code

* improve the logging
* test that the code can be parallelised

### Logic

* instead of disambiguating relations first and then entities
* try to do that by following the sequence of the document
* get all the annotations for a given document, ordered as they appear...
* ... then proceed to disambiguate each annotation, using the annotation type to call appropriate function/method
* this way, neighbouring entity mentions can be used to help with the disambiguation of relations

2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
author: Matteo Romanello, <[email protected]>

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.35470.svg)](https://doi.org/10.5281/zenodo.35470)
[![Build Status](https://travis-ci.org/mromanello/CitationExtractor.svg?branch=master)](https://travis-ci.org/mromanello/CitationExtractor)
[![codecov](https://codecov.io/gh/mromanello/CitationExtractor/branch/master/graph/badge.svg)](https://codecov.io/gh/mromanello/CitationExtractor)

70 changes: 45 additions & 25 deletions TODO.md
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,38 +1,58 @@
* re-organise the logging
* in `process.preproc_document` replace `guess_language` with `langid` library as it seems way more accurate (!!)
* https://docs.python.org/2/library/pkgutil.html#pkgutil.get_data
* `get_resource_filename` and `resource_isdir()`
## Next steps

* create evaluation `py.tests` for NER, RelEX and (as soon as possible) NED
- k-fold cross evaluation
- this way evaluations can be ran every time e.g. a feature extraction function is changed/introduced
- write results to disk so that they can be inspected e.g. via brat
- for RelEx: compare rule-based and ML-based extraction
* create some stats about the traning/test corpus
- number of entities by class
- number of relations
- number tokens
- language distribution of documents

## Code Refactoring

~~* remove obsolete bits from module `process`~~
* rename `process` -> `pipeline`
* move active learning classes to a separate module
* in the `settings.base_settings` replace absolute paths with use of `pkg_resources`:
* to streamline installation, try to remove local dependencies:
* add `pysuffix` to the codebase => `Utils.pysuffix` (or os)

* change the `LookupDictionary` in `Utils.FastDict` so that it gets the data directly from the Knowledge Base instead of the static file (**needs tests**)

pkg_resources.resource_filename('citation_extractor','data/authors.csv')
- put author names into a dictionary, assuring that the keys are unique
- this code uses the new KB, not the one in `citation_extractor.ned`

flat_author_names = {"%s$$n%i"%(author.get_urn(), i+1):name[1]
for author in kb.get_authors()
for i,name in enumerate(author.get_names())
if author.get_urn() is not None}

* include training/test data in the `data` directory
* `CRFSuite` instead of `CRF++`: <http://sklearn-crfsuite.readthedocs.org/en/latest/> (and combine with <http://www.nltk.org/api/nltk.classify.html>)
* to try to make the `crfpp_wrap.CRF_Classifier` pickleable:

def __getstate__(self):
d = self.__dict__.copy()
if 'logger' in d.keys()
d['logger'] = d['logger'].name
return d
def __setstate__(self, d):
if 'logger' in d.keys():
d['logger'] = logging.getLogger(d['logger'])
self.__dict__.update(d)
* move `crfpp_templates` to the `data` directory
* re-organise the logging

* ~~in `process.preproc_document` replace `guess_language` with `langid` library as it seems way more accurate (!!)~~
* ~~move active learning classes from `Utils.aph_corpus` to a separate module~~
~~* remove obsolete bits from module `process`~~
~~* rename `process` -> `pipeline`~~
~~* in the `settings.base_settings` replace absolute paths with use of `pkg_resources`:~~
* ~~include training/test data in the `data` directory~~
~~* to try to make the `crfpp_wrap.CRF_Classifier` pickleable~~

### Refactoring CitationParser

* ~create a new module `ned.py` and move here:~
~- `CitationMatcher` (now in `citation_parser`)~
~- `KnowledgeBase` (now in `citation_parser`)~
~- in the longer-term move also the `CitationParser` and the `anltr` grammar files~

## Testing

* use py.test [doku](http://pytest.org/latest/pytest.pdf)
* what to test
* creating and running a citation extractor
* test whether the `citation_extractor` can be pickled

* write tests for:
* ~~creating and running a citation extractor~~
* ~~test whether the `citation_extractor` can be pickled~~
* use of the several classifiers (not only CRF) i.e. scikitlearnadapter
* test that the ActiveLearner still works
* ~~use py.test [doku](http://pytest.org/latest/pytest.pdf)~~


16 changes: 0 additions & 16 deletions citation_extractor/Tests/test_FeatureExtractor.py

This file was deleted.

17 changes: 0 additions & 17 deletions citation_extractor/Tests/test_jstor.py

This file was deleted.

Loading

0 comments on commit a53b9c7

Please sign in to comment.