Merge pull request #16 from mromanello/v1.4.x

V1.4.x
mromanello · Jun 28, 2017 · a53b9c7 · a53b9c7
2 parents c147389 + 784c08a
commit a53b9c7
Show file tree

Hide file tree

Showing 37,525 changed files with 956,286 additions and 2,472 deletions.
diff --git a/.codecov.yml b/.codecov.yml
@@ -0,0 +1,3 @@
+coverage:
+	ignore: 
+	 - "citation_extractor/settings/"
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,23 @@
+# MacOS Specific
+.DS_Store
+
+# Python
+__pycache__/
+*.py[cod]
+
+# pytest
+.cache
+
+# venv
+.env
+env/
+
+# Vim tmp files
+*~
+
+# project specific
+citation_extractor/data/pickles/
+
+# pypi
+build/
+dist/
diff --git a/.travis.yml b/.travis.yml
@@ -0,0 +1,35 @@
+env:
+  - TREETAGGER_HOME=/home/$USER/tree-tagger/cmd/
+language: python
+python:
+  - "2.7"
+# command to install dependencies
+before_install:
+  - sudo apt-get update --fix-missing
+  - sudo apt-get install gfortran libopenblas-dev liblapack-dev
+  - sudo apt-get remove automake
+install:
+  - ./install_treetagger.sh
+  - sudo -H ./install_dependencies.sh
+  - sudo chmod 777 -R crfpp
+  - cd crfpp/
+  - export C_INCLUDE_PATH=/usr/local/include/:${C_INCLUDE_PATH}
+  - export CPLUS_INCLUDE_PATH=/usr/local/include/:${CPLUS_INCLUDE_PATH}
+  - pip install -e python
+  - cd
+  - git clone https://github.com/mromanello/hucit_kb.git
+  - cd hucit_kb
+  - pip install -r requirements.txt
+  - pip install .
+  - sudo -H ./install_3stores.sh
+  - pip install http://www.antlr3.org/download/Python/antlr_python_runtime-3.1.3.tar.gz https://github.com/mromanello/pyCTS/archive/master.zip  citation_parser
+  - cd $TRAVIS_BUILD_DIR
+  - pip install -e lib/
+  - pip install -r requirements.txt
+  - pip install -r requirements_dev.txt
+  - pip install .
+# command to run tests
+script: pytest -vv --cov=citation_extractor --ignore=tests/test_eval.py
+#script: travis_wait 60 pytest -s -vv --cov=citation_extractor
+after_success:
+  - codecov
diff --git a/INSTALL.md b/INSTALL.md
diff --git a/NOTES.md b/NOTES.md
@@ -2,18 +2,15 @@ where I left: try to provide the module with minimum data and directory structur
 
 ## Tests
 
-* use `py.test` to run the tests
-* combine standlone and doctests, depending from the context
 * [testing good practices](http://pytest.org/latest/goodpractises.html)
-* <https://pytest.org/latest/getting-started.html>
 
 ## Distributing the package
 
 * see <http://pythonhosted.org/setuptools/setuptools.html>
 
 ## Installation problems:
 
-to install SciPy on Ubuntu when needs:
+to install SciPy on Ubuntu one needs:
 
     sudo apt-get install gfortran libopenblas-dev liblapack-dev
 
@@ -27,79 +24,18 @@ then SciPy, then scikit-learn
 
     class= (scope_pos | scope_neg)
 
-def prepare_for_training(doc_id, basedir):
-    """
-    result = [
-        [
-            [
-                "arg1_entity":"AAUTHOR"
-                ,"arg2_entity":"REFSCOPE"
-                ,"concent":"AAUTHORREFSCOPE"
-            ]
-            ,'scope_pos'
-        ]
-        ,[
-            [
-                "arg1_entity":"REFSCOPE"
-                ,"arg2_entity":"AAUTHOR"
-                ,"concent":"REFSCOPEAAUTHOR"
-            ]
-            ,'scope_neg'
-        ]
-    ]
-    """
-    instances = []
-    entities, relations = read_ann_file(doc_id, basedir)
-    for arg1,arg2 in relations:
-        instance.append(extract_relation_features(arg1,arg2,entities,fulltext),'scope_pos')
-        instance.append(extract_relation_features(arg2,arg1,entities,fulltext),'scope_neg')
-    return instances
-
-def extract_relation_features(arg1,arg2,entities,fulltext):
-    """
-    the following features should be extracted:
-        Arg1_entity:AAUTHOR
-        Arg2_entity:REFSCOPE
-        ConcEnt: AAUTHORREFSCOPE
-        WordsBtw:0
-        EntBtw:0 
-        Thuc.=True (bow_arg1)
-        1.8=True (bow_arg2)
-        word_before_arg1
-        word_after_arg1
-        word_before_arg2
-        word_after_arg2
-    """
-    pass
-
-class relation_extractor:
-    __init__(self,classifier,train_dirs):
-        """
-        todo
-        """
-        training = [(file.replace(".ann",""),train_dir) for dir in train_dir 
-                        for file in glob.glob("%s*.ann"%dir)]
-        training_instances = [prepare_for_training(doc_id,base_dir) 
-                                        for doc_id,based_dir in doc_ids]
-        self.classifier.train(training_instances)
-    extract(self,entities,fulltext):
-        """
-        todo
-        """
-        relations = []
-        for candidate in itertools.combinations(entites,2):
-            arg1 = candidate[0]
-            arg2 = candidate[1]
-            feature_set = extract_relation_features(arg1,arg2,entities,fulltext)
-            label = self.classifier.classify(feature_set)
-            if(label=="scope_pos"):
-                relations.append((arg1,arg2,label))
-        return relations
-
-* when detecting relations it is necessary to compare all pairs of entities
-* to find all unique pairs (combinations) in a list with python:
-
-    import itertools
-    my_list = [1,2,3,4]
-    for p in itertools.combinations(my_list,2):
-        print p
+## Notes to improve the Named Entity Disambiguation
+
+### Code
+
+* improve the logging
+* test that the code can be parallelised
+
+### Logic
+
+* instead of disambiguating relations first and then entities
+* try to do that by following the sequence of the document
+* get all the annotations for a given document, ordered as they appear...
+* ... then proceed to disambiguate each annotation, using the annotation type to call appropriate function/method
+* this way, neighbouring entity mentions can be used to help with the disambiguation of relations
+
diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 author: Matteo Romanello, <[email protected]>
 
+[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.35470.svg)](https://doi.org/10.5281/zenodo.35470)
 [![Build Status](https://travis-ci.org/mromanello/CitationExtractor.svg?branch=master)](https://travis-ci.org/mromanello/CitationExtractor)
 [![codecov](https://codecov.io/gh/mromanello/CitationExtractor/branch/master/graph/badge.svg)](https://codecov.io/gh/mromanello/CitationExtractor)
-
diff --git a/TODO.md b/TODO.md
@@ -1,38 +1,58 @@
-* re-organise the logging
-* in `process.preproc_document` replace `guess_language` with `langid` library as it seems way more accurate (!!)
-* https://docs.python.org/2/library/pkgutil.html#pkgutil.get_data
-    * `get_resource_filename` and `resource_isdir()`
+## Next steps
+
+* create evaluation `py.tests` for NER, RelEX and (as soon as possible) NED
+    - k-fold cross evaluation
+    - this way evaluations can be ran every time e.g. a feature extraction function is changed/introduced
+    - write results to disk so that they can be inspected e.g. via brat
+    - for RelEx: compare rule-based and ML-based extraction 
+* create some stats about the traning/test corpus
+    - number of entities by class
+    - number of relations
+    - number tokens
+    - language distribution of documents
 
 ## Code Refactoring
 
-~~* remove obsolete bits from module `process`~~
-* rename `process` -> `pipeline`
-* move active learning classes to a separate module
-* in the `settings.base_settings` replace absolute paths with use of `pkg_resources`:
+* to streamline installation, try to remove local dependencies:
+	* add `pysuffix` to the codebase => `Utils.pysuffix` (or os)
+
+* change the `LookupDictionary` in `Utils.FastDict` so that it gets the data directly from the Knowledge Base instead of the static file (**needs tests**)
 
-    pkg_resources.resource_filename('citation_extractor','data/authors.csv')
+    - put author names into a dictionary, assuring that the keys are unique
+    - this code uses the new KB, not the one in `citation_extractor.ned`
+
+    flat_author_names = {"%s$$n%i"%(author.get_urn(), i+1):name[1] 
+            for author in kb.get_authors() 
+                        for i,name in enumerate(author.get_names())  
+                                            if author.get_urn() is not None}
 
-* include training/test data in the `data` directory
 * `CRFSuite` instead of `CRF++`: <http://sklearn-crfsuite.readthedocs.org/en/latest/> (and combine with <http://www.nltk.org/api/nltk.classify.html>)
-* to try to make the `crfpp_wrap.CRF_Classifier` pickleable:
-
-    def __getstate__(self):
-    d = self.__dict__.copy()
-    if 'logger' in d.keys()
-        d['logger'] = d['logger'].name
-    return d
-    def __setstate__(self, d):
-    if 'logger' in d.keys():
-        d['logger'] = logging.getLogger(d['logger'])
-    self.__dict__.update(d)
+* move `crfpp_templates` to the `data` directory
+* re-organise the logging
+
+* ~~in `process.preproc_document` replace `guess_language` with `langid` library as it seems way more accurate (!!)~~
+* ~~move active learning classes from `Utils.aph_corpus` to a separate module~~
+~~* remove obsolete bits from module `process`~~
+~~* rename `process` -> `pipeline`~~
+~~* in the `settings.base_settings` replace absolute paths with use of `pkg_resources`:~~
+* ~~include training/test data in the `data` directory~~
+~~* to try to make the `crfpp_wrap.CRF_Classifier` pickleable~~
+
+### Refactoring CitationParser
+
+* ~create a new module `ned.py` and move here:~  
+    ~- `CitationMatcher` (now in `citation_parser`)~
+    ~- `KnowledgeBase` (now in `citation_parser`)~
+    ~- in the longer-term move also the `CitationParser` and the `anltr` grammar files~
 
 ## Testing
 
-* use py.test [doku](http://pytest.org/latest/pytest.pdf)
-* what to test
-    * creating and running a citation extractor
-    * test whether the `citation_extractor` can be pickled
+
+* write tests for:
+    * ~~creating and running a citation extractor~~
+    * ~~test whether the `citation_extractor` can be pickled~~
     * use of the several classifiers (not only CRF) i.e. scikitlearnadapter
     * test that the ActiveLearner still works
+* ~~use py.test [doku](http://pytest.org/latest/pytest.pdf)~~
 
 
diff --git a/citation_extractor/Tests/test_FeatureExtractor.py b/citation_extractor/Tests/test_FeatureExtractor.py
diff --git a/citation_extractor/Tests/test_jstor.py b/citation_extractor/Tests/test_jstor.py