Natural language processing support for Pandas dataframes.
Text Extensions for Pandas turns Pandas DataFrames into a universal data structure for representing intermediate data in all phases of your NLP application development workflow.
Web site: https://ibm.biz/text-extensions-for-pandas
API docs: https://text-extensions-for-pandas.readthedocs.io/
- Connect features with regions of a document
- Visualize the internal data of your NLP application
- Analyze the accuracy of your models
- Combine the results of multiple models
- Represent BERT embeddings in a Pandas series
- Store logits and other feature vectors in a Pandas series
- Store an entire time series in each cell of a Pandas series
- SpaCy
- Transformers
- IBM Watson Natural Language Understanding
- IBM Watson Discovry Table Understanding
Looking for the model training code from our CoNLL-2020 paper, "Identifying Incorrect Labels in the CoNLL-2003 Corpus"? See the notebooks in this directory.
The associated data set is here.
This library requires Python 3.7+, Pandas, and Numpy.
To install the latest release, just run:
pip install text-extensions-for-pandas
Depending on your use case, you may also need the following additional packages:
spacy
(for SpaCy support)transformers
(for transformer-based embeddings and BERT tokenization)ibm_watson
(for IBM Watson support)
Alternatively, packages are available to be installed from conda-forge for use in a conda environment with:
conda install --channel=conda-forge text_extensions_for_pandas
If you'd like to try out the very latest version of our code, you can install directly from the head of the master branch:
pip install git+https://github.com/CODAIT/text-extensions-for-pandas
You can also directly import our package from your local copy of the
text_extensions_for_pandas
source tree. Just add the root of your local copy
of this repository to the front of sys.path
.
For examples of how to use the library, take a look at the example notebooks in this directory. You can try out these notebooks on Binder by navigating to https://mybinder.org/v2/gh/frreiss/tep-fred/branch-binder?urlpath=lab/tree/notebooks
To run the notebooks on your local machine, follow the following steps:
- Install Anaconda or Miniconda.
- Check out a copy of this repository.
- Use the script
env.sh
to set up an Anaconda environment for running the code in this repository. - Type
jupyter lab
from the root of your local source tree to start a JupyterLab environment. - Navigate to the
notebooks
directory and choose any of the notebooks there
API documentation can be found at https://text-extensions-for-pandas.readthedocs.io/en/latest/
text_extensions_for_pandas
: Source code for thetext_extensions_for_pandas
module.- env.sh: Script to create a conda environment
pd
capable of running the notebooks and test cases in this project - generate_docs.sh: Script to build the API documentation
- api_docs: Configuration files for
generate_docs.sh
- binder: Configuration files for running notebooks on Binder
- config: Configuration files for
env.sh
. - docs: Project web site
- notebooks: example notebooks
- resources: various input files used by our example notebooks
- test_data: data files for regression tests. The tests themselves are located adjacent to the library code files.
- tutorials: Detailed tutorials on using Text Extensions for Pandas to cover complex end-to-end NLP use cases (work in progress).
This project is an IBM open source project. We are developing the code in the open under the Apache License, and we welcome contributions from both inside and outside IBM.
To contribute, just open a Github issue or submit a pull request. Be sure to include a copy of the Developer's Certificate of Origin 1.1 along with your pull request.
Before building the code in this repository, we recommend that you use the
provided script env.sh
to set up a consistent build environment:
$ ./env.sh --env_name myenv
$ conda activate myenv
(replace myenv
with your choice of environment name).
To run tests, navigate to the root of your local copy and run:
pytest text_extensions_for_pandas
To build pip and source code packages:
python setup.py sdist bdist_wheel
(outputs go into ./dist
).
To build API documentation, run:
./generate_docs.sh