Skip to content

A text processing pipeline for turning unstructured text data into hierarchical datasets

License

Notifications You must be signed in to change notification settings

datasciencecampus/optimus

Repository files navigation

o p t i m u s

A text processing pipeline for turning unstructured text data into hierarchical datasets.

What does Optimus do?

The Data Science Campus has been exploring how to process unlabelled list data that is collected manually in an uncontrolled fashion with no supplementary information to allow aggregation of data. Please note that this project is intended to work on short descriptions, of no more than around 10 words. For longer text descriptions you may need to fork the repository and optimise some of the metrics.

For further information on the methodology please read our blog.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Documentation on the methods utilised and how Optimus functions is pending. This README will be updated to include links to this material once it is made available.

Prerequisites

You will need the following tools in order to be able to set up and use optimus:

  • A modern MacOS or linux installation, Windows is not supported and you are on your own trying it there
  • curl
  • zsh
  • python 3.6 or later
  • git

Firstly the user should clone this git repository

git clone https://github.com/datasciencecampus/optimus.git

Within the repo is a file named setup.zsh. This is a command line tool to install all of the other things you need. For help using this, invoke the script as

. setup.zsh -h

This script allows you to download the FastText wikipedia word embeddings model and places it in the optimus directory. If your project is elsewhere and you are not working in optimus directly then it is recommended to use this script to download the model and then you can move it to be local to your working directory.

Quick Start example

There is a quick start example script that demonstrates how to use the pipeline called example.py in the root directory. The final dataset is written to optimus_results.csv also in the root directory.

A graphical UI for running Optimus

In order to make the tool more accessible a web app based UI was developed. This user interface will help process data without the need of any python coding.

If this is something that interests you please read this README.md file for more info.

How to use the python module

Importing Optimus

Import Optimus into python either through the whole module

  • import optimus

or by importing the Optimus classes

  • from optimus import Optimus

Customise settings for Optimus

Configuration of the pipeline is controlled with a configuration file config.json file in the following format:

  {
    "data":"location/to/data.csv",
    "model":"location/to/wiki.en.bin"
    ...
  }

After creating a config.json file, the location can be passed when creating an instance of Optimus:

o = Optimus(config_path='path/to/config.json', ...)

Further settings can be added on an ad hoc basis and will overwrite any previous settings. To do so, pass in valid arguments into the Optimus class upon construction like so:

o = Optimus(
      config_path='path/to/config.json',
      data="path/to/new_data.csv",
      cutoff=6,
      ...
  )

Optimus has a default settings file to fall back on in case none of this is provided however using just default settings might cause issues. This is mainly due to the path specifications to the data and models in the default settings not being accurate.

The file etc/config.json stores the default arguments used by Optimus. Please do not edit this file.

Shortened reference:

  1. obj = Optimus() -> Uses default settings
  2. obj = Optimus(config_path='path/to/user/config.json') -> Uses custom config file
  3. obj = Optimus(distance=10, stepsize=2, cutoff=16 ...) -> replace specific parameter values instead of those defined in the config file.

Running the code & getting outputs

Optimus takes in pandas.core.series.Series objects. In order to run a configured Optimus object on a series, simply call the object and enclose the desired series in the brackets. For example, for a pandas series called text:

from optimus import Optimus

O = Optimus()
results = O(text)

NOTE: If no data is passed into the the Optimus object the data defined in the config file will be used.

Additional arguments to Optimus:
  • save_csv One can pass save_csv as an optional keyword argument. If the value is set to save_csv=True this will force Optimus to save the output DataFrame which includes all the labels from each iteration in the working directory as labelled.csv.

  • full Similarly if one just needs a dataframe to be returned and not saved, use the full=True setting to receive back the dataframe containing the mapped labels.

  • verbose A boolean value which will dictate how much will be printed to the console as the code runs. Some outputs are still maintained in the console even if verbose=False as this allows some idea of progress of the processing.

Managing Memory

The fastText model is large and requires a sizeable amount of RAM. Each instance of optimus will load its own fast text model on the first processing call. It does this by checking if the model was loaded before and if not will perform a ft.load_model() operation. Once its loaded, all subsequent runs (based on the same instance of Optimus) should not reload a model.

Replacing models and freeing memory

The Optimus object has a replace_model method. This method aims to provide a way to control the memory usage of the Optimus object. This method allows a user to reload and replace a new model or just to remove the loaded model from the Optimus object.

The method takes a string or a fastText loaded model and assigns it to the Optimus object. If no model parameter is passed, the method will simply delete and garbage collect the existing loaded model.

o = Optimus(args, kwargs)
output = o(some_data)

# Load from a path
o.replace_model('string/path/to/model')

# Provide an already loaded model
o.replace_model(fastText.load_model('string/path/to/model'))

# Delete the existing model in the Optimus object
o.replace_model()

Embedding plot functions

This pipeline comes with a helpful embedding visualiser module. This set of functions will allow users to pass in a pandas series full of text entries and a fastText model and use the model to embed these strings into first a n dimensional space which will then be reduced to 2 dimensional space using t-SNE.

This will then be plotted and exported into a 'embedding_plot.html' which is fully interactive.

import pandas as pd
from lib.emplot import plot

series = pd.Series(['string1', ..., 'string2'])
plot(series=series,
     model='path/to/model.bin',
     output_path='output_vectors.csv')

Working with large datasets

Ward linkage is computationally expensive. The process needs to calculate a pairwise distance matrix for all of the embedded vectors and this is of order $n^2$ for $n$ data points, in memory consumption. When you factor in that the models for the fastText embedding are already gigabytes in size this can become a problem.

Where data starts to push the boundaries of what is available to the process we currently recommend performing a sampling of your data points, using optimus to categorise the labelled points and then using (for example) a knn to 'smear' the generated labels across the points nearby.

Example code to do this is provided in the sampling/ directory. The program performs a simple random sample of the content of your list and then embeds these words before using the approach outlined above to generate labels for the out of sample words. This approach is naive, but can provide a starting point for more complex sampling mechanisms such as the use of apricot.

Authors / Contributors

Data Science Campus - Office for National Statistics

  • Steven Hopkins
  • Gareth Clews
  • Arturas Eidukas
  • Lucy Gwilliam

Department for the Environment, Food and Rural Affairs

  • Tom Hopkinson

License

This project is licensed under the MIT License - see the LICENSE.md file for details

References

Bag of Tricks for Efficient Text Classification

[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@InProceedings{joulin2017bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
  month={April},
  year={2017},
  publisher={Association for Computational Linguistics},
  pages={427--431},
}

About

A text processing pipeline for turning unstructured text data into hierarchical datasets

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •