Skip to content

Commit

Permalink
Fix document count, improve readmes and help messages
Browse files Browse the repository at this point in the history
  • Loading branch information
mwydmuch committed Nov 29, 2018
1 parent 0a99510 commit 9e3af30
Show file tree
Hide file tree
Showing 13 changed files with 280 additions and 159 deletions.
100 changes: 63 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,81 +2,107 @@

extremeText is an extension of [fastText](https://github.com/facebookresearch/fastText) library for multi-label classification including extreme cases with hundreds of thousands and millions of labels.

extremeText adds new options for fastText supervised command:

```
$ ./extremetext supervised
New losses for multi-label classification:
-loss sigmoid
-loss plt (Probabilistic Labels Tree)
With the following optional arguments:
-treeType tree type of PLT: complete, huffman, kmeans (default = kmeans)
-ensemble number of trees in ensemble (default = 1)
-bagging bagging ration for ensemble (default = 1.0)
-l2 l2 regularization (default = 0)
-tfidfWeights calculates TF-IDF weights for words
-wordsWeights reads words weights from file (format: <word>:<weights>)
-weight document weight prefix (default = __weight__; format: <weight prefix>:<document weight>)
-tag tags prefix (default = __tag__), tags are words tha are ingnored, by are outputed with prediction
```
extremeText implements:

extremeText adds new commands and makes other to work in parallel:
```
$ ./extremetext predict[-prob] <model> <test-data> [<k>] [<th>] [<output>] [<thread>]
$ ./extremetext get-prob <model> <input> [<th>] [<output>] [<thread>]
```
* Probabilistic Labels Tree (PLT) loss for extreme multi-Label classification with top-down hierarchical clustering (k-means) for tree building,
* sigmoid loss for multi-label classification,
* L2 regularization and FOBOS update for all losses,
* ensemble of loss layers with bagging,
* calculation of hidden (document) vector as a weighted average of the word vectors,
* calculation of TF-IDF weights for words.

## Installation

### Building executable

extremeText like fastText can be build as executable using Make or/and CMake:
extremeText like fastText can be build as executable using Make (recommended) or/and CMake:

```
$ git clone https://github.com/mwydmuch/extremeText.git
$ cd extremeText
$ (optional) cmake .
(optional) $ cmake .
$ make
```

This will produce object files for all the classes as well as the main binary `extremetext`.

### Python package

The easiest way to get extremeText is to use [pip](https://pip.pypa.io/en/stable/):
The easiest way to get extremeText is to use [pip](https://pip.pypa.io/en/stable/).

```
pip install extremetext
$ pip install extremetext
```

The latest version of extremeText can be build from sources:
Installing on MacOS may require setting `MACOSX_DEPLOYMENT_TARGET=10.9` first:
```
$ export MACOSX_DEPLOYMENT_TARGET=10.9
$ pip install extremetext
```

The latest version of extremeText can be build from sources using pip or alternatively setuptools.

```
$ git clone https://github.com/mwydmuch/extremeText.git
$ cd extremeText
$ pip install .
(or) $ python setup.py install
```

Alternatively you can also install extremeText using setuptools:
Now you can import this library with:

```
$ git clone https://github.com/mwydmuch/extremeText.git
$ cd extremeText
$ python setup.py install
import extremeText
```

## Usage

extremeText adds new options for fastText supervised command:

```
$ ./extremetext supervised
New losses for multi-label classification:
-loss sigmoid
-loss plt (Probabilistic Labels Tree)
With the following optional arguments:
General:
-l2 L2 regularization (default = 0)
-fobos use FOBOS update
-tfidfWeights calculate TF-IDF weights for words
-wordsWeights read word weights from file (format: <word>:<weights>)
-weight document weight prefix (default = __weight__; format: <weight prefix>:<document weight>)
-tag tags prefix (default = __tag__), tags are ignored words, that are outputed with prediction
-addEosToken add EOS token at the end of document (default = 0)
-eosWeight weight of EOS token (default = 1.0)
PLT (Probabilistic Labels Tree):
-treeType type of PLT: complete, huffman, kmeans (default = kmeans)
-arity arity of PLT (default = 2)
-maxLeaves maximum number of leaves (labels) in one internal node of PLT (default = 100)
-kMeansEps stopping criteria for k-means clustering (default = 0.001)
Ensemble:
-ensemble size of the ensemble (default = 1)
-bagging bagging ratio (default = 1.0)
```

extremeText also adds new commands and makes other to work in parallel:
```
$ ./extremetext predict[-prob] <model> <test-data> [<k>] [<th>] [<output>] [<thread>]
$ ./extremetext get-prob <model> <input> [<th>] [<output>] [<thread>]
```

## Reference

Please cite below work if using this code for classification.
Please cite below work if using this code for extreme classification.

M. Wydmuch, K. Jasinska, M. Kuznetsov, R. Busa-Fekete, K. Dembczyński, [*A no-regret generalization of hierarchical softmax to extreme multi-label classification*](https://arxiv.org/abs/1810.11671)

## TODO
* Merge with latest changes from fastText.
* Rewrite vanilla fastText losses as extremeText loss layers to support new features.

* Merge with the latest changes from fastText.
* Rewrite vanilla fastText losses as extremeText loss layers to support all new features.

---

Expand Down
36 changes: 25 additions & 11 deletions python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,15 @@

[extremeText](https://github.com/mwydmuch/extremeText) is an extension of [fastText](https://github.com/facebookresearch/fastText) library for multi-label classification including extreme cases with hundreds of thousands and millions of labels.

[extremeText](https://github.com/mwydmuch/extremeText) implements:

* Probabilistic Labels Tree (PLT) loss for extreme multi-Label classification with top-down hierarchical clustering (k-means) for tree building,
* sigmoid loss for multi-label classification,
* L2 regularization and FOBOS update for all losses,
* ensemble of loss layers with bagging,
* calculation of hidden (document) vector as a weighted average of the word vectors,
* calculation of TF-IDF weights for words.

## Requirements

[extremeText](https://github.com/mwydmuch/extremeText) builds on modern Mac OS and Linux distributions.
Expand All @@ -18,26 +27,25 @@ You will need:

## Installing extremeText

The easiest way to get extremeText is to use [pip](https://pip.pypa.io/en/stable/):
The easiest way to get [extremeText](https://github.com/mwydmuch/extremeText) is to use [pip](https://pip.pypa.io/en/stable/).

```
pip install extremetext
$ pip install extremetext
```

The latest version of extremeText can be build from sources:

Installing on MacOS may require setting `MACOSX_DEPLOYMENT_TARGET=10.9` first:
```
$ git clone https://github.com/mwydmuch/extremeText.git
$ cd extremeText
$ pip install .
$ export MACOSX_DEPLOYMENT_TARGET=10.9
$ pip install extremetext
```

Alternatively you can also install extremeText using setuptools:
The latest version of [extremeText](https://github.com/mwydmuch/extremeText) can be build from sources using pip or alternatively setuptools.

```
$ git clone https://github.com/mwydmuch/extremeText.git
$ cd extremeText
$ python setup.py install
$ pip install .
(or) $ python setup.py install
```

Now you can import this library with:
Expand All @@ -54,7 +62,7 @@ We recommend you look at the [examples within the doc folder](https://github.com

As with any package you can get help on any Python function using the help function.

For example
For example:

```
+>>> import extremeText
Expand Down Expand Up @@ -84,7 +92,7 @@ FUNCTIONS

## IMPORTANT: Preprocessing data / enconding conventions

In general it is important to properly preprocess your data. In particular our example scripts in the [root folder](https://github.com/mwydmuch/extremeText/extremeText) do this.
In general it is important to properly preprocess your data. Example scripts in the [root folder](https://github.com/mwydmuch/extremeText/extremeText) do this.

extremeText like fastText assumes UTF-8 encoded text. All text must be [unicode for Python2](https://docs.python.org/2/library/functions.html#unicode) and [str for Python3](https://docs.python.org/3.5/library/stdtypes.html#textseq). The passed text will be [encoded as UTF-8 by pybind11](https://pybind11.readthedocs.io/en/master/advanced/cast/strings.html?highlight=utf-8#strings-bytes-and-unicode-conversions) before passed to the extremeText C++ library. This means it is important to use UTF-8 encoded text when building a model. On Unix-like systems you can convert text using [iconv](https://en.wikipedia.org/wiki/Iconv).

Expand All @@ -100,3 +108,9 @@ extremeText will tokenize (split text into pieces) based on the following ASCII
The newline character is used to delimit lines of text. In particular, the EOS token is appended to a line of text if a newline character is encountered. The only exception is if the number of tokens exceeds the MAX\_LINE\_SIZE constant as defined in the [Dictionary header](https://github.com/mwydmuch/extremeText/blob/master/src/dictionary.h). This means if you have text that is not separate by newlines, such as the [fil9 dataset](http://mattmahoney.net/dc/textdata), it will be broken into chunks with MAX\_LINE\_SIZE of tokens and the EOS token is not appended.

The length of a token is the number of UTF-8 characters by considering the [leading two bits of a byte](https://en.wikipedia.org/wiki/UTF-8#Description) to identify [subsequent bytes of a multi-byte sequence](https://github.com/mwydmuch/extremeText/blob/master/src/dictionary.cc). Knowing this is especially important when choosing the minimum and maximum length of subwords. Further, the EOS token (as specified in the [Dictionary header](https://github.com/mwydmuch/extremeText/blob/master/src/dictionary.h)) is considered a character and will not be broken into subwords.

## Reference

Please cite below work if using this package for extreme classification.

M. Wydmuch, K. Jasinska, M. Kuznetsov, R. Busa-Fekete, K. Dembczyński, [A no-regret generalization of hierarchical softmax to extreme multi-label classification](https://arxiv.org/abs/1810.11671)
95 changes: 62 additions & 33 deletions python/README.rst
Original file line number Diff line number Diff line change
@@ -1,51 +1,68 @@
extremeText
========
===========

`extremeText <https://github.com/mwydmuch/extremeText>`__ is an extension
of `fastText <https://github.com/facebookresearch/fastText>`__ library
for multi-label classification including extreme cases with hundreds of thousands and millions of labels.
`extremeText <https://github.com/mwydmuch/extremeText>`__ is an
extension of `fastText <https://github.com/facebookresearch/fastText>`__
library for multi-label classification including extreme cases with
hundreds of thousands and millions of labels.

`extremeText <https://github.com/mwydmuch/extremeText>`__ implements:

- Probabilistic Labels Tree (PLT) loss for extreme multi-Label
classification with top-down hierarchical clustering (k-means) for
tree building,
- sigmoid loss for multi-label classification,
- L2 regularization and FOBOS update for all losses,
- ensemble of loss layers with bagging,
- calculation of hidden (document) vector as a weighted average of the
word vectors,
- calculation of TF-IDF weights for words.

Requirements
------------

`extremeText <https://github.com/mwydmuch/extremeText>`__ builds on modern Mac OS and Linux
distributions. Since it uses C++11 features, it requires a compiler with
good C++11 support. These include :
`extremeText <https://github.com/mwydmuch/extremeText>`__ builds on
modern Mac OS and Linux distributions. Since it uses C++11 features, it
requires a compiler with good C++11 support. These include:

- (gcc-4.8 or newer) or (clang-3.3 or newer)

You will need
You will need:

- `Python <https://www.python.org/>`__ version 2.7 or >=3.4
- `NumPy <http://www.numpy.org/>`__ &
`SciPy <https://www.scipy.org/>`__
- `pybind11 <https://github.com/pybind/pybind11>`__

Building extremeText
-----------------
Installing extremeText
----------------------

The easiest way to get extremeText is to use
pip <https://pypi.python.org/pypi/fasttext>`__:
The easiest way to get
`extremeText <https://github.com/mwydmuch/extremeText>`__ is to use
`pip <https://pip.pypa.io/en/stable/>`__.

::

$ pip install extremetext

The latest version of extremeText can be build from sources:
Installing on MacOS may require setting
``MACOSX_DEPLOYMENT_TARGET=10.9`` first:

::

$ git clone https://github.com/mwydmuch/extremeText.git
$ cd extremeText
$ pip install .
$ export MACOSX_DEPLOYMENT_TARGET=10.9
$ pip install extremetext

Alternatively you can also install extremeText using setuptools:
The latest version of
`extremeText <https://github.com/mwydmuch/extremeText>`__ can be build
from sources using pip or alternatively setuptools.

::

$ git clone https://github.com/mwydmuch/extremeText.git
$ cd extremeText
$ python setup.py install
$ pip install .
(or) $ python setup.py install

Now you can import this library with:

Expand All @@ -56,18 +73,19 @@ Now you can import this library with:
Examples
--------

In general it is assumed that the reader already has good knowledge of fastText/extremeText.
For this consider the main
In general it is assumed that the reader already has good knowledge of
fastText/extremeText. For this consider the main
`README <https://github.com/mwydmuch/extremeText/blob/master/README.md>`__
and `the tutorials on fastText website <https://fasttext.cc/docs/en/supervised-tutorial.html>`__.
and `the tutorials on fastText
website <https://fasttext.cc/docs/en/supervised-tutorial.html>`__.

We recommend you look at the `examples within the doc
folder <https://github.com/mwydmuch/extremeText/tree/master/python/doc/examples>`__.

As with any package you can get help on any Python function using the
help function.

For example
For example:

::

Expand Down Expand Up @@ -98,23 +116,25 @@ For example
IMPORTANT: Preprocessing data / enconding conventions
-----------------------------------------------------

In general it is important to properly preprocess your data. In
particular our example scripts in the `root
folder <https://github.com/mwydmuch/extremeText>`__ do this.
In general it is important to properly preprocess your data. Example
scripts in the `root
folder <https://github.com/mwydmuch/extremeText/extremeText>`__ do this.

extremeText like extremeText assumes UTF-8 encoded text. All text must be `unicode for
extremeText like fastText assumes UTF-8 encoded text. All text must be
`unicode for
Python2 <https://docs.python.org/2/library/functions.html#unicode>`__
and `str for
Python3 <https://docs.python.org/3.5/library/stdtypes.html#textseq>`__.
The passed text will be `encoded as UTF-8 by
pybind11 <https://pybind11.readthedocs.io/en/master/advanced/cast/strings.html?highlight=utf-8#strings-bytes-and-unicode-conversions>`__
before passed to the extremeText C++ library. This means it is important to
use UTF-8 encoded text when building a model. On Unix-like systems you
can convert text using `iconv <https://en.wikipedia.org/wiki/Iconv>`__.

extremeText like fastText will tokenize (split text into pieces) based on the following
ASCII characters (bytes). In particular, it is not aware of UTF-8
whitespace. We advice the user to convert UTF-8 whitespace / word
before passed to the extremeText C++ library. This means it is important
to use UTF-8 encoded text when building a model. On Unix-like systems
you can convert text using
`iconv <https://en.wikipedia.org/wiki/Iconv>`__.

extremeText will tokenize (split text into pieces) based on the
following ASCII characters (bytes). In particular, it is not aware of
UTF-8 whitespace. We advice the user to convert UTF-8 whitespace / word
boundaries into one of the following symbols as appropiate.

- space
Expand Down Expand Up @@ -144,3 +164,12 @@ maximum length of subwords. Further, the EOS token (as specified in the
`Dictionary
header <https://github.com/mwydmuch/extremeText/blob/master/src/dictionary.h>`__)
is considered a character and will not be broken into subwords.

Reference
---------

Please cite below work if using this package for extreme classification.

M. Wydmuch, K. Jasinska, M. Kuznetsov, R. Busa-Fekete, K. Dembczyński,
`A no-regret generalization of hierarchical softmax to extreme
multi-label classification <https://arxiv.org/abs/1810.11671>`__
2 changes: 1 addition & 1 deletion python/doc/examples/bin_to_vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@

if __name__ == "__main__":
parser = argparse.ArgumentParser(
description=("Print fasttext .vec file to stdout from .bin file")
description=("Print fasttext/extremetext .vec file to stdout from .bin file")
)
parser.add_argument(
"model",
Expand Down
Loading

0 comments on commit 9e3af30

Please sign in to comment.