Skip to content

Commit

Permalink
Rework examples (#316)
Browse files Browse the repository at this point in the history
* Move examples to RST files
* Fixes #269 by setting batch_size to a placeholder if a dataset is not set in the Iterator
* Fixes an iterator issue where the whole dataset was copied when shuffled (this caused slowdowns when iterating over HF dataset wrappers as they are disk backed and the copy made them concrete)
* Rename BasicVectorStorage -> WordVectors

Co-authored-by: mariosasko <[email protected]>
  • Loading branch information
mttk and mariosasko authored Apr 2, 2021
1 parent 68b2f03 commit 34513a4
Show file tree
Hide file tree
Showing 42 changed files with 1,454 additions and 2,260 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,18 +38,18 @@ jobs:
pip install .[quality]
- name: Check black compliance
run: |
black --check --line-length 90 --target-version py36 podium tests examples
black --check --line-length 90 --target-version py36 podium tests
- name: Check isort compliance
run: |
isort --check-only podium tests examples
isort --check-only podium tests
- name: Check docformatter compliance
run: |
docformatter podium tests examples --check --recursive \
docformatter podium tests --check --recursive \
--wrap-descriptions 80 --wrap-summaries 80 \
--pre-summary-newline --make-summary-multi-line
- name: Check flake8 compliance
run: |
flake8 podium tests examples
flake8 podium tests
build_and_test:
runs-on: ${{ matrix.os }}
Expand Down
14 changes: 7 additions & 7 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,19 @@
# Check code quality
quality:
@echo Checking code and doc quality.
black --check --line-length 90 --target-version py36 podium tests examples
isort --check-only podium tests examples
docformatter podium tests examples --check --recursive \
black --check --line-length 90 --target-version py36 podium tests
isort --check-only podium tests
docformatter podium tests --check --recursive \
--wrap-descriptions 80 --wrap-summaries 80 \
--pre-summary-newline --make-summary-multi-line
flake8 podium tests examples
flake8 podium tests

# Enforce code quality in source
style:
@echo Applying code and doc style changes.
black --line-length 90 --target-version py36 podium tests examples
isort podium tests examples
docformatter podium tests examples -i --recursive \
black --line-length 90 --target-version py36 podium tests
isort podium tests
docformatter podium tests -i --recursive \
--wrap-descriptions 80 --wrap-summaries 80 \
--pre-summary-newline --make-summary-multi-line

Expand Down
31 changes: 19 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ The main source of inspiration for Podium is an old version of [torchtext](https
### Contents

- [Installation](#installation)
- [Usage examples](#usage-examples)
- [Usage examples](#usage)
- [Contributing](#contributing)
- [Versioning](#versioning)
- [Authors](#authors)
Expand Down Expand Up @@ -56,13 +56,13 @@ SST({
name: text,
keep_raw: False,
is_target: False,
vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, finalized: True, size: 16284})
vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 16284})
}),
LabelField({
name: label,
keep_raw: False,
is_target: True,
vocab: Vocab({specials: (), eager: False, finalized: True, size: 2})
vocab: Vocab({specials: (), eager: False, is_finalized: True, size: 2})
})
]
})
Expand Down Expand Up @@ -94,7 +94,7 @@ HFDatasetConverter({
name: 'text',
keep_raw: False,
is_target: False,
vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, finalized: True, size: 280619})
vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 280619})
}),
LabelField({
name: 'label',
Expand All @@ -105,7 +105,7 @@ HFDatasetConverter({
})
```

Load your own dataset from a standardized tabular format (e.g. `csv`, `tsv`, `jsonl`):
Load your own dataset from a standardized tabular format (e.g. `csv`, `tsv`, `jsonl`, ...):

```python
>>> from podium.datasets import TabularDataset
Expand All @@ -121,24 +121,27 @@ TabularDataset({
fields: [
Field({
name: 'premise',
keep_raw: False,
is_target: False,
vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, finalized: True, size: 19})
vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 15})
}),
Field({
name: 'hypothesis',
keep_raw: False,
is_target: False,
vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, finalized: True, size: 19})
vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 6})
}),
LabelField({
name: 'label',
keep_raw: False,
is_target: True,
vocab: Vocab({specials: (), eager: False, finalized: True, size: 1})
vocab: Vocab({specials: (), eager: False, is_finalized: True, size: 1})
})
]
})
```

Or define your own `Dataset` subclass (tutorial coming soon).
Also check our documentation to see how you can load a dataset from [Pandas](https://pandas.pydata.org/), the CoNLL format, or define your own `Dataset` subclass (tutorial coming soon).

### Define your preprocessing

Expand All @@ -151,6 +154,7 @@ We wrap dataset pre-processing in customizable `Field` classes. Each `Field` has
>>> label = LabelField(name='label')
>>> fields = {'text': text, 'label': label}
>>> sst_train, sst_dev, sst_test = SST.get_dataset_splits(fields=fields)
>>> sst_train.finalize_fields()
>>> print(vocab)
Vocab({specials: ('<UNK>', '<PAD>'), eager: True, finalized: True, size: 5000})
```
Expand All @@ -175,6 +179,7 @@ You could decide to lowercase all the characters and filter out all non-alphanum
>>> text.add_posttokenize_hook(filter_alnum)
>>> fields = {'text': text, 'label': label}
>>> sst_train, sst_dev, sst_test = SST.get_dataset_splits(fields=fields)
>>> sst_train.finalize_fields()
>>> print(sst_train[222])
Example({
text: (None, ['a', 'slick', 'engrossing', 'melodrama']),
Expand All @@ -201,19 +206,21 @@ A common use-case is to incorporate existing components of pretrained language m
... numericalizer=tokenizer.convert_tokens_to_ids)
>>> fields = {'text': subword_field, 'label': label}
>>> sst_train, sst_dev, sst_test = SST.get_dataset_splits(fields=fields)
>>> # No need to finalize since we're not using a vocab!
>>> print(sst_train[222])
Example({
subword: (None, ['a', 'slick', ',', 'eng', '##ross', '##ing', 'mel', '##od', '##rama', '.']),label: (None, 'positive')
subword: (None, ['a', 'slick', ',', 'eng', '##ross', '##ing', 'mel', '##od', '##rama', '.']),
label: (None, 'positive')
})
```

For a more interactive introduction, check out the quickstart on Google Colab: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/takelab/podium/blob/master/docs/source/notebooks/quickstart.ipynb)

More complex examples can be found in our [examples folder](./examples).
Full usage examples can be found in our [docs](https://takelab.fer.hr/podium/examples).

## Contributing

We welcome contributions! To learn more about making a contribution to Podium, please see our [Contribution page](CONTRIBUTING.md).
We welcome contributions! To learn more about making a contribution to Podium, please see our [Contribution page](CONTRIBUTING.md) and our [Roadmap](Roadmap.md).

## Versioning

Expand Down
65 changes: 65 additions & 0 deletions Roadmap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Roadmap

If you are interested in making a contribution to Podium, this page outlines some changes we are planning to focus on in the near future. Feel free to propose improvements and moficiations either via [discussions](https://github.com/TakeLab/podium/discussions) or by raising an [issue](https://github.com/TakeLab/podium/issues).

Order does not reflect importance.

## Major changes

- Dynamic application of Fields
- Right now, for every change in Fields the dataset needs to be reloaded. The goal of this change would be to allow users to replace or update a Field in a Dataset. The Dataset should be aware of this change (e.g. by keeping a hash of the Field object) and if it happens, recompute all the necessary data for that Field.

The current pattern is:
```python
# Load a dataset
fields = {'text':text, 'label':label}
dataset = load_dataset(fields=fields)

# Decide to change something with one of the Fields
text = Field(..., tokenizer=some_different_tokenizer)
# Potentially expensive dataset loading is required again
dataset = load_dataset(fields=fields)
```
Dataset instances should instead detect changes in a Field and recompute values (Vocabs) for the ones that changed.

- Parallelization
- For data preprocessing (apply Fields in parallel)
- For data loading

- Conditional processing in Fields
- Handle cases where the values computed in one Field are dependent on values computed in another Field

- Experimental pipeline
- `podium.experimental`, wrappers for model framework agnostic training & serving
- Low priority

## Minor changes

- Populate hooks & preprocessing utilities
- Lowercase, truncate, extract POS, ...
- Populate pretrained vectors
- Word2vec
- Interface with e.g. gensim
- Improve Dataset coverage
- Data wrappers / abstract loaders for other source libraries and input formats
- BucketIterator modifications
- Simplify setting the sort key (e.g., in the basic case where the batch should be sorted according to the length of a single Field, accept a Field name as the argument)
- Improve HF/datasets integration
- Better automatic Field inference from features
- Cover additional feature datatypes (e.g., image data)
- Cleaner API?
- Centralized and intuitive download script
- Low priority as most data loading is delegated to hf/datasets
- Add a Mask token for MLM (can be handled with posttokenization hooks right now, but not ideal)

## Documentation

- Examples
- Language modeling
- Tensorflow model
- Various task types
- Chapters
- Handling datasets with missing tokens
- Loading data from pandas / porting data to pandas
- Loading CoNLL datasets
- Implementing your own dataset subclass
9 changes: 3 additions & 6 deletions docs/source/_static/js/custom.js
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ const hasNotebook = [
"advanced",
"preprocessing",
"walkthrough",
"examples/tfidf_example",
"examples/pytorch_rnn_example"
]

function addIcon() {
Expand Down Expand Up @@ -49,12 +51,7 @@ function addGithubButton() {
}

function addColabLink() {
if (location.toString().indexOf("package_reference") !== -1) {
return;
}

const parts = location.toString().split('/');
const pageName = parts[parts.length - 1].split(".")[0];
const pageName = location.protocol === "file:" ? location.pathname.split("/html/")[1].split(".")[0] : location.pathname.split("/podium/")[1].split(".")[0]

if (hasNotebook.includes(pageName)) {
const colabLink = `<a href="https://colab.research.google.com/github/TakeLab/podium/blob/master/docs/source/notebooks/${pageName}.ipynb">
Expand Down
8 changes: 4 additions & 4 deletions docs/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -448,18 +448,18 @@ The ``bucket_sort_key`` function defines how the instances in the dataset should
For Iterator, padding = 148141 out of 281696 = 52.588961149608096%
For BucketIterator, padding = 2125 out of 135680 = 1.5661851415094339%
As we can see, the difference between using a regular Iterator and a BucketIterator is massive. Not only do we reduce the amount of padding, we have reduced the total amount of tokens processed by about 50%. The SST dataset, however, is a relatively small dataset so this experiment might be a bit biased. Let's take a look at the same statistics for the :class:`podium.datasets.impl.IMDB` dataset. After changing the highligted data loading line in the first snippet to:
As we can see, the difference between using a regular Iterator and a BucketIterator is massive. Not only do we reduce the amount of padding, we have reduced the total amount of tokens processed by about 50%. The SST dataset, however, is a relatively small dataset so this experiment might be a bit biased. Let's take a look at the same statistics for the :class:`podium.datasets.impl.IMDB` dataset. After changing the data loading line in the first snippet to:

.. code-block:: rest
train, test = IMDB.get_dataset_splits(fields=fields)
>>> train, test = IMDB.get_dataset_splits(fields=fields)
And re-running the code, we obtain the following, still significant improvement:

.. code-block:: rest
For Iterator, padding = 13569936 out of 19414616 = 69.89546432440385%
For BucketIterator, padding = 259800 out of 6104480 = 4.255890755641758%
For Iterator, padding = 13569936 out of 19414616 = 69.89%
For BucketIterator, padding = 259800 out of 6104480 = 4.25%
Generally, using bucketing when iterating over your NLP dataset is preferred and will save you quite a bit of processing time.

Expand Down
Loading

0 comments on commit 34513a4

Please sign in to comment.