Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework examples #316

Merged
merged 31 commits into from
Apr 2, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
31b2792
Fix references in colabs
mariosasko Mar 31, 2021
c7f23d6
Rework examples
mttk Mar 31, 2021
a6d88a8
Merge
mttk Mar 31, 2021
ecd9c26
Finalize tfidf example
mttk Mar 31, 2021
f6726e4
Fix shuffling cost
mttk Apr 1, 2021
de5d083
Fix shuffling cost
mttk Apr 1, 2021
9123789
Fix shuffling cost+
mttk Apr 1, 2021
78757a3
Fix shuffling cost+
mttk Apr 1, 2021
1bd5887
Fix shuffling cost+
mttk Apr 1, 2021
7fec7f6
Merge branch 'master' into examples_rework
mttk Apr 1, 2021
6c9b67b
merge
mttk Apr 1, 2021
2ed6b95
Finalize pytorch rnn example
mttk Apr 1, 2021
0183c38
Finalize examples
mttk Apr 1, 2021
0c04847
Move examples back to examples folder
mttk Apr 1, 2021
ab3f504
Merge branch 'examples_rework' of github.com:TakeLab/podium into exam…
mariosasko Apr 1, 2021
9f77e5a
Add examples directory to notebooks
mttk Apr 1, 2021
73ab602
Remove debug print
mttk Apr 1, 2021
7b78c81
Fix tfidf example, improve notebooks
mariosasko Apr 1, 2021
e5cb8c6
Fix conflict
mariosasko Apr 1, 2021
0ce4f20
Move examples notebooks to notebooks/examples
mariosasko Apr 1, 2021
23b44d1
Fix JS colab condition
mariosasko Apr 1, 2021
3827d1b
Merge branch 'master' into examples_rework
mttk Apr 1, 2021
dcb6b9e
Comments
mttk Apr 1, 2021
63bed39
Delete examples (the camera ready ones are migrated into docs)
mttk Apr 1, 2021
6ff3314
Remove examples dir from commands
mttk Apr 1, 2021
715859d
Remove examples dir from action
mttk Apr 1, 2021
cf26d20
Update readme outputs
mttk Apr 1, 2021
d5a37e7
Add roadmap
mttk Apr 2, 2021
7436d30
Add roadmap
mttk Apr 2, 2021
a3cd42e
Add roadmap
mttk Apr 2, 2021
10bf1cf
Polish, comments, rename BasicVectorStorage to WordVectors
mttk Apr 2, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,18 +38,18 @@ jobs:
pip install .[quality]
- name: Check black compliance
run: |
black --check --line-length 90 --target-version py36 podium tests examples
black --check --line-length 90 --target-version py36 podium tests
- name: Check isort compliance
run: |
isort --check-only podium tests examples
isort --check-only podium tests
- name: Check docformatter compliance
run: |
docformatter podium tests examples --check --recursive \
docformatter podium tests --check --recursive \
--wrap-descriptions 80 --wrap-summaries 80 \
--pre-summary-newline --make-summary-multi-line
- name: Check flake8 compliance
run: |
flake8 podium tests examples
flake8 podium tests

build_and_test:
runs-on: ${{ matrix.os }}
Expand Down
14 changes: 7 additions & 7 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,19 @@
# Check code quality
quality:
@echo Checking code and doc quality.
black --check --line-length 90 --target-version py36 podium tests examples
isort --check-only podium tests examples
docformatter podium tests examples --check --recursive \
black --check --line-length 90 --target-version py36 podium tests
isort --check-only podium tests
docformatter podium tests --check --recursive \
--wrap-descriptions 80 --wrap-summaries 80 \
--pre-summary-newline --make-summary-multi-line
flake8 podium tests examples
flake8 podium tests

# Enforce code quality in source
style:
@echo Applying code and doc style changes.
black --line-length 90 --target-version py36 podium tests examples
isort podium tests examples
docformatter podium tests examples -i --recursive \
black --line-length 90 --target-version py36 podium tests
isort podium tests
docformatter podium tests -i --recursive \
--wrap-descriptions 80 --wrap-summaries 80 \
--pre-summary-newline --make-summary-multi-line

Expand Down
31 changes: 19 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ The main source of inspiration for Podium is an old version of [torchtext](https
### Contents

- [Installation](#installation)
- [Usage examples](#usage-examples)
- [Usage examples](#usage)
- [Contributing](#contributing)
- [Versioning](#versioning)
- [Authors](#authors)
Expand Down Expand Up @@ -56,13 +56,13 @@ SST({
name: text,
keep_raw: False,
is_target: False,
vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, finalized: True, size: 16284})
vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 16284})
}),
LabelField({
name: label,
keep_raw: False,
is_target: True,
vocab: Vocab({specials: (), eager: False, finalized: True, size: 2})
vocab: Vocab({specials: (), eager: False, is_finalized: True, size: 2})
})
]
})
Expand Down Expand Up @@ -94,7 +94,7 @@ HFDatasetConverter({
name: 'text',
keep_raw: False,
is_target: False,
vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, finalized: True, size: 280619})
vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 280619})
}),
LabelField({
name: 'label',
Expand All @@ -105,7 +105,7 @@ HFDatasetConverter({
})
```

Load your own dataset from a standardized tabular format (e.g. `csv`, `tsv`, `jsonl`):
Load your own dataset from a standardized tabular format (e.g. `csv`, `tsv`, `jsonl`, ...):

```python
>>> from podium.datasets import TabularDataset
Expand All @@ -121,24 +121,27 @@ TabularDataset({
fields: [
Field({
name: 'premise',
keep_raw: False,
is_target: False,
vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, finalized: True, size: 19})
vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 15})
}),
Field({
name: 'hypothesis',
keep_raw: False,
is_target: False,
vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, finalized: True, size: 19})
vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 6})
}),
LabelField({
name: 'label',
keep_raw: False,
is_target: True,
vocab: Vocab({specials: (), eager: False, finalized: True, size: 1})
vocab: Vocab({specials: (), eager: False, is_finalized: True, size: 1})
})
]
})
```

Or define your own `Dataset` subclass (tutorial coming soon).
Also check our documentation to see how you can load a dataset from [Pandas](https://pandas.pydata.org/), the CoNLL format, or define your own `Dataset` subclass (tutorial coming soon).

### Define your preprocessing

Expand All @@ -151,6 +154,7 @@ We wrap dataset pre-processing in customizable `Field` classes. Each `Field` has
>>> label = LabelField(name='label')
>>> fields = {'text': text, 'label': label}
>>> sst_train, sst_dev, sst_test = SST.get_dataset_splits(fields=fields)
>>> sst_train.finalize_fields()
>>> print(vocab)
Vocab({specials: ('<UNK>', '<PAD>'), eager: True, finalized: True, size: 5000})
```
Expand All @@ -175,6 +179,7 @@ You could decide to lowercase all the characters and filter out all non-alphanum
>>> text.add_posttokenize_hook(filter_alnum)
>>> fields = {'text': text, 'label': label}
>>> sst_train, sst_dev, sst_test = SST.get_dataset_splits(fields=fields)
>>> sst_train.finalize_fields()
>>> print(sst_train[222])
Example({
text: (None, ['a', 'slick', 'engrossing', 'melodrama']),
Expand All @@ -201,19 +206,21 @@ A common use-case is to incorporate existing components of pretrained language m
... numericalizer=tokenizer.convert_tokens_to_ids)
>>> fields = {'text': subword_field, 'label': label}
>>> sst_train, sst_dev, sst_test = SST.get_dataset_splits(fields=fields)
>>> # No need to finalize since we're not using a vocab!
>>> print(sst_train[222])
Example({
subword: (None, ['a', 'slick', ',', 'eng', '##ross', '##ing', 'mel', '##od', '##rama', '.']),label: (None, 'positive')
subword: (None, ['a', 'slick', ',', 'eng', '##ross', '##ing', 'mel', '##od', '##rama', '.']),
label: (None, 'positive')
})
```

For a more interactive introduction, check out the quickstart on Google Colab: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/takelab/podium/blob/master/docs/source/notebooks/quickstart.ipynb)

More complex examples can be found in our [examples folder](./examples).
Full usage examples can be found in our [docs](https://takelab.fer.hr/podium/examples).

## Contributing

We welcome contributions! To learn more about making a contribution to Podium, please see our [Contribution page](CONTRIBUTING.md).
We welcome contributions! To learn more about making a contribution to Podium, please see our [Contribution page](CONTRIBUTING.md) and our [Roadmap](Roadmap.md).

## Versioning

Expand Down
65 changes: 65 additions & 0 deletions Roadmap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Roadmap

If you are interested in making a contribution to Podium, this page outlines some changes we are planning to focus on in the near future. Feel free to propose improvements and moficiations either via [discussions](https://github.com/TakeLab/podium/discussions) or by raising an [issue](https://github.com/TakeLab/podium/issues).

Order does not reflect importance.

## Major changes

- Dynamic application of Fields
- Right now, for every change in Fields the dataset needs to be reloaded. The goal of this change would be to allow users to replace or update a Field in a Dataset. The Dataset should be aware of this change (e.g. by keeping a hash of the Field object) and if it happens, recompute all the necessary data for that Field.

The current pattern is:
```python
# Load a dataset
fields = {'text':text, 'label':label}
dataset = load_dataset(fields=fields)

# Decide to change something with one of the Fields
text = Field(..., tokenizer=some_different_tokenizer)
mttk marked this conversation as resolved.
Show resolved Hide resolved
# Potentially expensive dataset loading is required again
dataset = load_dataset(fields=fields)
```
Dataset instances should instead detect changes in a Field and recompute values (Vocabs) for the ones that changed.

- Parallelization
- For data preprocessing (apply Fields in parallel)
- For data loading

- Conditional processing in Fields
- Handle cases where the values computed in one Field are dependent on values computed in another Field

- Experimental pipeline
- `podium.experimental`, wrappers for model framework agnostic training & serving
- Low priority

## Minor changes

- Populate hooks & preprocessing utilities
- Lowercase, truncate, extract POS, ...
- Populate pretrained vectors
- Word2vec
- Interface with e.g. gensim
- Improve Dataset coverage
- Data wrappers / abstract loaders for other source libraries and input formats
- BucketIterator modifications
- Simplify setting the sort key (e.g., in the basic case where the batch should be sorted according to the length of a single Field, accept a Field name as the argument)
- Improve HF/datasets integration
- Better automatic Field inference from features
- Cover additional feature datatypes (e.g., image data)
- Cleaner API?
- Centralized and intuitive download script
- Low priority as most data loading is delegated to hf/datasets
- Add a Mask token for MLM (can be handled with posttokenization hooks right now, but not ideal)

## Documentation

- Examples
- Language modeling
- Tensorflow model
- Various task types
- Chapters
- Handling datasets with missing tokens
- Loading data from pandas / porting data to pandas
- Loading CoNLL datasets
- Implementing your own dataset subclass
9 changes: 3 additions & 6 deletions docs/source/_static/js/custom.js
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ const hasNotebook = [
"advanced",
"preprocessing",
"walkthrough",
"examples/tfidf_example",
"examples/pytorch_rnn_example"
]

function addIcon() {
Expand Down Expand Up @@ -49,12 +51,7 @@ function addGithubButton() {
}

function addColabLink() {
if (location.toString().indexOf("package_reference") !== -1) {
return;
}

const parts = location.toString().split('/');
const pageName = parts[parts.length - 1].split(".")[0];
const pageName = location.protocol === "file:" ? location.pathname.split("/html/")[1].split(".")[0] : location.pathname.split("/podium/")[1].split(".")[0]

if (hasNotebook.includes(pageName)) {
const colabLink = `<a href="https://colab.research.google.com/github/TakeLab/podium/blob/master/docs/source/notebooks/${pageName}.ipynb">
Expand Down
8 changes: 4 additions & 4 deletions docs/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -448,18 +448,18 @@ The ``bucket_sort_key`` function defines how the instances in the dataset should
For Iterator, padding = 148141 out of 281696 = 52.588961149608096%
For BucketIterator, padding = 2125 out of 135680 = 1.5661851415094339%

As we can see, the difference between using a regular Iterator and a BucketIterator is massive. Not only do we reduce the amount of padding, we have reduced the total amount of tokens processed by about 50%. The SST dataset, however, is a relatively small dataset so this experiment might be a bit biased. Let's take a look at the same statistics for the :class:`podium.datasets.impl.IMDB` dataset. After changing the highligted data loading line in the first snippet to:
As we can see, the difference between using a regular Iterator and a BucketIterator is massive. Not only do we reduce the amount of padding, we have reduced the total amount of tokens processed by about 50%. The SST dataset, however, is a relatively small dataset so this experiment might be a bit biased. Let's take a look at the same statistics for the :class:`podium.datasets.impl.IMDB` dataset. After changing the data loading line in the first snippet to:

.. code-block:: rest

train, test = IMDB.get_dataset_splits(fields=fields)
>>> train, test = IMDB.get_dataset_splits(fields=fields)

And re-running the code, we obtain the following, still significant improvement:

.. code-block:: rest

For Iterator, padding = 13569936 out of 19414616 = 69.89546432440385%
For BucketIterator, padding = 259800 out of 6104480 = 4.255890755641758%
For Iterator, padding = 13569936 out of 19414616 = 69.89%
For BucketIterator, padding = 259800 out of 6104480 = 4.25%

Generally, using bucketing when iterating over your NLP dataset is preferred and will save you quite a bit of processing time.

Expand Down
Loading