Rework examples (#316)

* Move examples to RST files * Fixes #269 by setting batch_size to a placeholder if a dataset is not set in the Iterator * Fixes an iterator issue where the whole dataset was copied when shuffled (this caused slowdowns when iterating over HF dataset wrappers as they are disk backed and the copy made them concrete) * Rename BasicVectorStorage -> WordVectors Co-authored-by: mariosasko <[email protected]>
TakeLab · Apr 2, 2021 · 34513a4 · 34513a4
1 parent 68b2f03
commit 34513a4
Show file tree

Hide file tree

Showing 42 changed files with 1,454 additions and 2,260 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -38,18 +38,18 @@ jobs:
         pip install .[quality]
     - name: Check black compliance
       run: |
-        black --check --line-length 90 --target-version py36 podium tests examples
+        black --check --line-length 90 --target-version py36 podium tests
     - name: Check isort compliance
       run: |
-        isort --check-only podium tests examples
+        isort --check-only podium tests
     - name: Check docformatter compliance
       run: |
-        docformatter podium tests examples --check --recursive \
+        docformatter podium tests --check --recursive \
           --wrap-descriptions 80 --wrap-summaries 80 \
           --pre-summary-newline --make-summary-multi-line
     - name: Check flake8 compliance
       run: |
-        flake8 podium tests examples
+        flake8 podium tests
   
   build_and_test:
     runs-on: ${{ matrix.os }}

diff --git a/Makefile b/Makefile
@@ -3,19 +3,19 @@
 # Check code quality
 quality:
 	@echo Checking code and doc quality.
-	black --check --line-length 90 --target-version py36 podium tests examples
-	isort --check-only podium tests examples
-	docformatter podium tests examples --check --recursive \
+	black --check --line-length 90 --target-version py36 podium tests
+	isort --check-only podium tests
+	docformatter podium tests --check --recursive \
     	--wrap-descriptions 80 --wrap-summaries 80 \
         --pre-summary-newline --make-summary-multi-line
-	flake8 podium tests examples
+	flake8 podium tests
 
 # Enforce code quality in source 
 style:
 	@echo Applying code and doc style changes.
-	black --line-length 90 --target-version py36 podium tests examples
-	isort podium tests examples
-	docformatter podium tests examples -i --recursive \
+	black --line-length 90 --target-version py36 podium tests
+	isort podium tests
+	docformatter podium tests -i --recursive \
     	--wrap-descriptions 80 --wrap-summaries 80 \
         --pre-summary-newline --make-summary-multi-line
 

diff --git a/README.md b/README.md
@@ -11,7 +11,7 @@ The main source of inspiration for Podium is an old version of [torchtext](https
 ### Contents
 
 - [Installation](#installation)
-- [Usage examples](#usage-examples)
+- [Usage examples](#usage)
 - [Contributing](#contributing)
 - [Versioning](#versioning)
 - [Authors](#authors)
@@ -56,13 +56,13 @@ SST({
             name: text,
             keep_raw: False,
             is_target: False,
-            vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, finalized: True, size: 16284})
+            vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 16284})
         }),
         LabelField({
             name: label,
             keep_raw: False,
             is_target: True,
-            vocab: Vocab({specials: (), eager: False, finalized: True, size: 2})
+            vocab: Vocab({specials: (), eager: False, is_finalized: True, size: 2})
         })
     ]
 })
@@ -94,7 +94,7 @@ HFDatasetConverter({
             name: 'text',
             keep_raw: False,
             is_target: False,
-            vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, finalized: True, size: 280619})
+            vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 280619})
         }),
         LabelField({
             name: 'label',
@@ -105,7 +105,7 @@ HFDatasetConverter({
 })
 ```
 
-Load your own dataset from a standardized tabular format (e.g. `csv`, `tsv`, `jsonl`):
+Load your own dataset from a standardized tabular format (e.g. `csv`, `tsv`, `jsonl`, ...):
 
 ```python
 >>> from podium.datasets import TabularDataset
@@ -121,24 +121,27 @@ TabularDataset({
     fields: [
         Field({
             name: 'premise',
+            keep_raw: False,
             is_target: False, 
-            vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, finalized: True, size: 19})
+            vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 15})
         }),
         Field({
             name: 'hypothesis',
+            keep_raw: False,
             is_target: False, 
-            vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, finalized: True, size: 19})
+            vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 6})
         }),
         LabelField({
             name: 'label',
+            keep_raw: False,
             is_target: True, 
-            vocab: Vocab({specials: (), eager: False, finalized: True, size: 1})
+            vocab: Vocab({specials: (), eager: False, is_finalized: True, size: 1})
         })
     ]
 })
 ```
 
-Or define your own `Dataset` subclass (tutorial coming soon).
+Also check our documentation to see how you can load a dataset from [Pandas](https://pandas.pydata.org/), the CoNLL format, or define your own `Dataset` subclass (tutorial coming soon).
 
 ### Define your preprocessing
 
@@ -151,6 +154,7 @@ We wrap dataset pre-processing in customizable `Field` classes. Each `Field` has
 >>> label = LabelField(name='label')
 >>> fields = {'text': text, 'label': label}
 >>> sst_train, sst_dev, sst_test = SST.get_dataset_splits(fields=fields)
+>>> sst_train.finalize_fields()
 >>> print(vocab)
 Vocab({specials: ('<UNK>', '<PAD>'), eager: True, finalized: True, size: 5000})
 ```
@@ -175,6 +179,7 @@ You could decide to lowercase all the characters and filter out all non-alphanum
 >>> text.add_posttokenize_hook(filter_alnum)
 >>> fields = {'text': text, 'label': label}
 >>> sst_train, sst_dev, sst_test = SST.get_dataset_splits(fields=fields)
+>>> sst_train.finalize_fields()
 >>> print(sst_train[222])
 Example({
     text: (None, ['a', 'slick', 'engrossing', 'melodrama']),
@@ -201,19 +206,21 @@ A common use-case is to incorporate existing components of pretrained language m
 ...                       numericalizer=tokenizer.convert_tokens_to_ids)
 >>> fields = {'text': subword_field, 'label': label}
 >>> sst_train, sst_dev, sst_test = SST.get_dataset_splits(fields=fields)
+>>> # No need to finalize since we're not using a vocab!
 >>> print(sst_train[222])
 Example({
-    subword: (None, ['a', 'slick', ',', 'eng', '##ross', '##ing', 'mel', '##od', '##rama', '.']),label: (None, 'positive')
+    subword: (None, ['a', 'slick', ',', 'eng', '##ross', '##ing', 'mel', '##od', '##rama', '.']),
+    label: (None, 'positive')
 })
 ```
 
 For a more interactive introduction, check out the quickstart on Google Colab: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/takelab/podium/blob/master/docs/source/notebooks/quickstart.ipynb)
 
-More complex examples can be found in our [examples folder](./examples).
+Full usage examples can be found in our [docs](https://takelab.fer.hr/podium/examples).
 
 ## Contributing
 
-We welcome contributions! To learn more about making a contribution to Podium, please see our [Contribution page](CONTRIBUTING.md).
+We welcome contributions! To learn more about making a contribution to Podium, please see our [Contribution page](CONTRIBUTING.md) and our [Roadmap](Roadmap.md).
 
 ## Versioning
 

diff --git a/Roadmap.md b/Roadmap.md
@@ -0,0 +1,65 @@
+# Roadmap
+
+If you are interested in making a contribution to Podium, this page outlines some changes we are planning to focus on in the near future. Feel free to propose improvements and moficiations either via [discussions](https://github.com/TakeLab/podium/discussions) or by raising an [issue](https://github.com/TakeLab/podium/issues).
+
+Order does not reflect importance.
+
+## Major changes
+
+- Dynamic application of Fields
+  - Right now, for every change in Fields the dataset needs to be reloaded. The goal of this change would be to allow users to replace or update a Field in a Dataset. The Dataset should be aware of this change (e.g. by keeping a hash of the Field object) and if it happens, recompute all the necessary data for that Field.
+
+  The current pattern is:
+  ```python
+    # Load a dataset
+    fields = {'text':text, 'label':label}
+    dataset = load_dataset(fields=fields)
+
+    # Decide to change something with one of the Fields
+    text = Field(..., tokenizer=some_different_tokenizer)
+    # Potentially expensive dataset loading is required again
+    dataset = load_dataset(fields=fields)
+  ```
+  Dataset instances should instead detect changes in a Field and recompute values (Vocabs) for the ones that changed.
+
+- Parallelization
+  - For data preprocessing (apply Fields in parallel)
+  - For data loading
+
+- Conditional processing in Fields
+  - Handle cases where the values computed in one Field are dependent on values computed in another Field 
+
+- Experimental pipeline
+  - `podium.experimental`, wrappers for model framework agnostic training & serving
+  - Low priority 
+
+## Minor changes
+
+- Populate hooks & preprocessing utilities
+  - Lowercase, truncate, extract POS, ...
+- Populate pretrained vectors
+  - Word2vec
+  - Interface with e.g. gensim
+- Improve Dataset coverage
+  - Data wrappers / abstract loaders for other source libraries and input formats
+- BucketIterator modifications
+  - Simplify setting the sort key (e.g., in the basic case  where the batch should be sorted according to the length of a single Field, accept a Field name as the argument)
+- Improve HF/datasets integration
+  - Better automatic Field inference from features
+  - Cover additional feature datatypes (e.g., image data)
+  - Cleaner API?
+- Centralized and intuitive download script
+  - Low priority as most data loading is delegated to hf/datasets
+- Add a Mask token for MLM (can be handled with posttokenization hooks right now, but not ideal)
+
+## Documentation
+
+- Examples
+  - Language modeling
+  - Tensorflow model
+  - Various task types
+- Chapters
+  - Handling datasets with missing tokens
+  - Loading data from pandas / porting data to pandas
+  - Loading CoNLL datasets
+  - Implementing your own dataset subclass
diff --git a/docs/source/_static/js/custom.js b/docs/source/_static/js/custom.js
@@ -16,6 +16,8 @@ const hasNotebook = [
     "advanced",
     "preprocessing",
     "walkthrough",
+    "examples/tfidf_example",
+    "examples/pytorch_rnn_example"
 ]
 
 function addIcon() {
@@ -49,12 +51,7 @@ function addGithubButton() {
 }
 
 function addColabLink() {
-    if (location.toString().indexOf("package_reference") !== -1) {
-        return; 
-    }
-
-    const parts = location.toString().split('/');
-    const pageName = parts[parts.length - 1].split(".")[0];
+    const pageName = location.protocol === "file:" ? location.pathname.split("/html/")[1].split(".")[0] : location.pathname.split("/podium/")[1].split(".")[0]
 
     if (hasNotebook.includes(pageName)) {
         const colabLink = `<a href="https://colab.research.google.com/github/TakeLab/podium/blob/master/docs/source/notebooks/${pageName}.ipynb">

diff --git a/docs/source/advanced.rst b/docs/source/advanced.rst
@@ -448,18 +448,18 @@ The ``bucket_sort_key`` function defines how the instances in the dataset should
   For Iterator, padding = 148141 out of 281696 = 52.588961149608096%
   For BucketIterator, padding = 2125 out of 135680 = 1.5661851415094339%
 
-As we can see, the difference between using a regular Iterator and a BucketIterator is massive. Not only do we reduce the amount of padding, we have reduced the total amount of tokens processed by about 50%. The SST dataset, however, is a relatively small dataset so this experiment might be a bit biased. Let's take a look at the same statistics for the :class:`podium.datasets.impl.IMDB` dataset. After changing the highligted data loading line in the first snippet to:
+As we can see, the difference between using a regular Iterator and a BucketIterator is massive. Not only do we reduce the amount of padding, we have reduced the total amount of tokens processed by about 50%. The SST dataset, however, is a relatively small dataset so this experiment might be a bit biased. Let's take a look at the same statistics for the :class:`podium.datasets.impl.IMDB` dataset. After changing the data loading line in the first snippet to:
 
 .. code-block:: rest
 
-  train, test = IMDB.get_dataset_splits(fields=fields)
+  >>> train, test = IMDB.get_dataset_splits(fields=fields)
 
 And re-running the code, we obtain the following, still significant improvement:
 
 .. code-block:: rest
 
-  For Iterator, padding = 13569936 out of 19414616 = 69.89546432440385%
-  For BucketIterator, padding = 259800 out of 6104480 = 4.255890755641758%
+  For Iterator, padding = 13569936 out of 19414616 = 69.89%
+  For BucketIterator, padding = 259800 out of 6104480 = 4.25%
 
 Generally, using bucketing when iterating over your NLP dataset is preferred and will save you quite a bit of processing time.