TakeLab · mttk · Apr 2, 2021 · Mar 31, 2021 · Mar 31, 2021 · Mar 31, 2021
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -38,18 +38,18 @@ jobs:
         pip install .[quality]
     - name: Check black compliance
       run: |
-        black --check --line-length 90 --target-version py36 podium tests examples
+        black --check --line-length 90 --target-version py36 podium tests
     - name: Check isort compliance
       run: |
-        isort --check-only podium tests examples
+        isort --check-only podium tests
     - name: Check docformatter compliance
       run: |
-        docformatter podium tests examples --check --recursive \
+        docformatter podium tests --check --recursive \
           --wrap-descriptions 80 --wrap-summaries 80 \
           --pre-summary-newline --make-summary-multi-line
     - name: Check flake8 compliance
       run: |
-        flake8 podium tests examples
+        flake8 podium tests
 
   build_and_test:
     runs-on: ${{ matrix.os }}

diff --git a/Makefile b/Makefile
@@ -3,19 +3,19 @@
 # Check code quality
 quality:
 	@echo Checking code and doc quality.
-	black --check --line-length 90 --target-version py36 podium tests examples
-	isort --check-only podium tests examples
-	docformatter podium tests examples --check --recursive \
+	black --check --line-length 90 --target-version py36 podium tests
+	isort --check-only podium tests
+	docformatter podium tests --check --recursive \
     	--wrap-descriptions 80 --wrap-summaries 80 \
         --pre-summary-newline --make-summary-multi-line
-	flake8 podium tests examples
+	flake8 podium tests
 
 # Enforce code quality in source 
 style:
 	@echo Applying code and doc style changes.
-	black --line-length 90 --target-version py36 podium tests examples
-	isort podium tests examples
-	docformatter podium tests examples -i --recursive \
+	black --line-length 90 --target-version py36 podium tests
+	isort podium tests
+	docformatter podium tests -i --recursive \
     	--wrap-descriptions 80 --wrap-summaries 80 \
         --pre-summary-newline --make-summary-multi-line
 

diff --git a/README.md b/README.md
@@ -11,7 +11,7 @@ The main source of inspiration for Podium is an old version of [torchtext](https
 ### Contents
 
 - [Installation](#installation)
-- [Usage examples](#usage-examples)
+- [Usage examples](#usage)
 - [Contributing](#contributing)
 - [Versioning](#versioning)
 - [Authors](#authors)
@@ -56,13 +56,13 @@ SST({
             name: text,
             keep_raw: False,
             is_target: False,
-            vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, finalized: True, size: 16284})
+            vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 16284})
         }),
         LabelField({
             name: label,
             keep_raw: False,
             is_target: True,
-            vocab: Vocab({specials: (), eager: False, finalized: True, size: 2})
+            vocab: Vocab({specials: (), eager: False, is_finalized: True, size: 2})
         })
     ]
 })
@@ -94,7 +94,7 @@ HFDatasetConverter({
             name: 'text',
             keep_raw: False,
             is_target: False,
-            vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, finalized: True, size: 280619})
+            vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 280619})
         }),
         LabelField({
             name: 'label',
@@ -105,7 +105,7 @@ HFDatasetConverter({
 })
 ```
 
-Load your own dataset from a standardized tabular format (e.g. `csv`, `tsv`, `jsonl`):
+Load your own dataset from a standardized tabular format (e.g. `csv`, `tsv`, `jsonl`, ...):
 
 ```python
 >>> from podium.datasets import TabularDataset
@@ -121,24 +121,27 @@ TabularDataset({
     fields: [
         Field({
             name: 'premise',
+            keep_raw: False,
             is_target: False, 
-            vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, finalized: True, size: 19})
+            vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 15})
         }),
         Field({
             name: 'hypothesis',
+            keep_raw: False,
             is_target: False, 
-            vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, finalized: True, size: 19})
+            vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 6})
         }),
         LabelField({
             name: 'label',
+            keep_raw: False,
             is_target: True, 
-            vocab: Vocab({specials: (), eager: False, finalized: True, size: 1})
+            vocab: Vocab({specials: (), eager: False, is_finalized: True, size: 1})
         })
     ]
 })
 ```
 
-Or define your own `Dataset` subclass (tutorial coming soon).
+Also check our documentation to see how you can load a dataset from [Pandas](https://pandas.pydata.org/), the CoNLL format, or define your own `Dataset` subclass (tutorial coming soon).
 
 ### Define your preprocessing
 
@@ -151,6 +154,7 @@ We wrap dataset pre-processing in customizable `Field` classes. Each `Field` has
 >>> label = LabelField(name='label')
 >>> fields = {'text': text, 'label': label}
 >>> sst_train, sst_dev, sst_test = SST.get_dataset_splits(fields=fields)
+>>> sst_train.finalize_fields()
 >>> print(vocab)
 Vocab({specials: ('<UNK>', '<PAD>'), eager: True, finalized: True, size: 5000})
 ```
@@ -175,6 +179,7 @@ You could decide to lowercase all the characters and filter out all non-alphanum
 >>> text.add_posttokenize_hook(filter_alnum)
 >>> fields = {'text': text, 'label': label}
 >>> sst_train, sst_dev, sst_test = SST.get_dataset_splits(fields=fields)
+>>> sst_train.finalize_fields()
 >>> print(sst_train[222])
 Example({
     text: (None, ['a', 'slick', 'engrossing', 'melodrama']),
@@ -201,19 +206,21 @@ A common use-case is to incorporate existing components of pretrained language m
 ...                       numericalizer=tokenizer.convert_tokens_to_ids)
 >>> fields = {'text': subword_field, 'label': label}
 >>> sst_train, sst_dev, sst_test = SST.get_dataset_splits(fields=fields)
+>>> # No need to finalize since we're not using a vocab!
 >>> print(sst_train[222])
 Example({
-    subword: (None, ['a', 'slick', ',', 'eng', '##ross', '##ing', 'mel', '##od', '##rama', '.']),label: (None, 'positive')
+    subword: (None, ['a', 'slick', ',', 'eng', '##ross', '##ing', 'mel', '##od', '##rama', '.']),
+    label: (None, 'positive')
 })
 ```
 
 For a more interactive introduction, check out the quickstart on Google Colab: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/takelab/podium/blob/master/docs/source/notebooks/quickstart.ipynb)
 
-More complex examples can be found in our [examples folder](./examples).
+Full usage examples can be found in our [docs](https://takelab.fer.hr/podium/examples).
 
 ## Contributing
 
-We welcome contributions! To learn more about making a contribution to Podium, please see our [Contribution page](CONTRIBUTING.md).
+We welcome contributions! To learn more about making a contribution to Podium, please see our [Contribution page](CONTRIBUTING.md) and our [Roadmap](Roadmap.md).
 
 ## Versioning
 

diff --git a/Roadmap.md b/Roadmap.md
@@ -0,0 +1,65 @@
+# Roadmap
+
+If you are interested in making a contribution to Podium, this page outlines some changes we are planning to focus on in the near future. Feel free to propose improvements and moficiations either via [discussions](https://github.com/TakeLab/podium/discussions) or by raising an [issue](https://github.com/TakeLab/podium/issues).
+
+Order does not reflect importance.
+
+## Major changes
+
+- Dynamic application of Fields
+  - Right now, for every change in Fields the dataset needs to be reloaded. The goal of this change would be to allow users to replace or update a Field in a Dataset. The Dataset should be aware of this change (e.g. by keeping a hash of the Field object) and if it happens, recompute all the necessary data for that Field.
+
+  The current pattern is:
+  ```python
+    # Load a dataset
+    fields = {'text':text, 'label':label}
+    dataset = load_dataset(fields=fields)
+
+    # Decide to change something with one of the Fields
+    text = Field(..., tokenizer=some_different_tokenizer)
+    # Potentially expensive dataset loading is required again
+    dataset = load_dataset(fields=fields)
+  ```
+  Dataset instances should instead detect changes in a Field and recompute values (Vocabs) for the ones that changed.
+
+- Parallelization
+  - For data preprocessing (apply Fields in parallel)
+  - For data loading
+
+- Conditional processing in Fields
+  - Handle cases where the values computed in one Field are dependent on values computed in another Field 
+
+- Experimental pipeline
+  - `podium.experimental`, wrappers for model framework agnostic training & serving
+  - Low priority 
+
+## Minor changes
+
+- Populate hooks & preprocessing utilities
+  - Lowercase, truncate, extract POS, ...
+- Populate pretrained vectors
+  - Word2vec
+  - Interface with e.g. gensim
+- Improve Dataset coverage
+  - Data wrappers / abstract loaders for other source libraries and input formats
+- BucketIterator modifications
+  - Simplify setting the sort key (e.g., in the basic case  where the batch should be sorted according to the length of a single Field, accept a Field name as the argument)
+- Improve HF/datasets integration
+  - Better automatic Field inference from features
+  - Cover additional feature datatypes (e.g., image data)
+  - Cleaner API?
+- Centralized and intuitive download script
+  - Low priority as most data loading is delegated to hf/datasets
+- Add a Mask token for MLM (can be handled with posttokenization hooks right now, but not ideal)
+
+## Documentation
+
+- Examples
+  - Language modeling
+  - Tensorflow model
+  - Various task types
+- Chapters
+  - Handling datasets with missing tokens
+  - Loading data from pandas / porting data to pandas
+  - Loading CoNLL datasets
+  - Implementing your own dataset subclass
diff --git a/docs/source/_static/js/custom.js b/docs/source/_static/js/custom.js
@@ -16,6 +16,8 @@ const hasNotebook = [
     "advanced",
     "preprocessing",
     "walkthrough",
+    "examples/tfidf_example",
+    "examples/pytorch_rnn_example"
 ]
 
 function addIcon() {
@@ -49,12 +51,7 @@ function addGithubButton() {
 }
 
 function addColabLink() {
-    if (location.toString().indexOf("package_reference") !== -1) {
-        return; 
-    }
-
-    const parts = location.toString().split('/');
-    const pageName = parts[parts.length - 1].split(".")[0];
+    const pageName = location.protocol === "file:" ? location.pathname.split("/html/")[1].split(".")[0] : location.pathname.split("/podium/")[1].split(".")[0]
 
     if (hasNotebook.includes(pageName)) {
         const colabLink = `<a href="https://colab.research.google.com/github/TakeLab/podium/blob/master/docs/source/notebooks/${pageName}.ipynb">

diff --git a/docs/source/advanced.rst b/docs/source/advanced.rst
@@ -448,18 +448,18 @@ The ``bucket_sort_key`` function defines how the instances in the dataset should
   For Iterator, padding = 148141 out of 281696 = 52.588961149608096%
   For BucketIterator, padding = 2125 out of 135680 = 1.5661851415094339%
 
-As we can see, the difference between using a regular Iterator and a BucketIterator is massive. Not only do we reduce the amount of padding, we have reduced the total amount of tokens processed by about 50%. The SST dataset, however, is a relatively small dataset so this experiment might be a bit biased. Let's take a look at the same statistics for the :class:`podium.datasets.impl.IMDB` dataset. After changing the highligted data loading line in the first snippet to:
+As we can see, the difference between using a regular Iterator and a BucketIterator is massive. Not only do we reduce the amount of padding, we have reduced the total amount of tokens processed by about 50%. The SST dataset, however, is a relatively small dataset so this experiment might be a bit biased. Let's take a look at the same statistics for the :class:`podium.datasets.impl.IMDB` dataset. After changing the data loading line in the first snippet to:
 
 .. code-block:: rest
 
-  train, test = IMDB.get_dataset_splits(fields=fields)
+  >>> train, test = IMDB.get_dataset_splits(fields=fields)
 
 And re-running the code, we obtain the following, still significant improvement:
 
 .. code-block:: rest
 
-  For Iterator, padding = 13569936 out of 19414616 = 69.89546432440385%
-  For BucketIterator, padding = 259800 out of 6104480 = 4.255890755641758%
+  For Iterator, padding = 13569936 out of 19414616 = 69.89%
+  For BucketIterator, padding = 259800 out of 6104480 = 4.25%
 
 Generally, using bucketing when iterating over your NLP dataset is preferred and will save you quite a bit of processing time.