v0.2.0
Summary
- Documentation live on keras.io.
- Added two tokenizers:
ByteTokenizer
andUnicodeCharacterTokenizer
. - Added a
Perplexity
metric. - Added three layers
TokenAndPositionEmbedding
,MLMMaskGenerator
andMLMHead
. - Contributing guides and roadmap.
What's Changed
- Add Byte Tokenizer by @abheesht17 in #80
- Fixing rank 1 outputs for WordPieceTokenizer by @aflah02 in #92
- Add tokenizer accessors to the base class by @mattdangerw in #89
- Fix word piece attributes by @mattdangerw in #97
- Small fix: change assertEquals to assertEqual by @chenmoneygithub in #103
- Added a Learning Rate Schedule for the BERT Example by @Stealth-py in #96
- Add Perplexity Metric by @abheesht17 in #68
- Use the black profile for isort by @mattdangerw in #117
- Update README with release information by @mattdangerw in #118
- Add a class to generate LM masks by @chenmoneygithub in #61
- Add docstring testing by @mattdangerw in #116
- Fix broken docstring in MLMMaskGenerator by @chenmoneygithub in #121
- Adding a UnicodeCharacterTokenizer by @aflah02 in #100
- Added Class by @adhadse in #91
- Fix bert example so it is runnable by @mattdangerw in #123
- Fix the issue that MLMMaskGenerator does not work in graph mode by @chenmoneygithub in #131
- Actually use layer norm epsilon in encoder/decoder by @mattdangerw in #133
- Whitelisted formatting and lint check targets by @adhadse in #126
- Updated CONTRIBUTING.md for setup of venv and standard pip install by @adhadse in #127
- Fix mask propagation of transformer layers by @chenmoneygithub in #139
- Fix masking for TokenAndPositionEmbedding by @mattdangerw in #140
- Fixed no oov token error in vocab for WordPieceTokenizer by @adhadse in #136
- Add a MLMHead layer by @mattdangerw in #132
- Bump version for 0.2.0 dev release by @mattdangerw in #142
- Added WSL setup text to CONTRIBUTING.md by @adhadse in #144
- Add attribution for the BERT modeling code by @mattdangerw in #151
- Remove preprocessing subdir by @mattdangerw in #150
- Word piece arg change by @mattdangerw in #148
- Rename max_length to sequence_length by @mattdangerw in #149
- Don't accept a string dtype for unicode tokenizer by @mattdangerw in #147
- Adding Utility to Detokenize as list of Strings to Tokenizer Base Class by @aflah02 in #124
- Fixed Import Error by @aflah02 in #161
- Added KerasTuner Hyper-Parameter Search for the BERT fine-tuning script. by @Stealth-py in #143
- Docstring updates for upcoming doc publish by @mattdangerw in #146
- version bump for 0.2.0.dev2 pre-release by @mattdangerw in #165
- Added a vocabulary_size argument to UnicodeCharacterTokenizer by @aflah02 in #163
- Simplified utility to preview a tfrecord by @mattdangerw in #168
- Update BERT example's README with data downloading instructions by @chenmoneygithub in #169
- Add a call to repeat during pretraining by @mattdangerw in #172
- Add an integration test matching our quick start by @mattdangerw in #162
- Modify README of bert example by @chenmoneygithub in #174
- Fix the finetuning script's loss and metric config by @chenmoneygithub in #176
- Minor improvements to the position embedding docs by @mattdangerw in #180
- Update docs for upcoming 0.2.0 release by @mattdangerw in #158
- Restore accidentally deleted line from README by @mattdangerw in #185
- Bump version for 0.2.0 release by @mattdangerw in #186
- Pre release fix by @mattdangerw in #187
New Contributors
- @Stealth-py made their first contribution in #96
- @adhadse made their first contribution in #91
Full Changelog: v0.1.1...v0.2.0