-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/encoder decoder dq restructure #766
Closed
Closed
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
6e31cb3
First attempt at re-architecting Seq2seq to have a seperate EncoderDe…
ac6fb6b
Merge branch 'main' into feature/EncoderDecoder_DQ_Restructure
2749577
Merge branch 'main' into feature/EncoderDecoder_DQ_Restructure
75b6fbc
merge conflicts fix
elboy3 a34192d
Adding seq2seq subfolder, updating comments, linting and formatting
3a9e7bd
Updated tests to be passing. Left comments for comments and potential…
46ee172
Formatting
e0fd7b9
fix breaking test
24a3a5a
Merge branch 'main' into feature/EncoderDecoder_DQ_Restructure
1cd8a70
Rename to seq2seq_base and updated hf watch to get config from get_da…
7356f09
formatting
43e4cab
Working on changes
18ac8c1
some thing swith jon
elboy3 8c113cd
remove import
elboy3 83ec3f5
Merge branch 'main' into feature/EncoderDecoder_DQ_Restructure
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
146 changes: 146 additions & 0 deletions
146
dataquality/loggers/data_logger/seq2seq/encoder_decoder.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,146 @@ | ||
from typing import Optional | ||
|
||
from vaex.dataframe import DataFrame | ||
|
||
from dataquality.loggers.data_logger.base_data_logger import ( | ||
MetasType, | ||
) | ||
from dataquality.loggers.data_logger.seq2seq.seq2seq_base import Seq2SeqDataLogger | ||
from dataquality.loggers.logger_config.seq2seq.encoder_decoder import ( | ||
EncoderDecoderLoggerConfig, | ||
encoder_decoder_logger_config, | ||
) | ||
from dataquality.schemas.seq2seq import Seq2SeqInputCols as C | ||
from dataquality.utils.seq2seq.offsets import ( | ||
align_tokens_to_character_spans, | ||
get_cutoff_from_saved_offsets, | ||
get_cutoff_from_truncated_tokenization, | ||
) | ||
|
||
|
||
class EncoderDecoderDataLogger(Seq2SeqDataLogger): | ||
"""Seq2Seq data logger for EncoderDecoder models | ||
|
||
Logging input data for EncoderDecoder models requires: | ||
1. tokenizer: This must be an instance of PreTrainedTokenizerFast from huggingface | ||
(ie T5TokenizerFast or GPT2TokenizerFast, etc). Your tokenizer should have an | ||
`.is_fast` property that returns True if it's a fast tokenizer. | ||
This class must implement the `encode`, `decode`, and `encode_plus` methods | ||
|
||
You can set your tokenizer via the seq2seq `watch(..., tok, ...)` function | ||
imported from `dataquality.integrations.seq2seq.hf` | ||
2. A two column (i.e. completion) dataset (pandas/huggingface etc) with string | ||
'text' (model <Input> / <Instruction> / <Prompt>, ...) and 'label' (model | ||
<Target> / (<Completion> / ...) columns + a data sample id column. | ||
Ex: Billsum dataset, with `text` <Input> and `summary` as the <Label> | ||
id text summary | ||
0 SECTION 1. LIABILITY ... Shields a business entity ... | ||
1 SECTION 1. SHORT TITLE.\n\n ... Human Rights Information Act ... | ||
2 SECTION 1. SHORT TITLE.\n\n ... Jackie Robinson Commemorative Coin ... | ||
3 SECTION 1. NONRECOGNITION ... Amends the Internal Revenue Code to ... | ||
4 SECTION 1. SHORT TITLE.\n\n ... Native American Energy Act - (Sec. 3... | ||
|
||
You can log your dataset via the `dq.log_dataset` function, passing in the | ||
column mapping as necessary for `text`, `label`, and `id` | ||
`dq.log_dataset(ds, text="text", label="summary", id="id")` | ||
|
||
Putting it all together: | ||
from dataquality.integrations.seq2seq.hf import watch | ||
from datasets import load_dataset | ||
from transformers import T5TokenizerFast | ||
|
||
tokenizer = T5TokenizerFast.from_pretrained("t5-small") | ||
ds = load_dataset("billsum") | ||
# Add `id` column to each dataset split as the idx | ||
ds = ds.map(lambda x,idx : {"id":idx},with_indices=True) | ||
dq.init("seq2seq") | ||
# See `watch` for additional input parameters | ||
watch( | ||
..., | ||
tokenizer, | ||
... | ||
) | ||
dq.log_dataset(ds["train"], label="summary", split="train") | ||
|
||
NOTE: We assume that the tokenizer you provide is the same tokenizer used for | ||
training. This must be true in order to align inputs and outputs correctly. Ensure | ||
all necessary properties (like `add_eos_token`) are set before setting your | ||
tokenizer so as to match the tokenization process to your training process. | ||
|
||
NOTE 2: Unlike EncoderOnly models, EncoderDecoder models explicitly separate the | ||
processing of the <Input> and <Target> data. Therefore, we do not need any | ||
additional information to isolate / extract information on the <Target> data. | ||
""" | ||
|
||
__logger_name__ = "encoder_decoder" | ||
logger_config: EncoderDecoderLoggerConfig = encoder_decoder_logger_config | ||
DATA_FOLDER_EXTENSION = {"emb": "hdf5", "prob": "hdf5", "data": "arrow"} | ||
|
||
def __init__(self, meta: Optional[MetasType] = None) -> None: | ||
super().__init__(meta) | ||
|
||
def validate_and_format(self) -> None: | ||
"""Format Encoder-Decoder Data Format | ||
|
||
Tokenize self.labels, using the user's `max_taget_tokens`. From | ||
the tokenized outputs generate the corresponding token alignments | ||
(i.e. label_offsets and lable_positions). | ||
|
||
Save the tokenized labels for each sample as `id_to_tokens`. This | ||
is essential during model logging for extracting GT token label | ||
information. | ||
|
||
Note: the parent Seq2SeqDataLogger.validate_and_format() handles | ||
common data type validation. | ||
""" | ||
super().validate_and_format() | ||
# We ensure tokenizer is set in the parent class | ||
encoded_data = self.logger_config.tokenizer( # type: ignore | ||
self.labels, | ||
return_offsets_mapping=True, | ||
max_length=self.logger_config.max_target_tokens, | ||
truncation=True, | ||
) | ||
tokenized_labels = encoded_data["input_ids"] | ||
aligned_data = align_tokens_to_character_spans(encoded_data["offset_mapping"]) | ||
self.token_label_offsets = aligned_data.token_label_offsets | ||
self.token_label_positions = aligned_data.token_label_positions | ||
|
||
id_to_tokens = dict(zip(self.ids, tokenized_labels)) | ||
self.logger_config.id_to_tokens[self.token_map_key].update(id_to_tokens) | ||
|
||
@classmethod | ||
def calculate_cutoffs(cls, df: DataFrame) -> DataFrame: | ||
"""Calculate the cutoff index for the input and target strings. | ||
|
||
|
||
When using Encoder-Decoder models, the input AND target tokens are truncated | ||
based on the respective Encoder (input) / Decoder (target) max_lengths | ||
OR user specified max_lengths (note: these may be different between the | ||
Encoder and Decoder). | ||
|
||
The model only "sees"/processes the tokens that remain after truncation, | ||
for example if max_length=512 for the Encoder, no matter how long the Input, | ||
the model will only process the first 512 tokens and ignore the rest. | ||
|
||
This function adds two columns to the df: | ||
- 'input_cutoff': the position of the last character in the input. | ||
- 'target_cutoff': the position of the last character in the target. | ||
""" | ||
# Error checking | ||
super().calculate_cutoffs(df) | ||
|
||
# TODO we may be able to take advantage of shared code with Decoder | ||
tokenizer = cls.logger_config.tokenizer | ||
max_input_length = cls.logger_config.max_input_tokens | ||
df[C.input_cutoff.value] = get_cutoff_from_truncated_tokenization( | ||
df, C.text, tokenizer, max_input_length | ||
) | ||
|
||
target_offsets_colname = C.token_label_offsets | ||
if target_offsets_colname in df.get_column_names(): | ||
df[C.target_cutoff.value] = get_cutoff_from_saved_offsets( | ||
df, target_offsets_colname | ||
) | ||
|
||
return df |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to get config we could call the get data logger config helper in this fn