Feature/encoder decoder dq restructure #766

elboy3 · 2023-10-04T18:57:10Z

Please do not create a pull request without creating an issue first.

Changes need to be discussed before proceeding, pull requests submitted without linked issues may be rejected.

Please provide enough information so that others can review your pull request. You can skip this if you're fixing a typo – it happens.

I have added tests to tests to cover my changes.
I have updated docs/, if necessary.
I have updated the README.md, if necessary.

What existing issue does this pull request close?

Put closes #issue-number in this pull request's description to auto-close the issue that this fixes.

How are these changes tested?

This pull request includes automated tests for the code it touches and those tests are described below. If no tests are included, reasons why must be provided below.

These changes are tested with [...]

Demonstration

Demonstrate your contribution.

For example, what are the exact commands you ran and their output, related screenshots, screen-recordings, test runs, anything that can showcase.

Provide additional context.

Provide as much relevant context as you like.

…coder sub_class

… test fixes

elboy3 · 2023-10-06T20:37:08Z

dataquality/schemas/task_type.py

@@ -14,7 +14,8 @@ class TaskType(str, Enum):
    object_detection = "object_detection"
    semantic_segmentation = "semantic_segmentation"
    prompt_evaluation = "prompt_evaluation"
-    seq2seq = "seq2seq"
+    seq2seq = "seq2seq"  # TODO Remove


we'll need to check with rodrigo if he uses "seq2seq" at all on the UI as well. and we'll have to make some api / runners / rungalileo changes too.

depending on how much seq2seq is hard coded elsewhere, we should either just rename it to encoder decoder OR use a new task type 10 as encoder decoder and just kinda deprecate 8 seq2seq

Okay! Yeah would love some guidance here.

elboy3 · 2023-10-06T20:38:00Z

dataquality/loggers/model_logger/seq2seq/seq2seq.py

@@ -58,8 +58,12 @@ def token_map_key(self) -> str:
            return self.inference_name
        return str(self.split)

+    @abstractmethod


i don't think you need this decorator since it's mainly for the base class (definition in BaseGalileoModelLogger)

elboy3 · 2023-10-06T20:46:08Z

dataquality/loggers/model_logger/seq2seq/encoder_decoder.py

+class EncoderDecoderModelLogger(Seq2SeqModelLogger):
+    # TODO Add in API so we can use encoder_decoder
+    # __logger_name__ = "encoder_decoder"
+    __logger_name__ = "seq2seq"


yup, we'll have to change this to encoder_decoder

elboy3 · 2023-10-06T20:46:45Z

dataquality/loggers/model_logger/__init__.py

    tabular_classification,
    text_classification,
    text_multi_label,
    text_ner,
 )
 from dataquality.loggers.model_logger.base_model_logger import BaseGalileoModelLogger
+from dataquality.loggers.model_logger.seq2seq import encoder_decoder, seq2seq


we should rename the seq2seq file (not folder) to seq2seq_base

Okay love that idea!

elboy3 · 2023-10-06T20:47:35Z

dataquality/loggers/data_logger/seq2seq/seq2seq.py

@@ -295,32 +284,31 @@ def separate_dataframe(
        return BaseLoggerDataFrames(prob=prob, emb=emb, data=data_df)

    @classmethod
+    @abstractmethod


i'm pretty sure, the abstract method should only go in the base and should not be implemented, that's how i've always seen it used. lmk if you think otherwise we can search a bit deeper into recommended usage!

Pasting in some thoughts from GPT-4

On the topic of can you have partial implementations within an abstractmethod here are its thoughts:

GPT-4 talking here

Yes, you can provide a partial implementation in an abstract method and still require subclasses to override it. This can be useful when there's some shared logic that you want to include in the base class, but you still want to ensure that subclasses provide their own specific implementations.

Here's how you can do this:

Define the abstract method in the base class with the @AbstractMethod decorator.

Within the abstract method, provide the shared logic.

In the subclasses, you can use the super() function to call the base class's implementation before or after adding the subclass-specific logic.
Here's an example to illustrate:

from abc import ABC, abstractmethod class Animal(ABC): @abstractmethod def speak(self): print("The animal makes a sound...") class Dog(Animal): def speak(self): super().speak() # Call the base class's implementation print("Woof!") class Cat(Animal): def speak(self): super().speak() # Call the base class's implementation print("Meow!")

In this example:

Animal is an abstract base class with an abstract method speak that has a partial implementation.

Both Dog and Cat subclasses override the speak method. They use super().speak() to call the base class's implementation and then add their own specific logic.

On the topic of should abstractmethod just be in the base-class here were its thoughts:

GPT-4 talking here

It's perfectly acceptable and sometimes even desirable to have abstract methods in subclasses (i.e., in intermediate layers of a class hierarchy). This practice can be useful in various scenarios:

Refined Abstraction: Sometimes, the base class is a very general abstraction, and intermediate subclasses refine this abstraction but are still too general to provide a concrete implementation for certain methods. These intermediate classes can introduce new abstract methods that are relevant to their refined abstraction.

Hierarchical Design: As you develop a class hierarchy, you may realize that certain behaviors are shared among a subset of subclasses, but not all. In such cases, it's beneficial to introduce an intermediate abstract class that encapsulates these shared behaviors as abstract methods.

Flexibility: Introducing abstract methods in intermediate classes provides flexibility. As the design evolves, you can ensure that specific subclasses adhere to certain interfaces without affecting unrelated subclasses.

Here's an example to illustrate:

from abc import ABC, abstractmethod class Vehicle(ABC): @abstractmethod def move(self): pass class LandVehicle(Vehicle): @abstractmethod def start_engine(self): pass class Car(LandVehicle): def move(self): print("The car drives.") def start_engine(self): print("Car engine started.") class Boat(Vehicle): def move(self): print("The boat sails.")

In this design, the intermediate abstract class LandVehicle provides a refined abstraction for vehicles that move on land. By introducing the start_engine abstract method, we ensure that any concrete subclass of LandVehicle implements this behavior.

I think that what GPT-4 says makes sense to me. I particularly like the points about hierarchical design and Refined Abstraction. I think here for example, calculate_cutoffs is very specific to just Seq2Seq classes!

elboy3 · 2023-10-06T20:49:11Z

dataquality/integrations/seq2seq/hf.py

+    if task_type == task_type.seq2seq:  # TODO Change to encoder_decoder
+        return encoder_decoder_logger_config
+
+    # TODO Change to encoder_decoder
+    raise GalileoException(
+        "Galileo's seq2seq watch method is only supported for seq2seq"
+    )


i think we can just use the get current task type helpers, since they will have already initialized the project with dq.init and we will have the task type stored in the config file

Look for other instances of where we do get_data_logger().logger_config

Okay 👌 yes this seems helpful!

elboy3

Generally looks great! We'll just need to discuss task type switch and how to best handle that without breaking things, then we're good to go

…talogger()

codecov-commenter · 2023-10-13T20:44:54Z

Codecov Report

Merging #766 (43e4cab) into main (8f0f7c3) will decrease coverage by 0.01%.
Report is 1 commits behind head on main.
The diff coverage is 99.03%.

@@            Coverage Diff             @@
##             main     #766      +/-   ##
==========================================
- Coverage   87.72%   87.72%   -0.01%     
==========================================
  Files         184      187       +3     
  Lines       15097    15139      +42     
==========================================
+ Hits        13244    13280      +36     
- Misses       1853     1859       +6

Files	Coverage Δ
...ity/loggers/data_logger/seq2seq/encoder_decoder.py	`100.00% <100.00%> (ø)`
...uality/loggers/data_logger/seq2seq/seq2seq_base.py	`68.21% <100.00%> (ø)`
...y/loggers/logger_config/seq2seq/encoder_decoder.py	`100.00% <100.00%> (ø)`
...lity/loggers/logger_config/seq2seq/seq2seq_base.py	`100.00% <ø> (ø)`
...ty/loggers/model_logger/seq2seq/encoder_decoder.py	`100.00% <100.00%> (ø)`
...ality/loggers/model_logger/seq2seq/seq2seq_base.py	`92.64% <100.00%> (ø)`
dataquality/schemas/task_type.py	`100.00% <100.00%> (ø)`
tests/loggers/test_seq2seq.py	`100.00% <100.00%> (ø)`
tests/utils/test_seq2seq_offset.py	`100.00% <100.00%> (ø)`
tests/utils/test_seq2seq_utils.py	`100.00% <100.00%> (ø)`
... and 1 more

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

elboy3 · 2023-10-17T17:15:04Z

dataquality/integrations/seq2seq/hf.py



 @check_noop
 def set_tokenizer(
    tokenizer: PreTrainedTokenizerFast,
+    logger_config: Union[EncoderDecoderLoggerConfig],


i wouldn't have logger config as a param for this cause this is technically a user facing fn and we wouldn't expect them to pass in a logger config

elboy3 · 2023-10-17T17:15:49Z

dataquality/integrations/seq2seq/hf.py

    assert isinstance(
        tokenizer, PreTrainedTokenizerFast
    ), "Tokenizer must be an instance of PreTrainedTokenizerFast"
    assert getattr(tokenizer, "is_fast", False), "Tokenizer must be a fast tokenizer"
    for attr in ["encode", "decode", "encode_plus", "padding_side"]:
        assert hasattr(tokenizer, attr), f"Tokenizer must support `{attr}`"
-    seq2seq_logger_config.tokenizer = tokenizer
+    logger_config.tokenizer = tokenizer


to get config we could call the get data logger config helper in this fn

elboy3 · 2023-10-17T17:16:31Z

dataquality/integrations/seq2seq/hf.py

    assert isinstance(
        model, PreTrainedModel
    ), "model must be an instance of transformers PreTrainedModel"
    assert model.can_generate(), "model must contain a `generate` method for seq2seq"

-    set_tokenizer(tokenizer, max_input_tokens, max_target_tokens)
+    set_tokenizer(tokenizer, logger_config, max_input_tokens, max_target_tokens)


elboy3 · 2023-10-17T17:17:39Z

dataquality/loggers/data_logger/seq2seq/encoder_decoder.py

+    """
+
+    # TODO Change to encoder_decoder after updating API
+    __logger_name__ = "seq2seq"  # encoder_decoder


agreed it should be encoder_decoder

elboy3 · 2023-10-17T17:18:13Z

dataquality/loggers/data_logger/seq2seq/encoder_decoder.py

+        common data type validation.
+        """
+        super().validate_and_format()
+        # TODO: question type checking does not work in super()


what do you mean? we can look into this together

elboy3 · 2023-10-17T17:19:16Z

dataquality/loggers/data_logger/seq2seq/seq2seq_base.py

@@ -96,7 +83,13 @@ def token_map_key(self) -> str:
            return self.inference_name
        return str(self.split)

+    @abstractmethod


i think we keep the fn here but just remove the abstractmethod decorator

elboy3 · 2023-10-17T17:20:29Z

dataquality/loggers/data_logger/seq2seq/seq2seq_base.py

@@ -295,32 +277,31 @@ def separate_dataframe(
        return BaseLoggerDataFrames(prob=prob, emb=emb, data=data_df)

    @classmethod
+    @abstractmethod


same, i think let's just have a base classmethod that does some logic, and if the parent's want to call super() and do extra or want to override they can, but let's not mandate that the parent's have to override this fn

elboy3 · 2023-10-17T17:21:43Z

dataquality/loggers/logger_config/seq2seq/encoder_decoder.py

+    # TODO Add comment
+    # This currently is purely a wrapper!
+    pass


Suggested change

# TODO Add comment

# This currently is purely a wrapper!

pass

"""Logger config for Encoder Decoder

This logger currently has same fields as the base class

"""

something like this is fine ^ also you don't need the "pass"

elboy3 · 2023-10-17T17:22:33Z

dataquality/loggers/model_logger/seq2seq/encoder_decoder.py

+        logprobs = self.convert_logits_to_logprobs(self.logits)
+        (
+            self.token_logprobs,
+            self.top_logprobs,
+        ) = self.process_logprobs(
+            self.ids, logprobs  # type: ignore


this is stuff that won't happen in docoder only?

elboy3 · 2023-10-17T17:22:46Z

dataquality/loggers/model_logger/seq2seq/seq2seq_base.py

@@ -58,8 +58,12 @@ def token_map_key(self) -> str:
            return self.inference_name
        return str(self.split)

+    @abstractmethod


👋 remove!

elboy3 · 2023-10-17T17:23:56Z

dataquality/schemas/task_type.py

+            8: TaskType.seq2seq,  # TODO Remove
+            # 8: TaskType.encoder_decoder,  # TODO add on API side


i'm personally now thinking we should actually keep seq2seq and just call it deprecated, and then add 10 as encoder_decoder and 11 as decoder_only ..... it seems too hard to move everything from seq2seq to encoder_decoder across all repos

elboy3

generally looking great! let's continue to pair

elboy3 · 2023-10-17T18:14:46Z

Closing so @jonathangomesselman can finish cleaning comments and then create his own PR!

Jonathan Gomes Selman and others added 8 commits October 3, 2023 18:10

First attempt at re-architecting Seq2seq to have a seperate EncoderDe…

6e31cb3

…coder sub_class

Merge branch 'main' into feature/EncoderDecoder_DQ_Restructure

ac6fb6b

Merge branch 'main' into feature/EncoderDecoder_DQ_Restructure

2749577

merge conflicts fix

75b6fbc

Adding seq2seq subfolder, updating comments, linting and formatting

a34192d

Updated tests to be passing. Left comments for comments and potential…

3a9e7bd

… test fixes

Formatting

46ee172

fix breaking test

e0fd7b9

elboy3 commented Oct 6, 2023

View reviewed changes

Jonathan Gomes Selman added 3 commits October 13, 2023 11:12

Merge branch 'main' into feature/EncoderDecoder_DQ_Restructure

24a3a5a

Rename to seq2seq_base and updated hf watch to get config from get_da…

1cd8a70

…talogger()

formatting

7356f09

Working on changes

43e4cab

elboy3 commented Oct 17, 2023

View reviewed changes

elboy3 and others added 3 commits October 17, 2023 14:07

some thing swith jon

18ac8c1

remove import

8c113cd

Merge branch 'main' into feature/EncoderDecoder_DQ_Restructure

83ec3f5

elboy3 closed this Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/encoder decoder dq restructure #766

Feature/encoder decoder dq restructure #766

elboy3 commented Oct 4, 2023

elboy3 Oct 6, 2023

jonathangomesselman Oct 13, 2023

elboy3 Oct 6, 2023

elboy3 Oct 6, 2023

elboy3 Oct 6, 2023

jonathangomesselman Oct 13, 2023

elboy3 Oct 6, 2023

jonathangomesselman Oct 13, 2023

jonathangomesselman Oct 13, 2023 •

edited

Loading

jonathangomesselman Oct 13, 2023

jonathangomesselman Oct 13, 2023

elboy3 Oct 6, 2023

elboy3 Oct 6, 2023

jonathangomesselman Oct 13, 2023

elboy3 left a comment

codecov-commenter commented Oct 13, 2023 •

edited

Loading

elboy3 Oct 17, 2023

elboy3 Oct 17, 2023

elboy3 Oct 17, 2023

elboy3 Oct 17, 2023

elboy3 Oct 17, 2023

elboy3 Oct 17, 2023

elboy3 Oct 17, 2023

elboy3 Oct 17, 2023

elboy3 Oct 17, 2023

elboy3 Oct 17, 2023

elboy3 Oct 17, 2023

elboy3 left a comment

elboy3 commented Oct 17, 2023

		8: TaskType.seq2seq, # TODO Remove
		# 8: TaskType.encoder_decoder, # TODO add on API side

Feature/encoder decoder dq restructure #766

Feature/encoder decoder dq restructure #766

Conversation

elboy3 commented Oct 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonathangomesselman Oct 13, 2023 • edited Loading

Choose a reason for hiding this comment

GPT-4 talking here

Choose a reason for hiding this comment

GPT-4 talking here

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elboy3 left a comment

Choose a reason for hiding this comment

codecov-commenter commented Oct 13, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elboy3 left a comment

Choose a reason for hiding this comment

elboy3 commented Oct 17, 2023

jonathangomesselman Oct 13, 2023 •

edited

Loading

codecov-commenter commented Oct 13, 2023 •

edited

Loading