-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hotfix for Filterblocks #72
Changes from all commits
6406d5e
507e325
bada90e
d5fc59d
1af5b74
face918
7d33095
ed7d74b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,25 +11,48 @@ | |
|
||
class FilterByValueBlock(Block): | ||
def __init__( | ||
self, filter_column, filter_value, operation, convert_dtype=None, **batch_kwargs | ||
self, | ||
filter_column, | ||
filter_value, | ||
operation, | ||
valid_values, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Something that would help a lot here is adding some documentation of these parameters that explains how they're all to be used. That would, I hope, help some of the discussions we're having trying to understand the changes. Can you add a docstring to this method? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. #76 Sounds good, lets address this in a follow up There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, one thing that helps is ...
so, it's like:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. and the confusion is "why do we need the second bullet again?" right? is it that actually you really only need one or the other? or what's the example for when you need them both together? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ... or put another way, two types of filtering? could they be collapsed into one, where it's always a set, and you might just have only one element in that set? |
||
convert_dtype=None, | ||
**batch_kwargs, | ||
) -> None: | ||
super().__init__(block_name=self.__class__.__name__) | ||
self.value = filter_value | ||
self.column_name = filter_column | ||
self.operation = operation | ||
self.valid_values = valid_values | ||
self.convert_dtype = convert_dtype | ||
self.num_procs = batch_kwargs.get("num_procs", 1) | ||
|
||
def _fiter_invalid_values(self, samples): | ||
samples = samples.filter( | ||
lambda x: x[self.column_name] in self.valid_values, | ||
num_proc=self.num_procs, | ||
) | ||
return samples | ||
|
||
def _convert_dtype(self, sample): | ||
try: | ||
sample[self.column_name] = self.convert_dtype(sample[self.column_name]) | ||
except ValueError as e: | ||
logger.error( | ||
"Error converting dtype: %s, filling with None to be filtered later", e | ||
) | ||
sample[self.column_name] = None | ||
shivchander marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return sample | ||
|
||
def generate(self, samples) -> Dataset: | ||
if self.convert_dtype: | ||
samples = samples.map( | ||
lambda x: { | ||
**x, | ||
self.column_name: self.convert_dtype(x[self.column_name]), | ||
}, | ||
self._convert_dtype, | ||
num_proc=self.num_procs, | ||
) | ||
|
||
samples = self._fiter_invalid_values(samples) | ||
|
||
return samples.filter( | ||
lambda x: self.operation(x[self.column_name], self.value), | ||
num_proc=self.num_procs, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filter currently only accepts
judgement=YES
... so adding is a no-op?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We saw that the judgement was sometimes getting populated with values besides "YES" and "NO" and that was breaking the code. Hence it was important to add a set of valid values or expected values from the evaluation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change for introducing valid_values is not shortsighted for the current scope and flows. This is useful in making the filter blocks more generic. For instance if you need to filter the outputs from a block using the
in
operation to check if the generated output belongs in a list of strings.As a general note and practice, language model output are not deterministic, restricting with these explicit valid_values makes it more generic and robust
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, if this is an "we don't need it now, but we think we'll need it later" argument - I'll refer again to the YAGNI principle :)
But my "we don't need it now" conclusion might be a misunderstanding. Here's pseudo-code showing my understanding:
Am I missing something?
(If I'm understanding correctly, I think it's important we avoid adding additional unnecessary complexity like this, because it's confusing enough already)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with @markmc -- that pseudocode matches my understanding, and so I have the same "can you help me understand why we need it?" question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will close this PR and reintroduce the retaining valid values as its own block (since we still need it). Thanks for isolating the bug fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@russellb it's a bugfix as Aakanksha already explained. Without having a valid_values filter, the code was breaking. I don't know the specifics but clearly it's offered as a bug fix, and Aakanksha tested it independently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, @mairin, but I've looked at this long enough that that just isn't correct. I fully believe there may be a use case for it and assume further discussion will help clarify that, but at least as used in the existing pipelines within this PR, the filter part is not fixing anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shivchander
Sounds good. It looks like you can keep it one block, but just change the value the current block takes to be a set instead of a single value. Then both use cases can be served by the same block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shivchander here's what I had in mind to just augment the existing block -- 6c1f3ee
though note I've only tested via that unit test so far