Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hotfix for Filterblocks #72

Closed
wants to merge 8 commits into from
7 changes: 4 additions & 3 deletions scripts/test_freeform_skills.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,10 @@
from openai import OpenAI

# First Party
from src.instructlab.sdg import SDG
from src.instructlab.sdg.default_flows import SynthSkillsFlow
from src.instructlab.sdg.pipeline import Pipeline
# pylint: disable=import-error, no-name-in-module
from instructlab.sdg import SDG
from instructlab.sdg.default_flows import SynthSkillsFlow
from instructlab.sdg.pipeline import Pipeline

# for vLLM endpoints, the api_key remains "EMPTY"
openai_api_key = "EMPTY"
Expand Down
7 changes: 4 additions & 3 deletions scripts/test_grounded_skills.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,10 @@
from openai import OpenAI

# First Party
from src.instructlab.sdg import SDG
from src.instructlab.sdg.default_flows import SynthGroundedSkillsFlow
from src.instructlab.sdg.pipeline import Pipeline
# pylint: disable=import-error, no-name-in-module
from instructlab.sdg import SDG
from instructlab.sdg.default_flows import SynthGroundedSkillsFlow
from instructlab.sdg.pipeline import Pipeline

# for vLLM endpoints, the api_key remains "EMPTY"
openai_api_key = "EMPTY"
Expand Down
12 changes: 5 additions & 7 deletions scripts/test_knowledge.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,14 @@
# Standard
import operator

# Third Party
from datasets import Dataset
from openai import OpenAI

# First Party
from src.instructlab.sdg import SDG
from src.instructlab.sdg.default_flows import MMLUBenchFlow, SynthKnowledgeFlow
from src.instructlab.sdg.pipeline import Pipeline
# pylint: disable=import-error, no-name-in-module
from instructlab.sdg import SDG
from instructlab.sdg.default_flows import MMLUBenchFlow, SynthKnowledgeFlow
from instructlab.sdg.pipeline import Pipeline

# Please don't add you vLLM endpoint key here
# for vLLM endpoints, the api_key remains "EMPTY"
openai_api_key = "EMPTY"
openai_api_base = "Add model endpoint here"

Expand Down
30 changes: 19 additions & 11 deletions src/instructlab/sdg/default_flows.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,7 @@ def get_flow(self) -> list:
"block_name": "filter_faithfulness",
"filter_column": "judgment",
"filter_value": "YES",
"valid_values": ["YES", "NO"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter currently only accepts judgement=YES ... so adding is a no-op?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We saw that the judgement was sometimes getting populated with values besides "YES" and "NO" and that was breaking the code. Hence it was important to add a set of valid values or expected values from the evaluation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change for introducing valid_values is not shortsighted for the current scope and flows. This is useful in making the filter blocks more generic. For instance if you need to filter the outputs from a block using the in operation to check if the generated output belongs in a list of strings.

As a general note and practice, language model output are not deterministic, restricting with these explicit valid_values makes it more generic and robust

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, if this is an "we don't need it now, but we think we'll need it later" argument - I'll refer again to the YAGNI principle :)

But my "we don't need it now" conclusion might be a misunderstanding. Here's pseudo-code showing my understanding:

    def filter_sample(sample):
        value = sample["judgment"]
        if value not in ["YES", "NO"]:
            return False
        if value != "YES":
            return False
        return True

Am I missing something?

(If I'm understanding correctly, I think it's important we avoid adding additional unnecessary complexity like this, because it's confusing enough already)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @markmc -- that pseudocode matches my understanding, and so I have the same "can you help me understand why we need it?" question

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will close this PR and reintroduce the retaining valid values as its own block (since we still need it). Thanks for isolating the bug fix

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@russellb it's a bugfix as Aakanksha already explained. Without having a valid_values filter, the code was breaking. I don't know the specifics but clearly it's offered as a bug fix, and Aakanksha tested it independently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, @mairin, but I've looked at this long enough that that just isn't correct. I fully believe there may be a use case for it and assume further discussion will help clarify that, but at least as used in the existing pipelines within this PR, the filter part is not fixing anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivchander

I will close this PR and reintroduce the retaining valid values as its own block (since we still need it).

Sounds good. It looks like you can keep it one block, but just change the value the current block takes to be a set instead of a single value. Then both use cases can be served by the same block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivchander here's what I had in mind to just augment the existing block -- 6c1f3ee

though note I've only tested via that unit test so far

"operation": operator.eq,
"batch_kwargs": {
"num_procs": 8,
Expand Down Expand Up @@ -226,9 +227,10 @@ def get_flow(self) -> list:
"block_config": {
"block_name": "filter_relevancy",
"filter_column": "score",
"filter_value": 2.0,
"filter_value": 2,
"valid_values": [0, 1, 2],
"operation": operator.eq,
"convert_dtype": float,
"convert_dtype": int,
"batch_kwargs": {
"num_procs": 8,
},
Expand Down Expand Up @@ -260,7 +262,8 @@ def get_flow(self) -> list:
"block_config": {
"block_name": "filter_verify_question",
"filter_column": "rating",
"filter_value": 1.0,
"filter_value": 1,
"valid_values": [0, 1],
"operation": operator.eq,
"convert_dtype": float,
"batch_kwargs": {
Expand Down Expand Up @@ -312,9 +315,10 @@ def get_flow(self) -> list:
"block_config": {
"block_name": "filter_questions",
"filter_column": "score",
"filter_value": 1.0,
"filter_value": 1,
"valid_values": [0, 1],
"operation": operator.eq,
"convert_dtype": float,
"convert_dtype": int,
"batch_kwargs": {
"num_procs": 8,
},
Expand Down Expand Up @@ -356,9 +360,10 @@ def get_flow(self) -> list:
"block_config": {
"block_name": "filter_qa_pair",
"filter_column": "score",
"filter_value": 2.0,
"filter_value": 2,
"valid_values": [1, 2, 3],
"operation": operator.ge,
"convert_dtype": float,
"convert_dtype": int,
"batch_kwargs": {
"num_procs": 8,
},
Expand Down Expand Up @@ -406,6 +411,7 @@ def get_flow(self) -> list:
"output_cols": ["question"],
"batch_kwargs": {
"num_procs": 8,
"num_samples": 3,
"batched": self.batched,
},
},
Expand All @@ -432,9 +438,10 @@ def get_flow(self) -> list:
"block_config": {
"block_name": "filter_grounded_questions",
"filter_column": "score",
"filter_value": 1.0,
"filter_value": 1,
"valid_values": [0, 1],
"operation": operator.eq,
"convert_dtype": float,
"convert_dtype": int,
"batch_kwargs": {
"num_procs": 8,
},
Expand Down Expand Up @@ -476,9 +483,10 @@ def get_flow(self) -> list:
"block_config": {
"block_name": "filter_grounded_qa_pair",
"filter_column": "score",
"filter_value": 2.0,
"filter_value": 2,
"valid_values": [1, 2, 3],
"operation": operator.ge,
"convert_dtype": float,
"convert_dtype": int,
"batch_kwargs": {
"num_procs": 8,
},
Expand Down
33 changes: 28 additions & 5 deletions src/instructlab/sdg/filterblock.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,25 +11,48 @@

class FilterByValueBlock(Block):
def __init__(
self, filter_column, filter_value, operation, convert_dtype=None, **batch_kwargs
self,
filter_column,
filter_value,
operation,
valid_values,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that would help a lot here is adding some documentation of these parameters that explains how they're all to be used. That would, I hope, help some of the discussions we're having trying to understand the changes. Can you add a docstring to this method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#76 Sounds good, lets address this in a follow up

Copy link
Contributor

@markmc markmc Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, one thing that helps is ... filter_column, filter_value, and operation go together

filter_column and valid_values go together too

so, it's like:

  • filter on filter_column,
  • first (and optionally) by checking that filter_column contains one of valid_values
  • second by comparing the value in the column matches filter_value using the given operation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the confusion is "why do we need the second bullet again?" right?

is it that actually you really only need one or the other?

or what's the example for when you need them both together?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or put another way, two types of filtering?

could they be collapsed into one, where it's always a set, and you might just have only one element in that set?

convert_dtype=None,
**batch_kwargs,
) -> None:
super().__init__(block_name=self.__class__.__name__)
self.value = filter_value
self.column_name = filter_column
self.operation = operation
self.valid_values = valid_values
self.convert_dtype = convert_dtype
self.num_procs = batch_kwargs.get("num_procs", 1)

def _fiter_invalid_values(self, samples):
samples = samples.filter(
lambda x: x[self.column_name] in self.valid_values,
num_proc=self.num_procs,
)
return samples

def _convert_dtype(self, sample):
try:
sample[self.column_name] = self.convert_dtype(sample[self.column_name])
except ValueError as e:
logger.error(
"Error converting dtype: %s, filling with None to be filtered later", e
)
sample[self.column_name] = None
return sample

def generate(self, samples) -> Dataset:
if self.convert_dtype:
samples = samples.map(
lambda x: {
**x,
self.column_name: self.convert_dtype(x[self.column_name]),
},
self._convert_dtype,
num_proc=self.num_procs,
)

samples = self._fiter_invalid_values(samples)

return samples.filter(
lambda x: self.operation(x[self.column_name], self.value),
num_proc=self.num_procs,
Expand Down
4 changes: 3 additions & 1 deletion src/instructlab/sdg/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ def generate(self, dataset) -> Dataset:
block = block_type(**block_config)

logger.info("Running block: %s", block_config["block_name"])
logger.info(dataset)
logger.info("Input: %s", dataset)

dataset = block.generate(dataset, **gen_kwargs)

Expand All @@ -57,4 +57,6 @@ def generate(self, dataset) -> Dataset:
if drop_duplicates_cols:
dataset = self._drop_duplicates(dataset, cols=drop_duplicates_cols)

logger.info("Output: %s\n\n", dataset)

return dataset