-
-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add config for optional parameters in a chat message #2260
base: main
Are you sure you want to change the base?
feat: add config for optional parameters in a chat message #2260
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor nits, but overall looks like a good improvement to chat templates.
fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
Why restrict the properties passed into the chat template in the first place?
The initial code was written when it was mainly role/content. I expanded it recently for tool_calling datasets, so this is a welcome change.
Couldn't we just pass the entire message in and if the template doesn't use a particular property they just become no-ops?
This could be an alternative, but I think the current solution works fine?
assert not all( | ||
label == IGNORE_TOKEN_ID for label in turn_labels | ||
), "Expected assistant message content to be trained on" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert not all( | |
label == IGNORE_TOKEN_ID for label in turn_labels | |
), "Expected assistant message content to be trained on" | |
assert all( | |
label != IGNORE_TOKEN_ID for label in turn_labels | |
), "Expected assistant message content to be trained on" |
I think this is the condition we want?
Edit: This test may fail until you set train_on_eos: turn
.
docs/config.qmd
Outdated
# Fields that will be passed to the chat template if present in the message. (Optional[List[str]], default: None) | ||
optional_message_fields: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we say we include tool_calls
, name
... by default based on your implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to go either way. Right now we include them as hardcoded, but I'd be happy to refactor it to just make them the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the downside of making it the default is that if someone wants to add an additional field, they -- at that point -- have to remember to add back the defaults as well.
It is why my slight preference is just to pass the entire message into the chat template and let the template itself decide what it cares about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other option... Would be to parse out all of the arguments from the jinja template and pass in any/all of those. Jinja provides a pretty easy way to extract the variables from the AST (per documentation, I haven't tried it yet).
We could even warn if the template is expecting something it isn't being passed which might be a nice quality of life addition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, should this be a List[str]
(current)
or a List[Dict[str, str]]
which would allow us to map alternate field names to the field in the jinja template?
The latter more closely matches the meaning of the other args like message_field_content
, message_field_role
I just updated this to be a little more thoughtful / robust...
|
message_field_role="role", | ||
message_field_content="content", | ||
message_property_mappings={ | ||
"from": "role", | ||
"value": "content", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be better to invert the property mappings so that the key is role
and content
. Because we always know we need to expect those keys, and the mapping is the field in the dataset that we should expect to find them.
Just for the sake of completeness and being able to verify the previous tests in CI and ensure backwards compatibility, It might be good to keep these tests with message_field_role
/message_field_content
and add a new test suite for message_property_mappings, even if only for a single model architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These specific tests are hard to keep as they were, because the signature of the function is no longer expecting message_field_role/message_field_content
Instead we are shuttling message_field_role/message_field_content
into message_property_mappings
as part of the config validation.
I have added a test to the model to make sure they are indeed making it into the mappings.
Here's the stack trace locally of the broken test in CI currently
and the dataset configuration that is working on main but is broken in the tests chat_template: llama3
datasets:
- path: mlabonne/FineTome-100k
type: chat_template
split: train[:20%]
field_messages: conversations
message_field_role: from
message_field_content: value |
Still looking into the e2e failure (if this push didn't fix it). |
I tracked down the problem with the failing e2e test. I'm sure there are some further nits, feedback which I'm happy to incorporate. Some other thoughts below that explain some of the changes and then maybe a broader topic that we can take off of this thread if there is any interest. -- I will say that the one challenge I had is there seems to be a lack of consistency as to if the config that is getting passed around to all of the various classes/functions is supposed to be the Many of the tests are obviously just constructing a It feels like the config is the "special sauce" in some ways for the project. Instead of having to write 1000 lines of boilerplate code to get something running, just craft a fairly simple (I'm probably being a little generous here) config file and you're on your way. I know a lot of the complexity of the project lies in how to deal with that config file, but it feels like the config itself is something of a second class citizen insofar that the Maybe I'm not seeing the forest from the trees here (and Python is certainly not my bailiwick) , but I'd be interested in helping tighten things up around this if there is any interest in it. |
@@ -311,6 +311,9 @@ class KTODataset(BaseModel): | |||
revision: Optional[str] = None | |||
|
|||
|
|||
DatasetConfig = Union[SFTDataset, DPODataset, KTODataset] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you intend to incorporate this into the line far below?
datasets: Optional[conlist(Union[SFTDataset, DPODataset, KTODataset], min_length=1)] = None # type: ignore
@NJordan72 Thanks for your thoughts and insights. We agree that the docs are definitely lacking and we are on a mission this first few months of the year to improve in that area. The evolution of the project started with a basic DefaultDict with naive validation functions written against it. Only last year did we convert the validation to use pydantic, hence the sort of mismatch of using pydantic up front and then the conversion to a dict after that. We're definitely open to making the pydantic model the common language through the project, but in places there will still be a mismatch when we're using the lower level dataclasses that we're extending from HF transformers. I'd love your thoughts on how you might approach this. |
validate_config(cfg) | ||
normalize_config(cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good call to add the validation in this test as it surfaces a new bug with parsing the chat_dataset. It seems we don't do this consistently across the rest of the tests, and seems like we should add this to more of them to make sure the change for the chat_dataset class doesn't affect other logic.
self._chat_template_msg_variables = self.get_chat_template_msg_variables( | ||
chat_template | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, messages_array_name
refers to the key of the List[dict]
. Should we change the signature so that field_messages
is passed to the Prompter as well? To allow passing messages_array_name=field_messages
?
if "message_field_role" in data: | ||
if ( | ||
"role" in data["message_property_mappings"] | ||
and data["message_property_mappings"]["role"] | ||
!= data["message_field_role"] | ||
): | ||
raise ValueError( | ||
f"Conflicting message role fields: message_field_role='{data['message_field_role']}' " | ||
f"conflicts with message_property_mappings.role='{data['message_property_mappings']['role']}'" | ||
) | ||
data["message_property_mappings"]["role"] = ( | ||
data["message_field_role"] or "role" | ||
) | ||
elif "role" not in data["message_property_mappings"]: | ||
data["message_property_mappings"]["role"] = "role" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After setting the property, should we drop the old fields? (Same for content below)
Description
Currently only role/content properties are passed into the chat template (with the exception of some hardcoded optional params that are model specific). This PR creates a configuration property to add additional optional parameters.
Motivation and Context
How has this been tested?