feat: add config for optional parameters in a chat message #2260

NJordan72 · 2025-01-14T16:01:46Z

Description

Currently only role/content properties are passed into the chat template (with the exception of some hardcoded optional params that are model specific). This PR creates a configuration property to add additional optional parameters.

Motivation and Context

Chat templates are not restricted to using role/content properties and we should allow users, especially those creating their own custom chat templates to pass in anything via the message.
OPEN QUESTION: Why restrict the properties passed into the chat template in the first place? Couldn't we just pass the entire message in and if the template doesn't use a particular property they just become no-ops?

How has this been tested?

Added unit test

src/axolotl/prompt_strategies/chat_template.py

tests/prompt_strategies/test_chat_templates_advanced.py

winglian

A few minor nits, but overall looks like a good improvement to chat templates.

NJordan72 · 2025-01-14T19:59:05Z

A few minor nits, but overall looks like a good improvement to chat templates.

fixed.

NanoCode012

Thanks for the PR.

Why restrict the properties passed into the chat template in the first place?

The initial code was written when it was mainly role/content. I expanded it recently for tool_calling datasets, so this is a welcome change.

Couldn't we just pass the entire message in and if the template doesn't use a particular property they just become no-ops?

This could be an alternative, but I think the current solution works fine?

docs/dataset-formats/conversation.qmd

NanoCode012 · 2025-01-15T09:49:28Z

tests/prompt_strategies/test_chat_templates_advanced.py

+        assert not all(
+            label == IGNORE_TOKEN_ID for label in turn_labels
+        ), "Expected assistant message content to be trained on"


Suggested change

assert not all(

label == IGNORE_TOKEN_ID for label in turn_labels

), "Expected assistant message content to be trained on"

assert all(

label != IGNORE_TOKEN_ID for label in turn_labels

), "Expected assistant message content to be trained on"

I think this is the condition we want?

Edit: This test may fail until you set train_on_eos: turn.

NanoCode012 · 2025-01-15T09:56:37Z

docs/config.qmd

+    # Fields that will be passed to the chat template if present in the message. (Optional[List[str]], default: None)
+    optional_message_fields:


Should we say we include tool_calls, name... by default based on your implementation?

Happy to go either way. Right now we include them as hardcoded, but I'd be happy to refactor it to just make them the default.

I guess the downside of making it the default is that if someone wants to add an additional field, they -- at that point -- have to remember to add back the defaults as well.

It is why my slight preference is just to pass the entire message into the chat template and let the template itself decide what it cares about.

The other option... Would be to parse out all of the arguments from the jinja template and pass in any/all of those. Jinja provides a pretty easy way to extract the variables from the AST (per documentation, I haven't tried it yet).

We could even warn if the template is expecting something it isn't being passed which might be a nice quality of life addition.

Also, should this be a List[str] (current)

or a List[Dict[str, str]] which would allow us to map alternate field names to the field in the jinja template?

The latter more closely matches the meaning of the other args like message_field_content, message_field_role

Co-authored-by: NanoCode012 <[email protected]>

NJordan72 · 2025-01-16T10:05:01Z

I just updated this to be a little more thoughtful / robust...

I removed the optional_message_fields option and replaced it with a message_property_mappings option.
message_property_mappings is basically a more generalized version of the message_field_role and message_field_content configuration options. Those continue to work and are just sucked into the message_property_mappings if provided. The idea is you can map any input property to any message property using this configuration.
I then had to update chat_template.py to use the new mappings configuration property to transform the message.
I added a JinjaTemplateAnalyzer that introspects the jinja template to figure out which properties a template is expecting (this is generalized, but it can also look at the properties specifically for the messages, which is what we care about here.
We then remove any properties from the transformed message that the template does not need/expect.
I've taken out special handling of tool_call_id, etc. as it is no longer needed an is handled by the introspection of the template itself
I've added test coverage for all of the new stuff and all of the old tests continue to pass

winglian · 2025-01-16T13:51:45Z

tests/prompt_strategies/test_chat_templates.py

-                message_field_role="role",
-                message_field_content="content",
+                message_property_mappings={
+                    "from": "role",
+                    "value": "content",
+                },


I think it might be better to invert the property mappings so that the key is role and content. Because we always know we need to expect those keys, and the mapping is the field in the dataset that we should expect to find them.

Just for the sake of completeness and being able to verify the previous tests in CI and ensure backwards compatibility, It might be good to keep these tests with message_field_role/message_field_content and add a new test suite for message_property_mappings, even if only for a single model architecture.

These specific tests are hard to keep as they were, because the signature of the function is no longer expecting message_field_role/message_field_content

Instead we are shuttling message_field_role/message_field_content into message_property_mappings as part of the config validation.

I have added a test to the model to make sure they are indeed making it into the mappings.

winglian · 2025-01-16T13:54:09Z

Here's the stack trace locally of the broken test in CI currently

  File "/Users/wing/Projects/ml/axolotl/src/axolotl/prompt_strategies/chat_template.py", line 275, in tokenize_prompt
    input_ids = self.prompter.build_prompt(turns)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/wing/Projects/ml/axolotl/src/axolotl/prompt_strategies/chat_template.py", line 84, in build_prompt
    return self.tokenizer.apply_chat_template(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/wing/.pyenv/versions/3.11.10/envs/axolotl-3.11/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1683, in apply_chat_template
    rendered_chat = compiled_template.render(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/wing/.pyenv/versions/3.11.10/envs/axolotl-3.11/lib/python3.11/site-packages/jinja2/environment.py", line 1304, in render
    self.environment.handle_exception()
  File "/Users/wing/.pyenv/versions/3.11.10/envs/axolotl-3.11/lib/python3.11/site-packages/jinja2/environment.py", line 939, in handle_exception
    raise rewrite_traceback_stack(source=source)
  File "<template>", line 1, in top-level template code
jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'role'

and the dataset configuration that is working on main but is broken in the tests

chat_template: llama3
datasets:
  - path: mlabonne/FineTome-100k
    type: chat_template
    split: train[:20%]
    field_messages: conversations
    message_field_role: from
    message_field_content: value

NJordan72 · 2025-01-16T20:29:42Z

Still looking into the e2e failure (if this push didn't fix it).

NJordan72 · 2025-01-17T09:33:04Z

I tracked down the problem with the failing e2e test. I'm sure there are some further nits, feedback which I'm happy to incorporate. Some other thoughts below that explain some of the changes and then maybe a broader topic that we can take off of this thread if there is any interest.

--

I will say that the one challenge I had is there seems to be a lack of consistency as to if the config that is getting passed around to all of the various classes/functions is supposed to be the Dict representation of the Pydantic model or the model itself.

Many of the tests are obviously just constructing a Dict that represents the config which obviously makes sense as to why they would want to pass that around, but it feels like it would be nicer if the Pydantic model was the lingua franca across the entire project.

It feels like the config is the "special sauce" in some ways for the project. Instead of having to write 1000 lines of boilerplate code to get something running, just craft a fairly simple (I'm probably being a little generous here) config file and you're on your way. I know a lot of the complexity of the project lies in how to deal with that config file, but it feels like the config itself is something of a second class citizen insofar that the Dict representation of it seems to be used in a bunch of places and there are bits of the config that seem under documented and somewhat unclear.

Maybe I'm not seeing the forest from the trees here (and Python is certainly not my bailiwick) , but I'd be interested in helping tighten things up around this if there is any interest in it.

winglian · 2025-01-17T12:40:31Z

src/axolotl/utils/config/models/input/v0_4_1/__init__.py

@@ -311,6 +311,9 @@ class KTODataset(BaseModel):
    revision: Optional[str] = None


+DatasetConfig = Union[SFTDataset, DPODataset, KTODataset]


Did you intend to incorporate this into the line far below?

datasets: Optional[conlist(Union[SFTDataset, DPODataset, KTODataset], min_length=1)] = None # type: ignore

winglian · 2025-01-17T14:30:10Z

@NJordan72 Thanks for your thoughts and insights. We agree that the docs are definitely lacking and we are on a mission this first few months of the year to improve in that area. The evolution of the project started with a basic DefaultDict with naive validation functions written against it. Only last year did we convert the validation to use pydantic, hence the sort of mismatch of using pydantic up front and then the conversion to a dict after that. We're definitely open to making the pydantic model the common language through the project, but in places there will still be a mismatch when we're using the lower level dataclasses that we're extending from HF transformers. I'd love your thoughts on how you might approach this.

winglian · 2025-01-17T14:34:16Z

tests/e2e/patched/test_fa_xentropy.py

+        validate_config(cfg)
        normalize_config(cfg)


this is a good call to add the validation in this test as it surfaces a new bug with parsing the chat_dataset. It seems we don't do this consistently across the rest of the tests, and seems like we should add this to more of them to make sure the change for the chat_dataset class doesn't affect other logic.

NanoCode012 · 2025-01-18T02:17:11Z

src/axolotl/prompt_strategies/chat_template.py

+        self._chat_template_msg_variables = self.get_chat_template_msg_variables(
+            chat_template
+        )


If I understand correctly, messages_array_name refers to the key of the List[dict]. Should we change the signature so that field_messages is passed to the Prompter as well? To allow passing messages_array_name=field_messages?

NanoCode012 · 2025-01-18T02:23:50Z

src/axolotl/utils/config/models/input/v0_4_1/__init__.py

+        if "message_field_role" in data:
+            if (
+                "role" in data["message_property_mappings"]
+                and data["message_property_mappings"]["role"]
+                != data["message_field_role"]
+            ):
+                raise ValueError(
+                    f"Conflicting message role fields: message_field_role='{data['message_field_role']}' "
+                    f"conflicts with message_property_mappings.role='{data['message_property_mappings']['role']}'"
+                )
+            data["message_property_mappings"]["role"] = (
+                data["message_field_role"] or "role"
+            )
+        elif "role" not in data["message_property_mappings"]:
+            data["message_property_mappings"]["role"] = "role"


After setting the property, should we drop the old fields? (Same for content below)

NJordan72 added 2 commits January 14, 2025 10:57

feat: add config for optional parameters in a chat message

d4ac2c8

chore: cleanup

ab270e9