Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drake application yaml logic handles !!binary tag #22318

Merged

Conversation

SeanCurtis-TRI
Copy link
Contributor

@SeanCurtis-TRI SeanCurtis-TRI commented Dec 16, 2024

Adds the ability for Drake to treat the yaml !!binary tag in a consistent and coherent manner.


This change is Reviewable

Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+@jwnimmer-tri Here's the promised read functionality. Ignore R1, it's still floating on top of the path/FileSource serialization. R2 has the python stuff and R3 has the C++ side.

Reviewable status: LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes (waiting on @SeanCurtis-TRI)

@SeanCurtis-TRI SeanCurtis-TRI force-pushed the PR_yaml_byte_string_v2 branch 4 times, most recently from 2af36ef to 5eda807 Compare December 17, 2024 17:03
Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll post my review of the test cases now. Big picture, it seems like the C++ and Python tests lack parity -- details inline below.

For my review of the actual implementation changes, I think my feedback will make the most sense when accompanied by a patch, which I'm working on now and will share a bit later.

Reviewed 1 of 2 files at r2, 2 of 7 files at r3, 8 of 9 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status: 12 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):
This PR only seems to have reading. Usually we add both reading and writing in the same PR, as a way to help cross-check that both sides of the story are mutually compatible.

I suppose if its an easier PR train we can add reading and then PR the writing half immediately after (while backfilling any glitches in reading that show up), but we don't want to delay that very long.


common/yaml/test/example_structs.h line 70 at r5 (raw file):

  }

  std::vector<std::byte> value{std::byte(0), std::byte(1), std::byte(2)};

nit This should patch the Python test value, i.e., deadbeef. (Or Python could change to match this.)


common/yaml/test/example_structs.h line 92 at r5 (raw file):

}

struct AllScalarsStruct {

nit Don't we need a std::vector<std::byte> here, both for completeness and to match the Python tests?


common/yaml/yaml_node.h line 174 at r4 (raw file):

  static constexpr std::string_view kTagStr{"tag:yaml.org,2002:str"};

  // https://yaml.org/spec/1.2.2/#generic-string

nit Wrong documentation link


bindings/pydrake/common/test/yaml_typed_test.py line 38 at r5 (raw file):

@dc.dataclass
class IntStruct:

See the comment about 10 lines up.

We need IntStruct on the C++ side now, too.


bindings/pydrake/common/test/yaml_typed_test.py line 321 at r5 (raw file):

        for value,  error_msg in cases:
            data = f"value: !!binary {value}"
            with self.assertRaisesRegex(yaml.constructor.ConstructorError,

nit This is overly coupling our test case to the implementation. We don't care about the type, we only care that the message is sensible.

(Therefore also remember to remove the import yaml.)

Suggestion:

Exception

bindings/pydrake/common/test/yaml_typed_test.py line 354 at r5 (raw file):

                self.assertEqual(x.value, b'')

        # Using !!binary and assigning it to non-bytes should throw.

Do these test cases exist on the C++ side? I didn't immediately find them.

Our goal is to keep the Python and C++ test case inputs as close alignment as possible. It's okay to diverge in the expected outcome, but the input panel should be roughly the same, declared in the same order, with the same phrasing, data, comments, etc. As naive of a cross-language porting as possible.


bindings/pydrake/common/test/yaml_typed_test.py line 307 at r4 (raw file):

            ("!!binary |\n  A3Rlc3Rfc3RyAw==", b"\x03test_str\x03"),
            ("!!binary |\n  A3Rlc3Rf\n  c3RyAw====", b"\x03test_str\x03"),
            ("!!binary", b''),

nit Double quotes for consistency?

In general throughout this method, probably the single vs double quotes could be made more uniform. I'll note a couple more instance inline, but not try to ding all of them.

Suggestion:

b""

bindings/pydrake/common/test/yaml_typed_test.py line 318 at r4 (raw file):

            ("A3Rfc3RyAw=", "Incorrect padding"),
            ("A3Rfc*RyAw==", "Invalid base64-encoded string")]
        for value,  error_msg in cases:

nit Disallowed styleguide abbreviation "msg". (I don't think we have any sodium salt of glutamic acid in our yaml parse.)

(Also nit too much whitespace in the middle.)

Suggestion:

for value, error_regex in cases:

bindings/pydrake/common/test/yaml_typed_test.py line 346 at r4 (raw file):

            ("true", "Expected.*bytes.*bool"),
        ]
        for value, error_message in cases:

BTW

Suggestion:

error_regex

bindings/pydrake/common/test/yaml_typed_test.py line 355 at r4 (raw file):

        # Using !!binary and assigning it to non-bytes should throw.
        cases = [
            (b'.inf', FloatStruct),

BTW Prior tests cases in this method used double quotes for b"stuff" input. Might as well stay consistent here too?


tools/workspace/yaml_cpp_internal/repository.bzl line 8 at r4 (raw file):

    github_archive(
        name = name,
        # local_repository_override = "/home/seancurtis/code/yaml-cpp",

nit Stray line

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 13 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes (waiting on @SeanCurtis-TRI)


common/yaml/yaml_read_archive.h line 1 at r5 (raw file):

#pragma once

See the head commit here for my suggestions:

https://github.com/jwnimmer-tri/drake/commits/yaml-binary-fixups/

Big picture, we want to keep the complexity in the cc file and we must use ReportError for errors (so that we get line numbers).

Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one way in writing binary strings differ between C++ and python. It's apparent in the acceptance strings in the two corresponding tests. Python always formats it as:

value: !!binary |
  bacdasldkfj

Whereas C++ prefers:

value: !!binary bacdasldkfj

Reading on both sides have been shown to accept both formats, so I'm choosing not to worry about the fact that they get formatted slightly differently on writiing.

Reviewable status: 1 unresolved discussion, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

This PR only seems to have reading. Usually we add both reading and writing in the same PR, as a way to help cross-check that both sides of the story are mutually compatible.

I suppose if its an easier PR train we can add reading and then PR the writing half immediately after (while backfilling any glitches in reading that show up), but we don't want to delay that very long.

I thought I had been clear about its state. Apparently not. I'd intentionally deferred adding writing to the PR until we were in agreement on the reading semantics.

Writing has now been added.


bindings/pydrake/common/test/yaml_typed_test.py line 318 at r4 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

nit Disallowed styleguide abbreviation "msg". (I don't think we have any sodium salt of glutamic acid in our yaml parse.)

(Also nit too much whitespace in the middle.)

But you gotta love that dash of umami...


bindings/pydrake/common/test/yaml_typed_test.py line 38 at r5 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

See the comment about 10 lines up.

We need IntStruct on the C++ side now, too.

Sadly, reviewable isn't good at providing relevant context for "comment about 10 lines up" as there are no other comments in this file.

However, the request for the IntStruct is clear.

(FTR, I hadn't bothered because of the difference between the yaml parsing logic in python vs C++. Python does all of the type conversion before we get a hold of the result, whereas C++ we do it ourselves. So, I felt the DoubleStruct provided coverage. I still believe it does. However, in the name of keeping the two tests in parallel -- in case our C++ yaml parser changes its behavior, it makes sense to provide the redundant coverage now).


bindings/pydrake/common/test/yaml_typed_test.py line 354 at r5 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Do these test cases exist on the C++ side? I didn't immediately find them.

Our goal is to keep the Python and C++ test case inputs as close alignment as possible. It's okay to diverge in the expected outcome, but the input panel should be roughly the same, declared in the same order, with the same phrasing, data, comments, etc. As naive of a cross-language porting as possible.

As alluded to above re: IntStruct, I felt what got tested here got tested there, albeit in a more compact form. However, for the reasons indicated above, I've expanded the C++ test to be a more literal translation of the python.

@SeanCurtis-TRI
Copy link
Contributor Author

Binary runs face first into writing json. I'm going to look into it.

@jwnimmer-tri
Copy link
Collaborator

I think it would be okay to leave that as future work, and have the json writer throw (maybe with a change-detector expect-throws test to guard us). The AllScalarsStruct could have a way to opt-out of the binary data (e.g., a bool template argument, or a non-serialized bool member field, which skipped it during Serialize).

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, wrong thread.

Anyway -- checkpoint on final reviews of all test code.

Reviewed 7 of 11 files at r6, all commit messages.
Reviewable status: 8 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

Binary runs face first into writing json. I'm going to look into it.

I think it would be okay to leave that as future work, and have the json writer throw (maybe with a change-detector expect-throws test to guard us). The AllScalarsStruct could have a way to opt-out of the binary data (e.g., a bool template argument, or a non-serialized bool member field, which skipped it during Serialize).


bindings/pydrake/common/test/yaml_typed_test.py line 38 at r5 (raw file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

Sadly, reviewable isn't good at providing relevant context for "comment about 10 lines up" as there are no other comments in this file.

However, the request for the IntStruct is clear.

(FTR, I hadn't bothered because of the difference between the yaml parsing logic in python vs C++. Python does all of the type conversion before we get a hold of the result, whereas C++ we do it ourselves. So, I felt the DoubleStruct provided coverage. I still believe it does. However, in the name of keeping the two tests in parallel -- in case our C++ yaml parser changes its behavior, it makes sense to provide the redundant coverage now).

Oops. I meant 10 lines up in the original file:

# To provide test coverage for all of the special cases of YAML loading, we'll
# define some dataclasses. These classes mimic
#  drake/common/yaml/test/example_structs.h
# and should be roughly kept in sync with the definitions in that file.

Basically the same idea as other threads -- that the C++ and Python tests should very strongly imitate each other. To the extent they differ at all, it should be for meaningful reasons that highlight an actual difference in behavior or capabilities between the two.

Perhaps this comment paragraph (and its twin on the C++ side) need to be expanded?


common/yaml/test/example_structs.h line 52 at r6 (raw file):

// A value used in the test data below to include a default (placeholder) value
// when initializing struct data members.
constexpr double kNominalInt = -1;

typo

Suggestion:

int

common/yaml/test/yaml_write_archive_test.cc line 74 at r6 (raw file):

TEST_F(YamlWriteArchiveTest, Bytes) {
  const auto test = [](const std::string& value, const std::string& expected) {
    const auto* data = reinterpret_cast<const std::byte*>(value.c_str());

BTW Using c_str() implies that we care about the trailing NIL, but in this case it is not relevant. Saying data() would better align with our intentions.

Suggestion:

value.data()

bindings/pydrake/common/test/yaml_typed_test.py line 319 at r6 (raw file):

            ("A3Rfc3RyAw=", "Incorrect padding"),
            ("A3Rfc*RyAw==", "Invalid base64-encoded string")]
        for value,  error_regex in cases:

nit Whitespace

Suggestion:

value, error_regex

bindings/pydrake/common/test/yaml_typed_test.py line 330 at r6 (raw file):

            ("!!str 1234", "Expected.*bytes.*str"),
            # Int.
            ("12", "Expected.*bytes.*int"),

nit Match C++ input exactly.

FYI it would also be okay by me to test both "12" and "!!int 12" as separate cases (in both languages), if you think that's better.

Suggestion:

"!!int 12"

bindings/pydrake/common/test/yaml_typed_test.py line 334 at r6 (raw file):

            # Pyyaml defect: 0o3 should be an int.
            ("0o3", "Expected.*bytes.*str"),
            # Pyyaml defect: 00:03 should be an int (value of 3).

BTW I am okay with leaving this comment alone, but as background the full answer for whether this is a "defect" depends on which version of the YAML specification pyyaml is claiming to implement. The specified regexes for inferring data types are different across yaml versions.


bindings/pydrake/common/test/yaml_typed_test.py line 337 at r6 (raw file):

            ("00:03", "Expected.*bytes.*str"),
            # Float.
            ("1234.5", "Expected.*bytes.*float"),

nit Match C++ input exactly.

Suggestion:

            ("1234.5", "Expected.*bytes.*float"),
            ("!!float 1234.5", "Expected.*bytes.*float"),

bindings/pydrake/common/test/yaml_typed_test.py line 383 at r6 (raw file):

        x = yaml_load_typed(schema=AllScalarsStruct, data=data, **options)
        self.assertEqual(x.some_bool, True)
        self.assertEqual(x.some_bytes, b'test string')

This test case does not match the C++ test case. Both should use the same yaml document as input.

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: feature but still should fix some open discussions prior to assigning platform.

Reviewed 1 of 2 files at r2, 1 of 7 files at r3, 4 of 11 files at r6.
Reviewable status: 11 unresolved discussions, needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):
Working

At least in Python (haven't checked C++ yet), the handling of Optional[bytes] and Union[..., bytes, ...] does not work correctly. I'm trying to play around a bit to see if I have advice on whether to try to fix it or else declare it to be out of scope. Stay tuned.


tools/workspace/yaml_cpp_internal/patches/upstream/b64_decode_failure_is_empty.patch line 6 at r6 (raw file):

  - The input has an invalid character.

However, it doesn't have the proper number of encoding characters (a multiple of

typo

Suggestion:

if the input

common/yaml/yaml_write_archive.cc line 124 at r6 (raw file):

  } else {
    emitted_tag = node_tag;
  }

Take note of the comment quoted here. It talks about "json schema" and explains how the data type will be correctly implied by the plain string because we are careful to emit text which the "json schema" regexes will notice when processing the plain string -- and therefore we elect not to emit the !!foo marker so that we don't clutter up the document.

Those conditions do not hold for !!binary -- it is not part of the json schema, and it does not have any regexs that would match against a plain scalar.

Therefore, the right way to handle binary is as a distinct else-if branch. See sample code below.

Suggestion:

  if ((node_tag == internal::Node::kTagNull) ||
      (node_tag == internal::Node::kTagBool) ||
      (node_tag == internal::Node::kTagInt) ||
      (node_tag == internal::Node::kTagFloat) ||
      (node_tag == internal::Node::kTagStr)) {
    // In most cases we don't need to emit the "JSON Schema" tags for YAML data,
    // because they are implied by default. However, YamlWriteArchive on variant
    // types sometimes marks the tag as important.
    if (node.IsTagImportant()) {
      // The `internal::Node::kTagFoo` all look like "tag:yaml.org,2002:foo".
      // We only want the "foo" part (after the second colon).
      emitted_tag = std::string("!!");
      emitted_tag.append(node_tag.substr(18));
    }
  } else if (node_tag == internal::Node::kTagBinary) {
    // Use the more compact "secondary tag" spelling.
    emitted_tag = "!!binary";
  } else {
    emitted_tag = node_tag;
  }

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 11 unresolved discussions, needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Working

At least in Python (haven't checked C++ yet), the handling of Optional[bytes] and Union[..., bytes, ...] does not work correctly. I'm trying to play around a bit to see if I have advice on whether to try to fix it or else declare it to be out of scope. Stay tuned.

Here is a start on the fix (only for python). See the new head commit:

https://github.com/jwnimmer-tri/drake/commits/yaml-binary-fixups/

Do we want to try to finish this approach?

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 11 unresolved discussions, needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Here is a start on the fix (only for python). See the new head commit:

https://github.com/jwnimmer-tri/drake/commits/yaml-binary-fixups/

Do we want to try to finish this approach?

(The essential idea is that in yaml_load_typed, our control flow must be solely dictated by the schema, not the yaml document. We can't validate the yaml value right at the start of the function and expect to be able to decide anything, we need to flow through the conditions that probe against the schema to decide which yaml values are allowed.)

@SeanCurtis-TRI
Copy link
Contributor Author

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

(The essential idea is that in yaml_load_typed, our control flow must be solely dictated by the schema, not the yaml document. We can't validate the yaml value right at the start of the function and expect to be able to decide anything, we need to flow through the conditions that probe against the schema to decide which yaml values are allowed.)

I took your commit and extended it for C++. I was explicitly looking at the optional to make sure it was happy when I ran out of time. Hopefully, I'll get a chance to push something new this morning, but my main priority is to get other PRs through platform as efficiently as possible.

@SeanCurtis-TRI SeanCurtis-TRI force-pushed the PR_yaml_byte_string_v2 branch 2 times, most recently from 00f1181 to 7053485 Compare December 20, 2024 18:36
Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I've got things updated. I have a third commit that adds coverage for "optional[bytes]". With all previous changes, they all simply seem to work. It's not clear to me if those additional tests (and the related infrastructure) is worthwhile.

There are various other "typed tests" in yaml_typed_test.py that I didn't explore. Should I? I glanced and they seemed to be testing a higher level of abstraction so i think I'm safe.

Anyhoooo....things seem to currently "work".

Reviewable status: 2 unresolved discussions, needs at least two assigned reviewers, commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes (waiting on @SeanCurtis-TRI)


common/yaml/yaml_write_archive.cc line 124 at r6 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Take note of the comment quoted here. It talks about "json schema" and explains how the data type will be correctly implied by the plain string because we are careful to emit text which the "json schema" regexes will notice when processing the plain string -- and therefore we elect not to emit the !!foo marker so that we don't clutter up the document.

Those conditions do not hold for !!binary -- it is not part of the json schema, and it does not have any regexs that would match against a plain scalar.

Therefore, the right way to handle binary is as a distinct else-if branch. See sample code below.

This whole thing felt a bit murky to me.

However, as I follow that vector, it seem obvious that shoving the kBinary into JsonSchemaTag is likewise misguided. kStr is acceptable because the json schema inherits from the failsafe schema. But the !!binary tag isn't part of that.

So, it seems to me that it would be better to use the Node::SetTag() overload that simply takes the string instead of extending the enumeration in a misleading direction. (This did necessitate a change to VisitScalar().)

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me if those additional tests (and the related infrastructure) is worthwhile.

I hear you on this, but I think for parsers (and yaml type checking in particular) having a costly battery of tests pays off in the long term. We used to not have so many, and it was often painful. In this case since we've identified that JSON primitives vs YAML primitives require different logic, I think the tests will be helpful. If we ever add a second kind of YAML-only primitive, that would be a point where we might lean on !!binary to catch certain kinds of bugs instead of writing a full complement of new tests for the new YAML primitive.

In any case, I'll take a final look at the test suite from scratch, to offer any final thoughts.

Reviewed 12 of 12 files at r7, all commit messages.
Reviewable status: 3 unresolved discussions, needs at least two assigned reviewers, commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes (waiting on @SeanCurtis-TRI)


common/yaml/yaml_write_archive.cc line 124 at r6 (raw file):

... kBinary into JsonSchemaTag is ... misguided.

Ah, good observation. I agree.

For !!binary, I am fine with calling SetTag with a std::string. However, FYI another approach would be to rename the enum to cover the wider domain (e.g., all yaml primitives), or to introduce a second enum for yaml-only primitives and overload SetTag on that new enum.

In any case, this thread is resolved.


a discussion (no related file):
Working

One more pass over all of the tests, from scratch.


common/yaml/yaml_write_archive.h line 239 at r7 (raw file):

    if constexpr (std::is_same_v<T, std::string>) {
      text = value;
      tag = internal::Node::kTagStr;

Whatever we do for !!binary tag, we shouldn't use SetTag(std::string) here for all of the json tags. It's important to keep using the compact (enum) representation in the common case, for performance.

Without writing it on paper to check, my best guess is that the tactic in the snippet below might be the smoothest? Up to you though, I'll let you play around and decide.

// Exactly one of these will be set by the if-else chain.
std::optional<JsonSchemaTag> json_tag;
std::string full_tag;
...
if (json_tag.has_value()) {
  scalar.SetTag(*json_tag);
} else {
  scalar.SetTag(std::move(full_tag));
}

common/yaml/yaml_read_archive.h line 356 at r7 (raw file):

    } else if constexpr (std::is_same_v<T, std::string>) {
      return tag == internal::Node::kTagStr;
    } else if constexpr (std::is_same_v<T, std::vector<std::byte>>) {

BTW The comment atop this is-else chain says "JSON schema tags". Per the other thread, possibly we want to amend that slightly for precision, since !!binary isn't part of the JSON schema.

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready for platform, modulo open discussions.

Reviewable status: 6 unresolved discussions, needs at least two assigned reviewers, commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Working

One more pass over all of the tests, from scratch.

Done


bindings/pydrake/common/test/yaml_typed_test.py line 1056 at r7 (raw file):
nit Ditto from the discussion above:

Saying vector<byte> in the comment here is confusing, since that is not a thing in Python ...


a discussion (no related file):
Please review and pull in SeanCurtis-TRI#13 for some important changes.


common/yaml/test/yaml_read_archive_test.cc line 183 at r7 (raw file):

  // Using !!binary on a schema whose type is bytes.
  test("!!binary A3Rlc3Rfc3RyAw==", "\x03test_str\x03");
  // Note: The number of spaces is critical to producing proper formatted yaml.

This threw me off for a couple minutes -- the comment says spacing is critical, but then (at first glance) the spacing in the Python copy of these test cases is different?!

The difference is because in C++ our document is doc: value: !!binary but in Python it's just value: !!binary, so indeed the indentation should differ by two.

We should either explain that in a comment or (probably better) change the C++ implementation of LoadSingleValue to replace "\n" with "\n " when it wraps the value, so that the C++ test case call site can drop the two extra spaces and match Python (and make more sense on the page to see only the two spaces in the test case definition -- four spaces here is weird until you figure out the trick).


bindings/pydrake/common/test/yaml_typed_test.py line 541 at r7 (raw file):

    def test_read_optional_bytes(self):
        """Smoke test for compatibility for the odd scalar: vector<byte>.

nit Saying vector<byte> in the comment here is confusing, since that is not a thing in Python.

For a different phrasing, we could mention that binary is a yaml scalar but not a json scalar, so we check it specifically. Or we could mention that in C++ the vector<byte> is awkward and we're just repeating the test case here to keep the Python tests sync'd up.

@jwnimmer-tri jwnimmer-tri added the release notes: feature This pull request contains a new feature label Dec 30, 2024
@SeanCurtis-TRI SeanCurtis-TRI force-pushed the PR_yaml_byte_string_v2 branch 2 times, most recently from 5671887 to 262383a Compare January 2, 2025 17:36
Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 unresolved discussions, needs at least two assigned reviewers, commits need curation (https://drake.mit.edu/reviewable.html#curated-commits) (waiting on @SeanCurtis-TRI)


a discussion (no related file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Please review and pull in SeanCurtis-TRI#13 for some important changes.

Done.


common/yaml/yaml_write_archive.h line 239 at r7 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Whatever we do for !!binary tag, we shouldn't use SetTag(std::string) here for all of the json tags. It's important to keep using the compact (enum) representation in the common case, for performance.

Without writing it on paper to check, my best guess is that the tactic in the snippet below might be the smoothest? Up to you though, I'll let you play around and decide.

// Exactly one of these will be set by the if-else chain.
std::optional<JsonSchemaTag> json_tag;
std::string full_tag;
...
if (json_tag.has_value()) {
  scalar.SetTag(*json_tag);
} else {
  scalar.SetTag(std::move(full_tag));
}

I tried that suggestion and a variant-based version. I'm presenting the variant version to solicit your thoughts. It's not an obvious win from a LOC perspective. But it does foster use of contemporary C++.

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 6 files at r8, 1 of 1 files at r9, all commit messages.
Reviewable status: needs at least two assigned reviewers, commits need curation (https://drake.mit.edu/reviewable.html#curated-commits) (waiting on @SeanCurtis-TRI)


common/yaml/yaml_write_archive.h line 239 at r7 (raw file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

I tried that suggestion and a variant-based version. I'm presenting the variant version to solicit your thoughts. It's not an obvious win from a LOC perspective. But it does foster use of contemporary C++.

I'll split the difference -- how about ditching the std::monostate, have it be just the two valid types (defaulting to JsonSchemaTag as the first type).

@@ -234,7 +234,7 @@ class YamlWriteArchive final {
     const T& value = *nvp.value();
     std::string text;
     // Tag must be set as either json enum, or full tag string.
-    std::variant<std::monostate, JsonSchemaTag, std::string> tag;
+    std::variant<JsonSchemaTag, std::string> tag;
     if constexpr (std::is_same_v<T, std::string>) {
       text = value;
       tag = JsonSchemaTag::kStr;
@@ -264,12 +264,7 @@ class YamlWriteArchive final {
     internal::Node scalar = internal::Node::MakeScalar(std::move(text));
     std::visit(
         [&scalar](auto&& arg) {
-          using ArgType = std::remove_cvref_t<decltype(arg)>;
-          if constexpr (std::is_same_v<ArgType, std::monostate>) {
-            DRAKE_UNREACHABLE();
-          } else {
-            scalar.SetTag(std::move(arg));
-          }
+          scalar.SetTag(std::move(arg));
         },
         tag);
     root_.Add(nvp.name(), std::move(scalar));

Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: needs at least two assigned reviewers, commits need curation (https://drake.mit.edu/reviewable.html#curated-commits) (waiting on @SeanCurtis-TRI)


common/yaml/yaml_write_archive.h line 239 at r7 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

I'll split the difference -- how about ditching the std::monostate, have it be just the two valid types (defaulting to JsonSchemaTag as the first type).

@@ -234,7 +234,7 @@ class YamlWriteArchive final {
     const T& value = *nvp.value();
     std::string text;
     // Tag must be set as either json enum, or full tag string.
-    std::variant<std::monostate, JsonSchemaTag, std::string> tag;
+    std::variant<JsonSchemaTag, std::string> tag;
     if constexpr (std::is_same_v<T, std::string>) {
       text = value;
       tag = JsonSchemaTag::kStr;
@@ -264,12 +264,7 @@ class YamlWriteArchive final {
     internal::Node scalar = internal::Node::MakeScalar(std::move(text));
     std::visit(
         [&scalar](auto&& arg) {
-          using ArgType = std::remove_cvref_t<decltype(arg)>;
-          if constexpr (std::is_same_v<ArgType, std::monostate>) {
-            DRAKE_UNREACHABLE();
-          } else {
-            scalar.SetTag(std::move(arg));
-          }
+          scalar.SetTag(std::move(arg));
         },
         tag);
     root_.Add(nvp.name(), std::move(scalar));

I'd considered that. I didn't like the fact that the code would compile if a tag wasn't actually set. :( It's the whole "at least one of the values will be set" guarantees. Obviously, you don't feel the ability to detect the difference between default-initialized and assigned-to is insufficiently large to justify the extra machinery.

Admittedly, it's not like we expect an influx of new scalar types, so the odds that a scalar type gets added without assigning a tag is slim to none. So, I guess a comment and placing the responsibility on future reviewers will suffice.

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: needs at least two assigned reviewers, commits need curation (https://drake.mit.edu/reviewable.html#curated-commits) (waiting on @SeanCurtis-TRI)


common/yaml/yaml_write_archive.h line 239 at r7 (raw file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

I'd considered that. I didn't like the fact that the code would compile if a tag wasn't actually set. :( It's the whole "at least one of the values will be set" guarantees. Obviously, you don't feel the ability to detect the difference between default-initialized and assigned-to is insufficiently large to justify the extra machinery.

Admittedly, it's not like we expect an influx of new scalar types, so the odds that a scalar type gets added without assigning a tag is slim to none. So, I guess a comment and placing the responsibility on future reviewers will suffice.

Right. That's how the old code used to have it, too: uninitialized by default. I don't think the UNREACHABLE is buying is anything -- either the test cases cover the situation where the tag was missed (in which case the test will pass/fail to catch mistakes due to bad yaml output), or they don't in which case the UNREACHABLE will at best leak onto users.

@SeanCurtis-TRI SeanCurtis-TRI force-pushed the PR_yaml_byte_string_v2 branch from 262383a to 7be1055 Compare January 2, 2025 19:01
Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: needs at least two assigned reviewers


common/yaml/yaml_write_archive.h line 239 at r7 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Right. That's how the old code used to have it, too: uninitialized by default. I don't think the UNREACHABLE is buying is anything -- either the test cases cover the situation where the tag was missed (in which case the test will pass/fail to catch mistakes due to bad yaml output), or they don't in which case the UNREACHABLE will at best leak onto users.

Fair enough.

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r10, all commit messages.
Reviewable status: needs at least two assigned reviewers

Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+@ggould-tri for platform review (as Thursday's and Monday's are already involved).

Reviewable status: LGTM missing from assignee ggould-tri(platform)

Copy link
Contributor

@ggould-tri ggould-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 9 files at r4, 2 of 11 files at r6, 6 of 12 files at r7, 4 of 6 files at r8, 1 of 1 files at r9, 1 of 1 files at r10, all commit messages.
Reviewable status: 1 unresolved discussion


common/yaml/test/example_structs.h line 98 at r10 (raw file):

// Sugar that copies a std::string to a std::vector<std::byte>, to make it
// easier to set or check a BytesStruct::value.

minor: You should mention the semantics around nulls here, for clarity.

Suggestion:

// Sugar that copies a std::string to a std::vector<std::byte>, to make it
// easier to set or check a BytesStruct::value.
// NOTE:  The input _may_ include internal null characters, and the
// returned vector includes those but _does not_ include any null terminator.

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 unresolved discussion


common/yaml/test/example_structs.h line 98 at r10 (raw file):

Previously, ggould-tri wrote…

minor: You should mention the semantics around nulls here, for clarity.

Commenting is good.

An additional way to reinforce that idea would be to change the argument to string_view instead of string. I'm not sure why I used string in the first place.

@SeanCurtis-TRI SeanCurtis-TRI force-pushed the PR_yaml_byte_string_v2 branch from 7be1055 to 6f83bd7 Compare January 6, 2025 20:11
Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! all discussions resolved, LGTM from assignees ggould-tri(platform),jwnimmer-tri(platform)


common/yaml/test/example_structs.h line 98 at r10 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Commenting is good.

An additional way to reinforce that idea would be to change the argument to string_view instead of string. I'm not sure why I used string in the first place.

I've changed to string_view and added the note.

However, std::string doesn't have a null terminator. I think the real subtlety is how the std::string gets constructed. A string literal which contains the null byte gets terminated at that point and the resulting std::string will include all of the bytes up to that point. But once we hit this function, there is no null terminator. Or do I have the wrong mental model?

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! all discussions resolved, LGTM from assignees ggould-tri(platform),jwnimmer-tri(platform)


common/yaml/test/example_structs.h line 98 at r10 (raw file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

I've changed to string_view and added the note.

However, std::string doesn't have a null terminator. I think the real subtlety is how the std::string gets constructed. A string literal which contains the null byte gets terminated at that point and the resulting std::string will include all of the bytes up to that point. But once we hit this function, there is no null terminator. Or do I have the wrong mental model?

std::string promises that if you read one past the size(), you'll find a nil byte. (Thus, the c_str() function.)

@ggould-tri
Copy link
Contributor

common/yaml/test/example_structs.h line 98 at r10 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

std::string promises that if you read one past the size(), you'll find a nil byte. (Thus, the c_str() function.)

In particular post-C++13 c_str() and data() are identical and thus both guaranteed null-terminated, restoring the C invariant that all strings are null-terminated. So it is at least potentially an open question whether the binary representation of such a string contains the null, hence the need to declare.

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r11, all commit messages.
Reviewable status: :shipit: complete! all discussions resolved, LGTM from assignees ggould-tri(platform),jwnimmer-tri(platform)

@jwnimmer-tri jwnimmer-tri merged commit 3e1e18a into RobotLocomotion:master Jan 6, 2025
8 of 9 checks passed
@SeanCurtis-TRI SeanCurtis-TRI deleted the PR_yaml_byte_string_v2 branch January 6, 2025 21:26
BetsyMcPhail added a commit that referenced this pull request Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release notes: feature This pull request contains a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants