Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22834 MF2: make tests compliant with schema and update spec tests #3063

Merged
merged 1 commit into from
Sep 18, 2024

Conversation

catamorphism
Copy link
Contributor

@catamorphism catamorphism commented Jul 22, 2024

This PR changes the code in the ICU4C and ICU4J tests that reads data-driven tests in JSON format to follow the test schema in the conformance repo, as well as updating the spec tests to the version in the spec repo.

Due to the changed spec tests, some non-test code changes were necessary to get the tests to pass:

  • ICU4C: Fixed number literal parsing in Number::format (this wasn't being done according to spec before, and previously there were no tests to expose that)
  • ICU4C: Fixed parsing of .input declarations, which had assumed the annotation can't be a reserved annotation
  • ICU4C: Fixed bug where .i was a parse error rather than an unsupported keyword
  • ICU4C: Make parsing of text and escape characters consistent with spec
  • ICU4C: Allow markup with space after the initial '{'
  • ICU4C: Check for duplicate variant errors (which were recently added to the spec; ICU4J already checked for these errors)
  • ICU4J: Allow trailing whitespace after complex messages (recent spec change)
  • ICU4J: Treat whitespace after .input as optional
  • ICU4J: Don't format unannotated number literals as numbers
  • Both: Allow leading whitespace before complex messages (also a recent spec change)

Note that some manual changes to the spec tests are necessary; ICU4J currently returns best-effort output rather than throwing exceptions for resolution errors, so tests where a resolution error (e.g. unknown-function or unsupported-statement) is expected need to be annotated with an ignoreJava property. This can be removed once the error handling story has been resolved (see MFWG issue 782).

The more interesting changes needed to parse the test files according to the schema include:

  • Updating error names according to schema
  • Updating how test params are specified (as an array of objects with name and value properties, rather than as an object)
  • Eliminating the srcs property to represent multi-line messages, and instead, allowing src to be either a single string or array of strings (consistent with this schema change)
  • Handling default test properties

This also made it possible to move all the test filenames into one file in the ICU4J tests (CoreTest.java) and delete the others, since tests are handled uniformly now; analogously, removing extra methods in the ICU4C test reader.

Finally, the non-spec tests needed to be changed themselves; the main changes were to change the format of the params property (mentioned above) and changing errors to expErrors. In addition, one file (icu-test-selectors.json) needed to be "expanded out" because the schema doesn't support the shared and variations properties.

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22834
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@catamorphism catamorphism changed the title DRAFT: MF2: make tests compliant with schema ICU-22834 DRAFT: MF2: make tests compliant with schema Jul 22, 2024
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/messageformat2_parser.h is different
  • testdata/message2/reserved-syntax.json is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • testdata/message2/reserved-syntax.json is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4j/main/core/src/main/java/com/ibm/icu/message2/MFParser.java is different
  • icu4j/main/core/src/main/java/com/ibm/icu/message2/StringUtils.java is now changed in the branch
  • testdata/message2/reserved-syntax-2.json is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/config/dist.mk is no longer changed in the branch
  • icu4c/source/i18n/messageformat2_data_model.cpp is no longer changed in the branch
  • icu4c/source/i18n/messageformat2_function_registry.cpp is different
  • icu4c/source/i18n/messageformat2_parser.cpp is different
  • icu4c/source/i18n/messageformat2_parser.h is no longer changed in the branch
  • icu4c/source/i18n/messageformat2_serializer.cpp is no longer changed in the branch
  • icu4c/source/i18n/unicode/messageformat2_data_model.h is no longer changed in the branch
  • icu4c/source/test/intltest/intltest.cpp is no longer changed in the branch
  • icu4c/source/test/intltest/intltest.h is no longer changed in the branch
  • icu4c/source/test/testdata/message2/duplicate-declarations.json is no longer changed in the branch
  • icu4c/source/test/testdata/message2/icu4j/icu-parser-tests.json is no longer changed in the branch
  • icu4c/source/test/testdata/message2/invalid-options.json is no longer changed in the branch
  • icu4c/source/test/testdata/message2/markup.json is no longer changed in the branch
  • icu4c/source/test/testdata/message2/more-data-model-errors.json is no longer changed in the branch
  • icu4c/source/test/testdata/message2/README.txt is no longer changed in the branch
  • icu4c/source/test/testdata/message2/reserved-syntax.json is no longer changed in the branch
  • icu4c/source/test/testdata/message2/resolution-errors.json is no longer changed in the branch
  • icu4c/source/test/testdata/message2/runtime-errors.json is no longer changed in the branch
  • icu4c/source/test/testdata/message2/spec/data-model-errors.json is no longer changed in the branch
  • icu4c/source/test/testdata/message2/spec/syntax-errors.json is no longer changed in the branch
  • icu4c/source/test/testdata/message2/spec/test-core.json is no longer changed in the branch
  • icu4c/source/test/testdata/message2/spec/test-functions.json is no longer changed in the branch
  • icu4c/source/test/testdata/message2/tricky-declarations.json is no longer changed in the branch
  • icu4c/source/test/testdata/message2/valid-tests.json is no longer changed in the branch
  • icu4j/main/common_tests/src/test/java/com/ibm/icu/dev/test/message2/Mf2FeaturesTest.java is no longer changed in the branch
  • icu4j/main/core/src/main/java/com/ibm/icu/message2/InputSource.java is no longer changed in the branch
  • icu4j/main/core/src/main/java/com/ibm/icu/message2/MFDataModelFormatter.java is no longer changed in the branch
  • icu4j/main/core/src/main/java/com/ibm/icu/message2/MFParser.java is different
  • icu4j/main/core/src/main/java/com/ibm/icu/message2/StringUtils.java is no longer changed in the branch
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/CoreTest.java is different
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/CustomFormatterPersonTest.java is no longer changed in the branch
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/DataModelErrorsTest.java is different
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/FunctionsTest.java is different
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/IcuFunctionsTest.java is different
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/MessageFormat2Test.java is no longer changed in the branch
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/SerializationTest.java is no longer changed in the branch
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/SyntaxErrorsTest.java is different
  • icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/message2/data-model-errors.json is no longer changed in the branch
  • icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/message2/icu-parser-tests.json is no longer changed in the branch
  • icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/message2/icu-test-previous-release.json is no longer changed in the branch
  • icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/message2/icu-test-selectors.json is no longer changed in the branch
  • icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/message2/syntax-errors.json is no longer changed in the branch
  • icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/message2/test-core.json is no longer changed in the branch
  • icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/message2/test-functions.json is no longer changed in the branch
  • icu4j/pom.xml is no longer changed in the branch
  • icu4j/releases_tools/github_release.sh is no longer changed in the branch
  • testdata/message2/icu-test-functions.json is different
  • testdata/message2/more-syntax-errors.json is now changed in the branch
  • testdata/message2/reserved-syntax-2.json is no longer changed in the branch
  • testdata/message2/reserved-syntax.json is different
  • testdata/message2/spec/test-functions.json is now changed in the branch
  • testdata/message2/syntax-errors-diagnostics.json is different
  • testdata/message2/valid-tests.json is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@catamorphism catamorphism changed the title ICU-22834 DRAFT: MF2: make tests compliant with schema ICU-22834 DRAFT: MF2: make tests compliant with schema and update spec tests Aug 8, 2024
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/Sources.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/CoreTest.java is different
  • testdata/message2/icu-test-functions-multiline.json is no longer changed in the branch
  • testdata/message2/icu-test-functions.json is different
  • testdata/message2/more-functions-multiline.json is no longer changed in the branch
  • testdata/message2/more-functions.json is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@catamorphism catamorphism changed the title ICU-22834 DRAFT: MF2: make tests compliant with schema and update spec tests ICU-22834 MF2: make tests compliant with schema and update spec tests Aug 8, 2024
@catamorphism catamorphism marked this pull request as ready for review August 8, 2024 21:41
@catamorphism
Copy link
Contributor Author

I can't request reviews, but: @markusicu @srl295 @echeran - and @mihnita might want to take a look at the Java changes.

@catamorphism
Copy link
Contributor Author

I added another commit to this PR to update the spec tests again, which meant I had to pull in some more commits that were initially in #3092 (draft PR) in order to make the tests pass.

@catamorphism
Copy link
Contributor Author

Also, I'll be on vacation from now until September 9, so I won't see any review comments until I return.

@markusicu markusicu requested a review from mihnita August 15, 2024 16:22
@srl295 srl295 self-requested a review August 16, 2024 15:00
Comment on lines 459 to 468
default: {
// Should be unreachable
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
default: {
// Should be unreachable
return 0;
default: {
// Should be unreachable
U_ASSERT(isDigit(c));
return 0;

should this handle other numbering systems though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and not to bikeshed on this part but there is ufmt_digitvalue() and you could ignore values <0 or ≥10 (it parses hex)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this handle other numbering systems though?

No, because it's operating on a Number Operand as defined in the function registry spec; it's not a localized number.

and not to bikeshed on this part but there is ufmt_digitvalue() and you could ignore values <0 or ≥10 (it parses hex)

I tried just now to use that, but I realized that ufmt_digitvalue() is part of the io library and io depends on i18n, so that would be a circular dependency. I guess it could be moved into i18n, but I'm hesitant to do a refactor like that just to avoid duplicating a relatively small function. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize it was in io, good call. fine as is then.

srl295
srl295 previously approved these changes Aug 24, 2024
Comment on lines 459 to 468
default: {
// Should be unreachable
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and not to bikeshed on this part but there is ufmt_digitvalue() and you could ignore values <0 or ≥10 (it parses hex)

icu4c/source/i18n/messageformat2_macros.h Outdated Show resolved Hide resolved
icu4c/source/i18n/messageformat2_parser.cpp Outdated Show resolved Hide resolved
@srl295
Copy link
Member

srl295 commented Aug 24, 2024

as a followon i'd move the number parsing into ufmt_cmn.cpp

Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be more cautious about the defines in the public header, otherwise looking good.

@mihnita
Copy link
Contributor

mihnita commented Sep 10, 2024

There is an email thread on how to organize the test data.
Where the idea is to have some kind of shared folder in the root of the repo, with separate sub-folders for testdata ICU only and testdata from cldr.

Because some of the json files here are from cldr "as is" (or at least they should be).
And some are ICU only, shared between C++/Java, but not from cldr.

The cldr test files are copied to icu by an ant script.


If you don't know what thread I'm talking about (with vacation and all it is understandable if it was lost), tell me and I'll ping you by email with a link.

Copy link
Contributor

@mihnita mihnita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

Am I wrong, of there was another PR that was kind of similar, but not really?
That I reviewed.
I can't find it now.

@catamorphism
Copy link
Contributor Author

catamorphism commented Sep 11, 2024

@mihnita:

There is an email thread on how to organize the test data. Where the idea is to have some kind of shared folder in the root of the repo, with separate sub-folders for testdata ICU only and testdata from cldr.

Do you think it would be OK to handle that in a separate PR?

@catamorphism
Copy link
Contributor Author

@mihnita:

Am I wrong, of there was another PR that was kind of similar, but not really?

Yes, there's a draft PR that I intend as a follow-up to this one: #3092

@catamorphism
Copy link
Contributor Author

Rebasing now, since I think all the line comments have been addressed.

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/common/unicode/char16ptr.h is no longer changed in the branch
  • icu4c/source/common/unicode/platform.h is no longer changed in the branch
  • icu4c/source/common/unicode/unistr.h is no longer changed in the branch
  • icu4c/source/common/uniset_props.cpp is no longer changed in the branch
  • icu4c/source/common/unistr.cpp is no longer changed in the branch
  • icu4c/source/i18n/messageformat2_function_registry.cpp is different
  • icu4c/source/i18n/number_decimalquantity.cpp is no longer changed in the branch
  • icu4c/source/test/intltest/ustrtest.cpp is no longer changed in the branch
  • icu4c/source/test/intltest/ustrtest.h is no longer changed in the branch
  • icu4j/main/core/src/main/java/com/ibm/icu/message2/MFDataModelFormatter.java is different
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/IcuFunctionsTest.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

srl295
srl295 previously approved these changes Sep 11, 2024
Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update LGTM

@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@catamorphism
Copy link
Contributor Author

@echeran It looks like the "enforce-all-checks" check failed, but that was the only check that failed. I think this is the same issue described in the icu-team email thread from yesterday?

@srl295
Copy link
Member

srl295 commented Sep 12, 2024

@echeran It looks like the "enforce-all-checks" check failed, but that was the only check that failed. I think this is the same issue described in the icu-team email thread from yesterday?

retriggered, clean now. The ironic thing is that I was wondering if CLDR needs one of the same enforce-all-checks but now i'm not so sure!

mihnita
mihnita previously approved these changes Sep 16, 2024
Copy link
Contributor

@mihnita mihnita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

Copy link
Contributor

@echeran echeran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good except I have a couple of questions / suggestions. Otherwise, I'm happy to rubber stamp since @mihnita is on board.

icu4c/source/i18n/messageformat2_function_registry.cpp Outdated Show resolved Hide resolved
icu4c/source/i18n/messageformat2_function_registry.cpp Outdated Show resolved Hide resolved
@catamorphism catamorphism dismissed stale reviews from mihnita and srl295 via 2c18c92 September 16, 2024 21:31
return result;
}

static double parseNumberLiteral(const FormattedPlaceholder& input, UErrorCode& errorCode) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions from @sffc

  1. Use function from double-conversion.h
  2. Use function from number_decimalquantity.h
  3. Use function from util.h called parseNumber to replace the helper methods that parse the integer and fractional portions.

Either option 1 or 2 should be sufficient to do what we need here. Option 1 gives a double, option 2 gives a DecimalQuantity. At least option 1 should be able to handle sign, decimal point, and exponent, and probably option 2. Option 3 is helpful but doesn't replace as much code as the other 2 options.

Copy link
Contributor Author

@catamorphism catamorphism Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried both (1) and (2), and the problem is that there are some strings that both functions accept, which are not valid number literals according to the MF2 grammar: for example, 01 (leading zero with no decimal point), +1 (leading '+'), .1 (decimal point with no leading 0), and 1. (trailing decimal point).

Since these cases are relatively simple to check for, I just added a check before calling into StringToDouble (option 1) - see 176a232.

I didn't like (3) since calling parseNumber to parse the decimal part still means having to do math to add the two parts together, so it doesn't seem to simplify the code that much.

Let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Option 1 seemed the most likely & directly applicable.

Copy link
Contributor

@echeran echeran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

After you squash, I'm happy to reapprove, and also to click the merge button on your behalf if need be.

return result;
}

static double parseNumberLiteral(const FormattedPlaceholder& input, UErrorCode& errorCode) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Option 1 seemed the most likely & directly applicable.

This also updates the spec tests from the current version of the MFWG
repository and removes some duplicate tests.
Spec tests now reflect the message-format-wg repo as of
unicode-org/message-format-wg@5612f3b

It also updates both the ICU4C and ICU4J parsers to follow the
current test schema in the conformance repository.

This includes adding code to both parsers to allow `src` to be
either a single string or an array of strings (per
unicode-org/conformance#255 ),
and eliminating `srcs` in tests.

It also includes other changes to make updated spec tests pass:

ICU4C: Allow trailing whitespace for complex messages, due to spec change
ICU4C: Parse number literals correctly in Number::format
ICU4J: Allow trailing whitespace after complex body, per spec change
ICU4C: Fix bug that was assuming an .input variable can't have a reserved annotation
ICU4C: Fix bug where unsupported '.i' was parsed as an '.input'
ICU4C/ICU4J: Handle markup with space after the initial left curly brace
ICU4C: Check for duplicate variant errors
ICU4C/ICU4J: Handle leading whitespace in complex messages
ICU4J: Treat whitespace after .input keyword as optional
ICU4J: Don't format unannotated number literals as numbers
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@catamorphism
Copy link
Contributor Author

@echeran Squashed -- I think you don't need to reapprove since nothing changed, but if you could click the merge button, that would be great!

@mihnita mihnita merged commit 747d5ee into unicode-org:main Sep 18, 2024
100 checks passed
@mihnita
Copy link
Contributor

mihnita commented Sep 18, 2024

Thank you!
M.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants