ICU-22834 MF2: make tests compliant with schema and update spec tests #3063

catamorphism · 2024-07-22T20:13:19Z

This PR changes the code in the ICU4C and ICU4J tests that reads data-driven tests in JSON format to follow the test schema in the conformance repo, as well as updating the spec tests to the version in the spec repo.

Due to the changed spec tests, some non-test code changes were necessary to get the tests to pass:

ICU4C: Fixed number literal parsing in Number::format (this wasn't being done according to spec before, and previously there were no tests to expose that)
ICU4C: Fixed parsing of .input declarations, which had assumed the annotation can't be a reserved annotation
ICU4C: Fixed bug where .i was a parse error rather than an unsupported keyword
ICU4C: Make parsing of text and escape characters consistent with spec
ICU4C: Allow markup with space after the initial '{'
ICU4C: Check for duplicate variant errors (which were recently added to the spec; ICU4J already checked for these errors)
ICU4J: Allow trailing whitespace after complex messages (recent spec change)
ICU4J: Treat whitespace after .input as optional
ICU4J: Don't format unannotated number literals as numbers
Both: Allow leading whitespace before complex messages (also a recent spec change)

Note that some manual changes to the spec tests are necessary; ICU4J currently returns best-effort output rather than throwing exceptions for resolution errors, so tests where a resolution error (e.g. unknown-function or unsupported-statement) is expected need to be annotated with an ignoreJava property. This can be removed once the error handling story has been resolved (see MFWG issue 782).

The more interesting changes needed to parse the test files according to the schema include:

Updating error names according to schema
Updating how test params are specified (as an array of objects with name and value properties, rather than as an object)
Eliminating the srcs property to represent multi-line messages, and instead, allowing src to be either a single string or array of strings (consistent with this schema change)
Handling default test properties

This also made it possible to move all the test filenames into one file in the ICU4J tests (CoreTest.java) and delete the others, since tests are handled uniformly now; analogously, removing extra methods in the ICU4C test reader.

Finally, the non-spec tests needed to be changed themselves; the main changes were to change the format of the params property (mentioned above) and changing errors to expErrors. In addition, one file (icu-test-selectors.json) needed to be "expanded out" because the schema doesn't support the shared and variations properties.

Checklist

Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22834
Required: The PR title must be prefixed with a JIRA Issue number.
Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
Required: Each commit message must be prefixed with a JIRA Issue number.
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

jira-pull-request-webhook · 2024-08-05T20:22:39Z

Notice: the branch changed across the force-push!

icu4c/source/i18n/messageformat2_parser.h is different
testdata/message2/reserved-syntax.json is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

jira-pull-request-webhook · 2024-08-05T20:25:30Z

Notice: the branch changed across the force-push!

testdata/message2/reserved-syntax.json is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

jira-pull-request-webhook · 2024-08-05T20:54:41Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

jira-pull-request-webhook · 2024-08-07T20:20:16Z

Notice: the branch changed across the force-push!

icu4j/main/core/src/main/java/com/ibm/icu/message2/MFParser.java is different
icu4j/main/core/src/main/java/com/ibm/icu/message2/StringUtils.java is now changed in the branch
testdata/message2/reserved-syntax-2.json is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

jira-pull-request-webhook · 2024-08-08T20:38:19Z

Notice: the branch changed across the force-push!

icu4c/source/config/dist.mk is no longer changed in the branch
icu4c/source/i18n/messageformat2_data_model.cpp is no longer changed in the branch
icu4c/source/i18n/messageformat2_function_registry.cpp is different
icu4c/source/i18n/messageformat2_parser.cpp is different
icu4c/source/i18n/messageformat2_parser.h is no longer changed in the branch
icu4c/source/i18n/messageformat2_serializer.cpp is no longer changed in the branch
icu4c/source/i18n/unicode/messageformat2_data_model.h is no longer changed in the branch
icu4c/source/test/intltest/intltest.cpp is no longer changed in the branch
icu4c/source/test/intltest/intltest.h is no longer changed in the branch
icu4c/source/test/testdata/message2/duplicate-declarations.json is no longer changed in the branch
icu4c/source/test/testdata/message2/icu4j/icu-parser-tests.json is no longer changed in the branch
icu4c/source/test/testdata/message2/invalid-options.json is no longer changed in the branch
icu4c/source/test/testdata/message2/markup.json is no longer changed in the branch
icu4c/source/test/testdata/message2/more-data-model-errors.json is no longer changed in the branch
icu4c/source/test/testdata/message2/README.txt is no longer changed in the branch
icu4c/source/test/testdata/message2/reserved-syntax.json is no longer changed in the branch
icu4c/source/test/testdata/message2/resolution-errors.json is no longer changed in the branch
icu4c/source/test/testdata/message2/runtime-errors.json is no longer changed in the branch
icu4c/source/test/testdata/message2/spec/data-model-errors.json is no longer changed in the branch
icu4c/source/test/testdata/message2/spec/syntax-errors.json is no longer changed in the branch
icu4c/source/test/testdata/message2/spec/test-core.json is no longer changed in the branch
icu4c/source/test/testdata/message2/spec/test-functions.json is no longer changed in the branch
icu4c/source/test/testdata/message2/tricky-declarations.json is no longer changed in the branch
icu4c/source/test/testdata/message2/valid-tests.json is no longer changed in the branch
icu4j/main/common_tests/src/test/java/com/ibm/icu/dev/test/message2/Mf2FeaturesTest.java is no longer changed in the branch
icu4j/main/core/src/main/java/com/ibm/icu/message2/InputSource.java is no longer changed in the branch
icu4j/main/core/src/main/java/com/ibm/icu/message2/MFDataModelFormatter.java is no longer changed in the branch
icu4j/main/core/src/main/java/com/ibm/icu/message2/MFParser.java is different
icu4j/main/core/src/main/java/com/ibm/icu/message2/StringUtils.java is no longer changed in the branch
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/CoreTest.java is different
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/CustomFormatterPersonTest.java is no longer changed in the branch
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/DataModelErrorsTest.java is different
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/FunctionsTest.java is different
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/IcuFunctionsTest.java is different
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/MessageFormat2Test.java is no longer changed in the branch
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/SerializationTest.java is no longer changed in the branch
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/SyntaxErrorsTest.java is different
icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/message2/data-model-errors.json is no longer changed in the branch
icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/message2/icu-parser-tests.json is no longer changed in the branch
icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/message2/icu-test-previous-release.json is no longer changed in the branch
icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/message2/icu-test-selectors.json is no longer changed in the branch
icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/message2/syntax-errors.json is no longer changed in the branch
icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/message2/test-core.json is no longer changed in the branch
icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/message2/test-functions.json is no longer changed in the branch
icu4j/pom.xml is no longer changed in the branch
icu4j/releases_tools/github_release.sh is no longer changed in the branch
testdata/message2/icu-test-functions.json is different
testdata/message2/more-syntax-errors.json is now changed in the branch
testdata/message2/reserved-syntax-2.json is no longer changed in the branch
testdata/message2/reserved-syntax.json is different
testdata/message2/spec/test-functions.json is now changed in the branch
testdata/message2/syntax-errors-diagnostics.json is different
testdata/message2/valid-tests.json is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

jira-pull-request-webhook · 2024-08-08T21:17:35Z

Notice: the branch changed across the force-push!

icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/Sources.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

jira-pull-request-webhook · 2024-08-08T21:32:01Z

Notice: the branch changed across the force-push!

icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/CoreTest.java is different
testdata/message2/icu-test-functions-multiline.json is no longer changed in the branch
testdata/message2/icu-test-functions.json is different
testdata/message2/more-functions-multiline.json is no longer changed in the branch
testdata/message2/more-functions.json is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

catamorphism · 2024-08-08T21:42:47Z

I can't request reviews, but: @markusicu @srl295 @echeran - and @mihnita might want to take a look at the Java changes.

catamorphism · 2024-08-13T23:31:01Z

I added another commit to this PR to update the spec tests again, which meant I had to pull in some more commits that were initially in #3092 (draft PR) in order to make the tests pass.

catamorphism · 2024-08-13T23:38:22Z

Also, I'll be on vacation from now until September 9, so I won't see any review comments until I return.

srl295 · 2024-08-24T17:36:06Z

icu4c/source/i18n/messageformat2_function_registry.cpp

+    default: {
+        // Should be unreachable
+        return 0;


Suggested change

default: {

// Should be unreachable

return 0;

default: {

// Should be unreachable

U_ASSERT(isDigit(c));

return 0;

should this handle other numbering systems though?

and not to bikeshed on this part but there is ufmt_digitvalue() and you could ignore values <0 or ≥10 (it parses hex)

should this handle other numbering systems though?

No, because it's operating on a Number Operand as defined in the function registry spec; it's not a localized number.

and not to bikeshed on this part but there is ufmt_digitvalue() and you could ignore values <0 or ≥10 (it parses hex)

I tried just now to use that, but I realized that ufmt_digitvalue() is part of the io library and io depends on i18n, so that would be a circular dependency. I guess it could be moved into i18n, but I'm hesitant to do a refactor like that just to avoid duplicating a relatively small function. What do you think?

I didn't realize it was in io, good call. fine as is then.

srl295 · 2024-08-24T17:40:07Z

icu4c/source/i18n/messageformat2_function_registry.cpp

+    default: {
+        // Should be unreachable
+        return 0;


and not to bikeshed on this part but there is ufmt_digitvalue() and you could ignore values <0 or ≥10 (it parses hex)

icu4c/source/i18n/messageformat2_macros.h

icu4c/source/i18n/messageformat2_parser.cpp

srl295 · 2024-08-24T17:52:43Z

as a followon i'd move the number parsing into ufmt_cmn.cpp

icu4c/source/i18n/messageformat2_macros.h

srl295

I think we should be more cautious about the defines in the public header, otherwise looking good.

icu4c/source/i18n/messageformat2_macros.h

icu4c/source/i18n/messageformat2_parser.cpp

mihnita · 2024-09-10T20:51:41Z

There is an email thread on how to organize the test data.
Where the idea is to have some kind of shared folder in the root of the repo, with separate sub-folders for testdata ICU only and testdata from cldr.

Because some of the json files here are from cldr "as is" (or at least they should be).
And some are ICU only, shared between C++/Java, but not from cldr.

The cldr test files are copied to icu by an ant script.

If you don't know what thread I'm talking about (with vacation and all it is understandable if it was lost), tell me and I'll ping you by email with a link.

mihnita

Thank you.

Am I wrong, of there was another PR that was kind of similar, but not really?
That I reviewed.
I can't find it now.

icu4j/main/core/src/main/java/com/ibm/icu/message2/MFParser.java

icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/Sources.java

catamorphism · 2024-09-11T00:33:56Z

@mihnita:

There is an email thread on how to organize the test data. Where the idea is to have some kind of shared folder in the root of the repo, with separate sub-folders for testdata ICU only and testdata from cldr.

Do you think it would be OK to handle that in a separate PR?

catamorphism · 2024-09-11T00:35:11Z

@mihnita:

Am I wrong, of there was another PR that was kind of similar, but not really?

Yes, there's a draft PR that I intend as a follow-up to this one: #3092

catamorphism · 2024-09-11T01:11:02Z

Rebasing now, since I think all the line comments have been addressed.

jira-pull-request-webhook · 2024-09-11T01:11:28Z

Notice: the branch changed across the force-push!

icu4c/source/common/unicode/char16ptr.h is no longer changed in the branch
icu4c/source/common/unicode/platform.h is no longer changed in the branch
icu4c/source/common/unicode/unistr.h is no longer changed in the branch
icu4c/source/common/uniset_props.cpp is no longer changed in the branch
icu4c/source/common/unistr.cpp is no longer changed in the branch
icu4c/source/i18n/messageformat2_function_registry.cpp is different
icu4c/source/i18n/number_decimalquantity.cpp is no longer changed in the branch
icu4c/source/test/intltest/ustrtest.cpp is no longer changed in the branch
icu4c/source/test/intltest/ustrtest.h is no longer changed in the branch
icu4j/main/core/src/main/java/com/ibm/icu/message2/MFDataModelFormatter.java is different
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/IcuFunctionsTest.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

srl295

Update LGTM

jira-pull-request-webhook · 2024-09-11T20:58:59Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

catamorphism · 2024-09-12T16:01:01Z

@echeran It looks like the "enforce-all-checks" check failed, but that was the only check that failed. I think this is the same issue described in the icu-team email thread from yesterday?

srl295 · 2024-09-12T16:04:40Z

@echeran It looks like the "enforce-all-checks" check failed, but that was the only check that failed. I think this is the same issue described in the icu-team email thread from yesterday?

retriggered, clean now. The ironic thing is that I was wondering if CLDR needs one of the same enforce-all-checks but now i'm not so sure!

mihnita

Thank you.

echeran

Looks good except I have a couple of questions / suggestions. Otherwise, I'm happy to rubber stamp since @mihnita is on board.

icu4c/source/i18n/messageformat2_function_registry.cpp

echeran · 2024-09-16T21:50:38Z

icu4c/source/i18n/messageformat2_function_registry.cpp

+    return result;
+}
+
+static double parseNumberLiteral(const FormattedPlaceholder& input, UErrorCode& errorCode) {


Suggestions from @sffc

Use function from double-conversion.h

Use function from number_decimalquantity.h

Use function from util.h called parseNumber to replace the helper methods that parse the integer and fractional portions.

Either option 1 or 2 should be sufficient to do what we need here. Option 1 gives a double, option 2 gives a DecimalQuantity. At least option 1 should be able to handle sign, decimal point, and exponent, and probably option 2. Option 3 is helpful but doesn't replace as much code as the other 2 options.

I tried both (1) and (2), and the problem is that there are some strings that both functions accept, which are not valid number literals according to the MF2 grammar: for example, 01 (leading zero with no decimal point), +1 (leading '+'), .1 (decimal point with no leading 0), and 1. (trailing decimal point).

Since these cases are relatively simple to check for, I just added a check before calling into StringToDouble (option 1) - see 176a232.

I didn't like (3) since calling parseNumber to parse the decimal part still means having to do math to add the two parts together, so it doesn't seem to simplify the code that much.

Let me know what you think.

Looks great! Option 1 seemed the most likely & directly applicable.

echeran

LGTM!

After you squash, I'm happy to reapprove, and also to click the merge button on your behalf if need be.

echeran · 2024-09-17T22:45:37Z

icu4c/source/i18n/messageformat2_function_registry.cpp

+    return result;
+}
+
+static double parseNumberLiteral(const FormattedPlaceholder& input, UErrorCode& errorCode) {


Looks great! Option 1 seemed the most likely & directly applicable.

This also updates the spec tests from the current version of the MFWG repository and removes some duplicate tests. Spec tests now reflect the message-format-wg repo as of unicode-org/message-format-wg@5612f3b It also updates both the ICU4C and ICU4J parsers to follow the current test schema in the conformance repository. This includes adding code to both parsers to allow `src` to be either a single string or an array of strings (per unicode-org/conformance#255 ), and eliminating `srcs` in tests. It also includes other changes to make updated spec tests pass: ICU4C: Allow trailing whitespace for complex messages, due to spec change ICU4C: Parse number literals correctly in Number::format ICU4J: Allow trailing whitespace after complex body, per spec change ICU4C: Fix bug that was assuming an .input variable can't have a reserved annotation ICU4C: Fix bug where unsupported '.i' was parsed as an '.input' ICU4C/ICU4J: Handle markup with space after the initial left curly brace ICU4C: Check for duplicate variant errors ICU4C/ICU4J: Handle leading whitespace in complex messages ICU4J: Treat whitespace after .input keyword as optional ICU4J: Don't format unannotated number literals as numbers

jira-pull-request-webhook · 2024-09-17T23:12:14Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

catamorphism · 2024-09-17T23:44:47Z

@echeran Squashed -- I think you don't need to reapprove since nothing changed, but if you could click the merge button, that would be great!

mihnita · 2024-09-18T14:46:37Z

Thank you!
M.

catamorphism changed the title ~~DRAFT: MF2: make tests compliant with schema~~ ICU-22834 DRAFT: MF2: make tests compliant with schema Jul 22, 2024

catamorphism force-pushed the mf2-test-schema branch from 2e5394d to c00e4b5 Compare August 5, 2024 20:22

catamorphism force-pushed the mf2-test-schema branch from c00e4b5 to af96c7a Compare August 5, 2024 20:25

catamorphism force-pushed the mf2-test-schema branch from d8a016d to 825cdf9 Compare August 5, 2024 20:54

catamorphism force-pushed the mf2-test-schema branch from 96e081f to c3ea7a7 Compare August 7, 2024 20:20

catamorphism force-pushed the mf2-test-schema branch from c3ea7a7 to ca1b786 Compare August 8, 2024 20:38

catamorphism changed the title ~~ICU-22834 DRAFT: MF2: make tests compliant with schema~~ ICU-22834 DRAFT: MF2: make tests compliant with schema and update spec tests Aug 8, 2024

catamorphism force-pushed the mf2-test-schema branch from ca1b786 to cf91e74 Compare August 8, 2024 21:17

catamorphism force-pushed the mf2-test-schema branch from cf91e74 to 612387f Compare August 8, 2024 21:31

catamorphism changed the title ~~ICU-22834 DRAFT: MF2: make tests compliant with schema and update spec tests~~ ICU-22834 MF2: make tests compliant with schema and update spec tests Aug 8, 2024

catamorphism marked this pull request as ready for review August 8, 2024 21:41

catamorphism mentioned this pull request Aug 8, 2024

ICU-22898 MF2: fix various parser bugs and add more tests #3092

Merged

7 tasks

markusicu assigned echeran Aug 15, 2024

markusicu requested a review from mihnita August 15, 2024 16:22

srl295 self-requested a review August 16, 2024 15:00

srl295 reviewed Aug 24, 2024

View reviewed changes

srl295 previously approved these changes Aug 24, 2024

View reviewed changes

srl295 requested a review from echeran August 24, 2024 17:53

catamorphism dismissed srl295’s stale review via 5531140 September 9, 2024 20:26

srl295 reviewed Sep 10, 2024

View reviewed changes

icu4c/source/i18n/messageformat2_macros.h Outdated Show resolved Hide resolved

srl295 reviewed Sep 10, 2024

View reviewed changes

icu4c/source/i18n/messageformat2_macros.h Outdated Show resolved Hide resolved

srl295 reviewed Sep 10, 2024

View reviewed changes

icu4c/source/i18n/messageformat2_parser.cpp Outdated Show resolved Hide resolved

mihnita reviewed Sep 10, 2024

View reviewed changes

catamorphism force-pushed the mf2-test-schema branch from 9bd9556 to d2edf89 Compare September 11, 2024 01:11

srl295 previously approved these changes Sep 11, 2024

View reviewed changes

catamorphism force-pushed the mf2-test-schema branch from d2edf89 to 1f99262 Compare September 11, 2024 20:58

This was referenced Sep 14, 2024

ICU-22890 Add test to show lone surrogate cause infinity loop #3166

Closed

ICU-22890 MF2: Add lone surrogate test to parser #3167

Merged

mihnita previously approved these changes Sep 16, 2024

View reviewed changes

echeran reviewed Sep 16, 2024

View reviewed changes

icu4c/source/i18n/messageformat2_function_registry.cpp Outdated Show resolved Hide resolved

icu4c/source/i18n/messageformat2_function_registry.cpp Outdated Show resolved Hide resolved

catamorphism dismissed stale reviews from mihnita and srl295 via 2c18c92 September 16, 2024 21:31

echeran reviewed Sep 16, 2024

View reviewed changes

echeran approved these changes Sep 17, 2024

View reviewed changes

catamorphism force-pushed the mf2-test-schema branch from cdc7238 to 6441f29 Compare September 17, 2024 23:12

mihnita merged commit 747d5ee into unicode-org:main Sep 18, 2024
100 checks passed

ICU-22834 MF2: make tests compliant with schema and update spec tests #3063

ICU-22834 MF2: make tests compliant with schema and update spec tests #3063

Conversation

catamorphism commented Jul 22, 2024 • edited Loading

Checklist

jira-pull-request-webhook bot commented Aug 5, 2024

jira-pull-request-webhook bot commented Aug 5, 2024

jira-pull-request-webhook bot commented Aug 5, 2024

jira-pull-request-webhook bot commented Aug 7, 2024

jira-pull-request-webhook bot commented Aug 8, 2024

jira-pull-request-webhook bot commented Aug 8, 2024

jira-pull-request-webhook bot commented Aug 8, 2024

catamorphism commented Aug 8, 2024

catamorphism commented Aug 13, 2024

catamorphism commented Aug 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srl295 commented Aug 24, 2024

srl295 left a comment

Choose a reason for hiding this comment

mihnita commented Sep 10, 2024

mihnita left a comment

Choose a reason for hiding this comment

catamorphism commented Sep 11, 2024 • edited Loading

catamorphism commented Sep 11, 2024

catamorphism commented Sep 11, 2024

jira-pull-request-webhook bot commented Sep 11, 2024

srl295 left a comment

Choose a reason for hiding this comment

jira-pull-request-webhook bot commented Sep 11, 2024

catamorphism commented Sep 12, 2024

srl295 commented Sep 12, 2024 • edited Loading

mihnita left a comment

Choose a reason for hiding this comment

echeran left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

catamorphism Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

echeran left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jira-pull-request-webhook bot commented Sep 17, 2024

catamorphism commented Sep 17, 2024

mihnita commented Sep 18, 2024

catamorphism commented Jul 22, 2024 •

edited

Loading

catamorphism commented Sep 11, 2024 •

edited

Loading

srl295 commented Sep 12, 2024 •

edited

Loading

catamorphism Sep 17, 2024 •

edited

Loading