Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement full Unicode 16.0.0 extended grapheme breaking. #719

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

lrhn
Copy link
Member

@lrhn lrhn commented Nov 7, 2024

Includes rule GB9c (Indic Conjunt Break rule).

This change has a significant cost in size since the information needed per character no longer fits in 4 bits. The base table is therefore twice as big (one byte per entry rather than half of that).

The number of states in the state automatons have also increased slightly, but in comparison that's a negligible change.

Tests have been made more thorough, testing not only the Unicode Consortium provided tests, but also variants of those with representative characters for each category of character that either in or not-in the BMP, to test that surrogate pair decoding works correctly.

Test also check that the created automatons are minimal, in that no state is unreachable and no two states are indistinguishable.

Includes rule GB9c (Indict Conjunt Break based).

This change has a significant cost in size since the
information needed per character no longer fits in 4 bits.
The base table is therefore twice as big (one byte per entry
rather than half of that).

The number of states in the state automatons have also
increased slightly, but in comparison that's a negligible change.

Tests have been made more thorough, testing not only the
Unicode Consortium provided tests, but also variants of those
with representative characters for each category of character
that either in or not-in the BMP, to test that surrogate pair
decoding works correctly.

Test also check that the created automatons are minimal,
in that no state is unreachable and no two states are
indistinguishable.
Copy link

github-actions bot commented Nov 7, 2024

Package publishing

Package Version Status Publish tag (post-merge)
package:characters 1.4.0 ready to publish characters-v1.4.0
package:args 2.6.1-wip WIP (no publish necessary)
package:async 2.12.0 already published at pub.dev
package:collection 1.19.1-wip WIP (no publish necessary)
package:convert 3.1.2 already published at pub.dev
package:crypto 3.0.6 already published at pub.dev
package:fixnum 1.1.1 already published at pub.dev
package:logging 1.3.0 already published at pub.dev
package:os_detect 2.0.3-wip WIP (no publish necessary)
package:path 1.9.1 already published at pub.dev
package:platform 3.1.6 already published at pub.dev
package:typed_data 1.4.0 already published at pub.dev

Documentation at https://github.com/dart-lang/ecosystem/wiki/Publishing-automation.

Copy link

github-actions bot commented Nov 7, 2024

PR Health

Breaking changes ✔️
Package Change Current Version New Version Needed Version Looking good?
characters None 1.4.0 1.4.0 1.4.0 ✔️
Coverage ⚠️
File Coverage
pkgs/characters/lib/src/characters_impl.dart 💚 89 %
pkgs/characters/lib/src/grapheme_clusters/breaks.dart 💚 98 %
pkgs/characters/lib/src/grapheme_clusters/constants.dart 💔 Not covered
pkgs/characters/lib/src/grapheme_clusters/table.dart 💚 100 %
pkgs/characters/tool/bin/generate_tables.dart 💔 Not covered
pkgs/characters/tool/bin/generate_tests.dart 💔 Not covered
pkgs/characters/tool/generate.dart 💔 Not covered
pkgs/characters/tool/src/atsp.dart 💔 Not covered
pkgs/characters/tool/src/automaton_builder.dart 💔 Not covered
pkgs/characters/tool/src/data_files.dart 💔 Not covered
pkgs/characters/tool/src/debug_names.dart 💚 15 %
pkgs/characters/tool/src/graph.dart 💔 Not covered
pkgs/characters/tool/src/grapheme_category_loader.dart 💔 Not covered
pkgs/characters/tool/src/list_overlap.dart 💔 Not covered
pkgs/characters/tool/src/shared.dart 💔 Not covered

This check for test coverage is informational (issues shown here will not fail the PR).

This check can be disabled by tagging the PR with skip-coverage-check.

API leaks ✔️

The following packages contain symbols visible in the public API, but not exported by the library. Export these symbols or remove them from your publicly visible API.

Package Leaked API symbols
License Headers ⚠️
// Copyright (c) 2024, the Dart project authors. Please see the AUTHORS file
// for details. All rights reserved. Use of this source code is governed by a
// BSD-style license that can be found in the LICENSE file.
Files
pkgs/characters/lib/src/grapheme_clusters/breaks.dart

All source files should start with a license header.

This check can be disabled by tagging the PR with skip-license-check.

Until `// dart format off` starts working.
@lrhn
Copy link
Member Author

lrhn commented Nov 7, 2024

Health check is wrong. The changelog is correct since the version wasn't changed, and the existing changelog didn't list missing part that is now implemented.

Copy link
Member

@natebosch natebosch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of these comments my primary concern is the return statements in loops in tests.

Should some of those be continue instead of return?

Comment on lines +306 to +309
expect(
eqClasses.where((l) => l.length != 1).toList(),
isEmpty,
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] prefer maintaining the semantics of the expectation in the matchers used.

Suggested change
expect(
eqClasses.where((l) => l.length != 1).toList(),
isEmpty,
);
expect(
eqClasses,
every(hasLength(1)),
);

var nextEqClasses = nextEq.classes;
if (nextEqClasses.length == eqClasses.length) break;
if (nextEqClasses.length == states.length) {
print("Backwards states distinguishable in $r steps");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't include print statements in tests.

I'm also worried about having the conditional logic in a test. Why do we not know in advance whether the test will fall through this case or not?

Comment on lines +250 to +252
if (unreachableStates.isEmpty) {
print("Backward states reachable in $step steps");
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we know in advance whether this branch will get hit?

return result;
}

int get maxWeight => _table.reduce((a, b) => a >= b ? a : b);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] I cannot find where this is used. Can it be removed?

If it cannot be removed, consider adding import 'dart:math' as math;

Suggested change
int get maxWeight => _table.reduce((a, b) => a >= b ? a : b);
int get maxWeight => _table.reduce(math.max);

}

/// Creates a new graph without the last vertex (or last [count] vertices).
Graph removeLastVertex([int count = 1]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot find where this is used. Can it be removed?

Copy link
Member Author

@lrhn lrhn Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably can. I had a different, simpler, optimization strategy that I tried, which used this.
The result was enough worse that I dropped it again.

Well spotted!

}

/// Swaps the positions of [vertex1] and [vertex2], updating the weight table.
void swap(int vertex1, int vertex2) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot find where this is used. Can it be removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

/// The index need not be at a grapheme cluster boundary.
/// Uniquely to this function, the index need not be at a grapheme cluster
/// boundary. That means there may be need for look-behind to find a character
/// where the exact state is known.
int nextBreak(String text, int start, int end, int index) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] This method is going from <50 lines to >150 lines. Consider whether any pieces can be pulled out into composed methods.

Copy link
Member Author

@lrhn lrhn Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering whether it should be a state machine. It probably can be. Maybe part of the forwards automaton, with more states above the normal states, since that's the normal states it should end up in after figuring out what it needs from the look-behind.

I'm generally not moving things into separate functions if they would need to return more than one value, to avoid risking extra allocations.

@lrhn
Copy link
Member Author

lrhn commented Nov 8, 2024

I think I broke isGraphemeClusterBoundary. Have to fix that too.
... and fixed. That was a silly bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants