-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement full Unicode 16.0.0 extended grapheme breaking. #719
base: main
Are you sure you want to change the base?
Conversation
Includes rule GB9c (Indict Conjunt Break based). This change has a significant cost in size since the information needed per character no longer fits in 4 bits. The base table is therefore twice as big (one byte per entry rather than half of that). The number of states in the state automatons have also increased slightly, but in comparison that's a negligible change. Tests have been made more thorough, testing not only the Unicode Consortium provided tests, but also variants of those with representative characters for each category of character that either in or not-in the BMP, to test that surrogate pair decoding works correctly. Test also check that the created automatons are minimal, in that no state is unreachable and no two states are indistinguishable.
Package publishing
Documentation at https://github.com/dart-lang/ecosystem/wiki/Publishing-automation. |
PR HealthBreaking changes ✔️
Coverage
|
File | Coverage |
---|---|
pkgs/characters/lib/src/characters_impl.dart | 💚 89 % |
pkgs/characters/lib/src/grapheme_clusters/breaks.dart | 💚 98 % |
pkgs/characters/lib/src/grapheme_clusters/constants.dart | 💔 Not covered |
pkgs/characters/lib/src/grapheme_clusters/table.dart | 💚 100 % |
pkgs/characters/tool/bin/generate_tables.dart | 💔 Not covered |
pkgs/characters/tool/bin/generate_tests.dart | 💔 Not covered |
pkgs/characters/tool/generate.dart | 💔 Not covered |
pkgs/characters/tool/src/atsp.dart | 💔 Not covered |
pkgs/characters/tool/src/automaton_builder.dart | 💔 Not covered |
pkgs/characters/tool/src/data_files.dart | 💔 Not covered |
pkgs/characters/tool/src/debug_names.dart | 💚 15 % |
pkgs/characters/tool/src/graph.dart | 💔 Not covered |
pkgs/characters/tool/src/grapheme_category_loader.dart | 💔 Not covered |
pkgs/characters/tool/src/list_overlap.dart | 💔 Not covered |
pkgs/characters/tool/src/shared.dart | 💔 Not covered |
This check for test coverage is informational (issues shown here will not fail the PR).
This check can be disabled by tagging the PR with skip-coverage-check
.
API leaks ✔️
The following packages contain symbols visible in the public API, but not exported by the library. Export these symbols or remove them from your publicly visible API.
Package | Leaked API symbols |
---|
License Headers ⚠️
// Copyright (c) 2024, the Dart project authors. Please see the AUTHORS file
// for details. All rights reserved. Use of this source code is governed by a
// BSD-style license that can be found in the LICENSE file.
Files |
---|
pkgs/characters/lib/src/grapheme_clusters/breaks.dart |
All source files should start with a license header.
This check can be disabled by tagging the PR with skip-license-check
.
Health check is wrong. The changelog is correct since the version wasn't changed, and the existing changelog didn't list missing part that is now implemented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of these comments my primary concern is the return
statements in loops in tests.
Should some of those be continue
instead of return
?
expect( | ||
eqClasses.where((l) => l.length != 1).toList(), | ||
isEmpty, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] prefer maintaining the semantics of the expectation in the matchers used.
expect( | |
eqClasses.where((l) => l.length != 1).toList(), | |
isEmpty, | |
); | |
expect( | |
eqClasses, | |
every(hasLength(1)), | |
); |
var nextEqClasses = nextEq.classes; | ||
if (nextEqClasses.length == eqClasses.length) break; | ||
if (nextEqClasses.length == states.length) { | ||
print("Backwards states distinguishable in $r steps"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't include print statements in tests.
I'm also worried about having the conditional logic in a test. Why do we not know in advance whether the test will fall through this case or not?
if (unreachableStates.isEmpty) { | ||
print("Backward states reachable in $step steps"); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we know in advance whether this branch will get hit?
return result; | ||
} | ||
|
||
int get maxWeight => _table.reduce((a, b) => a >= b ? a : b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] I cannot find where this is used. Can it be removed?
If it cannot be removed, consider adding import 'dart:math' as math;
int get maxWeight => _table.reduce((a, b) => a >= b ? a : b); | |
int get maxWeight => _table.reduce(math.max); |
pkgs/characters/tool/src/graph.dart
Outdated
} | ||
|
||
/// Creates a new graph without the last vertex (or last [count] vertices). | ||
Graph removeLastVertex([int count = 1]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot find where this is used. Can it be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably can. I had a different, simpler, optimization strategy that I tried, which used this.
The result was enough worse that I dropped it again.
Well spotted!
pkgs/characters/tool/src/graph.dart
Outdated
} | ||
|
||
/// Swaps the positions of [vertex1] and [vertex2], updating the weight table. | ||
void swap(int vertex1, int vertex2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot find where this is used. Can it be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
/// The index need not be at a grapheme cluster boundary. | ||
/// Uniquely to this function, the index need not be at a grapheme cluster | ||
/// boundary. That means there may be need for look-behind to find a character | ||
/// where the exact state is known. | ||
int nextBreak(String text, int start, int end, int index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] This method is going from <50 lines to >150 lines. Consider whether any pieces can be pulled out into composed methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm considering whether it should be a state machine. It probably can be. Maybe part of the forwards automaton, with more states above the normal states, since that's the normal states it should end up in after figuring out what it needs from the look-behind.
I'm generally not moving things into separate functions if they would need to return more than one value, to avoid risking extra allocations.
I think I broke |
Includes rule GB9c (Indic Conjunt Break rule).
This change has a significant cost in size since the information needed per character no longer fits in 4 bits. The base table is therefore twice as big (one byte per entry rather than half of that).
The number of states in the state automatons have also increased slightly, but in comparison that's a negligible change.
Tests have been made more thorough, testing not only the Unicode Consortium provided tests, but also variants of those with representative characters for each category of character that either in or not-in the BMP, to test that surrogate pair decoding works correctly.
Test also check that the created automatons are minimal, in that no state is unreachable and no two states are indistinguishable.