You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is applicable to most languages with conjunct forms that involve a virama.
Many scripts descended from Brahmi indicate clusters of consonant sounds by merging or stacking the glyphs of the consonants involved in one way or another. These scripts are abugidas, and each consonant character represents a consonant sound and an inherent vowel sound. The merging of glyphs indicates that the inherent vowel sound is dropped between the consonants. In Unicode text, this merging is usually accomplished using a special character between the consonants, which is typically called a virama or 'vowel-killer'.
When operations such as line breaking, cursor movement, vertical text rendering, deletion, hyphenation, etc are applied to the text these conjuncts must not be split apart. (Line-break opportunities in these scripts usually occur at inter-word spaces, but when a very long word doesn't entirely fit on a line or the CSS word-break property is set to break-all, or the CSS line-break property is set to anywhere, conjuncts should be kept together.)
A grapheme is a user-perceived unit of text. Text operations that use graphemes as a unit of text include line-breaking, forwards deletion, cursor movement & selection, character counts, text spacing, text insertion, justification, case conversions, and sorting. The Unicode Standard uses generalised rules to define 'grapheme clusters', which approximate the likely grapheme boundaries in a writing system.
The Unicode concept of grapheme cluster up to Unicode 15.0 fails to represent syllabic conjuncts (plus vowels, etc) in scripts like Bengali, Devanagari, Gujarati, etc. This means that various editing operations, line breaking algorithms, vertical text, etc. are likely to break text at the wrong point.
The reason conjuncts are not kept together is that segmentation rules in Unicode start a new grapheme cluster after the virama.
CSS uses the concept of 'typographic character unit', rather than grapheme cluster, in its specs, with the explanation that these cases are beyond the scope of the grapheme cluster concept and that implementations should provide appropriate support.
The impact of incorrectly segmenting text containing conjuncts is significant, affecting the correct handling of editing operations, line breaking algorithms, vertical text, etc. This is an issue with the priority of Basic.
Discussions took place in the Unicode Script Ad Hoc committee, and an initial proposal was made by Norbert Lindenberg that would form the basis for gradual deployment of changes for a number of scripts.
Unicode 15.1 introduced an initial set of changes to Unicode® Standard Annex #29, Unicode Text Segmentation that recognised consonants after a virama as a continuation of the grapheme cluster for certain scripts. The scripts affected by this change are those with Indic_Conjunct_Break (InCB)=Linker. Those scripts are currently Bengali, Devanagari, Gujarati, Oriya, Telugu, and Malayalam. (The problem remains for several other scripts, and more will be addressed for Unicode 17).
As long as applications support the latest rules for grapheme clusters, those scripts should keep conjuncts together.
Outcomes
The latest versions of the Gecko, Blink, and Webkit engines support the new rules for grapheme clusters for Bengali, Devanagari, and Gujarati.
The text was updated successfully, but these errors were encountered:
The first comment in this issue contains text that will automatically appear in one or more gap-analysis documents as a subsection with the same title as this issue. Any edits made to that comment will be immediately available in the Editor's draft of the document. Proposals for changes or discussion of the content can be made by adding comments below this point.
r12a
changed the title
Grapheme clusters fail to represent syllabic conjuncts
Grapheme clusters fail to represent syllabic conjuncts in north Indian scripts
May 18, 2021
r12a
changed the title
Grapheme clusters fail to represent syllabic conjuncts in north Indian scripts
Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, Gujarati, Oriya, Telugu, and Malayalam
Feb 7, 2025
r12a
changed the title
Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, Gujarati, Oriya, Telugu, and Malayalam
Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, and Gujarati
Feb 7, 2025
r12a
changed the title
Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, and Gujarati
Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, and Gujarati — FIXED !
Feb 7, 2025
My proposal wasn't about the initial 6 scripts (those came from Google/CLDR), but about an additional 14 scripts. It looks like the changes I proposed will go into Unicode 17.
hi @NorbertLindenberg. Yes, i didn't intend to imply that your proposal matched what was implemented in Unicode 15.1. I tweaked the wording a bit to maybe make that clearer.
This issue is applicable to most languages with conjunct forms that involve a virama.
Many scripts descended from Brahmi indicate clusters of consonant sounds by merging or stacking the glyphs of the consonants involved in one way or another. These scripts are abugidas, and each consonant character represents a consonant sound and an inherent vowel sound. The merging of glyphs indicates that the inherent vowel sound is dropped between the consonants. In Unicode text, this merging is usually accomplished using a special character between the consonants, which is typically called a virama or 'vowel-killer'.
When operations such as line breaking, cursor movement, vertical text rendering, deletion, hyphenation, etc are applied to the text these conjuncts must not be split apart. (Line-break opportunities in these scripts usually occur at inter-word spaces, but when a very long word doesn't entirely fit on a line or the CSS
word-break
property is set tobreak-all
, or the CSSline-break
property is set toanywhere
, conjuncts should be kept together.)A grapheme is a user-perceived unit of text. Text operations that use graphemes as a unit of text include line-breaking, forwards deletion, cursor movement & selection, character counts, text spacing, text insertion, justification, case conversions, and sorting. The Unicode Standard uses generalised rules to define 'grapheme clusters', which approximate the likely grapheme boundaries in a writing system.
More:
The GAP
The Unicode concept of grapheme cluster up to Unicode 15.0 fails to represent syllabic conjuncts (plus vowels, etc) in scripts like Bengali, Devanagari, Gujarati, etc. This means that various editing operations, line breaking algorithms, vertical text, etc. are likely to break text at the wrong point.
The reason conjuncts are not kept together is that segmentation rules in Unicode start a new grapheme cluster after the virama.
CSS uses the concept of 'typographic character unit', rather than grapheme cluster, in its specs, with the explanation that these cases are beyond the scope of the grapheme cluster concept and that implementations should provide appropriate support.
More:
Priority
The impact of incorrectly segmenting text containing conjuncts is significant, affecting the correct handling of editing operations, line breaking algorithms, vertical text, etc. This is an issue with the priority of Basic.
Tests
Action taken
Discussions took place in the Unicode Script Ad Hoc committee, and an initial proposal was made by Norbert Lindenberg that would form the basis for gradual deployment of changes for a number of scripts.
Unicode 15.1 introduced an initial set of changes to Unicode® Standard Annex #29, Unicode Text Segmentation that recognised consonants after a virama as a continuation of the grapheme cluster for certain scripts. The scripts affected by this change are those with Indic_Conjunct_Break (InCB)=Linker. Those scripts are currently Bengali, Devanagari, Gujarati, Oriya, Telugu, and Malayalam. (The problem remains for several other scripts, and more will be addressed for Unicode 17).
As long as applications support the latest rules for grapheme clusters, those scripts should keep conjuncts together.
Outcomes
The latest versions of the Gecko, Blink, and Webkit engines support the new rules for grapheme clusters for Bengali, Devanagari, and Gujarati.
The text was updated successfully, but these errors were encountered: