Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, and Gujarati — FIXED ! #87

r12a · 2020-02-05T05:47:36Z

This issue is applicable to most languages with conjunct forms that involve a virama.

Many scripts descended from Brahmi indicate clusters of consonant sounds by merging or stacking the glyphs of the consonants involved in one way or another. These scripts are abugidas, and each consonant character represents a consonant sound and an inherent vowel sound. The merging of glyphs indicates that the inherent vowel sound is dropped between the consonants. In Unicode text, this merging is usually accomplished using a special character between the consonants, which is typically called a virama or 'vowel-killer'.

When operations such as line breaking, cursor movement, vertical text rendering, deletion, hyphenation, etc are applied to the text these conjuncts must not be split apart. (Line-break opportunities in these scripts usually occur at inter-word spaces, but when a very long word doesn't entirely fit on a line or the CSS word-break property is set to break-all, or the CSS line-break property is set to anywhere, conjuncts should be kept together.)

A grapheme is a user-perceived unit of text. Text operations that use graphemes as a unit of text include line-breaking, forwards deletion, cursor movement & selection, character counts, text spacing, text insertion, justification, case conversions, and sorting. The Unicode Standard uses generalised rules to define 'grapheme clusters', which approximate the likely grapheme boundaries in a writing system.

More:

The GAP

The Unicode concept of grapheme cluster up to Unicode 15.0 fails to represent syllabic conjuncts (plus vowels, etc) in scripts like Bengali, Devanagari, Gujarati, etc. This means that various editing operations, line breaking algorithms, vertical text, etc. are likely to break text at the wrong point.

The reason conjuncts are not kept together is that segmentation rules in Unicode start a new grapheme cluster after the virama.

CSS uses the concept of 'typographic character unit', rather than grapheme cluster, in its specs, with the explanation that these cases are beyond the scope of the grapheme cluster concept and that implementations should provide appropriate support.

More:

Typographic character units in complex scripts

Priority

The impact of incorrectly segmenting text containing conjuncts is significant, affecting the correct handling of editing operations, line breaking algorithms, vertical text, etc. This is an issue with the priority of Basic.

Tests

Action taken

Discussions took place in the Unicode Script Ad Hoc committee, and an initial proposal was made by Norbert Lindenberg that would form the basis for gradual deployment of changes for a number of scripts.

Unicode 15.1 introduced an initial set of changes to Unicode® Standard Annex #29, Unicode Text Segmentation that recognised consonants after a virama as a continuation of the grapheme cluster for certain scripts. The scripts affected by this change are those with Indic_Conjunct_Break (InCB)=Linker. Those scripts are currently Bengali, Devanagari, Gujarati, Oriya, Telugu, and Malayalam. (The problem remains for several other scripts, and more will be addressed for Unicode 17).

As long as applications support the latest rules for grapheme clusters, those scripts should keep conjuncts together.

Outcomes

The latest versions of the Gecko, Blink, and Webkit engines support the new rules for grapheme clusters for Bengali, Devanagari, and Gujarati.

The text was updated successfully, but these errors were encountered:

r12a · 2020-02-05T05:47:50Z

The first comment in this issue contains text that will automatically appear in one or more gap-analysis documents as a subsection with the same title as this issue. Any edits made to that comment will be immediately available in the Editor's draft of the document. Proposals for changes or discussion of the content can be made by adding comments below this point.

Relevant gap analysis documents include:
Bengali • Gujarati • Devanagari

xfq · 2025-02-17T02:27:03Z

Should 'Gujurati' be 'Gujarati'? Or is it an alternative spelling?

r12a · 2025-02-17T06:19:28Z

Fixed.

NorbertLindenberg · 2025-04-15T21:20:34Z

My proposal wasn't about the initial 6 scripts (those came from Google/CLDR), but about an additional 14 scripts. It looks like the changes I proposed will go into Unicode 17.

r12a · 2025-04-17T10:19:27Z

hi @NorbertLindenberg. Yes, i didn't intend to imply that your proposal matched what was implemented in Unicode 15.1. I tweaked the wording a bit to maybe make that clearer.

Good news about Unicode 17!

r12a added i:segmentation Grapheme/word segmentation & selection gap p:basic doc:deva labels Feb 5, 2020

r12a changed the title ~~Grapheme clusters fail to represent syllabic conjuncts~~ Grapheme clusters fail to represent syllabic conjuncts in north Indian scripts May 18, 2021

r12a added doc:beng doc:gujr x:beng x:deva x:gujr labels May 18, 2021

This was referenced May 18, 2021

Grapheme clusters fail to represent syllabic conjuncts #104

Closed

Grapheme clusters fail to represent syllabic conjuncts #62

Closed

r12a added l:hi Hindi, Devanagari script l:bn Bengali language & script l:gu Gujurati language & script labels May 1, 2024

r12a added this to Gap-analysis pipeline Jun 20, 2024

r12a moved this to Issue identified, needing investigation in Gap-analysis pipeline Jun 20, 2024

r12a added s:gujr Gurajati script s:beng Bengali script s:deva Devanagari script labels Jul 2, 2024

1ec5 mentioned this issue Aug 23, 2024

Render complex text, variant forms, emoji, etc. 1ec5/maplibre-gl-js#1

Draft

r12a changed the title ~~Grapheme clusters fail to represent syllabic conjuncts in north Indian scripts~~ Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, Gujarati, Oriya, Telugu, and Malayalam Feb 7, 2025

r12a removed x:beng x:deva x:gujr labels Feb 7, 2025

r12a changed the title ~~Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, Gujarati, Oriya, Telugu, and Malayalam~~ Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, and Gujarati Feb 7, 2025

r12a moved this from Issue identified, needing investigation to Fixed in Gap-analysis pipeline Feb 7, 2025

r12a changed the title ~~Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, and Gujarati~~ Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, and Gujarati — FIXED ! Feb 7, 2025

r12a removed the p:basic label Feb 7, 2025

r12a added the p:ok label Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, and Gujarati — FIXED ! #87

Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, and Gujarati — FIXED ! #87

r12a commented Feb 5, 2020 •

edited

Loading

r12a commented Feb 5, 2020 •

edited

Loading

xfq commented Feb 17, 2025

r12a commented Feb 17, 2025

NorbertLindenberg commented Apr 15, 2025

r12a commented Apr 17, 2025

Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, and Gujarati — FIXED ! #87

Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, and Gujarati — FIXED ! #87

Comments

r12a commented Feb 5, 2020 • edited Loading

The GAP

Priority

Tests

Action taken

Outcomes

r12a commented Feb 5, 2020 • edited Loading

xfq commented Feb 17, 2025

r12a commented Feb 17, 2025

NorbertLindenberg commented Apr 15, 2025

r12a commented Apr 17, 2025

r12a commented Feb 5, 2020 •

edited

Loading

r12a commented Feb 5, 2020 •

edited

Loading