Skip to content

Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, and Gujarati — FIXED ! #87

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
r12a opened this issue Feb 5, 2020 · 5 comments
Labels
doc:beng doc:deva doc:gujr gap i:segmentation Grapheme/word segmentation & selection l:bn Bengali language & script l:gu Gujurati language & script l:hi Hindi, Devanagari script p:ok s:beng Bengali script s:deva Devanagari script s:gujr Gurajati script

Comments

@r12a
Copy link
Contributor

r12a commented Feb 5, 2020

This issue is applicable to most languages with conjunct forms that involve a virama.

Many scripts descended from Brahmi indicate clusters of consonant sounds by merging or stacking the glyphs of the consonants involved in one way or another. These scripts are abugidas, and each consonant character represents a consonant sound and an inherent vowel sound. The merging of glyphs indicates that the inherent vowel sound is dropped between the consonants. In Unicode text, this merging is usually accomplished using a special character between the consonants, which is typically called a virama or 'vowel-killer'.

When operations such as line breaking, cursor movement, vertical text rendering, deletion, hyphenation, etc are applied to the text these conjuncts must not be split apart. (Line-break opportunities in these scripts usually occur at inter-word spaces, but when a very long word doesn't entirely fit on a line or the CSS word-break property is set to break-all, or the CSS line-break property is set to anywhere, conjuncts should be kept together.)

A grapheme is a user-perceived unit of text. Text operations that use graphemes as a unit of text include line-breaking, forwards deletion, cursor movement & selection, character counts, text spacing, text insertion, justification, case conversions, and sorting. The Unicode Standard uses generalised rules to define 'grapheme clusters', which approximate the likely grapheme boundaries in a writing system.

More:

The GAP

The Unicode concept of grapheme cluster up to Unicode 15.0 fails to represent syllabic conjuncts (plus vowels, etc) in scripts like Bengali, Devanagari, Gujarati, etc. This means that various editing operations, line breaking algorithms, vertical text, etc. are likely to break text at the wrong point.

The reason conjuncts are not kept together is that segmentation rules in Unicode start a new grapheme cluster after the virama.

CSS uses the concept of 'typographic character unit', rather than grapheme cluster, in its specs, with the explanation that these cases are beyond the scope of the grapheme cluster concept and that implementations should provide appropriate support.

More:

Priority

The impact of incorrectly segmenting text containing conjuncts is significant, affecting the correct handling of editing operations, line breaking algorithms, vertical text, etc. This is an issue with the priority of Basic.

Tests

Action taken

Discussions took place in the Unicode Script Ad Hoc committee, and an initial proposal was made by Norbert Lindenberg that would form the basis for gradual deployment of changes for a number of scripts.

Unicode 15.1 introduced an initial set of changes to Unicode® Standard Annex #29, Unicode Text Segmentation that recognised consonants after a virama as a continuation of the grapheme cluster for certain scripts. The scripts affected by this change are those with Indic_Conjunct_Break (InCB)=Linker. Those scripts are currently Bengali, Devanagari, Gujarati, Oriya, Telugu, and Malayalam. (The problem remains for several other scripts, and more will be addressed for Unicode 17).

As long as applications support the latest rules for grapheme clusters, those scripts should keep conjuncts together.

Outcomes

The latest versions of the Gecko, Blink, and Webkit engines support the new rules for grapheme clusters for Bengali, Devanagari, and Gujarati.

@r12a r12a added i:segmentation Grapheme/word segmentation & selection gap p:basic doc:deva labels Feb 5, 2020
@r12a
Copy link
Contributor Author

r12a commented Feb 5, 2020

The first comment in this issue contains text that will automatically appear in one or more gap-analysis documents as a subsection with the same title as this issue. Any edits made to that comment will be immediately available in the Editor's draft of the document. Proposals for changes or discussion of the content can be made by adding comments below this point.

Relevant gap analysis documents include:
BengaliGujaratiDevanagari

@r12a r12a changed the title Grapheme clusters fail to represent syllabic conjuncts Grapheme clusters fail to represent syllabic conjuncts in north Indian scripts May 18, 2021
@r12a r12a added l:hi Hindi, Devanagari script l:bn Bengali language & script l:gu Gujurati language & script labels May 1, 2024
@r12a r12a moved this to Issue identified, needing investigation in Gap-analysis pipeline Jun 20, 2024
@r12a r12a added s:gujr Gurajati script s:beng Bengali script s:deva Devanagari script labels Jul 2, 2024
@r12a r12a changed the title Grapheme clusters fail to represent syllabic conjuncts in north Indian scripts Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, Gujarati, Oriya, Telugu, and Malayalam Feb 7, 2025
@r12a r12a changed the title Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, Gujarati, Oriya, Telugu, and Malayalam Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, and Gujarati Feb 7, 2025
@r12a r12a moved this from Issue identified, needing investigation to Fixed in Gap-analysis pipeline Feb 7, 2025
@r12a r12a changed the title Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, and Gujarati Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, and Gujarati — FIXED ! Feb 7, 2025
@r12a r12a removed the p:basic label Feb 7, 2025
@r12a r12a added the p:ok label Feb 7, 2025
@xfq
Copy link
Member

xfq commented Feb 17, 2025

Should 'Gujurati' be 'Gujarati'? Or is it an alternative spelling?

@r12a
Copy link
Contributor Author

r12a commented Feb 17, 2025

Fixed.

@NorbertLindenberg
Copy link

My proposal wasn't about the initial 6 scripts (those came from Google/CLDR), but about an additional 14 scripts. It looks like the changes I proposed will go into Unicode 17.

@r12a
Copy link
Contributor Author

r12a commented Apr 17, 2025

hi @NorbertLindenberg. Yes, i didn't intend to imply that your proposal matched what was implemented in Unicode 15.1. I tweaked the wording a bit to maybe make that clearer.

Good news about Unicode 17!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc:beng doc:deva doc:gujr gap i:segmentation Grapheme/word segmentation & selection l:bn Bengali language & script l:gu Gujurati language & script l:hi Hindi, Devanagari script p:ok s:beng Bengali script s:deva Devanagari script s:gujr Gurajati script
Projects
Development

No branches or pull requests

3 participants