-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Arabic text #18
Comments
Hi, glad you're finding the editor useful so far. The main challenges that I can see to supporting Arabic script lie in the design decision behind SPEEDy to enclose characters in single SPANs. This was done to leverage the browser's built-in NodeList structure which resembles a linked list. While the text direction can be set on the DIV that contains the character SPANs, this would apply to the entire text and would not allow portions of text to be in Arabic script. Additionally, it also means that SPEEDy currently does not support text blocks, but essentially treats the whole editor space as one text block. The only way around this that I can see is a fundamental rewrite of the engine to allow both DIVs and SPANs to be stored, but aside from the problem of mapping the caret to a character object (which is non-trivial in itself), it also poses questions about how the text will be serialised: that is, what constitutes the raw text stream output when non-character elements (like DIVs, TABLEs, etc.) are present in the editor? I am very open to any ideas on how to tackle this problem. For the time being I've decided to work within the limitations as in most other respects the editor works for standoff annotation. If you have any thoughts on a solution to this design issue, I would be happy to take them on board ... |
Thanks for your reply. I think it would be OK just to be able to set the text direction for the entire text, as long as the character joining problem could be fixed. I think the current design with one |
If the trick mentioned in the Stack Overflow answer works I think it would be possible to extend SPEEDy to support Arabic script. It will require some careful rewriting around character insertion and deletion to cater for the ligature directive symbols, but it seems like it should be possible. Some means of identifying a text as Arabic script would be necessary when it is being loaded, so perhaps a meta-data standoff property (i.e., one without a start or end index) could serve this purpose... |
I reprogrammed the editor to insert a zero-width joining character span between all regular character spans and unfortunately this had no effect on the ligaturing of the Arabic characters. It may be that this only works if the ZWJ is inside the same span as the text character, rather than wrapped in its own span. The example on Stack Overflow is a little ambiguous as it shows a text node adjacent to a span rather than two spans. I will see if I can append the ZWJ to the text content of the span and see what happens. |
I have got the ZWJ characters to work inside the editor, but the RTL orientation is mixing up the order of the words. Not sure if this is a result of using the ZWJ characters, or if I am misinterpreting it ... |
That’s great news, thank you!
The first word you type should start at the right margin, then word 2 should be to the left of word 1, word 3 should be to the left of word 2, etc., like this: 3 2 1 Could you make a branch that I could try? |
Hi, I've now checked in some code that appears to fix the Arabic ligature rendering issue, and hopefully implements RTL correctly. I noticed that the text seemed misaligned in RTL for Western texts, and my concern is that the CSS annotation might be interfering in some way. In any case, the best way to test the Arabic script feature out is to select 'arabic.json' from the File drop down list and click 'Load'. This not only loads a sample text but also reloads the editor with the required RTL and character interpolation settings. Currently the editor needs to be reloaded to carry this out, as as character interpolation (i.e., ZWJ) needs to be configured up front. I should be able to refactor this soon to allow more dynamic switching, but for now this works. I also created a branch called 'arabic-script' for you to play in. |
Hi, I've finally had a chance to try this out. It's definitely progress, thank you! Looking at the result in the editor, I now realise that the use of ZWJ is a little more complex than I thought: it's needed sometimes before the character, sometimes after the character, and sometimes both, but this depends on the character and on its position in the word. To make this easier, I'm writing a JavaScript function to determine where to add ZWJ. About
I think it would be fine to have an RTL button to change the document direction. Joining has to work in both cases, though. I think the easiest way to do this would be to apply the ZWJ logic to any character with a Unicode code point in an Arabic range. A function like this should do it (using the ranges from https://en.wikipedia.org/wiki/Arabic_script_in_Unicode): function isArabicChar(char) {
var codePoint = char.codePointAt(0);
return (codePoint >= 0x0600 && codePoint <= 0x06FF) ||
(codePoint >= 0x0750 && codePoint <= 0x077F) ||
(codePoint >= 0x08A0 && codePoint <= 0x08FF) ||
(codePoint >= 0xFB50 && codePoint <= 0xFDFF) ||
(codePoint >= 0xFE70 && codePoint <= 0xFEFF) ||
(codePoint >= 0x10E60 && codePoint <= 0x10E7F) ||
(codePoint >= 0x1EC70 && codePoint <= 0x1ECBF) ||
(codePoint >= 0x1EE00 && codePoint <= 0x1EEFF)
} I'm thinking that maybe I should write a little library with these functions, which could then be used in SPEEDy. |
Also, Arabic has diacritics as separate Unicode characters. For example, here's a string with two characters: the letter U+0644 followed by the diacritic U+064F, which appears above the letter: A diacritic (called a 'nonspacing character' in Unicode) has to be rendered above or below the letter. As far as I can tell, the only way to make this work is to put them in the same This means that it won't be possible to annotate just the diacritic with standoff. I think that's OK: the annotation can be attached to the letter. But it does mean that there needs to be a way for a |
I made a little library that does most of the work: https://github.com/dhlab-basel/arabic-shaping It provides functions for splitting a string into an array of character groups, with each group containing at most one letter and its diacritics, and for adding any ZWJ characters that are needed to each group. Then you can wrap each group in a Please let me know if you can use it or if it needs anything else. |
Hi,
I just wanted to let you know that I haven't forgotten about this issue,
but that I'm currently rewriting some core aspects of the editor to be more
flexible around the management of characters and text blocks (SPANs vs
DIVs) and around characters that are visible in the output text stream and
those that are excluded (but editable). I have some ideas how to best
integrate your functions into the pipeline, as well, so that this becomes a
more generalisable solution for other character sets.
Regards,
Iian
…On Fri, 22 Feb 2019 at 20:00, Benjamin Geer ***@***.***> wrote:
I made a little library that does most of the work:
https://github.com/dhlab-basel/arabic-shaping
It provides functions for splitting a string into an array of character
groups, with each group containing at most one letter and its diacritics,
and for adding any ZWJ characters that are needed to each group. Then you
can wrap each group in a <span> element.
Please let me know if you can use it or if it needs anything else.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#18 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAj4oZX8fVB297x55DNUEg0hGX29jazFks5vP7-ugaJpZM4ajxL->
.
|
OK, great, thanks for letting me know. Please also let me know if there’s anything I can do to help. |
Yes. My idea is that you can construct each For each character entered, you can call:
For each character group, you can call Maybe it's clearer to illustrate this step by step. Suppose we start with an empty text. The user starts by typing an Arabic letter, which we can represent in this illustration as
Next the user types a diacritic, which we can represent here as
Now the user types another Arabic letter.
Now that we have two groups, we must add any necessary ZWJ characters to make them join together correctly. We call
So now our two groups look like this:
These two groups should now be joined correctly. Now the user types a third letter, so we start group 3. To join groups 2 and 3, we have to redo the ZWJ in group 2, as well as adding ZWJ to group 3. We call
In short:
Does this seem workable? |
(Edited above comment after simplifying functions a bit.) |
Hi, This is a great explanation, but I have a few more questions, I'm afraid. The situation with the editor is that the user's cursor can be anywhere in a text; they could be typing the first letter of the text, or at the end of the text, or inserting a letter somewhere between. Alternatively, they could be deleting a character, or a whole range of characters at once. I am trying to determine what input exactly I would need to pass to your functions in these various circumstances, and the best way of getting that input. For example, as soon as someone inserts a character I am able to output the SPAN that wraps that character, along with the previous and next siblings (in some cases these will be NULL). Is that sufficient material for your shaping code to work with, do you think, or would you recommend some other parameters? And if it sufficient, can you suggest the procedure I should follow to generate the char-groups from those inputs? Thanks, |
Hi Iian,
Yes, that's fine. Whenever a character is inserted/changed, you have to update:
If a
To update the ZWJ of a For example, suppose we have these groups:
The user inserts group
We call:
Then the user deletes the
We call:
I think the following should work. For each character typed, if
Then update ZWJ as described above. Does that make sense? Thanks for your patience, |
Hi Ben, Thanks for the clarification, it has really helped. I've now added a 'onCharacterAdded' handler to SPEEDy to allow the client code to access the text stream, and I've attempted to implement the add part of your algorithm above. Would you mind loading the demo page when you can and try pasting in some Arabic text. From my end some portions of the text look correct, while others are off. There's probably something simple I'm missing with my implementation ... Best regards, PS. Keep in mind that my hookup is very basic, even around the assumptions it makes about the next and previous elements. But I wanted to start with the simple case first. |
Hi Ben,
Just a quick question that occurred to me: would your Arabic shaping
solution be reworkable to other non-Latin alphabets, and Unicode rendering
in general?
Cheers,
Iian
…On Wed., 6 Feb. 2019, 4:05 am Benjamin Geer, ***@***.***> wrote:
This is really an impressive standoff editor, and I'm looking forward to
exploring it more thoroughly. Just as an initial test, I tried typing the
Arabic word اختبار ("test") in the demo, and it looks like this:
[image: screenshot 2019-02-05 at 18 40 39]
<https://user-images.githubusercontent.com/558389/52292811-cf756300-2975-11e9-9079-385ab87f312d.png>
RTL text would need to be represented using the HTML dir attribute (see Structural
markup and right-to-left text in HTML
<https://www.w3.org/International/questions/qa-html-dir>).
Also, Arabic letters have to be joined together. Putting each character in
a separate <span> seems to breaks the joining of letters, but I've read
that this can be solved by adding a zero-width joiner Unicode character
<https://stackoverflow.com/a/7069789>.
How easy do you think it would be to do this? I would be glad to try to
help, with some guidance.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#18>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAj4oTVCIYD37bRj886cXoVAkyxBAPWTks5vKcfrgaJpZM4ajxL->
.
|
Thanks so much, I’m looking forward to trying this later today.
I think the basic principle should be the same. I believe the WebKit shaping bug affects Indic scripts as well. I thought about trying to implement a general-purpose solution for all affected scripts, but Arabic is the only one of the affected languages that I actually know, so I can see whether it’s rendered correctly. Once this works for Arabic, I’d be glad to refactor it to make it work for more scripts, with the help of someone who knows another affected language. |
Hi Ben,
I've now added the hook to your code to the loading function in SPEEDy,
which means that you can also test out the Arabic script shaping by loading
the Arabic text from the 'File' drop-down list in the editor demo.
Best regards,
Iian
…On Thu, 21 Mar 2019 at 17:24, Benjamin Geer ***@***.***> wrote:
Would you mind loading the demo page when you can and try pasting in some
Arabic text.
Thanks so much, I’m looking forward to trying this later today.
would your Arabic shaping
solution be reworkable to other non-Latin alphabets, and Unicode rendering
in general?
I think the basic principle should be the same. I believe the WebKit
shaping bug affects Indic scripts as well. I thought about trying to
implement a general-purpose solution for all affected scripts, but Arabic
is the only one of the affected languages that I actually know, so I can
see whether it’s rendered correctly. Once this works for Arabic, I’d be
glad to refactor it to make it work for more scripts, with the help of
someone who knows another affected language.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#18 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAj4oSs0f850jUDGfzukXYCM1ipr-Vd3ks5vYzOwgaJpZM4ajxL->
.
|
Incidentally, I noticed there are a lot of similarities between the
original Arabic text (from Google translate) and its rendering with the
interface to your Arabic shaping script. When I copy and paste the content
of the editor panel into Google translate it becomes clear that some
characters are dropped ... but hard to know how reliable this test is, as
I'm copying directly from the content of the editor panel which hasn't been
pre-treated for removal of ZWJCs. Still, it is clear that there *is *a
difference between the original text and the rendered text.
Regards,
Iian
…On Fri, 22 Mar 2019 at 10:10, Iian Neill ***@***.***> wrote:
Hi Ben,
I've now added the hook to your code to the loading function in SPEEDy,
which means that you can also test out the Arabic script shaping by loading
the Arabic text from the 'File' drop-down list in the editor demo.
Best regards,
Iian
On Thu, 21 Mar 2019 at 17:24, Benjamin Geer ***@***.***>
wrote:
> Would you mind loading the demo page when you can and try pasting in some
> Arabic text.
>
> Thanks so much, I’m looking forward to trying this later today.
>
> would your Arabic shaping
> solution be reworkable to other non-Latin alphabets, and Unicode rendering
> in general?
>
> I think the basic principle should be the same. I believe the WebKit
> shaping bug affects Indic scripts as well. I thought about trying to
> implement a general-purpose solution for all affected scripts, but Arabic
> is the only one of the affected languages that I actually know, so I can
> see whether it’s rendered correctly. Once this works for Arabic, I’d be
> glad to refactor it to make it work for more scripts, with the help of
> someone who knows another affected language.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#18 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAj4oSs0f850jUDGfzukXYCM1ipr-Vd3ks5vYzOwgaJpZM4ajxL->
> .
>
|
Hi Iian, I've just tried this. We're getting closer! 🙂 But I see two problems. First, you need to pull the latest version of my shaping code from GitHub, because the procedures I described above won't work with the slightly older version you have. I think this is why the text in I've also just now changed my example text to the one in https://github.com/dhlab-basel/arabic-shaping/blob/master/correct-example.html The second problem is that when I type Arabic text (one character at a time) into an empty editor window, the characters are inserted backwards; the order of the <span style="position: relative;">أ</span>
<span style="position: relative;">د</span>
<span style="position: relative;">ب</span> But that's the word أدب ("literature"). 🙂 Are you trying to insert the https://www.w3.org/International/questions/qa-visual-vs-logical Have a good weekend, |
Hi Ben,
I've just updated SPEEDy with the latest version of your code (sorry for
not spotting that earlier!). I also changed "arabic.json" to use the same
input text in your "main.js". I've also removed the RTL styling on the
editor element, which will hopefully help with the character insertion
issue (if not I can put it back in). I can confirm that I don't attempt to
reverse the SPAN insertion order for RTL text.
Unfortunately, I am rather flying blind in this process as I can't read
Arabic and it's hard for me to spot the difference between the incorrect
rendering of text and differences resulting from page flow, such as font
size, panel width, etc.
Would you mind checking things over when you have a chance. Once the
rendering for the added text is confirmed, I can start working in the
handler for removing text.
Best regards,
Iian
…On Fri, 22 Mar 2019 at 18:58, Benjamin Geer ***@***.***> wrote:
Hi Iian,
I've just tried this. We're getting closer! 🙂 But I see two problems.
First, you need to pull the latest version of my shaping code from GitHub,
because the procedures I described above won't work with the slightly older
version you have. I think this is why the text in arabic.json isn't
rendered correctly when you load it from a file.
I've also just now changed my example text to the one in arabic.json, so
you can compare these correct <span> elements with the ones produced by
the editor:
https://github.com/dhlab-basel/arabic-shaping/blob/master/correct-example.html
The second problem is that when I type Arabic text (one character at a
time) into an empty editor window, the characters are inserted backwards;
the order of the <span> elements is the reverse of what it should be. For
example, the first word of your sample text is بدأ ("began"). If I type
that word into the editor (I type first ب, then د, then أ), the editor
inserts each new character before the previous one, generating this HTML:
<span style="position: relative;">أ</span>
<span style="position: relative;">د</span>
<span style="position: relative;">ب</span>
But that's the word أدب ("literature"). 🙂 Are you trying to insert the
<span> elements in reverse order for RTL? If so, that's not how it works:
the "logical orderering" of the characters is the same for LTR and RTL. The
browser handles the "visual ordering". Here's an explanation:
https://www.w3.org/International/questions/qa-visual-vs-logical
Have a good weekend,
Ben
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#18 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAj4oY63XBLL7xi28dfGryGX404l2LYxks5vZJswgaJpZM4ajxL->
.
|
No problem, and don’t worry, I know it’s difficult to do this without knowing the language, but we’ll get there. I’ll try this again on Monday. |
OK, the good news is that when I load The only problem I see with the rendering of the sample text is that sometimes a line is broken in the middle of a word, like this: When I start with an empty editor and start typing in Arabic, the characters are still added in reverse order. To type the first word of the sample text, first I type the letter ب, and I see: Notice that the cursor is positioned to the left of the letter, as it should be. But now I type the letter د, and I see: The د should have appeared to the left of the ب, but has instead appeared to the right of it. Again, the cursor appears to the left of the د, as it should. Now I type the letter أ, and again it's added on the right instead of on the left: Also, now the cursor is still to the left of the د where it was before. If I use the Chrome developer tools to look at the generated HTML, I see: You can see that the characters have been inserted in reverse order. Since I typed ب, then د, then أ, I should have got this instead: <div data-role="editor" spellcheck="false" contenteditable="true" class="editor">
<span style="position: relative;">ب</span>
<span style="position: relative;">د</span>
<span style="position: relative;">أ</span>
</div> Maybe for testing, you can try an Arabic keyboard layout and just type those three letters yourself. On macOS (using the "Arabic - PC" layout), Linux, or Windows, you can find them here: On a QWERTY keyboard:
If you type them in that order, the resulting |
Hi again, is there any chance you might have some time for this? |
Hi Benjamin,
I'd like to fix this issue, but to an extent it feels like I am working in
the dark as I can't read Arabic script. Is there a chance we can discuss
this over a Skype call? We might get to the bottom of it more quickly.
My email address is: [email protected]
Best regards,
Iian
…On Mon, 15 Jul 2019 at 18:21, Benjamin Geer ***@***.***> wrote:
Hi again, is there any chance you might have some time for this?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#18?email_source=notifications&email_token=AAEPRIK4XXXXGMH4QY75F53P7QXQ3A5CNFSM4GUPCL7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ5AGYI#issuecomment-511312737>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEPRIKGOBXJ6BZFHLYNOYLP7QXQ3ANCNFSM4GUPCL7A>
.
|
This is really an impressive standoff editor, and I'm looking forward to exploring it more thoroughly. Just as an initial test, I tried typing the Arabic word اختبار ("test") in the demo, and it looks like this:
RTL text would need to be represented using the HTML
dir
attribute (see Structural markup and right-to-left text in HTML).Also, Arabic letters have to be joined together. Putting each character in a separate
<span>
seems to breaks the joining of letters, but I've read that this can be solved by adding a zero-width joiner Unicode character.How easy do you think it would be to do this? I would be glad to try to help, with some guidance.
The text was updated successfully, but these errors were encountered: