Skip to content

Latest commit

 

History

History
118 lines (77 loc) · 10.2 KB

17-UNICODE-UTF-8-EMOJIS.md

File metadata and controls

118 lines (77 loc) · 10.2 KB

17. UNICODE, UTF-8 and EMOJIS

In computing, computers commonly process the information digitally as bits. Then, bits come together to form bytes, and bytes create various data types. Most of these types are different ways to represent a number, think of, characters in a string are represented with numbers, behind the curtains.

In our project, normal Unicode text characters are supported built-in via Go language platform. But, Llama models have an capability to generate emojis as sequence of byte tokens. In this case, the built-in support might not be enough.

Also, this project supports to be compiled for and run on multiple operating systems such as Windows, Linux, and, MacOS. The application provides a CLI (command line interface). At this point, the Linux and MacOS terminals have Unicode emoji rendering capabilities. But in Windows, some terminals can't render completely, some can render partially.

Because we don't want the user (especially on Windows) to see only multiple "question marks" instead of emoji characters, we have added a capability to detect known emojis and code points. With this capability, our project can print known names of emojis and code points on the console. So, this provides the user to see the name texts if they can't see the emoji glyph.

17.1. UNICODE

At the first times of digital computing, early 1960s, the American Standard Code for Information Interchange standard was defined to represent commonly used characters of English as numbers between 0-255, as one byte.

Over time, computing and computers were becoming widespread and globalized. The need for support for languages other than English was emerging. Variety of characters was getting increased. This problem was partially solved with different encoding methods and character tables based on languages, which enables us to represent different characters in a single byte. They are also as code page.

As time passed, the ASCII standard and code pages were becoming unsatisfactory. Thus, some other character encoding method employing multi-byte representation was needed.

In late 1980s, a group of individuals from Xerox and Apple has started to investigate on creating an universal character set. Then, in early 1990s, the Unicode Consortium - Unicode, Inc. was founded and The Unicode Standard was born.

From Wikipedia: Unicode text is processed and stored as binary data using one of several encodings, which define how to translate the standard's abstracted codes for characters into sequences of bytes. The Unicode Standard itself defines three encodings: UTF-8, UTF-16, and UTF-32, though several others exist. Of these, UTF-8 is the most widely used by a large margin, in part due to its backwards-compatibility with ASCII.

In Unicode, characters are represented as code points (In Go language, code points are called as runes).

Sources:

17.2. UTF-8

Now, with Unicode, we have a character set consisting of code points that may not fit in a single byte. We need an encoding method. There are UTF-8, UTF-16, UTF-32, and others. Nowadays, UTF-8 is the most commonly used one.

UTF-8 is a variable-length character encoding standard. It has the flexibility to represent some characters in single-byte, some in 2-bytes, 3-bytes, n-bytes, etc... The most common 128 English characters could be represented with 8 bits (single-byte), this ability enables backward compatibility with the ASCII standard.

You can watch the source videos below and have more information.

Sources:

17.3. Emojis

Emojis are Unicode characters that are rendered as special pictograms, glyphs. Emojis are mostly represented with multiple bytes, also multiple emojis could form another "one" emoji by getting together with Zero-width joiner.

Sources:

17.3.1. Emoji Samples

As you can see in TestSimulatedEmojiOutputMultipleCompositeEmojis unit test function in cmd/main_test.go:

Link Icon Name Unicode Bytes
"" ZWJ: ZERO WIDTH JOINER \U0000200D (0xE2 0x80 0x8D)
iEmoji Link 👨 :man: \U0001F468 (0xF0 0x9F 0x91 0xA8)
iEmoji Link 👨 :man: + :ZWJ: \U0001F468 + \U0000200D (0xF0 0x9F 0x91 0xA8), (0xE2 0x80 0x8D)
iEmoji Link | iEmoji Link  👨‍👩
=
👨 + :ZWJ: + 👩
:man: + :ZWJ: + :woman: \U0001F468 + \U0000200D + \U0001F469 (0xF0 0x9F 0x91 0xA8), (0xE2 0x80 0x8D), (0xF0 0x9F 0x91 0xA9)
  👨‍👩‍👧
=
👨 + :ZWJ: + 👩 + :ZWJ: + 👧
:family_man_woman_girl:
=
:man: + :ZWJ: + :woman: + :ZWJ: + :girl:
\U0001F468 + \U0000200D + \U0001F469 + \U0000200D + \U0001F467 (0xF0 0x9F 0x91 0xA8), (0xE2 0x80 0x8D), (0xF0 0x9F 0x91 0xA9), (0xE2 0x80 0x8D), (0xF0 0x9F 0x91 0xA7) 
 iEmoji Link 👨‍👩‍👧‍👦
=
👨 + :ZWJ: + 👩 + :ZWJ: + 👧 + :ZWJ: + 👦
:family_man_woman_girl_boy:
=
:man: + :ZWJ: + :woman: + :ZWJ: + :girl: + :ZWJ: + :boy:
\U0001F468 + \U0000200D + \U0001F469 + \U0000200D + \U0001F467 + \U0000200D + \U0001F466 (0xF0 0x9F 0x91 0xA8), (0xE2 0x80 0x8D), (0xF0 0x9F 0x91 0xA9), (0xE2 0x80 0x8D), (0xF0 0x9F 0x91 0xA7), (0xE2 0x80 0x8D), (0xF0 0x9F 0x91 0xA6) 

17.3.2. Llama Emoji Generation

Large Language Models like Llama and most of other NLP (natural language processing) systems use tokenization to represent words or word partitions. So, the generative ones generate new tokens. However, their tokenizers have a limited number of items in their vocabulary. Mostly these vocabularies don't include emojis.

Llama's tokenizer model supports emojis employing byte type tokens. For e.g., if the Llama model wants to generate the "👨" (:man:) emoji, it generates this emoji byte-by-byte.

In our example, the "👨" (:man:) emoji is encoded in UTF-8 encoding with 4 bytes: 0xF0, 0x9F, 0x91, 0xA8. The Llama model generates "<0xF0>" byte token at first, then generates "<0x9F>", "<0x91>", and "<0xA8>" respectively. After generation of each byte token, our project's InferenceEngine.TokenToString(...) method checks if enough byte tokens are generated for representing an emoji, via utf8.Valid(...) as follows.

If the generated new token is a byte type token, it is added into decodingContext.waitingBytes array.

If utf8.Valid(...) returns true, this method converts the waiting bytes to a rune, then a string. If false, in other words, if the waiting bytes don't consist of a valid UTF-8 byte sequence (including an unfinished sequence that may be finished after upcoming new bytes), the application waits for the next token to generate.

Also, if the waiting byte sequence is a valid UTF-8 byte sequence, we also check if the sequence represents a known emoji or code point with a human readable name via inference.processEmoji(...) function.

This detection process seems simple at first sight, but emojis consisting of multiple emojis with Zero-width joiner make this process harder and more complex. But our project can handle these types of issues.

from src/inference/tokenize.go

func (ie *InferenceEngine) TokenToString(tokenId model.TokenId, decodingContext *generationDecodingContext) (token model.TokenPiece, resultString string, addedToWaiting bool) {
    ...
    if token.IsByte {
        if decodingContext.waitingBytes == nil {
            decodingContext.waitingBytes = make([]byte, 0)
        }
        decodingContext.waitingBytes = append(decodingContext.waitingBytes, token.ByteFallback...)
        if utf8.Valid(decodingContext.waitingBytes) {
            r, rsize := utf8.DecodeRune(decodingContext.waitingBytes)
            decodingContext.waitingBytes = decodingContext.waitingBytes[rsize:]
            resultString += processEmoji(decodingContext, r)
        } else {
            addedToWaiting = true
        }
        return
    } else {
        ...
    }
}