Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further optimize the compose code #8

Open
harendra-kumar opened this issue Nov 19, 2016 · 4 comments
Open

Further optimize the compose code #8

harendra-kumar opened this issue Nov 19, 2016 · 4 comments

Comments

@harendra-kumar
Copy link
Member

harendra-kumar commented Nov 19, 2016

Decompose code is well optimized but compose still has a lot of scope for optimization. Though its performance is close to utf8proc that we were using earlier, it is still far away from icu compose performance.

Currently we are not using quickcheck properties of unicode database, we can explore using the quickcheck properties to speed up the case when the string is already in composed or almost composed form. Some related links:

@Bodigrim
Copy link
Collaborator

Just random thoughts:

  1. Even if we perform full decomposition + full composition, I would expect NFC times to be roughly 2x of NFD. But it is more than 5x for some languages (Deutsch, English), which hopefully hints a missing optimization opportunity.

  2. My understanding is that faster composition algorithms require a more flexible access to the text, incompatible with the streaming approach. It sounds it involves speculative writes with a possibility of backtracking, once we encounter a combining (or composable) character.

@harendra-kumar
Copy link
Member Author

I think we can go a long way even with full decomposition and composition. I wrote composeChar to just complete the library without any thought to perf. I just copied the code from decomposition and somehow made it to work.

  1. The recursive call to composeChar in decomposeAll is pretty expensive, we can possibly remove that, its there to go through Hangul path just in case the next char is Hangul. You can try this to find the impact:
                _ -> do
                    -- XXX this recursive call here hurts performance
                    -- We can make the hangul composition a separate function
                    -- and call that or reorder here based on the type fo char
                    -- (i', st', rbuf', jb') <- composeChar mode arr i st rbuf jb x
                    (i', st', rbuf') <- reorder marr i st rbuf ch
                    decomposeAll arr i' st' rbuf' jb xs
  1. The reorder function is perhaps not ideal, it can possibly be made much faster. You can find the cost of just decomposition in composeChar by defining composePair as:
composePair c _ = Just c

It is pretty expensive even with this null definition of composePair.

I guess quick check based normalization can be performed as we are writing to an array. It should roughly go like this - we look up the current char in the quick check table, if the character is NO or Maybe or is in the composition exclusion table then we backtrack to the last starter. We have access to the whole array to which we have already written the previous chars and we should be able to easily backtrack in that array.

We will have to create the quickcheck properties file from the UCD, in ucd2haskell script. It is in http://www.unicode.org/Public/UCD/latest/ucd/DerivedNormalizationProps.txt .

In addition to the links in the issue description above, the normalization forms doc has a lot of useful information, especially these sections:

@harendra-kumar
Copy link
Member Author

Another thing - composePair is very inefficient because the patterns are not sorted so it becomes a linear search. We can sort it on first argument and then on the second argument and then I hope GHC would do the right thing.

@Bodigrim
Copy link
Collaborator

Bodigrim commented May 1, 2020

I do not think that composePair is a linear search, profiling assigns only 2% of runtime to it. My hypothesis is that it is a binary search already, which is fast per se, but spoils CPU caches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants