Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of decodeUtf8 on non-ascii text #1

Open
andrewthad opened this issue Mar 7, 2018 · 5 comments
Open

Performance of decodeUtf8 on non-ascii text #1

andrewthad opened this issue Mar 7, 2018 · 5 comments

Comments

@andrewthad
Copy link
Owner

The performance of decodeUtf8 is excellent on ascii text, since the check can be vectorized to operate on a full machine word at a time, and nearly all branch prediction are correct. However, for non-ascii text, I haven't put much effort into optimizing it. There's no way it could ever compete with the decoding of ascii text, but it could probably be much better than it is now.

@chessai
Copy link
Collaborator

chessai commented Mar 30, 2018

Do you have any ideas for how it could be optimised? Why can you not reach the efficiency of decodeUtf8 from Data.Text?

I think it can be achieved with sheer willpower alone.

@andrewthad
Copy link
Owner Author

andrewthad commented Mar 30, 2018

It could certainly match the performance of decodeUtf8 from Data.Text. Right now, I have some bounds checks that are redundant. It may be possible to eliminate them. Alternatively, it may improve things to make the code more concise. I think we could instead have a single helper function that handles two-byte, three-byte, and four-byte characters.

@chessai
Copy link
Collaborator

chessai commented Mar 30, 2018

Is it possible to write something like 'isUtf8' (which we know can be made relatively efficient), and if that function returns true, make a pass over the text?

By the way, the willpower thing was a joke.

@chessai
Copy link
Collaborator

chessai commented Mar 30, 2018

We have to perform the check for utf8 and then decode the character anyway, so maybe that would be an OK solution.

@andrewthad
Copy link
Owner Author

Actually, that's already what it does. It just passes over the Bytes and checks to see if it is UTF-8. So, it's zero-copy unless there are disallowed code points present. In that case, we have to clean them up, which requires allocating a new bytearray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants