Offset encoding in LSIF dumps #34

dbaeumer · 2019-04-12T09:29:15Z

The LSP defines that character offsets in lines (e.g. the character property in Positions) is an index into a string encoded using uft-16. Strings are encoded in memory using utf-16 in programming languages like Java, JavaScript and .Net. However newer languages use utf-8 as the encoding for strings in memory. This causes some friction in the LSP mainly because:

implementors of clients and servers are not aware of this
there is no way right now to detect that the client and server are using a different encoding.

There is a push in LSP right now to improve this (no concrete action has been decided yet) but we should try to avoid the same problem in LSIF. So I would like to discuss possible options:

we pick one encoding that is supported in LSIF. This doesn't have to be utf-16. However to my knowledge there is no best pick. utf-32 would be great since it is a fixed encoding, but almost no programming language have support for it. So it requires conversion on all sides. utf-16 is bad for new programming lanaguges utf-8 for more traditional languages. Byte offsets would need to be defined on an encoding as well since we render characters in the UI not bytes. So no actual benefit.
we allow to create dumps with different encodings and store the choosen encoding in the meta data. This means that someone will need to convert if client and server don't speak the same encoding. The tendency in LSP is currently to make servers do the conversion since they usually have access to the files. A web interface for example would have a hard time doing the conversion for example for a find all reference result since it would need to fetch the content of all files mentioned in the references result. Note that this also requires that the file content is part of the dump.
we allow to store ranges with n encodings (pratically utf-8 and utf-16). This doesn't mean that every range is duplicated since as long as all characters on the line before the character offest are smaller than ASCII 127 the offsets are the same. So we could attach the converted ranges somehow to the orginial range.

I tend to go with a combination of 2. and 3. We store the ecodings with the dump. The first encoding is the primary encoding the ranges are in. All other encodings are scondary and the value can be reach from the primary range using an edge like utf-8 or utf-16

The text was updated successfully, but these errors were encountered:

LukaszMendakiewicz · 2019-04-16T18:32:08Z

Do we believe 3 is worth it? It would mean more data would need to be stored and LSIF exporters would need to keep tracking the encodings in a given line. Depending on how the data providers for such exporters are layered, it could even mean re-encoding every line that has a salient position to check whether the result differs between UTF-8 and UTF-16 -- this would be a noticeable perf hit, given that there are LSIF items for almost every line in code.
FWIW our C++ language service is migrating all internal representations to UTF-8 and this language is not usually classified as "newer" :).

My preference currently would be to allow selecting either UTF-8 or UTF-16 encoding per dump and the LSIF metadata would specify which one is being used. So basically your option 2, but with the further restriction to those two encodings.

The downside you are identifying for this approach is the web interface example that would need to do on-the-fly conversion (statistically in 50% cases) -- however conversion on such interface could be done lazily, i.e. when the actual results are actually needed, so the perf hit would be minimal.
Consider the web interface is using UTF-16 encoding internally and the LSIF file it loads is UTF-8 based.
This means that e.g. for every hover request a single conversion from UTF-16 to UTF-8 will be needed to locate the appropriate location in the graph, on a line buffer that is already loaded (since it is being presented) anyway. For "batch" requests like FAR, I also don't see it being a problem: a tabular list of results can be limited to just a line number which should be sufficient for users, and any time a single result is being selected to jump to (or even just for visualization à la VS Code), the actual text line needs to be loaded anyway, and can be encoding-converted at that point.

Are there scenarios where character would be needed on the LSIF usage side that would not want to load the actual source files?

dbaeumer added the feature-request Request for new features or functionality label Oct 19, 2021

dbaeumer added this to the Backlog milestone Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offset encoding in LSIF dumps #34

Offset encoding in LSIF dumps #34

dbaeumer commented Apr 12, 2019 •

edited

Loading

LukaszMendakiewicz commented Apr 16, 2019

Offset encoding in LSIF dumps #34

Offset encoding in LSIF dumps #34

Comments

dbaeumer commented Apr 12, 2019 • edited Loading

LukaszMendakiewicz commented Apr 16, 2019

dbaeumer commented Apr 12, 2019 •

edited

Loading