-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add BOM detection, language guessing, and multi-byte char counts #865
Conversation
@stricklandrbls, I thought about that. We can just take the last bit say |
Also, I'm not sure what I need to edit within this text file I created using Greek characters. The BOM is accurately detected in the data editor but the language is appearing as Then I actually used the data editor to input the BOM magic number for The data editor detects the BOM correctly but the language is 'unknown'. There is another file I created, again with Greek characters, and the BOM detection is correct but the language is detected as Korean: |
BOM & Editor Encoding DiscrepancyWhen the BOM detection occurs, if it matches the available UTF encodings in the data editor (UTF-8 and UTF-16LE) then the editor's Edit Encoding setting should be changed to match. Currently it remains on Latin-1. BOM Description Mismatches w/ ProfilerIf the user edits the BOM bytes of a file, then opens the profiler to re-detect the BOM, those changes are not reflecting the in File Metrics panel. This persists even after the profiler display has closed. |
@stricklandrbls, the BOM in the file metics is computed at session creation time. The BOM in the profiler is computed on demand and will therefore always reflect current changes. We can force a BOM sync at profile time to get agreement, but I think this is a case where the juice isn't worth the squeeze. Instead, I just won't show it in the FileMetrics header at all, but we can use the BOM that we get at session creation time to set the default encoding. That will be a nice touch. What do you think? |
59152d0
to
2c6cb3a
Compare
@stricklandrbls, proposed changes made. |
I tested the BOM and language detection functionality against the test files that were located within the Omega Edit test files. The data editor displayed the correct BOM and language so there must have been a binary data issue with the testfiles I created earlier. I'm still seeing some weird behavior with the data editor and omega edit server and currently testing to see if it's local to this branch only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
src/dataEditor/dataEditorClient.ts
Outdated
data.language = createSessionResponse.hasLanguage() | ||
? (createSessionResponse.getLanguage() as string) | ||
: 'unknown' | ||
assert(data.language.length > 0, 'language is not set') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbeckerle, here is where the Data Editor is getting the detected language. This happens at session creation time and is detected by the Ωedit™ server (see: https://github.com/ctc-oss/omega-edit/blob/main/server/scala/serv/src/main/scala/com/ctc/omega_edit/grpc/EditorService.scala#L124). When the session is created in the extension, it sends a message to the WebView front-end with the details for the display.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so this code isn't actually doing any language detection. It's just asking the server for the language information. Same for BOM, and char counts.
Is there any way to request this information for a region within a file, or are these global/whole-file only? I would like to be able to select a region in the data and ask for the character set and human language and a few easy stats to be computed on that region.
In general, data files tend to be created by combining data having different characteristics. Anything you can learn at the whole-file level you should be able to learn about a specified sub-region of the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this patch so far, you can do a segment via the data profiler. The profiler will detect the BOM at the beginning of the file and do the character counts in the desired segment on demand. It doesn't currently re-calculate the language or the file-type, however as these are being done at session creation time right now (ref: https://github.com/ctc-oss/omega-edit/blob/main/server/scala/serv/src/main/scala/com/ctc/omega_edit/grpc/EditorService.scala#L115).
Language detection on a segment ought to be fairly straightforward to add to the data profiler. We can stop doing it at session creation and do it on demand with the profiler.
For content-type I think we just leave that as-is (done only at session creation time). If the content-type changes (maybe because we opened an empty file and are creating a file from whole cloth), I can add a refresh widget to refresh the content type based on file updates.
In summary, I agree that we ought to be able to do language detection on a segment and discontinue doing it at session creation time, and for content-type, I think a refresh mechanism will be the best bet.
To support these changes, I'll need to update Ωedit™ and cut a new release for the server-side support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some renaming and improved doc strings/comments would be helpful. Those are the only changes I'm requesting.
You can take the rest of this comment with a grain of salt, perhaps TL;DR.
I am not a UI programmer.
This code feels like mostly "glue logic" to me, and that's always something I like to see generated or minimized, not written by hand. It strikes me that this code consists of a structure definition of byteOrderMark and other metadata, and that the only actual "code" in this is estabilishing the subscriptions on changes to that for the visual presentation. The rest could all be generated from the structure.
Seems to me there is a pattern here for how to interact (controller) with the omega edit server, (model) which has a bunch of details associated with dealing with it being an asynchronous server. Added to that is how to interact with the VSCode presentation layer (view) and that all this should be ruthlessly uniform, ideally, generated from declarative specs of what the model contains, and what UI component subscribes to and creates changes of those model parts.
What I cannot see in this PR, which is after all, a small part of the system, is any separation of those pieces that makes any sense to me. It all feels jumbled, and snarled with the details of GPB for talking to the omega edit server, and HTTP-style post for talking to ... the UI?
But that's perhaps because I'm not working with the larger code body - I'm looking at trees and bushes, so can't see the forest.
Maybe I just need to add a model change that shows up visually on the display, to force myself to learn better how this all hangs together. (A suggestion on such a feature is welcome.)
This is btw, just my usual reaction whenever I see the detail required and intricacies of UI-related code, (Tho, omega edit server adds to the complexity in this case) and as I am a a back-end/algorithm developer, you can take my comments with a grain of salt.
The promise of reactive UI "kit" was, I always hoped, to make this sort of thing far more declarative than it has been historically.
src/dataEditor/dataEditorClient.ts
Outdated
data.language = createSessionResponse.hasLanguage() | ||
? (createSessionResponse.getLanguage() as string) | ||
: 'unknown' | ||
assert(data.language.length > 0, 'language is not set') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so this code isn't actually doing any language detection. It's just asking the server for the language information. Same for BOM, and char counts.
Is there any way to request this information for a region within a file, or are these global/whole-file only? I would like to be able to select a region in the data and ask for the character set and human language and a few easy stats to be computed on that region.
In general, data files tend to be created by combining data having different characteristics. Anything you can learn at the whole-file level you should be able to learn about a specified sub-region of the data.
let data = { | ||
byteOrderMark: '', | ||
changeCount: 0, | ||
computedFileSize: 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These things need documentation. This is ts code, but they really need the equivalent of javadoc.
What are the enum values for your byteOrderMark?
The computedFileSize, needs explanation. What is the difference vs. diskFileSize?
The vast bulk of data in the world will not have byte order marks at all. So is this just reporting whether the data starts with FEFF? or FFFE? and if not it is "unknown"? Is this byteOrderMark an enum? What are the possible values?
What is changeCount?
Explain that language is a guess at the human language being used, represented as .... what kind of identifier?
What is "type"?
What is undoCount? Pending undos available for undoing, or a count of how many were applied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use these things all over the place in the Data Editor. Maybe we actually need a glossary. Much of it is defined here but we'll need the glossary to live in the wiki on this project as the Ωedit™ stuff is just a subset of the variables used in the Data Editor.
ByteOrderMark: Here it's either None, UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE (enum defined here: https://github.com/ctc-oss/omega-edit/blob/main/core/src/include/omega_edit/fwd_defs.h#L76, turned from enum to string (what the server sends to the extension) here: https://github.com/ctc-oss/omega-edit/blob/main/core/src/lib/utility.c#L319). We use it to set the default encoding setting in the Settings header, and we also use it on the FileMetrics header for display.
ChangeCount: The number of applied changes (edit transactions)
ComputedFileSize: The file size with respect to the applied changes (Can be different than the DiskFileSize)
Type here is the detected Content-Type as determined by Ωedit™ (which in turn delegates the content detection to Apache Tikka)
UndoCount: The count of Undos available for undoing
We can create another issue for publishing a glossary.
await this.panel.webview.postMessage({ | ||
command: MessageCommand.profile, | ||
data: { | ||
startOffset: startOffset, | ||
length: length, | ||
byteProfile: byteProfile, | ||
numAscii: numAscii(byteProfile), | ||
characterCount: { | ||
byteOrderMark: characterCount.getByteOrderMark(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does a thing named characterCount have a getter named "getByteOrderMark"? This feels mis-named. It's not characterCount it's a whole block of metadata fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BOM is related to the character counts as it informs how the bytes have been decoded into characters. Would characterCountInfo
or characterCountData
be better for the name?
computedFileSize: this.fileSize, | ||
diskFileSize: this.fileSize, | ||
fileName: this.fileToEdit, | ||
computedFileSize: fileSize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between a computed file size and disk file size? Is this due to pending operations which might enlarge or shrink the data? Best to answer this with a comment where computedFileSize is defined in the code explaining what is meant by "computed".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. A glossary will help.
@@ -491,6 +512,16 @@ limitations under the License. | |||
>{((numAscii / sum) * 100).toFixed(2)}</span | |||
> | |||
</label> | |||
</div> | |||
<hr /> | |||
<div class="char-count"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not be named char-count. It may have originally been only that, but once you start extending it to hold other metadata it should be renamed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What other metadata are thinking would be added? I think these character count fields are final at least as far as Unicode is concerned.
@@ -71,6 +72,9 @@ limitations under the License. | |||
if ('type' in msg.data.data) { | |||
$fileMetrics.type = msg.data.data.type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?? msg.data.data ?? Try harder to name things with something useful for understanding what is being carried. msg.data makes sense. Why is the thing inside the data still just called "data"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but this is a VS Code (legacy?) messaging thing. Not much we can do about this one.
|
||
$: $bytesPerRow = $displayRadix === RADIX_OPTIONS.Binary ? 8 : 16 | ||
|
||
window.addEventListener('message', (msg) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this hand crafted code feels like boilerplate that could be generated from one line:
byteOrderMark enum BE, LE
There are a bunch of things that the omegaedit server computes that in the model-view-controller perspective, they're part of the model of the data. They need to be moved over to the client side and made available to the windows that subscribe to change of those values.
All that feels like it should be generated code to me. There's dozens of subtle lines of hand-crafted code here that should be boringly uniform.
That should let you generate all the code to marshall this for all the places that have to exchange it, and for subscribing to changes of that information in a window that is listening for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ωedit™ provides the model, the dataEditorClient is the controller, and all the Svelte stuff is the view. The dataEditorClient is doing quite a lot of taking data from the Ωedit™ client and populating VS Code messages for sending to the WebView. It's also taking the messages from the WebView and translating actions into Ωedit™ client calls.
We may be able to just take the ProtocolBuffer reply messages straight from the client and convert them (JSON-ify) in the controller and ship them to the view and perhaps we ought to in many cases, but I think that is outside the scope of this PR. Perhaps we create another issue for looking into slimming down the controller by just marshaling messages between the model and the view and vice-versa.
{ | ||
if ('byteOrderMark' in msg.data.data) { | ||
const { byteOrderMark } = msg.data.data | ||
if (byteOrderMark === 'UTF-8') $editorEncoding = 'utf-8' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how a byte-order mark indicates UTF-8. The char code is FEFF. If that is encoded as 3 bytes of UTF-8, that's a good indication of the charset being UTF-8, but it means nothing about byte order since UTF-8 has no byte order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FE FF is for UTF-16BE, and EF BB BF is the UTF-8 BOM. UTF-8 can typically be assumed, but it can also be explicit with the UTF-8 BOM (though you're right that byte order isn't even a thing in UTF-8 as it is in UTF-16 and UTF-32).
Anyway, setting the editor encoding to match the detected BOM by default I felt was a nice touch, and that's what's going on here.
@mbeckerle, your overall criticism of the controller portion is fair. Neither @stricklandrbls or I are "UI developers/designers", but we are computer scientists and are willing to tackle any layer in the stack and we're confident that whatever we develop will work. We've read Design Patterns we've read Scott Meyers, Herb Sutter, Martin Fowler, and the like. We know about code smells, we run linters, profilers, formatters, bounds checkers, documentation generators (for libraries at least), and use BDD (whenever we can -- UI development has proven to be a challenge here). But what's worked best in my career is beelining to "working first". Set aside the academics, optimizations, code poetry, endless design meetings, and just get the thing up and working, and then keep it working (swapping out the aircraft wings while in flight). At the end of the day, this is an application for users and they don't care if the code is Shakespearian or bubblegum and duct-tape, as long as it works. That's not to say there isn't room for great design, there is, but that comes over time as the full application takes shape and we can prioritize the pain points and address them along the way. Since UI development isn't something Robert and I have a ton of experience in (TypeScript, and Svelte we learned on this project) experience told us that we're not going to nail the design right off the bat, but we do have road map items and a rough schedule that we intend on sticking to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulled down the PR and ran in a windows environment. The change works as described.
a999d8b
to
75bf9d5
Compare
…r counts to the profiler
75bf9d5
to
9e6be18
Compare
This adds Byte Order Mark (BOM) detection and language guessing to the file metrics header. This also added character count data to the data profiler.
Above we see Arabic (ar in ISO-639-1) detected and no BOM.
Above we see Greek (el in ISO-639-1) detected and the UTF-16LE BOM.
Here we see the ISO-639-1 code expanded to the detected langauge.
The last set of figures are related to multi-byte character counts.