add BOM detection, language guessing, and multi-byte char counts #865

scholarsmate · 2023-09-28T19:08:30Z

This adds Byte Order Mark (BOM) detection and language guessing to the file metrics header. This also added character count data to the data profiler.

Above we see Arabic (ar in ISO-639-1) detected and no BOM.

Above we see Greek (el in ISO-639-1) detected and the UTF-16LE BOM.

Here we see the ISO-639-1 code expanded to the detected langauge.

The last set of figures are related to multi-byte character counts.

stricklandrbls · 2023-09-29T14:42:24Z

Looks like we need to start deviating from the layout of the file metrics. If the Content Type is too long then it's a little difficult to read the BOM text.

scholarsmate · 2023-09-29T16:18:14Z

@stricklandrbls, I thought about that. We can just take the last bit say foo/bar and just show bar, but have a tooltip that shows foo/bar.

stricklandrbls · 2023-09-29T16:19:38Z

Here's a more pronounced overflow example

stricklandrbls · 2023-09-29T16:39:24Z

@scholarsmate

Also, I'm not sure what I need to edit within this text file I created using Greek characters. The BOM is accurately detected in the data editor but the language is appearing as unknown.

Screenshot: My text input command

Then I actually used the data editor to input the BOM magic number for 0xFF 0xFE for UTF-16-LE

Screenshot: hexdump -C output of file

The data editor detects the BOM correctly but the language is 'unknown'.

Screenshot: FileMetrics Panel Info

There is another file I created, again with Greek characters, and the BOM detection is correct but the language is detected as Korean:

Screenshot: /tmp/utf16-le.txt

Screenshot: vim /tmp/greek-utf16le.txt showing Korean

stricklandrbls · 2023-09-29T16:49:41Z

BOM & Editor Encoding Discrepancy

When the BOM detection occurs, if it matches the available UTF encodings in the data editor (UTF-8 and UTF-16LE) then the editor's Edit Encoding setting should be changed to match. Currently it remains on Latin-1.

BOM Description Mismatches w/ Profiler

If the user edits the BOM bytes of a file, then opens the profiler to re-detect the BOM, those changes are not reflecting the in File Metrics panel. This persists even after the profiler display has closed.

Screenshot

scholarsmate · 2023-09-29T17:47:08Z

@stricklandrbls, the BOM in the file metics is computed at session creation time. The BOM in the profiler is computed on demand and will therefore always reflect current changes. We can force a BOM sync at profile time to get agreement, but I think this is a case where the juice isn't worth the squeeze. Instead, I just won't show it in the FileMetrics header at all, but we can use the BOM that we get at session creation time to set the default encoding. That will be a nice touch. What do you think?

scholarsmate · 2023-09-29T18:21:32Z

@stricklandrbls, proposed changes made.

stricklandrbls · 2023-10-02T15:55:58Z

I tested the BOM and language detection functionality against the test files that were located within the Omega Edit test files. The data editor displayed the correct BOM and language so there must have been a binary data issue with the testfiles I created earlier.

I'm still seeing some weird behavior with the data editor and omega edit server and currently testing to see if it's local to this branch only.

src/dataEditor/dataEditorClient.ts

src/svelte/src/components/DataMetrics/DataMetrics.svelte

stricklandrbls

+1

src/svelte/src/components/Header/fieldsets/FileMetrics.svelte

scholarsmate · 2023-10-06T18:45:33Z

src/dataEditor/dataEditorClient.ts

+      data.language = createSessionResponse.hasLanguage()
+        ? (createSessionResponse.getLanguage() as string)
+        : 'unknown'
+      assert(data.language.length > 0, 'language is not set')


@mbeckerle, here is where the Data Editor is getting the detected language. This happens at session creation time and is detected by the Ωedit™ server (see: https://github.com/ctc-oss/omega-edit/blob/main/server/scala/serv/src/main/scala/com/ctc/omega_edit/grpc/EditorService.scala#L124). When the session is created in the extension, it sends a message to the WebView front-end with the details for the display.

Ok, so this code isn't actually doing any language detection. It's just asking the server for the language information. Same for BOM, and char counts.

Is there any way to request this information for a region within a file, or are these global/whole-file only? I would like to be able to select a region in the data and ask for the character set and human language and a few easy stats to be computed on that region.

In general, data files tend to be created by combining data having different characteristics. Anything you can learn at the whole-file level you should be able to learn about a specified sub-region of the data.

In this patch so far, you can do a segment via the data profiler. The profiler will detect the BOM at the beginning of the file and do the character counts in the desired segment on demand. It doesn't currently re-calculate the language or the file-type, however as these are being done at session creation time right now (ref: https://github.com/ctc-oss/omega-edit/blob/main/server/scala/serv/src/main/scala/com/ctc/omega_edit/grpc/EditorService.scala#L115).

Language detection on a segment ought to be fairly straightforward to add to the data profiler. We can stop doing it at session creation and do it on demand with the profiler.

For content-type I think we just leave that as-is (done only at session creation time). If the content-type changes (maybe because we opened an empty file and are creating a file from whole cloth), I can add a refresh widget to refresh the content type based on file updates.

In summary, I agree that we ought to be able to do language detection on a segment and discontinue doing it at session creation time, and for content-type, I think a refresh mechanism will be the best bet.

To support these changes, I'll need to update Ωedit™ and cut a new release for the server-side support.

mbeckerle

Some renaming and improved doc strings/comments would be helpful. Those are the only changes I'm requesting.

You can take the rest of this comment with a grain of salt, perhaps TL;DR.

I am not a UI programmer.

This code feels like mostly "glue logic" to me, and that's always something I like to see generated or minimized, not written by hand. It strikes me that this code consists of a structure definition of byteOrderMark and other metadata, and that the only actual "code" in this is estabilishing the subscriptions on changes to that for the visual presentation. The rest could all be generated from the structure.

Seems to me there is a pattern here for how to interact (controller) with the omega edit server, (model) which has a bunch of details associated with dealing with it being an asynchronous server. Added to that is how to interact with the VSCode presentation layer (view) and that all this should be ruthlessly uniform, ideally, generated from declarative specs of what the model contains, and what UI component subscribes to and creates changes of those model parts.

What I cannot see in this PR, which is after all, a small part of the system, is any separation of those pieces that makes any sense to me. It all feels jumbled, and snarled with the details of GPB for talking to the omega edit server, and HTTP-style post for talking to ... the UI?
But that's perhaps because I'm not working with the larger code body - I'm looking at trees and bushes, so can't see the forest.

Maybe I just need to add a model change that shows up visually on the display, to force myself to learn better how this all hangs together. (A suggestion on such a feature is welcome.)

This is btw, just my usual reaction whenever I see the detail required and intricacies of UI-related code, (Tho, omega edit server adds to the complexity in this case) and as I am a a back-end/algorithm developer, you can take my comments with a grain of salt.

The promise of reactive UI "kit" was, I always hoped, to make this sort of thing far more declarative than it has been historically.

mbeckerle · 2023-10-06T22:06:36Z

src/dataEditor/dataEditorClient.ts

+      data.language = createSessionResponse.hasLanguage()
+        ? (createSessionResponse.getLanguage() as string)
+        : 'unknown'
+      assert(data.language.length > 0, 'language is not set')


Ok, so this code isn't actually doing any language detection. It's just asking the server for the language information. Same for BOM, and char counts.

Is there any way to request this information for a region within a file, or are these global/whole-file only? I would like to be able to select a region in the data and ask for the character set and human language and a few easy stats to be computed on that region.

In general, data files tend to be created by combining data having different characteristics. Anything you can learn at the whole-file level you should be able to learn about a specified sub-region of the data.

mbeckerle · 2023-10-06T22:13:30Z

src/dataEditor/dataEditorClient.ts

+    let data = {
+      byteOrderMark: '',
+      changeCount: 0,
+      computedFileSize: 0,


These things need documentation. This is ts code, but they really need the equivalent of javadoc.

What are the enum values for your byteOrderMark?

The computedFileSize, needs explanation. What is the difference vs. diskFileSize?

The vast bulk of data in the world will not have byte order marks at all. So is this just reporting whether the data starts with FEFF? or FFFE? and if not it is "unknown"? Is this byteOrderMark an enum? What are the possible values?

What is changeCount?

Explain that language is a guess at the human language being used, represented as .... what kind of identifier?

What is "type"?

What is undoCount? Pending undos available for undoing, or a count of how many were applied?

We use these things all over the place in the Data Editor. Maybe we actually need a glossary. Much of it is defined here but we'll need the glossary to live in the wiki on this project as the Ωedit™ stuff is just a subset of the variables used in the Data Editor.

ByteOrderMark: Here it's either None, UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE (enum defined here: https://github.com/ctc-oss/omega-edit/blob/main/core/src/include/omega_edit/fwd_defs.h#L76, turned from enum to string (what the server sends to the extension) here: https://github.com/ctc-oss/omega-edit/blob/main/core/src/lib/utility.c#L319). We use it to set the default encoding setting in the Settings header, and we also use it on the FileMetrics header for display.

ChangeCount: The number of applied changes (edit transactions)

ComputedFileSize: The file size with respect to the applied changes (Can be different than the DiskFileSize)

Type here is the detected Content-Type as determined by Ωedit™ (which in turn delegates the content detection to Apache Tikka)

UndoCount: The count of Undos available for undoing

We can create another issue for publishing a glossary.

mbeckerle · 2023-10-06T22:23:03Z

src/dataEditor/dataEditorClient.ts

          await this.panel.webview.postMessage({
            command: MessageCommand.profile,
            data: {
              startOffset: startOffset,
              length: length,
              byteProfile: byteProfile,
              numAscii: numAscii(byteProfile),
+              characterCount: {
+                byteOrderMark: characterCount.getByteOrderMark(),


why does a thing named characterCount have a getter named "getByteOrderMark"? This feels mis-named. It's not characterCount it's a whole block of metadata fields.

The BOM is related to the character counts as it informs how the bytes have been decoded into characters. Would characterCountInfo or characterCountData be better for the name?

mbeckerle · 2023-10-06T22:27:09Z

src/dataEditor/dataEditorClient.ts

-          computedFileSize: this.fileSize,
-          diskFileSize: this.fileSize,
-          fileName: this.fileToEdit,
+          computedFileSize: fileSize,


What's the difference between a computed file size and disk file size? Is this due to pending operations which might enlarge or shrink the data? Best to answer this with a comment where computedFileSize is defined in the code explaining what is meant by "computed".

That's correct. A glossary will help.

mbeckerle · 2023-10-06T22:30:47Z

src/svelte/src/components/DataMetrics/DataMetrics.svelte

@@ -491,6 +512,16 @@ limitations under the License.
        >{((numAscii / sum) * 100).toFixed(2)}</span
      >
    </label>
+    </div>
+    <hr />
+  <div class="char-count">


Should not be named char-count. It may have originally been only that, but once you start extending it to hold other metadata it should be renamed.

What other metadata are thinking would be added? I think these character count fields are final at least as far as Unicode is concerned.

mbeckerle · 2023-10-06T22:36:31Z

src/svelte/src/components/Header/fieldsets/FileMetrics.svelte

@@ -71,6 +72,9 @@ limitations under the License.
          if ('type' in msg.data.data) {
            $fileMetrics.type = msg.data.data.type


?? msg.data.data ?? Try harder to name things with something useful for understanding what is being carried. msg.data makes sense. Why is the thing inside the data still just called "data"?

I agree, but this is a VS Code (legacy?) messaging thing. Not much we can do about this one.

mbeckerle · 2023-10-06T22:45:19Z

src/svelte/src/components/Header/fieldsets/Settings.svelte


  $: $bytesPerRow = $displayRadix === RADIX_OPTIONS.Binary ? 8 : 16
+
+  window.addEventListener('message', (msg) => {


All this hand crafted code feels like boilerplate that could be generated from one line:

byteOrderMark enum BE, LE

There are a bunch of things that the omegaedit server computes that in the model-view-controller perspective, they're part of the model of the data. They need to be moved over to the client side and made available to the windows that subscribe to change of those values.

All that feels like it should be generated code to me. There's dozens of subtle lines of hand-crafted code here that should be boringly uniform.

That should let you generate all the code to marshall this for all the places that have to exchange it, and for subscribing to changes of that information in a window that is listening for it.

Ωedit™ provides the model, the dataEditorClient is the controller, and all the Svelte stuff is the view. The dataEditorClient is doing quite a lot of taking data from the Ωedit™ client and populating VS Code messages for sending to the WebView. It's also taking the messages from the WebView and translating actions into Ωedit™ client calls.

We may be able to just take the ProtocolBuffer reply messages straight from the client and convert them (JSON-ify) in the controller and ship them to the view and perhaps we ought to in many cases, but I think that is outside the scope of this PR. Perhaps we create another issue for looking into slimming down the controller by just marshaling messages between the model and the view and vice-versa.

mbeckerle · 2023-10-06T22:49:30Z

src/svelte/src/components/Header/fieldsets/Settings.svelte

+      {
+        if ('byteOrderMark' in msg.data.data) {
+            const { byteOrderMark } = msg.data.data
+            if (byteOrderMark === 'UTF-8') $editorEncoding = 'utf-8'


I don't understand how a byte-order mark indicates UTF-8. The char code is FEFF. If that is encoded as 3 bytes of UTF-8, that's a good indication of the charset being UTF-8, but it means nothing about byte order since UTF-8 has no byte order.

FE FF is for UTF-16BE, and EF BB BF is the UTF-8 BOM. UTF-8 can typically be assumed, but it can also be explicit with the UTF-8 BOM (though you're right that byte order isn't even a thing in UTF-8 as it is in UTF-16 and UTF-32).

Anyway, setting the editor encoding to match the detected BOM by default I felt was a nice touch, and that's what's going on here.

scholarsmate · 2023-10-07T18:22:45Z

@mbeckerle, your overall criticism of the controller portion is fair. Neither @stricklandrbls or I are "UI developers/designers", but we are computer scientists and are willing to tackle any layer in the stack and we're confident that whatever we develop will work. We've read Design Patterns we've read Scott Meyers, Herb Sutter, Martin Fowler, and the like. We know about code smells, we run linters, profilers, formatters, bounds checkers, documentation generators (for libraries at least), and use BDD (whenever we can -- UI development has proven to be a challenge here). But what's worked best in my career is beelining to "working first". Set aside the academics, optimizations, code poetry, endless design meetings, and just get the thing up and working, and then keep it working (swapping out the aircraft wings while in flight). At the end of the day, this is an application for users and they don't care if the code is Shakespearian or bubblegum and duct-tape, as long as it works. That's not to say there isn't room for great design, there is, but that comes over time as the full application takes shape and we can prioritize the pain points and address them along the way. Since UI development isn't something Robert and I have a ton of experience in (TypeScript, and Svelte we learned on this project) experience told us that we're not going to nail the design right off the bat, but we do have road map items and a rough schedule that we intend on sticking to.

rthomas320

Pulled down the PR and ran in a windows environment. The change works as described.

…r counts to the profiler

scholarsmate added enhancement New feature or request typescript data editor Issues related to the Data Editor capability labels Sep 28, 2023

scholarsmate added this to the 1.4.0 milestone Sep 28, 2023

scholarsmate requested review from arosien, stricklandrbls, shanedell and NolanMatt September 28, 2023 19:08

scholarsmate self-assigned this Sep 28, 2023

This was linked to issues Sep 28, 2023

Byte Order Maker (BOM) detection #796

Closed

Multi-Character detection #795

Closed

scholarsmate force-pushed the multibyte_char branch 4 times, most recently from 59152d0 to 2c6cb3a Compare September 29, 2023 18:20

stricklandrbls reviewed Oct 3, 2023

View reviewed changes

src/dataEditor/dataEditorClient.ts Show resolved Hide resolved

stricklandrbls reviewed Oct 3, 2023

View reviewed changes

src/svelte/src/components/DataMetrics/DataMetrics.svelte Show resolved Hide resolved

stricklandrbls reviewed Oct 3, 2023

View reviewed changes

src/svelte/src/components/DataMetrics/DataMetrics.svelte Show resolved Hide resolved

scholarsmate requested a review from mbeckerle October 6, 2023 17:45

stricklandrbls approved these changes Oct 6, 2023

View reviewed changes

mbeckerle reviewed Oct 6, 2023

View reviewed changes

src/svelte/src/components/Header/fieldsets/FileMetrics.svelte Show resolved Hide resolved

scholarsmate commented Oct 6, 2023

View reviewed changes

scholarsmate requested a review from mbeckerle October 6, 2023 19:56

mbeckerle requested changes Oct 7, 2023

View reviewed changes

rthomas320 self-requested a review November 14, 2023 14:39

rthomas320 approved these changes Nov 14, 2023

View reviewed changes

scholarsmate force-pushed the multibyte_char branch 2 times, most recently from a999d8b to 75bf9d5 Compare December 7, 2023 02:16

add BOM detection, and language guessing, and add multi-byte characte…

9e6be18

…r counts to the profiler

scholarsmate force-pushed the multibyte_char branch from 75bf9d5 to 9e6be18 Compare December 7, 2023 02:28

scholarsmate merged commit 36cb33a into apache:main Dec 8, 2023
21 checks passed

scholarsmate deleted the multibyte_char branch December 8, 2023 16:29

scholarsmate mentioned this pull request Dec 8, 2023

Language guessing #877

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add BOM detection, language guessing, and multi-byte char counts #865

add BOM detection, language guessing, and multi-byte char counts #865

scholarsmate commented Sep 28, 2023

stricklandrbls commented Sep 29, 2023

scholarsmate commented Sep 29, 2023

stricklandrbls commented Sep 29, 2023

stricklandrbls commented Sep 29, 2023

stricklandrbls commented Sep 29, 2023

Screenshot

scholarsmate commented Sep 29, 2023

scholarsmate commented Sep 29, 2023

stricklandrbls commented Oct 2, 2023

stricklandrbls left a comment

scholarsmate Oct 6, 2023

mbeckerle Oct 6, 2023

scholarsmate Oct 7, 2023

mbeckerle left a comment

mbeckerle Oct 6, 2023

mbeckerle Oct 6, 2023

scholarsmate Oct 7, 2023

mbeckerle Oct 6, 2023

scholarsmate Oct 7, 2023 •

edited

Loading

mbeckerle Oct 6, 2023

scholarsmate Oct 7, 2023

mbeckerle Oct 6, 2023

scholarsmate Oct 7, 2023

mbeckerle Oct 6, 2023

scholarsmate Oct 7, 2023

mbeckerle Oct 6, 2023

scholarsmate Oct 7, 2023

mbeckerle Oct 6, 2023

scholarsmate Oct 7, 2023 •

edited

Loading

scholarsmate commented Oct 7, 2023

rthomas320 left a comment

		@@ -71,6 +72,9 @@ limitations under the License.
		if ('type' in msg.data.data) {
		$fileMetrics.type = msg.data.data.type


		$: $bytesPerRow = $displayRadix === RADIX_OPTIONS.Binary ? 8 : 16

		window.addEventListener('message', (msg) => {

add BOM detection, language guessing, and multi-byte char counts #865

add BOM detection, language guessing, and multi-byte char counts #865

Conversation

scholarsmate commented Sep 28, 2023

stricklandrbls commented Sep 29, 2023

scholarsmate commented Sep 29, 2023

stricklandrbls commented Sep 29, 2023

stricklandrbls commented Sep 29, 2023

stricklandrbls commented Sep 29, 2023

BOM & Editor Encoding Discrepancy

BOM Description Mismatches w/ Profiler

Screenshot

scholarsmate commented Sep 29, 2023

scholarsmate commented Sep 29, 2023

stricklandrbls commented Oct 2, 2023

stricklandrbls left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbeckerle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scholarsmate Oct 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scholarsmate Oct 7, 2023 • edited Loading

Choose a reason for hiding this comment

scholarsmate commented Oct 7, 2023

rthomas320 left a comment

Choose a reason for hiding this comment

scholarsmate Oct 7, 2023 •

edited

Loading

scholarsmate Oct 7, 2023 •

edited

Loading