-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: implement EngineData parsing #43
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, thank you for working on the PR.
I'll be honest: I have been building an EngineData parser myself. I began working on it around the 27th of July, a couple of days after you volunteered to write the PR.
Please understand that I had no intention of undermining your hard work. I had been planning to attack the problem myself for a long time, and your initial suggestion motivated me to make one last attempt. I apologize for not telling you this sooner. I didn't want to stop you in case my solution failed, and I did not anticipate that you would submit a PR in such a short time.
Correctness
I found 2 bugs in your parser--please refer to my other comments. The first is trivial to fix. Also, it is not critical since it doesn't affect well-formed EngineData documents.
However, the second bug (unicode decoding) is a blocker. Because our use case involves unicode text (CJK characters), we need to correctly decode unicode strings. (My parser appears to handle them correctly, but I'd be grateful to accept bug reports).
Performance
I ran some quick benchmarks on my code as well as yours (with slight modifications), and the results currently favor mine by 5-60% on different browsers:
- Chrome 104.0
- Firefox 103.0.1
- Safari 15.6
(My benchmarks were done on a MacBook Pro 13-inch / M1 / 2020)
Integration
My parser is almost done and only needs to be integrated into the rest of the library. If you wish to work on your parser's performance, I'm ready to wait for further submissions. Otherwise, we could reuse your code (e.g. Layer.ts
) to integrate my parser into @webtoon/psd
.
it = this.advance(); | ||
} | ||
switch (it.type) { | ||
case TokenType.Name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TokenType.Name
should not be parsed as a value here. It causes << /abc /abc >>
to be parsed as { abc: "abc" }
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure about that interpretation?
At least in PDF spec /
-prefixed sequences are called "Name Objects" and are
an atomic symbol uniquely defined by a sequence of any characters
(8-bit values) except null (character code 0). Uniquely defined means that any two name objects made up of
the same sequence of characters denote the same object.
A dictionary on the other hand:
is an associative table containing pairs of objects, known as the dictionary’s entries. The first
element of each entry is the key and the second element is the value. The key shall be a name (unlike
dictionary keys in PostScript, which may be objects of any type). The value may be any kind of object, including
another dictionary
And so names are in no way tied to dictionaries - and as such sequence << /abc /abc >>
should mean "a dictionary containing one object, under key of Name "abc" and having value of Name "abc" - and both key and value referencing the same object (although iirc not many implementations care about that 2nd part). There's even an example of dict with values being names in the PDF spec:
EXAMPLE << /Type /Example
/Subtype /DictionaryExample
/Version 0 . 01
/IntegerItem 12
/StringItem ( a string )
/Subdictionary << /Item1 0 . 4
/Item2 true
/LastItem ( not ! )
/VeryLastItem ( OK )
>>
>>
See PDF spec - section 7.3.5 (Name objects) and 7.3.7 (Dictionary objects).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is interesting. I did not know that EngineData is based on the PDF format.
Have you observed any PSD file that actually uses /name
as anything other than a dictionary key? Personally I have not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt there's any official mention anyplace - but seeing how both are Adobe-based and how Adobe likes to push PDF in places where they have to put unstructured data it'd IMO make sense. That's why I based lexer and parser on PDF spec :)
0x3e, // > | ||
]); | ||
expect(() => parseEngineData(data)).toThrowError( | ||
MissingEngineDataProperties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the parser fixed, this test case should throw UnexpectedEngineDataToken
instead of MissingEngineDataProperties
.
No hard feelings, but next time you could mention you are willing to give a feature one more try - I could have implemented something else in the same time and library would've benefited from that :) As for this PR - I implemented fix to the CJK parsing bug and added test for it. I also migrated parser to stack-based - to make optimizations a little easier (now flamegraphs can properly merge similar invocations because stack looks largely the same). That said, I couldn't find any meaningful optimizations (more that ~5% on Node.js). I suspect your hand-written lexer simply behaves better that what is generated from the As for next steps - I'm literally indifferent as to which implementation ends up on master - as long as any does :) As you are the maintainer, the choice is yours (and your colleagues). The code in this PR is also yours to use as you see fit - e.g. for integrating your library if that speeds up development and such is the choice. Just let me know ;) One final question: would it make sense to re-write this part in Rust? Since it looks like parsing EngineData might take a large portion of total design parsing time. |
Thank you. I handled the situation poorly, and will not make the same mistake again. I doubt it would be appropriate to convert the EngineData parser to Rust/WebAssembly. Decoding layer images is by far the most time- and memory-intensive task (hundreds of milliseconds per layer). Parsing an EngineData document with JS takes < 1ms when optimized, and perf gains from Rust+WebAssembly would be barely noticeable. |
Ran some benchmarks; looks like for overall file opening process your parser provides a significant speedup:
(where @pastelmind Maybe the simplest way forward it to just merge this PR and then replace my implementation with yours? |
- array instead of Generator - heavy optimize string() function in lexer 1) slice copies buffer, 2) creating decoder is costly, 3) memoize results for each character instead of lookup
@pastelmind had some spare time and managed to find pretty substantial optimizations; right now the results are:
|
it doesn't improve performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was sick for most of the week and only got around to reviewing your PR just now.
- Awesome work on optimization! It seems that avoiding unnecessary memory allocations greatly boosted your lexer's performance.
- Could you clarify
memoize results for each character instead of lookup
in cd40e08? I don't see any memoizing code. Cursor
is becoming somewhat burdened. We could split it into two variants--the regularCursor
for parsing standard structures, and a specializedEngineDataCursor
exclusively for the lexer. This isn't needed now, though.
I will accept the PR for now.
refers to
If you are referring to using Array instead of Generator then not really - I did it just to have clean demarkation line when benchmarking with 0x. This change can be easily reverted if we ever start to run into memory issues :) PS. Hope you feel better now :) |
@pastelmind are we waiting on someone / something? Dunno what the process is, but I can't merge it on my own ;) |
fixes #6
This feature should be more or less finished - during the code review I'll run it on sample designs to verify parsing works consistently.