-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark comparing DOM load and path extractor #8
Conversation
README.md
Outdated
more details. | ||
|
||
To execute the benchmarks run: `gradle --no-daemon jmh`, requires an internet connection as it downloads the data set. | ||
Results bellow, higher is better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bellow -> below
final IonReader reader = newReader(inputStream); | ||
final IonWriter writer = newBinaryWriter(binaryOut) | ||
) { | ||
// all data is in the `dataset` key as a list, only write out that field to keep the extractor smaller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this? It looks like the binary and text data has different structure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text data is created by writing the binary data as Ion text so they are the same, see line 89.
Both are different than the original dataset
. The original dataset is something like:
{
<some meta field>: <value>,
<other meta field>: <value>,
"dataset": [ <item1>, <item2> ]
}
the bulk of the data is in the dataset
struct field, that code is only picking the contents of the dataset and writing them out, e.g. <item1><item2>
.
I'm doing this to make the search paths less verbose but not really work it if it caused confusion. Will change the benchmark to work with the data as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The binary and text are the same as bytesText
is created by writing out bytesBinary
as IonText.
I'm extracting the dataset
field from the original dataset for the benchmark test data as the bulk of the data is there, the rest is just some metadata. But thinking more on it it's probably not worth it as can cause confusion, better to have the benchmark to work on the original dataset, will change that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see that now, thanks. I'm fine with it as-is.
/** | ||
* Benchmarks comparing the PathExtractor with fully materializing the DOM. | ||
*/ | ||
public class PathExtractorBenchmark { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to make this pluggable for different data sets as well, since the performance of the path extractor is highly dependent on the characteristics of the data. It would be good to eventually provide results for several data sets along with a description of the data and what was skipped. This doesn't block the initial release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, opened #9 to track this
``` | ||
|
||
Using the path extractor has equivalent performance for both text and binary when fully materializing the document and | ||
can give significant performance improvements when partially materializing binary documents. This happens due to Ion's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
See README.md changes for description of benchmarks
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.