Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark comparing DOM load and path extractor #8

Merged
merged 2 commits into from
Oct 31, 2018
Merged

Conversation

raganhan
Copy link
Contributor

See README.md changes for description of benchmarks

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

README.md Outdated
more details.

To execute the benchmarks run: `gradle --no-daemon jmh`, requires an internet connection as it downloads the data set.
Results bellow, higher is better.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bellow -> below

final IonReader reader = newReader(inputStream);
final IonWriter writer = newBinaryWriter(binaryOut)
) {
// all data is in the `dataset` key as a list, only write out that field to keep the extractor smaller
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this? It looks like the binary and text data has different structure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text data is created by writing the binary data as Ion text so they are the same, see line 89.

Both are different than the original dataset. The original dataset is something like:

{
  <some meta field>: <value>,
  <other meta field>: <value>,
  "dataset": [ <item1>, <item2> ] 
}

the bulk of the data is in the dataset struct field, that code is only picking the contents of the dataset and writing them out, e.g. <item1><item2>.

I'm doing this to make the search paths less verbose but not really work it if it caused confusion. Will change the benchmark to work with the data as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The binary and text are the same as bytesText is created by writing out bytesBinary as IonText.

I'm extracting the dataset field from the original dataset for the benchmark test data as the bulk of the data is there, the rest is just some metadata. But thinking more on it it's probably not worth it as can cause confusion, better to have the benchmark to work on the original dataset, will change that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see that now, thanks. I'm fine with it as-is.

/**
* Benchmarks comparing the PathExtractor with fully materializing the DOM.
*/
public class PathExtractorBenchmark {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to make this pluggable for different data sets as well, since the performance of the path extractor is highly dependent on the characteristics of the data. It would be good to eventually provide results for several data sets along with a description of the data and what was skipped. This doesn't block the initial release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, opened #9 to track this

```

Using the path extractor has equivalent performance for both text and binary when fully materializing the document and
can give significant performance improvements when partially materializing binary documents. This happens due to Ion's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@raganhan raganhan merged commit 36adff2 into master Oct 31, 2018
@raganhan raganhan deleted the benchmarks-6 branch October 31, 2018 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants