Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Your test duration measurement is inaccurate #143

Closed
dead-claudia opened this issue Oct 18, 2024 · 10 comments
Closed

Your test duration measurement is inaccurate #143

dead-claudia opened this issue Oct 18, 2024 · 10 comments
Labels
question Further information is requested

Comments

@dead-claudia
Copy link

Related: #65

Performance measurement is unfortunately not as simple as performance.now() in browsers. Further, operating systems can and do sometimes have resolution limits of their own.

Here's some considerations that need to be addressed, in general:

The benchmarks currently naively use start and end performance.now()/process.hrtime.bigint() calls. The precision issues can give you bad data, but it's not insurmountable:

  • Any test run that completes in less than half of the minimum timer precision is statistically indistinguishable from a test where the JIT optimized everything out. You could compare that to a null test, count the difference in timer ticks, and use that to build your time estimate, but even then, you'd need a massive number of runs to get much confidence from that. And for tests like ours where we need to measure frame times, in browsers like Brave and Safari where timer resolution is 1ms, this could require a very long time for some tests. (This isn't theoretical - one of our render tests in this PR ran 147169 times at an average of 0.03 ms/run or ~33k ops/sec, and that many frames at 60 FPS would come out to about 41 minutes.)
  • If the test completes in more than half a unit of precision but less than 1.5 units, you could use similar tricks to gain more confidence, but you'd still need a decent number of runs if you're only measuring spans. You could just expect roughly half the runs since you have a roughly 100% chance of at least one tick happening in the code block.
  • Even the act of measurement entails overhead (in the order of nanoseconds), so a null test is needed for accurately measuring at a resolution of more than about 100k ops/sec.

It doesn't appear your benchmark execution code currently takes any of this into account, hence this issue.

@jerome-benoit
Copy link
Collaborator

jerome-benoit commented Oct 19, 2024

That's why the API:

  • allows to define a custom timestamping function to fit environment needs, such as
    • () => $.agent.monotonicNow()
    • () => $262.agent.monotonicNow()
    • ...
  • allows to define the benchmark behavior such as minimum benchmark iterations and time
  • integrates JIT deoptimization
  • integrates advanced statistics to refine accuracy via the API tunables

There's no one-size-fits-all benchmarking design in the JS ecosystem. But instead APIs allowing to refine the benchmark accordingly and offering sane defaults.

What is actually missing in tinybench API to adapt it to browsers?

@dead-claudia
Copy link
Author

dead-claudia commented Oct 19, 2024

@jerome-benoit I didn't realize you also used confidence - that means the jitter isn't of much concern.

But for the rest, in short, duration is inaccurate if the code executes faster than timer resolution:

let start = now()
fn()
let end = now()
let duration = end - start

The concrete fix is this:

  1. Run multiple fn() iterations instead of just one, repeating until you sufficiently exceed the clock's granularity. (I chose 15x, but that was just a random guesstimate that turned out to he good enough for my needs.) The time per run becomes duration / iterations.
    • Unfortunately, to avoid measuring beforeEach and afterEach, it'd be a breaking change. If you want to ensure accuracy while just doing one (to not break people), you'll need to alert people of this benchmarking limitation so they're aware, and possibly display a warning if it measures a zero-duration sample.
  2. Use the iteration count as a weight for your statistical analysis, including your confidence analysis (for when to stop looping).

Changing the now() function is not a valid workaround - this issue is inherent to the granularity of that function's result. Not even adding a durationSince(timestamp) is sufficient.

@dead-claudia
Copy link
Author

On a related note, maybe they should've called it performance.subtle.now(), not performance.now(), to align with crypto.subtle. Both seem to have a class of bugs that neither result in errors nor data returned that can be easily validated through automated means. 😅

@jerome-benoit
Copy link
Collaborator

jerome-benoit commented Oct 19, 2024

The concrete fix is this:

  1. Run multiple fn() iterations instead of just one, repeating until you sufficiently exceed the clock's granularity. (I chose 15x, but that was just a random guesstimate that turned out to he good enough for my needs.) The time per run becomes duration / iterations.

Benchmark warmup at Bench and Task level is supported and it has sane defaults for most case, that can be tuned.
The case where timestamping resolution is below the JS runtime timestamping is not solved by any means by incorrect measurement methodology such as measuring the latency of repeated benchmark function code and using the average as the latency of one function run:

  • using an average as a primary source for statistical analysis introduce non anecdotal bias, cf. below
  • a correct benchmark methodology must not modify the original experiment, it's a golden rule
  • Unfortunately, to avoid measuring beforeEach and afterEach, it'd be a breaking change.

Only the latency of the benchmarking function execution with JIT deoptimization is measured in tinybench.

If you want to ensure accuracy while just doing one (to not break people), you'll need to alert people of this benchmarking limitation so they're aware, and possibly display a warning if it measures a zero-duration sample.

A correct benchmark methodology means not modifying the experiment to time. Such as measuring the time a runner at doing 500m is not measuring the time and distance of one step repetitively and use the average of that measurement as a base to time his 500m course. It's utterly wrong in so many ways ...

I've seen benchmarking tool such as mitata using a similar totally flawed methology. Tinybench will never go that path as we care about using unbiased measurement methodology. That why I've forked mitata in tatami-ng because the maintainer was not inclined to external contributions about it. And now pushing the relevant bits of that fork to tinybench that will show up in version 3.x.x

  1. Use the iteration count as a weight for your statistical analysis, including your confidence analysis (for when to stop looping).

Tinybench is meant to be a lean library using state of the art benchmarking methods and advanced statistics. The analysis of them such as determining if the margin of error is acceptable, the median absolute deviation is acceptable, ... and globally the statistical significance of the result will not be part of tinybench. It's up to the user to analyze them and eventually automate the detection of anomalies in the measurement.
The statistical indicators analysis can be documented and more state of the art statistical indicators can be added (z-score, IQR, ...) - we take PRs -.

Changing the now() function is not a valid workaround - this issue is inherent to the granularity of that function's result. Not even adding a durationSince(timestamp) is sufficient.

The analysis of the result is meant to tell if a measurement is correct or not: for example the presence of a lot of zero measurement will make the margin of error go high for latency => results cannot be trusted.

And using a totally flawed benchmarking methodology (and opening a wide door to the premature optimization disease) as a workaround to a too high resolution in the JS runtime timestamping is not an acceptable solution. The root cause must be fixed: not offering an optional mode with high resolution timer in a JS runtime is considered as a bug nowadays. And browsers can be started with high resolution timer for benchmarking purpose.

So I repeat: what is actually missing in tinybench to run accurate benchmark using state of the art methodology in browsers?

@jerome-benoit jerome-benoit added the question Further information is requested label Oct 19, 2024
@dead-claudia
Copy link
Author

@jerome-benoit Just letting you know I plan to respond with precise numbers, just I've got a bunch of math (mostly stats and calculus) to work out.

May turn this into a whitepaper later for others to reference.

@jerome-benoit
Copy link
Collaborator

jerome-benoit commented Oct 27, 2024

@jerome-benoit Just letting you know I plan to respond with precise numbers, just I've got a bunch of math (mostly stats and calculus) to work out.

Precise numbers on how a flawed benchmarking methodology can give unbiased results? ;-)

  1. Averaging an average is a well-known statistics 'avoid to do' thing documented in all books. Even with the introduction of weights.
  2. Altering the experiment to benchmark is the ultimate rule to not violate in the measurement theory.

If you plan to write down that violating theses two points will make a benchmark more accurate, the proof of the contrary is already settled since years: Kolmorogov-Naguno publications on the generalization of the averaging concept - the f-mean has the correct properties if and only if f is injective -> a mean (weighted or not) is not injective -> using a mean as f-mean does not behave correctly.
The inaccuracy of timestamping in the measurement theory is a known issue and already handled by the statistical analysis of the sample of measurements.

@dead-claudia
Copy link
Author

@jerome-benoit Just letting you know I plan to respond with precise numbers, just I've got a bunch of math (mostly stats and calculus) to work out.

Precise numbers on how a flawed benchmarking methodology can give unbiased results? ;-)

The hard numbers I'm working on are around why your benchmarking methodology is flawed. Mine is not any more perfect, but I'd have to do (a lot) more research to figure out how to correctly account for measurement error in general.

  1. Averaging an average is a well-known statistics 'avoid to do' thing documented in all books. Even with the introduction of weights.

These aren't simple averages. They're population-weighted averages. https://en.wikipedia.org/wiki/Weighted_arithmetic_mean lists precisely the method I'm using as a valid use, just in a different context (average test grade).

And to be clear, when it comes to performance, there's three main statistics that matter most: the arithmetic mean, the upper end of the confidence interval, and (for real-time cases) the max value.

In the context I'm currently working with, I'm dealing with a soft real-time process where a certain function call must complete in under 5 microseconds on average and preferably under 2 microseconds. Yes, microseconds, not milliseconds. And yes, this obviously sits well below even the minimum DOMHighResTimestamp resolution allowed by spec.

  1. Altering the experiment to benchmark is the ultimate rule to not violate in the measurement theory.

Obviously, that is a concern. I in my current (not-yet-public) benchmarking code prepend a null test for this exact reason and make sure to print both raw stats including it and stats adjusted using numbers from a null test.

My adjusted stats displayed for those is admittedly probably biased and obviously not perfect, but fixing that would require a bunch of math that I currently don't plan to do in the near term. (I have more pressing priorities.)

If you plan to write down that violating theses two points will make a benchmark more accurate, the proof of the contrary is already settled since years: Kolmorogov-Naguno publications on the generalization of the averaging concept - the f-mean has the correct properties if and only if f is injective -> a mean (weighted or not) is not injective -> using a mean as f-mean does not behave correctly.

The ultimate mean I'm returning isn't of the samples themselves, but of the (potentially unmeasurable) durations of each iteration plus the small but statistically significant measurement overhead.

Thing is, you don't need to know all values to have a mean. Simply knowing how many runs and how long all runs collectively took is sufficient. As wait times add just through the passage of time, you only need to measure the whole span to know how long all runs collectively took.

And non-weighted means (what these intermediate means are) with known population sizes can be correctly merged using population-weighted means. (You obviously can't do this with unknown populations. But the population size here is known.)

There are caveats, like the lack of a true min/max and a lack of a true median. But that goes without saying.

The inaccuracy of timestamping in the measurement theory is a known issue and already handled by the statistical analysis of the sample of measurements.

This is the part I'm working on, so I won't address it here right now.

@jerome-benoit
Copy link
Collaborator

jerome-benoit commented Oct 28, 2024

Thing is, you don't need to know all values to have a mean. Simply knowing how many runs and how long all runs collectively took is sufficient. As wait times add just through the passage of time, you only need to measure the whole span to know how long all runs collectively took.

You still does not seem to really understand why it's plain wrong as a measurement method, why any weighted averages as a primary source of a measurements sample is just biased by definition and can't be used as a trusted source.

Basically you're building a sample with only imputed values using the mean. It has be proven since years to be non anecdotally biased: rubin's law introduction, distribution of samples comparaison methods, ...
And claiming boldly that's using sample of incorrectly imputed values known to introduce bias can be used as a trusted measurement source for further analysis ... And fails to see the contradiction in the terms.

And non-weighted means (what these intermediate means are) with known population sizes can be correctly merged using population-weighted means. (You obviously can't do this with unknown populations. But the population size here is known.)

No, you can't. Or you have to prove that Kolmogorov-Nagumo were wrong on the generalization of the concept of averaging.

Since the beginning you are making incorrect statements on:

  • the way latency is measured in tinybench
  • the statistics computed in tinybench
  • the high-resolution timer supported granularity in JS runtimes

And you continue with plain wrong statements about well-known mathematics proven results in the statistics field.
No interesting outcomes are then expected.

I see no point in continuing such a discussion.

@dead-claudia dead-claudia closed this as not planned Won't fix, can't repro, duplicate, stale Oct 28, 2024
@dead-claudia
Copy link
Author

dead-claudia commented Oct 28, 2024

Okay, closing. I'll agree to disagree, in particular on these two key points.

  • Kolmogorov-Nagumo f-means somehow impacting the averaging of time delta samples (this was my key point in my previous comment)
  • Weighted averages not applying to aggregate time delta samples

If I write up a whitepaper about this and get it peer reviewed, I may consider sharing it here. But aside from that, I'm ending the discussion. I doubt that even providing hard stats on how naively using this code provides skewed statistics or providing mathematical proofs that either of the two above bullets are false will change your opinions on the matter, so continuing it will be fruitless.

let start = performance.now()
// do stuff
let end = performance.now()
let duration = end - start

My calculations came out to about a 70% chance of undercounting a 20us interval and 30% chance of counting it as zero given Chrome's 100us granularity + 100us random jitter, by the way. I just hadn't yet either 1. calculated the expected mean value (specifically Exp({Mean(S) for all S in possible sample sets})) or 2. confirmed precisely how Chrome generates the value (which from a cursory code search appears to diverge from what I used to math out, but only slightly). But the two stats I have calculated suggest a very possible skew towards zero.

I still plan to complete these calculations anyway, as you weren't the only considered audience for them.

@jerome-benoit
Copy link
Collaborator

jerome-benoit commented Oct 28, 2024

I doubt that even providing hard stats on how naively using this code provides skewed statistics or providing mathematical proofs that either of the two above bullets are false will change your opinions on the matter, so continuing it will be fruitless.

And the reason is simple: no mathematical proof of their fallacy exists, no counter examples exist. And if you had one to share, it will probably end up being a candidate for the fields medal ...
But one can give you tons of counter examples that averaging an average is biased, weighted or not. That doing measurement over a sleep interval on an async code execution design is not reliable per definition, and a totally incorrect methodology to measure the timestamping accuracy on a JS runtime or even in general, ... (and the pile of technical and mathematical nonsense continue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants