Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the performance of computing sin and cos functions #234

Open
wants to merge 4 commits into
base: melodic-devel
Choose a base branch
from

Conversation

jcmonteiro
Copy link
Contributor

Once again, this PR affects quite a lot of files, but the change is actually not that big; besides, it is optional. 🤓

TL;DR

For me, this PR saves 40% CPU without any noticeable impact on convergence. The maximum absolute error between the analytic sin and cos functions and the approximations is 0.003.

Why

In the application I work on, the major performance bottleneck we faced when using TEB was computing three simple functions: sin, cos, and pow. Initially, that came as a surprise for me, but it does make sense since these functions must be used all over to compute the graph edges.

Summary

There are basically two ways of speeding up the computation of trigonometric functions: look-up tables and mathematical approximations. I did not benchmark any of the other ways, but I am confident the approximation I used is more than enough to disappear with the performance bottleneck due to the trigonometric functions.

Comparison

As an example, this is the typical flame graph I get without the trigonometric approximations (sorry, but I had to blur the image where proprietary code is called).
teb_before

And this one is with approximations for sin and cos.
teb_sincos

So, for me, the improvement was roughly 40% less CPU usage.

Parameters

use_sin_cos_approximation under a new tab called "Performance". This toggles the approximations and defaults to false.

Formula

It is a pretty simple approximation, I take a second-order approximation in the 0 to 45 degrees interval, and then use trigonometric identities to replicate the result to the other quadrants. For comparison, these are the analytic functions and their approximations.
image

And these are the absolute errors.
image

@jcmonteiro
Copy link
Contributor Author

The description was already too big, so I decided to add this as a comment. If you find this PR useful, I have one other change where I limit the cost exponent to be an integer in the [1, 4]. Doing so, I get an extra 10% performance improvement, totaling 50%.

This is the flamegraph with the changes to sin, cos, and pow.
teb_all

@amakarow
Copy link
Contributor

This is awesome, thanks a lot. We will test this soon.

@jcmonteiro
Copy link
Contributor Author

jcmonteiro commented Oct 23, 2020

I just noticed that there are some other places that use sin and cos that were outside my use case. I'll update the PR to include them later.

@jcmonteiro jcmonteiro force-pushed the feature/improve-sin-cos-performance branch from 6cd9cc7 to abbdbee Compare November 9, 2020 18:37
@jcmonteiro
Copy link
Contributor Author

jcmonteiro commented Nov 9, 2020

Sorry for taking so long to get back to this PR. I have exhaustively tested the changes and decided to enable sin and cos approximations only for computing the distance to footprint models, thus force-pushing to the branch. As it turns out, all the performance bottleneck is in these computations, and enabling approximations to the edges hindered convergence in some edge cases (no pun intended).

@RainerKuemmerle
Copy link
Contributor

Just commenting here as a user of TEB.
IMHO would be nice to keep the same API as the sin/cos provided by math.h/cmath. This would also limit the changes in other parts of the code and is easier to maintain than the function writing the result into a reference.
Moreover, can one limit code duplication between sin/cos by exploiting the phase shift of pi/2 between sine and cosine?
How would the approximation suggested here perform with others, e.g. https://github.com/kennyalive/fast-sine-cosine/blob/master/src/main.cpp (random one found via Google).

@jcmonteiro
Copy link
Contributor Author

IMHO would be nice to keep the same API as the sin/cos provided by math.h/cmath. This would also limit the changes in other parts of the code and is easier to maintain than the function writing the result into a reference.

Yes and no. There is a benefit in computing sin and cos at the same time, so I must take them as references. I then used the same interface for computing only the sine just for consistency.

Moreover, can one limit code duplication between sin/cos by exploiting the phase shift of pi/2 between sine and cosine?

What do you have in mind?

How would the approximation suggested here perform with others, e.g. https://github.com/kennyalive/fast-sine-cosine/blob/master/src/main.cpp (random one found via Google).

I have not benchmarked since the change cleared the performance bottleneck altogether.

@RainerKuemmerle
Copy link
Contributor

IMHO would be nice to keep the same API as the sin/cos provided by math.h/cmath. This would also limit the changes in other parts of the code and is easier to maintain than the function writing the result into a reference.

Yes and no. There is a benefit in computing sin and cos at the same time, so I must take them as references. I then used the same interface for computing only the sine just for consistency.

To me the API is not comfortable and very prone to errors.

Moreover, can one limit code duplication between sin/cos by exploiting the phase shift of pi/2 between sine and cosine?

What do you have in mind?

image
But this does maybe not apply to your approach. Just if you have them as functions returning the sine/cosine one can spare lines of code.

How would the approximation suggested here perform with others, e.g. https://github.com/kennyalive/fast-sine-cosine/blob/master/src/main.cpp (random one found via Google).

I have not benchmarked since the change cleared the performance bottleneck altogether.

Just reasoning on which approximation to use. For instance, the implementation linked above seems very compact - also with respect to the number of operations performed. But not sure on the performance.

@jcmonteiro
Copy link
Contributor Author

Just reasoning on which approximation to use. For instance, the implementation linked above seems very compact - also with respect to the number of operations performed. But not sure on the performance.

It is really compact. Thanks for the reference btw. I have tested here the precision and it is better, the maximum error is 0.001 using the approximation you suggested compared to 0.003 with the one I implemented. Besides, the error is zero at zero, which I assume is desirable for numerical conditioning. I'll compare the performance just for the sake of it, but it should be equally good (if not better).

@jcmonteiro
Copy link
Contributor Author

I have changed the approximation method to the one suggested by @RainerKuemmerle. Although the maximum absolute approximation error is smaller, I've noticed a more pronounced convergence deterioration if using the approximation in the edges. Since I have disabled this and the PR uses the approximations only for distance calculations, I think it doesn't really matter.

Besides, the API is the same as the one in the standard library, as pointed out by @RainerKuemmerle.

@amakarow
Copy link
Contributor

amakarow commented Jan 4, 2021

Have you experienced any overhead when using your wrapper for std::sin and std::cos compared to the plain implementation?

@jcmonteiro
Copy link
Contributor Author

The overhead is unnoticeable given that the other methods called while solving the optimization are much more expensive. Profiling shows no difference as well.

@corot corot self-requested a review November 26, 2021 11:59
@jcmonteiro
Copy link
Contributor Author

I forgot about this one for a while. Would this PR be interesting? Should I consider modifications to improve it (besides resolving the conflicts)?

@corot
Copy link
Collaborator

corot commented Nov 29, 2021

I forgot about this one for a while. Would this PR be interesting? Should I consider modifications to improve it (besides resolving the conflicts)?

If it delivers the performance improvement you describe, yes, definitely is interesting.
Can you first resolve the conflicts, so I can give a try and assess the performance gain?

@jcmonteiro jcmonteiro force-pushed the feature/improve-sin-cos-performance branch from 9dd56fe to 6dd3639 Compare November 29, 2021 10:26
@@ -376,6 +383,9 @@ class TebConfig
recovery.oscillation_recovery_min_duration = 10;
recovery.oscillation_filter_duration = 10;

// Recovery
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Recovery
// Performance

Comment on lines +420 to +426
grp_performance.add(
"use_sin_cos_approximation",
bool_t,
0,
"Use sin and cos approximations to improve performance. The maximum absolute error for these approximations is 1e-3.",
False
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep same format as for other params

@corot
Copy link
Collaborator

corot commented Nov 30, 2021

I didn't notice any speedup on computeVelocityCommands methods. Those are the times for 100 consecutive calls w/o and w/ fast sin/cos doing the same route:

std:
0.425189
0.716500
0.751879
0.404824

fast
0.426448
0.613000
0.681991
0.406975

Neither I notice any reduction on CPU usage, but I did that with top, what obviously is not at all a sound method.

Where the speedup gets reflected?
Also there are still many usages of std::sin/cos; I wounder what the speedup can be if we replace all of them with the fast versions. What about making SIN / COS macros that use either std or fast version depending on a compilation flag and use them all along?

PS: sorry, I broke the PR again with the latest merge 😞

@jcmonteiro
Copy link
Contributor Author

I didn't notice any speedup on computeVelocityCommands methods. Those are the times for 100 consecutive calls w/o and w/ fast sin/cos doing the same route:

std: 0.425189 0.716500 0.751879 0.404824

fast 0.426448 0.613000 0.681991 0.406975

Neither I notice any reduction on CPU usage, but I did that with top, what obviously is not at all a sound method.

If I remember it correctly, the computeVelocityCommands was called thousands of times per second in my application because we handled every single cell in the costmap as an obstacle (did not use costmap converters).

Where the speedup gets reflected? Also there are still many usages of std::sin/cos; I wounder what the speedup can be if we replace all of them with the fast versions. What about making SIN / COS macros that use either std or fast version depending on a compilation flag and use them all along?

I have left the changes out of the edge calculations because adding them there caused the optimization to oscillate. The macro is also an idea but it would

  1. prevent changing in realtime,
  2. make it impossible to use the features for those not compiling from source.

PS: sorry, I broke the PR again with the latest merge disappointed

That's alright =). I will see if I can make the changes asap. The full disclaimer is that I am not compiling or running since my company uses ROS2 now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants