Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

precompute sin/cos to improve performance #341

Open
wants to merge 3 commits into
base: melodic-devel
Choose a base branch
from

Conversation

VRichardJP
Copy link

As discussed in #340

Copy link
Collaborator

@corot corot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks interesting, but just checking the time expended on computeCmdVel method, I get worse results than with current melodic-devel:

0.393857
0.652801
0.827784
0.435621

With PR on same route:

0.593492
1.004650
1.046142
0.564472

Admittedly, is not a very sophisticated test, but I get consistent results.
Can you try to compare the time expended on computeCmdVel method?

include/teb_local_planner/pose_se2.h Outdated Show resolved Hide resolved
include/teb_local_planner/pose_se2.h Outdated Show resolved Hide resolved
@VRichardJP
Copy link
Author

VRichardJP commented Dec 15, 2021

@corot I don't include cmd_vel_ in my benchmark so I overlooked it. I guess the issue here is that when PoseSE2 is copied sin/cos gets calculated over and over instead of being copied. For example with the current code, something like:

pose1.theta() = pose2.theta();

is equivalent to:

pose1.theta()._theta = pose2.theta()._theta;
pose1.theta().update_sincos();

Maybe adding a simple copy constructor would fix the performance in your benchmark:

Theta(const Theta& other) {
    _rad = other._rad;
    _sin = other._sin;
    _cos = other._cos;
  }

That being said, I don't really like to keep implicit Theta->double and double->Theta conversion as Theta comes with an upfront cost. I am considering 2 ways to tackle this:

  1. Keep implicit conversion but make sin/cos evaluation lazy. This is really easy to implement and as little impact on the rest of the code but it comes with a small runtime cost. For example:
const double& sin() {
    if (!_init) {
        update_sincos();
        _init = true;
    }
    return _sin;
}
  1. Make conversion explicit both ways and modify the rest of teb code accordingly. There is no runtime cost but it adds complexity in the code. Eventually, there are cases where sin/cos values are not necessary and could be left undefined, but that could be the source of bugs in the future.

What do you think?

@VRichardJP
Copy link
Author

VRichardJP commented Dec 15, 2021

I have tested both approach 1 and 2 in my setup. The difference in performance is negligible on my machine: the implementation 1 is ~1% slower compared to the implementation 2, while being way simpler and less prone to bugs.

EDIT: I observe with google perftools that with the lazy sincos version, the total number of sincos call is reduced by 2/3 compared to original branch. This bears out my initial observation that sin/cos computation is largely duplicated.

@VRichardJP VRichardJP requested a review from corot December 15, 2021 07:38
Copy link
Collaborator

@corot corot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better now, but I still get times slightly above melodic-devel. Maybe sin / cos implementation does already some caching? 🤔 (I compile in release, so I assume the implementation of std is very efficient)

0.449948
0.750811
0.723817
0.433760

With PR on same route:

0.477178
0.915454
0.731955
0.506457

Did you measure the time on computeCmdVel?
Or how do you evaluate the performance gain?

@VRichardJP
Copy link
Author

VRichardJP commented Dec 16, 2021

Hi, I'm a bit confused to see it does not improve things in your side =/

To benchmark the changes I am using a private fork of melodic-devel, but besides a few extra options or weights the functionality/performance should be the same. I am measuring performance using test_optim_node: I start the node, load a scenario (custom set of obstacle and start/stop pose), do N loops of CB_mainCycle then stop. I have more than 200 scenarios. Although the time I measure includes some ros setup and initialization, with my PR I get a clean +10% performance gain on the whole benchmark, which means the performance on the plan() function itself is even bigger.

For example, I have generated a call graph with google-perftools of one of the scenario without using my changes (RelWithDebInfo mode with SET(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -Wl,--no-as-needed,-lprofiler,--as-needed"), ran with CPUPROFILE_FREQUENCY=1000 CPUPROFILE=/tmp/teb.profile roslaunch <...> and visualized with google-pprof --web install/teb_local_planner/lib/teb_local_planner/test_optim_node /tmp/teb.profile):
image

You can see at the bottom of the picture that test_optim_node spends 34% of its time in _sincos.

This is the same graph but with the PR:
image

The time spent in _sincos has dropped to 12% because eval_sincos catches many duplicates. For some reason you can see that TwoCirclesRobotFootprint::calculateDistance still manage to call _sincos directly and does not go through eval_sincos. I am wondering why it happens, maybe it is an after effect of the compiler optimization.

From your performance I understand that my PR has no effect on your machine. I see 2 possible causes:

  • _sincos calls are replaced by a table lookup after optimization. In that case, my PR is absolutely useless and just adds an useless if and extra data to copy. What CPU/compiler are you using? on my side, I have a Intel i7-9700 @ 3.00GHz and uses gcc 7.5.0 compiler
  • teb parameters and/or the scenario: my test world is quite big (generated paths contains around 300 points), my vehicle is a carlike (non-holonomic) using two_circles footprint. From my understanding of teb, the world/vehicle/tuning used does just influence the number of sin/cos call, which are duplicated in all cases. So it should not be a thing.

@VRichardJP VRichardJP requested a review from corot December 19, 2021 08:01
@corot
Copy link
Collaborator

corot commented Dec 22, 2021

Hi, I'm a bit confused to see it does not improve things in your side =/

me too! I tried to rule out some possibilities:

  • I'm using circular footprint model. I have tried a similar test with polygon footprint model but I got similar results: no improvement.
  • I tried compiling on DEBUG, but again, I got similar results: no improvement.

What's that test_optim_node? I can give it a try.
Could you also test just recording execution time for computeCmdVel calls? I use this simple code:

time.diff.zip

@VRichardJP
Copy link
Author

This is the callback I use for benchmarking purpose:

// Planning loop
void CB_mainCycle(const ros::TimerEvent& e)
{
planner->plan(PoseSE2(-4,0,0), PoseSE2(4,0,0)); // hardcoded start and goal for testing purposes
}

@RainerKuemmerle
Copy link
Contributor

Some edges compute their Jacobian numerically. This might also explain why you observe a lot of calls to sin/cos in general. For those edges, caching does not help when evaluating wrt theta but for x/y. For edges with analytic Jacobian, the overhead of caching might lead to negative impact. Maybe benchmark per Edge on computeError/linearizeOplus leads to interestings insights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants