-
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed improvements for discussion #4138
Conversation
no real difference in FPS but code is faster. also 160bytes smaller, meaning it is actually faster
tested and working, also tested video
its not faster but cleaner (and uses less flash)
-calculations for virtual strips are done on each call, which is unnecessary. moved them into the if statement.
uses less flash so it should be faster (did not notice any FPS difference though) also cleaned code in ColorFromPaletteWLED (it is not faster, same amount of code)
…alette inlining getMappedPixelIndex gets rid of function entry instructions (hopefully) so it should be faster. also added the 'multi color math' trick to color_add function (it will not make much difference but code shrinks by a few bytes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent!
There are some formatting issues (spaces) but that's minor.
wled00/FX_fcn.cpp
Outdated
@@ -1816,7 +1820,7 @@ bool WS2812FX::deserializeMap(uint8_t n) { | |||
return (customMappingSize > 0); | |||
} | |||
|
|||
uint16_t IRAM_ATTR WS2812FX::getMappedPixelIndex(uint16_t index) const { | |||
__attribute__ ((always_inline)) inline uint16_t IRAM_ATTR WS2812FX::getMappedPixelIndex(uint16_t index) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non obligatory: I would prefer __attribute__
at the end but [[...]]
in front.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly have no Idea what the precompiler instructions mean in detail, I copied this from what they use in fastled...
the idea behind this is to get rid of the 'function entry' instructions that are added when a function is called. When I added the inline
flash size increased by a few bytes, telling me that it is actually inlined. Since this short function is only called from two places and is called A LOT this may be faster. I have no way to check (would need a proper debugger that shows assembly instructions being executed line by line).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I've recently learned these are compiler attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 not sure if always_inline plays well with IRAM_ATTR .... the first tells the compiler to always inline the function, the latter says "put the function into IRAM" which means that a real function is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is what I was wondering too, this is just a suggestion, i.e. to inline this for speed but how to tell the compiler to inline it to the functions that are in ram... not sure how it will do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my understanding:
inline
is a hint/suggestion to the compiler. So it might get inlined, or not.__attribute__((always_inline))
is a directive. So the compiler must inline this function, no matter if its efficient or not.
If you want to optimize function calls, its sometime useful to add __attribute__((pure))
or __attribute__((const))
to the function declaration. But only do this after double-checking that the code is actually "pure" (no side-effects) or "const" (solely depends on arguments). I did this in the MoonModules fork, but honestly it does not give you more than 1 or 2 fps even if you apply it to lots of functions.
one more thing I was thinking about: |
I had the same idea! I expect that using flat temporary render buffer of the virtual() sizes, then applying all mapping operations as a post-processing step, would yield a substantial average-case performance improvement. That said, if you try to do mirror and transpose during the render, I think you'll give up most of the benefits as every write still ends up being a pile of variable lookups and conditionals instead of just straight local pointer arithmetic -- I think it'd be better to perform all mapping operations after the render. We'd also get some advantages with code layout and IRAM caching, as each FX function wouldn't need to make so many external function calls. The big tradeoff is RAM usage. The catch is that the temporary buffer needs contiguous RAM. If it's held allocated, we lose the ability to share that memory (in highly constrained systems) with other tasks like the web server... but if it's not, we risk not being able to get a big enough buffer when it's needed for a render. There may also be systems where there just isn't enough RAM to hold two copies of every pixel state at once. That said, I'd still love to see a PR that implements a temporary render buffer and pulls the mapping operation out to a separate pass after the FX function call. I think the performance and code size improvements are likely to be compelling, even if it reduces the maximum LED count that can be supported on memory constrained hardware. |
How are buffers currently? What is the global buffer vs no global buffer? |
Let's start at the bottom and make our way up. NeoPixelBus (in 2.7.9) uses 3 buffer: 2 render buffers and 1 transmit buffer. Render buffers are used in "double buffer" fashion. Transmit buffer may be small, but render buffers are of size LED count * pixel size where pixel size can be 24bits up to 64bits. (2.8.0 or later may change that as @Makuna is working on optimisations). Different hardware (RMT, I2S, ...) may have transmit buffers of different size. WLED creates its own global buffer of size LED count * 32bit (if so selected in settings) for its Segment does not create any buffers of its own (though in the past we had This (lossy reconstruction) is why mirroring, reversing and transposing have to be done in the Newer versions of NPB are going to do brightness scaling only during transmission of data to LEDs so And then, there is a CCT support. NOTE: Note the difference between 'GetPixelColor() |
Thank you for detailing this, much clearer now. I was just looking at 'soap' FX as an example that uses
A solution could be to replace the global buffer with individual segment buffers that are not mapped nor brightness adjusted but do that in one go when it is transferred to NeoPixelBus buffer (saving a lot of if statements and back and forth mapping). Would that be a good way to try and solve it or is there something fundamentally flawed with this approach? |
I am not going to rewrite all of the effects. 😄
This was used in first iteration of 0.14 while we used |
true, lookup is fast.
yes, the main difference being the color correction, right? If I disable global buffer, colors get worse but speed stays (almost) the same.
Do you mind to quickly elaborate what problems these were? |
- there already is a method to calculate the table on the fly, there is no need to store it in flash, it can just be calculated at bootup (or cfg change)
Overlapping segments. |
- gamma correction only where needed - paletteIndex should be uint8_t (it is only used as that) note: integrating the new `ColorFromPaletteWLED()` into this would require a whole lot of code rewrite and would result in more color conversions from 32bit to CRGB. It would be really useful only if CRGB is replaced with native 32bit colors.
wled00/FX_fcn.cpp
Outdated
|
||
unsigned paletteIndex = i; | ||
uint8_t paletteIndex = i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is only used as an uint8_t down the line and not involved in any calculation.
edit:
correction: I have index as an unsigned in the ColorFromPaletteWLED()
and do a byte()
cast there anyway to prevent overflows. so it may be better to revert back to unsigned for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. ATM. what about in the future?
if it is unsigned, compiler will trim it down when passing to a function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok reverted to unsigned locally. saves 6 bytes of code. I thought compiler would do better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XTensa doesn't have non-word arithmetic instructions -- any operations done on a 8-bit or 16-bit types have to be performed by upcasting, operating, and downcasting afterwards to restrict the output range. At least here, int
or unsigned
is almost always less code (and slightly faster).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XTensa doesn't have non-word arithmetic instructions -- any operations done on a 8-bit or 16-bit types have to be performed by upcasting, operating, and downcasting afterwards to restrict the output range. At least here,
int
orunsigned
is almost always less code (and slightly faster).
in general: absolutely yes. But if a function is performed purely in 8bit (like most of the fastled stuff) sometimes the compiler optimizes in a way that makes it actually faster (smaller code) when using uint8_t
, don't ask me why exactly, I have been playing with types to improve code quite a bit and sometimes its just not really clear what is happening.
though so ;) but glad it is 'only that' |
It is not. wait for "blending styles" |
future improvements to buffers aside: should i squash this draft and make a PR? any changes required before I do that? |
Please, and do address points discussed. |
@blazoncek one more general question: |
One thing to keep in mind regarding existing effects is that most were ported from FastLED type effects. Some taken directly from FastLED, some from places like soulmate.com. FastLED operated on its CRGB (or CHSV) and so most effects disregarded the W channel. I guess this was the reason to introduce "Automatic White Calculation" for strips like SK6812 where white channel could be used to reduce power draw. Any new effect should, if possible, take into account white channel as well IMO though most will really not benefit at all. It all comes down to the "artist" how he/she envisions an effect on RGB and RGBW strips. I do not know if stripping W from calculations is wise or not without prior notice to users. |
Agreed. These are just ideas. |
I would avoid CRGB. Why? Because WLED uses NeoPixelBus and not FastLED (well, palettes are exception). NeoPixelBus has its own classes comparable to CRGB but also extended to include CCT (or WW & CW) information which does not exist in FastLED. Unfortunately the world is not as simple as that. CRGB is very common and a lot of people know what it represents. To top that some people want to control white channel independently of effects. |
as long as we keep the commit history, this PR could be a collection of code improvements. |
- removing WS2812FX::setMode() - removing WS2812FX::setColor() - removing floating point in transition - color handling modification in set.cpp - replaced uint8_t with unsigned in function parameters - inlined WS2812FX::isUpdating() - (MAY BE BREAKING) alexa & smartnest update
- changes to `setPixelColorXY` give an extra FPS, some checks and the loops are only done when needed, additional function call is still faster (force inlining it gives negligible speed boost but eats more flash) - commented out the unused `boxBlur` function - code size improvemnts (also faster) in `moveX()` and `moveY()` by only copying whats required and avoiding code duplications - consolidated the `blur()` functions by enabling asymmetrical blur2D() to replace `blurRow` and `blurCol` - compiler warning fixes (explicit unsigned casts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few thoughts.
wled00/FX_2Dfcn.cpp
Outdated
else strip.setPixelColorXY(start + width() - xX - 1, startY + yY, col); | ||
unsigned groupLen = groupLength(); | ||
|
||
if(groupLen > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reduces the processing if group length is 1 by removing:
- four assignments (
yY
&&xX
,W
&H
)- hidden two subtractions and a comparison (
W
&H
)
- hidden two subtractions and a comparison (
- two multiplications
- two comparisons (
j
&g
,uint8_t
!) - two comparisons (
yY>=H
&&xX>=W
) - two increments (
xX
&yY
)
For each setPixelColorXY()
call.
What may seem irrelevant with low number of pixels actually produces a noticeable slowdown on large pixel count.
} | ||
} | ||
|
||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it is unused ATM but it produces quite different blur and I put this function in intentionally.
I would not like it removed.
uint32_t newPxCol[vW]; | ||
int newDelta; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see the point in modifying move()
functions, especially with convoluted math that is hard to follow.
How much speed does it gain?
wled00/FX_2Dfcn.cpp
Outdated
@@ -676,7 +652,10 @@ void Segment::drawCharacter(unsigned char chr, int16_t x, int16_t y, uint8_t w, | |||
case 60: bits = pgm_read_byte_near(&console_font_5x12[(chr * h) + i]); break; // 5x12 font | |||
default: return; | |||
} | |||
col = ColorFromPalette(grad, (i+1)*255/h, 255, NOBLEND); | |||
uint32_t col = ColorFromPaletteWLED(grad, (i+1)*255/h, 255, NOBLEND); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a "shadow" of the col
from a few linea above.
FYI I am now running these improvements on my test set up (8266,32,S2,C3) for about a week and see no pitfalls. Code is smaller and faster. |
As I am moving away from WLED (at least for a while if not permanently) I would urge you to implement this PR ASAP as it benefits WLED tremendously. |
I can rebase and merge any time. It would require another beta release as there are many changes in this PR that are not well tested. If we do that #4181 should go along with it though. |
- correctly clear segment spacing change - renamed Segment::setUp() to Segment::setGeometry() - removed WS2812FX::setSegment() - removed obsolete/unfunctional word clock usermod .cpp file
|
||
stateChanged = true; // send UDP/WS broadcast | ||
|
||
if (stop) fill(BLACK); // turn old segment range off (clears pixels if changing spacing) | ||
if (stop || spc != spacing || m12 != map1D2D) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does "if (stop)" do the same as isActive()
, or is there a special reason to avoid isActive() here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDK, it is an old code. Perhaps if you shorten a segment. Most likely on second thought.
_vHeight = virtualHeight(); | ||
_vLength = virtualLength(); | ||
_segBri = currentBri(); | ||
fill(BLACK); // turn old segment range off or clears pixels if changing spacing (requires _vWidth/_vHeight/_vLength/_segBri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use beginDraw()
, instead of updating all cached values explicitly? It would make the code easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beginDraw()
does a lot of other things too which are unnecessary.
I did consider calling it, but decided against and added comment.
- code is a bit cleaner and faster as well - chaning array access to pointer access in bus_manager makes it a few instructions faster - changed getNumberOfPins and getNumberOfChannels to return 32bit values, saving the unnecessary 8bit conversion
FYI: I just ran a final test before merging this. Overall speed improvement is around 20%! |
Added a lot of improvements to core functions making them slightly faster and saving some flash.
I tested all functions to work the same as before. I tested speed by looking at FPS and it's hard to say how much faster it actually is but I see some improvement, maybe 1%.
The inline of getMappedPixelIndex() is just an idea, not sure that is correct nor do I know if it improves anything.
Did not test on ESP8266.