-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make fix_outline always produce correct visual output without aliasing or gaps #483
base: master
Are you sure you want to change the base?
Conversation
else if (info->effect_type == EF_NONE || info->effect_timing > 0) | ||
info->filter.alpha_secondary = info->filter.alpha_primary; | ||
else | ||
info->filter.alpha_primary = info->filter.alpha_secondary; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For \k and \ko, I’m setting alpha_primary
and alpha_secondary
to the same number in the cache key. This means that the bitmap will be re-rendered when the karaoke time arrives (which happens at most once), but on the other hand this can avoid unnecessary re-renders (arbitrarily many) when the other alpha value changes. This may or may not be a desirable trade-off.
libass/ass_render.c
Outdated
copy_bitmap(render_priv->engine, &v->bm_o_k, &v->bm_o); | ||
fix_outline(&v->bm, &v->bm_o_k, k->filter.alpha_secondary, blurred_o); | ||
} | ||
fix_outline(&v->bm, &v->bm_o, k->filter.alpha_primary, blurred_o); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I’m not a huge fan of the code duplication here.
29f3b36
to
09d4d2f
Compare
Also, after this PR, this code from #371 (comment): if ((flags & FILTER_NONZERO_BORDER &&
info->a_pre_fade[0] == 0 &&
info->a_pre_fade[1] == 0 &&
info->fade == 0) ||
...)
flags |= FILTER_FILL_IN_BORDER; technically starts to harm quality if \bord is small (smaller than a pixel), although I haven’t done any tests and wouldn’t be surprised if the difference was unnoticeable. |
And yes, as mentioned on IRC, this PR does have the issue that it requires divisions. Many divisions. Slow.™ It’s not obvious how to alleviate this, although to me it’s also not necessarily obvious that this is actually a problem. |
711f798
to
ccb6b73
Compare
After a long discussion on IRC and some performance tests by @MrSmile, I’ve replaced the integer division code with float division, because it’s actually faster, with an extra slight twist to try and assist autovectorization (which is entirely impossible for integer division on x86). I’ve updated the first commit message accordingly. I’ve also used this opportunity to reduce noisy changes between the two commits. I’ve also added a code comment where I previously posted a GitHub comment. |
libass/ass_bitmap.c
Outdated
// The cost (which we accept) with IEEE float32 is that just two out | ||
// of the 33 million possible operand tuples round in the "wrong" | ||
// direction after evaluating to almost exactly an integer and a half. | ||
const float over65025 = nextafterf(65025, FLT_MAX); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ultimately decided to just go with nextafterf
as the universally safest choice (with no assumptions about the floating point format). The drawbacks I considered are:
- It may be absent on some older systems. But we make extensive use of
lrint
and I invokelround
once in karaoke code, and I’d like to hope that libraries that contain those functions also havenextafterf
. - It’s not necessarily evaluated as a compile-time constant. But I’ve hoisted it out of the pixel-wise loop, so it’s now a single call per bitmap, which shouldn’t hurt.
I tried to emulate this particular nextafterf
call through basic arithmetic, but couldn’t make it work without assuming things about the floating point format. Which is likely very safe in practice, but if I can avoid it at no significant cost, I’d rather avoid it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main problem is outputting of 4 bitmaps instead of 2 for \kf. Also a bunch of nitpicks.
@@ -30,7 +30,7 @@ typedef struct cache Cache; | |||
// cache values | |||
|
|||
typedef struct { | |||
Bitmap bm, bm_o, bm_s; | |||
Bitmap bm, bm_o, bm_o_k, bm_s; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think simple symmetric bm_k
is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t disagree that this naming is odd, but will it not be a concern that bm_k
doesn’t quite explain what it actually is? Like, bitmap of karaoke what? (It’s specifically the outline, like bm_o
, but exclusively for karaoke.) I suppose it should be fine with the comment edited…
@@ -106,6 +106,8 @@ START(filter, filter_desc) | |||
GENERIC(int, be) | |||
GENERIC(int, blur) | |||
VECTOR(shadow) | |||
GENERIC(uint8_t, alpha_primary) | |||
GENERIC(uint8_t, alpha_secondary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alpha_*
correspond to c[0]
and c[1]
. I think it's better to keep 0 and 1 literally instead of confusing off-by-one primary
and secondary
. Either in the form of alpha[2]
(but I think cache macro doesn't support arrays) or alpha0
/alpha1
or even a0
/a1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“Primary” and “secondary” come from the ASS style field names PrimaryColour
and SecondaryColour
, which denote their meanings. But sure, alpha0
and alpha1
could work too. At one point I also considered alpha_left
and alpha_right
.
libass/ass_render.c
Outdated
@@ -789,6 +789,15 @@ static ASS_Image *render_text(ASS_Renderer *render_priv) | |||
if ((info->effect_type == EF_KARAOKE_KO) | |||
&& (info->effect_timing <= 0)) { | |||
// do nothing | |||
} else if ((info->effect_type == EF_KARAOKE_KF) && info->bm_o_k) { | |||
tail = | |||
render_glyph(render_priv, info->bm_o, info->x, info->y, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to rewrite render_glyph()
to produce one output bitmap instead of two.
Bitmap *bm_s_source = &v->bm_o; | ||
bool share_fix_outline = | ||
!(flags & FILTER_FILL_IN_BORDER) && !(flags & FILTER_FILL_IN_SHADOW) && | ||
(k->filter.alpha_primary == 0xFF || k->filter.alpha_secondary == 0xFF); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's better to put 0xFF - alpha
into filter
? There are checks for 0xFF
only and alpha gets reversed inside of fix_outline()
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use the ASS-native alpha scale where 255
means transparent and 0
means opaque absolutely everywhere, I believe, so I think flipping it here would be confusing. For example, compare this to the code that originally decides whether to set FILTER_FILL_IN_SHADOW
.
@@ -112,7 +112,7 @@ typedef struct { | |||
BitmapRef *bitmaps; | |||
|
|||
int x, y; | |||
Bitmap *bm, *bm_o, *bm_s; // glyphs, outline, shadow bitmaps | |||
Bitmap *bm, *bm_o, *bm_o_k, *bm_s; // glyphs, outline, shadow bitmaps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment should be updated too.
@@ -60,24 +60,25 @@ | |||
#endif | |||
|
|||
|
|||
void ass_synth_blur(const BitmapEngine *engine, Bitmap *bm, | |||
bool ass_synth_blur(const BitmapEngine *engine, Bitmap *bm, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to not change this function and handle blurriness check outside of it. At least threshold here (0.001) is arbitrary in contrast with blur quantization that is based on mathematical considerations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I really want to know is what the actual bitmap contains by the time I get to fix_outline
: whether it’s proper coverage values, which allow me to use the perfect formula, or approximate and nonmonotonic signal-processed samples, which necessitate the “border is already pixelated” formula. I don’t care so much (or at all) about whether blur was requested in the script, but rather about whether it was actually applied and applied successfully. Even in the rare case where we try to apply blur and fail due to lack of memory or something, I can still apply the perfect formula to the result, although it probably doesn’t really matter.
(Also, Gaussian blur already returns a bool
status and we’ve been ignoring it all this time.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current version leads to an “use after free” in ART's karaoke tests: https://github.com/TheOneric/libass-regression-tests-tmp/runs/2319224287?check_suite_focus=true#step:10:85 (for some reason GitHub requires one to be logged in to see any logs)
To recreate enable at least ASAN; ART uses -fsanitize=address -fsanitize=undefined -fsanitize=float-cast-overflow -fno-sanitize-recover=all
.
Fixed the memory bug. (Good catch!) By the way, 216-vertical-2252.png in ART will need to be regenerated: it exhibits pretty bad aliasing that is fixed by this PR. (But it’s close enough that it earns it a “BAD” success rather than an outright failure: that’s kinda nice.) Haven’t redone |
My biggest question here is what this does to performance, since it potentially runs on quite a lot of data. In practice, are compilers smart enough to extract the If it's fine on those fronts, I'm 👍 on this. |
Back when this was discussed on IRC, MrSmile ran tests and saw autovectorization kick in on amd64, which I think implies hoisting of the Quoting the IRC log with the exact numbers below. @MrSmile Do you remember what the 10s number quoted at the beginning was? Looking at the log now, I’m not sure what the baseline number is: 10s or 16s? … Oh, was it perhaps 10s vectorized and 16s non-vectorized? Does the old/current code vectorize at all?
|
If I'm not mistaken |
Tentative bottom line, assuming that 10s/16s was resp. vectorized/unvectorized:
|
Yes, 10s for -Ofast and 16s for -O2.
I'm planning to create assembly version when this PR lands, so It's not a strict necessity. On the other hand I usually build libass with -Ofast and haven't encountered problems so far, maybe it's worth to increase default optimization level, especially for non-x86 architectures. |
Oh man, I’ve just rediscovered that this was in my very first batch of patches that was my first attempt to contribute to libass 😂 Even before other things that I thought came earlier. And even listed as the very first item. And it still isn’t finished. |
Update on this: since GCC 12 autovectorization is part of |
Whenever a glyph has a border, the original glyph's own ASS_Image is applied on top of the border's ASS_Image. Let's mathematically devise such alpha values for the border image that the resulting colors exactly match (modulo rounding) the colors produced by a hypothetical ASS_Image that contains both the glyph and the border. Let a_i be the opacity of a single pixel within an ASS_Image, where i is either g for glyph or o for outline (border). Further, let A_i be the opacity and c_i the color of the entire ASS_Image. This commit only considers cases when no blur is applied, so that the border is a solid, expanded copy of the glyph. Blur will be addressed in a separate commit. Think of a_i as the _area_ covered by the glyph/border, measured in pixels. We have 0 <= a_g <= a_o <= 1. We also have 0 <= A_g, A_o <= 1. We need a background color. Let's denote it as c_b. The background color includes any effects from applying the shadow of the glyph. The final color we want to achieve is: * on the area covered by the original glyph, A_g c_g + (1 - A_g) c_b. * on the area covered by the border, A_o c_o + (1 - A_o) c_b. * on the remaining area, c_b. Averaging the color over the pixel's area gives: a_g (A_g c_g + (1 - A_g) c_b) + (a_o - a_g) (A_o c_o + (1 - A_o) c_b) + (1 - a_o) c_b The color we actually get after blending the border with the background and then blending the original glyph with the result is: (A_g a_g) c_g + (1 - A_g a_g) ((A_o a_o) c_o + (1 - A_o a_o) c_b) We want to replace a_o in the actual color's expression with a value that makes it match the expected color. Let's call this value a_o'. Equating both expressions and solving for a_o' gives: A_o (c_o - c_b) = 0 and a_o' does not matter or A_g a_g = 1 and a_o' does not matter or a_o' = (a_o - a_g) / (1 - A_g a_g) Use the solution formula from the third case to calculate a_o', as the first two cases allow a_o' to be anything. The calculation itself is performed and rounded (half towards +infinity) using floating-point division because this is actually faster than integer division and because this is vectorizable on x86 (massively improving the speed further). The only issue is if the 1 - A_g a_g denominator should be zero. A_g a_g can only equal 1 if A_g = a_g = 1. Since a_g <= a_o <= 1, a_o must also be 1. As it happens, libass already handles this case separately: as a precaution, it checks that a_g < a_o and sets a_o' to zero if this condition fails to hold. Retain this check for the numerator, still as a precaution. Unfortunately, using this check for the denominator or the whole formula seems to break automatic vectorization of the loop, so use a trick: instead of 1 - A_g a_g, use nextafter(1) - A_g a_g and accept the very rare and minor precision loss. This is just enough to avoid division by zero and allows the loop to be autovectorized. Since in this case a_o' can be any value, this does not adversely affect our equation solving. Moreover, if A_g a_g = 1, then the numerator must be 0, so we will, in fact, produce the same value (0) as we do now with the a_g < a_o check. The formula requires knowledge of A_g. To enable this, the opacity of the original glyph is passed to fix_outline and (via CompositeHashKey) to ass_composite_construct. Generated composite bitmaps depend on it, so composites with different primary/secondary color alpha (depending on karaoke) are now distinguished by the composite cache. NB: this commit produces phantom, fully transparent ASS_Images that duplicate parts of border images when \kf is in effect. This may be a problem if any consumer assigns any meaning to fully transparent ASS_Images.
Blurred alphas are not coverage values. The area that is not covered by the original glyph is a mix of the border and the background, governed by the blurred border opacities. To get an exact result, we would need to integrate the blurred opacities on fragments of pixels that exclude the original glyph (or, equivalently, on fragments that are covered by it, as these fragments are complementary). This is too hard, and in fact blur itself already employs approximations to avoid integration even on whole pixels. Give up on exact values when blur is applied, and treat each pixel of the blurred border image as a uniformly flat translucent square. Note that we can in this case get non-zero opacities even when the glyph's coverage is larger than the border's uncorrected opacity, invalidating our current o[x] > g[x] sanity check. Writing and solving equations like before, we want: a_g (A_g c_g + (1 - A_g) c_b) + (1 - a_g) ((A_o a_o) c_o + (1 - A_o a_o) c_b) and we obtain: A_o (c_o - c_b) = 0 and a_o' does not matter or A_g a_g = 1 and a_o' does not matter or a_o' = (a_o - a_g a_o) / (1 - A_g a_g) Handle the first case like the third, and handle the second case by setting a_o' = 0. The observable difference for non-blurred bitmaps was not large in my tests, but the code is simple, so the exact non-blurred calculations are retained.
Rebased on master and addressed two nits. (More substantial changes still pending.) Meanwhile, we’ve run performance tests on several other CPUs. Full times of AMD Ryzen 7 5800H (Zen 3; released in early 2021), GCC 12.2.0 (Rev10, Built by MSYS2 project), Windows 11 (x86_64-w64-mingw32, UCRT): For this test, I recompiled only
Despite vectorization being on by default in GCC 12, it doesn’t actually show any benefit here at Yeah, turns out On another note, neither GCC nor Clang (according to Godbolt) ever try using 32-by-16-bit integer division instructions in this code on x86; instead, they use 64-by-32-bit division. There’s a chance that they’d be faster, although I’m guessing (without knowing anything) maybe not on recent CPUs. M1 Max (AArch64/ARMv8.5-A), Apple clang version 14.0.0 (clang-1400.0.29.202), arm64-apple-darwin22.3.0 (tests by @rcombs):
Elsewhere on the Internet, microbenchmarks have shown ARM64 to have slower float div than int div:
It is possible that these benchmarks are setup incorrectly per se, or perhaps they’re just not directly applicable to our use case because we do additional arithmetic here besides the division itself (even if not a lot)—or perhaps even because of the need to work around division by zero in one way or another, which adds instructions in the integer case but comes for free with floats.
On the other hand, we should probably have hand-written assembly code for this to avoid depending on finicky compiler optimizations that have an effect this big. And at that point, maybe it would be wiser, actually, to keep the C code based on pure integers? Or not? We haven’t tried LUTs here (as @MrSmile did last time); maybe we should. |
Actually it's not a boolean parameter: there're small enough blurs to be between unsigned num = 255 * FFMAX(o[x] - g[x], 0);
num = 255 * num + (o[x] * (255 - g[x]) - num) * blurred; If there's no significant performance impact, I would prefer continuous version instead of two discrete (also it's one asm function instead of two).
It's executed on the same hardware anyway, so mostly instructions size matter in this case.
There's a tendency to vectorize only float divisions, so I think generic float version would be better. I leave here my old table version if anybody wants to experiment with it. |
What would be the meaning of this parameter, and how would you compute it? I don’t know how to assign meaning to anything other than a Boolean.
Old CPUs have distinctly lower timings for shorter-operand variants in Agner Fog’s tables. Recent CPUs, though, have them mostly the same (for all lengths except 128-div-64) or even a little worse for the 32-div-16 variant than for 64-div-32; maybe that’s why GCC and Clang aren’t using it. |
It has just occurred to me that there’s another opportunity to explore here. Most of the pixels have either Code here. This does prevent any kind of vectorization, but it is almost as fast with integers as the vectorized float version and faster than current unvectorized
|
It would be "effective blurriness". I'm going to do some math to find a good approximation for it depending on blur radius, but we can use only 0 and 255 initially. Simplified lerp formula: unsigned num = FFMAX(65025 * (o[x] - g[x]) + blurred * (255 - o[x]) * g[x], 0);
Not sure that more complicated source worth the gain. I think that better to postpone all optimizations til assembly implementation. On the current stage we should decide a good reference algorithm to simplify future assembly testing. Do float and int branch produce exactly the same results? |
float den = over65025 - alpha_g * g[x]; | ||
o[x] = num / den + 0.5f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can match the integer version perfectly, at negligible perf impact, like this:
float den = over65025 - alpha_g * g[x]; | |
o[x] = num / den + 0.5f; | |
float den = 65025 - alpha_g * g[x]; | |
o[x] = FFMIN(num / den, 255.f) + 0.5f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes we have IEEE… uh… no, does this even work?
If den
is 0, then num
is 0, too. With IEEE float, num / den
is NaN, and this FFMIN
returns NaN (if I’m reading its definition correctly—as much as it doesn’t expect NaNs as inputs in the first place), and the conversion to integer upon assignment is UB by the standard (and not-zero in practice: mind, the exact result value doesn’t even matter if the ASS_Images are overlaid one on top of the other as this code assumes, but this doesn’t match the integer code).
TL;DR isn’t the FFMIN useless due to NaN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and not-zero in practice
Actually, it probably is zero, at least on x86? The conversion probably goes through int32, giving INT32_MIN
on x86, and then truncated to int8, giving zero?
In that case, this is actually a valid trick to use in hand-written assembler code that could target any C implementation that explicitly guards against division by zero.
But in C itself, no, I don’t think this is allowed. In fact, §6.5.5 Multiplicative operators explicitly declares any division by zero UB in any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, instead of FFMIN we could use
float den = 65025 - alpha_g * g[x];
o[x] = num / (den + 1e-30f) + 0.5f;
This should have a significantly lower impact than over65025.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we could do float divres = (num / den); o[x] = (divres >= 0.f ? divres : 0.f) + 0.5f;
.
Are we aware of any real targets we care about that don't provide IEEE floats? I'm perfectly comfortable having those as a requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t know, but I’m not a specialist in other platforms. My only worthless contribution is that IIRC GPUs don’t (or didn’t), but I’d be surprised if libass could be compiled for a GPU.
But for starters, -ffast-math
, which e. g. MrSmile uses for his own builds, is a common way to throw NaN explicitly out the window. It probably wouldn’t affect this line simply because the most trivial thing for the compiler to do is exactly the thing we find most desirable, but I’m not a fan of “probably usually maybe correct but explicitly unguaranteed” code. This function doesn’t seem worth bumping libass’s requirements to include IEEE compliance, and it might well harm us in other places (I assume MrSmile uses -ffast-math
not for nothing).
For another thing, neither GCC nor Clang (nor MSVC) actually fully support Annex F nor define __STDC_IEC_559__
(although some platforms have system headers that define it instead—which isn’t correct), so this remains technically undefined behaviour. Indeed, Clang even used to complain about it with -fsanitize=undefined
. But again, it’s true that, in practice, their optimizers do ensure 0/0=NaN, except when this is disabled via -ffast-math
etc. (and so Clang’s UBSan excludes this check by default since Clang 9). [17000] [19535]
den + 1e-30f
I like this idea (if we stick to dividing every pixel).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have a significantly lower impact than over65025.
Remind me: what is the impact of over65025 in this context? The nextafterf
call itself? But it happens only once, whereas + 1e-30f
would have to happen in each iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc the call to nextafterf
was very expensive, while the constant adds are quite cheap? But also, the nextafterf
doesn't quite match the int implementation in all cases, while the add does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me: what is the impact of over65025 in this context?
Result change. Basically + 1e-30
only matters for den == 0
, while over65025
can shift values even in nonzero case.
I’ve rerun some tests in a slightly more stable environment. I ran each test 30 times on the Ryzen. Most of the results still annoyingly failed normality tests, so take my significance tests with a
Note how integer division yet again continues to appear slower than floating-point division even when no vectorization is in play.
This is the part I don’t understand: what is the physical meaning of this? what is the reference value you’re going to try approximating? how? If it’s just “what subjectively looks good”, in all honesty I expect the current
I disagree. Surely we want the fastest code, not the simplest? We’ve already added the float stuff only for this reason, which IMO makes it more complicated: just look at the size of that comment explaining By the way, if it seems scarily complicated due to the And given just how fast it is, maybe we don’t actually need any hand-written assembler code at all? Then again, I suppose you could hand-write vectorized assembly that also performs this check against the whole vector: if it’s all zeros or all 255s, no division is necessary. That might be faster still.
No, not with the All pure-integer variants do match each other, including the conditional-division code I proposed in my last comment, and so does the conditional-division code using Speaking of rounding, it probably isn’t even optimal to always round to the nearest integer to begin with: ideally we’d round to the nearby integer that gives the more accurate value for the final colour after both images are blended together. But I haven’t tried doing the maths for this. On another note, @rcombs reported another speedup (77% total time, or 1.3× total speed) on the M1 Max by doing all computations in float16 in the range 0‥1. Of course, this implies some precision loss, and given that we’re working with just 8 bits of precision to begin with, I’m slightly reluctant to accept it, but it’s probably entirely fine. |
On the M1:
|
fyi, gcc has a |
FWIW I’ve now tried and measured this. With a simple 8-pixel handling loop (using AVX2 intrinsics) and with the added conditional skip on “all eight bytes = 0 or all eight bytes = 255”, the time has gone down to 10.4 s (vs 11.3–11.7 s for the unconditional vectorized code). And I won’t be surprised if it can be written better and faster still by someone who is better at SIMD assembly. |
It's mostly due to synthetic nature of sample. I suspect that in real world cases with letters it would be actually worse (for small letters due to never hitting uniform 0/255 case and for large letters due to branch mispredictions). Actually, even scalar 0/255 variant can be slower on real world cases, better to test it first. |
Summarising 32bit ARM tests, mostly already discussed on IRC: On all 32bit ARM except Since we already concluded a FPU is required to use libass at reasonable performance anyway, we probably don't need to be concerned about performance on software-float CPUs. For all tests:
32bit tests
Config options:
clang doesn't accept . Average Timings on the Cortex-A73 have a run-to-run variance of <30ms. For the A53 it’s a bit more but not enough to change the relative order between int and float. Cortex-A53 running a 32bit (RaspberryPi 3 B+) @ 1.4GHz:
The results have been omited, but even lemire.me:
As in previous benchmarks, the results from the simple testcase don't replicate the ordering of actual Cortex-A73 running 32bit @ 1.8GHz
Notes:
64bit testFor comparison, the same Cortex-A73 running an aarch64 build:
Cortex-A73 aarch64 @ 1.8GHz
I didn't record disassembly for those binaries, but tests on Godbolt suggests no autovectorisation of the float div occured. Perhaps the better float performance is due to fundamental differences between 32bit and 64bit ARM. lemire.me:
Here too, the simple div-only-function benchmark from lemire does not correlate to the performance of the actual int/float-based Conclusion?Cortex-A53’s FPU performs surprisingly poor, falling behind even software integer division. A more recent Cortex-A73 fares better in 32bit mode, beating non-native int and almost matching native int. In 64bit mode float again outperforms (native) int on the A73 (no 64bit tests easily possible atm for the A53). Compilers apparently won't use NEON for autovectorisation by default due to incompatibilities with IEEE fp. Ordering in simple float/int div benchmarks doesn't map to the same ordering in actual float/int-based Cortex-A53 constitutes the first instance we've seen where As it can go either way this probably doesn't help in deciding on whether to use integers or float for the C implementation. |
Now also results for a genuine 32-bit-only CPU (Cortex-A9, ARMv7-a+fp+neon+sec+mp, no fp16), though I doubt this could handle video playback and non-trivial subs at 720p. Average timings for libass tests vary by about 50ms between sets of runs. Within a run set, sometimes a few runs took 100-250ms longer, presumably due to some service process(es) competing for CPU. perf settings:
|
Not directly related to the patch itself, but mentioning here since all discussion about future default compiler flags also happened in this PR. With
To be fair, I haven't seen Full logs with timings of individual functions for |
Ultimately, this just means this function is an even stronger candidate for explicit vectorization in assembly. |
@MrSmile Is this still something you’re willing to look into—is it worth implementing this into the code (C and assembly) now? I’m guessing it’s an additional minor slowdown, so it would be a bit of a waste to add it and not use it; but on the other hand, modifying the several implementations across C & assembly later also wouldn’t be too fun. |
I think for now it would be better to use blur radius (well, we have two, so some average value like For 1D case the exact solution would be something like this. UPD: switching between |
I’ve ported my ancient
alpha
branch onto current master and fixed karaoke, so this should now really solve the problem for good.This would supersede and close #480. It would fix #76 and fix #473.
My only question is whether this PR’s current approaches to caching and producing ASS_Bitmaps are right.
An alternative approach I considered was to include
effect_timing
in the cache key and havefix_outline
takeeffect_timing
and both alphas and internally process the bitmap in two parts like karaoke does it. This would have required no changes torender_text
and avoided phantom ASS_Images. The big drawback would be that \kf syllables would effectively never be cached (except for very slowly-moving \kf).Ideally, combined glyph/fill bitmaps would be cached separately from border and shadow bitmaps:
\1a
and/or\2a
would share (and avoid recomputing) their glyph bitmaps.\be
or\blur
or fill-in flags.To get rid of the phantom transparent ASS_Images that
render_text
now outputs for borders with \kf,render_glyph
could skip images whose alpha is 255 (fully transparent). This could slightly help other cases, too, such as when the script has an explicit 255 alpha. On the other hand, if there is any client out there that wants to do something nontrivial with the ASS_Images it receives such as draw bounding boxes around them, it would then miss these images entirely, which seems undesirable. Alternatively,render_glyph
could get an extra argument pointing to a second source bitmap, which would work well but require extra (easy) code changes inrender_text
.