Skip to content

Latest commit

 

History

History
148 lines (97 loc) · 2.99 KB

CHANGES.SIMD.rst

File metadata and controls

148 lines (97 loc) · 2.99 KB

Changelog (Pillow-SIMD)

9.5.0.post2 & 9.4.0.post2 & 9.3.0.post2 & 9.2.0.post2 & 9.1.1.post2 & 9.0.0.post3

  • Fixed color ovrflow in LUT processing

9.0.0.post1

  • Fixed possible overflow in LUT processing
  • Restored compatibility with Visual C Compiler

7.0.0.post4

  • Filter: fixed wrong offset handling for 3x3 single-band version

7.0.0.post3

  • ColorLUT: fixed potential access violation, up to 2x faster

7.0.0.post2

  • ColorLUT: SSE4 & AVX2

7.0.0.post1 & 6.2.2.post1 & 6.1.0.post1 & 6.0.0.post2

  • Bands: access violation in getband in some environments

7.0.0.post0

  • Reduce: SSE4

6.0.0.post1

  • GCC 9.0+: fixed unaligned read for _**_cvtepu8_epi32 functions.

6.0.0.post0 and 5.3.0.post1

  • Resampling: Correct max coefficient calculation. Some rare combinations of initial and requested sizes lead to black lines.

4.3.0.post0

  • Float-based filters, single-band: 3x3 SSE4, 5x5 SSE4
  • Float-based filters, multi-band: 3x3 SSE4 & AVX2, 5x5 SSE4
  • Int-based filters, multi-band: 3x3 SSE4 & AVX2, 5x5 SSE4 & AVX2
  • Box blur: fast path for radius < 1
  • Alpha composite: fast div approximation
  • Color conversion: RGB to L SSE4, fast div in RGBa to RGBA
  • Resampling: optimized coefficients loading
  • Split and get_channel: SSE4

3.4.1.post1

  • Critical memory error for some combinations of source/destination sizes is fixed.

3.4.1.post0

  • A lot of optimizations in resampling including 16-bit intermediate color representation and heavy unrolling.

3.3.2.post0

  • Maintenance release

3.3.0.post2

  • Fixed error in RGBa -> RGBA conversion

3.3.0.post1

Alpha compositing

  • SSE4 and AVX2 fixed-point full loading implementation. Up to 4.6x faster.

3.3.0.post0

Resampling

  • SSE4 and AVX2 fixed-point full loading horizontal pass.
  • SSE4 and AVX2 fixed-point full loading vertical pass.

Conversion

  • RGBA -> RGBa SSE4 and AVX2 fixed-point full loading implementations. Up to 2.6x faster.
  • RGBa -> RGBA AVX2 implementation using gather instructions. Up to 5x faster.

3.2.0.post3

Resampling

  • SSE4 and AVX2 float full loading horizontal pass.
  • SSE4 float full loading vertical pass.

3.2.0.post2

Resampling

  • SSE4 and AVX2 float full loading horizontal pass.
  • SSE4 float per-pixel loading vertical pass.

2.9.0.post1

Resampling

  • SSE4 and AVX2 float per-pixel loading horizontal pass.
  • SSE4 float per-pixel loading vertical pass.
  • SSE4: Up to 2x for downscaling. Up to 3.5x for upscaling.
  • AVX2: Up to 2.7x for downscaling. Up to 3.5x for upscaling.

Box blur

  • Simple SSE4 fixed-point implementations with per-pixel loading.
  • Up to 2.1x faster.