Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use libjpeg-turbo for all Lossless JPEG bit depths #105

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

SimonSegerblomRex
Copy link

@SimonSegerblomRex SimonSegerblomRex commented Jun 24, 2024

Enabled by the solution to libjpeg-turbo/libjpeg-turbo#768
(Planned to be included in libjpeg-turbo release 3.1.0.)

There are still some Lossless JPEG encoded images that libjpeg-turbo refuses to decode, see the discussions in:

Note to self while testing with local copy of libjpeg-turbo:
Put this in the customize_build function used by setup.py:

libjpeg_turbo_path = <path to libjpeg-turbo>
EXTENSIONS['jpeg8']['sources'] = []
EXTENSIONS['jpeg8']['include_dirs'] = [libjpeg_turbo_path + "/src"]  # moved to src in the dev branch
EXTENSIONS['jpeg8']['library_dirs'] = [libjpeg_turbo_path]

and make sure to set

export LD_LIBRARY_PATH=<libjpeg_turbo_path>

before running any python script importing imagecodecs.

Enabled by the solution to:
libjpeg-turbo/libjpeg-turbo#768
(Planned to be included in libjpeg-turbo release 3.1.0.)

There are still some Lossless JPEG encoded images
that libjpeg-turbo refuses to decode, see the discussions
in:
 * libjpeg-turbo/libjpeg-turbo#586
 * libjpeg-turbo/libjpeg-turbo#765
@SimonSegerblomRex
Copy link
Author

This is WIP and will stay as a draft pull request until there's an official libjpeg-turbo release that includes the changes necessary.

@cgohlke
Copy link
Owner

cgohlke commented Jun 24, 2024

Thanks. I am aware of the ongoing work in libjpeg-turbo. Note that the JPEG codec in imagecodecs switches to the LJPEG codec for bit-depths not supported by libjpeg-turbo.

@SimonSegerblomRex
Copy link
Author

SimonSegerblomRex commented Jun 24, 2024

Note that the JPEG codec in imagecodecs switches to the LJPEG codec for bit-depths not supported by libjpeg-turbo.

Yes, ljpeg_decode seems to work fine and will still be needed as backup in jpeg_decode for images that libjpeg-turbo refuses to decode due to the issues discussed in libjpeg-turbo/libjpeg-turbo#586 and libjpeg-turbo/libjpeg-turbo#765. ljpeg_encode shouldn't be needed any longer though.

@@ -109,7 +109,7 @@
- `libheif <https://github.com/strukturag/libheif>`_ 1.17.6
(`libde265 <https://github.com/strukturag/libde265>`_ 1.0.15,
`x265 <https://bitbucket.org/multicoreware/x265_git/src/master/>`_ 3.6)
- `libjpeg-turbo <https://github.com/libjpeg-turbo/libjpeg-turbo>`_ 3.0.3
- `libjpeg-turbo <https://github.com/libjpeg-turbo/libjpeg-turbo>`_ 6ec8e41f50e5a83fe078732cbf0360272165ed45
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the latest sha1 from the dev branch. No official release tag.

@SimonSegerblomRex
Copy link
Author

SimonSegerblomRex commented Jun 25, 2024

I tested this with a 16bit Lossless JPEG file as input:

import sys

from imagecodecs import imread, jpeg8_decode, jpeg8_encode
from numpy.testing import assert_array_equal

filename = sys.argv[1]

image = imread(filename)
if image.ndim > 2:
    image = image[..., 0].copy()  # copy to fix strides

for bit_depth in range(16, 1, -1):
    print(bit_depth)
    if bit_depth <= 8 and image.itemsize > 1:
        # FIXME: Should this really be necessary?
        image = image.astype("u1")
    enc = jpeg8_encode(
        image,
        lossless=True,
        predictor=1,
        bitspersample=bit_depth,
    )
    dec = jpeg8_decode(enc)
    assert_array_equal(image, dec)
    image <<= 1

It works, but the case with bit-depth <= 8 in a uint16 array should be handled in a better way.

EDIT: Fixed this with the check here.

And improve error handling.
@@ -141,7 +141,7 @@ def jpeg8_encode(
(src.dtype == numpy.uint8 or src.dtype == numpy.uint16)
and src.ndim in {2, 3}
# src.nbytes <= 2147483647 and # limit to 2 GB
and samples in {1, 3, 4}
and samples in {1, 2, 3, 4}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to work as expected with 2 components:

import sys

from imagecodecs import imread, jpeg8_decode, jpeg8_encode
from numpy.testing import assert_array_equal

filename = sys.argv[1]

image = imread(filename)
enc = jpeg8_encode(
    image,
    lossless=True,
    predictor=1,
    bitspersample=16,
)
dec = jpeg8_decode(enc)
assert_array_equal(image, dec)

Using this input file: 16

These files were created by a Lossless JPEG
encoder (implemented by me...) that contained
a bug that caused the largest Huffman code to
contain all ones. There's no problem to decode
these images (even for libjpeg-turbo, see the
discussion in libjpeg-turbo/libjpeg-turbo#765
) but libjpeg-turbo refuses to do since they
are not valid according to the JPEG spec.

Now I recreated them using jpeg8_encode.

dng0.ljp was not a valid file.
@SimonSegerblomRex
Copy link
Author

SimonSegerblomRex commented Jun 26, 2024

(I replaced the broken dng*.ljp files that were created using my broken Lossless JPEG encoder.)

I did a quick benchmark comparing jpeg8_decode and ljpeg_decode. jpeg8_decode is about ~40 % faster using this input: Pentax-K-1-DNG-extracted.jpg ( 3696x4950, 2 components) (Note: Pentax DNG files are the only images I've found in the wild hit by this problem, so you need that patch to get past the "Bogus Huffman table definition" error.)

Everything seems to work as expected now, but I guess we should wait for an official libjpeg-turbo tag.

@SimonSegerblomRex
Copy link
Author

SimonSegerblomRex commented Jun 26, 2024

I found this source containing a lot of Lossless JPEG files (embedded in DICOM files). A quick test shows that libjpeg-turbo and lj92 produce slightly different results for some of them, e.g., gdcm-JPEG-LossLessThoravision.dcm. BitsPerSample is 15 and in the decoded arrays there are values as high as 65520 for lj92 and 65535 for libjpeg-turbo... something weird is going on here (even considering that the decoded values are probably supposed to be reinterpreted as signed values or something). Do you have any input regarding this file @malaterre? EDIT: Solved by using gdcmrawto extract the JPEG file. Now this files behaves as expected both with lj92 and libjpeg-turbo.

@malaterre
Copy link

I found this source containing a lot of Lossless JPEG files (embedded in DICOM files). A quick test shows that libjpeg-turbo and lj92 produce slightly different results for some of them, e.g., gdcm-JPEG-LossLessThoravision.dcm. BitsPerSample is 15 and in the decoded arrays there are values as high as 65520 for lj92 and 65535 for libjpeg-turbo... something weird is going on here (even considering that the decoded values are probably supposed to be reinterpreted as signed values or something). Do you have any input regarding this file @malaterre?

@SimonSegerblomRex What do you get if you use thorfdbg/libjpeg ?

@SimonSegerblomRex
Copy link
Author

SimonSegerblomRex commented Jun 26, 2024

@SimonSegerblomRex What do you get if you use thorfdbg/libjpeg ?

With thorfdbg/libjpeg I get:

reading a JPEG file failed - error -1038 - invalid stream, found invalid huffman code in entropy coded segment

and that's probably the right thing. The images decoded by lj92 and libjpeg-turbo are completely broken, so they would have been better off failing as well than trying to decode garbage.

@SimonSegerblomRex
Copy link
Author

SimonSegerblomRex commented Jun 26, 2024

I found that lj92 fails to decode MARCONI_MxTWin-12-MONO2-JpegLossless-ZeroLengthSQ.dcm (just 0s out) while libjpeg-turbo decodes it without issues 👍 EDIT: Extracting and repairing the JPEG file using gdcmraw it decodes as expected also with lj92.

@malaterre
Copy link

@SimonSegerblomRex What do you get if you use thorfdbg/libjpeg ?

With thorfdbg/libjpeg I get:

reading a JPEG file failed - error -1038 - invalid stream, found invalid huffman code in entropy coded segment

and that's probably the right thing. The images decoded by lj92 and libjpeg-turbo are completely broken, so they would have been better off failing as well than trying to decode garbage.

What kind of command did you use ?

% gdcmraw gdcm-JPEG-LossLessThoravision.dcm  /tmp/bla.jpg
% jpeg /tmp/bla.jpg /tmp/bla.pgm
jpeg Copyright (C) 2012-2018 Thomas Richter, University of Stuttgart
and Accusoft

For license conditions, see README.license for details.


0 bytes memory not yet released.

15905134 bytes maximal required.

4197 allocations performed.

@SimonSegerblomRex
Copy link
Author

SimonSegerblomRex commented Jun 26, 2024

EDIT: Using the output from gdcmraw (that's actually not part of the DICOM file) I get the same output using all three decoders 👍

First I just used this script to extract the JPEG file:

import re
import struct
import sys

SOI = struct.pack(">H", 0xFFD8)
SOF3 = struct.pack(">H", 0xFFC3)
EOI = struct.pack(">H", 0xFFD9)

with open(sys.argv[1], "rb") as f:
    data = f.read()

matches = re.finditer(b"(?=(" + SOI + b".*?" + SOF3 + b".+?" + EOI + b"))", data, re.S)
for i, match in enumerate(matches):
    with open(f"{i}.jpg", "wb") as f:
        print(i)
        f.write(match.group(1))

It seems like gdcmraw does some magic to repair the broken file.

@SimonSegerblomRex
Copy link
Author

SimonSegerblomRex commented Jun 27, 2024

This is ready for code review (but there's still no new libjpeg-turbo release or tag).

@SimonSegerblomRex SimonSegerblomRex marked this pull request as ready for review June 27, 2024 13:27
@cgohlke cgohlke added the enhancement New feature or request label Jun 27, 2024
@cgohlke
Copy link
Owner

cgohlke commented Jun 27, 2024

Thank you. I will review this when libjpeg-turbo 3.1 is released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants