Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow upload of PDFs to SIPI #1581

Closed
gfoo opened this issue Dec 10, 2019 · 17 comments
Closed

Allow upload of PDFs to SIPI #1581

gfoo opened this issue Dec 10, 2019 · 17 comments
Assignees
Labels
enhancement improve existing code or new feature

Comments

@gfoo
Copy link

gfoo commented Dec 10, 2019

Is there any rules for Sipi upload route to accept or not a PDF?

I've an error for example with this file: Programme_Colloque_Bertrand.pdf

Using the binary, the PDF sounds ok:

$ docker run --rm -v $(pwd):/image daschswiss/knora-sipi:v10.1.1 -f "/image/Programme_Colloque_Bertrand.pdf" out-sipi.pdf
pdf has 2 pages

But using the upload route (the call is stuck):

curl -s -X POST -F filename=@Programme_Colloque_Bertrand.pdf http://localhost:1024/upload?token=xxxx
@gfoo
Copy link
Author

gfoo commented Dec 10, 2019

From the cines web site cited by @mrivoal , this PDF is considered as archivable (see the Validation tab).

I used their tools to try to fix potential errors in the PDF (Correction PDF tab):

Both fixed files are now accepted by Sipi.

@gfoo gfoo changed the title upoad PDF file rules upload PDF file rules Dec 10, 2019
@gfoo
Copy link
Author

gfoo commented Dec 10, 2019

we also used Acrobat to convert the file into a PDFa but Sipi does not accept it.

Programme_Colloque_Bertrand-AcrobatPDFa-FIX.pdf

@gfoo
Copy link
Author

gfoo commented Jan 9, 2020

@lrosenth Any news about that?

1/ is there a test in Sipi to not accept particular PDFs? Because a minima, without any PDF quality considerations I cannot import all the pdfs by using the route http://localhost:1024/upload?token=xxxx

2/ is there a well-known way to convert about 1000 pdfs. In this post I used PDFtk (no so easy to install on linux/mac os systems, but I think it is possible) and Ghostscript does not make the job correctly for all our pdfs.

The Ghostscript solution seems to me more flexible, I used this command line (found somewhere...):
gs -dPDFA -dBATCH -dNOPAUSE -sColorConversionStrategy=UseDeviceIndependentColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=2 -sOutputFile=output.pdf input.pdf

Maybe you have some advices about these parameters?

(I'm going to prepare some scripts to test all that and provide to you all our pdfs.)

@subotic
Copy link
Collaborator

subotic commented Jan 23, 2020

I think this is the wrong repository for this issue. The docker image of sipi you use contains a completely different set of scripts that are custom to Knora.

The upload script that you are using is this one: https://github.com/dasch-swiss/knora-api/blob/develop/sipi/scripts/upload.lua.

@subotic subotic transferred this issue from dasch-swiss/sipi Jan 23, 2020
@subotic
Copy link
Collaborator

subotic commented Jan 23, 2020

I've moved the issue to the knora-api repo.

@subotic subotic assigned benjamingeer and unassigned lrosenth Jan 23, 2020
@subotic subotic added the enhancement improve existing code or new feature label Jan 23, 2020
@subotic subotic changed the title upload PDF file rules Allow upload of PDFs Jan 23, 2020
@subotic subotic changed the title Allow upload of PDFs Allow upload of PDFs to SIPI Jan 23, 2020
@benjamingeer
Copy link

I've moved the issue to the knora-api repo.

@subotic Why? I thought this issue was about getting Sipi to validate PDF files.

@subotic
Copy link
Collaborator

subotic commented Jan 23, 2020

Isn't it about a problem with uploading PDFs? Since this logic is implemented in a lua script, it could probably be solved in this repo.

@benjamingeer
Copy link

Which logic? The Lua script just calls Sipi’s C++ functions to process the file. If Sipi can’t parse a particular PDF file, there’s nothing the Lua script can do about it.

@lrosenth
Copy link
Contributor

lrosenth commented Jan 24, 2020

OK, I had a look at it. It requires at leastsome major changes in

  • knora-api/sipi/scripts/upload.lua
  • knora-api/sipi/scripts/store.lua

Currently all upload files are being converted to a sipi image object and then converted to a JPEG2000 file. In order to cope with PDF's (and also .ttl's, .xnml's etc.) we have to build in a switch which checks the MIME type and performs

  • the conversion as it used to be if the input MIME-type is of TIFF, PNG,JPEG,J2K etc.
  • just copies the file to the proper place (tmp and then the images/project/...)

To rememer, PDF's that are in the images-folder can be served

  1. as PDF-download (complete PDF) if the URL looks like this: http://sipi-server/images/file.pdf
  2. as image using the full IIIF-url with the addition of the pagenumber http://sipi-server/images/file.pdf@pagenum/full/full/0/default.jpg where pagenum is an interger 0,1,2

@benjamingeer
Copy link

Currently all upload files are being converted to a sipi image object and then converted to a JPEG2000 file.

No, upload.lua checks whether the file is an image. If not, it doesn't do any conversion:

https://github.com/dasch-swiss/knora-api/blob/develop/sipi/scripts/upload.lua#L116

@benjamingeer
Copy link

Support for uploading PDF files was added in #1206.

@benjamingeer
Copy link

These changes were also announced on Discuss DaSCH:

https://discuss.dasch.swiss/t/support-non-image-files-in-knora-api-v2/33/2

@gfoo
Copy link
Author

gfoo commented Jan 24, 2020

@benjamingeer did you try to reproduce the bug? curl -s -X POST -F filename=@Programme_Colloque_Bertrand.pdf http://localhost:1024/upload?token=xxxx I found this bug on 10 Dec 2019

@benjamingeer
Copy link

@gfoo I haven't tried to reproduce it, but if it works with one PDF file and not with another PDF file, I think the problem has to be in the C++ code in Sipi.

@mrivoal
Copy link

mrivoal commented Feb 3, 2020

@lrosenth (or to whom it may concern) any news regarding this issue? We still have a lot of things to fix to be able to release Lumieres.Lausanne but if we can't import PDFs, then we can't release anything.

@benjamingeer
Copy link

I've reproduced this and can confirm that it depends on the content of the PDF file. Moved to dasch-swiss/sipi#319.

@gfoo
Copy link
Author

gfoo commented Feb 4, 2020

@benjamingeer thanks for the test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement improve existing code or new feature
Projects
None yet
Development

No branches or pull requests

5 participants