Pre-receive Hook: Wait for file - software duplex #440

raumneun · 2021-01-25T08:47:41Z

Hi there,
this is more a question than an issue - maybe a feature request. In any case, thank you for the great work with paperless-ng!
I was using a self-cooked minimalistic paperless web-app which used folders (because it was easier to implement) successfully for quite some time now but decided to switch to paperless-ng because it is... well.. better in just about every way :)
With one exception to my usual workflow:
Our scanner has an ADF, but unfortunately scans only single-sided. So in my web-app, I used a python script that combines two PDFs from the front (odd page numbers) and backs (even page numbers) by interleaving the pages.
Obviously, this would not work with paperless-ng out of the box as the consumer would consume the first pdf before the second arrived.

So I was wondering if it was possible to write a pre-consumption script that if a file appears in the consumption directory, that contains the word "front", it waits in an infinite loop (with a timeout of some sort) until the second file arrives, then combines these two and exits afterwards, such that the consumption carries on as usual.
To prevent the second file to be consumed from a different worker, the pre-consumption script should also check for the word "back" and ignore this file altogether.

Do you see any problem with that approach? Or even better: do you have a better idea to solve this elegantly? Obviously, I could do all outside of paperless-ng with another service, but I would very much like to keep it all in one place as this is much easier to backup and transfer to other systems then having yet another service running.

thanks!

Cheers
Max

jonaswinkler · 2021-01-25T09:37:53Z

Hello! Comments below.

Hi there,
this is more a question than an issue - maybe a feature request. In any case, thank you for the great work with paperless-ng!
I was using a self-cooked minimalistic paperless web-app which used folders (because it was easier to implement) successfully for quite some time now but decided to switch to paperless-ng because it is... well.. better in just about every way :)

Thank you :)

With one exception to my usual workflow:
Our scanner has an ADF, but unfortunately scans only single-sided. So in my web-app, I used a python script that combines two PDFs from the front (odd page numbers) and backs (even page numbers) by interleaving the pages.

I'd check with the software of the scanner. I could imagine that is has some pseudo duplex options (Scan, rotate the pages, scan again, and the sofware does the splining.)

Obviously, this would not work with paperless-ng out of the box as the consumer would consume the first pdf before the second arrived.

Correct.

So I was wondering if it was possible to write a pre-consumption script that if a file appears in the consumption directory, that contains the word "front", it waits in an infinite loop (with a timeout of some sort) until the second file arrives, then combines these two and exits afterwards, such that the consumption carries on as usual.
To prevent the second file to be consumed from a different worker, the pre-consumption script should also check for the word "back" and ignore this file altogether.

Do you see any problem with that approach?

The only obvious issue I see is that the pre and post consumption scripts cannot cancel the consumption. Paperless does not evaluate the exit code of these scripts and continues anyway. However, this is an easy fix. Apart from that, this might be possible, but won't be very elegant. You will also get lots of errors in the logs, because you'll have to cancel consumption for either the front or back documents regularly.

Or even better: do you have a better idea to solve this elegantly? Obviously, I could do all outside of paperless-ng with another service, but I would very much like to keep it all in one place as this is much easier to backup and transfer to other systems then having yet another service running.

I'd probably search for scanner software that does this for you. If it's a network scanner, well, that's not an option.
Doing this in paperless is pretty difficult. The general assumption for the document consumption pipeline is that one file from any source will result in exactly one document in paperless. Trying to implement any sort of file merging or splitting breaks that assumption, and would require me to make lots of fundamental changes.
I was actually thinking about Split single PDF into multiple ones using a delimiter page #317 when I read this. My comment over there also describes some issues related to this. The use case in that issue is different, but I'd propose the same solution as over there. The idea is as follows:
1. Make a script that watches a folder just as paperless does.
2. That script may implement any kind of custom file merging/splitting logic you want.
3. The script would put processed files into the consumption folder of paperless (or use the rest API), and simply copy files that don't need any merging / splitting.
4. (Optional) Make this a reusable docker image that can be easily integrated into any docker compose setup (see the referenced issue).
This is an additional service, but I think it might be a good solution nonetheless.

thanks!

Cheers
Max

Philmo67 · 2021-01-25T10:16:21Z

Maybe try NAPS2 ?

raumneun · 2021-01-26T07:22:09Z

Thanks for your comments.
I was not aware that the consumption scripts do not cancel consumption after an exit code unequal to 0. Maybe this would be a good idea anyway?
I do agree that generating lots of errors in the log is not very elegant and also that this is should not be a core feature of paperless-ng as you state in the documentation that it expects files to work with regardless where they come from and should not handle preprocessing of some sort.

I am going for a completely headless design (well, at least for the scanning part), so using the scanner software is not really an option. Imho, this whole thing works only if digitizing the documents is as easy and fast as possible - otherwise I just don't do it ;)
I will go the route you described and create a python watchdog for another folder (say, consume-duplex) script that does the magic of combining the PDF and move it to the consumption directory. I'll try to build a docker image from that which could be integrated in the docker-compose setup for paperless-ng. If this is successful, I'll publish it at docker-hub so everyone interested may use it.

jonaswinkler · 2021-01-26T12:08:13Z

You might want to look at src/documents/management/commands/document_consumer.py for details on how to watch a folder for changes with inotify / watchdog.

With inotify, you can listen to CLOSE_WRITE events, but that's not available on every file system.
With watchdog, you get a notification the instant a file is created, and you'll have to wait until the file is complete by periodically checking its size / time modified.

jonaswinkler closed this as completed Jan 27, 2021

Repository owner locked and limited conversation to collaborators Jan 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Pre-receive Hook: Wait for file - software duplex #440

Pre-receive Hook: Wait for file - software duplex #440

raumneun commented Jan 25, 2021

jonaswinkler commented Jan 25, 2021 •

edited

Loading

Philmo67 commented Jan 25, 2021

raumneun commented Jan 26, 2021

jonaswinkler commented Jan 26, 2021 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

Pre-receive Hook: Wait for file - software duplex #440

Pre-receive Hook: Wait for file - software duplex #440

Comments

raumneun commented Jan 25, 2021

jonaswinkler commented Jan 25, 2021 • edited Loading

Philmo67 commented Jan 25, 2021

raumneun commented Jan 26, 2021

jonaswinkler commented Jan 26, 2021 • edited Loading

This issue was moved to a discussion.

jonaswinkler commented Jan 25, 2021 •

edited

Loading

jonaswinkler commented Jan 26, 2021 •

edited

Loading