[Feature Request] Split and merge documents #335

ghost · 2021-01-13T18:15:49Z

As a user I would like to merge different scan into one document.

Example: I scan the front and back side of an ID card, it uploads as different documents into paperless. I can merge the 2 documents into one.

jonaswinkler · 2021-01-13T18:37:26Z

Hi, welcome to GitHub!

As of right now, I don't have any plans to support editing PDF documents. If you really need that, it might be worth giving Papermerge a shot, they do have some editing tools over there. Although I don't know if they support document merging specifically.

This would also be a very big change, since

I need a new UI to select and reorder documents for merging
I need need support from the web server to do the actual editing
Paperless stores exactly one original file for each document and never modifies that. If we want to merge two documents, we modify the original files. This is something I want to avoid.
What if the user wants to merge an Office .docx and a PDF document? We could simply disable that, but since all documents are displayed as PDF documents in paperless, some users might get confused about why certain documents can be merged, and some cannot.

Not going to happen (anytime soon).

shamoon · 2021-01-13T19:09:18Z

Agree, feels kinda out of the scope of this app IMHO, and so many tools can do this, even native PDF / image viewers...

Philmo67 · 2021-01-13T21:51:06Z

Just curious : what tools are you using for merging/splitting/rotating pdf documents ?
I tried PDFsam basic, or pdfarranger or even directly after scanning using NAPS2 but I don't find these tools user-friendly enough for these tasks.

jonaswinkler · 2021-01-13T22:08:14Z

PDFArranger gets the job done, and has everything I need. Apart from that, I usually use gscan2pdf for scanning, and discard unwanted pages with that, or scan multiple pages into a single document. No further editing needed, usually.

Matthias84 · 2021-01-17T20:46:39Z

Some request here. I feed scanned TIFFs from an ADF document scanner. Of course I could do a manual preprocessing outside of paperless, but for non tech users, it might be interesting to get an preprocessing-inbox where you could merge, remove, reorder pages, turn pages, ... before they are fully processed within paperless-ng? But yes, I see the point that it's a lot of work to support it for all the different (multi-page) formats ... 🤔

Zocker1999NET · 2021-01-24T13:59:24Z

In #426 I had a similar idea (I did not found this already existing issue) about how to implement this:

How could this be implemented on the UI:

Select the documents you want:

Click on a "Combine" button

What happens in the background:

Combine the original documents (not the archived versions!) for example using ImageMagisk: convert "$@" pdf:-

Delete all old entries of the selected documents

Reprocess the new document as it was simply placed into the consume directory

Known issues with this implementation:

The original source files maybe cannot be currently handled, so they may be lost. Possible workaround: Before combining the originals to a PDF document, pack them together into a zip/tar archive, store that as "original document" and enable paperless to work with zip/tar archives if possible

Will most likely not support formats not supported by ImageMagisk like Office documents, however should be able to combine JPEGs/PNGs/PDFs/TIFFs. Possible workaround: Before combining using ImageMagisk them, convert each file not supported by ImageMagisk to a PDF reusing current existing strategies.

The TAR/ZIP approach would allow paperless-ng to prevent losing the old original documents, while, as I see it, allow it "in theory relatively simply" (haven't seen the code of this project yet) to allow this with all existing documents.

jhass · 2021-02-09T15:44:17Z

I wonder where the need for merging generally comes from.

For me it's because my printer's scan to mail function can't put more than 4-5 pages into a single document. I wonder if it's comparable for most of you. If so, we might not need (much) UI for this or even change much about the one document == one file paradigm. Instead we could allow defining rules for merging at consumption time.

For mail this is "easy", we could have a mail rule to merge all attachments in a mail into one document for example (convert to PDF if needed, sort by name, then merge PDFs).

For manual upload it could be a checkbox or a separate button "upload selected files as one".

API wise that could be a new endpoint, "upload multiple as one", to be integrated into any frontends that want to support this.

The most tricky bit is for the folder drop, since there's no rule system for that yet as far as I noticed. One could imagine something alike the mail filtering system though based on filenames, match all files with pattern, do action on that set of files. The only action at first would be merge of course.

So, to summarize I wonder if "merge at consumption" solves the needs of most people here already.

Zocker1999NET · 2021-02-09T16:37:36Z

@jhass Your ideas on how to implement this feature seems great.

To answer your question, my printer may has such a feature, but because I'm currently scanning over 600 documents to store even older documents on the computer, I decided not to use such features because I wanted to scan these amounts of pages using a feed scanner without thinking about them in the first place. I wanted to do the sorting/merging of documents only digitally but create an index over my offline documents by sorting them by an incrementing scan id they gain after scanning. I think this is much easier in my case.

jhass · 2021-02-09T17:05:59Z

Not too different story here, just maybe a little less to go through and I'm also throwing out stuff where I feel fine at retaining only the digital copy, so going by the recommended ASN system :) I'm just running paperless-ng on a server so using the mail function of my feed scanner rather than some scan tool on a PC is easiest. Now my pain is that I have multiple documents for the same ASN 😅 , othewise I'm not too worried about, it's indeed easy to find the "other parts" by date or title.

jonaswinkler · 2021-02-09T17:57:46Z

Chiming in here and sharing some further ideas and comments. I'm pretty busy right now and don't have all that much time except for critical stuff.

For mail this is "easy", we could have a mail rule to merge all attachments in a mail into one document for example (convert to PDF if needed, sort by name, then merge PDFs).

This would certainly be possible. However, does your scanner actually support sending multiple scanned files in one mail? Also, I'd like to have the merging logic available to all users, not just users who use the mail functionality.

The most tricky bit is for the folder drop, since there's no rule system for that yet as far as I noticed. One could imagine something alike the mail filtering system though based on filenames, match all files with pattern, do action on that set of files. The only action at first would be merge of course.

This is in fact the most tricky part. Once paperless detects new files in the consumption folder, it sends them to the task queue for processing immediately. How does paperless detect when the last document of a batch has arrived? I don't think there's a good solution here.

If we do this, I'd like to have the merging functionality available to everyone, and the consumption folder is still the most commonly used way to upload documents.

Therefore, some ideas on what could actually work for everyone, given the current architecture.

Paperless will always add one file as one document, and one document has exactly one original file (+ optional archived file). That will not change. The implications and required changes are just too big.
We can add merging support to the UI in a non-intrusive way:
- On the document detail page, have an additional action in the top right corner "Edit / Add to edit / Merge with others".
- This would add the document to an "Edit tool / Merge tool", a link to this tool would appear below "Documents" in the sidebar.
- More documents can be added to that tool in the same way.
- More documents could also be added by selection on the list (as pictured above)
- The merge tool would allow reordering and saving the combined document as a new document.
- The merge tool would also have an option to delete the selected document after merging was successful.
- The merge tool only appears when documents are selected for merging.
We need API support for that.
- A new endpoint "/api/merge" that accepts a "merge plan" on POST, which is essentially the ordered list of documents.
- We already have python dependencies in paperless that deal with pdf editing.
- Since paperless also accepts non-PDF documents (images, text, office, ...): Use the original documents if PDF, or the archived versions when original is not a PDF (issue warning). Issue error when no PDF is available.
- API combines original documents, invokes OCRmyPDF on the merged document again to get the combined archived version, and saves all that as a new document.
- API optionally deletes all documents in the merge plan.
- Then there's also metadata. We could either redo metadata detection on the merged document just as we do with new documents, or simply take the metadata from the first document.

After that works, we could also think about adding support for selecting individual pages from the selected documents. (This is something I'd find useful as well, since I've got lots of documents with empty pages that my scanner detected as not empty)

The most critical part is making the backend work, so that should be the focus.

How does that sound? This all is also very isolated functionality and can be added without affecting anything else. If someone wants to take a stab at that, I can give some more detailed instructions on how to do it.

jhass · 2021-02-09T21:22:14Z

However, does your scanner actually support sending multiple scanned files in one mail? Also, I'd like to have the merging logic available to all users, not just users who use the mail functionality.

No it does not, it doing the opposite, splitting the document into multiple PDFs inside one mail (I think it's one mail, I actually never checked 😅) if it gets too big.

Yes, a merging tool that just creates a new document sounds like a great idea 👍 :)

As a third alternative I think something like a meta-document which groups several documents in a defined order but keeping the members untouched and their own entries could also already help a lot of usecases and might be a little bit less effort.

jonaswinkler · 2021-02-09T21:40:32Z

and might be a little bit less effort.

We'd still need some UI to define these meta documents, which is about the same as the one for a merge tool. We also need support in the back end for that, documents now have ordered child documents? Also, many components of paperless have to take this new data structure into account (search index should not return documents that are part of a meta document, import+export, metadata matching should use the concatenated content of all documents in a meta document, ...)

Compared to building a feature that uses already existing data structures and abides to already defined contracts (and in doing so is compatible with all existing features), this is actually a lot more work.

And then there's also the test suite. Changing features requires changing associated test cases. Adding a new isolated feature just requires new test cases for that feature.

jhass · 2021-02-10T07:20:56Z

I didn't mean it to be that invasive, child documents could appear normally still and meta documents could not appear in full text search etc, child documents would "just" provide a quick link to go to the meta document they're part of.

I felt building the merge background task could potentially prove to be quite the rabbit hole 😅

jonaswinkler · 2021-02-10T09:56:02Z

I didn't mean it to be that invasive, child documents could appear normally still and meta documents could not appear in full text search etc, child documents would "just" provide a quick link to go to the meta document they're part of.

I want to do things properly :)

I felt building the merge background task could potentially prove to be quite the rabbit hole 😅

Actually not, use pikepdf to produce a new pdf document, submit that to the consumer just as we do with other new documents, and optionally delete some documents when done. The consumer will take care of the rest.

The actual merging and editing is very straight forward (https://pikepdf.readthedocs.io/en/latest/topics/pages.html).

jonaswinkler · 2021-02-12T19:14:17Z

I just put a new API endpoint together, and the actual merging process on the server side is straight forward. Reordering documents, keeping only selected pages, that's all simple. The merged document will appear as a new document to paperless, with notifications and all that.

Now I need to get this implemented properly and figure out how the UI is supposed to work.

shamoon · 2021-02-12T19:26:22Z

Very cool! Didn’t even realize you were actively working on this. Let me know if / when / where I can help, have some UI ideas

jonaswinkler · 2021-02-12T20:09:27Z

Well, I don't exactly communicate what I'm actively working on, that's true.

If you want to work out a UI for this, go for it.

I've also got some UI ideas, not sure if they align with yours, here goes.

Add some button to the document detail page to the top right ("Mark for merge" / "Add to merge" or similar)
Add a button to the bulk editor that does the same, but with multiple documents
When documents are selected for merge, A new sidebar link "merge tool" or similar will appear, which in turn opens the merge editor. I feel that when selecting multiple documents for merge, that should open immediately.
The merge editor has two columns. The left has some sort of list of the selected documents, and changing the order should be possible with drag and drop. Probably some cards with the thumbnail, but not as tall as the large cards.
The cards may also have some field to select specific pages (maybe a text field that accepts a string such as "1,3-6,8")
The right column will show a preview. Generating the merged PDF is actually pretty fast, the download will probably take more time than the merge itself. This preview could be reloaded in real time (debounced).
And well, some buttons to start the process.

These are just ideas. If you got better ideas, go for it, while keeping the following in mind:

I'd like to keep the UI simple as a whole. Therefore, be sparse with adding buttons that are always visible, even when the tool is not used.
I can't provide images for individual pages of a document. Just the thumbnail and the PDF itself.

The API will essentially accept an ordered list document of ids, and for each document id an optional page range. I don't have the details down yet. As long as the UI is able to provide that, we're good. It will be possible to specify the same document twice, in case you want to add pages from a document somewhere in the middle of another document.

The API will also have an option to download the resulting document as a preview without actually adding it to paperless.
And some flags to optionally delete source documents on success.

shamoon · 2021-02-12T20:17:39Z

Yea thats pretty similar to what I imagined. And I agree as this will be a not-every day and probably even not every-user kinda tool the button shouldnt be too prominent / take up too much space, maybe inside a menu or something. And yep exactly what I was thinking about getting there from document detail or bulk edit, and it opens a modal with the UI.

As for the actual UI, definitely agree on visual drag + drop, preview does sound cool too. And then when the user is done do they hit "Save" does it create a new document? And what about metadata? Im sure we'll have to figure out lots of stuff once we dig in. Mobile might be a challenge, etc.

jonaswinkler · 2021-02-12T20:26:43Z

and it opens a modal with the UI.

Not necessarily a modal, I think this should be a full page view. You may want to go from the merge tool to the list again, and add more documents.

And then when the user is done do they hit "Save" does it create a new document?

It will create a new document. Maybe a checkbox that will cause the source files to be removed on success.

And what about metadata?

Options for either keeping info from the first document (which should be most representative for the resulting file), or run it through the matching algorithms again.

Mobile might be a challenge, etc.

It's okay to have certain functionality not available on mobile.

shamoon · 2021-02-12T20:40:42Z

Hmm, just now this makes me think about whether it will be frustrating if the actual merge UI has no way to add documents, like a “picker” of some kind. Like if you added 2 docs but realized you need a third you’d have to go back to the list and find the other, hightlight it and add it. A little odd. Then again maybe people will mostly be merging two docs so it’s no big deal?

Just kinda asking / thinking out loud. I personally haven’t needed this so I’m trying to put myself in the mindset of a user of this

jonaswinkler · 2021-02-12T20:46:12Z

Hmm, just now this makes me think about whether it will be frustrating if the actual merge UI has no way to add documents, like a “picker” of some kind.

The picker would essentially be a list view (with filtering) as well, or something similar, and we already have that.

shamoon · 2021-02-12T21:02:12Z

The picker would essentially be a list view (with filtering) as well, or something similar, and we already have that.

Yea.

Should be fun challenge, LMK when I can start playing with it.

…ging #335

ffleischer · 2021-03-11T21:26:11Z

While you're currently working on this topic, would it make sense to directly consider some kind of Staple functionality?

UseCase:
I've got a small mobile document scanner (Doxie GO) which can scan multiple pages but only one-sided. So I need to scan the other sides separately. Their Windows companion app has this Staple functionality.

It generates from 2 documents (D1 & D2) with multiple pages (D1P1, D1P2, .... & D2P1, D2P2, ....) a new document with (D1P1, D2P1, D1P2, D2P2,.....). This mitigates the single side scan a bit.

jonaswinkler · 2021-04-02T14:42:43Z

What browser are you using btw?

Chromium.

When the separator is the first or last item it should remove it. I cant reproduce the bug youre seeing where its sending [], are there specific steps to recreate this? Thats what lines 31-35 in split-merge.service.ts should prevent

I manually dragged a separator to the end after splitting.

Ill see about document chooser using the view, its pretty basic right now

I think it's okay that way.

shamoon · 2021-04-02T14:58:15Z

Ok these small things should be addressed, I'll tackle the rest when I have some more time later and reply directly to your code review. Thanks

rklasen · 2021-05-11T18:23:02Z

Any updates on this? This seems like a fantastic feature.

jonaswinkler · 2021-05-11T21:13:22Z

I just need to find some time to address a few remaining issues.

steffe · 2021-06-30T16:07:21Z

First of all a big thank you, amazing Software, was searching since longer a solution for a paperless easy to use tool, now I found it.
Awaiting for this Features, especially the Split feature, if you need some testers, let me know. Best Stefan

GoldenBatt88 · 2021-08-20T14:15:24Z

New to Paperless, just saw this project last week, and I'm very happy with this piece of software, it's great! Just picked up a cheap scanner today, but i also ran into problems that it would make 2 pdf's is it is a double sided letter, no option to merge from te scanner itself... Glad to see there is already a solution being worked on. If you need more people to test, I'm also willing to test some documents here.

mhupfauer · 2021-09-08T16:31:18Z

Is there an approximate ETA on the availability of this feature on the stable main branch?

nikor30 · 2021-10-26T13:29:54Z

Hi, yes all so here highly wanted :-)

olvier · 2021-11-09T23:20:08Z

+1 :)

eSportler89 · 2021-12-08T13:42:10Z

+1 :)

rutgr · 2021-12-08T18:37:23Z

+1 :)

cardinalfan1 · 2021-12-11T19:21:10Z

I’d also like to see this functionality. Especially the ability to merge a pdf with another one in reverse-alternate order. (For those of us who have non-duplex scanners who end up with alternate order with the second pdf being reversed). Right now doing it with pdfsam but would be great to have this built in.

MBadberg · 2021-12-18T23:05:14Z

Hm, lol. the first function, what i've missed after uploading a few documents - merging :-D

So, here i will give a +1

lucasmenno · 2021-12-23T23:59:48Z

+1 also from my end :)
Got a WiFi multifunction device. Scanning with bash directly to the input folder.
Merging documents inside paperless-ng would be awesome :3

buttercheetah · 2022-01-04T06:22:57Z

+1 In my opinion, its the only thing missing from this program.

Duval23 · 2022-01-22T08:04:06Z

I do not urgently need this feature, but I can give a use case case where it comes in handy.

I need to create an account of charges. I used to directly scan to paperless and have not prepared any other 'scan-option'. As new bills occur regularly, they could be scanned to paperless-ng and then appended to the existing document containing all the bills. So this feature helps to keep 'work in progess documents' up to date.

For now, I scan all the bills to paperless and once the document can be finished, I download all bills and merge them locally. Not a big deal, to be honest. But maybe this use case will help for the documentation.

henfri · 2022-02-06T18:52:45Z

For some of you, this may help:
#457 (comment)
Not anywhere close to the .mov above, of course. But if your scanner is a Brother ADS it should do the trick.

Jaykob · 2022-02-11T16:51:48Z

First of all, thanks for this great piece of software!

I also miss this feature after using paperless for one hour, as my scanner doesn’t easily support scanning multiple pages to a network share. Had to use a RPi with saned and scanbd to listen for the scanner’s hardware buttons to be pressed. So I’m quite happy that it works after all.

Seeing that it’s been 9 months or so without progress, I’m asking myself where and how we could help to get this thing finished?
I don’t want to sound offensive - just trying to offer help for a feature that for sure many people (including me) would suspect to be available in a project like this.

henfri · 2022-02-11T18:12:43Z

See here:
#1599

Jaykob · 2022-02-12T07:26:31Z

Thanks for the heads up!
Hope Jonas is OK 🙏

…ging #335

maugsburger · 2022-04-24T11:24:55Z

May I leave one more idea: Use OCR and a special page (with a unique text, like, PAPERLESS-SPLIT-PAGE-PAPERLESS-SPLIT-PAGE on it), to automatically split a scan there (and remove the seperator page).

My scanner is really slow on loading the scan preset, but quite fast on the actual scanning. Therefore, I would love to add a bunch of documents at once, but get them split automatically.

smseidl · 2022-04-25T20:38:07Z

May I leave one more idea: Use OCR and a special page (with a unique text, like, PAPERLESS-SPLIT-PAGE-PAPERLESS-SPLIT-PAGE on it), to automatically split a scan there (and remove the seperator page).

My scanner is really slow on loading the scan preset, but quite fast on the actual scanning. Therefore, I would love to add a bunch of documents at once, but get them split automatically.

@maugsburger - Paperless-NGX just released Version 1.7.0 that include the ability to use a PATCHT file for separating documents into different files. Its disabled by default, but if you read thru the documentation, it's quite easy to turn on.

clauschristianude · 2022-07-22T13:50:05Z

Hi,

for me it's the opposite: I scan over the year bill's into single documents and in the new year, I would like to select all bill's of the last year and merge them into one big file. So I can handle them better.

c-c

jonaswinkler added the feature request New feature or request label Jan 24, 2021

jonaswinkler mentioned this issue Jan 24, 2021

Feature Request: Combine documents #426

Closed

shamoon mentioned this issue Jan 25, 2021

Pre-receive Hook: Wait for file - software duplex #440

Closed

jonaswinkler changed the title ~~merge pictures or documents~~ [Feature Request] Split and merge documents Feb 12, 2021

jonaswinkler pinned this issue Feb 12, 2021

jonaswinkler added a commit that referenced this issue Mar 11, 2021

added a very crude and largely untested API endpoint for document mer…

1e131f0

…ging #335

jonaswinkler added a commit that referenced this issue Mar 11, 2021

Angular interfaces for split+merge API support #335

5e096bc

ark- mentioned this issue May 31, 2021

Feature Request: Rotation in App #228

Closed

henfri mentioned this issue Feb 11, 2022

[Other] How to continue... the project seems unmaintained now #1599

Open

shamoon mentioned this issue Feb 16, 2022

[Feature] Add rotate to Bulk Edit paperless-ngx/paperless-ngx#59

Closed

shamoon referenced this issue in paperless-ngx/paperless-ngx Feb 22, 2022

added a very crude and largely untested API endpoint for document mer…

fe727e6

…ging #335

shamoon referenced this issue in paperless-ngx/paperless-ngx Feb 22, 2022

Angular interfaces for split+merge API support #335

0e52453

shamoon referenced this issue in paperless-ngx/paperless-ngx Feb 22, 2022

initial gui #335

c9b60cb

czyzlukasz mentioned this issue Dec 16, 2023

Add option to merge multiple files into single PDF when bulk downloading paperless-ngx/paperless-ngx#5002

Closed

11 tasks

[Feature Request] Split and merge documents #335

[Feature Request] Split and merge documents #335

Comments

ghost commented Jan 13, 2021

jonaswinkler commented Jan 13, 2021 • edited Loading

shamoon commented Jan 13, 2021

Philmo67 commented Jan 13, 2021

jonaswinkler commented Jan 13, 2021

Matthias84 commented Jan 17, 2021

Zocker1999NET commented Jan 24, 2021

jhass commented Feb 9, 2021 • edited Loading

Zocker1999NET commented Feb 9, 2021

jhass commented Feb 9, 2021

jonaswinkler commented Feb 9, 2021

jhass commented Feb 9, 2021

jonaswinkler commented Feb 9, 2021 • edited Loading

jhass commented Feb 10, 2021

jonaswinkler commented Feb 10, 2021 • edited Loading

jonaswinkler commented Feb 12, 2021 • edited Loading

shamoon commented Feb 12, 2021

jonaswinkler commented Feb 12, 2021 • edited Loading

shamoon commented Feb 12, 2021

jonaswinkler commented Feb 12, 2021 • edited Loading

shamoon commented Feb 12, 2021

jonaswinkler commented Feb 12, 2021 • edited Loading

shamoon commented Feb 12, 2021

ffleischer commented Mar 11, 2021

jonaswinkler commented Apr 2, 2021

shamoon commented Apr 2, 2021

rklasen commented May 11, 2021

jonaswinkler commented May 11, 2021

steffe commented Jun 30, 2021

GoldenBatt88 commented Aug 20, 2021

mhupfauer commented Sep 8, 2021 • edited Loading

nikor30 commented Oct 26, 2021

olvier commented Nov 9, 2021

eSportler89 commented Dec 8, 2021

rutgr commented Dec 8, 2021

cardinalfan1 commented Dec 11, 2021

MBadberg commented Dec 18, 2021

lucasmenno commented Dec 23, 2021

buttercheetah commented Jan 4, 2022

Duval23 commented Jan 22, 2022

henfri commented Feb 6, 2022

Jaykob commented Feb 11, 2022 • edited Loading

henfri commented Feb 11, 2022

Jaykob commented Feb 12, 2022

maugsburger commented Apr 24, 2022

smseidl commented Apr 25, 2022

clauschristianude commented Jul 22, 2022

jonaswinkler commented Jan 13, 2021 •

edited

Loading

jhass commented Feb 9, 2021 •

edited

Loading

jonaswinkler commented Feb 9, 2021 •

edited

Loading

jonaswinkler commented Feb 10, 2021 •

edited

Loading

jonaswinkler commented Feb 12, 2021 •

edited

Loading

jonaswinkler commented Feb 12, 2021 •

edited

Loading

jonaswinkler commented Feb 12, 2021 •

edited

Loading

jonaswinkler commented Feb 12, 2021 •

edited

Loading

mhupfauer commented Sep 8, 2021 •

edited

Loading

Jaykob commented Feb 11, 2022 •

edited

Loading