Automatically organize folders with potentially huge amounts of unorganized ebooks. This is a Python port of organize-ebooks.sh from ebook-tools written in shell by na--.
This is done by renaming the files with proper names and moving them to other folders. The new names are obtained based on the ISBNs found in the ebook files. These ISBNs are extracted by using progressively more complex methods (from searching the filename to OCR-ing the given file) depending on the user's specified options.
organize_ebooks.py automatically organize folders with potentially huge amounts of unorganized ebooks. This is done by renaming the files with proper names and moving them to other folders.
The new names are obtained based on the ISBNs found in the ebook files. These ISBNs are extracted by using progressively more complex methods (from searching the filename to OCR-ing the given file) depending on the user's specified options (see Basic command).
It is a Python port of organize-ebooks.sh from ebook-tools written in shell by na--.
⭐ Other related Python projects based on ebook-tools
:
- convert-to-txt: convert documents (pdf, djvu, epub, word) to txt
- find-isbns: find ISBNs from ebooks (pdf, djvu, epub) or any string given as input to the script
- ocr: run OCR on documents (pdf, djvu, and images)
- split-ebooks-into-folders: split the supplied ebook files into folders with consecutive names
- interactive-organizer: interactively and manually check the ebook files
that were organized by
organized_ebooks
.
You can ignore this section and go straight to pulling the Docker image which contains all the
required dependencies and the Python package organize_ebooks
already installed. This section is more for showing how I setup my system
when porting the shell script organize-ebooks.sh et al. to Python.
This is the environment on which the Python package organize_ebooks was developed and tested:
Platform: macOS
Python: version 3.7
p7zip for ISBN searching in ebooks that are in archives.
Tesseract for running OCR on books - version 4 gives better results.
⚠️ OCR is a slow resource-intensive process. Hence, by default only the first 7 and last 3 pages are OCR-ed through the option--ocr-only-first-last-pages
. More info at Script options.Ghostscript:
gs
converts pdf to png (useful for OCR)textutil or catdoc: for converting doc to txt
NOTE: On macOS, you don't need
catdoc
since it has the built-intextutil
command-line tool that converts any txt, html, rtf, rtfd, doc, docx, wordml, odt, or webarchive file-
it includes
ddjvu
for converting djvu to tif image (useful for OCR), anddjvused
to get number of pages from a djvu documentit includes
djvutxt
for converting djvu to txt⚠️ To access the djvu command line utilities and their documentation, you must set the shell variable
PATH
andMANPATH
appropriately. This can be achieved by invoking a convenient shell script hidden inside the application bundle:$ eval `/Applications/DjView.app/Contents/setpath.sh`
Ref.: ReadMe from DjVuLibre
You need to softlink
djvutxt
in/user/local/bin
(or add it in$PATH
)
-
- it includes
pdftotext
for converting pdf to txt - it includes
pdfinfo
to get number of pages from a pdf document if mdls (macOS) is not found.
- it includes
ℹ️ epub is converted to txt by using unzip -c {input_file}
Optionally:
-
Versions 2.84 and above are preferred because of their ability to manually specify from which specific online source we want to fetch metadata. For earlier versions you have to set
ISBN_METADATA_FETCH_ORDER
andORGANIZE_WITHOUT_ISBN_SOURCES
to empty strings.for fetching metadata from online sources
for getting an ebook's metadata with
ebook-meta
in order to search it for ISBNsfor converting {pdf, djvu, epub, msword} to txt (for ISBN searching) by using calibre's ebook-convert
⚠️ ebook-convert
is slower than the other conversion tools (textutil
,catdoc
,pdftotext
,djvutxt
)
Optionally poppler, catdoc and DjVuLibre can be installed for faster than calibre's conversion of
.pdf
,.doc
and.djvu
files respectively to.txt
.Optionally the Goodreads and WorldCat xISBN calibre plugins can be installed for better metadata fetching.
⭐
If you only install calibre among these dependencies, you can still have a functioning program that will organize ebook collections:
- fetching metadata from online sources will work: by default calibre comes with Amazon and Google sources among others
- conversion to txt will work: calibre's own
ebook-convert
tool will be used. However, accuracy and performance will be affected as explained in the list of dependencies above.
ℹ️
It is recommended to install the Python package organize_ebooks with Docker because the Docker
container has all the many dependencies already installed along with the Python package organize_ebooks
.
Pull the Docker image from hub.docker.com:
docker pull raul23/organize:latest
Run the Docker container:
docker run -it -v /host/input/folder:/unorganized-books raul23/organize:latest
ℹ️
/host/input/folder
is a directory within your OS that can contain all the ebooks to be organized and is mounted as/unorganized-books
within the Docker container.You can use the
-v
option mulitple times to mount several host output folders within the container, e.g.:docker run -it -v /host/input/folder:/unorganized-books -v /host/output/folder:/output-folder raul23/organize:latest
raul23/organize:latest
is the name of the image upon which the Docker container will be created.
Now that you are within the Docker container, you can run the Python script
organize_ebooks
with the desired options:user:~$ organize_ebooks /unorganized-books/
ℹ️
- This basic command instructs the script
organize_ebooks
to organize the ebooks within/unorganized-books/
and to save the renamed ebooks within the working directory which is the default location of the-o
option (output folder). - When you log in as
user
(non-root) within the Docker container, your working directory is/ebook-tools
.
- This basic command instructs the script
ℹ️
- The layers of the Docker image can be checked in details at the project's Docker repo where you can find the commands used in the Dockerfile for installing all the dependencies in the base OS (Ubuntu 18.04).
- This Python-based Docker image is derived from the project ebook-tools (shell scripts by na--) which you can find at the Docker Hub. One of the main differences being that the base OS is Ubuntu 18.04 and Debian, respectively.
The Docker image for this project contains the following components:
Ubuntu 18.04: the base system of the Docker image
All the dependencies (required and optional) needed for supporting all the features (e.g. OCR, document conversion to text) offered by the package
organize_ebooks
:Python 3.6.9 along with
setuptools
andwheel
p7zip:
7z
Tesseract
Ghostscript:
gs
catdoc
DjVuLibre:
ddjvu
,djvused
,djvutxt
Poppler:
pdftotext
andpdfinfo
calibre:
ebook-convert
,ebook-meta
, calibre's metadata plugins (including Goodreads and WorldCat xISBN)The Goodreads plugin (goodreads.zip) is from this forum post (by a calibre Developer) (2022-12-23): mobileread.com
unzip
The Python package
organize_books
is installed. You can call the corresponding script with any of the options:user:~$ organize_ebooks /unorganized-books/
The Python package interactive_organizer is installed. You can call the corresponding script with any of the options:
user:~$ interactive_organizer /uncertain/
user
: a user nameduser
is created with UID 1000.user
doesn't have root privileges within the Docker container. Thus you can't among other things install packages withapt-get install
.
You can ignore this section and go straight to pulling the Docker image which contains all the required dependencies and the Python packageorganize_ebooks
already installed. This section is for installing the bleeding-edge version of the Python packageorganize_ebooks
after you have installed yourself the many dependencies.
After you have installed the dependencies, you can then install the development (bleeding-edge) version of the package organize_ebooks:
pip install git+https://github.com/raul23/organize-ebooks#egg=organize-ebooks
NOTE: the development version has the latest features
Test installation
Test your installation by importing
organize_ebooks
and printing its version:python -c "import organize_ebooks; print(organize_ebooks.__version__)"
You can also test that you have access to the
organize_ebooks.py
script by showing the program's version:organize_ebooks --version
To uninstall the development version of the package organize_ebooks:
pip uninstall organize_ebooks
To display the script organize_ebooks.py list of options and their descriptions:
$ organize_ebooks -h usage: organize_ebooks [OPTIONS] {folder_to_organize} Automatically organize folders with potentially huge amounts of unorganized ebooks. This is done by renaming the files with proper names and moving them to other folders. This script is based on the great ebook-tools written in shell by na-- (See https://github.com/na--/ebook-tools). General options: -h, --help Show this help message and exit. -v, --version Show program's version number and exit. -q, --quiet Enable quiet mode, i.e. nothing will be printed. --verbose Print various debugging information, e.g. print traceback when there is an exception. -d, --dry-run If this is enabled, no file rename/move/symlink/etc. operations will actually be executed. -s, --symlink-only Instead of moving the ebook files, create symbolic links to them. -k, --keep-metadata Do not delete the gathered metadata for the organized ebooks, instead save it in an accompanying file together with each renamed book. It is very useful for semi-automatic verification of the organized files for additional verification, indexing or processing at a later date. -r, --reverse If this is enabled, the files will be sorted in reverse (i.e. descending) order. By default, they are sorted in ascending order. --log-level {debug,info,warning,error} Set logging level. (default: info) --log-format {console,only_msg,simple} Set logging formatter. (default: only_msg) Convert-to-txt options: --djvu {djvutxt,ebook-convert} Set the conversion method for djvu documents. (default: djvutxt) --epub {epubtxt,ebook-convert} Set the conversion method for epub documents. (default: epubtxt) --msword {catdoc,textutil,ebook-convert} Set the conversion method for msword documents. (default: textutil) --pdf {pdftotext,ebook-convert} Set the conversion method for pdf documents. (default: pdftotext) Options related to extracting ISBNS from files and finding metadata by ISBN: --max-isbns NUMBER Maximum number of ISBNs to try when fetching metadata from online sources by ISBNs. (default: 5) -i, --isbn-regex ISBN_REGEX This is the regular expression used to match ISBN-like numbers in the supplied books. (default: (?<![0-9])(-?9-?7[789]-?)?((-?[0-9]-?){9}[0-9xX])(?![0-9])) --isbn-blacklist-regex REGEX Any ISBNs that were matched by the ISBN_REGEX above and pass the ISBN validation algorithm are normalized and passed through this regular expression. Any ISBNs that successfully match against it are discarded. The idea is to ignore technically valid but probably wrong numbers like 0123456789, 0000000000, 1111111111, etc.. (default: ^(0123456789|([0-9xX])\2{9})$) --isbn-direct-files REGEX This is a regular expression that is matched against the MIME type of the searched files. Matching files are searched directly for ISBNs, without converting or OCR-ing them to .txt first. (default: ^text/(plain|xml|html)$) --isbn-ignored-files REGEX This is a regular expression that is matched against the MIME type of the searched files. Matching files are not searched for ISBNs beyond their filename. By default, it tries to ignore .gif and .svg images, audio, video and executable files and fonts. (default: ^(image/(gif|svg.+)|application/(x-shockwave-flash|CDFV2|vnd.ms- opentype|x-font-ttf|x-dosexec|vnd.ms-excel|x-java-applet)|audio/.+|video/.+)$) --reorder-files LINES [LINES ...] These options specify if and how we should reorder the ebook text before searching for ISBNs in it. By default, the first 400 lines of the text are searched as they are, then the last 50 are searched in reverse and finally the remainder in the middle. This reordering is done to improve the odds that the first found ISBNs in a book text actually belong to that book (ex. from the copyright section or the back cover), instead of being random ISBNs mentioned in the middle of the book. No part of the text is searched twice, even if these regions overlap. Set it to `False` to disable the functionality or `first_lines last_lines` to enable it with the specified values. (default: 400 50) --irs, --isbn-return-separator SEPARATOR This specifies the separator that will be used when returning any found ISBNs. (default: ' - ') -m, ---metadata-fetch-order METADATA_SOURCE [METADATA_SOURCE ...] This option allows you to specify the online metadata sources and order in which the subcommands will try searching in them for books by their ISBN. The actual search is done by calibre's `fetch- ebook-metadata` command-line application, so any custom calibre metadata plugins can also be used. To see the currently available options, run `fetch-ebook-metadata --help` and check the description for the `--allowed-plugin` option. If you use Calibre versions that are older than 2.84, it's required to manually set this option to an empty string. (default: ['Goodreads', 'Google', 'Amazon.com', 'ISBNDB', 'WorldCat xISBN', 'OZON.ru']) OCR options: --ocr, --ocr-enabled {always,true,false} Whether to enable OCR for .pdf, .djvu and image files. It is disabled by default. (default: false) --ocrop, --ocr-only-first-last-pages PAGES PAGES Value 'n m' instructs the script to convert only the first n and last m pages when OCR-ing ebooks. (default: 7 3) Organize options: --skip-archives Skip all archives (e.g. zip, 7z) except epub files. -c, --corruption-check {check_only,true,false} `check_only`: do not organize or rename files, just check them for corruption (ex. zero-filled files, corrupt archives or broken .pdf files). `true`: check corruption and organize/rename files. `false`: skip corruption check. This option is useful with the `output-folder-corrupt` option. (default: true) -t, --tested-archive-extensions REGEX A regular expression that specifies which file extensions will be tested with `7z t` for corruption. (default: ^(7z|bz2|chm|arj|cab|gz|tgz|gzip|zip|rar|xz|tar|epub|docx|odt|ods|cbr|cbz|maff|iso)$) --owi, --organize-without-isbn Specify whether the script will try to organize ebooks if there were no ISBN found in the book or if no metadata was found online with the retrieved ISBNs. If enabled, the script will first try to use calibre's `ebook-meta` command-line tool to extract the author and title metadata from the ebook file. The script will try searching the online metadata sources (`organize-without-isbn- sources`) by the extracted author & title and just by title. If there is no useful metadata or nothing is found online, the script will try to use the filename for searching. --owis, --organize-without-isbn-sources METADATA_SOURCE [METADATA_SOURCE ...] This option allows you to specify the online metadata sources in which the script will try searching for books by non-ISBN metadata (i.e. author and title). The actual search is done by calibre's `fetch-ebook-metadata` command-line application, so any custom calibre metadata plugins can also be used. To see the currently available options, run `fetch-ebook-metadata --help` and check the description for the `--allowed-plugin` option. Because Calibre versions older than 2.84 don't support the `--allowed-plugin` option, if you want to use such an old Calibre version you should manually set `organize_without_isbn_sources` to an empty string. (default: ['Goodreads', 'Google', 'Amazon.com']) -w, --without-isbn-ignore REGEX This is a regular expression that is matched against lowercase filenames. All files that do not contain ISBNs are matched against it and matching files are ignored by the script, even if `organize-without-isbn` is true. The default value is calibrated to match most periodicals (magazines, newspapers, etc.) so the script can ignore them. (default: complex default value, see the README) --pamphlet-included-files REGEX This is a regular expression that is matched against lowercase filenames. All files that do not contain ISBNs and do not match `without-isbn-ignore` are matched against it and matching files are considered pamphlets by default. They are moved to `output_folder_pamphlets` if set, otherwise they are ignored. (default: \.(png|jpg|jpeg|gif|bmp|svg|csv|pptx?)$) --pamphlet-excluded-files REGEX This is a regular expression that is matched against lowercase filenames. If files do not contain ISBNs and match against it, they are NOT considered as pamphlets, even if they have a small size or number of pages. (default: \.(chm|epub|cbr|cbz|mobi|lit|pdb)$) --pamphlet-max-pdf-pages PAGES .pdf files that do not contain valid ISBNs and have a lower number pages than this are considered pamplets/non-ebook documents. (default: 50) --pamphlet-max-filesize-kib SIZE Other files that do not contain valid ISBNs and are below this size in KiBs are considered pamplets/non-ebook documents. (default: 250) Input/Output options: folder_to_organize Folder containing the ebook files that need to be organized. -o, --output-folder PATH The folder where ebooks that were renamed based on the ISBN metadata will be moved to. (default: /Users/test/PycharmProjects/testing/organize/test_installation) --ofu, --output-folder-uncertain PATH If `organize-without-isbn` is enabled, this is the folder to which all ebooks that were renamed based on non-ISBN metadata will be moved to. (default: None) --ofc, --output-folder-corrupt PATH If specified, corrupt files will be moved to this folder. (default: None) --ofp, --output-folder-pamphlets PATH If specified, pamphlets will be moved to this folder. (default: None) --oft, --output-filename-template TEMPLATE This specifies how the filenames of the organized files will look. It is a bash string that is evaluated so it can be very flexible (and also potentially unsafe). (default: ${d[AUTHORS]// & /, } - ${d[SERIES]:+[${d[SERIES]}] - }${d[TITLE]/:/ -} ${d[PUBLISHED]:+ (${d[PUBLISHED]%-*})}${d[ISBN]:+[${d[ISBN]}]}.${d[EXT]}) --ome, --output-metadata-extension EXTENSION If `keep-metadata` is enabled, this is the extension of the additional metadata file that is saved next to each newly renamed file. (default: meta)
--keep-metadata
: as stated in its description above, the metadata files that are created alongside the renamed ebook files are useful for the script interactive_organizer which used them for various post-processing tasks such as showing the differences between the old and new filenames.--log-level
: if it is set to the logging levelwarning
, you will only be shown on the terminal those documents that were skipped (e.g. the file is an image) or failed (e.g. corrupted file).--max-isbns
: especially when organizing epub files (they can contain many files since they are archives), many valid ISBNs can be found and thus the fetching of metadata from online sources might take longer than usual. By limiting the number of ISBNs to check, the script can run faster by not being bogged down by testing lots of ISBNs. And usually it is the first ISBN found that is the correct one since it appears in the very first pages of the document which is the most likely place to find it (the script searches ISBNs in the first pages, then in the end, and finally in the middle of the file).--skip-archives
: by default all archives (e.g. 7z, zip) are searched for ISBNs and this means that they will be decompressed and each extracted file will be recursively searched for ISBNs. Thus you can just skip these archives (except epub documents) when organizing your ebooks by using this flag.--corruption-check
: corruption check withpdfinfo
can be very sensitive by flagging some PDF files as corrupted even though they can be opened without problems:Syntax Error: Dictionary key must be a name object Syntax Error: Couldn't find trailer dictionary
Thus by setting this option to 'false', you can skip any corruption check (whether by
pdfinfo
or7z
). By default, corruption check is enabled. Also if you set it to 'check_only', only corruption check will be performed, i.e. no organization or renaming of ebooks will be done.The choices for
--ocr
are {always, true, false}- 'always': If the conversion to text was successful but no ISBNs were found, then OCR is run on the document. Also, if the conversion failed (e.g. its content is empty or doesn't contain any text), then OCR is applied to the document.
- 'true': OCR is applied to the document only if the conversion to text failed.
- 'false': No OCR is applied after the conversion to text.
--owi, --organize-without-isbn
: if no ISBNs could be found within the document, the document can still be organized based on its author and/or title or filename by calling calibre'sfetch-ebook-metadata
command-line application which fetches metadata from online metadata sources (by default they are 'Goodreads', 'Google', 'Amazon.com').These ebooks are then saved under the user specifed uncertain folder (
--ofu, --output-folder-uncertain
).
At bare minimum, the script organize_ebooks
requires an input folder containing the ebooks to organize. Thus, the following is one the
most basic command you can provide to the script:
organize_ebooks ~/ebooks/input_folder/
The ebooks in the input folder will be searched for ISBNs. The script tries to find ISBN numbers in the given ebook file by using progressively more "expensive" tactics (as stated in lib.sh from ebook-tools).
These are the steps in order followed by the organize_ebooks
script when searching ISBNs for a given ebook
(as soon as ISBNs are found, the script return them):
- The first location it tries to find ISBNs is the filename.
- Then it checks the contents directly if it is a text file.
- The next place that is searched for ISBNs is the file metadata by calling calibre's
ebook-meta
. - The file is decompressed with
7z
if it is an archive and the extracted files are recursively searched for ISBNs (epubs are excluded from this step even though they are basically zipped HTML files as explained in epub and archives). - The file is converted to
txt
and its text content is searched for ISBNs. - If OCR is enabled (through the
--ocr
option), the file is OCR-ed and the resultant text content is searched for ISBNs.
organize_ebooks ~/input_folder/ -o ~/outut_folder/ --ofc ~/corrupt/ --ofu ~/uncertain/ --owi
ℹ️
--ofu, --output-folder-uncertain
: this folder will contain any document that could be identified based on non-ISBN metadata (e.g. title) from online sources (e.g. Goodreads). However this folder is only used along with the flag--owi
(next option explained).--owi, --organize-without-isbn
: This flag instructs the script to fetch metadata from online sources in case no ISBN could be found in an ebook. The filename or the author and/or title are used for fetching metadata about the book.
By default (see the --oft option), this is the bash string used as template when naming ebooks:
${d[AUTHORS]// & /, } - ${d[SERIES]:+[${d[SERIES]}] - }${d[TITLE]/:/ -}${d[PUBLISHED]:+ (${d[PUBLISHED]%-*})}${d[ISBN]:+ [${d[ISBN]}]}.${d[EXT]})
For example, it produces the following filenames:
Cory Doctorow - Little Brother (2008) [9780007288427] Eric von Hippel - Democratizing Innovation (2005) [0262002744] Steve Jones - Almost Like a Whale - The Origin of Species Updated (2000) [9780385409858].html
If you want to add other data to the filenames such as the publisher and languages, here is how you can modify this bash string:
${d[AUTHORS]// & /, } - ${d[SERIES]:+[${d[SERIES]}] - }${d[TITLE]/: -} (${d[PUBLISHER]:+${d[PUBLISHER]}}, ${d[PUBLISHED]:+${d[PUBLISHED]%%-*}})${d[ISBN]:+ [${d[ISBN]}]}${d[LANGUAGES]:+ [${d[LANGUAGES]}]}.${d[EXT]}
Here is an example of a filename that is generated based on this modified bash string:
Cory Doctorow - With a Little Help (CorDoc-Company, Limited, 2010) [9780557943050] [eng].epub
This is how you would call the script organize_ebooks
with this modified string (--oft, --output-filename-template
option):
organize_ebooks ~/input -o ~/output/ --oft '${d[AUTHORS]// & /, } - ${d[SERIES]:+[${d[SERIES]}] - }${d[TITLE]/: -} (${d[PUBLISHER]:+${d[PUBLISHER]}}, ${d[PUBLISHED]:+${d[PUBLISHED]%%-*}})${d[ISBN]:+ [${d[ISBN]}]}${d[LANGUAGES]:+ [${d[LANGUAGES]}]}.${d[EXT]}'
To organize a collection of documents (ebooks, pamplets) through the script organize_ebooks.py
:
organize_ebooks ~/input_folder/ -o ~/output_folder/ --ofp ~/pamphlets/
ℹ️ Explaining the command
- I only specify the input and two ouput folders and thus ignore corrupted files (
--ofu
not used) and ebooks without ISBNs (--ofu
and--owi
not used). These ignored files will just be skipped. - Also books made up with images will be skipped since OCR was not choosen (
--ocr
is set to 'false' by default).
Let's say we have this folder containing assorted documents:
To organize this collection of documents (ebooks, pamphlets) through the Python API (i.e. organize_ebooks
package):
from organize_ebooks.lib import organizer
retcode = organizer.organize('/Users/test/ebooks/input_folder/',
output_folder='/Users/test/ebooks/output_folder',
output_folder_corrupt='/Users/test/ebooks/corrupt/',
output_folder_pamphlets='/Users/test/ebooks/pamphlets/',
output_folder_uncertain='/Users/test/ebooks/uncertain/',
organize_without_isbn=True,
keep_metadata=True)
ℹ️ Explaining the parameters of the function organize()
- The first parameter to
organize()
is the input folder containing the documents to organize output_folder
: this is the folder where every ebooks whose ISBNs could be retrieved will be saved and renamed with proper names. Thus the program is highly confident that these ebooks are correctly labeled based on the found ISBNs.output_folder_corrupt
: any document that was checked (withpdfinfo
) and found to be corrupted will be saved in this folder.output_folder_pamphlets
: this is the folder that will contain any documents without valid ISBNs (e.g. HMTL pages) that satisfy certain criteria for pamphlets (such as small size and low number of pages).output_folder_uncertain
: this folder will contain any documents that could be identified based on non-ISBN metadata (e.g. title) from online sources (e.g. Goodreads). However this folder is only used if the flagorganize_without_isbn
(next option explained) is set to True.organize_without_isbn
: If True, this flag specifies to fetch metadata from online sources in case no ISBN could be found in ebooks.keep_metadata
: If True, a metadata file will be saved along the renamed ebooks in the output folder. Also, documents that were identified as corrupted will be saved along with a metadata file that will contain info about the detected corruption.- If everything went well with the organization of documents,
organize()
will return 0 (success). Otherwise,retcode
will be 1 (failure).
Sample output:
Contents of the different folders after the organization:
By default when using the API, the loggers are disabled. If you want to enable them, call the
function setup_log()
(with the desired log level in all caps) at the beginning of your code before
the function organize()
:
from organize_ebooks.lib import organizer, setup_log
setup_log(logging_level='INFO')
retcode = organizer.organize('/Users/test/ebooks/input_folder/',
output_folder='/Users/test/ebooks/output_folder',
output_folder_corrupt='/Users/test/ebooks/corrupt/',
output_folder_pamphlets='/Users/test/ebooks/pamphlets/',
output_folder_uncertain='/Users/test/ebooks/uncertain/',
organize_without_isbn=True,
keep_metadata=True)
Sample output:
Having multiple metadata sources can slow down the ebooks organization.
By default, we have for
metadata-fetch-order
:['Goodreads', 'Amazon.com', 'Google', 'ISBNDB', 'WorldCat xISBN', 'OZON.ru']
By default, we have for
organize-without-isbn-sources
:['Goodreads', 'Amazon.com', 'Google']
I usually get results from
Google
andGoodreads
.Books that are sometimes skipped for insufficient information from filename\ISBN or wrong filename\ISBN
Solution manuals
Obscure and/or non-english books
Very old books without any ISBN
A book with an invalid ISBN from the get go: only found two such books so far (French math books)
Books with an invalid ISBN because when converting them to text for extracting their ISBNs, an extra number was added to the ISBN (and not at the end but in the middle of it) which made it invalid
For the moment, I don't know what to do about this case
Books whose ISBNs couldn't be extracted because the conversion to text (with or without OCR) was not cleaned, i.e. it added extra characters (not necessarily numbers) such as '·' or 'uf73' between the numbers of the ISBN which "broke" the regex
Solution: I had to modify
find_isbns()
to take into account these annoying "artifacts" from the conversion procedure
Obviously, they are skipped if I didn't enable OCR with the option
--ocr-enabled
(by default it is set to 'false')I was trying to build a docker image based from ebooktools/scripts which contains all the necessary dependencies (e.g. calibre, Tesseract) for a Debian system and I was going to add the Python package organize_ebooks . However, I couldn't build an image from the base OS
debian:sid-slim
as specified in its Dockerfile:The following signatures couldn't be verified because the public key is not available: NO_PUBKEY
Thus, I created an image from scratch starting with
ubuntu:18.04
that I am trying to push to hub.docker.com but I am always getting the errorrequested access to the resource is denied
(see solution).
When searching for ISBNs, the Python script organize_ebooks
doesn't decompress epub files with 7z
because it would be a very slow
operation since 7z
decompresses archives and recursively scans the contents which can be many files within an epub file.
Then you would have to search ISBNs for each of the extracted files which would increase the running time of the script.
Instead, epub files are decompressed with unzip -c
which extracts files to stdout/screen and then the output is redirected to
a temporary text file. This text file is then searched for ISBNs. Hence the searching for ISBNs is quicker when applying unzip
to epub files than with 7z
.
Also, the reason for using unzip
is to make the conversion of epub files to text quicker and more accurate than calibre's
ebook-convert
.
ℹ️ epubs are basically zipped HTML files
These are the files that are supported for conversion to txt and the corresponding conversion tools used:
Files supported | Conversion tool #1 | Conversion tool #2 | Conversion tool #3 |
---|---|---|---|
pdftotext |
ebook-convert (calibre) |
||
djvu | djvutxt |
ebook-convert (calibre) |
|
epub | epubtxt |
ebook-convert (calibre) |
|
docx (Word 2007) | ebook-convert (calibre) |
||
doc (Word 97) | textutil (macOS) |
catdoc |
ebook-convert (calibre) |
rtf | ebook-convert (calibre) |
ℹ️ Some explanations about the table
epubtxt
is a fancy way to sayunzip
.- By default,
ebook-convert
(calibre) is always used as a last resort when other methods already exist since it is slower than the other conversion tools.
For comparison, here are the times taken to convert completely a 154-pages PDF document to txt for both supported conversion methods:
pdftotext
: 4.27sebook-convert
(calibre): 80.91s
ℹ️ If you are having trouble pushing your docker image to hub.docker.com with an old macOS, here is what worked for me
I was trying to push to hub.docker.com but I was getting the error
requested access to the resource is denied
.I tried everything that was suggested on various forums: checking that I named my image and repo correctly, making sure I was logged in before pushing, making sure that I was not pushing to a private repo or to docker.io/library/, making sure that my Docker client was running, and so on.
I was finally able to push the Docker image to hub.docker.com by installing Ubuntu 22.04 in a virtual machine since I was finally convinced that my very old macOS wasn't compatible with Docker anymore 😞. Also my Docker version was way too old and the latest Docker requires newer versions of macOS. The only
docker
operation I was not able to accomplish (as far as I know) with my old macOS wasdocker push
.👉 SOLUTION: if you tried everything under the sun to fix the
push
problem but you still couldn't solve it, then the solution is to finally accept that your old macOS (or any other OS) is the cause and you should try Docker on a newer system. Since I didn't want to install a newer version of macOS (I don't want to break my current programs and I don't think my system is able to support it), I opted for installing Docker with Ubuntu 22.04 under a virtual machine.What I noticed strange though was that on my old macOS when I logged out from Docker, I got the following message:
Not logged in to https://index.docker.io/v1/However on Ubuntu 22.04, this is what I get when I log out from Docker (and this is what I see from other people using Docker):
Removing login credentials for https://index.docker.io/v1/Maybe on the old macOS I was not correctly authenticated (even though I got the message
Login Succeeded
) and thus I couldn't do thedocker push
.