Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support accent-insensitive search using SQLean unicode sqlite3 extension #3199

Merged
merged 7 commits into from
Jul 28, 2024

Conversation

mikiher
Copy link
Contributor

@mikiher mikiher commented Jul 27, 2024

This fixes #2678.

This PR has two main parts:

1. Refactoring and modifications in BinaryManager to support new capabilities

  • Ability to download binaries from multiple sources.
  • Ability to download binaries from GitHub release assets
  • Support for downloading libraries

These BinaryManager changes allowed me to:

  • move away from downloading ffmpeg and ffprobe from ffbinaries.com, replacing it with the ffbinaries-prebuilt GitHub repository, which contains the same executables.
  • download the SQLean unicode library from the sqlean Github repository

I also made BinaryManager run for all platforms, so it's able to download the unicode library (which is needed for the second part). Up until now it was only running for dev and Windows platforms, but I believe it's safe to run it on all supported platforms, with no changes or side-effects (if ffmpeg and ffprobe already exist on the system and are identified using the findRequiredBinaries() method, those will be used and nothing will be downloaded).

As a reminder, if BinaryManager is not able to find a binary it needs, it downloads it and puts it in one of two locations:

  • mainInstallDir: gloabl.appRoot (or, in the case of a pkg-ed binary, in the directory where that binary is located)
  • altInstallDir: (if mainInstallDir is not writable) the Audiobookshelf config directory (which should always be writable)

I believe the BinaryManager is a good mechanism to make sure all required binaries are available (whether they're obtained externally or by BinaryManager itself), and it removes the hassle of deploying those binaries in the various supported platforms.

2. Support for accent-insensitive search by using functions from the SQLean unicode sqlite3 extension

  • The extension is downloaded by BinaryManager (like ffmpeg, it can also be added externally and passed though the SQLEAN_UNICODE_PATH environment variable)
  • It is loaded into sqlite3 at Database init time
  • During searches, the query is normalized by calling the functions above, and is matched against the fields that require accent-insensitive matching (and which are normalized in the same way).
  • It looks like using the normalized match doesn't affect query performance in any perceptible way - I benchmarked on a library containing ~4000 books and db query execution time seems to remain roughly unchanged (2-3 ms per query).

Note regarding pkg-ed binaries: the unicode extension cannot be packaged into a pkg-ed binary, since it loaded by the sqlite3 native code, which doesn't have access to the pkg virtual file system, so it has to be downloaded as a dependency for pkg-ed binaries as well.

@mikiher mikiher marked this pull request as ready for review July 27, 2024 20:43
@advplyr
Copy link
Owner

advplyr commented Jul 28, 2024

I tested with the debian pkg on ubuntu 23 and it is working. I removed the ffmpeg install from the debian preinstall script. It will be much better to have all the install methods using the same ffmpeg source.

I'll mention in the release notes that the old location of the ffmpeg/ffprobe binaries at /usr/lib/audiobookshelf-ffmpeg will no longer be used. The version of ffmpeg that was being installed previously (at https://johnvansickle.com/ffmpeg/) won't match so unless someone manually swapped out those binaries they will just remain unused. I don't think we should remove those automatically.

Thanks for figuring all that out with sqlite extensions. Great stuff!

@advplyr advplyr merged commit 6183001 into advplyr:master Jul 28, 2024
4 checks passed
@mikiher mikiher deleted the unaccent branch July 29, 2024 08:49
@advplyr
Copy link
Owner

advplyr commented Aug 5, 2024

I realize that the docker images are going to download the binaries on every image start since they aren't getting mapped outside the container. Also the Dockerfile is already including ffmpeg/ffprobe but the latest version.

We can either:

  1. Update the Dockerfile to include unicode.so so it gets bundled in the image
  2. Remove ffmpeg from the Dockerfile and update the BinaryManager to download the binaries to the ConfigPath that we know is mapped outside the container.
  3. A hybrid approach where we only download the unicode.so in BinaryManager for docker.

It is nice to have the binaries already included in the docker image but it is also helpful to have everything use the same method of getting binaries.

I'm leaning towards 2 unless there is a better reason to package the binaries

@calvinbui
Copy link

I'm leaning towards 2 unless there is a better reason to package the binaries

Option 1 is the accepted practice, unless the file is optional.

@mikiher
Copy link
Contributor Author

mikiher commented Aug 6, 2024

I am obviously somewaht partial here, but let me try to list the pros and cons for 2 as I see them.

Pros:

  1. Full control over which versions of the binaries are being used, and where they're obtained from (for example, in our current Dockerfile, we have apk add ffmpeg, which installs ffmpeg 6.1.1 which I believe you intentionally removed from the valid versions list at some point)
  2. No dependency on setup, which is complex and different for each system
  3. Still retains the ability to override externally by setting the right environment variables

Cons:

  1. Somewhat complicating server startup procedure, and in case it needs to download, it lengthens it
  2. Might fail due to networking issues (though this can also happen during setup)
  3. Setup scripts are still required, so we now have two setup mechanisms, one external to the code and one internal

Also, If we go with option 1, I think we still need to obtain it from github in Dockerfile - sqlean is not in the apk repository, and I don't think we want to keep a copy of it in our repository (same goes for the other build scripts).

@advplyr
Copy link
Owner

advplyr commented Aug 8, 2024

I haven't gotten to look into this much yet but there have been a lot of reports of sqlite db corruption and the server becoming unresponsive #3251. I hope to be able to look into this later this afternoon (it's morning for me) but let me know if you have any thoughts

@mikiher
Copy link
Contributor Author

mikiher commented Aug 8, 2024 via email

@advplyr
Copy link
Owner

advplyr commented Aug 8, 2024

The other possibility I can think of is the switch to using MemoryStore for sessions

@mikiher
Copy link
Contributor Author

mikiher commented Aug 8, 2024 via email

@advplyr
Copy link
Owner

advplyr commented Aug 9, 2024

I was able to reproduce the SQLITE_CORRUPT reported in #3241

A user on Discord sent me their corrupt DB and narrowed it down to crashing when the title of specific books are updated. It doesn't seem like it is related to the actual title ("Console Wars") but just so happens that edit triggers the SQLITE_CORRUPT.

When not loading the unicode extension the db did not crash so I'm putting out a patch now with the extension disabled.

I couldn't find anything useful to track down the root cause of the crash. The sqlite db used in testing is only 2MB so I can send it to you if you want to test it.

@advplyr
Copy link
Owner

advplyr commented Aug 9, 2024

Oh and also the other issue happening with the server hanging up was the new MemoryStore that I patched out in v2.12.2.
There were 2 issues happening at the same time.

@mikiher
Copy link
Contributor Author

mikiher commented Aug 10, 2024

Thanks for tracking this, and sorry for causing this mess. Yes, please send me the db and the edit that caused the corrupt error.

So, do you think the db was corrupted before the edit, or was the edit causing the corruption?

Also, please send me the log up to the crash.

Which system was this running on?

@mikiher
Copy link
Contributor Author

mikiher commented Aug 10, 2024

After reading #3241 again, and the dicussion on discord, and re-reading the extension docs, and looking at the extension source code, I think I might have some idea of what's been happening...

I initially thought that the extension only added a couple of functions which I only used in SELECT queries, but then I read your comment that this happens specifically when updating the title of some book, and something clicked. I went back to the extension documentation, and saw the following line, which at the time I didn't think was relevant to what I was trying to achieve:

Tries to override the default NOCASE case-insensitive collation sequence to support UTF-8 characters (available in SQLite CLI and C API only).

But now that you mentioned the title, I rememberd that the books table had an index on title with COLLATE NOCASE

What I now suspect is that the new NOCASE collation doesn't work well (a euphemism for crashes) on indices built with the original collation. The issue probably manifests itself when you have some/many titles with with non-ascii characters, where indexing would likely be different due to the new collation.

And so when you try to do things that change the title column (e.g. setting a different title, deleting a book, etc.) , SQLite crashes when it tries to update the index. This might happen even when the title you modify doesn't itself contain non-ascii characters, depending on how the index update is implemented.

Looking at the bug, I also note that the original crash reported happened when trying to delete a series record, which also has a COLLATE NOCASE index on name

If all of the assumptions above are true, the solution would be to reindex the COLLATE NOCASE indices when first loading the extension, which will likely solve the issue. In fact, after doing that, for those columns that have NOCASE indices, we can just use unmodified LIKE expressions (without normalizing the query or the column) in searches, and they should just work.

In the meantime. while the extension is disabled, It would probably be a good idea to reindex all COLLATE NOCASE indices, in the next release, in case they were corrupted (this can be done by running the query REINDEX NOCASE)

I'm waiting for you to send me the db so I can run some tests on it to verify the above.

@advplyr
Copy link
Owner

advplyr commented Aug 10, 2024

Ah yeah that makes sense. I just sent you the broken db on discord.

const StreamZip = require('../nodeStreamZip')
const { finished } = require('stream/promises')

var API_URL = 'https://ffbinaries.com/api/v1'
Copy link
Contributor

@devnoname120 devnoname120 Aug 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikiher @advplyr Switching from https://ffbinaries.com/api/v1 to https://api.github.com/repos/${this.owner}/${this.repo}/releases/tags/${releaseTag} is risky. Unauthenticated GitHub's API calls are severely rate-limited and can cause the server to fail launching. See this issue: #3266

Copy link
Owner

@advplyr advplyr Aug 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rate limit is pretty strict.
https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28

Probably best to revert back to ffbinaries or we could host the binaries ourselves.

Docker builds should also not be installing ffmpeg/ffprobe since they are packaged in the image. Only reason they are is because the image includes ffmpeg/ffprobe v6.1 but the BinaryManager is expecting 5.1.

I updated the BinaryManager to 5.1 because of an error with ffprobe on 6.1 that I was able to reproduce on Windows #2689

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO for Docker the dependencies should always be bundled inside. This is IMO one of the main points of using a Docker image: make running containers more predictable.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree. I'm testing ffmpeg 5.1 in the Dockerfile now

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ffprobe 6.1 issue with those mp3 files mentioned in #2689 is specific to Windows in my testing. Since it is not easy to include an older version of a package with alpine linux I think I'll just add a skip for the BinaryManager for docker installs for now.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to note, though, that even in the current docker setup, binaries aren't really bundled, but installed through apk (which probably doesn't have the rate-limiting issue, but is still downloaded over the network).

This only happens when the image is built though. Once it is built the users system doesn't run it.

Docker build can stay with ffmpeg/ffprobe v6.1. It is the Windows built that had the issue with 6.1.
It would be nice to have them all use the same version but it is not easy to install older alpine linux packages that I could see so I was thinking of just having the binary manager skipped for docker.

Copy link
Contributor Author

@mikiher mikiher Aug 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for speaking out of ignorance.
Then I concede that BinaryManager is not suitable for docker installations (at least not as part of server init).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set it up to skip the binary manager if the source is "docker" that is set in index.js. That should do it

Copy link
Contributor

@devnoname120 devnoname120 Aug 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikiher I apologize for the tone of my first comment, I reworded it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, I'm using untone() so I'm tone-insensitive...😁

@mikiher
Copy link
Contributor Author

mikiher commented Aug 10, 2024

Ah yeah that makes sense. I just sent you the broken db on discord.

Thanks @advplyr. Do you also have the log of the SQLite failure? on which system were you able to reproduce it?

@advplyr
Copy link
Owner

advplyr commented Aug 10, 2024

It is the same SQLITE_CORRUPT error in #3241. I reproduced it on Windows running locally.

First I opened the db and NULLed the root user password so I could login. Once logging into root with no password I search for "Console Wars". There are 2 books in the books library starting with "Console Wars" and one book in the audiobook library. If you edit the title in Abs of any of those books it will crash with SQLITE_CORRUPT.

@mikiher
Copy link
Contributor Author

mikiher commented Aug 11, 2024

OK, I can confirm that was able to reproduce the crash locally using the DB you sent.

I can also confirm that after adding the REINDEX NOCASE query right after loading the extension, the crash doesn't happen anymore, and I was not able to cause it to crash in any other way. I think this mostly confirms my suspicions on the root cause of the crash. I will work on a fix.

@MaximUltimatum
Copy link

I'm not sure if it's too late to weigh in on the discussion about the ffmpeg binaries as a non-contributor (thank you for all the work you folks do) is that I'd prefer they be packaged with the container image. I always lock my dependencies to a version at work, and in this case I had to downgrade after my image failed to start due to the unauthenticated rate-limiting after upgrade

@mikiher
Copy link
Contributor Author

mikiher commented Aug 14, 2024

This is already done. On Docker we no longer try to download binaries at runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enhancement]: Diacritic-insensitive search
5 participants