Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SQLite for parts database #99

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

flaviut
Copy link
Contributor

@flaviut flaviut commented Oct 1, 2023

The intent here is to use this with https://github.com/phiresky/sql.js-httpvfs. There are major missing pieces:

  • the python sqlite on my system doesn't have the fts5 extension availible, so while sql.js-httpvfs theoretically supports full-text search, it is not available
  • the sql.js-httpvfs VFS needs to be modified to download the index file and map between the index & the gzipped data, and use the Compression Streams API to decompress each chunk
  • no work was done on the frontend portion of jlcparts

Quick benchmarks:

algo page size (KiB) zstd dict (KiB) compressed (MiB) uncompressed (MiB) ratio cpu (s) compression cpu (s)
zstd 1 8 111 537 21 99 45
zstd 2 8 96 566 17 82 28
zstd 4 8 74 634 12 74 20
zstd 8 8 54 545 10 64 10
zstd 16 8 42 513 8 62 8
zstd 1 16 104 537 19 130 76
zstd 2 16 89 566 16 98 44
zstd 4 16 71 634 11 81 27
zstd 8 16 51 545 9 68 14
zstd 16 16 41 513 8 62 8
zstd 1 32 98 537 18 205 151
zstd 2 32 80 566 14 139 85
zstd 4 32 68 634 11 103 49
zstd 8 32 49 545 9 78 24
zstd 16 32 40 513 8 67 13
zstd 1 64 87 537 16 313 259
zstd 2 64 73 566 13 202 148
zstd 4 64 59 634 9 136 82
zstd 8 64 46 545 8 93 39
zstd 16 64 38 513 7 75 21
zstd 1 128 85 537 16 2696 2642
zstd 2 128 71 566 13 1452 1398
zstd 4 128 58 634 9 765 711
zstd 8 128 46 545 8 392 338
zstd 16 128 38 513 7 206 152
zstd 1 256 84 537 16 4318 4264
zstd 2 256 70 566 12 2101 2047
zstd 4 256 58 634 9 1115 1061
zstd 8 256 51 545 9 841 787
zstd 16 256 55 513 11 60 6
none 1     537   54 0
none 2     566   53 0
none 4     634   54 0
none 8     545   53 0
none 16     513   54 0
gzip 1   198 537 37 83 29
gzip 2   164 566 29 69 15
gzip 4   122 634 19 66 12
gzip 8   78 545 14 61 7
gzip 16   56 513 11 62 8

note: not super helpful in the CPU & time area because I have not tested read performance.

See #37

Don't commit:

- cached parts database
- IntelliJ IDE files
The intent here is to use this with
https://github.com/phiresky/sql.js-httpvfs. There are major missing
pieces:

- the python sqlite on my system doesn't have the fts5 extension
availible, so while sql.js-httpvfs theoretically supports full-text
search, it is not availible
- the sql.js-httpvfs VFS needs to be modified to download the index
file and map between the index & the gzipped data, and use the
Compression Streams API to decompress each chunk
- no work was done on the frontend portion of jlcparts
@flaviut
Copy link
Contributor Author

flaviut commented Oct 2, 2023

I've updated the above with some more data.

generated with this test script
#!/bin/zsh

for dict_size in 8192 16384 32768 65536 131072 262144
do
for page_size in 1024 2048 4096 8192 16384
do
    time jlcparts buildtables --jobs 16 --ignoreoldstock 30 --pagesize "$page_size" --compression-algorithm zstd --dictsize $dict_size cache.sqlite3 web/public/data > /dev/null
    sqlite_size=$(ls -lh web/public/data/index.sqlite3 | awk '{print $5}')
    index_size=$(ls -lh web/public/data/index.sqlite3.zstindex | awk '{print $5}')
    parts_size=$(ls -lh web/public/data/index.sqlite3.zstpart | awk '{print $5}')
    echo "zstd, dict_size=${dict_size} page_size=${page_size}: (sqlite3=${sqlite_size}, index=${index_size}, parts=${parts_size})"
done
done


for page_size in 1024 2048 4096 8192 16384
do
    time jlcparts buildtables --jobs 16 --ignoreoldstock 30 --pagesize "$page_size" --compression-algorithm gzip cache.sqlite3 web/public/data > /dev/null
    sqlite_size=$(ls -lh web/public/data/index.sqlite3 | awk '{print $5}')
    index_size=$(ls -lh web/public/data/index.sqlite3.gzindex | awk '{print $5}')
    parts_size=$(ls -lh web/public/data/index.sqlite3.gzpart | awk '{print $5}')
    echo "gzip, page_size=${page_size}: (sqlite3=${sqlite_size}, index=${index_size}, parts=${parts_size})"
done


for page_size in 1024 2048 4096 8192 16384
do
    time jlcparts buildtables --jobs 16 --ignoreoldstock 30 --pagesize "$page_size" --compression-algorithm none cache.sqlite3 web/public/data > /dev/null
    sqlite_size=$(ls -lh web/public/data/index.sqlite3 | awk '{print $5}')
    echo "none, page_size=${page_size}: (sqlite3=${sqlite_size})"
done

data:

  • algo=zstd, dict_size=8192, page_size=1024, sqlite3=537M, index=4.2M, parts=107M, user=98.74s, system=44.37s, cpu=490%, total=29.195
  • algo=zstd, dict_size=8192, page_size=2048, sqlite3=566M, index=2.3M, parts=94M, user=81.53s, system=38.07s, cpu=460%, total=25.964
  • algo=zstd, dict_size=8192, page_size=4096, sqlite3=634M, index=1.3M, parts=73M, user=73.70s, system=33.54s, cpu=424%, total=25.261
  • algo=zstd, dict_size=8192, page_size=8192, sqlite3=545M, index=553K, parts=53M, user=64.36s, system=29.37s, cpu=432%, total=21.670
  • algo=zstd, dict_size=8192, page_size=16384, sqlite3=513M, index=265K, parts=42M, user=61.74s, system=27.98s, cpu=421%, total=21.286
  • algo=zstd, dict_size=16384, page_size=1024, sqlite3=537M, index=4.3M, parts=100M, user=130.27s, system=46.43s, cpu=585%, total=30.160
  • algo=zstd, dict_size=16384, page_size=2048, sqlite3=566M, index=2.3M, parts=87M, user=97.57s, system=38.20s, cpu=519%, total=26.115
  • algo=zstd, dict_size=16384, page_size=4096, sqlite3=634M, index=1.3M, parts=70M, user=81.39s, system=33.74s, cpu=481%, total=23.913
  • algo=zstd, dict_size=16384, page_size=8192, sqlite3=545M, index=561K, parts=50M, user=67.96s, system=29.74s, cpu=448%, total=21.759
  • algo=zstd, dict_size=16384, page_size=16384, sqlite3=513M, index=273K, parts=41M, user=62.47s, system=27.97s, cpu=434%, total=20.835
  • algo=zstd, dict_size=32768, page_size=1024, sqlite3=537M, index=4.3M, parts=94M, user=204.76s, system=51.62s, cpu=727%, total=35.258
  • algo=zstd, dict_size=32768, page_size=2048, sqlite3=566M, index=2.3M, parts=78M, user=138.59s, system=40.91s, cpu=618%, total=29.011
  • algo=zstd, dict_size=32768, page_size=4096, sqlite3=634M, index=1.3M, parts=67M, user=103.12s, system=35.89s, cpu=541%, total=25.678
  • algo=zstd, dict_size=32768, page_size=8192, sqlite3=545M, index=577K, parts=48M, user=78.08s, system=30.13s, cpu=486%, total=22.225
  • algo=zstd, dict_size=32768, page_size=16384, sqlite3=513M, index=289K, parts=40M, user=67.19s, system=28.27s, cpu=448%, total=21.289
  • algo=zstd, dict_size=65536, page_size=1024, sqlite3=537M, index=4.3M, parts=83M, user=312.52s, system=55.79s, cpu=865%, total=42.561
  • algo=zstd, dict_size=65536, page_size=2048, sqlite3=566M, index=2.3M, parts=71M, user=201.91s, system=43.34s, cpu=733%, total=33.439
  • algo=zstd, dict_size=65536, page_size=4096, sqlite3=634M, index=1.4M, parts=58M, user=135.94s, system=36.79s, cpu=624%, total=27.670
  • algo=zstd, dict_size=65536, page_size=8192, sqlite3=545M, index=609K, parts=45M, user=92.84s, system=31.08s, cpu=537%, total=23.035
  • algo=zstd, dict_size=65536, page_size=16384, sqlite3=513M, index=321K, parts=38M, user=74.90s, system=28.70s, cpu=480%, total=21.552
  • algo=zstd, dict_size=131072, page_size=1024, sqlite3=537M, index=4.4M, parts=81M, user=2696.38s, system=118.30s, cpu=1388%, total=3:22.65
  • algo=zstd, dict_size=131072, page_size=2048, sqlite3=566M, index=2.4M, parts=69M, user=1451.82s, system=77.68s, cpu=1297%, total=1:57.84
  • algo=zstd, dict_size=131072, page_size=4096, sqlite3=634M, index=1.4M, parts=57M, user=765.11s, system=55.65s, cpu=1173%, total=1:09.96
  • algo=zstd, dict_size=131072, page_size=8192, sqlite3=545M, index=673K, parts=45M, user=392.49s, system=38.96s, cpu=997%, total=43.261
  • algo=zstd, dict_size=131072, page_size=16384, sqlite3=513M, index=385K, parts=38M, user=205.56s, system=34.54s, cpu=795%, total=30.165
  • algo=zstd, dict_size=262144, page_size=1024, sqlite3=537M, index=4.5M, parts=79M, user=4317.83s, system=129.94s, cpu=1389%, total=5:20.08
  • algo=zstd, dict_size=262144, page_size=2048, sqlite3=566M, index=2.5M, parts=67M, user=2100.99s, system=80.78s, cpu=1330%, total=2:43.97
  • algo=zstd, dict_size=262144, page_size=4096, sqlite3=634M, index=1.5M, parts=56M, user=1114.82s, system=58.12s, cpu=1193%, total=1:38.27
  • algo=zstd, dict_size=262144, page_size=8192, sqlite3=545M, index=791K, parts=50M, user=840.95s, system=44.49s, cpu=1171%, total=1:15.61
  • algo=zstd, dict_size=262144, page_size=16384, sqlite3=513M, index=257K, parts=55M, user=59.91s, system=28.42s, cpu=416%, total=21.187
  • algo=none, page_size=1024, sqlite3=537M, user=54.11s, system=28.21s, cpu=357%, total=23.002
  • algo=none, page_size=2048, sqlite3=566M, user=53.32s, system=27.39s, cpu=371%, total=21.723
  • algo=none, page_size=4096, sqlite3=634M, user=53.80s, system=27.15s, cpu=377%, total=21.421
  • algo=none, page_size=8192, sqlite3=545M, user=53.45s, system=25.91s, cpu=397%, total=19.955
  • algo=none, page_size=16384, sqlite3=513M, user=53.69s, system=25.61s, cpu=406%, total=19.505
  • algo=gzip, page_size=1024, sqlite3=537M, index=4.2M, parts=194M, user=83.26s, system=38.45s, cpu=242%, total=50.192
  • algo=gzip, page_size=2048, sqlite3=566M, index=2.3M, parts=162M, user=69.16s, system=30.83s, cpu=279%, total=35.818
  • algo=gzip, page_size=4096, sqlite3=634M, index=1.3M, parts=121M, user=65.55s, system=29.80s, cpu=298%, total=31.991
  • algo=gzip, page_size=8192, sqlite3=545M, index=545K, parts=77M, user=60.83s, system=27.76s, cpu=324%, total=27.329
  • algo=gzip, page_size=16384, sqlite3=513M, index=257K, parts=56M, user=62.02s, system=27.45s, cpu=321%, total=27.828
algo cpu dict_size index (MiB) page_size parts (MiB) sqlite3 (MiB) system (s) total (s) user (s) page size (KiB) zstd dict (KiB)
zstd 490% 8192 4.2 1024 107 537 44 29 99 1 8
zstd 460% 8192 2.3 2048 94 566 38 26 82 2 8
zstd 424% 8192 1.3 4096 73 634 34 25 74 4 8
zstd 432% 8192 0.5 8192 53 545 29 22 64 8 8
zstd 421% 8192 0.3 16384 42 513 28 21 62 16 8
zstd 585% 16384 4.3 1024 100 537 46 30 130 1 16
zstd 519% 16384 2.3 2048 87 566 38 26 98 2 16
zstd 481% 16384 1.3 4096 70 634 34 24 81 4 16
zstd 448% 16384 0.5 8192 50 545 30 22 68 8 16
zstd 434% 16384 0.3 16384 41 513 28 21 62 16 16
zstd 727% 32768 4.3 1024 94 537 52 35 205 1 32
zstd 618% 32768 2.3 2048 78 566 41 29 139 2 32
zstd 541% 32768 1.3 4096 67 634 36 26 103 4 32
zstd 486% 32768 0.6 8192 48 545 30 22 78 8 32
zstd 448% 32768 0.3 16384 40 513 28 21 67 16 32
zstd 865% 65536 4.3 1024 83 537 56 43 313 1 64
zstd 733% 65536 2.3 2048 71 566 43 33 202 2 64
zstd 624% 65536 1.4 4096 58 634 37 28 136 4 64
zstd 537% 65536 0.6 8192 45 545 31 23 93 8 64
zstd 480% 65536 0.3 16384 38 513 29 22 75 16 64
zstd 1388% 131072 4.4 1024 81 537 118 203 2696 1 128
zstd 1297% 131072 2.4 2048 69 566 78 118 1452 2 128
zstd 1173% 131072 1.4 4096 57 634 56 70 765 4 128
zstd 997% 131072 0.7 8192 45 545 39 43 392 8 128
zstd 795% 131072 0.4 16384 38 513 35 30 206 16 128
zstd 1389% 262144 4.5 1024 79 537 130 320 4318 1 256
zstd 1330% 262144 2.5 2048 67 566 81 164 2101 2 256
zstd 1193% 262144 1.5 4096 56 634 58 98 1115 4 256
zstd 1171% 262144 0.8 8192 50 545 44 76 841 8 256
zstd 416% 262144 0.3 16384 55 513 28 21 60 16 256
none 357%     1024   537 28 23 54 1  
none 371%     2048   566 27 22 53 2  
none 377%     4096   634 27 21 54 4  
none 397%     8192   545 26 20 53 8  
none 406%     16384   513 26 20 54 16  
gzip 242%   4.2 1024 194 537 38 50 83 1  
gzip 279%   2.3 2048 162 566 31 36 69 2  
gzip 298%   1.3 4096 121 634 30 32 66 4  
gzip 324%   0.5 8192 77 545 28 27 61 8  
gzip 321%   0.3 16384 56 513 27 28 62 16  

NOTE: this time is compression time, not decompression time

seems to me like zstd with 64KiB dict & 8K pages is likely the best tradeoff here. zstd has good decompression speed across a range of compression levels: https://github.com/facebook/zstd#benchmarks

@yaqwsx
Copy link
Owner

yaqwsx commented Oct 3, 2023

Thank you for your input. I really appreciate it. I thought about using SQLite for a while, but I never found time (and motivation) to do it. Let me share my findings:

  • Originally, I wanted to serve the SQLite over HTTPVFS. After doing several tests I abandoned the idea - most users use the full-text feature. It is very easy to run into one of the worst-case complexity scenarios that leads to fetching a non-trivial part of the database. This leads to fetching quite a lot of data when using the site. Thus, I think a better solution is to store a compressed database in local storage and use SQLite over a compressed filesystem. The usage of the site will remain roughly the same - the user will be notified about new database which they can fetch. The fetching will be fairly quick as we need to just download the image and do nothing else (96% of the current update time is the update of entries in IndexedDB).
  • next, I was thinking about a suitable DB schema. Careful design can save a lot of DB size. Meanwhile representation of the categories is trivial, I struggled with a suitable design for attributes as they are non-uniform. I haven't found a design that I was satisfied with. I would have to dig up my old notes to give you more details.
  • I want to connect this huge change with a rewrite of the frontend into React functional components and refactor it. This project was my first project in React, and the frontend is quite messy. I would also like to encode the current search query into the URL so the users can use bookmarks to create a shortcuts.

@flaviut
Copy link
Contributor Author

flaviut commented Oct 3, 2023

The usage of the site will remain roughly the same - the user will be notified about new database which they can fetch

I find this part of the site particularly frustrating. I don't like having to constantly refresh all the data and have to think about that, I'd rather that just get handled seamlessly in the background.

most users use the full-text feature. It is very easy to run into one of the worst-case complexity scenarios that leads to fetching a non-trivial part of the database.

Interesting. The full-text feature definitely needs to be investigated further. I don't know about SQLite, but with Postgresql, there's things like covering indexes and index-only scans. I wonder if it'd be possible to get that in SQLite FTS.

I'm not familiar with the existing code, but there's also plenty of knobs around tokenizing the query, stop words, throttling delay, limiting result counts, etc.

If you have your old notes, I'd be interested in reading them.

Thus, I think a better solution is to store a compressed database in local storage and use SQLite over a compressed filesystem. The fetching will be fairly quick as we need to just download the image and do nothing else (96% of the current update time is the update of entries in IndexedDB).

Not opposed to this, but I don't see why it couldn't be both: fetch chunks only as they are needed, and cache them locally. In fact, I wonder if web browsers already handle caching internally these days.

next, I was thinking about a suitable DB schema. Careful design can save a lot of DB size. Meanwhile representation of the categories is trivial, I struggled with a suitable design for attributes as they are non-uniform.

I stuck with the existing IndexedDB schema design in my testing. But yes, there is plenty to look at in terms of structuring the data so it can be queried easily.

I want to connect this huge change with a rewrite of the frontend into React functional components and refactor it.

I've often felt this same temptation to combine two major changes into one 😃. I've found it ends up more motivating and more efficient for me to do changes in smaller chunks, even if at times it seems like I'm doing work that I will soon replace.

@phiresky
Copy link

phiresky commented Nov 9, 2023

There's many config options for full-text-search with sqlite. I don't remember if I wrote my research about this down somewhere but you can reduce the FTS size by >90% by setting detail=none and contentless (https://www.sqlite.org/fts5.html#the_detail_option). It does reduce the power of the queries you can do a lot though.

Also note that sqlite is very easy to compile and there's also drop in python packages to get a newer and more complete sqlite into python.

There's one powerful statically hosted full text search engine that scales to a terabyte of data that i know of called summa, but for an index with a compressed size of 50MB it's probably not worth it.

There's a few other JS libraries that allow creating a keyword / full text search index that you serialize to JSON that you fully download (with a smaller size) and you could then use in combination with SQLite or something else to fetch the full data dynamically. One example (I think) is https://github.com/nextapps-de/flexsearch

@phiresky
Copy link

phiresky commented Nov 9, 2023

Also just as a note if you have too much free time: If you download the whole DB then you can alternatively also create a minimal db without indexes and without FTS, download that, and create the indexes and FTS search locally. Trading bandwidth for local compute.

@yaqwsx yaqwsx force-pushed the master branch 2 times, most recently from 3dc9b5b to bb8f331 Compare January 7, 2024 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants