Use SQLite for parts database #99

flaviut · 2023-10-01T22:11:18Z

The intent here is to use this with https://github.com/phiresky/sql.js-httpvfs. There are major missing pieces:

the python sqlite on my system doesn't have the fts5 extension availible, so while sql.js-httpvfs theoretically supports full-text search, it is not available
the sql.js-httpvfs VFS needs to be modified to download the index file and map between the index & the gzipped data, and use the Compression Streams API to decompress each chunk
no work was done on the frontend portion of jlcparts

Quick benchmarks:

algo	page size (KiB)	zstd dict (KiB)	compressed (MiB)	uncompressed (MiB)	ratio	cpu (s)	compression cpu (s)
zstd	1	8	111	537	21	99	45
zstd	2	8	96	566	17	82	28
zstd	4	8	74	634	12	74	20
zstd	8	8	54	545	10	64	10
zstd	16	8	42	513	8	62	8
zstd	1	16	104	537	19	130	76
zstd	2	16	89	566	16	98	44
zstd	4	16	71	634	11	81	27
zstd	8	16	51	545	9	68	14
zstd	16	16	41	513	8	62	8
zstd	1	32	98	537	18	205	151
zstd	2	32	80	566	14	139	85
zstd	4	32	68	634	11	103	49
zstd	8	32	49	545	9	78	24
zstd	16	32	40	513	8	67	13
zstd	1	64	87	537	16	313	259
zstd	2	64	73	566	13	202	148
zstd	4	64	59	634	9	136	82
zstd	8	64	46	545	8	93	39
zstd	16	64	38	513	7	75	21
zstd	1	128	85	537	16	2696	2642
zstd	2	128	71	566	13	1452	1398
zstd	4	128	58	634	9	765	711
zstd	8	128	46	545	8	392	338
zstd	16	128	38	513	7	206	152
zstd	1	256	84	537	16	4318	4264
zstd	2	256	70	566	12	2101	2047
zstd	4	256	58	634	9	1115	1061
zstd	8	256	51	545	9	841	787
zstd	16	256	55	513	11	60	6
none	1			537		54	0
none	2			566		53	0
none	4			634		54	0
none	8			545		53	0
none	16			513		54	0
gzip	1		198	537	37	83	29
gzip	2		164	566	29	69	15
gzip	4		122	634	19	66	12
gzip	8		78	545	14	61	7
gzip	16		56	513	11	62	8

note: not super helpful in the CPU & time area because I have not tested read performance.

See #37

Don't commit: - cached parts database - IntelliJ IDE files

The intent here is to use this with https://github.com/phiresky/sql.js-httpvfs. There are major missing pieces: - the python sqlite on my system doesn't have the fts5 extension availible, so while sql.js-httpvfs theoretically supports full-text search, it is not availible - the sql.js-httpvfs VFS needs to be modified to download the index file and map between the index & the gzipped data, and use the Compression Streams API to decompress each chunk - no work was done on the frontend portion of jlcparts

flaviut · 2023-10-02T00:29:31Z

I've updated the above with some more data.

generated with this test script

#!/bin/zsh

for dict_size in 8192 16384 32768 65536 131072 262144
do
for page_size in 1024 2048 4096 8192 16384
do
    time jlcparts buildtables --jobs 16 --ignoreoldstock 30 --pagesize "$page_size" --compression-algorithm zstd --dictsize $dict_size cache.sqlite3 web/public/data > /dev/null
    sqlite_size=$(ls -lh web/public/data/index.sqlite3 | awk '{print $5}')
    index_size=$(ls -lh web/public/data/index.sqlite3.zstindex | awk '{print $5}')
    parts_size=$(ls -lh web/public/data/index.sqlite3.zstpart | awk '{print $5}')
    echo "zstd, dict_size=${dict_size} page_size=${page_size}: (sqlite3=${sqlite_size}, index=${index_size}, parts=${parts_size})"
done
done


for page_size in 1024 2048 4096 8192 16384
do
    time jlcparts buildtables --jobs 16 --ignoreoldstock 30 --pagesize "$page_size" --compression-algorithm gzip cache.sqlite3 web/public/data > /dev/null
    sqlite_size=$(ls -lh web/public/data/index.sqlite3 | awk '{print $5}')
    index_size=$(ls -lh web/public/data/index.sqlite3.gzindex | awk '{print $5}')
    parts_size=$(ls -lh web/public/data/index.sqlite3.gzpart | awk '{print $5}')
    echo "gzip, page_size=${page_size}: (sqlite3=${sqlite_size}, index=${index_size}, parts=${parts_size})"
done


for page_size in 1024 2048 4096 8192 16384
do
    time jlcparts buildtables --jobs 16 --ignoreoldstock 30 --pagesize "$page_size" --compression-algorithm none cache.sqlite3 web/public/data > /dev/null
    sqlite_size=$(ls -lh web/public/data/index.sqlite3 | awk '{print $5}')
    echo "none, page_size=${page_size}: (sqlite3=${sqlite_size})"
done

data:

algo=zstd, dict_size=8192, page_size=1024, sqlite3=537M, index=4.2M, parts=107M, user=98.74s, system=44.37s, cpu=490%, total=29.195
algo=zstd, dict_size=8192, page_size=2048, sqlite3=566M, index=2.3M, parts=94M, user=81.53s, system=38.07s, cpu=460%, total=25.964
algo=zstd, dict_size=8192, page_size=4096, sqlite3=634M, index=1.3M, parts=73M, user=73.70s, system=33.54s, cpu=424%, total=25.261
algo=zstd, dict_size=8192, page_size=8192, sqlite3=545M, index=553K, parts=53M, user=64.36s, system=29.37s, cpu=432%, total=21.670
algo=zstd, dict_size=8192, page_size=16384, sqlite3=513M, index=265K, parts=42M, user=61.74s, system=27.98s, cpu=421%, total=21.286
algo=zstd, dict_size=16384, page_size=1024, sqlite3=537M, index=4.3M, parts=100M, user=130.27s, system=46.43s, cpu=585%, total=30.160
algo=zstd, dict_size=16384, page_size=2048, sqlite3=566M, index=2.3M, parts=87M, user=97.57s, system=38.20s, cpu=519%, total=26.115
algo=zstd, dict_size=16384, page_size=4096, sqlite3=634M, index=1.3M, parts=70M, user=81.39s, system=33.74s, cpu=481%, total=23.913
algo=zstd, dict_size=16384, page_size=8192, sqlite3=545M, index=561K, parts=50M, user=67.96s, system=29.74s, cpu=448%, total=21.759
algo=zstd, dict_size=16384, page_size=16384, sqlite3=513M, index=273K, parts=41M, user=62.47s, system=27.97s, cpu=434%, total=20.835
algo=zstd, dict_size=32768, page_size=1024, sqlite3=537M, index=4.3M, parts=94M, user=204.76s, system=51.62s, cpu=727%, total=35.258
algo=zstd, dict_size=32768, page_size=2048, sqlite3=566M, index=2.3M, parts=78M, user=138.59s, system=40.91s, cpu=618%, total=29.011
algo=zstd, dict_size=32768, page_size=4096, sqlite3=634M, index=1.3M, parts=67M, user=103.12s, system=35.89s, cpu=541%, total=25.678
algo=zstd, dict_size=32768, page_size=8192, sqlite3=545M, index=577K, parts=48M, user=78.08s, system=30.13s, cpu=486%, total=22.225
algo=zstd, dict_size=32768, page_size=16384, sqlite3=513M, index=289K, parts=40M, user=67.19s, system=28.27s, cpu=448%, total=21.289
algo=zstd, dict_size=65536, page_size=1024, sqlite3=537M, index=4.3M, parts=83M, user=312.52s, system=55.79s, cpu=865%, total=42.561
algo=zstd, dict_size=65536, page_size=2048, sqlite3=566M, index=2.3M, parts=71M, user=201.91s, system=43.34s, cpu=733%, total=33.439
algo=zstd, dict_size=65536, page_size=4096, sqlite3=634M, index=1.4M, parts=58M, user=135.94s, system=36.79s, cpu=624%, total=27.670
algo=zstd, dict_size=65536, page_size=8192, sqlite3=545M, index=609K, parts=45M, user=92.84s, system=31.08s, cpu=537%, total=23.035
algo=zstd, dict_size=65536, page_size=16384, sqlite3=513M, index=321K, parts=38M, user=74.90s, system=28.70s, cpu=480%, total=21.552
algo=zstd, dict_size=131072, page_size=1024, sqlite3=537M, index=4.4M, parts=81M, user=2696.38s, system=118.30s, cpu=1388%, total=3:22.65
algo=zstd, dict_size=131072, page_size=2048, sqlite3=566M, index=2.4M, parts=69M, user=1451.82s, system=77.68s, cpu=1297%, total=1:57.84
algo=zstd, dict_size=131072, page_size=4096, sqlite3=634M, index=1.4M, parts=57M, user=765.11s, system=55.65s, cpu=1173%, total=1:09.96
algo=zstd, dict_size=131072, page_size=8192, sqlite3=545M, index=673K, parts=45M, user=392.49s, system=38.96s, cpu=997%, total=43.261
algo=zstd, dict_size=131072, page_size=16384, sqlite3=513M, index=385K, parts=38M, user=205.56s, system=34.54s, cpu=795%, total=30.165
algo=zstd, dict_size=262144, page_size=1024, sqlite3=537M, index=4.5M, parts=79M, user=4317.83s, system=129.94s, cpu=1389%, total=5:20.08
algo=zstd, dict_size=262144, page_size=2048, sqlite3=566M, index=2.5M, parts=67M, user=2100.99s, system=80.78s, cpu=1330%, total=2:43.97
algo=zstd, dict_size=262144, page_size=4096, sqlite3=634M, index=1.5M, parts=56M, user=1114.82s, system=58.12s, cpu=1193%, total=1:38.27
algo=zstd, dict_size=262144, page_size=8192, sqlite3=545M, index=791K, parts=50M, user=840.95s, system=44.49s, cpu=1171%, total=1:15.61
algo=zstd, dict_size=262144, page_size=16384, sqlite3=513M, index=257K, parts=55M, user=59.91s, system=28.42s, cpu=416%, total=21.187
algo=none, page_size=1024, sqlite3=537M, user=54.11s, system=28.21s, cpu=357%, total=23.002
algo=none, page_size=2048, sqlite3=566M, user=53.32s, system=27.39s, cpu=371%, total=21.723
algo=none, page_size=4096, sqlite3=634M, user=53.80s, system=27.15s, cpu=377%, total=21.421
algo=none, page_size=8192, sqlite3=545M, user=53.45s, system=25.91s, cpu=397%, total=19.955
algo=none, page_size=16384, sqlite3=513M, user=53.69s, system=25.61s, cpu=406%, total=19.505
algo=gzip, page_size=1024, sqlite3=537M, index=4.2M, parts=194M, user=83.26s, system=38.45s, cpu=242%, total=50.192
algo=gzip, page_size=2048, sqlite3=566M, index=2.3M, parts=162M, user=69.16s, system=30.83s, cpu=279%, total=35.818
algo=gzip, page_size=4096, sqlite3=634M, index=1.3M, parts=121M, user=65.55s, system=29.80s, cpu=298%, total=31.991
algo=gzip, page_size=8192, sqlite3=545M, index=545K, parts=77M, user=60.83s, system=27.76s, cpu=324%, total=27.329
algo=gzip, page_size=16384, sqlite3=513M, index=257K, parts=56M, user=62.02s, system=27.45s, cpu=321%, total=27.828

algo	cpu	dict_size	index (MiB)	page_size	parts (MiB)	sqlite3 (MiB)	system (s)	total (s)	user (s)	page size (KiB)	zstd dict (KiB)
zstd	490%	8192	4.2	1024	107	537	44	29	99	1	8
zstd	460%	8192	2.3	2048	94	566	38	26	82	2	8
zstd	424%	8192	1.3	4096	73	634	34	25	74	4	8
zstd	432%	8192	0.5	8192	53	545	29	22	64	8	8
zstd	421%	8192	0.3	16384	42	513	28	21	62	16	8
zstd	585%	16384	4.3	1024	100	537	46	30	130	1	16
zstd	519%	16384	2.3	2048	87	566	38	26	98	2	16
zstd	481%	16384	1.3	4096	70	634	34	24	81	4	16
zstd	448%	16384	0.5	8192	50	545	30	22	68	8	16
zstd	434%	16384	0.3	16384	41	513	28	21	62	16	16
zstd	727%	32768	4.3	1024	94	537	52	35	205	1	32
zstd	618%	32768	2.3	2048	78	566	41	29	139	2	32
zstd	541%	32768	1.3	4096	67	634	36	26	103	4	32
zstd	486%	32768	0.6	8192	48	545	30	22	78	8	32
zstd	448%	32768	0.3	16384	40	513	28	21	67	16	32
zstd	865%	65536	4.3	1024	83	537	56	43	313	1	64
zstd	733%	65536	2.3	2048	71	566	43	33	202	2	64
zstd	624%	65536	1.4	4096	58	634	37	28	136	4	64
zstd	537%	65536	0.6	8192	45	545	31	23	93	8	64
zstd	480%	65536	0.3	16384	38	513	29	22	75	16	64
zstd	1388%	131072	4.4	1024	81	537	118	203	2696	1	128
zstd	1297%	131072	2.4	2048	69	566	78	118	1452	2	128
zstd	1173%	131072	1.4	4096	57	634	56	70	765	4	128
zstd	997%	131072	0.7	8192	45	545	39	43	392	8	128
zstd	795%	131072	0.4	16384	38	513	35	30	206	16	128
zstd	1389%	262144	4.5	1024	79	537	130	320	4318	1	256
zstd	1330%	262144	2.5	2048	67	566	81	164	2101	2	256
zstd	1193%	262144	1.5	4096	56	634	58	98	1115	4	256
zstd	1171%	262144	0.8	8192	50	545	44	76	841	8	256
zstd	416%	262144	0.3	16384	55	513	28	21	60	16	256
none	357%			1024		537	28	23	54	1
none	371%			2048		566	27	22	53	2
none	377%			4096		634	27	21	54	4
none	397%			8192		545	26	20	53	8
none	406%			16384		513	26	20	54	16
gzip	242%		4.2	1024	194	537	38	50	83	1
gzip	279%		2.3	2048	162	566	31	36	69	2
gzip	298%		1.3	4096	121	634	30	32	66	4
gzip	324%		0.5	8192	77	545	28	27	61	8
gzip	321%		0.3	16384	56	513	27	28	62	16

NOTE: this time is compression time, not decompression time

seems to me like zstd with 64KiB dict & 8K pages is likely the best tradeoff here. zstd has good decompression speed across a range of compression levels: https://github.com/facebook/zstd#benchmarks

yaqwsx · 2023-10-03T05:04:09Z

Thank you for your input. I really appreciate it. I thought about using SQLite for a while, but I never found time (and motivation) to do it. Let me share my findings:

Originally, I wanted to serve the SQLite over HTTPVFS. After doing several tests I abandoned the idea - most users use the full-text feature. It is very easy to run into one of the worst-case complexity scenarios that leads to fetching a non-trivial part of the database. This leads to fetching quite a lot of data when using the site. Thus, I think a better solution is to store a compressed database in local storage and use SQLite over a compressed filesystem. The usage of the site will remain roughly the same - the user will be notified about new database which they can fetch. The fetching will be fairly quick as we need to just download the image and do nothing else (96% of the current update time is the update of entries in IndexedDB).
next, I was thinking about a suitable DB schema. Careful design can save a lot of DB size. Meanwhile representation of the categories is trivial, I struggled with a suitable design for attributes as they are non-uniform. I haven't found a design that I was satisfied with. I would have to dig up my old notes to give you more details.
I want to connect this huge change with a rewrite of the frontend into React functional components and refactor it. This project was my first project in React, and the frontend is quite messy. I would also like to encode the current search query into the URL so the users can use bookmarks to create a shortcuts.

flaviut · 2023-10-03T21:52:59Z

The usage of the site will remain roughly the same - the user will be notified about new database which they can fetch

I find this part of the site particularly frustrating. I don't like having to constantly refresh all the data and have to think about that, I'd rather that just get handled seamlessly in the background.

most users use the full-text feature. It is very easy to run into one of the worst-case complexity scenarios that leads to fetching a non-trivial part of the database.

Interesting. The full-text feature definitely needs to be investigated further. I don't know about SQLite, but with Postgresql, there's things like covering indexes and index-only scans. I wonder if it'd be possible to get that in SQLite FTS.

I'm not familiar with the existing code, but there's also plenty of knobs around tokenizing the query, stop words, throttling delay, limiting result counts, etc.

If you have your old notes, I'd be interested in reading them.

Thus, I think a better solution is to store a compressed database in local storage and use SQLite over a compressed filesystem. The fetching will be fairly quick as we need to just download the image and do nothing else (96% of the current update time is the update of entries in IndexedDB).

Not opposed to this, but I don't see why it couldn't be both: fetch chunks only as they are needed, and cache them locally. In fact, I wonder if web browsers already handle caching internally these days.

next, I was thinking about a suitable DB schema. Careful design can save a lot of DB size. Meanwhile representation of the categories is trivial, I struggled with a suitable design for attributes as they are non-uniform.

I stuck with the existing IndexedDB schema design in my testing. But yes, there is plenty to look at in terms of structuring the data so it can be queried easily.

I want to connect this huge change with a rewrite of the frontend into React functional components and refactor it.

I've often felt this same temptation to combine two major changes into one 😃. I've found it ends up more motivating and more efficient for me to do changes in smaller chunks, even if at times it seems like I'm doing work that I will soon replace.

phiresky · 2023-11-09T15:24:42Z

There's many config options for full-text-search with sqlite. I don't remember if I wrote my research about this down somewhere but you can reduce the FTS size by >90% by setting detail=none and contentless (https://www.sqlite.org/fts5.html#the_detail_option). It does reduce the power of the queries you can do a lot though.

Also note that sqlite is very easy to compile and there's also drop in python packages to get a newer and more complete sqlite into python.

There's one powerful statically hosted full text search engine that scales to a terabyte of data that i know of called summa, but for an index with a compressed size of 50MB it's probably not worth it.

There's a few other JS libraries that allow creating a keyword / full text search index that you serialize to JSON that you fully download (with a smaller size) and you could then use in combination with SQLite or something else to fetch the full data dynamically. One example (I think) is https://github.com/nextapps-de/flexsearch

phiresky · 2023-11-09T15:34:42Z

Also just as a note if you have too much free time: If you download the whole DB then you can alternatively also create a minimal db without indexes and without FTS, download that, and create the indexes and FTS search locally. Trading bandwidth for local compute.

flaviut added 2 commits October 1, 2023 14:17

Update gitignore

79c4c6a

Don't commit: - cached parts database - IntelliJ IDE files

Parallelize buildtables & update readme

eb5336a

flaviut had a problem deploying to github-pages October 1, 2023 22:11 — with GitHub Actions Failure

flaviut force-pushed the flaviu/sqlite-parts-db branch from b734c9f to 64f1d8a Compare October 1, 2023 22:22

flaviut had a problem deploying to github-pages October 1, 2023 22:22 — with GitHub Actions Failure

flaviut force-pushed the flaviu/sqlite-parts-db branch from 64f1d8a to 7980dbd Compare October 1, 2023 22:24

flaviut had a problem deploying to github-pages October 1, 2023 22:24 — with GitHub Actions Failure

flaviut force-pushed the flaviu/sqlite-parts-db branch from 7980dbd to b0a6a59 Compare October 2, 2023 00:19

flaviut had a problem deploying to github-pages October 2, 2023 00:19 — with GitHub Actions Failure

flaviut mentioned this pull request Oct 3, 2023

Is it possible to compress database? phiresky/sql.js-httpvfs#28

Closed

yaqwsx force-pushed the master branch from d670f33 to 668b5ed Compare October 5, 2023 08:46

yaqwsx force-pushed the master branch 2 times, most recently from 3dc9b5b to bb8f331 Compare January 7, 2024 18:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use SQLite for parts database #99

Use SQLite for parts database #99

flaviut commented Oct 1, 2023 •

edited

Loading

flaviut commented Oct 2, 2023

yaqwsx commented Oct 3, 2023

flaviut commented Oct 3, 2023

phiresky commented Nov 9, 2023

phiresky commented Nov 9, 2023

Use SQLite for parts database #99

Are you sure you want to change the base?

Use SQLite for parts database #99

Conversation

flaviut commented Oct 1, 2023 • edited Loading

flaviut commented Oct 2, 2023

yaqwsx commented Oct 3, 2023

flaviut commented Oct 3, 2023

phiresky commented Nov 9, 2023

phiresky commented Nov 9, 2023

flaviut commented Oct 1, 2023 •

edited

Loading