Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace string DB fields with more appropriate data types #2932

Open
DerEwige opened this issue Oct 19, 2024 · 12 comments
Open

Replace string DB fields with more appropriate data types #2932

DerEwige opened this issue Oct 19, 2024 · 12 comments

Comments

@DerEwige
Copy link
Contributor

For example you chose to store the UUIDs in text fields
Which make them use 36 byte instead of 16 byte if you would use the built in UUID data type
If you would revamp to use BIGINT where it is possible instead you would go down to 8 byte.

This would shrink the database (and the indexes in particularly) and make the whole DB a lot faster.

@t-bast
Copy link
Member

t-bast commented Nov 22, 2024

That choice was made because we currently favor usability/readability (when browsing the DB using standard DB tools) over performance. We've found that the performance hit wasn't noticeable, so we haven't changed this. We can change this in the future when performance starts becoming a real bottleneck, but that isn't a priority at all right now. Also, this should be motivated by reproducible benchmarks to ensure that the performance gain is worth the DB migration complexity.

@DerEwige
Copy link
Contributor Author

There is only one table that is really a problem right now

htlc_info

At 250 million entries it reached a size of 100GB and the index is more than 25 GB (which is the main problem). Because of this the index does not fit into my memory anymore which causes huge performance problems.

Especially if a compaction is running while éclair is deleting rows form that table the node suffers heavy performance loss.

This is really the only table that causes problem and could benefit from some optimization.

image

@t-bast
Copy link
Member

t-bast commented Nov 22, 2024

Gotcha, this is indeed the table that grows the most as HTLCs are processed. It gets better with splicing, which lets us remove old entries after a splice, but it's still that DB that will grow the most.

It's a bit annoying to update, because if we want to change the channel_id field, we also have to change the channels DB since we reference it as foreign key. Changing the payment_hash field would be trivial though. We can consider doing this if it's scoped to only those tables.

@DerEwige
Copy link
Contributor Author

For channel_id you could replace it with an field called channel_nr which could be just a BIGINT
And then have a separat table where you map channel_id to channel_nr

makes the corellation a big more complex (needs 2 steps or a join)

But could reduce the size of the table (space on diks) and also the index by a lot
(just some ideas)

I'll try to do some peromance test on that with 50m+ insersts and deleets in the next few weeks and post results in here, once I'm done

@DerEwige
Copy link
Contributor Author

I made some small tests and I think I found the main issue and a possible solution

You are using a composite index of channel_id and payment_hash .
I looked into the codebase and in theory that is the best solution, for the kind of queries you are making.
(Two sperate indexes would invoke extra CPU cost when doing the queries.)

But the index gets so big, it does consume 30-50% of the total table space.
This gets problematic once your table grows to several GB in size.

Composite index vs 2 separate indexes
For small tables the difference is negligible, but it increases with larger tables.
But if the table gets too big and the index does not fit into memory any longer the benefit of the composite index gets far outweighed by the increased I/O that is needed to read the index from the disk.

Proposed optimization
Changing payment_hash from TEXT to BYTEA with a constraint on length 32 and splitting up the composite index into two separate indexes resulted in the following:

Decrease of total table size by more than 50%
Decrease of index size by more than 85%

Small 100k table before optimization
36 MB total
17 MB index

Small 100k table after optimization

17 MB total
2.2 MB index

for my table this would mean reduction of the index from 25 GB to about 3.5 GB

@DerEwige
Copy link
Contributor Author

I have analyzed it some more:

You can use this SQL statement on your PG instance to verify my findings

SELECT relname AS table_name,
schemaname AS schema_name,
coalesce(seq_scan, 0) + coalesce(idx_scan, 0) AS total_selects, 
n_tup_ins AS inserts,
n_tup_upd AS updates,
n_tup_del AS deletes
FROM pg_stat_all_tables
WHERE schemaname = 'local'
ORDER by (coalesce(seq_scan,0) + coalesce(idx_scan,0) + n_tup_ins + n_tup_upd + n_tup_del) DESC;

image

It is basically only inserts and deletes on htlc_info.
It is rarely selected from (I guess only on startup and some other occasions)

So, turning the compound index into 2 separate indexes would not hurt performance at all even when the compound index fits into the memory.

The least invasive version would be to leave channel_id and payment_hash untouched and just delete the current compound index and create 2 separate indexes for channel_id and commitment_number.

This would lead to no changes in the code (for accessing the DB) and minimal DB migration.

This would shrink the total DB by about 45% and the index would shrink by about 85%

@t-bast
Copy link
Member

t-bast commented Nov 25, 2024

Thanks for the investigation!

The least invasive version would be to leave channel_id and payment_hash untouched and just delete the current compound index and create 2 separate indexes for channel_id and commitment_number.

Sounds acceptable, do you want to open a PR for that?

This would lead to no changes in the code (for accessing the DB) and minimal DB migration.

Is the DB migration really minimal? Re-building two indices on the htlc_infos table requires reading the whole table, doesn't it? We'd need to check how long that takes on a large node such as ours.

@DerEwige
Copy link
Contributor Author

Sounds acceptable, do you want to open a PR for that?

Once the investigation is done. I would like to leave the implementation to you guys. My Scala is really lacking.

Is the DB migration really minimal? Re-building two indices on the htlc_infos table requires reading the whole table, doesn't it? We'd need to check how long that takes on a large node such as ours.

I guess every change we do (change datatype of a field, change an index) would require to read the whole table. So, there is no cheap solution here.

You are right rebuilding the index will take some time.

It might be better to add the new indexes before dropping the old one.
This would temporarily increase the table size by about 10%, but the building of the new indexes should be much faster, as they are included in the old index (I assume PG would make use of this)

How big is your DB right now? I’ve managed to bring mine down to about 65 million entries by replacing a lot of older channels during 2 sat/vb times.

I could either make a copy or my DB or do a mockup DB with the same size to test the durations.
I’ll post results once I have them

@t-bast
Copy link
Member

t-bast commented Nov 25, 2024

How big is your DB right now? I’ve managed to bring mine down to about 65 million entries by replacing a lot of older channels during 2 sat/vb times.

We "only* have 21 million rows, it's smaller than yours! I'm curious to know what the migration time is to re-create the index in that case.

@DerEwige
Copy link
Contributor Author

ok. I will do a mockup DB with 20m entries and test it, should take me about 2 hours to do multiple runs.
(insering randome data into a DB even at 400k entries in batches takes some time)

@DerEwige
Copy link
Contributor Author

I did 4 runs in total

TLDR: about 90s to create both new indexes

DB server used is
VPS: t4g.medium (2 vCPU, 4 GB Memory)
Storage: 3200 IOPS, 250 MB/s throughput
Synchronous streaming replication to secondary server up and running

(I assume you have a bit a stronger system)

My live node uses the same DB server.
So the test did not have exclusive access to it, but shared its resources
I created a mockup table in a separate DB.
I used the same create statements for the table and index as you use.

Because of this setup there was quite some variation in duration time for the new indexes.

I created 50 channels with 400’000 commits on each channel.
(I could have randomized this, but it should not matter for the test)

I’ll post the slowest run here:
After inserting 20m entries:

total size: 7106 MB
index size: 3471 MB
Insert duration: 571 seconds

Index creation duration: 94.4 seconds
Index drop duration: 0.12 seconds

Size after creating new indexes and dropping old index:

total size: 3917 MB
index size: 282 MB
Commands used to create and drop indexes:

CREATE INDEX htlc_infos_idx1 ON test_table_1(channel_id);	        
CREATE INDEX htlc_infos_idx2 ON test_table_1(commitment_number);

DROP INDEX IF EXISTS htlc_infos_idx;

you can use this select to check your size:

SELECT 
	    pg_size_pretty(pg_total_relation_size('test_table_1')) AS total_size,
	    pg_size_pretty(pg_indexes_size('test_table_1')) AS index_size,
	    pg_size_pretty(pg_relation_size('test_table_1')) AS data_size;

(Note: this will show you the size on disk, not the actual size. If your table was bigger in the past and you never did a "full vacuum", it will show you larger size)

t-bast added a commit that referenced this issue Nov 25, 2024
We were previously using a composite index on both the `channel_id` and
the `commitment_number` in the `htlc_infos` table. This table is used to
store historical HTLC information to be able to publish penalty txs when
a malicious peer publishes a revoked commitment. This table is usually
the largest DB table on nodes that relay a lot of payments, because it
grows with the number of HTLCs received and sent (and only shrinks when
channels are closed or spliced).

Using a composite index makes sense, but leads to increased memory usage
compared to separate indices, thus reducing performance because this is
a table on which we write a lot, but almost never read (only when we
detect a revoked force-close). The read performance doesn't seem to be
negatively impacted anyway when splitting the indices, and the memory
usage is greatly improved.

The migration may take some time depending on the size of the table,
but we cannot get around this.

Thanks to @DerEwige for the performance investigation that lead to
this change (see #2932 for more details).
@t-bast
Copy link
Member

t-bast commented Nov 25, 2024

Thanks for investigating this, I made the change in #2946.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants