Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Goal: run load testing #441

Closed
27 tasks done
zolotokrylin opened this issue Aug 6, 2024 · 64 comments
Closed
27 tasks done

Goal: run load testing #441

zolotokrylin opened this issue Aug 6, 2024 · 64 comments
Assignees

Comments

@zolotokrylin
Copy link
Contributor

zolotokrylin commented Aug 6, 2024

TSN Load Testing - Spec


          @zolotokrylin @MicBun @rsoury 

I do think some kind of stress test periodically is needed. It might be time-consuming and slow at hand but it helps us discover issues that might not be happening during regular tests.

How do you feel about making stress tests and benchmarks part of another goal, more specific to these? Take this as a suggestion, as although it's also important for the system's reliability, the architecture and planning of this kind of testing may be complex enough to have its own set of objectives.

Then, we keep this one more focused on simple logic + observability. And the other would be about answering: "how much pressure our system supports, and how it behaves"

Originally posted by @outerlook in #415 (comment)

cc: @rsouy


Blocker

Issues

Optimization issues

TSN-SDK Task

Other issue

@zolotokrylin zolotokrylin changed the title @zolotokrylin @MicBun @rsoury Goal: execute load testing on the load Aug 6, 2024
@zolotokrylin zolotokrylin changed the title Goal: execute load testing on the load Goal: run load testing Aug 6, 2024
@zolotokrylin
Copy link
Contributor Author

@outerlook @rsoury, please suggest a simple and cost-effective action plan here.
Let's find out our system limits ASAP and create Goals to fix it, as suggested above.

@rsoury
Copy link

rsoury commented Aug 7, 2024

@brennanjl - Do you have any benchmarking and pen-test tooling that you can provide? - What's your take on the following/

@outerlook - Proposing a simple pen-test suite that:

  1. Deploys a dev version of the network multi-regionally - use Production proposed capacity - ie. 8gm RAM, 2vCPU, on 200GB of storage for each Node.
  2. If we do not have IaaC to support this, we really should. I believe Truflation should have a minimum of 3 Nodes it operates in Production - post TSN launch.

Determines:

  • Max Transactions Per Second - each transaction is created using a dynamically created wallet
    • Transactions include Stream Creation
    • Composed Stream Creation
  • Max Queries Per Second - use arbitrary filters on queries to try break it.
  • Max Composition Depth
  • Max Storage Limit? - 1000 Transactions a second with 1mb transaction sizes - will fill up the DB quite fast.
  • Max Transaction Limit

Seed data with mock data using a Faker lib - https://github.com/go-faker/faker

Any additional recommendations to determine load is advised.

@brennanjl
Copy link
Collaborator

Our team just published metrics today here. Keep in mind there is a lot Kwil can still optimize, this is just how it is right now.

This should answer:

  • Max Transactions Per Second
  • Max Transaction Limit

For max storage limit, it is totally dependent on your schemas. There are some hard postgres limits, but we will much sooner hit slow queries due to data size.

For max queries per second, I would consider this a non-issue / essentially infinite. It is very easy to spin up read replicas to read more data, so it is horizontally scalable.

Finally, for max composition depth, this is currently a Postgres limit, but I am unsure what it is. In v0.9, we can enforce more controls in this though, due to our interpreter usage, so I am curious if there is an "ideal value" here.

Since so much of this is dependent on your specific schema, indexes, and procedures, I would recommend performing benchmarking yourself (particularly for query speed / stream composition for reads). The above Docsend link should provide adequate information for node distribution + its impact on write performance, but it doesn't really cover reads. The reason I mention this is because the TSN specific benchmarks shouldn't need to run more than one node, as it shouldn't;t effect the benchmarking of queries.

@outerlook
Copy link
Contributor

outerlook commented Aug 7, 2024

@rsoury

Deploys a dev version of the network multi-regionally - use Production proposed capacity - ie. 8gm RAM, 2vCPU, on 200GB of storage for each Node.
If we do not have IaaC to support this, we really should. I believe Truflation should have a minimum of 3 Nodes it operates in Production - post TSN launch.

we do have IaaC for single region, could adapt for multi. However:

@brennanjl

The reason I mention this is because the TSN specific benchmarks shouldn't need to run more than one node

This makes sense, as querying is dependent only on the queried node right now, correct?

Max Transactions Per Second - each transaction is created using a dynamically created wallet

  • Transactions include Stream Creation
  • Composed Stream Creation

@rsoury Can we use the results of the kwil team for this?

I assume so because the bottleneck for the majority of transactions lies in their size due to the consensus state over them rather than the step of committing to the underlying Postgres, which varies per logic.
Of course, we could deploy contracts with complex and recursive database inserts, but those are usually special cases that we wouldn't need to benchmark.

We have simple writes such as insert_records, which will be used frequently, but the underlying database operation is simple.

@outerlook
Copy link
Contributor

outerlook commented Aug 7, 2024

@rsoury

Max Composition Depth

@brennanjl

Finally, for max composition depth, this is currently a Postgres limit, but I am unsure what it is.

Assuming infinity for Postgres, for the current implementation, we should try to relate Compose Depth vs. Date Range (or queried data points?) vs. Query Execution Time, as we can arbitrarily increase the query timeout to whatever seems reasonable and make this a limit.

Note: the worst-case scenario is when every ancestor stream is private because more operations are performed per stream on a query.

@rsoury
Copy link

rsoury commented Aug 8, 2024

@brennanjl - Very cool report!

Questions:

  1. Do these reported numbers increase when the number of nodes on the network decreases?
    55 Nodes is more than a full network on many Cosmos chains (50 max).
  2. Is it safe to assume that resolutions on Kwil Oracles are essentially transactions - and that these figures on TPS and throughput can be applied to this aspect of Kwil Networks?

(Fun theoretical/research) Considering BFT is the limiting factor, I wonder what these numbers would look like where Sei Consensus or Sui is used with pruned features.
ie. a general (bigger) L1 consensus network and L2 without consensus to index transactions (Kwil Oracles) + SQL Contracts.
It'd also solve #199 by default.
I know this would be a rewrite - so don't take it too seriously.

@rsoury
Copy link

rsoury commented Aug 8, 2024

@outerlook - Re

Max Composition Depth

We can push this back. Especially considering we're in complete control of depth, as Truflation is the only analyst/stream-creator.
I believe this report suffices as to understand our limits.

All third-party analysis and stream creation should be performed on a Staging/Testnet -- not on the Production TSN.

Only once we've performed our own stress test, and/or used metrics from Staging/Testnet produced by third parties, then can we expose Stream creation to third parties.

@zolotokrylin - Please determine course of action based on this advice.

@rsoury
Copy link

rsoury commented Aug 8, 2024

Side note: Postgres performance can be limited heavily based on bad/complex queries. I’m sure composition depth plays a role in this.

@zolotokrylin
Copy link
Contributor Author

zolotokrylin commented Aug 8, 2024

@rsoury, sorry, what's the question?

We should run a typical "read requests" load testing. Let's keep write requests out of this scope.
Let's move into the spec doc if there are more considerations and details to be discussed.

@rsoury
Copy link

rsoury commented Aug 8, 2024

@zolotokrylin - Yep, you've basically answered.

Question was: Based on the fact that write performance has been established sufficiently, what's the next steps.

Note that "read requests" can be load tested, but as per Brennan's comment, they're essentially correlated to the scale of a single instance Postgres DB + Golang App.
If we need more scale, we can just horizontally scale the gateways, and load balance between Validator Nodes that we operate internally.

@zolotokrylin
Copy link
Contributor Author

Gateways are same as RPC service providers, right?

@outerlook
Copy link
Contributor

outerlook commented Aug 8, 2024

About read load testing

as @rsoury said

Postgres performance can be limited heavily based on bad/complex queries. I’m sure composition depth plays a role in this.

We already faced some issues with it:

In short, we had query timeouts (5 seconds) for fetching yoy values for month ranges, but it worked for a week.

Why is it complex?
  • getting records for a range of days have some loops inside the query to fill the gaps
  • querying composed index is recursive, considering the depth of the composition
  • there are permission checks for every stream, querying the metadata table for:
    • the stream's read visibility
    • the stream's owner (if private)
    • permission for a wallet to read it (if private)
    • the stream's compose visibility
    • permission for a subsequent stream to compose from it (if private)

As a bandaid, our solution was just to increase the timeout limit to 60 seconds. But I think we would benefit from knowing with more precision how depth is influencing the query time

Also, I don't think testing this would be too complex if Kwil's testing framework can be a good representation of actual no-roundtrip query times, as it just uses Postgres directly (right, @brennanjl?).

In that case, we could get values for:

  • visibility: public or private
  • procedures: get_record, get_index, get_index_change

Getting the responses for:

queried days / depth 1 10 100 1K 10K 100K 1M
1
7
30
90
180
365

We could even be better at optimizing our queries if we have this table

@rsoury
Copy link

rsoury commented Aug 8, 2024

@zolotokrylin

Gateways are same as RPC service providers, right?

Yes.

I agree with @outerlook, it's wise to proceed with composition depth measurements.
I didn't consider the severity of the response times.

We could even be better at optimizing our queries if we have this table

Yes, but we cannot optimise queries that external parties deploy.

@brennanjl - Is YoY queries slow because it waits for the next block? Or does it use the last block where consensus time is established?
If the latter, it's likely an unoptimised query issue.
It's not a major issue now but may become one.
Solution could be a SQL optimiser...

Another bandaid solution is stuffing the entire DB in memory purely for faster queries.

@brennanjl
Copy link
Collaborator

Do these reported numbers increase when the number of nodes on the network decreases?
55 Nodes is more than a full network on many Cosmos chains (50 max).

Yes. We haven't measured the performance impact of running different amounts of nodes with proper tests, but CometBFT's tests show this, and we have a lot of anecdotal evidence as well.

Is it safe to assume that resolutions on Kwil Oracles are essentially transactions - and that these figures on TPS and throughput can be applied to this aspect of Kwil Networks?

Not really. An overview of how it works can be found here, but essentially, there is one tx per validator that votes, and that single tx contains the information for all resolutions they are voting on. However, the data for the event itself only goes through consensus once, since all other nodes voting in favor simply submit a hash of the event data.

(Fun theoretical/research) Considering BFT is the limiting factor, I wonder what these numbers would look like where Sei Consensus or Sui is used with pruned features.

I have actually spoken with a lot of the teams building these protocols. I'm pretty keen to try them because it would obviously open up a lot of opportunities for Kwil, but consensus engines aren't something I want to get fancy with (at least yet). They are very hard to get right, and we don't have an assurance that any of these protocols will be successful long term. We have significantly lower dependency risk with Comet, so we have chosen it for now.

@brennanjl
Copy link
Collaborator

@brennanjl - Is YoY queries slow because it waits for the next block? Or does it use the last block where consensus time is established?
If the latter, it's likely an unoptimised query issue.
It's not a major issue now but may become one.

It is slow due to some sort of logic being executed against the DB (all processing happens outside of consensus). There are two areas where we can improve:

  1. Kuneiform code: I think there are much more efficient ways to write the Kuneiform that powers streams. In particular, the code that aggregates data from many streams is very inefficient, because it performs a lot of extra operations in order to be "generalizable". @MicBun has mentioned this, and asked for being able to construct SQL queries ad-hoc during execution in order to fix it (which we are working on). If you had a Kuneiform code-gen tool (which would be very easy to make for this case) that generated stream code, you could just generate optimal queries for each specific stream. All you need to know is the amount of children it has.
  2. Kwil's SQL determinism: Kwil's SQL enforces determinism for all queries. While we generally want to keep this for reads as well, one area we can make optional is the default ordering enforcement. Ii don't actually know how much this will help, because TSN already orders the results anyways, but it could be worth a shot.

Solution could be a SQL optimiser...

A SQL optimizer probably wouldn't help here. I can take a scan at the current queries, but Postgres already has a really good optimizer. I think the issue is more likely the Kuneiform logic, and the relatively restrictive number of things you can do in Kuneiform.

This weekend, I will take a look at the procedures to see if there are any basic ways to optimize. I can also explore a code-gen tool for generating optimal stream code.

On a slightly longer timeline, Kwil will have more control over Kuneiform execution soon as we move to an interpreter (starting work on it soon). This will make it much easier for us to add features that could allow you to create more optimal streams.

@markholdex
Copy link
Collaborator

hey @brennanjl did you check the procedures for ways to optimize?

@markholdex
Copy link
Collaborator

@rsoury I know we are still curious about how can we improve query time and what can influence it. But other than that it feels like it's clear to everyone where we can improve and where it wouldn't make sense. Thus, do we need to create a short spec around the read-requests load testing? If yes, let's do it. I will be easier to pick it up for everybody in the team.

@rsoury
Copy link

rsoury commented Aug 14, 2024

@markholdex - Sure thing, we can do a small spec on this.

I believe @outerlook has essentially outlined what's involved here #441 (comment)

However, we can formalise this into a set of experiments to conduct.

@markholdex
Copy link
Collaborator

@rsoury can you get started on it and then we all contribute?

@rsoury
Copy link

rsoury commented Aug 15, 2024

@markholdex - Spec established and added to Github Issue - #441 (comment)

@outerlook - Please complete with Action Items more specific to the benchmark requirements outlined

@brennanjl
Copy link
Collaborator

hey @brennanjl did you check the procedures for ways to optimize?

@markholdex apologies, I have not had a chance to do this yet. I have been really behind, and dropped the ball on it. Getting on it now.

@markholdex
Copy link
Collaborator

@brennanjl totally get it. please do let me know when we can expect to get back from you. Thank you!

@markholdex
Copy link
Collaborator

@outerlook what is the status here? I can see we still have unresolved Problems. What is the ETA?

@MicBun are you able to help here?

@MicBun
Copy link
Contributor

MicBun commented Sep 9, 2024

Hi @markholdex, about the unresolved problems:

These two do not require a PR, instead, it is done on a server (AWS).
Edit: I already did the contract upgrade on our server, so it is resolved now. It excludes the changes in on-going PR.

I think it is better to skip this improvement as changing the type from text to int, has minimal impact and we need to adjust all of the data types too.

The node upgrade is dependent on this PR, which is still in progress.

@outerlook
Copy link
Contributor

Hey, team. Sorry about the delay with #530. I finally could get #534 into review to fix it. It had weird errors that only happened after 2 hours of running tests.

I'm rerunning the tests to get the complete results and share them here. The neat results of this goal are that query times were improved a lot, and we can get reliable query times for the specified procedures. And they are also now predictable (if the days/streams queried grow, the query time is pretty much linear)

Our load tests are simple and won't show other specs limitations like memory usage, etc., but with the experience from running them, I say we should continue to recommend at least t3.medium specs for node usage—not for CPU, but for memory issues that I had while running tests.

We could test a much smaller number of streams at once than originally spec'ed (800 vs 1M) due to testing constraints, but results are linear enough to use consistent scale here to get query numbers (maybe not the memory constraints)

Another limitation uncovered is a maximum of 180 depth for compositions due to call stack size limitation within the default Postgres configuration. We could tweak our servers to support more values or even some configuration implemented network-wise, but I guess it's not worth it right now because we're very far from this depth, as streams usually increase a lot horizontally first (our streams have 4 depth at max)

@outerlook
Copy link
Contributor

outerlook commented Sep 9, 2024

Next steps after #534 get merged:

However, some are more async and could move along with other goals.

@outerlook
Copy link
Contributor

outerlook commented Sep 9, 2024

results

complete markdown can be seen here -> Open

important tables that gives the overview (with value in milliseconds):

t3.large - get_index_change - Public

queried days / qty streams 50 100 200 400
1 52 122 197 359
7 48 93 205 395
30 106 183 302 630
365 534 1239 2037 4384

t3.medium - get_index_change - Public

queried days / qty streams 50 100 200 400
1 48 99 327 317
7 47 114 176 474
30 98 165 362 789
365 635 1039 2292 4540

t3.small - get_index_change - Public

queried days / qty streams 50 100 200 400
1 73 125 223 487
7 58 133 255 513
30 173 201 328 780
365 769 1242 2940 5069

I believe we can conclude that although we should keep the 60 seconds timeout just in case, our 365 days of yoy query of ~200 streams in CPI us can now be queried under ~2 seconds in t3.medium

There is a more thorough analysis under #534 that proves these values are linear

upon seeing private and public results, although private makes them slower, we could see that it is not too significantly slower

@outerlook
Copy link
Contributor

outerlook commented Sep 10, 2024

Hey guys, we had some issues with this upgrade. @zolotokrylin @markholdex

The TLDR of what I want to know is if we continue to take it until Friday to push it (with a less performant solution that doesn't break) or put it on hold and solve truflation/website#788, which, if I correctly understood, is a business priority.

What improves? Query times

More details
  • current optimizations depend on functions done at kwil that I initially thought that weren't breaking changes, but they are (array assignments are now allowed and weren't before)
  • for this, we should update the contracts not to depend on this optimization and be ok with a little worse performance (but still better than the initial)

@outerlook
Copy link
Contributor

outerlook commented Sep 10, 2024

another note: we will be able to use the complete solution only after kwil next version, where preview gets in sync with main at kwil, and we chose to support it from tsn, upgrading the network

@MicBun
Copy link
Contributor

MicBun commented Sep 10, 2024

Hi @outerlook, forget to mention, the contract on our server is already using the non-array assignment.
#536
I have upgrade the contract just before your PR that use array assignment.

where preview gets in sync with main at kwil, and we chose to support it from tsn, upgrading the network

is there any ETA on this? if it most likely be ready on next few days I think it is better to leave the current contract as it is (the most updated and optimized one)

@outerlook
Copy link
Contributor

outerlook commented Sep 10, 2024

where preview gets in sync with main at kwil, and we chose to support it from tsn, upgrading the network

is there any ETA on this?

I think they will start the beta in the next few days; @brennanjl can answer this better. However, there might be some weeks in beta until it's stable, and there may also be time for us to coordinate the upgrade. I don't think it's so immediate

@MicBun

I have upgrade the contract just before your PR that use array assignment

If that's the only thing that changed, then #550 contracts should be the same as what was deployed. Is that correct?

@MicBun
Copy link
Contributor

MicBun commented Sep 10, 2024

If that's the only thing that changed, then #550 contracts should be the same as what was deployed. Is that correct?

Yes, it is correct.

@outerlook
Copy link
Contributor

outerlook commented Sep 10, 2024

Then, this goal can be considered done after:

I'm running new tests to get the results, but for 4-branch categories, the difference should not be so big, just for flat categories (the cost to add a new stream to a composed stream is bigger)

@brennanjl
Copy link
Collaborator

I think they will start the beta in the next few days; @brennanjl can answer this better. However, there might be some weeks in beta until it's stable, and there may also be time for us to coordinate the upgrade. I don't think it's so immediate

To confirm, beta got shipped today. That means that preview will converge with a production-ready branch in two weeks (assuming no critical bugs), as is documented by our release cycle here. This will include a number of other additions on-top of what is on preview, including peer filtering and RPC security, and better processes for future coordinated upgrades (to make it easier for truflation to make these breaking changes in the future).

@outerlook
Copy link
Contributor

Thank you, Brennan!

I'm closing this issue as everything planned is shipped, and everything is normalized again. If anyone thinks something is pending, please reopen!

@zolotokrylin
Copy link
Contributor Author

zolotokrylin commented Sep 11, 2024

@outerlook is this the final report: #441 (comment) ?

@outerlook
Copy link
Contributor

outerlook commented Sep 11, 2024

I left a complete report being generated yesterday. It timed out after 6 hours (before the reverts it took 1h30m).

I'll leave another one running today with a 48-hour time limit so we know how long the new one will take. I'll leave the goal open until we have it.

@outerlook
Copy link
Contributor

outerlook commented Sep 13, 2024

Results took 13 hours this time

available here: https://us-east-2.console.aws.amazon.com/s3/object/benchmark-results-benchmark-staging-stack?region=us-east-2&bucketType=general&prefix=reports/2024-09-12T13%3A45%3A44.092Z.md

note that on previous implementation (with pending optimizations) branching factor didn't matter. However for now, it does. That's why I'll also specify by the factor here:

Branching factor = 8

It's a relatively common factor, 8 children per stream, but only a guess (I think this varies from 2~10)

t3.large - get_index_change - Public

queried days / qty streams 50 100 200 400
1 52 111 219 433
7 83 227 337 693
30 223 433 864 1916
365 2520 4662 9616 19406

t3.medium - get_index_change - Public

queried days / qty streams 50 100 200 400
1 82 151 306 428
7 106 234 438 728
30 291 595 951 1873
365 2586 5308 10608 21220

t3.small - get_index_change - Private

queried days / qty streams 50 100 200 400
1 68 137 240 539
7 94 221 381 959
30 255 633 1157 2799
365 2500 5636 10078 22827

Branching factor = infinity

as if all streams of a category were directly below the root. helps us know the cost of new streams per parent

see that numbers grow a lot with new streams

t3.large - get_index_change - Public

queried days / qty streams 50 100 200 400
1 139 446 2119 11870
7 295 1790 8551 48612
30 1206 5878 31756 189672
365 13613 69843 373267 2387596

yes, took 2.3K seconds = 38 min for last one

t3.medium - get_index_change - Public

queried days / qty streams 50 100 200 400
1 132 573 2795 14454
7 412 2027 9552 58411
30 1153 6258 34644 229682
365 15890 80349 428752 2666942

t3.small - get_index_change - Public

queried days / qty streams 50 100 200 400
1 159 482 2175 12918
7 414 2161 8703 56300
30 1135 6120 35256 215169
365 15004 76803 418376 2579786

(how can small instance perform better than medium in this case? well, as it's relatively close, IDK if it's worth investigate for this discrepancy)


Conclusion:

The remaining optimization is important, but although the number of flat trees is big, it isn't a critical situation because we currently need categories that are that flat on production. But it should be addressed in the future because it's almost prohibited to have one with these numbers

@rsoury
Copy link

rsoury commented Sep 15, 2024

@outerlook - Considering linear vertically scaling the Node impacts performance - should be bump primary Node used for handling queries.
Each instance you've used here has 2 vCPUs.
Does 4 vCPUs help?

@outerlook
Copy link
Contributor

Does 4 vCPUs help?

We can sure answer this by testing t3.xlarge too. And I'll run it.

I have a weak theory that postgres doesn't split the load in threads for our single query, so I don't believe it would bring it to half unless the CPUs themselves are very different here. I do think it could improve a lot for concurrent queries being requested.

let's see what happens

@markholdex
Copy link
Collaborator

@outerlook when will the results with new machine become available?

@outerlook
Copy link
Contributor

I'll start it in the next minutes. However, if it has no difference we better not commit it (unless we spot a difference here)
It currently takes 13 hours to run it all, so I will probably post some values here tomorrow

@markholdex
Copy link
Collaborator

@outerlook any updates here?

@outerlook
Copy link
Contributor

It errored in the upload process after 11 hours. I'll try running again to see if it was a one-time issue. Sorry about the delay, I'm treating this one as a lower priority task, when comparing to recent events

17T11:42:00.871Z_t3.xlarge.csv An error occurred (ExpiredToken) when calling the PutObject operation: The provided token has expired.\nfailed to run commands: exit status 1

@outerlook
Copy link
Contributor

outerlook commented Sep 20, 2024

now for t3.xlarge

branching factor = 8

t3.xlarge - get_index_change - Public

queried days / qty streams 50 100 200 400
1 58 119 216 441
7 86 175 328 720
30 224 431 864 1935
365 2216 4358 8837 18402

branching factor = infinity

t3.xlarge - get_index_change - Public

queried days / qty streams 50 100 200 400
1 120 442 1947 11252
7 398 1539 7661 46209
30 1331 5746 29429 181908
365 13119 64697 341347 2163804

conclusion: faster than the others, as expected, but by small difference in % (~5%)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants