Performance best practices when integrating OpenFGA with a GraphQL API #202
-
We are currently in the process of evaluating OpenFGA as a centralised authorization service. I really like what you've done here so far 💪 However, when running some performance tests, we had some findings that made us unsure whether this is the right solution for us and I am trying to find out if (a) we are using OpenFGA in the intended way, if so if (b) the results are as expected or (c) we are doing something wrong. I don't have much experience with other centralised AuthZ systems, hence little to compare it to. SetupMy test setup currently is fairly simple. It consists of
Auth model & Tuples
Valid check request
Results, per request (using DDosify to test):
(These times were measured in the NodeJS service. NodeJS itself did not seem to be the bottleneck. I also did some smoke tests checking the added latency of our network. It takes ~2-3ms to establish a TCP connection. Hence, I don't think we can (solely) blame our network. 🙂 ) I expected that OpenFGA adds some overhead since each check needs to execute an HTTP call. But the above metrics raise a few question marks:
Any advice is appreciated. 🙏 Thanks in advance. |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 6 replies
-
HI @patricknick Thanks for all the context you provided us! A few follow up questions:
BTW, we'll be publishing an OpenFGA version in the next few days that will include caching improvements which will improve your results (openfga/openfga#891). |
Beta Was this translation helpful? Give feedback.
-
@patricknick would you be OK to have a quick call with the team so we can pair on it? If so please send me an email to [email protected] and we can schedule it. |
Beta Was this translation helpful? Give feedback.
-
@patricknick Just to clarify, your initial issue mentions high latency Check calls for Checks of the form
The numbers and graphs you are reporting, are these measuring a single Check call of the form initially called out in your first issue, or are these numbers reporting Check latency for the various Checks involved in the project and attribute checks? If the later, what do the Checks look like for those attributes, and do you have some sample tuples exhibiting the kinds of tuples involved in those attribute checks? When I look at the pGAdmin screenshot you provided, it doesn't make sense that the number of Active connections is nearly 0 throughout the duration of the timeseries. If you have active OpenFGA queries in flight, then there should always be a pretty decent number of active connections. How was this diagram produced? As general guidance, I don't recommend using the postgresql subchart to manage your Postgres instance for OpenFGA. A more representative benchmark for OpenFGA would be with a more production ready database deployment. The default values for the Helm chart are mostly just to get started. However, some things you may try out of the box before doing anything further is to :
|
Beta Was this translation helpful? Give feedback.
-
Ok, thanks for the clarification. That makes more sense. Good news here as well, the upcoming
Let me know what you find 😄 I'm glad the suggested changes helped a lot! Those graphs look much better and visually appear to be behaving as we'd hope. What you don't want to see is a lot of "churning" of active and idle connections, and in this case we don't, which is great!
You're always going to see some bursts, but so long as the p99 is well within reason you should feel some confidence. The bursts are most likely coming from connection contention at the database layer. If you have
Good question! Our upcoming release |
Beta Was this translation helpful? Give feedback.
-
@patricknick would you mind summarizing the state of the issue? Also, if you can draft a short writeup explaining how to use OpenFGA with GraphQL the community would be extremely grateful :) Thanks! |
Beta Was this translation helpful? Give feedback.
-
First of all, thanks to @jon-whit and @aaguiarz for taking your time and all the help and insights! 🙏 I've tried to summarise everything I learned so far. Summary of this discussionNewest performance metricsThanks to @jon-whit's suggestions and the newest version 1.3.1 of OpenFGA, we managed to achieve quite a significant performance improvement.
Note: We achieved these performance metrics with the Helm-deployed Postgres database. However, this DB is intended for developer environments and is not production-ready. Using a production-ready DB might achieve even better numbers if scaled appropriately. How to fine-tune OpenFGA performanceGenerally, the performance of OpenFGA is depending on a few factors.
There are possibilities how to fine-tune OpenFGA performance:
Best practices with GraphQLWith GraphQL, there are additional complexities that need to be taken into account. Generally, one needs to look out for:
Depending on these variables, a single GraphQL request can result in many necessary authorisation checks (with any centralised authorisation service, not only in combination with OpenFGA). Thus, you just end up multiplying the aforementioned "challenges" of OpenFGA. (Note: It is a bit unfair to blame GraphQL here. GraphQL allows to request many resources in a single request. If you'd request the same resources using REST, you'd probably end up with the same challenges.) There are some options how to organise Authorisation in combination with GraphQL.
|
Beta Was this translation helpful? Give feedback.
-
Thanks a lot for the extremely detailed answer @patricknick !! |
Beta Was this translation helpful? Give feedback.
First of all, thanks to @jon-whit and @aaguiarz for taking your time and all the help and insights! 🙏 I've tried to summarise everything I learned so far.
Summary of this discussion
Newest performance metrics
Thanks to @jon-whit's suggestions and the newest version 1.3.1 of OpenFGA, we managed to achieve quite a significant performance improvement.
Note: We achieved these performance metrics with the Helm-deployed Postgres database. However, this DB is intended for developer environments and is not production-ready. Using a production-…