Performance best practices when integrating OpenFGA with a GraphQL API #202

patricknick · 2023-08-18T14:22:17Z

patricknick
Aug 18, 2023

We are currently in the process of evaluating OpenFGA as a centralised authorization service. I really like what you've done here so far 💪

However, when running some performance tests, we had some findings that made us unsure whether this is the right solution for us and I am trying to find out if (a) we are using OpenFGA in the intended way, if so if (b) the results are as expected or (c) we are doing something wrong. I don't have much experience with other centralised AuthZ systems, hence little to compare it to.

Setup

My test setup currently is fairly simple. It consists of

a NodeJS service, exposing a GraphQL API.
- it uses the Node SDK for the checks.
  - we enabled connection pooling, according to this issue. It already improved the latency of each request quite a bit.
- Each resolver calls the check API for authorization.
- There is no other business logic, no DB access, etc. (excluded to minimise influencing factors).
OpenFGA is deployed as v1.2.2 using the default Helm chart, using Postgres.
OpenFGA and the NodeJS service are hosted on the same K8s cluster.
I used a medium-complex GraphQL query for testing, which would be a typical scenario for us. Each request resulted in 7 checks against OpenFGA.

Auth model & Tuples

tuples.json.zip

model
  schema 1.1
type branch
  relations
    define member: [user]
type project
  relations
    define parent: [branch]

    define in_prospect: [organization]
    define confirmed: [organization]

    define collaborator: [user]
    define viewer: [user] or collaborator

    define in_prospect_or_confirmed: contact from in_prospect or contact from confirmed

    define can_view: viewer and in_prospect_or_confirmed

type organization
  relations
    define parent: [branch]
    define contact: [user]

type user

Valid check request

{
  "tuple_key": {
    "user": "user:6994dced-bf1d-4978-a838-37a81708c7f6",
    "relation": "can_view",
    "object": "project:776b5eeb-40fd-47f5-9697-76c9a8f977e8"
  }
}

Results, per request (using DDosify to test):

	avg	p90	p95	p99	min	max
Test 1: 5 req/s, for 120 seconds, 3 test runs	10-15ms	16-32ms	30-43ms	40-68ms	2ms	61-162ms
Test 2: 15 req/s, for 120 seconds, 3 test runs	16-23ms	42-57ms	53-68ms	72-91ms	2ms	184 - 269ms

(These times were measured in the NodeJS service. NodeJS itself did not seem to be the bottleneck. I also did some smoke tests checking the added latency of our network. It takes ~2-3ms to establish a TCP connection. Hence, I don't think we can (solely) blame our network. 🙂 )

I expected that OpenFGA adds some overhead since each check needs to execute an HTTP call. But the above metrics raise a few question marks:

the check durations seem to be fairly inconsistent and increasingly so with more load. When running the same test setup with DB checks, I get a p99 of 12ms in worst case. Leaving aside that I'm comparing apples and oranges here, the above 91ms are an enormous performance penalty for any system. Am I already experiencing OpenFGA's limits here? How would this work in a larger system where the authorization system would surely need to able to handle more than this?
This test setup runs both services in the same K8s cluster, which is kind of a "best case scenario". I imagine, if you'd use the managed service from Auth0, the latency would only increase. I'm wondering how this all fits into the claims of being "blazing fast" and whether I don't use it as expected?
in a RESTful API, you typically would have 1 check per endpoint, adding ~10ms to the request (in the best case). I guess that's somewhat acceptable in an average business application. But in GraphQL, one often does authorization checks in multiple resolvers in several layers (imagine a Federated system). This quickly adds up, especially with the above numbers. Do you have any experience or best practice regarding GraphQL APIs? How did others solve this?

Any advice is appreciated. 🙏 Thanks in advance.

Answered by patricknick

Sep 12, 2023

First of all, thanks to @jon-whit and @aaguiarz for taking your time and all the help and insights! 🙏 I've tried to summarise everything I learned so far.

Summary of this discussion

Newest performance metrics

Thanks to @jon-whit's suggestions and the newest version 1.3.1 of OpenFGA, we managed to achieve quite a significant performance improvement.

	avg	p90	p95	p99	min	max
Test 8: max_conn: 200, max_open: 66, max_idle: 30, experimental check query cache enabled	5.5ms	10ms	12ms	19ms	1ms	75ms

Note: We achieved these performance metrics with the Helm-deployed Postgres database. However, this DB is intended for developer environments and is not production-ready. Using a production-…

View full answer

aaguiarz · 2023-08-18T19:03:06Z

aaguiarz
Aug 18, 2023
Maintainer

HI @patricknick

Thanks for all the context you provided us! A few follow up questions:

Can you provide details about your Postgres configuration? Aurora, PgBouncer etc? We want to know if you might be experiencing connection contention/saturation or paying a high cost when establishing the connection.
Are you passing the AuthZModel Id in the header of the check call? This will save on database read in each request.
Can you share details about how OpenFGA is fronted? K8S ingress with SSL?
Did you override any configurations such as the minimum connections per node to PSQL?
Your test window is a bit short, was there any ramp up period? It's possible that the very first requests which will sadly pay the penalty of warming up the connection pool of both Node towards OpenFGA and OpenFGA towards PSQL and those would completely skew your numbers based on the total number of replicas you have for each of the services. Would be good to have a ramp period and a longer lasting test.
Can you confirm that the check() calls being done in parallel?

BTW, we'll be publishing an OpenFGA version in the next few days that will include caching improvements which will improve your results (openfga/openfga#891).

2 replies

patricknick Aug 21, 2023
Author

Thanks a lot for the quick reply @aaguiarz! I tried to answer all your questions.

BTW, we'll be publishing an OpenFGA version in the next few days that will include caching improvements which will improve your results (openfga/openfga#891).

Nice, great to hear 🙌 will you also update the Helm chart? (I saw that's currently a minor version behind. I can open a ticket if you want.)

Can you provide details about your Postgres configuration? Aurora, PgBouncer etc? We want to know if you might be experiencing connection contention/saturation or paying a high cost when establishing the connection.

I'm using the official OpenFGA Helm Template which also deploys my Postgres instance (see Helm config values below). We did not deploy PGBouncer or anything similar. Is there anything else that I can look up?

'datastore.engine=postgres', 
'datastore.uri=postgres://postgres:' + postgres_password + '@openfga-postgresql.' + NAMESPACE + '.svc.cluster.local:5432/postgres?sslmode=disable',
'postgres.enabled=true',
'postgresql.auth.postgresPassword=' + postgres_password, 
'postgresql.auth.database=postgres',
'http.addr=0.0.0.0:8082',
'telemetry.metrics.enableRPCHistograms=true'

Are you passing the AuthZModel Id in the header of the check call? This will save on database read in each request.

I did not specify the AuthZModel id, mainly because I was experimenting with different models and missed that part regarding the performance impact. Thanks for pointing that out. (see below for updated metrics)

Can you share details about how OpenFGA is fronted? K8S ingress with SSL?

There is no ingress since I'm only calling OpenFGA from within the K8s cluster. The NodeJS service connects to it using the provided openfga service. I think our K8s cluster uses the default round-robin behaviour to send traffic to each of the 3 OpenFGA replicas.
- Since I'm experimenting, I did not enable any TLS/SSL (using only HTTP) and no authentication. I assume, this would not have a performance benefit, would it?

Did you override any configurations such as the minimum connections per node to PSQL?

I did not override any configuration. I just found the OPENFGA_DATASTORE_MAX_OPEN_CONNS config option, which is undefined by default. I set that to 33 (max conn size for Postgres is 100) (since the Helm chart deploys 3 instances of OpenFGA by default). Is there any other value that I should tweak?

Your test window is a bit short, was there any ramp up period? It's possible that the very first requests which will sadly pay the penalty of warming up the connection pool of both Node towards OpenFGA and OpenFGA towards PSQL and those would completely skew your numbers based on the total number of replicas you have for each of the services. Would be good to have a ramp period and a longer lasting test.

there was almost no ramp-up period, and if, not longer than a few seconds. However, I ran each setup 3 x 120 seconds, so the skewing effect should be minimal. Anyway, I adjusted my setup slightly. (see below for updated metrics)

Can you confirm that the check() calls being done in parallel?

I can confirm the checks() ran in parallel.

A bit more context, in case this is interesting for you

- ddosify submitted 15 graphql queries per second to the NodeJS API. - a graphql query looks like this (simplified) (⚠️ be aware: the measured times are per check request, not for the entire GraphQL query):

  project(id: $projectId) { <-- check
    id
    attribute1  <--check
    attribute2 <--check
    attribute3 <--check
    attribute4 <--check
    attribute5 {  <--check
      attribute6  <--check
    }
  }

the project check is run sequentially
attribute 1 - 5 checks are run in parallel, once project has resolved
attribute 6 check is run sequentially, once attribute 5 has resolved
each check ran the exact same query on OpenFGA.

Updated metrics

	avg	p90	p95	p99	min	max
Test 3: 15 req/s, for 300 seconds, OPENFGA_DATASTORE_MAX_OPEN_CONNS=33	24ms	53ms	63ms	87ms	2ms	181ms
Test 4: 15 req/s, for 300 seconds, OPENFGA_DATASTORE_MAX_OPEN_CONNS=33, AuthZ Model Id set	23ms	53ms	66ms	110ms	3ms	481ms
Test 5: 15 req/s, for 300 seconds, AuthZ Model Id set	19ms	43ms	50ms	65ms	2ms	156ms

Interestingly, OPENFGA_DATASTORE_MAX_OPEN_CONNS seems to have a negative performance impact. And what I also don't quite understand is why the request times fluctuate so much:

(This is just anecdotal from one of the tests. But I saw this behaviour being quite consistent. Maybe this hints at something?)

patricknick Aug 22, 2023
Author

I'm not sure if this is even relevant. This is what I see when connecting to the OpenFGA Postgres instance directly (I used pgAdmin):

What I'm curious is the large amount of transactions. If I run my test set, I average at about 5'500 - 6'000 transactions per second as a result of 15 x 7 check requests per second (total of 105).

aaguiarz · 2023-08-22T18:23:50Z

aaguiarz
Aug 22, 2023
Maintainer

@patricknick would you be OK to have a quick call with the team so we can pair on it? If so please send me an email to [email protected] and we can schedule it.

1 reply

patricknick Aug 24, 2023
Author

I've sent you an email, thank you!

jon-whit · 2023-08-22T21:56:25Z

jon-whit
Aug 22, 2023

@patricknick Just to clarify, your initial issue mentions high latency Check calls for Checks of the form Check(project:<id>#can_view@user:<id>), but then you go on to say

ddosify submitted 15 graphql queries per second to the NodeJS API. - a graphql query looks like this (simplified) (⚠️ be aware: the measured times are per check request, not for the entire GraphQL query):

project(id: $projectId) { <-- check
id
attribute1 <--check
attribute2 <--check
attribute3 <--check
attribute4 <--check
attribute5 { <--check
attribute6 <--check
}
}

the project check is run sequentially

attribute 1 - 5 checks are run in parallel, once project has resolved

attribute 6 check is run sequentially, once attribute 5 has resolved

each check ran the exact same query on OpenFGA.

The numbers and graphs you are reporting, are these measuring a single Check call of the form initially called out in your first issue, or are these numbers reporting Check latency for the various Checks involved in the project and attribute checks? If the later, what do the Checks look like for those attributes, and do you have some sample tuples exhibiting the kinds of tuples involved in those attribute checks?

When I look at the pGAdmin screenshot you provided, it doesn't make sense that the number of Active connections is nearly 0 throughout the duration of the timeseries. If you have active OpenFGA queries in flight, then there should always be a pretty decent number of active connections. How was this diagram produced?

As general guidance, I don't recommend using the postgresql subchart to manage your Postgres instance for OpenFGA. A more representative benchmark for OpenFGA would be with a more production ready database deployment. The default values for the Helm chart are mostly just to get started. However, some things you may try out of the box before doing anything further is to :

Increase the maximum number of open connections for Postgres (the default is 100). Try doubling the number of connections.
After increasing the databases max number of open connections, increase the --datastore-max-open-conns setting to account for the increase.
Increase the minimum number of idle connections in the OpenFGA server by setting --datastore-max-idle-conns. In general, keep an eye on the average number of open connections used during the lifetime of OpenFGA. If the average number of connections that are active in OpenFGA is 30, then set the minimum idle connections to 30. This will help reduce the overhead of re-establishing connections between requests and/or within a request.

1 reply

patricknick Aug 24, 2023
Author

The numbers and graphs you are reporting, are these measuring a single Check call of the form initially called out in your first issue, or are these numbers reporting Check latency for the various Checks involved in the project and attribute checks? If the later, what do the Checks look like for those attributes, and do you have some sample tuples exhibiting the kinds of tuples involved in those attribute checks?

Apologies for the confusion. Let me try to clarify 🙂 in my test scenario I run 4500 requests, each request triggers 7 check requests and the measurements are over all those 4500 x 7 check requests. However, it's important to know that the 7 check requests all perform identical checks of the form Check(project:<id>#can_view@user:<id>), using the same IDs. In the end, the OpenFGA pods receive 31'500 identical check requests.

When I look at the pGAdmin screenshot you provided, it doesn't make sense that the number of Active connections is nearly 0 throughout the duration of the timeseries. If you have active OpenFGA queries in flight, then there should always be a pretty decent number of active connections. How was this diagram produced?

I'll need to look into why the active connections are not showing. The chart are the default dashboard charts from pgAdmin. They probably use SELECT * FROM pg_stat_activity; under the hood. The refresh time is every 5 seconds. My working theory is that the queries are so short that at the time of look up, the connections are indeed idle. But I can neither proof/disproof at the moment and I might be completely wrong.

I took your general guidance to carry out a few more tests and there was quite a big change to observe.

max_conn = Postgres maximum amount of connections
max_open = --datastore-max-open-conns
max_idle = --datastore-max-idle-conns

	avg	p90	p95	p99	min	max
Test 6: max_conn: 200, max_open: 66, max_idle: 30	8ms	14ms	18ms	29ms	1ms	225ms
Test 7: max_conn: 300, max_open: 100, max_idle: 80	8ms	13ms	16ms	23ms	1ms	101ms

I think I could have gained even more by having a longer warm up period (in Test 6 & 7, I used 30 seconds with 15 req/s):

Screenshot

The occasional "bursts" of request times still happen but are much less severe:

Screenshot

❓ Is there a rule of thumb to know how many connections OpenFGA might need or how it's DB requirements change with increasing load? Or is this something that one would need to measure and fine-tune over time?

👉 I still need to run some tests with a managed Postgres instance as you suggest instead of the Helm-deployed version.

jon-whit · 2023-08-24T22:17:52Z

jon-whit
Aug 24, 2023

Apologies for the confusion. Let me try to clarify 🙂 in my test scenario I run 4500 requests, each request triggers 7 check requests and the measurements are over all those 4500 x 7 check requests. However, it's important to know that the 7 check requests all perform identical checks of the form Check(project:#can_view@user:), using the same IDs. In the end, the OpenFGA pods receive 31'500 identical check requests.

Ok, thanks for the clarification. That makes more sense. Good news here as well, the upcoming v1.3.1 includes Check query caching, and this will massively help you here, because the same Check is getting issued back to back in a very short window of time. If you can tolerate potential staleness in some cases, then this latency will go down even further because no database queries will be involved after the first Check(project:<id>#can_view@user:<id>) is resolved.

I'll need to look into why the active connections are not showing. The chart are the default dashboard charts from pgAdmin. They probably use SELECT * FROM pg_stat_activity; under the hood. The refresh time is every 5 seconds. My working theory is that the queries are so short that at the time of look up, the connections are indeed idle. But I can neither proof/disproof at the moment and I might be completely wrong.

Let me know what you find 😄

I'm glad the suggested changes helped a lot! Those graphs look much better and visually appear to be behaving as we'd hope. What you don't want to see is a lot of "churning" of active and idle connections, and in this case we don't, which is great!

The occasional "bursts" of request times still happen but are much less severe:

You're always going to see some bursts, but so long as the p99 is well within reason you should feel some confidence. The bursts are most likely coming from connection contention at the database layer. If you have --datastore-max-open-conns set to 100 and you've saturated at 100 open connections, then any requests trying to resolve while that limit is being met will be blocked waiting for an available connection. The outlier data points you see (e.g. the ones that are part of the p99.9 and beyond) are likely being most impacted by the database contention on open connections.

❓ Is there a rule of thumb to know how many connections OpenFGA might need or how it's DB requirements change with increasing load? Or is this something that one would need to measure and fine-tune over time?

Good question! Our upcoming release v1.3.1 will have some metrics that developers using OpenFGA can use to better "fine tune" their database connection settings. We're introducing a metric that reports the percentiles of database queries required to resolve Check requests, and this can be used as an informative metric to help set a bound on the minimum and maximum connections.

0 replies

aaguiarz · 2023-09-04T22:36:57Z

aaguiarz
Sep 4, 2023
Maintainer

@patricknick would you mind summarizing the state of the issue? Also, if you can draft a short writeup explaining how to use OpenFGA with GraphQL the community would be extremely grateful :)

Thanks!

2 replies

patricknick Sep 5, 2023
Author

👍 yes of course, will try to do so in the next few days.

patricknick Sep 12, 2023
Author

@aaguiarz Apologies for the late reply. I added a summary below. Please let me know if there is anything missing or confusing so that I can adjust. 🙂

patricknick · 2023-09-12T13:27:52Z

patricknick
Sep 12, 2023
Author

First of all, thanks to @jon-whit and @aaguiarz for taking your time and all the help and insights! 🙏 I've tried to summarise everything I learned so far.

Summary of this discussion

Newest performance metrics

Thanks to @jon-whit's suggestions and the newest version 1.3.1 of OpenFGA, we managed to achieve quite a significant performance improvement.

	avg	p90	p95	p99	min	max
Test 8: max_conn: 200, max_open: 66, max_idle: 30, experimental check query cache enabled	5.5ms	10ms	12ms	19ms	1ms	75ms

Note: We achieved these performance metrics with the Helm-deployed Postgres database. However, this DB is intended for developer environments and is not production-ready. Using a production-ready DB might achieve even better numbers if scaled appropriately.

How to fine-tune OpenFGA performance

Generally, the performance of OpenFGA is depending on a few factors.

complexity of authorisation model and amount of relations
- The more complex the authorisation model becomes and the more relations are saved, the more paths need to be evaluated. Simplified, OpenFGA needs to load all that data (model and relations) in memory and evaluate wether there is a possible path from given user to given object.
- As a recommendation, one should keep constrain both dimensions. Using Google Drive as an example: define maximum nested depth of folders and a maximum amount of files per folder
check request complexity
- Similar to the previous bullet point, depending on a check request, the complexity can be lower or higher. Checking a direct relation is very easy. Evaluating a graph with multiple possible paths from A to B is much more effort and thus affects performance as well.
amount of requests
- the more requests happen in (near) parallel, the higher the requirements on the database. This is particularly related to the above bullet points. Because each check request requires several DB requests to be evaluated. Thus, one can end up very quickly with a congested database.

There are possibilities how to fine-tune OpenFGA performance:

Using prometheus metrics to analyse and gain insights
- v1.3.1 introduced new performance metrics that help understand what is going on under the hood. Mainly, how many DB queries are needed per Check and how long they take. Combined with the average amount of requests, one can interpolate roughly the maximum amount needed database connections. (@jon-whit I wasn't able to take detailed notes on this part during our conversation. Would you mind clarifying in case there's something missing here?)
Set the right amount of database connections
- OpenFGA is designed to solve authorisation problems in a highly concurrent manner. However, this potentially requires a large amount of available connections to run these lookups in parallel. As we saw, in my test setup, I needed 200 - 300 connections to achieve a decent performance. Apparently, this can grow much more, depending on the setup.
- Note on connections: every time a new connection needs to be established, this will result in a "delay" or additional latency of a check request.
Use OpenFGA's various configuration possibilities to tweak performance
- Blocking hundreds or thousands of connections might not always be an option. OpenFGA provides quite a few parameters to find the optimal balance. In the end, it comes down to a trade-off between maximum performance and infrastructure limitations. Check out the following ones (I didn't have the time to try them myself yet):
  - --resolve-node-breadth-limit
  - --resolve-node-limit
  - --max-concurrent-reads-for-check
  - --max-concurrent-reads-for-list-objects

Best practices with GraphQL

With GraphQL, there are additional complexities that need to be taken into account. Generally, one needs to look out for:

GraphQL query depth → every layer is resolved sequentially, thus adding request latency
GraphQL query breadth → sibling resolvers are resolved in parallel, thus not causing a huge impact on request latency. However, they still add load on the Authorisation service.
Amount of requested entities → this is especially relevant when querying lists of entities
Granularity and complexity of the authorisation framework → the more granular an authorisation framework becomes, the more resolvers will need to implement it.

Depending on these variables, a single GraphQL request can result in many necessary authorisation checks (with any centralised authorisation service, not only in combination with OpenFGA). Thus, you just end up multiplying the aforementioned "challenges" of OpenFGA.

(Note: It is a bit unfair to blame GraphQL here. GraphQL allows to request many resources in a single request. If you'd request the same resources using REST, you'd probably end up with the same challenges.)

There are some options how to organise Authorisation in combination with GraphQL.

Simple approach: graphql.org recommends to keep authorisation inside be business-logic layer so that different API technologies can reuse the same business-logic. This offers the most flexibility in terms of development. However, authorisation is likely implemented very granularly in this case, which increases the amount of Check requests (some of which might even be duplicated)
Advanced approach: If the previous solution is too slow, there is the option of performing check requests only on root-resolvers level. This limits the amount of resolvers that need to check authorisation. Personally, I'd recommend to chose this pattern with great care. Lower-level resolvers basically "trust" that a user has been granted authorisation. However, if a graphql graph is fast evolving and owned by multiple teams, it's very easy to make mistakes during development and accidentally expose data that is unintended. Also, it might not always be an option if there are more granular authorisation requirements.
Improved simple approach (personal favourite): There is an option to overcome some of the drawbacks of the "simple approach" by de-duplicating check requests on the calling-side. This option basically follows what Apollo implemented in their official REST datasource. Here, Check request decisions would be cached in the context of a GraphQL request (aka for lower-level resolvers) or even for a certain amount of time (aka for multiple queries). However, the current version of OpenFGA SDK does not provide this functionality, I think. So, some custom development is necessary. (@aaguiarz might this be an idea for your roadmap?)

0 replies

aaguiarz · 2023-09-13T16:24:15Z

aaguiarz
Sep 13, 2023
Maintainer

Thanks a lot for the extremely detailed answer @patricknick !!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenFGA

Performance best practices when integrating OpenFGA with a GraphQL API #202

{{title}}

Replies: 7 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

OpenFGA

Performance best practices when integrating OpenFGA with a GraphQL API #202

patricknick Aug 18, 2023

Setup

Results, per request (using DDosify to test):

Summary of this discussion

Newest performance metrics

Replies: 7 comments · 6 replies

aaguiarz Aug 18, 2023 Maintainer

patricknick Aug 21, 2023 Author

Updated metrics

patricknick Aug 22, 2023 Author

aaguiarz Aug 22, 2023 Maintainer

patricknick Aug 24, 2023 Author

jon-whit Aug 22, 2023

patricknick Aug 24, 2023 Author

jon-whit Aug 24, 2023

aaguiarz Sep 4, 2023 Maintainer

patricknick Sep 5, 2023 Author

patricknick Sep 12, 2023 Author

patricknick Sep 12, 2023 Author

Summary of this discussion

Newest performance metrics

How to fine-tune OpenFGA performance

Best practices with GraphQL

aaguiarz Sep 13, 2023 Maintainer

patricknick
Aug 18, 2023

Replies: 7 comments 6 replies

aaguiarz
Aug 18, 2023
Maintainer

patricknick Aug 21, 2023
Author

patricknick Aug 22, 2023
Author

aaguiarz
Aug 22, 2023
Maintainer

patricknick Aug 24, 2023
Author

jon-whit
Aug 22, 2023

patricknick Aug 24, 2023
Author

jon-whit
Aug 24, 2023

aaguiarz
Sep 4, 2023
Maintainer

patricknick Sep 5, 2023
Author

patricknick Sep 12, 2023
Author

patricknick
Sep 12, 2023
Author

aaguiarz
Sep 13, 2023
Maintainer