Implement a new, experimental variant of LookupResources as LookupResources2 #1905

josephschorr · 2024-05-20T19:03:15Z

This implementation should be much faster for intersections, exclusions and caveats due to early tree shearing and check hints

This change has three major components:

Change to caveats expression handling to early terminate when relevant
Ability to specify hints to checks to avoid recompuation
Implementation of LR2

josephschorr · 2024-07-08T15:22:49Z

Moved to draft due to steelthread discovering a pagination bug

vroldanbet · 2024-07-11T09:05:55Z

There is a unit test error:

--- FAIL: TestSimpleLookupResources2 (0.24s)
    --- FAIL: TestSimpleLookupResources2/document#view->user:legal (0.06s)
        lookupresources2_test.go:122: 
            	Error Trace:	/home/runner/actions-runner/_work/spicedb/spicedb/internal/dispatch/graph/lookupresources2_test.go:122
            	Error:      	Not equal: 
            	            	expected: 0x3
            	            	actual  : 0x1
            	Test:       	TestSimpleLookupResources2/document#view->user:legal
            	Messages:   	Depth required mismatch

…ources2 This implementation should be much faster for intersections, exclusions and caveats due to early tree shearing and check hints

… LR2

josephschorr · 2024-07-12T21:34:20Z

There is a unit test error:
...

Fixed

…dispatched resources when checking those already received from another dispatch Adds some parallelism back into LR2

Also adds additional testing to ensure check hints are used in LR2

vroldanbet

Some early comments, still making my way through the PR

internal/dispatch/graph/graph.go

internal/dispatch/singleflight/singleflight.go

internal/services/integrationtesting/testconfigs/caveatarrow.yaml

internal/services/v1/permissions.go

internal/graph/lookupresources2.go

vroldanbet · 2024-07-22T12:17:03Z

internal/graph/lookupresources2.go

+	}
+
+	// For each entrypoint, load the necessary data and re-dispatch if a subproblem was found.
+	return withParallelizedStreamingIterableInCursor(ctx, ci, entrypoints, parentStream, crr.concurrencyLimit,


Not a blocker, since this warrants a bigger refactor, but I thought it was worth calling this out again given what I observed under load: concurrency limits are not very useful if we don't keep a state around of the available "capacity" of the process. E.g. in non-clustered mode, this leads to large spikes in goroutines, because every local dispatch gets the same limit, and in clustered mode, it leads to the same situation, albeit lessened because there is more capacity. Ultimately this affects tail latencies, as the scheduler scrambles to handle all the scheduling spikes with whatever GOMAXPROCS it has defined. With sufficient load, and all the dispatching happening through wide relations, this can easily overload a cluster.

We could add it as a tracking field to the dispatch (since we're changing it here anyway) and decrease the counter on each dispatch

let's do that as a follow-up, we may want to generalize it for all APIs, and this could be the foundation to what the paper describes as These safeguards include fine-grained cost accounting, quotas, and throttling. We could e.g. limit clients by the number of dispatches.

internal/graph/resourcesubjectsmap2.go

vroldanbet · 2024-07-22T12:51:23Z

internal/graph/resourcesubjectsmap2.go

+		// If all the incoming edges are caveated, then the entire status has to be marked as a check
+		// is required. Otherwise, if there is at least *one* non-caveated incoming edge, then we can
+		// return the existing status as a short-circuit for those non-caveated found subjects.
+		if allCaveated {
+			resources = append(resources, &v1.PossibleResource{
+				ResourceId:           resourceID,
+				ForSubjectIds:        subjectIDs,
+				MissingContextParams: missingContextParameters.AsSlice(),
+			})
+		} else {
+			resources = append(resources, &v1.PossibleResource{
+				ResourceId:    resourceID,
+				ForSubjectIds: nonCaveatedSubjectIDs,
+			})
+		}


Why are we OK with returning a partial list here?

Let's say we do an LR call, but we missed some important caveat arguments - by mistake. Now we are returning "this is the list of resources the subject has access to". The client has no way to know they missed arguments that could have augmented that list.

As the comment describes: if we find a path for the resource that is not caveated, then the result is by definition present, so we don't care about the caveated paths to that same resource.

Missed arguments don't matter if at least one of the paths to the resource has no caveats

internal/testserver/server.go

internal/caveats/run.go

vroldanbet · 2024-07-22T16:09:34Z

internal/caveats/run.go

+		currentResult = syntheticResult{
+			value:                true,
+			contextValues:        map[string]any{},
+			exprString:           "",
+			missingContextParams: []string{},
+		}


Suggested change

currentResult = syntheticResult{

value: true,

contextValues: map[string]any{},

exprString: "",

missingContextParams: []string{},

}

currentResult.value = true

Can't; the type of the result is not a synthetic

Then:

var currentResult ExpressionResult = syntheticResult{ value: cop.Op == core.CaveatOperation_AND, contextValues: map[string]any{}, exprString: "", missingContextParams: []string{}, }

vroldanbet · 2024-07-22T16:45:30Z

internal/caveats/run.go


 	var contextValues map[string]any
 	var exprStringPieces []string

+	var currentResult ExpressionResult = syntheticResult{
+		value:                false,
+		contextValues:        map[string]any{},


Initializing syntheticResult.contextValues with an empty map does an unnecessary allocation to the heap. combineMap does take care of nil first argument when merging what has been found, which adds one extra unnecessary allocation.

I did some benchmarks, and combineMaps can be made faster and do one less allocation in the case one of the arguments is nil or empty.

Given the caveat is being executed for each one of the tuples coming out of ReverseQueryRelationships via redispatchOrReportOverDatabaseQuery, I thought this could lead to N unnecessary allocations in the critical path.

func combineMaps(first map[string]any, second map[string]any) map[string]any { if first == nil || len(first) == 0 { return maps.Clone(second) } else if second == nil || len(second) == 0 { return maps.Clone(first) } cloned := maps.Clone(first) maps.Copy(cloned, second) return cloned }

No need to clone them if they aren't changing

please write a benchmark like I did, and make sure it reduces allocations and CPU time for:

first is nil, second is not

first has elements, second does not

My code above moved the needle from 3 allocations to 2 and made it 50% faster

Since I don't clone now at all unless necessary, it should be 1 less even

internal/caveats/run.go

internal/graph/lr2streams.go

vroldanbet · 2024-07-23T13:15:47Z

internal/graph/lr2streams.go

+		checkHint, err := hints.HintForEntrypoint(
+			rdc.entrypoint,
+			resource.Resource.ResourceId,
+			rdc.parentRequest.TerminalSubject,
+			&v1.ResourceCheckResult{
+				Membership: v1.ResourceCheckResult_MEMBER,
+			})


This seems like a potentially large payload to send over dispatch. By default it would be 100 elements but we've discussed introducing a flag, and we've observed better results as that number increases.

Why does the hint need to be represented as N hint elements, instead of a more compact/normalized representation? Each one of the resources has the same entrypoint and the same terminal subject. The result is N protos with N copies of the same subject and object namespaces and relations. The only difference by the objectID, and that is also going to be duplicated in the ComputeBulkCheck dispatch operation.

Could we normalize this information, and if necessary, have the client-side derive these N hints?

Why does the hint need to be represented as N hint elements, instead of a more compact/normalized representation?

Because they are isolated hints

Each one of the resources has the same entrypoint and the same terminal subject.

Here, but that may not apply in the future. This is a generalized hint system.

Could we normalize this information, and if necessary, have the client-side derive these N hints?

Yes, I could have each hint support taking in multiple resource IDs, but that would require more complicated code, so its a tradeoff

Because they are isolated hints

I don't understand what this means. Please elaborate.

Yes, I could have each hint support taking in multiple resource IDs, but that would require more complicated code, so its a tradeoff

My tests indicate that under an LR2 workload, SpiceDB is spending 32% of CPU time in GC (21-23% with my optimizations in #1989) . That's almost a fourth of CPU time collecting garbage (❗). And the vast majority of it is proto allocations.

That CPU time that goes to GC means less CPU time for the massive goroutine fan out LR2 tends to do with certain data shapes and sizes. In turn, it increases tail latencies.

My advice is to make the tradeoff and reduce allocations, the impact cannot be ignored and is globally an issue across all API endpoints. This is very relevant for deployments that have low CPU requests. Feel free to do in a follow up PR so this one does not stay around opened any longer.

internal/graph/hints/checkhints.go

vroldanbet · 2024-07-23T13:32:07Z

internal/graph/check.go

+	if len(req.CheckHints) > 0 {
+		filteredResourcesIdsSet := mapz.NewSet(filteredResourcesIds...)
+		for _, checkHint := range req.CheckHints {
+			resourceID, ok := hints.AsCheckHintForComputedUserset(checkHint, req.ResourceRelation, req.Subject)


hints can also be issued for tupleset-to-userset, why can we assume here it's a hint for computed usersets?

checkInternal is going to loop over the CheckHints 2 times:

one to filter the resources to dispatch

another one during combination

You could return the result from the hint along side the resourceID, and build a map like the once combineWithComputedHints accepts, and call the latter instead. This means you don't have to iterate again over all the hints. This adds up, especially with wide relationships that lead to many batches of 100 elements.

why can we assume here it's a hint for computed usersets?

Because this is processing for the relation itself, and not the arrow.

You could return the result from the hint along side the resourceID...

Except when there aren't any hints, then we're doing work we don't need to do. Its also capping at maybe 1000 elements right now, which means the overhead is likely minimal. I can combine if you like, but it makes the code less readable for a very small improvement

Except when there aren't any hints, then we're doing work we don't need to do.

What work, creating an empty map?

Its also capping at maybe 1000 elements right now, which means the overhead is likely minimal. I can combine if you like, but it makes the code less readable for a very small improvement

How is it making it less readable? You already have a method that accepts a map.
Looking at these things as 1 call with 1000 elements wouldn't be much overhead. The issue is a graph with many elements can lead to many, many dispatches. I think every little bit counts.

vroldanbet · 2024-07-23T13:35:15Z

internal/graph/check.go

+			}
+
+			if req.OriginalRelationName != "" {
+				resourceID, ok = hints.AsCheckHintForComputedUserset(checkHint, &core.RelationReference{


please do not allocate a core.RelationReference here. Instead you can create a new AsCheckHintForComputedUserset method that receives the strings directly, and that the method with proto args can reuse

I hate breaking nicer interfaces for optimization but oh well

You could have maintained the interface the used the proto, and just built on the new signature

Yeah, but then it would have been used in exactly one location, so not worth the overhead IMO

internal/graph/check.go

internal/dispatch/graph/lookupresources2_test.go

internal/graph/check.go

internal/dispatch/graph/lookupresources2_test.go

internal/dispatch/graph/check_test.go

internal/graph/hints/checkhints.go

internal/dispatch/graph/check_test.go

vroldanbet

LGTM, I've left some additional feedback, but we can address in a follow up

github-actions bot added area/CLI Affects the command line area/api v1 Affects the v1 API area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools) area/dispatch Affects dispatching of requests labels May 20, 2024

josephschorr force-pushed the lrv2 branch 2 times, most recently from 41b5468 to b4b801f Compare May 20, 2024 19:26

josephschorr marked this pull request as ready for review May 21, 2024 15:26

josephschorr requested a review from a team as a code owner May 21, 2024 15:26

josephschorr mentioned this pull request May 23, 2024

LookupResources, through object(s) #1317

Open

josephschorr force-pushed the lrv2 branch from b4b801f to ea58074 Compare May 23, 2024 23:28

vroldanbet assigned josephschorr Jun 26, 2024

josephschorr force-pushed the lrv2 branch from ea58074 to f7be8ea Compare June 28, 2024 21:42

josephschorr marked this pull request as draft July 8, 2024 15:22

josephschorr force-pushed the lrv2 branch from f7be8ea to aea320d Compare July 10, 2024 01:49

josephschorr added 2 commits July 12, 2024 17:09

Implement a new, experimental variant of LookupResources as LookupRes…

75ad751

…ources2 This implementation should be much faster for intersections, exclusions and caveats due to early tree shearing and check hints

Add additional steelthread tests for LookupResources and fix issue in…

f2c40f4

… LR2

josephschorr force-pushed the lrv2 branch from dd789a0 to 5668e71 Compare July 12, 2024 21:34

Switch experimental LookupResources2 to request additional chunks of …

dedb1f8

…dispatched resources when checking those already received from another dispatch Adds some parallelism back into LR2

josephschorr force-pushed the lrv2 branch from 5668e71 to dedb1f8 Compare July 16, 2024 16:41

josephschorr marked this pull request as ready for review July 16, 2024 16:41

Re-engineer how check hints are handled to use protos for performance

5147ba1

Also adds additional testing to ensure check hints are used in LR2

vroldanbet mentioned this pull request Jul 19, 2024

additional changes to LR2 branch #1989

Closed

vroldanbet reviewed Jul 22, 2024

View reviewed changes

vroldanbet reviewed Jul 23, 2024

View reviewed changes

Review feedback and additional tests

d889495

josephschorr force-pushed the lrv2 branch from 6c2c605 to d889495 Compare July 23, 2024 22:52

vroldanbet approved these changes Jul 24, 2024

View reviewed changes

vroldanbet added this pull request to the merge queue Jul 24, 2024

Merged via the queue into authzed:main with commit c31f34f Jul 24, 2024
22 checks passed

github-actions bot locked and limited conversation to collaborators Jul 24, 2024

josephschorr deleted the lrv2 branch July 24, 2024 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a new, experimental variant of LookupResources as LookupResources2 #1905

Implement a new, experimental variant of LookupResources as LookupResources2 #1905

josephschorr commented May 20, 2024 •

edited

Loading

josephschorr commented Jul 8, 2024

vroldanbet commented Jul 11, 2024

josephschorr commented Jul 12, 2024

vroldanbet left a comment

vroldanbet Jul 22, 2024

josephschorr Jul 23, 2024

vroldanbet Jul 24, 2024

vroldanbet Jul 22, 2024

josephschorr Jul 23, 2024

vroldanbet Jul 22, 2024

josephschorr Jul 23, 2024

vroldanbet Jul 24, 2024

vroldanbet Jul 22, 2024

josephschorr Jul 23, 2024

vroldanbet Jul 24, 2024 •

edited

Loading

josephschorr Jul 24, 2024

vroldanbet Jul 23, 2024

josephschorr Jul 23, 2024

vroldanbet Jul 24, 2024

josephschorr Jul 24, 2024

vroldanbet Jul 23, 2024

vroldanbet Jul 23, 2024

josephschorr Jul 23, 2024

vroldanbet Jul 24, 2024

vroldanbet Jul 23, 2024

josephschorr Jul 23, 2024

vroldanbet Jul 24, 2024

josephschorr Jul 24, 2024

vroldanbet left a comment

Implement a new, experimental variant of LookupResources as LookupResources2 #1905

Implement a new, experimental variant of LookupResources as LookupResources2 #1905

Conversation

josephschorr commented May 20, 2024 • edited Loading

josephschorr commented Jul 8, 2024

vroldanbet commented Jul 11, 2024

josephschorr commented Jul 12, 2024

vroldanbet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vroldanbet Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vroldanbet left a comment

Choose a reason for hiding this comment

josephschorr commented May 20, 2024 •

edited

Loading

vroldanbet Jul 24, 2024 •

edited

Loading