Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/scale-out: support for proxying GQL queries #2588

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vrajashkr
Copy link
Contributor

What type of PR is this?
feature

Which issue does this PR fix:
Towards #2434

What does this PR do / Why do we need it:
Previously, only dist-spec APIs were supported for scale-out as in a shared storage environment, the metadata was shared and any instance could correctly respond to the GQL queries as all the data is available.

In a local scale-out cluster deployment, the metadata store, in addition to the file storage is isolated to each member in the cluster. Due to this, there is a need to proxy the GQL queries as well for UI and client requests to work as expected.

This change introduces a new GQL proxy + a handler for the GlobalSearch request that proxies the request to all the members and collects them for response to the client.
Support for other GQL queries is pending.

Testing done on this change:
Unit Tests added.

Will this break upgrades or downgrades?
No, there shouldn't be any impact to upgrades and downgrades.

Does this PR introduce any user-facing change?:

A zot scale-out cluster deployed without using shared storage is now supported. Client queries from the UI and other clients will be proxied amongst the cluster members to fetch data for the request.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Previously, only dist-spec APIs were supported for scale-out as in a
shared storage environment, the metadata was shared and any instance
could correctly respond to the GQL queries as all the data is available.

In a local scale-out cluster deployment, the metadata store, in addition
to the file storage is isolated to each member in the cluster.
Due to this, there is a need to proxy the GQL queries as well for UI
and client requests to work as expected.

This change introduces a new GQL proxy + a handler for the GlobalSearch
request that proxies the request to all the members and collects them
for response to the client.
Support for other GQL queries is pending.

Signed-off-by: Vishwas Rajashekar <[email protected]>
@vrajashkr
Copy link
Contributor Author

vrajashkr commented Aug 4, 2024

There's still quite a bit of work to be done on this change. Sharing an early draft for review on the approach and handling.

I'll address the non-TODO style comments in the next commit.

return nil, err
}

resp, err := httpClient.Do(fwdRequest)

Check failure

Code scanning / CodeQL

Uncontrolled data used in network request Critical

The
URL
of this request depends on a
user-provided value
.
@@ -13,6 +11,7 @@ import (
"zotregistry.dev/zot/pkg/api/constants"
"zotregistry.dev/zot/pkg/cluster"
"zotregistry.dev/zot/pkg/common"
"zotregistry.dev/zot/pkg/proxy"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this split between pkg/api/proxy.go and pkg/proxy/proxy.go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During development, there was a circular import for api and the new gql proxy in the extensions. This led me to break up the generic proxy logic into its own package and call it from the api package as well as from the gql proxy package.

That said, I do agree that the file naming could potentially be better.

StarCount int `json:"starCount"`
DownloadCount int `json:"downloadCount"`
NewestImage ImageSummary `json:"newestImage"`
Name string `json:"Name"` //nolint:tagliatelle // graphQL schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this change needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the GQL server was handling serialization of the data for responding to the client. Now, since the proxy handler is responding on behalf of the GQL server (after proxying), the standard Golang serialization takes place.

GQL starts keys with an uppercase, but the standard struct annotations don't encourage the same - hence the nolint.

Example:

{
    "errors": [
        {
            "message": "unable to run vulnerability scan on tag v0.0.19.231225-squashfs in repo machine/bootkit/rootfs: error: image 'machine/bootkit/rootfs@sha256:9efb1b9dd349e3fc5fa2dd658354a395a28ac7e391c85a72b39384da7a5ec7a1' scanning is not supported for given image media type",
            "path": [
                "GlobalSearch"
            ]
        },
        {
            "message": "unable to run vulnerability scan in repo machine/bootkit/rootfs: manifest digest: sha256:9efb1b9dd349e3fc5fa2dd658354a395a28ac7e391c85a72b39384da7a5ec7a1, error: image 'machine/bootkit/rootfs@sha256:9efb1b9dd349e3fc5fa2dd658354a395a28ac7e391c85a72b39384da7a5ec7a1' scanning is not supported for given image media type",
            "path": [
                "GlobalSearch"
            ]
        }
    ],
    "data": {
        "GlobalSearch": {
            "Page": {
                "TotalCount": 37,
                "ItemCount": 3
            },
            "Repos": [
                {
                    "Name": "machine/bootkit/rootfs",
                    "LastUpdated": "2023-12-25T15:31:59.110429376Z",
                    "Size": "319472431",
                    "Platforms": [
                        {
                            "Os": "linux",
                            "Arch": "amd64"
                        }
                    ],
                    "IsStarred": false,
                    "IsBookmarked": false,
                    "NewestImage": {
                        "Tag": "v0.0.19.231225-squashfs",
                        "Vulnerabilities": {
                            "MaxSeverity": "",
                            "Count": 0
                        },
                        "Description": "A minimal bootable root filesystem",
                        "IsSigned": false,
                        "SignatureInfo": [],
                        "Licenses": "GPLv2 and others",
                        "Vendor": "project-machine",
                        "Labels": ""
                    },
                    "StarCount": 0,
                    "DownloadCount": 5285
                },
                {
                    "Name": "c3/ubuntu/base-amd64",
                    "LastUpdated": "2024-03-01T00:46:16.186838886Z",
                    "Size": "273201849",
                    "Platforms": [
                        {
                            "Os": "linux",
                            "Arch": "amd64"
                        }
                    ],
                    "IsStarred": false,
                    "IsBookmarked": false,
                    "NewestImage": {
                        "Tag": "jammy",
                        "Vulnerabilities": {
                            "MaxSeverity": "",
                            "Count": 0
                        },
                        "Description": "base is a minimal glibc-based Linux system",
                        "IsSigned": false,
                        "SignatureInfo": [],
                        "Licenses": "GPL-2.0-or-later",
                        "Vendor": "Cisco Systems, Inc.",
                        "Labels": ""
                    },
                    "StarCount": 0,
                    "DownloadCount": 260
                },
                {
                    "Name": "tools/busybox",
                    "LastUpdated": "2022-10-04T18:22:45.289257759Z",
                    "Size": "773920",
                    "Platforms": [
                        {
                            "Os": "linux",
                            "Arch": "amd64"
                        }
                    ],
                    "IsStarred": false,
                    "IsBookmarked": false,
                    "NewestImage": {
                        "Tag": "1.34.1",
                        "Vulnerabilities": {
                            "MaxSeverity": "",
                            "Count": 0
                        },
                        "Description": "",
                        "IsSigned": false,
                        "SignatureInfo": [],
                        "Licenses": "",
                        "Vendor": "",
                        "Labels": ""
                    },
                    "StarCount": 0,
                    "DownloadCount": 112
                }
            ]
        }
    }
}

}

proxyBody, err := io.ReadAll(proxyResponse.Body)
proxyResponse.Body.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

close it right here? or defer this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is completely read after ReadAll, we may not need to keep it open till the end of the function execution. We can close it right away.

}

// aggregate errors
collatedResult.Errors = append(collatedResult.Errors, proxyRespData.Errors...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we now have a situation where we may a good result mixed with errors. This could be problematic/confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, though - it may not be all response errors. For example, here's a snippet from zothub:

{
    "errors": [
        {
            "message": "unable to run vulnerability scan on tag v0.0.19.231225-squashfs in repo machine/bootkit/rootfs: error: image 'machine/bootkit/rootfs@sha256:9efb1b9dd349e3fc5fa2dd658354a395a28ac7e391c85a72b39384da7a5ec7a1' scanning is not supported for given image media type",
            "path": [
                "GlobalSearch"
            ]
        },
        {
            "message": "unable to run vulnerability scan in repo machine/bootkit/rootfs: manifest digest: sha256:9efb1b9dd349e3fc5fa2dd658354a395a28ac7e391c85a72b39384da7a5ec7a1, error: image 'machine/bootkit/rootfs@sha256:9efb1b9dd349e3fc5fa2dd658354a395a28ac7e391c85a72b39384da7a5ec7a1' scanning is not supported for given image media type",
            "path": [
                "GlobalSearch"
            ]
        }
    ],
    "data": {
        "GlobalSearch": {
            "Page": {
                "TotalCount": 37,
                "ItemCount": 3
            },
            "Repos": [
                {
                    "Name": "machine/bootkit/rootfs",
                    "LastUpdated": "2023-12-25T15:31:59.110429376Z",
                    "Size": "319472431",
                    "Platforms": [
                        {
                            "Os": "linux",
                            "Arch": "amd64"
                        }
                    ],
                    "IsStarred": false,
                    "IsBookmarked": false,
                    "NewestImage": {
                        "Tag": "v0.0.19.231225-squashfs",
                        "Vulnerabilities": {
                            "MaxSeverity": "",
                            "Count": 0
                        },
                        "Description": "A minimal bootable root filesystem",
                        "IsSigned": false,
                        "SignatureInfo": [],
                        "Licenses": "GPLv2 and others",
                        "Vendor": "project-machine",
                        "Labels": ""
                    },
                    "StarCount": 0,
                    "DownloadCount": 5285
                },

The data is valid, but there are some errors.

@rchincha
Copy link
Contributor

rchincha commented Aug 5, 2024

Overall the patch is not super-complicated.

@@ -214,14 +214,14 @@ type GlobalSearchResultResp struct {
}

type GlobalSearchResult struct {
Copy link
Contributor

@andaaron andaaron Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't these property renames mean we break backwards compatibility with older zots? Do we care about that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I gathered, the GlobalSearch and other such payloads that are part of the GQL schema are handled entirely by the GQL server.

Ideally, this implementation should send the same payload as the GQL server except aggregated.

Not sure if I fully got your question though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean this code did not produce the same payload as the GQL server before the change?
In this case we should definitely fix this issue.

@vrajashkr
Copy link
Contributor Author

Update on this:
As an alternative solution, we were looking into having the fanout and proxy logic inside the resolver code and use closures to hold the request/response data.

However, the resolver has a signature like the following:
pkg/extensions/search/resolver.go

func globalSearch(ctx context.Context, query string, metaDB mTypes.MetaDB, filter *gql_generated.Filter,
	requestedPage *gql_generated.PageInput, cveInfo cveinfo.CveInfo, log log.Logger, //nolint:unparam
) (*gql_generated.PaginatedReposResult, []*gql_generated.ImageSummary, []*gql_generated.LayerSummary, error,
) {

This appears to be called by the internal resolver from the GQL server where we don't have any control of the behaviour.

Next items:

  1. Explore the current GQL server to see if there is any way to expose the request/response data.
  2. Explore whether the current GQL server can support proxying on its own.
  3. See if there are any other projects out there trying to proxy GQL requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants