Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MGMT-17997: Resource server doesn't start on deployment #101

Conversation

irinamihai
Copy link
Collaborator

@irinamihai irinamihai commented May 31, 2024

Description:

  • add backend-token-file parameter to the resource-server
  • update the operator resourceServer to not assume the open-cluster-management namespace for ACM and dynamically obtain the searchAPI
  • restructure the O2IMS CRD to have a separate structure that holds the configuration for each server
    • each server structure contains a new ServerConfig struct meant to hold parameters common to all the servers
  • add the parameters used to start the servers in the CR status

The operator was tested locally using a personal quay image that included the resource server changes.

@irinamihai irinamihai requested review from jhernand and danielerez May 31, 2024 02:16
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented May 31, 2024

@irinamihai: This pull request references MGMT-17997 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set.

In response to this:

Description:

  • add backend-token-file parameter to the resource-server
  • update the operator to dynamically:
  • determine the search API from the IngressHost
  • use the backend-token if specified in the ORANO2IMS CR and if not, use the backend-token-file
  • ToDo:
  • add conditions for the resource ser-ver
  • better operator error management

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@irinamihai
Copy link
Collaborator Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 31, 2024
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented May 31, 2024

@irinamihai: This pull request references MGMT-17997 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set.

In response to this:

Description:

  • add backend-token-file parameter to the resource-server
  • update the operator to dynamically:
  • determine the search API from the IngressHost
  • use the backend-token if specified in the ORANO2IMS CR and if not, use the backend-token-file
  • ToDo:
  • add conditions for the resource ser-ver
  • better operator error management
  • update unit tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented May 31, 2024

@irinamihai: This pull request references MGMT-17997 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set.

In response to this:

Description:

  • add backend-token-file parameter to the resource-server
  • update the operator to dynamically:
  • determine the search API from the IngressHost
  • use the backend-token if specified in the ORANO2IMS CR and if not, use the backend-token-file
  • ToDo:
  • add conditions for the resource ser-ver
  • better operator error management
  • update unit tests & fix lint

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@irinamihai irinamihai force-pushed the MGMT-17997-resource-server branch from ab40072 to c1bfbcc Compare May 31, 2024 22:38
)
nextReconcile = ctrl.Result{RequeueAfter: 30 * time.Second}
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra line

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented May 31, 2024

@irinamihai: This pull request references MGMT-17997 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set.

In response to this:

Description:

  • add backend-token-file parameter to the resource-server
  • update the operator to dynamically:
  • determine the search API from the IngressHost
  • use the backend-token if specified in the ORANO2IMS CR and if not, use the backend-token-file
  • ToDo:
  • update unit tests & fix lint

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@irinamihai irinamihai force-pushed the MGMT-17997-resource-server branch 2 times, most recently from 5c8e510 to 953a6b2 Compare June 3, 2024 22:36
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 3, 2024

@irinamihai: This pull request references MGMT-17997 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set.

In response to this:

Description:

  • add backend-token-file parameter to the resource-server
  • update the operator resourceServer to not assume the
    open-cluster-management namespace for ACM and dynamically
    obtain
  • ToDo:
  • add separate CRD structure for the ResourceServer & update the
    code accordingly
  • update CRD conditions to account for different error scenarios
  • update tests
  • fix lint

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

//+optional
BackendToken string `json:"backendToken,omitempty"`
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use an structure like this for all the other servers as well instead of the boolean that we use today? I'd suggest that we create a base structure containing only the Enabled boolean, and then we create structs specific for each server that embed it. For example:

type ServerConfig struct {
        // Enabled indicates if the server should be started.
        //
        // +kubebuilder:default=true
        Enabled bool `json:"enabled"`
}

// ResourceServerConfig contains the configuration for the resource server.
type ResourceServerConfig struct {
        ServerConfig `json:",inline"`

        //+optional
	BackendURL string `json:"backendURL,omitempty"`
	//+optional
	BackendToken string `json:"backendToken,omitempty"`
}

// MetdataServerConfig contains the configuration for the metadata server.
type MetadataServerConfig struct {
        ServerConfig `json:",inline"`
}

// Same for all the other servers.

This will give us room to add other common parameters in the future. For example, at some point I would like to add logging settings:

type ServerConfig struct {
        // Enabled indicates if the server should be started.
        //
        // +kubebuilder:default=true
        Enabled bool `json:"enabled"`

        // LogLevel indicates the log level.
        //
        // +kubebuilder:default=info
        LogLevel string `json:"logLevel"`
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was my plan as well. Initially I was thinking to leave the common settings directly in spec, but we can have another structure if you prefer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"strings"
)

func processBackendToken(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to extract this logic to a common place I'd suggest we do it for both adding the flags and getting them, something similar to what we do for the network listening flags. For example, we could have a function to add the flags to a flag set:

// AddTokenFlags adds to the given flagset the flags needed to configure a token ...
func AddTokenFlags(set *pflag.FlagSet, name string) {
         ...
}

And then a function to get the value of the token from a flag set:

// GetTokenFlag gets the value of the a token flag ...
func GetTokenFlag(set *pflag.FlagSet, name string) (result string, err error) {
        ...
}

That way if we ever need to change how we add or get these flags it will all be in the same place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. @jhernand , kindly confirm this is what you had in mind.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I think it can be even better if we pass the pflag.FlagSet object to the GetTokenFlag function because then that function can also extract the values of the flags from that flag set.

slog.String("!token", backendToken),
slog.String("token_file", backendTokenFile),
)
return "", errors.New("backendToken and backendTokenFile both provided")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to report issues related to flags returning errors then I think they should be more explicit about what caused the error, more similar to what we are writing to the log. For example:

fmt.Errorf(
        "backend token flag '%s' and token file flag '%s' have both been provided, "+
                "but they are incompatible",
        backendTokenFlagName, backendTokenFileFlagName,
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

slog.String("file", backendTokenFile),
slog.String("error", err.Error()),
)
return "", errors.New("failed to read backend token file")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think we should include in the error message the name of the file and the original error:

fmt.Errorf(
        "failed to read backend token file '%s': %w",
        backendTokenFile, err,
)

// Check that we have a token:
if backendToken == "" {
logger.ErrorContext(ctx, "Backend token or token file must be provided")
return "", errors.New("no token provided")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to explain include in the error messages the names of the flags that should be used.

@@ -115,6 +115,17 @@ func (t *reconcilerTask) run(ctx context.Context) (nextReconcile ctrl.Result, er
// Set the default reconcile time to 5 minutes.
nextReconcile = ctrl.Result{RequeueAfter: 5 * time.Minute}

// Validate the CR.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to do this validation with a webhook so that we reject the object creation instead of failing to reconcile it? Not in this pull request, but consider it for the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look into it.

return "", fmt.Errorf("multiclusterengine labels does not contain the installer.namespace key")
}
return acmNamespace.(string), nil
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be worth here, but when you need to access information nested inside an unstructured object like here we are using in other places the jq library. It could be something like this:

// Create the jq tool:
jqTool := ...

// Extract the namespace:
var namespace string
err = jqTool.Evaluate(
        `.metadata.labels["installer.namespace"]`,
        &namespace,
)
if err != nil {
        ....
}

As I said this is probably not worth for this case, but consider it in the future if the location of the nested information is complicated or may change in the future.

return searchAPI, nil
}

func BuildServerContainerArgs(ctx context.Context, c client.Client, orano2ims *oranv1alpha1.ORANO2IMS,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of parameters of this function is growing, as usually happens, at some point we should consider making it a method of some type, so that we can have some of the parameters as fields of the type and avoid adding parameters. I just want to avoid things like this, which happen over time (from real code):

clusterApi = NewManager(getDefaultConfig(), common.GetTestLog(), db, commontesting.GetDummyNotificationStream(ctrl), nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, false)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Noted.

@@ -502,9 +512,6 @@ func (c *ResourceServerCommand) generateSearchApiUrl(backendURL string) (string,
// Split URL address
hostArr := strings.Split(u.Host, ".")

// Replace with search API prefix
hostArr[0] = searchApiUrlPrefix
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it work when running the resource-server manually (i.e. without the operator)?
I mean, IIUC, after this change the search-api prefix is added only using the operator.
We probably need to add another flag the specify the prefix or the namespace? e.g. --search-api-namespace

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The search-api is the backendURL of the resource-server which would be provided in full when run manually.
This is how I've been running the resource server:
./oran-o2ims start resource-server --log-level=debug --log-file=stdout --api-listener-address=localhost:8002 --cloud-id=123 --backend-url="${BACKEND_URL}" --backend-token-file=./token
where BACKEND_URL is https://search-api-open-cluster-management.apps.lab.karmalabs.corp. There is no need to override with the search-api prefix.
@danielerez , do you agree?

When running with the operator, the operator will try to dynamically construct all the search-api URL.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so maybe we should just rename the BACKEND_URL in launch.json to something like SEARCH_API_URL.
I.e. the current code was added in order to avoid changing the url when debugging a different server. It's not crucial of course, but should be easier when running the server locally.
Any way, can you please update the readme with details about the behaviour (to note that 'search-api' prefix is needed in the url).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left everything with BACKEND_URL to keep it consistent. Maybe I can add some extra explanation in the CRD about the resource server requiring the search API.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sounds good. Also in the readme file to emphasis that the prefix is required.

@irinamihai irinamihai force-pushed the MGMT-17997-resource-server branch 3 times, most recently from 77df5de to 7217dc9 Compare June 5, 2024 02:10
func GetTokenFlag(
ctx context.Context,
backendToken string, backendTokenFile string,
logger *slog.Logger) (string, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass here the *pflag.FlagSet object and have here also the code that extracts the values of the flags?

. "github.com/onsi/gomega"
)

var _ = Describe("Flags", func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to have these tests, thanks!

Level: slog.LevelDebug,
}
handler := slog.NewJSONHandler(GinkgoWriter, options)
logger := slog.New(handler)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably go inside a BeforeSuite block. It won't make a difference in this case, but in general if you put it here it will be executed when Ginkgo is reading the test suite, even if it isn't going to execute it. On the other hand the BeforeSuite block is executed only when the tests are going to be executed.

@irinamihai irinamihai force-pushed the MGMT-17997-resource-server branch from 7217dc9 to 099d569 Compare June 5, 2024 23:38
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 5, 2024

@irinamihai: This pull request references MGMT-17997 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set.

In response to this:

Description:

  • add backend-token-file parameter to the resource-server
  • update the operator resourceServer to not assume the
    open-cluster-management namespace for ACM and dynamically
    obtain
  • restructure the O2IMS CRD to have a separate structure that holds
    the configuration for each server
  • each server structure contains a new ServerConfig struct
    meant to hold parameters common to all the servers

ToDo:

  • add more unit tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

"--backend-url=${env:BACKEND_URL}",
"--backend-token=${env:BACKEND_TOKEN}"
]
"--backend-url=https://search-api-open-cluster-management.apps.lab.karmalabs.corp",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remember to remove url/token

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resource server will still need these parameters. Only the O2IMS CR will have them optional.
I will cleanup this file though.

Description:
- add backend-token-file parameter to the resource-server
- update the operator resourceServer to not assume the
  open-cluster-management namespace for ACM and dynamically
  obtain the searchAPI
- restructure the O2IMS CRD to have a separate structure that holds
  the configuration for each server
  * each server structure contains a new ServerConfig struct
    meant to hold parameters common to all the servers
- add the parameters used to start the servers in the CR status
@irinamihai irinamihai force-pushed the MGMT-17997-resource-server branch from 099d569 to ef838fd Compare June 6, 2024 22:04
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 6, 2024

@irinamihai: This pull request references MGMT-17997 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set.

In response to this:

Description:

  • add backend-token-file parameter to the resource-server
  • update the operator resourceServer to not assume the open-cluster-management namespace for ACM and dynamically obtain the searchAPI
  • restructure the O2IMS CRD to have a separate structure that holds the configuration for each server
  • each server structure contains a new ServerConfig struct meant to hold parameters common to all the servers
  • add the parameters used to start the servers in the CR status

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

// Add the token arg:
result = append(
result,
GetBackendTokenArg(orano2ims.Spec.ResourceServerConfig.BackendToken))
Copy link
Collaborator Author

@irinamihai irinamihai Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requests to the resource server didn't work for me when the default token file was used ("/run/secrets/kubernetes.io/serviceaccount/token").
@danielerez @jhernand , do you know why that would be?
This doesn't have to block the merge as the resource server token can be included in the spec.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean when you shay that it doesn't work? What error does it write to the log? If you get an error then we may have a problem in how we pass that token to the server, because in theory any service account has permission to use the ACM search API. You can find more details about how that is checked here:

https://github.com/stolostron/search-v2-api/blob/main/pkg/rbac/authnMiddleware.go

If there is no error, but you don't get the expected results, then it will probably be because the service account that you are using doesn't have the right permissions. The ACM search API uses the same permissions than the rest of the cluster. You can find more details about how that is implemented here:

https://github.com/stolostron/search-v2-api/blob/main/pkg/rbac/authzMiddleware.go

If I understand correctly we create a service account for each service. For the resource server is named resource-server, I think. That service account won't have any permissions. For example, when you send a request like this:

GET /o2ims-infrastructureInventory/v1/resourcePools

Our resource server will try to use the search API to retrieve the set of ManagedClusters (this is the mapping we currently do: a resource pool per ManagedCluster, @danielerez correct me if I am wrong) and will receive nothing because the resource-server account doesn't have permissions to get ManagedCluster objects.

In the documentation here @danielerez suggests to use the oauth-apiserver-sa service account from the openshift-oauth-apiserver namespace. That works because that service account is the one used by the ACM search API itself, and it has all the permissions. Instead of that I think that our operator should add to our resource-server service account all the required permissions, creating a role and a role binding for that. I think that role will need read access to ManagedClusters and Hosts. @danielerez please clarify with @irinamihai what are the required permissions.

If my analysis is right, that work ^ looks complicated enough to deserve a separate pull request. Please merge this first, and then open another one for that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the info, @jhernand. I apologize, I should have provided more details.
It doesn't work in the sense that an empty list is always returned by the server. I did multiple runs and I noticed this is the behavior when the token is not correct.
I will look more into it and discuss with @danielerez for the future PR you've suggested.

I will merge this since we can specify the backendToken for the resourceServer separately.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So about the access permissions, in the resource server we query two kinds of objects: Cluster and Node.
Then I guess it could indeed be enough to have read access for these, but will have to check if anything else is maybe required by the search API.
Regarding the empty list using an invalid token, it does look like a bug. I would expect a StatusUnauthorized (401) error in such scenario. Will have to look into it.

@irinamihai irinamihai requested review from jhernand and danielerez June 6, 2024 22:10
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 6, 2024

@irinamihai: This pull request references MGMT-17997 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set.

In response to this:

Description:

  • add backend-token-file parameter to the resource-server
  • update the operator resourceServer to not assume the open-cluster-management namespace for ACM and dynamically obtain the searchAPI
  • restructure the O2IMS CRD to have a separate structure that holds the configuration for each server
  • each server structure contains a new ServerConfig struct meant to hold parameters common to all the servers
  • add the parameters used to start the servers in the CR status

The operator was tested locally using a personal quay image that included the resource server changes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

domain := ".apps" + ingressSplit[len(ingressSplit)-1]

// The searchAPI is obtained from the "search-api" string and the ACM namespace.
searchAPI := "https://" + "search-api-" + acmNamespace + domain
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a different pull request, but note that as we will be running inside the same cluster than the search API we can use directly the search-search-api service instead of creating a new route that points to it. So the URL could be:

https://search-search-api.open-cluster-management.svc.cluster.local:4010

That can simplify things, and reduces the number of hops to get to the backend.

@jhernand
Copy link
Collaborator

jhernand commented Jun 7, 2024

/approve
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 7, 2024
Copy link

openshift-ci bot commented Jun 7, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jhernand

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 7, 2024
@irinamihai
Copy link
Collaborator Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 7, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit ec8d768 into openshift-kni:main Jun 7, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants