-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MGMT-17997: Resource server doesn't start on deployment #101
MGMT-17997: Resource server doesn't start on deployment #101
Conversation
@irinamihai: This pull request references MGMT-17997 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/hold |
@irinamihai: This pull request references MGMT-17997 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@irinamihai: This pull request references MGMT-17997 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
ab40072
to
c1bfbcc
Compare
) | ||
nextReconcile = ctrl.Result{RequeueAfter: 30 * time.Second} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra line
@irinamihai: This pull request references MGMT-17997 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
5c8e510
to
953a6b2
Compare
@irinamihai: This pull request references MGMT-17997 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
//+optional | ||
BackendToken string `json:"backendToken,omitempty"` | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use an structure like this for all the other servers as well instead of the boolean that we use today? I'd suggest that we create a base structure containing only the Enabled
boolean, and then we create structs specific for each server that embed it. For example:
type ServerConfig struct {
// Enabled indicates if the server should be started.
//
// +kubebuilder:default=true
Enabled bool `json:"enabled"`
}
// ResourceServerConfig contains the configuration for the resource server.
type ResourceServerConfig struct {
ServerConfig `json:",inline"`
//+optional
BackendURL string `json:"backendURL,omitempty"`
//+optional
BackendToken string `json:"backendToken,omitempty"`
}
// MetdataServerConfig contains the configuration for the metadata server.
type MetadataServerConfig struct {
ServerConfig `json:",inline"`
}
// Same for all the other servers.
This will give us room to add other common parameters in the future. For example, at some point I would like to add logging settings:
type ServerConfig struct {
// Enabled indicates if the server should be started.
//
// +kubebuilder:default=true
Enabled bool `json:"enabled"`
// LogLevel indicates the log level.
//
// +kubebuilder:default=info
LogLevel string `json:"logLevel"`
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was my plan as well. Initially I was thinking to leave the common settings directly in spec, but we can have another structure if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
internal/cmd/server/common.go
Outdated
"strings" | ||
) | ||
|
||
func processBackendToken( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to extract this logic to a common place I'd suggest we do it for both adding the flags and getting them, something similar to what we do for the network listening flags. For example, we could have a function to add the flags to a flag set:
// AddTokenFlags adds to the given flagset the flags needed to configure a token ...
func AddTokenFlags(set *pflag.FlagSet, name string) {
...
}
And then a function to get the value of the token from a flag set:
// GetTokenFlag gets the value of the a token flag ...
func GetTokenFlag(set *pflag.FlagSet, name string) (result string, err error) {
...
}
That way if we ever need to change how we add or get these flags it will all be in the same place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. @jhernand , kindly confirm this is what you had in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I think it can be even better if we pass the pflag.FlagSet
object to the GetTokenFlag
function because then that function can also extract the values of the flags from that flag set.
internal/cmd/server/common.go
Outdated
slog.String("!token", backendToken), | ||
slog.String("token_file", backendTokenFile), | ||
) | ||
return "", errors.New("backendToken and backendTokenFile both provided") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to report issues related to flags returning errors then I think they should be more explicit about what caused the error, more similar to what we are writing to the log. For example:
fmt.Errorf(
"backend token flag '%s' and token file flag '%s' have both been provided, "+
"but they are incompatible",
backendTokenFlagName, backendTokenFileFlagName,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
internal/cmd/server/common.go
Outdated
slog.String("file", backendTokenFile), | ||
slog.String("error", err.Error()), | ||
) | ||
return "", errors.New("failed to read backend token file") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I think we should include in the error message the name of the file and the original error:
fmt.Errorf(
"failed to read backend token file '%s': %w",
backendTokenFile, err,
)
internal/cmd/server/common.go
Outdated
// Check that we have a token: | ||
if backendToken == "" { | ||
logger.ErrorContext(ctx, "Backend token or token file must be provided") | ||
return "", errors.New("no token provided") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to explain include in the error messages the names of the flags that should be used.
@@ -115,6 +115,17 @@ func (t *reconcilerTask) run(ctx context.Context) (nextReconcile ctrl.Result, er | |||
// Set the default reconcile time to 5 minutes. | |||
nextReconcile = ctrl.Result{RequeueAfter: 5 * time.Minute} | |||
|
|||
// Validate the CR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to do this validation with a webhook so that we reject the object creation instead of failing to reconcile it? Not in this pull request, but consider it for the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will look into it.
return "", fmt.Errorf("multiclusterengine labels does not contain the installer.namespace key") | ||
} | ||
return acmNamespace.(string), nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not be worth here, but when you need to access information nested inside an unstructured object like here we are using in other places the jq
library. It could be something like this:
// Create the jq tool:
jqTool := ...
// Extract the namespace:
var namespace string
err = jqTool.Evaluate(
`.metadata.labels["installer.namespace"]`,
&namespace,
)
if err != nil {
....
}
As I said this is probably not worth for this case, but consider it in the future if the location of the nested information is complicated or may change in the future.
internal/controllers/utils/utils.go
Outdated
return searchAPI, nil | ||
} | ||
|
||
func BuildServerContainerArgs(ctx context.Context, c client.Client, orano2ims *oranv1alpha1.ORANO2IMS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of parameters of this function is growing, as usually happens, at some point we should consider making it a method of some type, so that we can have some of the parameters as fields of the type and avoid adding parameters. I just want to avoid things like this, which happen over time (from real code):
clusterApi = NewManager(getDefaultConfig(), common.GetTestLog(), db, commontesting.GetDummyNotificationStream(ctrl), nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, false)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Noted.
@@ -502,9 +512,6 @@ func (c *ResourceServerCommand) generateSearchApiUrl(backendURL string) (string, | |||
// Split URL address | |||
hostArr := strings.Split(u.Host, ".") | |||
|
|||
// Replace with search API prefix | |||
hostArr[0] = searchApiUrlPrefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it work when running the resource-server manually (i.e. without the operator)?
I mean, IIUC, after this change the search-api prefix is added only using the operator.
We probably need to add another flag the specify the prefix or the namespace? e.g. --search-api-namespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The search-api is the backendURL of the resource-server which would be provided in full when run manually.
This is how I've been running the resource server:
./oran-o2ims start resource-server --log-level=debug --log-file=stdout --api-listener-address=localhost:8002 --cloud-id=123 --backend-url="${BACKEND_URL}" --backend-token-file=./token
where BACKEND_URL is https://search-api-open-cluster-management.apps.lab.karmalabs.corp
. There is no need to override with the search-api prefix.
@danielerez , do you agree?
When running with the operator, the operator will try to dynamically construct all the search-api URL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, so maybe we should just rename the BACKEND_URL in launch.json to something like SEARCH_API_URL.
I.e. the current code was added in order to avoid changing the url when debugging a different server. It's not crucial of course, but should be easier when running the server locally.
Any way, can you please update the readme with details about the behaviour (to note that 'search-api' prefix is needed in the url).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left everything with BACKEND_URL to keep it consistent. Maybe I can add some extra explanation in the CRD about the resource server requiring the search API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sounds good. Also in the readme file to emphasis that the prefix is required.
77df5de
to
7217dc9
Compare
internal/cmd/server/flags.go
Outdated
func GetTokenFlag( | ||
ctx context.Context, | ||
backendToken string, backendTokenFile string, | ||
logger *slog.Logger) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass here the *pflag.FlagSet
object and have here also the code that extracts the values of the flags?
. "github.com/onsi/gomega" | ||
) | ||
|
||
var _ = Describe("Flags", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to have these tests, thanks!
internal/cmd/server/flags_test.go
Outdated
Level: slog.LevelDebug, | ||
} | ||
handler := slog.NewJSONHandler(GinkgoWriter, options) | ||
logger := slog.New(handler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably go inside a BeforeSuite
block. It won't make a difference in this case, but in general if you put it here it will be executed when Ginkgo is reading the test suite, even if it isn't going to execute it. On the other hand the BeforeSuite
block is executed only when the tests are going to be executed.
7217dc9
to
099d569
Compare
@irinamihai: This pull request references MGMT-17997 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
.vscode/launch.json
Outdated
"--backend-url=${env:BACKEND_URL}", | ||
"--backend-token=${env:BACKEND_TOKEN}" | ||
] | ||
"--backend-url=https://search-api-open-cluster-management.apps.lab.karmalabs.corp", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remember to remove url/token
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resource server will still need these parameters. Only the O2IMS CR will have them optional.
I will cleanup this file though.
Description: - add backend-token-file parameter to the resource-server - update the operator resourceServer to not assume the open-cluster-management namespace for ACM and dynamically obtain the searchAPI - restructure the O2IMS CRD to have a separate structure that holds the configuration for each server * each server structure contains a new ServerConfig struct meant to hold parameters common to all the servers - add the parameters used to start the servers in the CR status
099d569
to
ef838fd
Compare
@irinamihai: This pull request references MGMT-17997 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
// Add the token arg: | ||
result = append( | ||
result, | ||
GetBackendTokenArg(orano2ims.Spec.ResourceServerConfig.BackendToken)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requests to the resource server didn't work for me when the default token file was used ("/run/secrets/kubernetes.io/serviceaccount/token").
@danielerez @jhernand , do you know why that would be?
This doesn't have to block the merge as the resource server token can be included in the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean when you shay that it doesn't work? What error does it write to the log? If you get an error then we may have a problem in how we pass that token to the server, because in theory any service account has permission to use the ACM search API. You can find more details about how that is checked here:
https://github.com/stolostron/search-v2-api/blob/main/pkg/rbac/authnMiddleware.go
If there is no error, but you don't get the expected results, then it will probably be because the service account that you are using doesn't have the right permissions. The ACM search API uses the same permissions than the rest of the cluster. You can find more details about how that is implemented here:
https://github.com/stolostron/search-v2-api/blob/main/pkg/rbac/authzMiddleware.go
If I understand correctly we create a service account for each service. For the resource server is named resource-server
, I think. That service account won't have any permissions. For example, when you send a request like this:
GET /o2ims-infrastructureInventory/v1/resourcePools
Our resource server will try to use the search API to retrieve the set of ManagedClusters
(this is the mapping we currently do: a resource pool per ManagedCluster
, @danielerez correct me if I am wrong) and will receive nothing because the resource-server
account doesn't have permissions to get ManagedCluster
objects.
In the documentation here @danielerez suggests to use the oauth-apiserver-sa
service account from the openshift-oauth-apiserver
namespace. That works because that service account is the one used by the ACM search API itself, and it has all the permissions. Instead of that I think that our operator should add to our resource-server
service account all the required permissions, creating a role and a role binding for that. I think that role will need read access to ManagedClusters
and Hosts
. @danielerez please clarify with @irinamihai what are the required permissions.
If my analysis is right, that work ^ looks complicated enough to deserve a separate pull request. Please merge this first, and then open another one for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for all the info, @jhernand. I apologize, I should have provided more details.
It doesn't work in the sense that an empty list is always returned by the server. I did multiple runs and I noticed this is the behavior when the token is not correct.
I will look more into it and discuss with @danielerez for the future PR you've suggested.
I will merge this since we can specify the backendToken for the resourceServer separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So about the access permissions, in the resource server we query two kinds of objects: Cluster and Node.
Then I guess it could indeed be enough to have read access for these, but will have to check if anything else is maybe required by the search API.
Regarding the empty list using an invalid token, it does look like a bug. I would expect a StatusUnauthorized (401) error in such scenario. Will have to look into it.
@irinamihai: This pull request references MGMT-17997 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
domain := ".apps" + ingressSplit[len(ingressSplit)-1] | ||
|
||
// The searchAPI is obtained from the "search-api" string and the ACM namespace. | ||
searchAPI := "https://" + "search-api-" + acmNamespace + domain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a different pull request, but note that as we will be running inside the same cluster than the search API we can use directly the search-search-api
service instead of creating a new route that points to it. So the URL could be:
https://search-search-api.open-cluster-management.svc.cluster.local:4010
That can simplify things, and reduces the number of hops to get to the backend.
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jhernand The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/unhold |
Description:
The operator was tested locally using a personal quay image that included the resource server changes.