Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adjust the alicloud metrics exporter and add RDS performance metrics #15563

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ZPascal
Copy link
Contributor

@ZPascal ZPascal commented Jun 25, 2024

Summary

The PR adds generic advanced and RDS performance metrics support

Checklist

  • No AI generated code was used in this PR
  • Build a dev release and test it
  • Add new unit tests
  • Adapt the documentation

Output

Screenshot 2024-06-25 at 16 12 22 1

Related issues

superseeds #15238

@ZPascal ZPascal force-pushed the adjust-the-alicloud-metrics-exporter branch from 9f75e88 to d4bed9f Compare June 25, 2024 14:05
@powersj
Copy link
Contributor

powersj commented Jun 26, 2024

@ZPascal I see this is a draft, are you actually ready for a review once the lint issue is resolved?

@ZPascal ZPascal force-pushed the adjust-the-alicloud-metrics-exporter branch from e2b3c16 to d632ccc Compare June 27, 2024 13:21
@ZPascal
Copy link
Contributor Author

ZPascal commented Jun 27, 2024

@ZPascal I see this is a draft, are you actually ready for a review once the lint issue is resolved?

@powersj I think so. I still have some rework to do, e.g. new tests or documentation, but I think we can do it in parallel. In general, the feature offers the possibility to manage performance metrics. These metrics are not part of the CMS. I'm currently thinking about renaming the plugin. What do you think of this idea?

@ZPascal ZPascal force-pushed the adjust-the-alicloud-metrics-exporter branch from c0b4842 to a25731c Compare June 27, 2024 20:02
@ZPascal ZPascal changed the title Adjust the alicloud metrics exporter and add RDS performance metrics feat: Adjust the alicloud metrics exporter and add RDS performance metrics Jun 27, 2024
@telegraf-tiger telegraf-tiger bot added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Jun 27, 2024
@powersj
Copy link
Contributor

powersj commented Jun 28, 2024

These metrics are not part of the CMS. I'm currently thinking about renaming the plugin. What do you think of this idea?

Check my understanding: CMS is the Cloud Monitor service, while the RDS is the relational database service? A user might not want one or the other.

Renaming it in the Telegraf config would be a breaking change and we do not want to do that.

What I would suggest doing is to add a config option that sets what metrics are captured, something like:

## Aliyun Metrics
## Specified which metrics to capture from Aliyun, choose from:
##  * cms - Cloud Monitor service
##  * rds - Relational Database service
# metrics = ["cms"]

A user can then add in rds if they want, or even use only rds. I suspect there is a possibility where we would want additional metrics from other services down the road. The other option is a 2nd plugin just for RDS, but this looks like we can keep them together given the wide overlap of the codebase.

In terms of docs, we would want to clarify at the top that this collects more than just CMS.

@powersj powersj added the waiting for response waiting for response from contributor label Jul 15, 2024
@ZPascal
Copy link
Contributor Author

ZPascal commented Jul 17, 2024

Hi @powersj, thank you for the answer.

Check my understanding: CMS is the Cloud Monitor service, while the RDS is the relational database service? A user might not want one or the other.

The default values are collected by CMS and as an optional feature, you can enable the advanced monitoring metrics like AWS to get more details like IOPS usage.

Renaming it in the Telegraf config would be a breaking change and we do not want to do that.

I fully agree with that.

What I would suggest doing is to add a config option that sets what metrics are captured, something like:

Aliyun Metrics

Specified which metrics to capture from Aliyun, choose from:

* cms - Cloud Monitor service

* rds - Relational Database service

metrics = ["cms"]

A user can then add in rds if they want, or even use only rds. I suspect there is a possibility where we would want additional metrics from other services down the road. The other option is a 2nd plugin just for RDS, but this looks like we can keep them together given the wide overlap of the codebase.

I would like to build up the whole topic in a generic way and also offer the possibility to consume further extended metrics of the services in the future. For this reason, I have introduced a generic parameter that provides the basic function for the plugin and in the second step activates the function for the RDS service.

I can only report on our current case, and we need both. The CMS metrics and the performance metrics. If someone doesn't need the CMS metrics, they don't need to forward or extract them.

The other option is a 2nd plugin just for RDS, but this looks like we can keep them together given the wide overlap of the codebase.

I think we can handle it, in one generic solution to extract metrics from the Alicloud API.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Jul 17, 2024
@powersj
Copy link
Contributor

powersj commented Jul 17, 2024

I have introduced a generic parameter that provides the basic function for the plugin and in the second step activates the function for the RDS service.

We discourage boolean parameters. As a plugin gets more and more features it means the user has more and more toggles that they need to switch on or off and complicates the configuration greatly. We prefer a single config option that is an array that specifies what features to toggle on or off.

@ZPascal
Copy link
Contributor Author

ZPascal commented Jul 18, 2024

I have introduced a generic parameter that provides the basic function for the plugin and in the second step activates the function for the RDS service.

We discourage boolean parameters. As a plugin gets more and more features it means the user has more and more toggles that they need to switch on or off and complicates the configuration greatly. We prefer a single config option that is an array that specifies what features to toggle on or off.

@powersj Thank you for the update. I'll adapt the code base.

@powersj powersj added the waiting for response waiting for response from contributor label Jul 24, 2024
@telegraf-tiger
Copy link
Contributor

telegraf-tiger bot commented Aug 8, 2024

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you!

@telegraf-tiger telegraf-tiger bot closed this Aug 8, 2024
@powersj
Copy link
Contributor

powersj commented Aug 8, 2024

I'm going to re-open as @ZPascal I think you've been going through and updating the PRs and we were nearly done with this one.

@powersj powersj reopened this Aug 8, 2024
@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Aug 8, 2024
@powersj powersj added the waiting for response waiting for response from contributor label Aug 8, 2024
@ZPascal ZPascal force-pushed the adjust-the-alicloud-metrics-exporter branch from f7f9386 to bc31acc Compare August 11, 2024 12:40
@ZPascal ZPascal marked this pull request as ready for review August 12, 2024 04:52
Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, some initial comments

Comment on lines +142 to +146
## Specified which metric service to capture from Aliyun
## * cms - Cloud Monitor service (default settings)
## * rds - Relational Database service
# service = 'rds'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a service + metric_services? Shouldn't this be removed in favor of the metric_services above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric_service is the general trigger point to initialize the RDS client on a technical level and to activate the function in general. The service parameter defines the corresponding extended metrics and marks them. These metrics are not available at the standard level (CMS). This means at the API level that you will not get a result for the endpoint if you call it with the wrong metrics.
It is not marked as an error, but we log it as debug information. If this behavior is ok, I can also remove the service parameter and we intentionally call the API endpoint with the wrong information.

Copy link
Contributor

@powersj powersj Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the scenario at play with these settings then.

Why would I enable "cms" metric_services and not "cms" service? (This naming is a headache) :) Likewise, why would I enable "rds" metric_services and not "rds" service?

You mentioned extended metrics. We are not collecting these today for cms right?

Should this be a follow-on PR and this PR instead be focused only on adding the new RDS service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @powersj,

I'm not sure I understand the scenario at play with these settings then.
Why would I enable "cms" metric_services and not "cms" service? (This naming is a headache) :) Likewise, why would I enable "rds" metric_services and not "rds" service?

The corresponding parameters are linked to each other.

The metric_service parameter is the generic option and generally activates the option. It initiates the separate client that processes the request to the dedicated API endpoint.

The service parameter at the metric level is the trigger point for calling the API. In the current implementation, it is the RDS service of Alicloud and extracts the separately specified values.

If we don't have a switch for the metric level, we are intentionally calling the RDS API in the wrong context to get metrics e.g. we expect CMS default metrics, thus burdening the Alicloud rate limit.

You mentioned extended metrics. We are not collecting these today for cms right?
Should this be a follow-on PR and this PR instead be focused only on adding the new RDS service?

The CMS service and the dedicated APIs of the services themselves are not compatible and are separate endpoints. When I talk about extended metrics, I'm talking about the dedicated RDS API endpoint that this PR is already using.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @powersj, is there any ambiguity regarding the functionality or can I proceed to write new tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZPascal please do not expose technical details to users. As you said, the two parameters are linked and as such a nightmare to maintain and even communicate to the user. Please let the user define the metric and internally map to which client to initialize and use!

plugins/inputs/aliyuncms/README.md Outdated Show resolved Hide resolved
plugins/inputs/aliyuncms/README.md Outdated Show resolved Hide resolved
plugins/inputs/aliyuncms/aliyuncms.go Outdated Show resolved Hide resolved
plugins/inputs/aliyuncms/aliyuncms.go Outdated Show resolved Hide resolved
plugins/inputs/aliyuncms/aliyuncms.go Outdated Show resolved Hide resolved
resp := new(rds.DescribeDBInstancePerformanceResponse)

switch request.Key {
//TODO Adapt the Tests and the Mock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's left here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is my quality aspect to write new unit tests for the delivered new function. I am currently working on the mocks and will further customize the tests today.

plugins/inputs/aliyuncms/discovery.go Outdated Show resolved Hide resolved
@powersj powersj removed the waiting for response waiting for response from contributor label Aug 13, 2024
@powersj powersj added the waiting for response waiting for response from contributor label Aug 13, 2024
@ZPascal ZPascal force-pushed the adjust-the-alicloud-metrics-exporter branch 2 times, most recently from 5bd146b to 6f80f8e Compare August 15, 2024 07:25
@telegraf-tiger
Copy link
Contributor

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you!

@telegraf-tiger telegraf-tiger bot closed this Aug 27, 2024
@ZPascal
Copy link
Contributor Author

ZPascal commented Aug 27, 2024

@powersj @srebhan The discussion is still ongoing, but the bot has closed the PR. Can you please open the PR again?

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Aug 27, 2024
@srebhan
Copy link
Member

srebhan commented Aug 29, 2024

Sorry @ZPascal. I'll reopen and try to read into the discussion tomorrow!

@srebhan srebhan reopened this Aug 29, 2024
@srebhan srebhan linked an issue Aug 29, 2024 that may be closed by this pull request
Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @ZPascal for taking so long to look at your great contribution! I do have some comments in the code, with the most important being to remove the redundant metric_service/service settings. Just use one (e.g. metric) setting and internally map to which service you have to get this from.

Regarding renaming the plugin: Please don't do this in the present PR but split this out to a separate one. What you can do is to rename the plugin directory and struct to "aliyun" and then register the plugin twice, once with the new and once with the old name like here...

Comment on lines +142 to +146
## Specified which metric service to capture from Aliyun
## * cms - Cloud Monitor service (default settings)
## * rds - Relational Database service
# service = 'rds'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZPascal please do not expose technical details to users. As you said, the two parameters are linked and as such a nightmare to maintain and even communicate to the user. Please let the user define the metric and internally map to which client to initialize and use!

Comment on lines +9 to +11
"github.com/jmespath/go-jmespath"
"reflect"
"slices"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the grouping into built-in, 3rd party and telegraf internal imports which the groups being separated by an empty line!

Comment on lines -17 to +20
"github.com/jmespath/go-jmespath"

"github.com/aliyun/alibaba-cloud-sdk-go/services/rds"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, the SDK is not an internal import so please move it to the group above.

@@ -27,8 +29,8 @@ import (
var sampleConfig string

type (
// AliyunCMS is aliyun cms config info.
AliyunCMS struct {
// AliyunMetrics is aliyun cms config info.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment adds no info at all. Please remove instead of making it worse. ;-)

Comment on lines -112 to +122
func (*AliyunCMS) SampleConfig() string {
func (*AliyunMetrics) SampleConfig() string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please keep the struct name as changing it might make things more confusing...

//Check metric services
if len(s.MetricServices) == 0 {
s.MetricServices = []string{"cms"}
s.Log.Info("'metric_services' is not set. Metrics will be queried from the cms service")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not log anything for filling in default values. This is confusion and will make users explicitly set the values in their configs to silence this with no benefit.

Comment on lines 302 to 366
for _, instanceID := range metric.requestDimensions {
req := rds.CreateDescribeDBInstancePerformanceRequest()
req.DBInstanceId = instanceID["instanceId"]
req.Key = metricName
startTime := s.windowStart.UTC()
req.StartTime = fmt.Sprintf("%d-%02d-%02dT%02d:%02dZ", startTime.Year(), startTime.Month(),
startTime.Day(), startTime.Hour(), startTime.Minute())
endTime := s.windowEnd.UTC()
req.EndTime = fmt.Sprintf("%d-%02d-%02dT%02d:%02dZ", endTime.Year(), endTime.Month(),
endTime.Day(), endTime.Hour(), endTime.Minute())
req.RegionId = region

resp, err := s.rdsClient.DescribeDBInstancePerformance(req)

if err != nil {
return fmt.Errorf("failed to get the database instance performance metrics: %w", err)
}
if resp.GetHttpStatus() != 200 {
s.Log.Errorf("failed to get the database instance performance metrics: %v", resp.BaseResponse.GetHttpContentString())
break
}

for _, performanceKey := range resp.PerformanceKeys.PerformanceKey {
for _, performanceValue := range performanceKey.Values.PerformanceValue {
parsedTime, err := time.Parse(time.RFC3339, performanceValue.Date)
if err != nil {
return fmt.Errorf("failed to parse response performance time datapoints: %w", err)
}

if strings.Contains(performanceValue.Value, "&") {
performanceKeys := strings.Split(performanceKey.ValueFormat, "&")
performanceValues := strings.Split(performanceValue.Value, "&")

for i, value := range performanceValues {
valueAsFloat, err := strconv.ParseFloat(value, 32)
if err != nil {
return fmt.Errorf("failed to convert the performance value string to an float: %w", err)
}
datapoints = append(datapoints,
map[string]interface{}{
"instanceId": instanceID["instanceId"],
performanceKeys[i]: valueAsFloat,
"timestamp": parsedTime.Unix(),
})
}
} else {
valueAsFloat, err := strconv.ParseFloat(performanceValue.Value, 32)
if err != nil {
return fmt.Errorf("failed to convert the performance value string to an float: %w", err)
}
datapoints = append(datapoints,
map[string]interface{}{
"instanceId": instanceID["instanceId"],
performanceKey.ValueFormat: valueAsFloat,
"timestamp": parsedTime.Unix(),
})
}
}
}

if len(datapoints) == 0 {
s.Log.Debugf("No rds performance metrics returned from RDS, response msg: %s", resp.GetHttpContentString())
break
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMG! Please move this to an own function and try to reduce nesting!

Comment on lines +422 to +426
if reflect.TypeOf(value).String() == "int64" {
datapointTime = value.(int64)
} else {
datapointTime = int64(value.(float64)) / 1000
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use type assertions here instead of reflection!

@@ -39,7 +39,7 @@ type discoveryTool struct {

respRootKey string //Root key in JSON response where to look for discovery data
respObjectIDKey string //Key in element of array under root key, that stores object ID
//for ,majority of cases it would be InstanceId, for OSS it is BucketName. This key is also used in dimension filtering// )
//for ,the majority of cases it would be InstanceId, for OSS it is BucketName. This key is also used in dimension filtering// )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get what this comment wants to tell me and where this relates to...

@@ -63,15 +63,13 @@ func getRPCReqFromDiscoveryRequest(req discoveryRequest) (*requests.RpcRequest,
}

ptrV := reflect.Indirect(reflect.ValueOf(req))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please move formatting cleanup to an own PR to ease reviews! It's hard if functional changes and pure style changes are mixed and some files are only style changes...

@srebhan srebhan added the waiting for response waiting for response from contributor label Sep 9, 2024
@ZPascal ZPascal force-pushed the adjust-the-alicloud-metrics-exporter branch from ea43c13 to 44b89fc Compare September 14, 2024 10:53
@telegraf-tiger
Copy link
Contributor

Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip.
Downloads for additional architectures and packages are available below.

⚠️ This pull request increases the Telegraf binary size by 4.83 % for linux amd64 (new size: 273.5 MB, nightly size 260.8 MB)

📦 Click here to get additional PR build artifacts

Artifact URLs

DEB RPM TAR GZ ZIP
amd64.deb aarch64.rpm darwin_amd64.tar.gz windows_amd64.zip
arm64.deb armel.rpm darwin_arm64.tar.gz windows_arm64.zip
armel.deb armv6hl.rpm freebsd_amd64.tar.gz windows_i386.zip
armhf.deb i386.rpm freebsd_armv7.tar.gz
i386.deb ppc64le.rpm freebsd_i386.tar.gz
mips.deb riscv64.rpm linux_amd64.tar.gz
mipsel.deb s390x.rpm linux_arm64.tar.gz
ppc64el.deb x86_64.rpm linux_armel.tar.gz
riscv64.deb linux_armhf.tar.gz
s390x.deb linux_i386.tar.gz
linux_mips.tar.gz
linux_mipsel.tar.gz
linux_ppc64le.tar.gz
linux_riscv64.tar.gz
linux_s390x.tar.gz

@telegraf-tiger
Copy link
Contributor

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you!

@telegraf-tiger telegraf-tiger bot closed this Sep 23, 2024
@srebhan
Copy link
Member

srebhan commented Sep 30, 2024

Unfortunately, there was no progress in the PR for some time. Please reopen the PR if you want to continue working on the matter.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Sep 30, 2024
@ZPascal
Copy link
Contributor Author

ZPascal commented Oct 1, 2024

Hi @srebhan, I'm still working on the corresponding feature in general, but I need a bit more time to handle the corresponding case and rewrite the functionality in general. I'll reopen a new PR if the feature is ready.

@srebhan srebhan reopened this Oct 1, 2024
@srebhan
Copy link
Member

srebhan commented Oct 1, 2024

I reopened the PR to be sure we keep track...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants