feat: Adjust the alicloud metrics exporter and add RDS performance metrics #15563

ZPascal · 2024-06-25T13:42:57Z

Summary

The PR adds generic advanced and RDS performance metrics support

Checklist

No AI generated code was used in this PR
Build a dev release and test it
Add new unit tests
Adapt the documentation

Output

Related issues

superseeds #15238

powersj · 2024-06-26T19:25:18Z

@ZPascal I see this is a draft, are you actually ready for a review once the lint issue is resolved?

ZPascal · 2024-06-27T19:56:30Z

@ZPascal I see this is a draft, are you actually ready for a review once the lint issue is resolved?

@powersj I think so. I still have some rework to do, e.g. new tests or documentation, but I think we can do it in parallel. In general, the feature offers the possibility to manage performance metrics. These metrics are not part of the CMS. I'm currently thinking about renaming the plugin. What do you think of this idea?

powersj · 2024-06-28T19:21:18Z

These metrics are not part of the CMS. I'm currently thinking about renaming the plugin. What do you think of this idea?

Check my understanding: CMS is the Cloud Monitor service, while the RDS is the relational database service? A user might not want one or the other.

Renaming it in the Telegraf config would be a breaking change and we do not want to do that.

What I would suggest doing is to add a config option that sets what metrics are captured, something like:

## Aliyun Metrics
## Specified which metrics to capture from Aliyun, choose from:
##  * cms - Cloud Monitor service
##  * rds - Relational Database service
# metrics = ["cms"]

A user can then add in rds if they want, or even use only rds. I suspect there is a possibility where we would want additional metrics from other services down the road. The other option is a 2nd plugin just for RDS, but this looks like we can keep them together given the wide overlap of the codebase.

In terms of docs, we would want to clarify at the top that this collects more than just CMS.

ZPascal · 2024-07-17T08:47:53Z

Hi @powersj, thank you for the answer.

Check my understanding: CMS is the Cloud Monitor service, while the RDS is the relational database service? A user might not want one or the other.

The default values are collected by CMS and as an optional feature, you can enable the advanced monitoring metrics like AWS to get more details like IOPS usage.

Renaming it in the Telegraf config would be a breaking change and we do not want to do that.

I fully agree with that.

What I would suggest doing is to add a config option that sets what metrics are captured, something like:

Aliyun Metrics

Specified which metrics to capture from Aliyun, choose from:

* cms - Cloud Monitor service

* rds - Relational Database service

metrics = ["cms"]

A user can then add in rds if they want, or even use only rds. I suspect there is a possibility where we would want additional metrics from other services down the road. The other option is a 2nd plugin just for RDS, but this looks like we can keep them together given the wide overlap of the codebase.

I would like to build up the whole topic in a generic way and also offer the possibility to consume further extended metrics of the services in the future. For this reason, I have introduced a generic parameter that provides the basic function for the plugin and in the second step activates the function for the RDS service.

I can only report on our current case, and we need both. The CMS metrics and the performance metrics. If someone doesn't need the CMS metrics, they don't need to forward or extract them.

The other option is a 2nd plugin just for RDS, but this looks like we can keep them together given the wide overlap of the codebase.

I think we can handle it, in one generic solution to extract metrics from the Alicloud API.

powersj · 2024-07-17T13:59:59Z

I have introduced a generic parameter that provides the basic function for the plugin and in the second step activates the function for the RDS service.

We discourage boolean parameters. As a plugin gets more and more features it means the user has more and more toggles that they need to switch on or off and complicates the configuration greatly. We prefer a single config option that is an array that specifies what features to toggle on or off.

ZPascal · 2024-07-18T12:06:19Z

I have introduced a generic parameter that provides the basic function for the plugin and in the second step activates the function for the RDS service.

We discourage boolean parameters. As a plugin gets more and more features it means the user has more and more toggles that they need to switch on or off and complicates the configuration greatly. We prefer a single config option that is an array that specifies what features to toggle on or off.

@powersj Thank you for the update. I'll adapt the code base.

telegraf-tiger · 2024-08-08T18:09:43Z

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you!

powersj · 2024-08-08T18:11:24Z

I'm going to re-open as @ZPascal I think you've been going through and updating the PRs and we were nearly done with this one.

powersj

Thanks for the update, some initial comments

powersj · 2024-08-13T14:18:49Z

plugins/inputs/aliyuncms/README.md

+    ## Specified which metric service to capture from Aliyun
+    ##  * cms - Cloud Monitor service (default settings)
+    ##  * rds - Relational Database service
+    # service = 'rds'
+


Why is there a service + metric_services? Shouldn't this be removed in favor of the metric_services above?

The metric_service is the general trigger point to initialize the RDS client on a technical level and to activate the function in general. The service parameter defines the corresponding extended metrics and marks them. These metrics are not available at the standard level (CMS). This means at the API level that you will not get a result for the endpoint if you call it with the wrong metrics.
It is not marked as an error, but we log it as debug information. If this behavior is ok, I can also remove the service parameter and we intentionally call the API endpoint with the wrong information.

I'm not sure I understand the scenario at play with these settings then.

Why would I enable "cms" metric_services and not "cms" service? (This naming is a headache) :) Likewise, why would I enable "rds" metric_services and not "rds" service?

You mentioned extended metrics. We are not collecting these today for cms right?

Should this be a follow-on PR and this PR instead be focused only on adding the new RDS service?

Hi @powersj,

I'm not sure I understand the scenario at play with these settings then.
Why would I enable "cms" metric_services and not "cms" service? (This naming is a headache) :) Likewise, why would I enable "rds" metric_services and not "rds" service?

The corresponding parameters are linked to each other.

The metric_service parameter is the generic option and generally activates the option. It initiates the separate client that processes the request to the dedicated API endpoint.

The service parameter at the metric level is the trigger point for calling the API. In the current implementation, it is the RDS service of Alicloud and extracts the separately specified values.

If we don't have a switch for the metric level, we are intentionally calling the RDS API in the wrong context to get metrics e.g. we expect CMS default metrics, thus burdening the Alicloud rate limit.

You mentioned extended metrics. We are not collecting these today for cms right?
Should this be a follow-on PR and this PR instead be focused only on adding the new RDS service?

The CMS service and the dedicated APIs of the services themselves are not compatible and are separate endpoints. When I talk about extended metrics, I'm talking about the dedicated RDS API endpoint that this PR is already using.

Hi @powersj, is there any ambiguity regarding the functionality or can I proceed to write new tests?

@ZPascal please do not expose technical details to users. As you said, the two parameters are linked and as such a nightmare to maintain and even communicate to the user. Please let the user define the metric and internally map to which client to initialize and use!

plugins/inputs/aliyuncms/README.md

plugins/inputs/aliyuncms/aliyuncms.go

powersj · 2024-08-13T14:23:28Z

plugins/inputs/aliyuncms/aliyuncms_test.go

+	resp := new(rds.DescribeDBInstancePerformanceResponse)
+
+	switch request.Key {
+	//TODO Adapt the Tests and the Mock


What's left here?

It is my quality aspect to write new unit tests for the delivered new function. I am currently working on the mocks and will further customize the tests today.

plugins/inputs/aliyuncms/discovery.go

telegraf-tiger · 2024-08-27T18:09:41Z

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you!

ZPascal · 2024-08-27T21:35:46Z

@powersj @srebhan The discussion is still ongoing, but the bot has closed the PR. Can you please open the PR again?

srebhan · 2024-08-29T19:57:07Z

Sorry @ZPascal. I'll reopen and try to read into the discussion tomorrow!

srebhan

Sorry @ZPascal for taking so long to look at your great contribution! I do have some comments in the code, with the most important being to remove the redundant metric_service/service settings. Just use one (e.g. metric) setting and internally map to which service you have to get this from.

Regarding renaming the plugin: Please don't do this in the present PR but split this out to a separate one. What you can do is to rename the plugin directory and struct to "aliyun" and then register the plugin twice, once with the new and once with the old name like here...

srebhan · 2024-08-30T06:50:12Z

plugins/inputs/aliyuncms/README.md

+    ## Specified which metric service to capture from Aliyun
+    ##  * cms - Cloud Monitor service (default settings)
+    ##  * rds - Relational Database service
+    # service = 'rds'
+


@ZPascal please do not expose technical details to users. As you said, the two parameters are linked and as such a nightmare to maintain and even communicate to the user. Please let the user define the metric and internally map to which client to initialize and use!

srebhan · 2024-08-30T06:50:44Z

plugins/inputs/aliyuncms/aliyuncms.go

+	"github.com/jmespath/go-jmespath"
+	"reflect"
+	"slices"


Please keep the grouping into built-in, 3rd party and telegraf internal imports which the groups being separated by an empty line!

srebhan · 2024-08-30T06:51:15Z

plugins/inputs/aliyuncms/aliyuncms.go

-	"github.com/jmespath/go-jmespath"
-
+	"github.com/aliyun/alibaba-cloud-sdk-go/services/rds"


Same as above, the SDK is not an internal import so please move it to the group above.

srebhan · 2024-08-30T06:52:00Z

plugins/inputs/aliyuncms/aliyuncms.go

@@ -27,8 +29,8 @@ import (
 var sampleConfig string

 type (
-	// AliyunCMS is aliyun cms config info.
-	AliyunCMS struct {
+	// AliyunMetrics is aliyun cms config info.


This comment adds no info at all. Please remove instead of making it worse. ;-)

srebhan · 2024-08-30T06:53:20Z

plugins/inputs/aliyuncms/aliyuncms.go

-func (*AliyunCMS) SampleConfig() string {
+func (*AliyunMetrics) SampleConfig() string {


Can we please keep the struct name as changing it might make things more confusing...

srebhan · 2024-08-30T06:54:42Z

plugins/inputs/aliyuncms/aliyuncms.go

+	//Check metric services
+	if len(s.MetricServices) == 0 {
+		s.MetricServices = []string{"cms"}
+		s.Log.Info("'metric_services' is not set. Metrics will be queried from the cms service")


Please do not log anything for filling in default values. This is confusion and will make users explicitly set the values in their configs to silence this with no benefit.

srebhan · 2024-08-30T06:55:57Z

plugins/inputs/aliyuncms/aliyuncms.go

+				for _, instanceID := range metric.requestDimensions {
+					req := rds.CreateDescribeDBInstancePerformanceRequest()
+					req.DBInstanceId = instanceID["instanceId"]
+					req.Key = metricName
+					startTime := s.windowStart.UTC()
+					req.StartTime = fmt.Sprintf("%d-%02d-%02dT%02d:%02dZ", startTime.Year(), startTime.Month(),
+						startTime.Day(), startTime.Hour(), startTime.Minute())
+					endTime := s.windowEnd.UTC()
+					req.EndTime = fmt.Sprintf("%d-%02d-%02dT%02d:%02dZ", endTime.Year(), endTime.Month(),
+						endTime.Day(), endTime.Hour(), endTime.Minute())
+					req.RegionId = region
+
+					resp, err := s.rdsClient.DescribeDBInstancePerformance(req)
+
+					if err != nil {
+						return fmt.Errorf("failed to get the database instance performance metrics: %w", err)
+					}
+					if resp.GetHttpStatus() != 200 {
+						s.Log.Errorf("failed to get the database instance performance metrics: %v", resp.BaseResponse.GetHttpContentString())
+						break
+					}
+
+					for _, performanceKey := range resp.PerformanceKeys.PerformanceKey {
+						for _, performanceValue := range performanceKey.Values.PerformanceValue {
+							parsedTime, err := time.Parse(time.RFC3339, performanceValue.Date)
+							if err != nil {
+								return fmt.Errorf("failed to parse response performance time datapoints: %w", err)
+							}
+
+							if strings.Contains(performanceValue.Value, "&") {
+								performanceKeys := strings.Split(performanceKey.ValueFormat, "&")
+								performanceValues := strings.Split(performanceValue.Value, "&")
+
+								for i, value := range performanceValues {
+									valueAsFloat, err := strconv.ParseFloat(value, 32)
+									if err != nil {
+										return fmt.Errorf("failed to convert the performance value string to an float: %w", err)
+									}
+									datapoints = append(datapoints,
+										map[string]interface{}{
+											"instanceId":       instanceID["instanceId"],
+											performanceKeys[i]: valueAsFloat,
+											"timestamp":        parsedTime.Unix(),
+										})
+								}
+							} else {
+								valueAsFloat, err := strconv.ParseFloat(performanceValue.Value, 32)
+								if err != nil {
+									return fmt.Errorf("failed to convert the performance value string to an float: %w", err)
+								}
+								datapoints = append(datapoints,
+									map[string]interface{}{
+										"instanceId":               instanceID["instanceId"],
+										performanceKey.ValueFormat: valueAsFloat,
+										"timestamp":                parsedTime.Unix(),
+									})
+							}
+						}
+					}
+
+					if len(datapoints) == 0 {
+						s.Log.Debugf("No rds performance metrics returned from RDS, response msg: %s", resp.GetHttpContentString())
+						break
+					}
+				}


OMG! Please move this to an own function and try to reduce nesting!

srebhan · 2024-08-30T06:56:50Z

plugins/inputs/aliyuncms/aliyuncms.go

+						if reflect.TypeOf(value).String() == "int64" {
+							datapointTime = value.(int64)
+						} else {
+							datapointTime = int64(value.(float64)) / 1000
+						}


Please use type assertions here instead of reflection!

srebhan · 2024-08-30T06:58:45Z

plugins/inputs/aliyuncms/discovery.go

@@ -39,7 +39,7 @@ type discoveryTool struct {

 	respRootKey     string //Root key in JSON response where to look for discovery data
 	respObjectIDKey string //Key in element of array under root key, that stores object ID
-	//for ,majority of cases it would be InstanceId, for OSS it is BucketName. This key is also used in dimension filtering// )
+	//for ,the majority of cases it would be InstanceId, for OSS it is BucketName. This key is also used in dimension filtering// )


I don't get what this comment wants to tell me and where this relates to...

srebhan · 2024-08-30T07:00:12Z

plugins/inputs/aliyuncms/discovery.go

@@ -63,15 +63,13 @@ func getRPCReqFromDiscoveryRequest(req discoveryRequest) (*requests.RpcRequest,
 	}

 	ptrV := reflect.Indirect(reflect.ValueOf(req))
-


Could you please move formatting cleanup to an own PR to ease reviews! It's hard if functional changes and pure style changes are mixed and some files are only style changes...

telegraf-tiger · 2024-09-14T11:17:31Z

Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip.
Downloads for additional architectures and packages are available below.

⚠️ This pull request increases the Telegraf binary size by 4.83 % for linux amd64 (new size: 273.5 MB, nightly size 260.8 MB)

📦 Click here to get additional PR build artifacts

Artifact URLs

DEB	RPM	TAR GZ	ZIP
amd64.deb	aarch64.rpm	darwin_amd64.tar.gz	windows_amd64.zip
arm64.deb	armel.rpm	darwin_arm64.tar.gz	windows_arm64.zip
armel.deb	armv6hl.rpm	freebsd_amd64.tar.gz	windows_i386.zip
armhf.deb	i386.rpm	freebsd_armv7.tar.gz
i386.deb	ppc64le.rpm	freebsd_i386.tar.gz
mips.deb	riscv64.rpm	linux_amd64.tar.gz
mipsel.deb	s390x.rpm	linux_arm64.tar.gz
ppc64el.deb	x86_64.rpm	linux_armel.tar.gz
riscv64.deb		linux_armhf.tar.gz
s390x.deb		linux_i386.tar.gz
		linux_mips.tar.gz
		linux_mipsel.tar.gz
		linux_ppc64le.tar.gz
		linux_riscv64.tar.gz
		linux_s390x.tar.gz

telegraf-tiger · 2024-09-23T18:09:44Z

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you!

srebhan · 2024-09-30T17:04:13Z

Unfortunately, there was no progress in the PR for some time. Please reopen the PR if you want to continue working on the matter.

ZPascal · 2024-10-01T08:56:57Z

Hi @srebhan, I'm still working on the corresponding feature in general, but I need a bit more time to handle the corresponding case and rewrite the functionality in general. I'll reopen a new PR if the feature is ready.

srebhan · 2024-10-01T10:16:15Z

I reopened the PR to be sure we keep track...

ZPascal force-pushed the adjust-the-alicloud-metrics-exporter branch from 9f75e88 to d4bed9f Compare June 25, 2024 14:05

ZPascal force-pushed the adjust-the-alicloud-metrics-exporter branch from e2b3c16 to d632ccc Compare June 27, 2024 13:21

ZPascal force-pushed the adjust-the-alicloud-metrics-exporter branch from c0b4842 to a25731c Compare June 27, 2024 20:02

ZPascal changed the title ~~Adjust the alicloud metrics exporter and add RDS performance metrics~~ feat: Adjust the alicloud metrics exporter and add RDS performance metrics Jun 27, 2024

telegraf-tiger bot added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Jun 27, 2024

powersj added the waiting for response waiting for response from contributor label Jul 15, 2024

telegraf-tiger bot removed the waiting for response waiting for response from contributor label Jul 17, 2024

powersj added the waiting for response waiting for response from contributor label Jul 24, 2024

telegraf-tiger bot closed this Aug 8, 2024

powersj reopened this Aug 8, 2024

telegraf-tiger bot removed the waiting for response waiting for response from contributor label Aug 8, 2024

powersj added the waiting for response waiting for response from contributor label Aug 8, 2024

ZPascal force-pushed the adjust-the-alicloud-metrics-exporter branch from f7f9386 to bc31acc Compare August 11, 2024 12:40

ZPascal marked this pull request as ready for review August 12, 2024 04:52

powersj reviewed Aug 13, 2024

View reviewed changes

powersj removed the waiting for response waiting for response from contributor label Aug 13, 2024

srebhan assigned powersj Aug 13, 2024

powersj added the waiting for response waiting for response from contributor label Aug 13, 2024

ZPascal force-pushed the adjust-the-alicloud-metrics-exporter branch 2 times, most recently from 5bd146b to 6f80f8e Compare August 15, 2024 07:25

powersj assigned srebhan Aug 20, 2024

srebhan unassigned powersj Aug 21, 2024

telegraf-tiger bot closed this Aug 27, 2024

telegraf-tiger bot removed the waiting for response waiting for response from contributor label Aug 27, 2024

srebhan reopened this Aug 29, 2024

srebhan linked an issue Aug 29, 2024 that may be closed by this pull request

[gGNMI] incorrect metric processing #15792

Open

srebhan removed a link to an issue Aug 29, 2024

[gGNMI] incorrect metric processing #15792

Open

srebhan reviewed Aug 30, 2024

View reviewed changes

srebhan added the waiting for response waiting for response from contributor label Sep 9, 2024

ZPascal added 2 commits September 14, 2024 12:53

feat: Add the RDS performance and generic metrics option

77b7461

fix: Ajust the tests

44b89fc

ZPascal force-pushed the adjust-the-alicloud-metrics-exporter branch from ea43c13 to 44b89fc Compare September 14, 2024 10:53

telegraf-tiger bot closed this Sep 23, 2024

telegraf-tiger bot removed the waiting for response waiting for response from contributor label Sep 30, 2024

srebhan reopened this Oct 1, 2024

ZPascal added 2 commits October 9, 2024 06:40

WIP

fdba61b

WIP

c2f7b64

		"github.com/jmespath/go-jmespath"

		"github.com/aliyun/alibaba-cloud-sdk-go/services/rds"

		func (*AliyunCMS) SampleConfig() string {
		func (*AliyunMetrics) SampleConfig() string {

		@@ -63,15 +63,13 @@ func getRPCReqFromDiscoveryRequest(req discoveryRequest) (*requests.RpcRequest,
		}

		ptrV := reflect.Indirect(reflect.ValueOf(req))

feat: Adjust the alicloud metrics exporter and add RDS performance metrics #15563

Are you sure you want to change the base?

feat: Adjust the alicloud metrics exporter and add RDS performance metrics #15563

Conversation

ZPascal commented Jun 25, 2024 • edited by srebhan Loading

Summary

Checklist

Output

Related issues

powersj commented Jun 26, 2024

ZPascal commented Jun 27, 2024

powersj commented Jun 28, 2024

ZPascal commented Jul 17, 2024 • edited Loading

Aliyun Metrics

Specified which metrics to capture from Aliyun, choose from:

* cms - Cloud Monitor service

* rds - Relational Database service

metrics = ["cms"]

powersj commented Jul 17, 2024

ZPascal commented Jul 18, 2024

telegraf-tiger bot commented Aug 8, 2024

powersj commented Aug 8, 2024

powersj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

powersj Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

telegraf-tiger bot commented Aug 27, 2024

ZPascal commented Aug 27, 2024

srebhan commented Aug 29, 2024

srebhan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

telegraf-tiger bot commented Sep 14, 2024

Artifact URLs

telegraf-tiger bot commented Sep 23, 2024

srebhan commented Sep 30, 2024

ZPascal commented Oct 1, 2024

srebhan commented Oct 1, 2024

ZPascal commented Jun 25, 2024 •

edited by srebhan

Loading

ZPascal commented Jul 17, 2024 •

edited

Loading

powersj Aug 15, 2024 •

edited

Loading