-
Notifications
You must be signed in to change notification settings - Fork 36
stop historical detector; support return AD task in get detector API #359
stop historical detector; support return AD task in get detector API #359
Conversation
Codecov Report
@@ Coverage Diff @@
## master #359 +/- ##
============================================
+ Coverage 78.65% 78.94% +0.28%
- Complexity 2465 2635 +170
============================================
Files 224 243 +19
Lines 10995 11658 +663
Branches 943 1002 +59
============================================
+ Hits 8648 9203 +555
- Misses 1907 1981 +74
- Partials 440 474 +34
Flags with carried forward coverage won't be shown. Click here to find out more. |
eb8f2c0
to
4128228
Compare
4128228
to
57ab219
Compare
@@ -56,6 +63,7 @@ public ADTaskCacheManager(Settings settings, ClusterService clusterService, Memo | |||
clusterService.getClusterSettings().addSettingsUpdateConsumer(MAX_BATCH_TASK_PER_NODE, it -> maxAdBatchTaskPerNode = it); | |||
taskCaches = new ConcurrentHashMap<>(); | |||
this.memoryTracker = memoryTracker; | |||
this.detectors = Sets.newConcurrentHashSet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use this field to record all detectors which running on the coordinating node to resolve race condition. If user starts multiple tasks for the same detector, we will put the first task in cache and record the detector id in this field. For other tasks, we will check if the detector id exists in this set, then we find the detector id exists that means there is already one task running for this detector, so we will reject other tasks.
Will add more comments here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the checking logic in another PR? don't see the related codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can check this method ADTaskCacheManager#put(String detectorId)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. It will throw DuplicateTaskException for duplicated tasks. Thanks.
Usually any new state will add some complexity for concurrency control/failure mode into the codes. I don't see any issue for this change now. pls check if there is some edge case from e2e p.o.v. such as: for some case, the detector id can't be removed which causes the customer can't start new task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, will test more edge cases together with frontend changes.
taskCaches.remove(taskId); | ||
List<ADBatchTaskCache> taskCaches = getBatchTaskCacheByDetectorId(detectorId); | ||
if (taskCaches.isEmpty()) { | ||
detectors.remove(detectorId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the concurrency control is added in "put" method. why not for "remove"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By adding concurrency control in "put" method, we make sure only one detector id exists in detectors set, so no need to add concurrency control when we remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious: is there only 1 detectors
object in memory across all the nodes? Or, is detectors
object in memory of master node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If user start historical detector, the request will be forwarded to coordinating node(hash on detector id) first. We will put detectors on coordinating node. When we start detector, will check if the detector exists on coordinating node or not. If exists, will reject the request to avoid running multiple tasks for one detector.
src/main/java/com/amazon/opendistroforelasticsearch/ad/task/ADTaskManager.java
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,24 @@ | |||
/* | |||
* Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2021? Have you figured out how to configure spotless for the year in copyright header
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check this comment #355 (comment) , paste the comment here
./gradlew spotlessApply will reset copyright year as 2020 as configured in file spotless.license.java.
To make this PR clean, will send out a separate PR to update this license file and apply to all files. Will replace 2020 with $YEAR, then spotless can fill as current year automatically.
src/main/java/com/amazon/opendistroforelasticsearch/ad/transport/ADTaskProfileAction.java
Outdated
Show resolved
Hide resolved
} | ||
}, e -> { | ||
if (e instanceof IndexNotFoundException) { | ||
function.accept(Optional.empty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need this line? looks like to me function is not being executed if Optional.empty()
is given
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different functions may handle empty task in different way, for example get detector with task will return empty task with detector(https://github.com/opendistro-for-elasticsearch/anomaly-detection/pull/359/files#diff-a30fb43728874ee34d76805a90385e2774d58bc7fc05bccc9d1fbd6e2a64a67fR210)
taskCaches.remove(taskId); | ||
List<ADBatchTaskCache> taskCaches = getBatchTaskCacheByDetectorId(detectorId); | ||
if (taskCaches.isEmpty()) { | ||
detectors.remove(detectorId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious: is there only 1 detectors
object in memory across all the nodes? Or, is detectors
object in memory of master node?
src/main/java/com/amazon/opendistroforelasticsearch/ad/model/ADTaskProfile.java
Show resolved
Hide resolved
src/main/java/com/amazon/opendistroforelasticsearch/ad/task/ADBatchTaskCache.java
Show resolved
Hide resolved
src/main/java/com/amazon/opendistroforelasticsearch/ad/task/ADTaskManager.java
Outdated
Show resolved
Hide resolved
src/main/java/com/amazon/opendistroforelasticsearch/ad/task/ADTaskManager.java
Outdated
Show resolved
Hide resolved
src/main/java/com/amazon/opendistroforelasticsearch/ad/task/ADTaskManager.java
Show resolved
Hide resolved
src/main/java/com/amazon/opendistroforelasticsearch/ad/transport/ADCancelTaskResponse.java
Outdated
Show resolved
Hide resolved
src/main/java/com/amazon/opendistroforelasticsearch/ad/transport/ADTaskProfileRequest.java
Outdated
Show resolved
Hide resolved
src/main/java/com/amazon/opendistroforelasticsearch/ad/transport/ADTaskProfileResponse.java
Outdated
Show resolved
Hide resolved
src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorPlugin.java
Show resolved
Hide resolved
...in/java/com/amazon/opendistroforelasticsearch/ad/transport/ForwardADTaskTransportAction.java
Show resolved
Hide resolved
} | ||
} | ||
|
||
protected void cleanDetectorCache(ADTask adTask, TransportService transportService, AnomalyDetectorFunction function) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me understand the function? The title is clean detector cache. Could you point where you cleaned detector cache? Seems what you did are: 1) forward to coordinating node the stop task action; 2) no matter you succeed or fail, you invoke the input function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method will be called at two places:
- Check
ADBatchTaskRunner#internalBatchTaskListener
, if the task finished normally or failed, we will call this function to clean detector cache on coordinating node. - Check
ADTaskManager#resetTaskStateAsStopped
, if the task state is out of sync such as worker node crashed, then we need to clean detector cache on coordinating node and reset the task state.
src/main/java/com/amazon/opendistroforelasticsearch/ad/task/ADTaskManager.java
Show resolved
Hide resolved
src/main/java/com/amazon/opendistroforelasticsearch/ad/task/ADTaskManager.java
Outdated
Show resolved
Hide resolved
...va/com/amazon/opendistroforelasticsearch/ad/transport/GetAnomalyDetectorTransportAction.java
Show resolved
Hide resolved
*/ | ||
public ADTaskCancellationState cancelLocalTaskByDetectorId(String detectorId, String reason, String userName) { | ||
ADTaskCancellationState cancellationState = adTaskCacheManager.cancelByDetectorId(detectorId, reason, userName); | ||
logger.debug("Cancelled AD task for detector: " + detectorId + ", state: " + cancellationState); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to add reason
and userName
to log, looks like such info is not logged anywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the reason is static message as we don't allow user to fill why they need to stop historical detector on Kibana. But it may help in future when we enable filling cancellation reason. As this is debug level log, sounds not bad to add more info, will add userName and reason in log.
} else if (detector.getDetectorId() == null) { | ||
validationException = addValidationError(CommonErrorMessages.AD_ID_MISSING_MSG, validationException); | ||
} | ||
if (adTaskAction == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add such check for User
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User could be null here. If user is null, we will set tasks's startedBy
field as null, check TaskManager#createNewADTask
,
String userName = user == null ? null : user.getName();
Issue #, if available:
Description of changes:
In last PR #355 we added starting historical detector change. As historical detector may run for a long time, we should allow user to kill running historical detector before its task done to avoid consuming unnecessary resource if user think no need to run the historical detector any more for example user find they configured wrong feature.
This PR's main change:
task
URL param in get detector API, if user set it as true, will return latest AD task together with detector.Test
./gradlew build
./gradlew integTest -PnumNodes=3
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.