Skip to content

Commit

Permalink
[core] Add call_site to Tasks and Actors. (#48920)
Browse files Browse the repository at this point in the history
Adds a new field `string call_site` to TaskSpec. It's populated from
language frontend (`_raylet.pyx`), if
`RAY_record_task_actor_creation_sites` is set to `true`. The field is
propagated to `TaskEvent` and `ActorTableData` and through Dashboard
APIs.

Users can use `ray list task --detail` as well as the Web UI. It works
for Tasks, Actors (creation) and Actor Methods.

The flag `RAY_record_task_actor_creation_sites` is disabled by default,
user can enable it if they want; so by default there should be no
performance costs.


Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: angelinalg <[email protected]>
  • Loading branch information
rynewang and angelinalg authored Dec 19, 2024
1 parent bbd21ed commit 53d2145
Show file tree
Hide file tree
Showing 30 changed files with 288 additions and 13 deletions.
4 changes: 3 additions & 1 deletion cpp/src/ray/runtime/task/local_mode_task_submitter.cc
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,9 @@ ObjectID LocalModeTaskSubmitter::Submit(InvocationSpec &invocation,
required_placement_resources,
"",
/*depth=*/0,
local_mode_ray_tuntime_.GetCurrentTaskId());
local_mode_ray_tuntime_.GetCurrentTaskId(),
// Stacktrace is not available in local mode.
/*call_site=*/"");
if (invocation.task_type == TaskType::NORMAL_TASK) {
} else if (invocation.task_type == TaskType::ACTOR_CREATION_TASK) {
invocation.actor_id = local_mode_ray_tuntime_.GetNextActorID();
Expand Down
27 changes: 24 additions & 3 deletions cpp/src/ray/runtime/task/native_task_submitter.cc
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
#include <ray/api/ray_exception.h>

#include "../abstract_ray_runtime.h"
#include "ray/common/ray_config.h"

namespace ray {
namespace internal {
Expand Down Expand Up @@ -68,6 +69,13 @@ ObjectID NativeTaskSubmitter::Submit(InvocationSpec &invocation,
options.serialized_runtime_env_info = call_options.serialized_runtime_env_info;
options.generator_backpressure_num_objects = -1;
std::vector<rpc::ObjectReference> return_refs;

std::string call_site;
if (::RayConfig::instance().record_task_actor_creation_sites()) {
std::stringstream ss;
ss << ray::StackTrace();
call_site = ss.str();
}
if (invocation.task_type == TaskType::ACTOR_TASK) {
// NOTE: Ray CPP doesn't support per-method max_retries and retry_exceptions
const auto native_actor_handle = core_worker.GetActorHandle(invocation.actor_id);
Expand All @@ -80,6 +88,7 @@ ObjectID NativeTaskSubmitter::Submit(InvocationSpec &invocation,
max_retries,
/*retry_exceptions=*/false,
/*serialized_retry_exception_allowlist=*/"",
call_site,
return_refs);
if (!status.ok()) {
return ObjectID::Nil();
Expand All @@ -103,7 +112,9 @@ ObjectID NativeTaskSubmitter::Submit(InvocationSpec &invocation,
1,
false,
scheduling_strategy,
"");
"",
/*serialized_retry_exception_allowlist=*/"",
call_site);
}
return ObjectID::FromBinary(return_refs[0].object_id());
}
Expand All @@ -130,6 +141,12 @@ ActorID NativeTaskSubmitter::CreateActor(InvocationSpec &invocation,
bundle_id.second);
placement_group_scheduling_strategy->set_placement_group_capture_child_tasks(false);
}
std::string call_site;
if (::RayConfig::instance().record_task_actor_creation_sites()) {
std::stringstream ss;
ss << ray::StackTrace();
call_site = ss.str();
}
ray::core::ActorCreationOptions actor_options{
create_options.max_restarts,
/*max_task_retries=*/0,
Expand All @@ -144,8 +161,12 @@ ActorID NativeTaskSubmitter::CreateActor(InvocationSpec &invocation,
scheduling_strategy,
create_options.serialized_runtime_env_info};
ActorID actor_id;
auto status = core_worker.CreateActor(
BuildRayFunction(invocation), invocation.args, actor_options, "", &actor_id);
auto status = core_worker.CreateActor(BuildRayFunction(invocation),
invocation.args,
actor_options,
/*extension_data=*/"",
call_site,
&actor_id);
if (!status.ok()) {
throw RayException("Create actor error");
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ General Debugging
=======================

Distributed applications are more powerful yet complicated than non-distributed ones. Some of Ray's behavior might catch
users off guard while there may be sound arguments for these design choices.
users off guard while there may be sound arguments for these design choices.

This page lists some common issues users may run into. In particular, users think of Ray as running on their local machine, and
while this is sometimes true, this leads to a lot of issues.
Expand Down Expand Up @@ -225,3 +225,55 @@ This document discusses some common problems that people run into when using Ray
as well as some known problems. If you encounter other problems, `let us know`_.

.. _`let us know`: https://github.com/ray-project/ray/issues

Capture task and actor call sites
---------------------------------

Ray can optionally capture and display the stacktrace of where your code invokes tasks, creates actors or invokes actor tasks. This feature can help with debugging and understanding the execution flow of your application.

To enable call site capture, set the environment variable ``RAY_record_task_actor_creation_sites=true``. When enabled:

- Ray captures the stacktrace when creating tasks, actors or calling actor methods
- The call site stacktrace is visible in:
- Ray Dashboard UI under the task details and actor details pages
- ``ray list task --detail`` CLI command output
- State API responses

Note that stacktrace capture is disabled by default to avoid any performance overhead. Only enable it when needed for debugging purposes.

Example:

.. testcode::

import ray

# Enable stacktrace capture
ray.init(runtime_env={"env_vars": {"RAY_record_task_actor_creation_sites": "true"}})

@ray.remote
def my_task():
return 42

# Capture the stacktrace upon task invocation.
future = my_task.remote()
result = ray.get(future)

@ray.remote
class Counter:
def __init__(self):
self.value = 0

def increment(self):
self.value += 1
return self.value

# Capture stacktrace upon actor creation.
counter = Counter.remote()

# Capture stacktrace upon method invocation.
counter.increment.remote()


The stacktrace shows the exact line numbers and call stack where the task was invoked, actor was created and methods were invoked.

This feature is currently only supported for Python and C++ tasks.
23 changes: 21 additions & 2 deletions python/ray/_raylet.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -3726,6 +3726,7 @@ cdef class CoreWorker:
c_string serialized_retry_exception_allowlist
CTaskID current_c_task_id
TaskID current_task = self.get_current_task_id()
c_string call_site

self.python_scheduling_strategy_to_c(
scheduling_strategy, &c_scheduling_strategy)
Expand All @@ -3734,6 +3735,10 @@ cdef class CoreWorker:
retry_exception_allowlist,
function_descriptor)

if RayConfig.instance().record_task_actor_creation_sites():
# TODO(ryw): unify with get_py_stack used by record_ref_creation_sites.
call_site = ''.join(traceback.format_stack())

with self.profile_event(b"submit_task"):
prepare_resources(resources, &c_resources)
prepare_labels(labels, &c_labels)
Expand Down Expand Up @@ -3761,6 +3766,7 @@ cdef class CoreWorker:
c_scheduling_strategy,
debugger_breakpoint,
serialized_retry_exception_allowlist,
call_site,
current_c_task_id,
)

Expand Down Expand Up @@ -3810,10 +3816,15 @@ cdef class CoreWorker:
c_vector[CObjectID] incremented_put_arg_ids
optional[c_bool] is_detached_optional = nullopt
unordered_map[c_string, c_string] c_labels
c_string call_site

self.python_scheduling_strategy_to_c(
scheduling_strategy, &c_scheduling_strategy)

if RayConfig.instance().record_task_actor_creation_sites():
# TODO(ryw): unify with get_py_stack used by record_ref_creation_sites.
call_site = ''.join(traceback.format_stack())

with self.profile_event(b"submit_task"):
prepare_resources(resources, &c_resources)
prepare_resources(placement_resources, &c_placement_resources)
Expand Down Expand Up @@ -3849,7 +3860,9 @@ cdef class CoreWorker:
enable_task_events,
c_labels),
extension_data,
&c_actor_id)
call_site,
&c_actor_id,
)

# These arguments were serialized and put into the local object
# store during task submission. The backend increments their local
Expand Down Expand Up @@ -3959,11 +3972,15 @@ cdef class CoreWorker:
c_string serialized_retry_exception_allowlist
c_string serialized_runtime_env = b"{}"
unordered_map[c_string, c_string] c_labels
c_string call_site

serialized_retry_exception_allowlist = serialize_retry_exception_allowlist(
retry_exception_allowlist,
function_descriptor)

if RayConfig.instance().record_task_actor_creation_sites():
call_site = ''.join(traceback.format_stack())

with self.profile_event(b"submit_task"):
if num_method_cpus > 0:
c_resources[b"CPU"] = num_method_cpus
Expand Down Expand Up @@ -3992,8 +4009,10 @@ cdef class CoreWorker:
max_retries,
retry_exceptions,
serialized_retry_exception_allowlist,
call_site,
return_refs,
current_c_task_id)
current_c_task_id,
)
# These arguments were serialized and put into the local object
# store during task submission. The backend increments their local
# ref count initially to ensure that they remain in scope until we
Expand Down
15 changes: 15 additions & 0 deletions python/ray/dashboard/client/src/pages/actor/ActorDetail.tsx
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import { Box } from "@mui/material";
import React from "react";
import { Outlet } from "react-router-dom";
import { CodeDialogButton } from "../../common/CodeDialogButton";
import { CollapsibleSection } from "../../common/CollapsibleSection";
import { DurationText } from "../../common/DurationText";
import { formatDateFromTimeMs } from "../../common/formatUtils";
Expand Down Expand Up @@ -205,6 +206,20 @@ const ActorDetailPage = () => {
</div>
),
},
{
label: "Call site",
content: (
<Box display="inline-block">
<CodeDialogButton
title="Call site"
code={
actorDetail.callSite ||
'Call site not recorded. To enable, set environment variable "RAY_record_task_actor_creation_sites" to "true".'
}
/>
</Box>
),
},
]}
/>
<CollapsibleSection title="Logs" startExpanded>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ const MOCK_ACTORS: { [actorId: string]: Actor } = {
requiredResources: {},
placementGroupId: "123",
reprName: ",",
callSite: "",
},
ACTOR_2: {
actorId: "ACTOR_2",
Expand All @@ -42,6 +43,7 @@ const MOCK_ACTORS: { [actorId: string]: Actor } = {
requiredResources: {},
placementGroupId: "123",
reprName: ",",
callSite: "",
},
ACTOR_3: {
actorId: "ACTOR_3",
Expand All @@ -63,6 +65,7 @@ const MOCK_ACTORS: { [actorId: string]: Actor } = {
requiredResources: {},
placementGroupId: "123",
reprName: ",",
callSite: "",
},
ACTOR_4: {
actorId: "ACTOR_4",
Expand All @@ -84,6 +87,7 @@ const MOCK_ACTORS: { [actorId: string]: Actor } = {
requiredResources: {},
placementGroupId: "123",
reprName: ",",
callSite: "",
},
ACTOR_5: {
actorId: "ACTOR_5",
Expand All @@ -105,6 +109,7 @@ const MOCK_ACTORS: { [actorId: string]: Actor } = {
requiredResources: {},
placementGroupId: "123",
reprName: ",",
callSite: "",
},
};

Expand Down
20 changes: 19 additions & 1 deletion python/ray/dashboard/client/src/pages/task/TaskPage.tsx
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
import { Box, Typography } from "@mui/material";
import React from "react";
import { useParams } from "react-router-dom";
import { CodeDialogButtonWithPreview } from "../../common/CodeDialogButton";
import {
CodeDialogButton,
CodeDialogButtonWithPreview,
} from "../../common/CodeDialogButton";
import { CollapsibleSection } from "../../common/CollapsibleSection";
import { DurationText } from "../../common/DurationText";
import { formatDateFromTimeMs } from "../../common/formatUtils";
Expand Down Expand Up @@ -86,6 +89,7 @@ const TaskPageContents = ({
job_id,
func_or_class_name,
name,
call_site,
} = task;
const isTaskActive = task.state === "RUNNING" && task.worker_id;

Expand Down Expand Up @@ -242,6 +246,20 @@ const TaskPageContents = ({
label: "",
content: undefined,
},
{
label: "Call site",
content: (
<Box display="inline-block">
<CodeDialogButton
title="Call site"
code={
call_site ||
'Call site not recorded. To enable, set environment variable "RAY_record_task_actor_creation_sites" to "true".'
}
/>
</Box>
),
},
]}
/>
<CollapsibleSection title="Logs" startExpanded>
Expand Down
1 change: 1 addition & 0 deletions python/ray/dashboard/client/src/type/actor.ts
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ export type Actor = {
};
exitDetail: string;
reprName: string;
callSite?: string | undefined;
};

export type ActorDetail = {
Expand Down
1 change: 1 addition & 0 deletions python/ray/dashboard/client/src/type/task.ts
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ export type Task = {
error_type: string | null;
error_message: string | null;
task_log_info: { [key: string]: string | null | number };
call_site: string | null;
};

export type ProfilingData = {
Expand Down
1 change: 1 addition & 0 deletions python/ray/dashboard/modules/actor/actor_head.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ def actor_table_data_to_dict(message):
"endTime",
"reprName",
"placementGroupId",
"callSite",
}
light_message = {k: v for (k, v) in orig_message.items() if k in fields}
light_message["actorClass"] = orig_message["className"]
Expand Down
6 changes: 5 additions & 1 deletion python/ray/includes/libcoreworker.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -127,12 +127,15 @@ cdef extern from "ray/core_worker/core_worker.h" nogil:
const CSchedulingStrategy &scheduling_strategy,
c_string debugger_breakpoint,
c_string serialized_retry_exception_allowlist,
c_string call_site,
const CTaskID current_task_id)
CRayStatus CreateActor(
const CRayFunction &function,
const c_vector[unique_ptr[CTaskArg]] &args,
const CActorCreationOptions &options,
const c_string &extension_data, CActorID *actor_id)
const c_string &extension_data,
c_string call_site,
CActorID *actor_id)
CRayStatus CreatePlacementGroup(
const CPlacementGroupCreationOptions &options,
CPlacementGroupID *placement_group_id)
Expand All @@ -147,6 +150,7 @@ cdef extern from "ray/core_worker/core_worker.h" nogil:
int max_retries,
c_bool retry_exceptions,
c_string serialized_retry_exception_allowlist,
c_string call_site,
c_vector[CObjectReference] &task_returns,
const CTaskID current_task_id)
CRayStatus KillActor(
Expand Down
2 changes: 2 additions & 0 deletions python/ray/includes/ray_config.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -100,3 +100,5 @@ cdef extern from "ray/common/ray_config.h" nogil:
int gcs_rpc_server_reconnect_timeout_s() const

int maximum_gcs_destroyed_actor_cached_count() const

c_bool record_task_actor_creation_sites() const
Loading

0 comments on commit 53d2145

Please sign in to comment.