Skip to content

Conversation

rountree
Copy link

@rountree rountree commented Oct 3, 2025

Fixes #61

debug_printf2("Propogating spindle environment by copying it to new envp list\n");
for (cur = (char **) envp; *cur; cur++, orig_size++);
new_size = orig_size + 10;
new_size = orig_size + 20;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be 'new_size = orig_size + 9'. Eight environment variables get propogated, plus one slot for the NULL.

cachepath = chosen_realized_cachepath;
chosen_parsed_cachepath = chosen_parsed_cachepath;
chosen_symbolic_cachepath = chosen_symbolic_cachepath;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by the set_intercept_readlink_cachepath() and set_should_intercept_cachepath(). Seems like we're keeping multiple copies of the paths in different static variables, then calling each of these functions to set the different copies.

Different variables holding the same copies of information are a source of bugs. Let's consolidate the variables holding paths in the clients to just one instance.

Or am I missing something where these are different?

int exit_readys_recvd;
ldcs_dist_model_t dist_model;
ldcs_client_t* client_table;
char *location;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've still got 'location'. What's does 'location' represent anymore? Is it a duplicate variable for cachepath?

Also, please comment what these variables contain.

char *candidate_cachepaths; /* Colon-separated list of candidate paths (max 64) */
char *chosen_cachepath; /* The consensus path (same across all nodes). */
uint64_t cachepath_bitidx; /* Bit index used by allReduce() to arrive at consensus. */

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spindle_launch is the user interface function. We don't need to expose chosen_cachepath or cachepath_bitidx to the user. They're not selecting or reading those values.

Also, what's the difference between 'location' and 'candidate_cachepaths' here? What's it mean if a user sets both or only one?

}

COMM_LOCK;
client_recv_msg_static(fd, &message, LDCS_READ_BLOCK);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of messy doing one send and three recvs. In other parts of spindle the recv would be one message with three strings rather than three messages.

msgbundle_force_flush(procdata);
}

ldcs_audit_server_md_consensus(procdata, msg);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest renaming ldcs_audit_server_md_consensus(...) to something like ldcs_audit_server_md_allreduce(MD_AND).

The ldcs_audit_server_md_* are all network operation focused, not spindle algorithm focused. That's the layer where you'd add a different network implementation (like infiniband), and we want to keep higher-level spindle concepts out of that layer.

ldcs_send_msg(connid, &msg);
procdata->server_stat.clientmsg.cnt++;
procdata->server_stat.clientmsg.time += ldcs_get_time() - client->query_arrival_time;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repeated adds to clientmsg.cnt and clientmsg.time is triple counting here.

"reloc-python", &relocpython,
"python-prefix", &pyprefix,
"location", &location,
"numa", &numa,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment to previous, what's the difference between location and cachepaths now?

shortExecExcludes = 298,
shortPatchLdso
shortPatchLdso,
shortCachePaths,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to keep the pattern going, could you add numbers to the enum values (and fix shortPatchLdso while at it). I know it's not necessary, but it makes it easier to look values from debug_printfs if they're explicit here.

/* not the root, so forward our reduction result to our parent */
if (cobo_write_fd(cobo_parent_fd, pval, sizeof(*pval)) < 0) {
err_printf("Sending reduced data to parent failed\n");
exit(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't exit on network failure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow specification for cache, daemon, and fifo paths

3 participants