-
Notifications
You must be signed in to change notification settings - Fork 31
Cachepath work broken out into functional commits #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
debug_printf2("Propogating spindle environment by copying it to new envp list\n"); | ||
for (cur = (char **) envp; *cur; cur++, orig_size++); | ||
new_size = orig_size + 10; | ||
new_size = orig_size + 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be 'new_size = orig_size + 9'. Eight environment variables get propogated, plus one slot for the NULL.
cachepath = chosen_realized_cachepath; | ||
chosen_parsed_cachepath = chosen_parsed_cachepath; | ||
chosen_symbolic_cachepath = chosen_symbolic_cachepath; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by the set_intercept_readlink_cachepath() and set_should_intercept_cachepath(). Seems like we're keeping multiple copies of the paths in different static variables, then calling each of these functions to set the different copies.
Different variables holding the same copies of information are a source of bugs. Let's consolidate the variables holding paths in the clients to just one instance.
Or am I missing something where these are different?
int exit_readys_recvd; | ||
ldcs_dist_model_t dist_model; | ||
ldcs_client_t* client_table; | ||
char *location; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've still got 'location'. What's does 'location' represent anymore? Is it a duplicate variable for cachepath?
Also, please comment what these variables contain.
char *candidate_cachepaths; /* Colon-separated list of candidate paths (max 64) */ | ||
char *chosen_cachepath; /* The consensus path (same across all nodes). */ | ||
uint64_t cachepath_bitidx; /* Bit index used by allReduce() to arrive at consensus. */ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spindle_launch is the user interface function. We don't need to expose chosen_cachepath or cachepath_bitidx to the user. They're not selecting or reading those values.
Also, what's the difference between 'location' and 'candidate_cachepaths' here? What's it mean if a user sets both or only one?
} | ||
|
||
COMM_LOCK; | ||
client_recv_msg_static(fd, &message, LDCS_READ_BLOCK); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of messy doing one send and three recvs. In other parts of spindle the recv would be one message with three strings rather than three messages.
msgbundle_force_flush(procdata); | ||
} | ||
|
||
ldcs_audit_server_md_consensus(procdata, msg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest renaming ldcs_audit_server_md_consensus(...) to something like ldcs_audit_server_md_allreduce(MD_AND).
The ldcs_audit_server_md_* are all network operation focused, not spindle algorithm focused. That's the layer where you'd add a different network implementation (like infiniband), and we want to keep higher-level spindle concepts out of that layer.
ldcs_send_msg(connid, &msg); | ||
procdata->server_stat.clientmsg.cnt++; | ||
procdata->server_stat.clientmsg.time += ldcs_get_time() - client->query_arrival_time; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The repeated adds to clientmsg.cnt and clientmsg.time is triple counting here.
"reloc-python", &relocpython, | ||
"python-prefix", &pyprefix, | ||
"location", &location, | ||
"numa", &numa, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment to previous, what's the difference between location and cachepaths now?
shortExecExcludes = 298, | ||
shortPatchLdso | ||
shortPatchLdso, | ||
shortCachePaths, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to keep the pattern going, could you add numbers to the enum values (and fix shortPatchLdso while at it). I know it's not necessary, but it makes it easier to look values from debug_printfs if they're explicit here.
/* not the root, so forward our reduction result to our parent */ | ||
if (cobo_write_fd(cobo_parent_fd, pval, sizeof(*pval)) < 0) { | ||
err_printf("Sending reduced data to parent failed\n"); | ||
exit(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't exit on network failure
Fixes #61