Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infohint: add nogpu as a gpudriver_id hint #169

Merged
merged 4 commits into from
Jan 18, 2021

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Jan 14, 2021

Pull Request Description

The purpose of this PR is add a way to tell yaksa the both in_buf and out_buf is cpu host memory. This is useful when yaksa is configured with GPU support but the application doesn't use gpu memory or the application knows when their buffer is entirely from CPU and thus doesn't need to pay the pointer attribute query overhead.

Expected Impact

Fixes #168

Author Checklist

Add a way for application to tell yaksa that both in_buf and out_buf are
host memory.

Signed-off-by: Hui Zhou <[email protected]>
@hzhou hzhou force-pushed the 2101_no_gpu branch 2 times, most recently from e4fbfd2 to b8d3a98 Compare January 14, 2021 21:52
@hzhou
Copy link
Contributor Author

hzhou commented Jan 14, 2021

test:yaksa/gpu

Apply nogpu hint every other test when no device memory is involved.

Signed-off-by: Hui Zhou <[email protected]>
The yaksu_atomic_decr returns the value before decrement.

Signed-off-by: Hui Zhou <[email protected]>
@hzhou
Copy link
Contributor Author

hzhou commented Jan 15, 2021

test:yaksa/gpu

RESULT: 3 hr 28min. 1 failure:

test/pack/pack -datatype int -count 17 -seed 73 -iters 32768 -segments 1 -ordering normal -overlap none -num-threads 4
 Stack Trace

CUDA Error (yaksuri_cudai_event_query:src/backend/cuda/pup/yaksuri_cudai_event.c,65): an illegal memory access was encountered
lt-pack: test/pack/pack.c:273: runtest: Assertion `rc == (0)' failed.
CUDA Error (yaksuri_cudai_event_query:src/backend/cuda/pup/yaksuri_cudai_event.c,65): an illegal memory access was encountered
      

@hzhou
Copy link
Contributor Author

hzhou commented Jan 15, 2021

@pavanbalaji The PR is ready for your review. There is 1 failure in the newly created gpu testing, see #169 (comment), I am not sure it is related to this PR. Maybe you can help to decide. Meanwhile, I'll re-run the test to confirm.

@hzhou
Copy link
Contributor Author

hzhou commented Jan 15, 2021

test:yaksa/gpu ✔️

@hzhou
Copy link
Contributor Author

hzhou commented Jan 15, 2021

@pavanbalaji Tests are clean. Ready for your review.

host_only_get_ptr_attr(&unpack_info);
} else {
pack_get_ptr_attr(tbuf_d, dbuf_d + dobj.DTP_buf_offset, &unpack_info);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch is forcing all host-only testing to use this new info hint. This will cause us to lose all testing for the cases where this info hint is not provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only adds the info hint to half of the cases, similar to the gpu tests. Refer to line 55 above.

test/pack/pack.c Outdated
static void host_only_get_ptr_attr(yaksa_info_t * info)
{
static int count = 0;
if ((++count) % 2 == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this thread-safe?

Copy link
Contributor Author

@hzhou hzhou Jan 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. But the consequence is just less than perfect half by half testing. The same mechanism is used in test/pack/pack-cuda.c. We can make it atomic (do we have atomic library for the testing?) but maybe it isn't worth the trouble. For example, maybe some randomness is even desirable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, randomness is not desirable in this case. It's better to make both cases deterministic. If we don't have an atomic library, we could use a lock or decide on the info hint based on the iteration number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

The static counter is not thread-safe. Use iteration number to get more
consistent split-testing.

Signed-off-by: Hui Zhou <[email protected]>
@hzhou
Copy link
Contributor Author

hzhou commented Jan 17, 2021

test:yaksa/gpu

2 failures:

test/pack/pack  -datatype int -count 17 -seed 85 -iters 32768 -segments 64 -ordering  normal -overlap none -num-threads 4        
Stack Trace           
CUDA Error (yaksuri_cudai_event_query:src/backend/cuda/pup/yaksuri_cudai_event.c,65): misaligned address 
lt-pack: test/pack/pack.c:272: runtest: Assertion `rc == (0)' failed. 
CUDA Error (yaksuri_cudai_event_query:src/backend/cuda/pup/yaksuri_cudai_event.c,65): misaligned address 

test/pack/pack  -datatype int -count 17 -seed 97 -iters 32768 -segments 64 -ordering  reverse -overlap none -num-threads 4        
Stack Trace           
CUDA Error (yaksuri_cudai_event_query:src/backend/cuda/pup/yaksuri_cudai_event.c,65): an illegal memory access was encountered 
CUDA Error (yaksuri_cudai_event_query:src/backend/cuda/pup/yaksuri_cudai_event.c,65): an illegal memory access was encountered 
lt-pack: test/pack/pack.c:272: runtest: Assertion `rc == (0)' failed. 
lt-pack: test/pack/pack.c:272: runtest: Assertion `rc == (0)' failed.

@hzhou
Copy link
Contributor Author

hzhou commented Jan 17, 2021

test:yaksa/gpu

@hzhou
Copy link
Contributor Author

hzhou commented Jan 18, 2021

@pavanbalaji ready for your review again

@pavanbalaji
Copy link
Contributor

How did the failures disappear? Did you simply rerun the tests till they disappeared? That's not a good model. It only means that the errors are intermittent and are being ignored.

@hzhou
Copy link
Contributor Author

hzhou commented Jan 18, 2021

How did the failures disappear? Did you simply rerun the tests till they disappeared? That's not a good model. It only means that the errors are intermittent and are being ignored.

I did record them down. They seem to be intermittent gpu errors. I won't have time to debug them. I'll have to check them later.

@pavanbalaji
Copy link
Contributor

If they are happening with main too, then we should at least have an issue created for it. If it's not happening with main, then it's a problem with this PR and it cannot be merged till that's fixed.

@hzhou
Copy link
Contributor Author

hzhou commented Jan 18, 2021

If they are happening with main too, then we should at least have an issue created for it. If it's not happening with main, then it's a problem with this PR and it cannot be merged till that's fixed.

Let me open a dummy PR to check that

@pavanbalaji
Copy link
Contributor

I meant in the nightly tests. You don't need a dummy PR.

@hzhou
Copy link
Contributor Author

hzhou commented Jan 18, 2021

I meant in the nightly tests. You don't need a dummy PR.

Sure. Let me create one.

@hzhou
Copy link
Contributor Author

hzhou commented Jan 18, 2021

@pavanbalaji The cuda memory issue was reproduced in nightly test (https://jenkins-pmrs.cels.anl.gov/view/yaksa/job/yaksa-nightly-gpu/lastCompletedBuild/testReport/) and tracked in issue #170

@hzhou
Copy link
Contributor Author

hzhou commented Jan 18, 2021

@pavanbalaji Is there any remaining issues with this PR?

@pavanbalaji pavanbalaji merged commit 7a4ed0f into pmodels:main Jan 18, 2021
@hzhou hzhou deleted the 2101_no_gpu branch January 18, 2021 21:25
@hzhou hzhou mentioned this pull request Mar 25, 2022
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an info hint key to disable gpu buffer query
2 participants