Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU not fully utilized #36

Open
joshinils opened this issue Aug 15, 2022 · 4 comments
Open

CPU not fully utilized #36

joshinils opened this issue Aug 15, 2022 · 4 comments

Comments

@joshinils
Copy link
Contributor

image
I own an AMD Ryzen 3800xt 8-core CPU with 16 Threads.
When blurring a video, not all of the CPU is being used.

My guess is this has something to do with reading frames from memory, then detecting info, then blurring it, then storing the blurred frame, then repeating the loop.
My profiling skills in python are non-existent, so I can only guess that this is the reason why CPU performance is not at 100% utilization.
weaving/shuffling IO-bound tasks and CPU-bound ones to do them concurrently would be ideal, so reading and writing happens in the background while detections take place

@tfaehse
Copy link
Owner

tfaehse commented Aug 15, 2022

Lots of bottlenecks unfortunately. Detection can run in almost any batch size, especially on GPUs the higher the better. But there's two issues with that, you need quite a bit of memory and frames need to arrive quickly enough. So far, that's not really given.

I've tried offloading reading and writing frames to separate processes, but surprisingly that didn't help much - the additional overhead, at least on my system, ate the gains from parallelisation. So far, the best approach I could find was to massively speed up reading/writing frames. #32 contains a very much WIP draft, but the performance improvements are already quite large.
Basically, it makes sure that extracting frames, reading frames and detection (given a high enough batch size) all attempt to saturate the given resources. As long as each of the steps use near 100% of your CPU, it doesn't matter if they're properly pipelined or not. So far, that has lead to really nice results on my laptop at least - up to 20fps for inference at 360p, with small weights. Maybe not the most realistic workload, but it's my default test case...

I still have to fix a lot of things there though, and check how this works when using a GPU.

TL;DR: Increasing the batch size and improving frame extraction should help.

@joshinils
Copy link
Contributor Author

hm, I found when using CPU that increasing batchsize does not really help, rather the opposite.

with batch_size==5:
image
the dips are where the progress-bar updates, so when the batch of frames is saved and the next are loaded...
so even then the detection does not use more cpu% :-/

@joshinils
Copy link
Contributor Author

almost makes me think having two batches run detections in parallel would be beneficial, however weird that may sound.
since one batch of detections does not utilize the cpu well.

@tfaehse
Copy link
Owner

tfaehse commented Aug 15, 2022

Hmm.. it's not perfect for me, but much better. batch size 16:
image

I'll do some profiling soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants