Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot handle huge command.txt files #125

Open
gauvinalexandre opened this issue Apr 17, 2016 · 7 comments
Open

Cannot handle huge command.txt files #125

gauvinalexandre opened this issue Apr 17, 2016 · 7 comments

Comments

@gauvinalexandre
Copy link

gauvinalexandre commented Apr 17, 2016

Attached, an example file from a worker that had hard time...

Further details to come tomorrow...
@mgermain

770146_mp2_m_worker_105_e.txt

@mgermain
Copy link
Member

Congrats! You broke Linux :D
After exploration, I saw that the problem comes from Linux freaking out because the process has been waiting for the lock too long. I'll see if there is anything I can do.

@MarcCote
Copy link
Member

I think a have a better understanding of the issue now. When the file commands.txt with all the pending commands is huge, e.g. 2.5Gb worth of text, getting the next command to run is highly inefficient.

We currently need to read and rewrite the ~2.5Gb file each time we get the next command to run. This is because we consider the next command to run as the first one in commands.txt. So, we read the first line, read the rest of the file and then write back all lines except the first one, ouch!

We should take the last line of commands.txt as the next command to run and simply truncate the file.

@MarcCote MarcCote changed the title command.txt locking failure flags may lead to overhead issues Cannot handle huge command.txt files Nov 18, 2016
@mgermain
Copy link
Member

@MarcCote What you say is true, but this is 2 distinct issues.
@gauvinalexandre How big was your command.txt?

@MarcCote I have a vague memory that we had a proper reason to do so in the past. There is also the fact that it was NEVER intended to be used in this way. Finally, I think that having the unfinished_commands.txt and/or having a database might mitigate all this.

@MarcCote
Copy link
Member

Maybe those are two distinct issues. Please @gauvinalexandre let us know :).

@mgermain
How is having a lot of commands to "dispatch" is not the intended purpose of smart-dispatch?
You are right about the database solving the issue I mentioned.

@mgermain
Copy link
Member

@MarcCote Well in the case of the 70 brains, in a way, SD is used as a substitute to MPI to implement parallelism inside their own program. Maybe another way to say what I said in my previous post is that SD was never designed to launch an actual 50 Million jobs at once :P If we can find a way to do it nicely I'm not against it though.

@gauvinalexandre
Copy link
Author

Hey guys! Yes, I agree it's 2 different issues, but they are related in some way.

My issue was that my tasks had a very short process-time, while the command.txt (big, but not gygabytes) was relatively long to write. So, the workers were continuously asking for more tasks. Then Linux was thinking it was stuck in a multiprocess concurrent lock because of its OS default waiting timeout parameter, so it killed workers to prevent it. In other words, workers we're dying bored (thanks for @mgermain for figuring this out).

In Max's case, the file is very long to write. So I guess the same will happen at the end, killing workers with boredom waiting for millions of tasks to be rewritten millions of times.

So in one side, there's overhead in reading too frequently the command.txt file, while in the other there's overhead writing too much. I think it's related because the process-time of tasks over the number of tasks should be a reasonable ratio. Otherwise we'll have problems.

The final message here is, we cannot just throw anything there, yet, until geniuses like @mgermain and @MarcCote find a solution. There's a smart-dispatch tweaking part. Thanks again guys!

@MarcCote
Copy link
Member

From what I understand speeding up the "picking a new command to execute" would solve both issues. I suggest "taking the last line of commands.txt (a.k.a. pending) as the next command to run and simply truncate the file".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants