Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smp: add a function that barriers memory prefault work #2608

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tomershafir
Copy link
Contributor

@tomershafir tomershafir commented Jan 6, 2025

Currently, memory prefault logic is internal and seastar doesnt provide much control to users. In order to improve the situation, I suggest to provide a barrier for the prefault threads. This allows to:

  • Prefer predictable low latency and high throughput from the start of request serving, at the cost of a startup delay, depending on machine characteristics and application specific requirements. For example, a fixed capacity on prem db setup, where slower startup can be tolerated. From users perspective, they generally cannot tolerate inconsistency (like spikes in latency).
  • Similarly, improve user scheduling decisions, like running less critical tasks while prefault works.
  • Reliably test the prefault logic, improving reliability and users trust in seastar.

I tested locally. If you approve, next I will try to submit a prefault test.

Currently, memory prefault logic is internal and seastar doesnt provide much control to users. In order to improve the situation, I suggest to provide a barrier for the prefault threads. This allows to:

* Prefer predictable low latency and high throughput from the start of request serving, at the cost of a startup delay, depending on machine characteristics and application specific requirements. For example, a fixed capacity on prem db setup, where slower startup can be tolerated. From users perspective, they generally cannot tolerate inconsistency (like spikes in latency).
* Similarly, improve user scheduling decisions, like running less critical tasks while prefault works.
* Reliably test the prefault logic, improving reliability and users trust in seastar.
* Release memory_prefaulter::_worker_threads early and remove this overhead, rather than only at exit.

I tested locally. If you approve this change, next I will submit a prefault test.
@tomershafir tomershafir marked this pull request as ready for review January 6, 2025 14:33
@avikivity
Copy link
Member

Did you observe latency impact from the prefault threads? It was written carefully not to have latency impact, but it's of course possible that some workloads suffer.

@tomershafir
Copy link
Contributor Author

As you described in #1702, page faults can cause deviation, and following up the example, there can be 25sec where latency is variably higher.

@avikivity
Copy link
Member

As you described in #1702, page faults can cause deviation, and following up the example, there can be 25sec where latency is variably higher.

I said nothing about latency being higher there.

We typically run large machines with a few vcpus not assigned to any shards, and the prefault threads run with low priority.

@tomershafir
Copy link
Contributor Author

tomershafir commented Jan 7, 2025

There are 2 aspects:

  1. Page faults

In the previous comment, I meant page fault latency. The page faults can cause high latency unpredictably until the prefaulter finishes.

Regarding page faults measurement, it seems I cannot reliably measure on my env.

  1. Prefault threads competition

I tried to non scientifically isolate wall time overhead of prefault threads:

I have a test app that performs file I/O and process memory buffers repeatedly. I used Ubuntu Orbstack VM with 1 NUMA node, 10 cores, --memory=14G - effectively a small NUMA node, and a small input to let the overhead be most visible.

  • With --lock-memory=1 without waiting, I see that the chrono time of the actual work is significantly higher than with --lock-memory=0. (~1800ms > ~600ms)
  • When waiting before doing actual work, I see that the overhead is removed.
  • When building seastar without prefault code and --lock-memory=1 I dont see the overhead.

@tomershafir
Copy link
Contributor Author

By default seastar uses all vcpus, which makes sense for resource efficiency.

Also, do you free specific vcpus? Like one per numa node, the granularity of prefault threads.

@avikivity
Copy link
Member

By default seastar uses all vcpus, which makes sense for resource efficiency.

Also, do you free specific vcpus? Like one per numa node, the granularity of prefault threads.

1 in 8, with NUMA awareness. They're allocated for kernel network processing. See perftune.py.

@tomershafir
Copy link
Contributor Author

Nice. Let me know if this change makes sense to you

@tomershafir
Copy link
Contributor Author

@avikivity ping

@tomershafir
Copy link
Contributor Author

I also tried to simulate perftune with 1 free vcpu: --cpuset=0-8 given the above setup, and I still observe the overhead, even though its less (~1600ms).

@avikivity
Copy link
Member

I don't understand what this 1600ms overhead is.

@tomershafir
Copy link
Contributor Author

tomershafir commented Jan 21, 2025

I tried to non scientifically isolate wall time overhead of prefault threads:

I have a test app that performs file I/O and process memory buffers repeatedly. I used Ubuntu Orbstack VM with 1 NUMA node, 10 cores, --memory=14G - effectively a small NUMA node, and a small input to let the overhead be most visible.

  • With --lock-memory=1 without waiting, I see that the chrono time of the actual work is significantly higher than with --lock-memory=0. (~1800ms > ~600ms)
  • When waiting before doing actual work, I see that the overhead is removed.
  • When building seastar without prefault code and --lock-memory=1 I dont see the overhead.

I mean its the wall time of the work that I observe, with 1 free vcpu: --cpuset=0-8 given the above setup.

@avikivity
Copy link
Member

I tried to non scientifically isolate wall time overhead of prefault threads:
I have a test app that performs file I/O and process memory buffers repeatedly. I used Ubuntu Orbstack VM with 1 NUMA node, 10 cores, --memory=14G - effectively a small NUMA node, and a small input to let the overhead be most visible.

  • With --lock-memory=1 without waiting, I see that the chrono time of the actual work is significantly higher than with --lock-memory=0. (~1800ms > ~600ms)
  • When waiting before doing actual work, I see that the overhead is removed.
  • When building seastar without prefault code and --lock-memory=1 I dont see the overhead.

I mean its the wall time of the work that I observe, with 1 free vcpu: --cpuset=0-8 given the above setup.

Okay. But what's the problem with that time?

Anyway, is we add future<> seastar::wait_for_background_initialization() we can have application startup code elect to wait for it before opening ports. This way it can let its own initialization work overlap with memory initialization.

@tomershafir
Copy link
Contributor Author

The problem is that it is slower and not consistent/predictable. After memory initialization it is faster and consistent.

Regarding the implementation, the problem is that pthread_join blocks, so in the current implementation can't using a future be misleading?

@avikivity
Copy link
Member

How is pthread_join relevant?

@tomershafir
Copy link
Contributor Author

tomershafir commented Jan 21, 2025

Currently, the logical barrier waits for pthread_join on all the threads that perform the prefault work. It will block the reactor thread.

@avikivity
Copy link
Member

Ah, you're referring to the patch while I was referring to the current state. Don't use join then, instead figure out something else that can satisfy a seastar::promise. Maybe it's as simple as seastar::alien::submit_to(0, [&] { _reactor._prefault_complete.set_value(); }).

@tomershafir
Copy link
Contributor Author

ah, I see. So if I understand correctly, you have just restated the clarified motivation for the patch (pls correct me if I'm wrong). I'll work on a non-blocking method next week.

@avikivity
Copy link
Member

I don't completely see that it's useful but can't deny that it might be.

I'd be happier with an example of a real application requiring it.

@tomershafir
Copy link
Contributor Author

I have only a test application. How about scylladb?

@avikivity
Copy link
Member

I have only a test application. How about scylladb?

I'm not aware of reports of problem during the prefault stage. It takes some time for a node to join the cluster, and by that time enough memory was prefaulted for it to work well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants