feature(smp): improve hart booting, TLS setup and per-core stack initialization #1678

mrm-develop · 2025-04-08T20:43:54Z

This pull request improves the RISC-V boot process, multi-core startup, and stack/TLS setup for better stability, performance, and future extensibility.

Changes:

In mod.rs, the function boot_next_processor now includes detailed logging to show the current hart mask and which hart will be started next. It also checks the return value of sbi_rt::hart_start to log any startup errors. The memory ordering for atomic variables like CURRENT_STACK_ADDRESS and CPU_ONLINE was adjusted for safer multi-core synchronization.

In start.rs, a check was added to skip invalid secondary harts early. A new PerCpuData struct was introduced, aligned to 64 bytes to match cache lines. This structure allows each core to track its state individually. The boot hart now correctly sets up its TLS pointer (tp) using inline assembly if TLS info is available. The hart mask is calculated using FDT parsing, with each CPU's status checked to ensure it is enabled before adding it to the mask.

In scheduler.rs, the TLS setup now uses the alignment field provided in tls_info. Memory is allocated with proper alignment and zero-initialized. Stacks are allocated with extra pages to separate kernel, user, and interrupt stacks. The kernel stack is marked with 0xdeadbeef at the top to help debugging stack overflows. The TaskStacks type now properly handles cloning and deallocation. This also ensures safe memory reuse and cleanup when tasks end.

…ialization

mkroening

Thanks for the PR! Sorry for the late review!

I have left a few comments.

Generally, I think using Orderin::Relaxed in the cases where you changed it, would be fine. Why do you think, we need to change it? We don't synchronize other memory accesses with it.

If you need assistance with git or GitHub regarding interactive rebasing, feel free to reach out! :)

src/arch/riscv64/kernel/mod.rs

mkroening · 2025-05-07T10:02:49Z

src/arch/riscv64/kernel/start.rs

+const MAX_CORES: usize = 32;
+
+// Cache-line aligned CPU-local data
+#[repr(align(64))]
+struct PerCpuData {
+	is_initialized: AtomicBool,
+	local_counter: AtomicU64,
+	#[allow(dead_code)]
+	padding: [u8; 48], // Fill to full cache line
+}
+
+impl PerCpuData {
+	const fn new() -> Self {
+		Self {
+			is_initialized: AtomicBool::new(false),
+			local_counter: AtomicU64::new(0),
+			padding: [0; 48],
+		}
+	}
+}
+
+#[allow(clippy::declare_interior_mutable_const)]
+static CPU_DATA: [PerCpuData; MAX_CORES] = {
+	const CPU_LOCAL: PerCpuData = PerCpuData::new();
+	[CPU_LOCAL; MAX_CORES]
+};


Why would we need this? Can we remove this?

It's working, but it's experimental.
The code tries to make RISC-V smp systems faster by using CPU-local data. There is a special data type called PerCpuData. With #[repr(align(64))], it makes sure that each part of this data lines up with the CPU's cache lines. This stops different CPU cores from accidentally messing with each other's data (called "False Sharing"). Because of this, each RISC-V core (HART) can work on its own data without slowing down other cores, which should help with parallel processing.

I understand using hart-local data and aligning that to cache lines. But we already have arch::riscv64::kernel::CoreLocal for that. Compared to CoreLocal, PerCpuData brings no benefit, as far as I can see. MAX_CORES is a limitation that CoreLocal does not have.

Also, the members (is_initialized and local_counters) seem useless, since they are never read and the atomic orderings and fences make no sense to me.

Is there any reason to keep this that I oversaw? 🤔

mkroening · 2025-05-07T10:03:38Z

src/arch/riscv64/kernel/scheduler.rs

Can you split the TaskTLS changes into another PR that we can merge seperately?

mkroening · 2025-05-07T10:06:27Z

src/arch/riscv64/kernel/start.rs

+		// Optimized HART_MASK calculation
+		let mut hart_mask = 0u64;
+		for cpu in fdt.cpus() {
+			if let Some(cpu_id) = cpu.property("reg").and_then(|p| p.as_usize()) {
+				if cpu
+					.property("status")
+					.and_then(|p| p.as_str())
+					.is_some_and(|s| s != "disabled\u{0}")
+				{
+					hart_mask |= 1 << cpu_id;


How is this optimized? Isn't this the same code as before but in a more functional style? I honestly found the code more readable before.

You're right, "optimized" is misleading here, as this is just a more functional style without any performance improvement. I'll update the comment to avoid giving the wrong impression.
If you want the old code back, just send me an answer, I will change it.

I think changing it back would be better. Your functional style avoids a few unwraps, but unwrapping here is not a bad thing, I think. If there is an issue with the FDT in your code, we would just silently skip initializing the hart mask.

mkroening · 2025-05-07T10:09:11Z

src/arch/riscv64/kernel/start.rs

+	// Optimized Hart-ID validation
+	if CPU_ONLINE.load(Ordering::Acquire) > 0 {
+		// Faster check for Secondary-HARTs
+		if (HART_MASK.load(Ordering::Relaxed) & (1 << hart_id)) == 0 {
+			error!("Invalid hart ID: {hart_id}");
+			processor::halt();
+		}
+	}


What does optimized mean here? And what do we check for? And why would we want to halt the processor in that case?

Right again, this isn’t really an optimization. The purpose here is to validate that the current HART is part of the expected HART mask (from the device tree). If not, we halt to prevent undefined behavior from unexpected or spurious CPUs starting. I’ll update the comment to outline it is sanity check, not an optimization.

Hmm, but nothing bad would happen if an unexpected hart starts, right? Are there real cases where harts start unexpectedly? What does “faster check for Secondary-HARTS” compare to? We don't check the main hart, right? This also just checks if we are one of the harts in the hart mask, not the specific hart that was requested in sbi_rt::hart_start.

mkroening · 2025-05-07T10:11:06Z

src/arch/riscv64/kernel/start.rs


 	if CPU_ONLINE.load(Ordering::Acquire) == 0 {
-		unsafe {


Let's put the refactoring of the reduction of unsafe scope in a separate commit or PR.

mrm-develop · 2025-05-31T20:22:11Z

You can check again :)

mkroening · 2025-06-02T09:07:31Z

src/arch/riscv64/kernel/mod.rs

-		// TODO: Old: Changing cpu_online will cause uhyve to start the next processor
 		CPU_ONLINE.fetch_add(1, Ordering::Release);

-		//When running bare-metal/QEMU we use the firmware to start the next hart


I think these comments should be kept. The lower one is still true, and the upper one is still to-do, since Uhyve does not support RISC-V yet.

mkroening · 2025-06-02T09:09:14Z

src/arch/riscv64/kernel/start.rs

+	// Memory Fence before ID storage
+	fence(Ordering::Release);
+	CURRENT_BOOT_ID.store(hart_id as u32, Ordering::Release);


The memory fence is not needed, right? Why did you change the atomic ordering? Relaxed ordering should be sufficient, since we don't use CURRENT_BOOT_ID for synchronization.

mkroening · 2025-06-02T09:10:36Z

src/arch/riscv64/kernel/start.rs

+		// Memory Fence before HART_MASK update
+		fence(Ordering::Release);
+		HART_MASK.store(hart_mask, Ordering::Release);


Same question as https://github.com/hermit-os/kernel/pull/1678/files#r2120516736 here.

mkroening · 2025-06-02T09:17:35Z

src/arch/riscv64/kernel/start.rs

+		// Initialize TLS for boot core:
+		if let Some(tls_info) = env::boot_info().load_info.tls_info {
+			// Load the value into 'tp' using the mv instruction:
+			unsafe {
+				asm!(
+					"mv tp, {val}",
+					val = in(reg) tls_info.start as usize,
+					options(nostack, nomem)
+				);
+			}


Why is this needed? This is done in src/arch/riscv64/kernel/scheduler.rs#L274 and src/arch/riscv64/kernel/scheduler.rs#L404.

Marco Ruppert-Maas added 2 commits April 8, 2025 22:39

feature(smp): improve hart booting, TLS setup and per-core stack init…

a2dfe1b

…ialization

fixed clippy issues

c067c15

mkroening self-assigned this Apr 9, 2025

mkroening self-requested a review April 9, 2025 07:47

Marco Ruppert-Maas added 3 commits April 14, 2025 21:28

fixed more clippy issues

ee94477

fixed another clippy issue

dc8e3c1

another clippy issue

c2a4a71

mkroening requested changes May 7, 2025

View reviewed changes

mrm-develop and others added 4 commits May 31, 2025 20:23

Merge branch 'hermit-os:main' into feature/smp_tls

a324844

feature(smp): implemented feedback from pr review

0ecdac0

feature(smp): made comments clearer.

a3b7441

clippy issues

42a69a3

mkroening reviewed Jun 2, 2025

View reviewed changes

feature(smp): improve hart booting, TLS setup and per-core stack initialization #1678

Are you sure you want to change the base?

feature(smp): improve hart booting, TLS setup and per-core stack initialization #1678

Conversation

mrm-develop commented Apr 8, 2025

Uh oh!

mkroening left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrm-develop commented May 31, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mkroening left a comment •

edited

Loading