Skip to content

feature(smp): improve hart booting, TLS setup and per-core stack initialization #1678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions src/arch/riscv64/kernel/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -184,9 +184,12 @@ fn finish_processor_init() {
pub fn boot_next_processor() {
let new_hart_mask = HART_MASK.load(Ordering::Relaxed);

debug!("Current HART_MASK: 0x{new_hart_mask:x}");
let next_hart_index = lsb(new_hart_mask);

if let Some(next_hart_id) = next_hart_index {
debug!("Preparing to start HART {next_hart_id}");

{
let stack = physicalmem::allocate(KERNEL_STACK_SIZE)
.expect("Failed to allocate boot stack for new core");
Expand All @@ -199,10 +202,8 @@ pub fn boot_next_processor() {
next_hart_id
);

// TODO: Old: Changing cpu_online will cause uhyve to start the next processor
CPU_ONLINE.fetch_add(1, Ordering::Release);

//When running bare-metal/QEMU we use the firmware to start the next hart
Comment on lines -202 to -205
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these comments should be kept. The lower one is still true, and the upper one is still to-do, since Uhyve does not support RISC-V yet.

if !env::is_uhyve() {
sbi_rt::hart_start(next_hart_id as usize, start::_start as usize, 0).unwrap();
}
Expand Down
4 changes: 2 additions & 2 deletions src/arch/riscv64/kernel/scheduler.rs
Original file line number Diff line number Diff line change
Expand Up @@ -281,7 +281,7 @@ impl TaskTLS {
let tls_start = VirtAddr::new(tls_info.start);
// Yes, it does, so we have to allocate TLS memory.
// Allocate enough space for the given size and one more variable of type usize, which holds the tls_pointer.
let tls_allocation_size = tls_size.align_up(32usize); // + mem::size_of::<usize>();
let tls_allocation_size = tls_size.align_up(tls_info.align as usize);
// We allocate in 128 byte granularity (= cache line size) to avoid false sharing
let memory_size = tls_allocation_size.align_up(128usize);
let layout =
Expand Down Expand Up @@ -311,7 +311,7 @@ impl TaskTLS {
}

debug!(
"Set up TLS at 0x{tls_pointer:x}, tdata_size 0x{tdata_size:x}, tls_size 0x{tls_size:x}"
"Set up TLS at {tls_pointer:#x}, tdata_size {tdata_size:#x}, tls_size {tls_size:#x}"
);

Some(Box::new(Self {
Expand Down
110 changes: 89 additions & 21 deletions src/arch/riscv64/kernel/start.rs
Original file line number Diff line number Diff line change
@@ -1,17 +1,40 @@
use core::arch::naked_asm;
use core::sync::atomic::Ordering;
use core::arch::{asm, naked_asm};
use core::sync::atomic::{AtomicBool, AtomicU64, Ordering, fence};

use fdt::Fdt;
use hermit_entry::Entry;
use hermit_entry::boot_info::RawBootInfo;

use super::{CPU_ONLINE, CURRENT_BOOT_ID, HART_MASK, NUM_CPUS, get_dtb_ptr};
use crate::arch::riscv64::kernel::CURRENT_STACK_ADDRESS;
#[cfg(not(feature = "smp"))]
use crate::arch::riscv64::kernel::processor;
use crate::arch::riscv64::kernel::{CURRENT_STACK_ADDRESS, processor};
use crate::{KERNEL_STACK_SIZE, env};

//static mut BOOT_STACK: [u8; KERNEL_STACK_SIZE] = [0; KERNEL_STACK_SIZE];
const MAX_CORES: usize = 32;

// Cache-line aligned CPU-local data
#[repr(align(64))]
struct PerCpuData {
is_initialized: AtomicBool,
local_counter: AtomicU64,
#[allow(dead_code)]
padding: [u8; 48], // Fill to full cache line
}

impl PerCpuData {
const fn new() -> Self {
Self {
is_initialized: AtomicBool::new(false),
local_counter: AtomicU64::new(0),
padding: [0; 48],
}
}
}

#[allow(clippy::declare_interior_mutable_const)]
static CPU_DATA: [PerCpuData; MAX_CORES] = {
const CPU_LOCAL: PerCpuData = PerCpuData::new();
[CPU_LOCAL; MAX_CORES]
};
Comment on lines +12 to +37
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we need this? Can we remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's working, but it's experimental.
The code tries to make RISC-V smp systems faster by using CPU-local data. There is a special data type called PerCpuData. With #[repr(align(64))], it makes sure that each part of this data lines up with the CPU's cache lines. This stops different CPU cores from accidentally messing with each other's data (called "False Sharing"). Because of this, each RISC-V core (HART) can work on its own data without slowing down other cores, which should help with parallel processing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand using hart-local data and aligning that to cache lines. But we already have arch::riscv64::kernel::CoreLocal for that. Compared to CoreLocal, PerCpuData brings no benefit, as far as I can see. MAX_CORES is a limitation that CoreLocal does not have.

Also, the members (is_initialized and local_counters) seem useless, since they are never read and the atomic orderings and fences make no sense to me.

Is there any reason to keep this that I oversaw? 🤔


/// Entrypoint - Initialize Stack pointer and Exception Table
#[unsafe(no_mangle)]
Expand Down Expand Up @@ -47,24 +70,59 @@ pub unsafe extern "C" fn _start(hart_id: usize, boot_info: Option<&'static RawBo
}

unsafe extern "C" fn pre_init(hart_id: usize, boot_info: Option<&'static RawBootInfo>) -> ! {
CURRENT_BOOT_ID.store(hart_id as u32, Ordering::Relaxed);
// Sanity check: validate hart_id against HART_MASK
if CPU_ONLINE.load(Ordering::Acquire) > 0 {
// Faster check for Secondary-HARTs
if (HART_MASK.load(Ordering::Relaxed) & (1 << hart_id)) == 0 {
error!("Invalid hart ID: {hart_id}");
processor::halt();
}
}

// Memory Fence before ID storage
fence(Ordering::Release);
CURRENT_BOOT_ID.store(hart_id as u32, Ordering::Release);
Comment on lines +82 to +84
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory fence is not needed, right? Why did you change the atomic ordering? Relaxed ordering should be sufficient, since we don't use CURRENT_BOOT_ID for synchronization.


if CPU_ONLINE.load(Ordering::Acquire) == 0 {
unsafe {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put the refactoring of the reduction of unsafe scope in a separate commit or PR.

env::set_boot_info(*boot_info.unwrap());
let fdt = Fdt::from_ptr(get_dtb_ptr()).expect("FDT is invalid");
// Init HART_MASK
let mut hart_mask = 0;
for cpu in fdt.cpus() {
let hart_id = cpu.property("reg").unwrap().as_usize().unwrap();
let status = cpu.property("status").unwrap().as_str().unwrap();

if status != "disabled\u{0}" {
hart_mask |= 1 << hart_id;
// Boot CPU Initialization
env::set_boot_info(*boot_info.unwrap());
let fdt = unsafe { Fdt::from_ptr(get_dtb_ptr()) }.expect("FDT is invalid");

// Build HART_MASK using readable conditional checks
let mut hart_mask = 0u64;
for cpu in fdt.cpus() {
if let Some(cpu_id) = cpu.property("reg").and_then(|p| p.as_usize()) {
if cpu
.property("status")
.and_then(|p| p.as_str())
.is_some_and(|s| s != "disabled\u{0}")
{
hart_mask |= 1 << cpu_id;
}
}
NUM_CPUS.store(fdt.cpus().count().try_into().unwrap(), Ordering::Relaxed);
HART_MASK.store(hart_mask, Ordering::Relaxed);
}

NUM_CPUS.store(fdt.cpus().count().try_into().unwrap(), Ordering::Release);

// Memory Fence before HART_MASK update
fence(Ordering::Release);
HART_MASK.store(hart_mask, Ordering::Release);
Comment on lines +107 to +109
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


CPU_DATA[hart_id]
.is_initialized
.store(true, Ordering::Release);
CPU_DATA[hart_id].local_counter.store(1, Ordering::Release);

// Initialize TLS for boot core:
if let Some(tls_info) = env::boot_info().load_info.tls_info {
// Load the value into 'tp' using the mv instruction:
unsafe {
asm!(
"mv tp, {val}",
val = in(reg) tls_info.start as usize,
options(nostack, nomem)
);
}
Comment on lines +116 to +125
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
crate::boot_processor_main()
} else {
Expand All @@ -76,6 +134,16 @@ unsafe extern "C" fn pre_init(hart_id: usize, boot_info: Option<&'static RawBoot
}
}
#[cfg(feature = "smp")]
crate::application_processor_main();
{
// Optimized Secondary-HART initialization
fence(Ordering::Acquire);
CPU_DATA[hart_id]
.is_initialized
.store(true, Ordering::Release);
CPU_DATA[hart_id]
.local_counter
.fetch_add(1, Ordering::Relaxed);
crate::application_processor_main()
}
}
}