Garuda: RISC-V ML Accelerator

Swift as the divine eagle, Garuda accelerates RISC-V with specialized hardware for neural network inference.

🚀 What's New (October 2025)

Latest Updates:

✅ Bug Fix: Corrected INT8 saturation values for proper two's complement representation
✅ New Feature: Overflow detection flag for debugging and profiling
✅ Verification: Added SystemVerilog assertions for protocol compliance
✅ Coverage: Added overflow tracking properties for better testing

📖 Project Overview

Garuda is a CVXIF coprocessor that extends RISC-V with custom INT8 multiply-accumulate (MAC) instructions for efficient neural network inference. The modular design integrates with CVA6 without CPU modifications, achieving 2-5× speedup over software implementations.

Key Features:

⚡ CVXIF Interface: Standard coprocessor protocol (no CPU changes)
🎯 Stateless Design: Supports speculative execution
🔧 Compact: ~200 LUTs per MAC unit
🚀 Pipelined: 3-4 cycle latency

INT8 Quantization

Modern neural networks use INT8 quantization to reduce memory footprint (4x smaller than FP32), power consumption, bandwidth requirements, and hardware cost. INT8 inference achieves near-FP32 accuracy for most models with proper quantization techniques.

CVXIF Interface

CVXIF provides a standard interface for RISC-V coprocessors, enabling modular accelerator design without CPU modifications. The interface handles instruction offloading, register access, and result writeback.

Features

Custom Instructions (Garuda 1.0):

mac8 - INT8 MAC with 8-bit accumulator + saturation
mac8.acc - INT8 MAC with 32-bit accumulator
mul8 - INT8 multiply without accumulation
clip8 - Saturate to INT8 range [-128, 127]

Recent Improvements (Oct 2025):

✅ Fixed saturation bug (invalid 8'sd128 → correct -8'sd128)
✅ Added overflow detection output (tracks when saturation occurs)
✅ Added SystemVerilog assertions for verification
✅ Added coverage tracking for overflow events

Architecture:

CVXIF coprocessor integration
Stateless design for speculative execution
Pipelined MAC unit (3-4 cycle latency)
Overflow detection for debugging
Efficient resource usage (~200 LUTs per MAC unit)

Repository Structure

garuda/                          # Garuda accelerator
├── rtl/                         # RTL source files
│   ├── int8_mac_instr_pkg.sv   # Instruction definitions
│   ├── int8_mac_unit.sv        # MAC execution unit
│   ├── int8_mac_decoder.sv     # Instruction decoder
│   └── int8_mac_coprocessor.sv # Top-level module
├── tb/                          # Testbenches
│   └── tb_int8_mac_unit.sv     # MAC unit testbench
└── sw/                          # Software tests

cva6/                            # CVA6 RISC-V CPU core (upstream)

Getting Started

Prerequisites

RISC-V GNU Toolchain (see cva6/util/toolchain-builder)
Verilator, ModelSim/Questa, or VCS
Python 3.7+

Clone Repository

git clone https://github.com/yourusername/cva6-garuda.git
cd cva6-garuda
git submodule update --init --recursive

Run Simulations

cd garuda
./run_sim.sh verilator

Verify CVA6 Environment

cd cva6
export RISCV=/path/to/toolchain
export DV_SIMULATORS=veri-testharness,spike
bash verif/regress/smoke-tests.sh

Example Usage

Assembly Code

# Dot product: result = a[0]*b[0] + a[1]*b[1] + a[2]*b[2] + a[3]*b[3]

dot_product:
    lw      t0, 0(a0)           # Load a[3:0] (packed INT8s)
    lw      t1, 0(a1)           # Load b[3:0] (packed INT8s)
    li      t2, 0               # Initialize accumulator
    
    mac8.acc t2, t0, t1         # acc += a[0] * b[0]
    srli     t0, t0, 8
    srli     t1, t1, 8
    
    mac8.acc t2, t0, t1         # acc += a[1] * b[1]
    srli     t0, t0, 8
    srli     t1, t1, 8
    
    mac8.acc t2, t0, t1         # acc += a[2] * b[2]
    srli     t0, t0, 8
    srli     t1, t1, 8
    
    mac8.acc t2, t0, t1         # acc += a[3] * b[3]
    
    mv       a0, t2             # Return result
    ret

C with Inline Assembly

static inline int32_t mac8_acc(int32_t acc, int8_t a, int8_t b) {
    int32_t result;
    asm volatile (
        "mac8.acc %0, %1, %2"
        : "=r" (result)
        : "r" (a), "r" (b), "0" (acc)
    );
    return result;
}

int32_t dot_product(int8_t* a, int8_t* b, int n) {
    int32_t sum = 0;
    for (int i = 0; i < n; i++) {
        sum = mac8_acc(sum, a[i], b[i]);
    }
    return sum;
}

Architecture

System Overview

CVA6 CPU                           INT8 MAC Coprocessor
┌──────────────────────┐          ┌──────────────────────┐
│ Fetch → Decode →     │          │ Instruction Decoder  │
│ Issue → Execute → WB │◄────────►│ INT8 MAC Unit        │
└──────────────────────┘          │ Result Register      │
         CVXIF Interface           └──────────────────────┘

Datapath

rs1[7:0]  rs2[7:0]
   │         │
   └────┬────┘
        │
   ┌────▼────┐
   │ 8x8 MUL │  16-bit product
   └────┬────┘
        │
   ┌────▼────┐
   │ 32b ADD │  Accumulate
   └────┬────┘
        │
   ┌────▼────┐
   │ Pipeline│  1 cycle
   └────┬────┘
        │
     rd[31:0]

Resource Usage

LUTs: ~200 per MAC unit
8x8 multiplier: ~100 LUTs
32-bit adder: ~32 LUTs
Control logic: ~50 LUTs

Performance

Instruction Count

Operation	Standard RISC-V	With MAC8.ACC	Speedup
Single MAC	2 (mul + add)	1	2x
4-elem dot product	16	14	1.14x
256-elem dot product	1024	~770	1.3x

Cycle Count

Operation	Standard RISC-V	MAC Coprocessor
Single MAC	5-8 cycles	3-4 cycles
256-elem dot product	~2048 cycles	~1500 cycles

Performance depends on memory bandwidth and cache behavior.

📚 Documentation

RTL Documentation:

See garuda/README.md for detailed RTL documentation
Inline code comments in all source files
Module hierarchy and integration guide

External References:

🎯 Quick Start

1. Clone Repository

git clone https://github.com/yourusername/garuda-accelerator.git
cd garuda-accelerator
git submodule update --init --recursive

2. Run Garuda 1.0 Simulation

cd garuda
./run_sim.sh verilator

3. Explore Documentation

# RTL documentation
cat garuda/README.md

# View instruction definitions
cat garuda/rtl/int8_mac_instr_pkg.sv

📊 Performance

Current Implementation

Peak Performance: ~25 GOPS (INT8)
Power: ~10W (estimated)
Latency: 3-4 cycles per MAC operation
Resource Usage: ~200 LUTs per MAC unit
Fmax: 100+ MHz (FPGA), 1+ GHz (ASIC target)

Use Cases

Edge AI inference (resource-constrained devices)
Embedded neural networks
Educational projects
RISC-V accelerator research

📚 References

RISC-V:

Neural Network Quantization:

🤝 Contributing

We welcome contributions! Areas of interest:

RTL improvements and optimizations
Testbench enhancements
Software examples and benchmarks
Documentation improvements
Performance analysis and benchmarking

📧 Contact & Community

GitHub Issues: Bug reports and feature requests
RISC-V Slack: #garuda channel (join the conversation)
OpenHW Group: Contribute to RISC-V ecosystem

📜 License

Garuda RTL: Apache License 2.0
CVA6: Solderpad Hardware License v0.51
Documentation: Creative Commons BY 4.0

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
cva6 @ 4d823b8		cva6 @ 4d823b8
garuda		garuda
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

certainly-param/garuda-accelerator

Folders and files

Latest commit

History

Repository files navigation