-
Notifications
You must be signed in to change notification settings - Fork 23
Home
PiTubeDirect is a simple, cheap, and compact way to add a second processor to your BBC or Master micro. It's the lowest cost way to run a Tube-enhanced Elite on a fast 6502, or CP/M, DOS+, or Panos on the relevant CPU type, or a gigahertz of Native ARM. It's the fastest 6502 you're likely to see, and the highest performance coprocessor.
Second processors used to cost hundreds or thousands, and some are very rare and still expensive. With PiTubeDirect you just add a simple interface to a Raspberry Pi and connect directly to the Tube socket. It can be fitted under a Beeb or inside a Master, or just cable-connected.
The minimal configuration would use:
- a 40 pin IDC cable
- a pair of 74LVC245A level shifters which change the tube interface from 5V to 3.3V levels
- a Raspberry Pi Zero or any Raspberry Pi you already have
- a micro SD card with the PiTubeDirect package on it
The bill-of-materials for this is approximately ten pounds. But for a little more money you can buy one ready made: see Level Shifter Options.
With this project you will have a configurable coprocessor which can be powered by the Beeb and fitted inside it, with a choice of:
- 274MHz 65C102 (
*FX 151,230,0
) - 3MHz 65C102 (
*FX 151,230,1
) (for games compatibility) - 112MHz Z80 (
*FX 151,230,4
) - 63MHz 80286 (
*FX 151,230,8
) - 27MHz 6809 (
*FX 151,230,9
) - 59MHz ARM2 (
*FX 151,230,12
) - 35MHz 32016 (
*FX 151,230,13
) - null co-pro (
*FX 151,230,14
) (to save having to power off the pi) - 1000MHz ARMnative (
*FX 151,230,15
)
Equivalent speeds are approximate. Generally a Pi 3 will run faster than a Pi 1. For ideas to exercise each of the CPU models, see Examples for each CoPro core.
Here's the minimal configuration, using a Pi Zero and a DIY two-chip level shifter: A closer view of the level shifter (a PCB design for this is in progress): A closer view of the Pi Zero: The 65C02 Co Processor running the CLOCKSP benchmark: The 65C02 Co Processor running the Tube Elite: The ARM Co Processor running the CLOCKSP benchmark: The most complex / desirable / expensive Beeb Co Processor was the 32016: This runs an operating system called Panos: And was aimed at the scientific market: It's possible to dynamically switch between Co Processors using *FX 151,230,N (same mechanism the Matchbox Co Processor uses if you are familiar with that): After hitting BREAK, the Pi has reconfigured itself as an 80x86: This runs Digital Research DOSPlus 2.1: Which in turn runs an early graphical windowing system called GEM: Here's the iconic Paint paint program from circa 1986 (30 years ago!): Finally, the whole system, including an old Atari trackball converted to look like an AMX Mouse: You can see the Pi is now a Pi 3, which helps with the larger/more complex emulators (more below).
The Tube chip in an original BBC Co Processor is a custom ULA (Uncommitted Logic Array) that provides four bidirectional FIFOs, allowing the BBC Micro (host) and the Co Processor (parasite) to reliably exchange messages with full flow control.
In PiTubeDirect, the functionality of the Tube chip is emulated in software on the Raspberry Pi, and the Tube host interface on the BBC micro is connected to the Raspberry Pi's GPIO header via a pair of 74LVC245A level shifter chips.
The level shifters are necessary because the Tube interface uses 5V levels, where as the Raspberry Pi's GPIO signals use 3.3V levels. Omitting the level shifters would likely damage the Raspberry Pi, so please don't try!
The Tube host interface is simply an extension of the 6502 bus and operates at 2MHz. The nTUBE signal (indicating an access to one of eight host-side tube registers) becomes active ~100ns into the 6502 bus signal. This generates an interrupt on the Pi, which then has about ~400ns (at most) to service the access in real time.
Clearly minimizing interrupt latency is crucial to reliable operation, and we use several techniques here:
- dispense with an operating system - PiTubeDirect is a bare metal system were we control everything
- use a FIQ interrupt (so registers don't have to be stacked)
- carefully hand optimize the FIQ handler
- avoid cache misses within the FIQ handler by locking critical code and data into the cache
- if multiple cores are available, dedicate an entire core to the FIQ handler
Doing all this, it's just possible to achieve the required performance.
For more information, see the FIQ interrupt handler walkthrough
The PiTubeDirect firmware currently includes emulations of the following Beeb Co Processors:
- 65C102 (using 65tube - the fastest known native ARM 65C02 emulation)
- 65C102 (using lib6502 - written in C)
- 80x86 (using Fake86 - written in C)
- ARM2 (using MAME's ARM 2/3/6 emulation - written in C)
- 32016 (using a 32016 emulation that started life in B-Em, and was resurrected earlier this year)
- Z80
- 6809
See Credits and Acknowledgements for who we have to thank for each of these emulations.
Several Pi Models are supported, but within the team we are concentrating on the two extremes:
- the £4.00 Pi Zero (BCM2835/ARM1176) which has a single ARM core that runs at up to 1.0GHz
- the £30.00 Pi 3 (BCM2837/ARM Cortex A53) which has four ARM cores running at up to 1.2GHz
On the Pi Zero, the challenge is reducing interrupt latency, regardless of what the main emulator is doing, as they are both sharing on the same ARM core. The typical interrupt latency we observe is 80ns. However, if the main emulator has a cache miss at exactly the same time as the host attempts to read a tube register, this can increase to 300ns, which means the read data arrives marginally late.
We have focused on the 6502 emulation using 65tube, which has been reduced in size to ~9KB. In theory this should fit inside the 16KB L1 cache. But in practice we still observe occasional late reads (on a scope). That said, Tube Elite does run reasonably reliably. But we are close to the edge here, and this is best viewed as an experiment that's still in process.
We now use the GPU to handle the time critical requests from the host. This now means we don't miss a request.
On the Pi 3, we dedicate one of the cores to interrupt handling, and doing this results in an interrupt latency that is very tightly controlled, and varies between 100ns and 120ns. This provides ample time to reliably service 6502 reads and writes, regardless how large the main emulator is, and what it is doing.
All of the emulators currently successfully boot on the Pi 3, and run more reliably than on the Pi Zero.
The above has now been moved over to the GPU for increased performance.
PiTubeDirect is closely related, but distinct from, two earlier Beeb Co Processor projects:
- the Matchbox Co Processor (see github and stardot) implements multiple Co Processors using a Xilinx XC6SLX9 FPGA. More than 50 of these have been built and distributed through the stardot forums. The cost is about £50.
- the PiTubeClient project (see github and stardot) is an extension to the Matchbox Co Processor that allows a range of Co Processors to be emulated in software on a Raspberry Pi.
One of the designs in the Matchbox Co Processor is an "SPI Co Processor" containing an VHDL implementation of the Acorn Tube chip together with an SPI slave interface. A software emulation of a Co Processor, running on the Raspberry Pi, can use SPI to read/write the tube registers. The Raspberry Pi firmware to do all this is PiTubeClient.
PiTubeDirect is an evolution of PiTubeClient that avoids the need to use a Matchbox Co Processor. It does this by emulating the Acorn Tube chip itself in software on the Raspberry Pi. This introduces some very hard real time constraints on the Raspberry Pi, and the fun of this project was/is overcoming these.
If you connect a serial cable to the Pi, you will get some diagnostic logging:
FIRMWARE_VERSION : 572ca1d3
BOARD_MODEL : 00000000
BOARD_REVISION : 00a02082
BOARD_MAC_ADDRESS : 5ceb27b8 17d73569
BOARD_SERIAL : ce5c6935 00000000
EMMC_FREQ : 250.000 MHz 250.000 MHz 250.000 MHz
UART_FREQ : 48.000 MHz 1000.000 MHz 1000.000 MHz
ARM_FREQ : 1000.000 MHz 1000.000 MHz 1000.000 MHz
CORE_FREQ : 400.000 MHz 400.000 MHz 400.000 MHz
V3D_FREQ : 300.000 MHz 300.000 MHz 300.000 MHz
H264_FREQ : 300.000 MHz 300.000 MHz 300.000 MHz
ISP_FREQ : 300.000 MHz 300.000 MHz 300.000 MHz
SDRAM_FREQ : 450.000 MHz 450.000 MHz 450.000 MHz
PIXEL_FREQ : 0.000 MHz -1894.967 MHz -1894.967 MHz
PWM_FREQ : 0.000 MHz 500.000 MHz 500.000 MHz
CORE TEMP : 52.08 °C
CORE VOLTAGE : 1.32 V
SDRAM_C VOLTAGE : 1.20 V
SDRAM_P VOLTAGE : 1.20 V
SDRAM_I VOLTAGE : 1.20 V
CMD_LINE : dma.dmachans=0x7f35 bcm2708_fb.fbwidth=656 bcm2708_fb.fbheight=416 bcm2709.boardrev=0xa02082 bcm2709.serial=0xce5c6935 smsc95xx.macaddr=B8:27:EB:5C:69:35 bcm2708_fb.fbswap=1 bcm2709.uart_clock=48000000 vc_mem.mem_base=0x3dc00000 vc_mem.mem_size=0x3f000000 dwc_otg.lpm_enable=0 console=ttyS0,115200 console=tty1 root=/dev/mmcblk0p2 rootfstype=ext4 elevator=deadline copro=0 fsck.repair=no rootwait
COPRO : 0
0 0000000000 1100000000
1 0000220000 0000220011
2 0000000010 0000111110
A0 = GPIO27 = mask 08000000
A1 = GPIO02 = mask 00000004
A2 = GPIO03 = mask 00000008
enable_MMU_and_IDCaches
cpsr = 600001d3
extctrl = 00000000 00000040
ttbcr = 00000000
ttbr0 = 01fac04a
sctrl = 00c5183d
ctype = 84448004
On power up, after the MMU, I and D caches are enabled, a short benchmark is run on Core 0:
benchmarking core....
cycle counter = 4000192
L1I_CACHE = 4000013
L1I_CACHE_REFILL = 2
L1D_CACHE = 2
L1D_CACHE_REFILL = 0
L2D_CACHE_REFILL = 2
INST_RETIRED = 6000026
benchmarking io toggling....
cycle counter = 63203584
L1I_CACHE = 3000029
L1I_CACHE_REFILL = 4
L1D_CACHE = 2000002
L1D_CACHE_REFILL = 1
L2D_CACHE_REFILL = 4
INST_RETIRED = 6000028
benchmarking 1KB memory copy....
cycle counter = 3904
L1I_CACHE = 446
L1I_CACHE_REFILL = 5
L1D_CACHE = 520
L1D_CACHE_REFILL = 10
L2D_CACHE_REFILL = 31
INST_RETIRED = 824
benchmarking 2KB memory copy....
cycle counter = 1920
L1I_CACHE = 840
L1I_CACHE_REFILL = 0
L1D_CACHE = 1032
L1D_CACHE_REFILL = 11
L2D_CACHE_REFILL = 19
INST_RETIRED = 1593
benchmarking 4KB memory copy....
cycle counter = 4160
L1I_CACHE = 1597
L1I_CACHE_REFILL = 0
L1D_CACHE = 2056
L1D_CACHE_REFILL = 11
L2D_CACHE_REFILL = 35
INST_RETIRED = 3128
benchmarking 8KB memory copy....
cycle counter = 8960
L1I_CACHE = 3131
L1I_CACHE_REFILL = 0
L1D_CACHE = 4104
L1D_CACHE_REFILL = 26
L2D_CACHE_REFILL = 69
INST_RETIRED = 6200
benchmarking 16KB memory copy....
cycle counter = 15104
L1I_CACHE = 6182
L1I_CACHE_REFILL = 0
L1D_CACHE = 8200
L1D_CACHE_REFILL = 14
L2D_CACHE_REFILL = 132
INST_RETIRED = 12342
benchmarking 32KB memory copy....
cycle counter = 37376
L1I_CACHE = 12325
L1I_CACHE_REFILL = 0
L1D_CACHE = 16392
L1D_CACHE_REFILL = 119
L2D_CACHE_REFILL = 260
INST_RETIRED = 24630
benchmarking 64KB memory copy....
cycle counter = 99200
L1I_CACHE = 24633
L1I_CACHE_REFILL = 0
L1D_CACHE = 32776
L1D_CACHE_REFILL = 189
L2D_CACHE_REFILL = 512
INST_RETIRED = 49208
benchmarking 128KB memory copy....
cycle counter = 224832
L1I_CACHE = 49190
L1I_CACHE_REFILL = 0
L1D_CACHE = 65544
L1D_CACHE_REFILL = 175
L2D_CACHE_REFILL = 1024
INST_RETIRED = 98358
benchmarking 256KB memory copy....
cycle counter = 422272
L1I_CACHE = 98343
L1I_CACHE_REFILL = 0
L1D_CACHE = 131080
L1D_CACHE_REFILL = 264
L2D_CACHE_REFILL = 2048
INST_RETIRED = 196662
benchmarking 512KB memory copy....
cycle counter = 875136
L1I_CACHE = 196647
L1I_CACHE_REFILL = 0
L1D_CACHE = 262152
L1D_CACHE_REFILL = 557
L2D_CACHE_REFILL = 4099
INST_RETIRED = 393270
benchmarking 1024KB memory copy....
cycle counter = 1901376
L1I_CACHE = 393256
L1I_CACHE_REFILL = 0
L1D_CACHE = 524296
L1D_CACHE_REFILL = 268
L2D_CACHE_REFILL = 9069
INST_RETIRED = 786486
The cycle counter is in 1GHz ARM clock cycles.
Then, if there are multiple cores, these are started, and finally the emulator is started:
Raspberry Pi Direct 65C02 (65tube) Client
main running on core 0
starting core 1
SPIN1
starting core 2
SPIN2
starting core 3
CORE3
enable_MMU_and_IDCaches
cpsr = 600001d3
extctrl = 00000000 00000040
ttbcr = 00000000
ttbr0 = 01fac04a
sctrl = 00c5183d
ctype = 84448004
emulator running on core 3
Each time the Co Processor is reset (by hitting BREAK on the Beeb), ARM performance stats can be logged:
cycle counter = 244349525184
L1I_CACHE = 3928583582
L1I_CACHE_REFILL = 79
L1D_CACHE = 123315172
L1D_CACHE_REFILL = 26
L2D_CACHE_REFILL = 113
INST_RETIRED = 26060255
tube reset - copro 0
Hardware
Software
- Build dependencies
- Running cmake
- Compiling kernel.img
- Deploying on a Pi
- Recommended config.txt and cmdline.txt options
- Validation
- Compilation flags
Implementation Notes