Posted on

RISC-V in the loop

Continuous integration (‘CI’) for hardware is a logical step to take: Why not do for hardware, what works fine for software?

To keep things short: I’ve decided to stick my proprietary RISC-V approach ‘pyrv32’ into the opensourced MaSoCist testing loop to always have an online reference that can run anywhere without massive software installation dances.

Because there’s still quite a part of the toolchain missing from the OpenSource repo (work in progress), only a stripped down VHDL edition of the pyrv32 is available for testing and playing around.

This is what it currently does, when running ‘make all test’ in the provided Docker environment:

  • Build some tools necessary to build the virtual hardware
  • Compile source code, create a ROM file from it as VHDL
  • Build a virtual System on Chip based on the pyrv32 core
  • Downloads the ‘official’ riscv-tests suite onto the virtual target and runs the tests
  • Optionally, you can also talk to the system via a virtual (UART) console

Instructions

This is the quickest ‘online’ way without installing software. You might need to register yourself a docker account beforehand.

  1. Log in at the docker playground: https://labs.play-with-docker.com
  2. Add a new instance of a virtual machine via the left panel
  3. Run the docker container:
    docker run -it hackfin/masocist
  4. Run the test suite:
    wget section5.ch/downloads/masocist_sfx.sh && sh masocist_sfx.sh && make all test
  5. Likewise, you can run the virtual console demo:
    make clean run-pyrv32
  6. Wait for Boot message and # prompt to appear, then type h for help.
  7. Dump virtual SPI flash:
    s 0 1
  8. Exit minicom terminal by Ctrl-A, then q.

What’s in the box?

  • ghdl, ghdlex: Turns a set of VHDL sources into a simulation executable that exposes signals to the network (The engine for the virtual chip).
  • masocist: A build system for a System on Chip:
    • GNU Make, Linux kconfig
    • Plenty of XML hardware definitions based on netpp.
    • IP core library and plenty of ugly preprocessor hacks
    • Cross compiler packages for ZPU, riscv32 and msp430 architectures
  • gensoc: SoC generator alias IP-XACT’s mean little brother (from another mother…)
  • In-House CPU cores with In Circuit Emulation features (Debug TAPs over JTAG, etc.):
    • ZPUng: pipelined ZPU architecture with optimum code density
    • pyrv32: a rv32ui compatible RISC-V core
  • Third party opensource cores, not fully verified (but running a simple I/O test):
    • neo430: a msp430 compatible architecture in VHDL
    • potato: a RISC-V compatible CPU design

Posted on

RISC-V experiments

Yes, the RISC-V is fun. It’s not only the market momentum it currently has, or the political aspects of disrupting the world of closed source processor IP. Its design is just a somewhat logical attractor, when you’ve made the way from classic RISC (MIPS) pipelines, over the DLX improvements with various visits to FPGA specific developments like Xilinx’ microblaze architecture, Altera’s NIOS or the Lattice lm32.

Starting with MIPS makes sense, beause you’ve got a well matured open source toolchain which you can use as a reference for regress-testing your newly developed architecture. However, at some point one will run into scratchy issues, like the branch delay slot (which requires workaround logic for in circuit emulation) or the not optimum instruction set density, let aside one of the biggest issues: absolute addressing versus pc relative (which was brought in with the MIPS16 ASE). So, sooner or later you will find out that the RISC-V arrangements are pretty optimal for this type of RISC architecture.

The classic RISC five stage pipeline

This is something that’s explained in detail in plenty of papers published from berkeley.edu. You will find a lot of valuable information from the driving forces behind the RISC-V architecture.

Anyway, these are the processing stages of that little bucket chain working in the processor — in parallel:

  • FE: The fetch stage, get an instruction word from instruction(program) memory
  • DE: Decode the instruction into an arithmetic, load/store, jump, branch or other function
  • EX: Execute the instruction
  • MEM: The memory stage is more complicated. Let’s discuss this later.
  • WB: Write back of a computed or loaded value to an internal register

In a chip architecture, bottlenecks will occur when much of the logic zoo gathers in one area of the silicon. In general, this is the case for often used multiplexers (these can be seen as routers, directing the data to the right logic). Flipflops do better, they will hold data in a register and can propagate the data to process to another area of the chip. Ok, we knew that, not too new.

Now the first question: If we have an instruction reading from data memory, when can we start asserting an address to the RAM? Answer: For most architectures, we can do that early in the DE stage. The ASCII art approach below demonstrates how that works for a fast reading scenario (READ_FAST): Data is READY one cycle after the READ event. If this was the case for all data access, we could omit the MEM stage. For a READ_WAIT scenario however, data might not be ready. So we just need this stage to determine whether data is ready and valid before we can write it to the destination register in the WB stage.

                 |   FE   |   DE   |   EX   |  MEM  |   WB   |
READ_FAST                    READ     READY          
READ_WAIT                    READ     WAIT    READY

On some architectures again, we might want to save on logic and not assert the READ event in the DE, but in the EX stage. Then we’d have to insert these infamous pipeline stalls that force the processor to wait (and do nothing) for these few cycles. On the other hand, we’d possibly save some adder and multiplexing logic in DE.

The same dispute can be carried out over branching. When a decision to branch can be taken, most architectures calculate a jump target address relative to the current program counter (of the DE stage). This can be calculated either in the DE stage, or by the ALU (arithmetic logic unit) in the EX stage (more about that later).

Branch penalties

When a branch is taken, i.e. if it turns out during the EX stage that the program counter that is fetched from (which by default – without branch prediction – just keeps incrementing linearly) is actually invalid, fetched and decoded data is thrown away until the data fetched from the target jump address is valid again. The cycles consumed by this ‘pipeline flushing’ is the so called branch penalty (‘BP’). The later we calculate the target address, the greater the ‘BP’.

Hands on

For this example, a RISC-V design (called ‘pyrv32’) was synthesized for a Spartan-6 LX45. No optimization to specific RAMs took place, the toolchain decides to allocate quite a bit of LUT RAM. A few words about this design:

  • Simplified RISC-Pipeline with minimum hazard scenarios, no branch delay slot (unlike MIPS)
  • RV32I instruction set compatibility, but missing CSR unit
  • Very simple exception/irq/debug/emulation support
  • 4-5 stage pipeline with shortcut logic to allow READ_FAST and READ_WAIT scenarios

The complete design is a fully working system on chip, like a microprocessor from the shelf with Ethernet MAC, I2C and what not. I wanted want to see things in a known working setup, so I basically swapped out the original ZPUng against our proprietary ‘pyrv32’:

‘Early’ branch calculation:

f_max = 62 MHz

Branch penalty: 2 cycles

‘Late’ branch calculation:

f_max = 68 MHz

Branch penalty: 3 cycles

For the ‘late’ branch calculation you can see the maximum clock frequency going up, likewise the LUT count. However, total count is lower for the entire SoC. Probably the synthesis is doing some optimization that would be subject to further scrutiny.

This branching option is a configureable variable (CONFIG_EARLY_BRANCH_DETECTION). By default it is True. This configuration provides – depending on the typical amount of branching – more power.

Bare bone

When synthesizing the CPU as single unit without the peripheral memory logic, the frequency variations are marginal:

  • ‘early’: f_max = 110 MHz
  • ‘late’: f_max = 114 MHz

This is to expect as logic congestion is reduced due to the missing peripheral and DMA bus logic.

Risc-V 32 bit for netpp node

Yes, the RiscV fits on the netpp node with the existing ‘dagobert’ configuration:

  • IRAM size: 0x8000
  • DRAM size: 0x4000
  • DMA scratch pad like with the ZPUng configuration: 2x 0x800
  • 54 MHz core clock

Although the DRAM memory is DMA capable by principle, the scratch pad must be used for all fast I/O (networking) like on the ZPUng architecture. The reason for this are a few configurations reserved for the future, like shared memory between cache, pyrv32 and the DSP extension which use the dual ports of the block ram to read 64 bit instructions from certain memory portions.

The DMA default width however is 16 bit, unlike 8 bit as on the ZPUng. This allows for higher throughput, like 1G ethernet. Not making sense on the netpp node, but on a camera, for example.

The tedious path to optimization

A lot of tweaking is necessary to crank f_max up to the possible maximum. Some tools will help you with that and point out stupid mistakes, with most of the synthesis tools it is a bit of trial and error and careful reading of the logfiles. These details are boring, so I will spare them here. Short version: You can get some interesting insight using the various floorplanning tools which can visualize signal and data paths between critical logic.

However, one may not want to go for maximum f_max, as most cycles are burnt elsewhere. Many things can be optimized in software or using clever DMA processing. This is where the CPU architecture is less relevant than a tricky SoC memory cross bar which allows the peripherals to use DMA while the CPU can do other things.

Optimization attack targets

There are a few deviations from the RISC-V standard you can look at:

  • Implementing the CSR as memory mapped unit by replacing csr register exchange/set/clr commands through memory mapped I/O range accesses: This spares you some logic in the CPU and avoids further congestion close to the ALU
  • Eliminating IRQ support all together by using DMA queues. This might appear odd, but for some data processing, pure DMA will do, and the main loop ends up to run deterministically.
CSR quirks

When implementing the CSR unit as memory mapped (to MMR space, i.e. memory mapped registers), there are a few quirks. First, remember that the CSRRS/CSRRC/CSRRW are supposed to be atomic, i.e. no emulation of the sort (1) Get register (2) or with value (3) write back is ok. Therefore this MMR mapped CSR needs to implement a W1C (write one to clear) respective W1S (…to set) logic. Because we only have write_enable or read_enable signals to the MMR I/O, W1C and W1S is to be implemented as shadow registers using an offset address.

Meaning, the register logic works as follows: An access to the CSR register 0xb05 will work with the following assembler instructions mapping, for example:

  • CSRRW: A simultaneous read/write from/to address 0xffe0000 + (0x0b05 << 2)
  • CSRRC: A W1C to  0xffe01000 + (0x0b05 << 2)
  • CSRRS: A W1S to 0xffe02000 + (0x0b05 << 2)

Thus, the W1* logic is implemented in the peripherals for each register. Note: Since the CSR is sitting in MMR, a read is subject to a delay. Under certain circumstances, the pipeline may stall for one cycle, depending on the previous memory access history. This has to do with the 4/5-Pipeline shortcut/delaying mechanisms.

Optimization updates:
  1. 02 Oct 2019: no IRQ, pure DMA (no CSR), register file moved to LUT RAM, debug logic mostly eliminated:
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | Module                             | Partition | Slices*       | Slice Reg     | LUTs          | LUTRAM        | BRAM/FIFO | DSP48A1 | BUFG  | BUFIO | BUFR  | DCM   | PLL_ADV   | Full Hierarchical Name                                                                        |
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | ++pyrv32_cpu_inst                  |           | 347/348       | 354/354       | 1035/1036     | 44/44         | 0/0       | 0/0     | 0/0   | 0/0   | 0/0   | 0/0   | 0/0       | netpp_node_top/soc/pyrv32_cpu_inst                                                            |
Posted on

Multitasking on the ZPUng

For preemptive or less preemptive multi tasking on the ZPUng architecture, some mechanisms for task switching come in handy. Since the ZPUng is a context saving architecture by design, the context switching is very light: We only need to manipulate the stack pointer and program counter somewhere in the code (plus regard some minor details with global variables used by GCC which we will ignore for the begin).

Every task has its own stack area in the stack memory. Since we have no virtual memory in this architecture, care must be taken that the  local stack areas are not trashing other task’s reserved stack spaces.

Preemptive (time slice) multitasking

In this case, a timer interrupt service routine will always change the context. By design of an interrupt handling hardware, the return address PC is always stored on the stack, so if we manipulate the stack pointer (SP) inside the interrupt handler routine, there’s not much more to do than saving any global context entities on the stack.

For this context switch, we need to store the current SP into the address pointed to by a global context pointer g_context on IRQ handler entry and restore it upon exit. The following assembler macros are required to do that:

; Save current SP context in a global ptr g_context
.macro save_context
pushsp
im g_context
load
store
.endm

; Restore SP context from global ptr g_context
.macro restore_context
im g_context
load
load
popsp
.endm

The timer service routine looks very simple as well:

; -- IRQ handler
	.globl irq_timer_handler
irq_timer_handler:
	; Stores the current context (sp) into the variable pointed to
	; by g_context.
	save_context
	set_stack_isr
	save_memregs

	im timer_service
	call

	restore_memregs
	restore_stack
	; This leaves a possibly new return jump address on
	; TOS, if g_context was modified by the timer_service.
	restore_context
	
	.byte 15
	poppc

So inside the timer_service() function which can be coded in C, we need to only modify the g_context pointer with each tasks stack pointer storage address:

 g_context = &walk->sp; // Context switch

Using a very simple prioritized round robin scheduler and two example tasks toggling GPIO pins, we achieve a simulation result as shown below:

Multitasking trace

The TaskDesc debug output denotes the currently active task ID, 0x2a90 being the main task, where 0x2aa8 toggles GPIO1, 0x2ac0 toggles GPIO0.

Internally, these task descriptors are put into a worker queue and are cycled through using some bit of priority distribution, i.e. tasks with a lower ‘interval’ value get more CPU time, however, a task can never block completely.

Atomicity

You might notice something odd in the above wave trace around t = 1.5ms (and 1.7ms, likewise): GPIO0 is changed even though the corresponding task 0x2ac0 is not active. Why is that? Let’s have a look at the task code:

int task1(void *p)
{
    while (1) {
        MMR(Reg_GPIO_OUT) ^= 0x02;
    }
    return 0;
}

int task2(void *p)
{
    while (1) {
        MMR(Reg_GPIO_OUT) ^= 0x01;
    }
    return 0;
}

The solution:

The XOR statement to the GPIO register is not atomic. Meaning, it splits up into the following primitive instructions:

  1. Get value from OUT register
  2. XOR with a value
  3. Write back value to OUT register

Let a timer IRQ request come in between 1 or 2 and assume it is switching the context to the other GPIO manipulating task – here we go. task2() is actually getting in between!

If we were to use global variables and tasks depending on single bits, we should keep this big virtual banner around in our coder’s brains:

Make sure your semaphores are atomic!

Non-preemptive (user space) multi tasking

Another aspect of concurring tasks: There might be a process waiting for input data, i.e. sleeping until data is ready and the IRQ handler wakes the corresponding process up. In the meantime, other processes might want to consume the CPU time. The rather dumb round robin scheme doesn’t take this into account, it just cycles through processes and makes sure each gets its slice once in a while.

Non-preemptive multi tasking implies, that some control is actually given to the currently running task. Loosely speaking: a task switch is induced from user space (not inside an IRQ handler). Let’s summarize what functionality we’d want to have for a user space triggered context switch:

  1. Process might want to sleep for a certain time:
    -> We put the context descriptor into a sleep queue that is worked on inside the timer service handler. Once the timeout is reached, the process is put back first into the worker queue, hence is resumed next.
  2. Process waits for data to arrive / DMA to complete:
    -> The context descriptor is put into a wait queue and resumes upon a specific data IRQ event.

A similar scheme is run in the Linux kernel. We try to keep this layer way thinner for our simple ZPUng SoC though.

Now, with a lack of atomicity as shown above, things can get in each other’s way. Classical CPU architecture tend to block IRQs to implement atomic behaviour, we can overcome this overhead using the ZPUng with a trick by jumping into microcode emulation code space (using a reserved instruction), where interrupts are by default masked, but still latched (like inside an IM instruction sequence). This introduces some minor latency for interrupt response, however this is most of the time not of any concern.

Inside the context switch system call, the stack context is manipulated as inside the timer service handler. Using simple queue techniques we can make sure that no unwanted modification is getting in between non-atomic operations.

The simulation benefit

When developing tailored multi tasking configurations without a generic OS overhead, bugs are easily introduced. The classical problem of a race condition with uninitialized variables (that never turn up in a source code review or MISRA compliance check) can cause a lot of headache on uC-Systems with no fully non-intrusive trace unit. In this case, a full 1:1 simulation comes in extremely handy.

For example, if a task accesses a variable before it was actually initialized or properly defined, the system would recognize the undefined memory content as such and display this event in the simulation.

However, the system as such can not take the burden of you, to create proper test cases. For example, a multi tasking setup may never show a problem in the simulation if the timing of interrupt events is deterministic. If external data availability comes into play, you would have to create a stimulating test bench that makes use of all possible timing intervals with respect to a task switch event to actually prove that the programm is robust in all possible scenarios.