Posted on

MaSoCist support of OpenSource synthesis for ECP5

Find the a (currently unstable) development branch here:

https://github.com/hackfin/MaSoCist/tree/ghdlsynth_release

NOTE: In process of upgrading to ghdl v1.0 synthesis. The build system is currently not functional. Use the self extracting script from the Instructions (2) below for a frozen working configuration.

Configurations that work (from those appearing when you run ‘make which’ in the masocist top dir):

  • *-zpu-ghdlsynth: ZPUng setup with ‘beatrix’ configuration.
  • *-pyrv32-ghdlsynth: RISC-V 32 bit basic configuration, proof of concept only, not fully functional as SoC in synthesis (as of now)
    Note: You need to explicitely install the rv32 toolchain (inside the docker container) for this config:
    sudo apt-get install riscv32-binutils riscv32-gcc riscv32-newlib-libc

Instructions

This can be done online in a browser, if you don’t run Linux, see also https://section5.ch/index.php/2019/10/24/risc-v-in-the-loop/.

Note: Since this setup depends on external packages, there is no guarantee it will build smoothly.

  1. Run docker container with exported USB devices (if you want to program the plugged in board right away):
    docker run -it --device=/dev/bus/usb hackfin/masocist:synth
  2. Pull synthesis self-extracting build script:
    wget https://section5.ch/downloads/masocist-synth_sfx.sh && sh masocist-synth_sfx.sh
  3. Pull packages and build:
    make all
  4. Configure platform:
    cd src/vhdl/masocist-opensource;
    make versa_ecp5-zpu-ghdlsynth
  5. Build for synthesis:
    make clean sw syn
  6. When successful, you’ll end up with a $(PLATFORM).SVF file in syn/.
  7. See next paragraph on how to program the board (this procedure will be simplified)

Supported boards

Currently, the following ECP5 based boards are supported/under scrutiny:

Programming with on board FT2232H JTAG interface:

Note that programming will only work when the container is run/started after plugging in the board.

To program the FPGA SRAM on the board with the produced SVF file:

  1. Make sure board is connected via USB and powered up.
  2. You may need to restart the container from above:
    docker start -i <id>
    where ‘id’ is the container id of the above stopped container (retrieve from shell history or with docker ps -a)
  3. Install openocd:
    sudo apt-get install openocd
  4. Make sure board is connected via USB and powered up, then run, inside $(MASOCIST)/syn:
    make download OPENOCD="sudo openocd"
  5. If you see a lot scrolling by, board programming tends to be successful (interface was recognized). You can ignore errors like:
    Error: tdo check error at line 26780
    as they are due to the changed USERCODE.
  6. You should see the segment display on the board ‘spinning’. Also, you can talk to the SoC through the UART at 115200, 8N1, for example using minicom:
    minicom -o -D /dev/ttyUSB1
    The output upon booting of the SoC should be:
Probing flash…
Flash Type: m25p128         
Booting beatrix HW rev: 04 @25 MHZ
------------- test shell -------------                                          
--        SoC for Versa ECP5        --                                          
            arch: ZPUng                                                          
--  (c) 2012-2020  www.section5.ch  --                                          
--     type 'h' for help            --                                          
 

Quirks

Short summary on what works in particular and what does not:

Ram inference

RAM inference is currently problematic and needs to be investigated further. TODO:

  • Make synthesis recognize more variants of RAM with init values
  • Implement dual process true dual port RAM
  • Fully eliminate Verilog RAM wrapper workarounds

FSM optimization

Some FSM seem to optimize away in yosys. Needs to be investigated if it’s a VHDL synthesis or internal Yosys issue.

Vendor primitives

Vendor specific Black Box primitives will no longer have to be wrapped starting with new ghdl-1.0 releases (Container namehackfin/masocist:synth-1.0). However, for the time being you might want to visit:

https://github.com/ghdl/ghdl-yosys-plugin/issues/46

So far tested primitives within MaSoCist:

  • JTAGG: Test access port to ZPUng and pyrv32 for JTAG debugging or automated in circuit emulation hardware tests.
  • EHXPLLL: PLL primitive for clock frequency conversion
  • USRMCLK: Access to SPI master clock on ECP5

Not working

  • System Interrupt Controller (CONFIG_SIC) is currently not supported: #1140
    Fixed. Make sure to install the up to date debian GHDL packages when reusing an old container:
    sudo apt-get update; sudo apt-get install ghdl ghdl-libs
    You also have to rebuild the ghdl.so module in src/ghdl-yosys-plugin:
    make clean all; sudo make install
  • FLiX DSP and JPEG core unsupported (due to true dual port RAM issues)
  • Under scrutiny: pktfifo (CONFIG_MAC) problematic (TDP BRAM issues)
  • Post map simulation does not work with DP16KD primitives, due to missing ‘whitebox’ model, see also #32. You will have to separately use the supplied vendor model from the Diamond libraries.
Posted on 2 Comments

RISC-V in the loop

Continuous integration (‘CI’) for hardware is a logical step to take: Why not do for hardware, what works fine for software?

To keep things short: I’ve decided to stick my proprietary RISC-V approach ‘pyrv32’ into the opensourced MaSoCist testing loop to always have an online reference that can run anywhere without massive software installation dances.

Because there’s still quite a part of the toolchain missing from the OpenSource repo (work in progress), only a stripped down VHDL edition of the pyrv32 is available for testing and playing around.

This is what it currently does, when running ‘make all test’ in the provided Docker environment:

  • Build some tools necessary to build the virtual hardware
  • Compile source code, create a ROM file from it as VHDL
  • Build a virtual System on Chip based on the pyrv32 core
  • Downloads the ‘official’ riscv-tests suite onto the virtual target and runs the tests
  • Optionally, you can also talk to the system via a virtual (UART) console

Instructions

This is the quickest ‘online’ way without installing software. You might need to register yourself a docker account beforehand.

  1. Log in at the docker playground: https://labs.play-with-docker.com
  2. Add a new instance of a virtual machine via the left panel
  3. Run the docker container:
    docker run -it hackfin/masocist
  4. Run the test suite:
    wget section5.ch/downloads/masocist_sfx.sh && sh masocist_sfx.sh && make all test
  5. Likewise, you can run the virtual console demo:
    make clean run-pyrv32
  6. Wait for Boot message and # prompt to appear, then type h for help.
  7. Dump virtual SPI flash:
    s 0 1
  8. Exit minicom terminal by Ctrl-A, then q.

What’s in the box?

  • ghdl, ghdlex: Turns a set of VHDL sources into a simulation executable that exposes signals to the network (The engine for the virtual chip).
  • masocist: A build system for a System on Chip:
    • GNU Make, Linux kconfig
    • Plenty of XML hardware definitions based on netpp.
    • IP core library and plenty of ugly preprocessor hacks
    • Cross compiler packages for ZPU, riscv32 and msp430 architectures
  • gensoc: SoC generator alias IP-XACT’s mean little brother (from another mother…)
  • In-House CPU cores with In Circuit Emulation features (Debug TAPs over JTAG, etc.):
    • ZPUng: pipelined ZPU architecture with optimum code density
    • pyrv32: a rv32ui compatible RISC-V core
  • Third party opensource cores, not fully verified (but running a simple I/O test):
    • neo430: a msp430 compatible architecture in VHDL
    • potato: a RISC-V compatible CPU design

Posted on Leave a comment

Multitasking on the ZPUng

For preemptive or less preemptive multi tasking on the ZPUng architecture, some mechanisms for task switching come in handy. Since the ZPUng is a context saving architecture by design, the context switching is very light: We only need to manipulate the stack pointer and program counter somewhere in the code (plus regard some minor details with global variables used by GCC which we will ignore for the begin).

Every task has its own stack area in the stack memory. Since we have no virtual memory in this architecture, care must be taken that the  local stack areas are not trashing other task’s reserved stack spaces.

Preemptive (time slice) multitasking

In this case, a timer interrupt service routine will always change the context. By design of an interrupt handling hardware, the return address PC is always stored on the stack, so if we manipulate the stack pointer (SP) inside the interrupt handler routine, there’s not much more to do than saving any global context entities on the stack.

For this context switch, we need to store the current SP into the address pointed to by a global context pointer g_context on IRQ handler entry and restore it upon exit. The following assembler macros are required to do that:

; Save current SP context in a global ptr g_context
.macro save_context
pushsp
im g_context
load
store
.endm

; Restore SP context from global ptr g_context
.macro restore_context
im g_context
load
load
popsp
.endm

The timer service routine looks very simple as well:

; -- IRQ handler
	.globl irq_timer_handler
irq_timer_handler:
	; Stores the current context (sp) into the variable pointed to
	; by g_context.
	save_context
	set_stack_isr
	save_memregs

	im timer_service
	call

	restore_memregs
	restore_stack
	; This leaves a possibly new return jump address on
	; TOS, if g_context was modified by the timer_service.
	restore_context
	
	.byte 15
	poppc

So inside the timer_service() function which can be coded in C, we need to only modify the g_context pointer with each tasks stack pointer storage address:

 g_context = &walk->sp; // Context switch

Using a very simple prioritized round robin scheduler and two example tasks toggling GPIO pins, we achieve a simulation result as shown below:

Multitasking trace

The TaskDesc debug output denotes the currently active task ID, 0x2a90 being the main task, where 0x2aa8 toggles GPIO1, 0x2ac0 toggles GPIO0.

Internally, these task descriptors are put into a worker queue and are cycled through using some bit of priority distribution, i.e. tasks with a lower ‘interval’ value get more CPU time, however, a task can never block completely.

Atomicity

You might notice something odd in the above wave trace around t = 1.5ms (and 1.7ms, likewise): GPIO0 is changed even though the corresponding task 0x2ac0 is not active. Why is that? Let’s have a look at the task code:

int task1(void *p)
{
    while (1) {
        MMR(Reg_GPIO_OUT) ^= 0x02;
    }
    return 0;
}

int task2(void *p)
{
    while (1) {
        MMR(Reg_GPIO_OUT) ^= 0x01;
    }
    return 0;
}

The solution:

The XOR statement to the GPIO register is not atomic. Meaning, it splits up into the following primitive instructions:

  1. Get value from OUT register
  2. XOR with a value
  3. Write back value to OUT register

Let a timer IRQ request come in between 1 or 2 and assume it is switching the context to the other GPIO manipulating task – here we go. task2() is actually getting in between!

If we were to use global variables and tasks depending on single bits, we should keep this big virtual banner around in our coder’s brains:

Make sure your semaphores are atomic!

Non-preemptive (user space) multi tasking

Another aspect of concurring tasks: There might be a process waiting for input data, i.e. sleeping until data is ready and the IRQ handler wakes the corresponding process up. In the meantime, other processes might want to consume the CPU time. The rather dumb round robin scheme doesn’t take this into account, it just cycles through processes and makes sure each gets its slice once in a while.

Non-preemptive multi tasking implies, that some control is actually given to the currently running task. Loosely speaking: a task switch is induced from user space (not inside an IRQ handler). Let’s summarize what functionality we’d want to have for a user space triggered context switch:

  1. Process might want to sleep for a certain time:
    -> We put the context descriptor into a sleep queue that is worked on inside the timer service handler. Once the timeout is reached, the process is put back first into the worker queue, hence is resumed next.
  2. Process waits for data to arrive / DMA to complete:
    -> The context descriptor is put into a wait queue and resumes upon a specific data IRQ event.

A similar scheme is run in the Linux kernel. We try to keep this layer way thinner for our simple ZPUng SoC though.

Now, with a lack of atomicity as shown above, things can get in each other’s way. Classical CPU architecture tend to block IRQs to implement atomic behaviour, we can overcome this overhead using the ZPUng with a trick by jumping into microcode emulation code space (using a reserved instruction), where interrupts are by default masked, but still latched (like inside an IM instruction sequence). This introduces some minor latency for interrupt response, however this is most of the time not of any concern.

Inside the context switch system call, the stack context is manipulated as inside the timer service handler. Using simple queue techniques we can make sure that no unwanted modification is getting in between non-atomic operations.

The simulation benefit

When developing tailored multi tasking configurations without a generic OS overhead, bugs are easily introduced. The classical problem of a race condition with uninitialized variables (that never turn up in a source code review or MISRA compliance check) can cause a lot of headache on uC-Systems with no fully non-intrusive trace unit. In this case, a full 1:1 simulation comes in extremely handy.

For example, if a task accesses a variable before it was actually initialized or properly defined, the system would recognize the undefined memory content as such and display this event in the simulation.

However, the system as such can not take the burden of you, to create proper test cases. For example, a multi tasking setup may never show a problem in the simulation if the timing of interrupt events is deterministic. If external data availability comes into play, you would have to create a stimulating test bench that makes use of all possible timing intervals with respect to a task switch event to actually prove that the programm is robust in all possible scenarios.