Übersicht

MaSoCist – ein linux-ähnliches Build-System für eigene CPU-Designs als OpenSource

FPGAs erlauben ein weitgehend nur durch die vorgegebenen Chip-Resourcen eingeschränktes eigenes Prozessordesign mit Peripherie.
Da auf FPGAs oft für den Anwendungszweck spezialisierte Hardware zum Einsatz kommt, kann ein software-implementierter on-chip CPU-Kern meist einfache Konfigurationsarbeiten oder die Schnittstelle zum Nutzer — die Peripherie — abdecken. Typischerweise wird dies als ‘SoC’ — System on Chip — bezeichnet.

Die MaSoCist-Umgebung (Abkürzung für Martins System on Chip Instancing/Simulation tool
erlaubt, die CPU wie auch die benötigte Peripherie weitgehend analog zum Linux Kernel zu konfigurieren und die Software wie auch die Hardware über Gerätebeschreibungen zu definieren. Damit kann u.a. die Firmware/Software, die grundsätzlich per GNU-Toolchain gebaut wird, für viele verschiedene FPGA-Plattformen verwaltet werden.

Details

In der Nische der FPGA SoCs haben sich resourcensparende Stack-Maschinen trotz Geschwindigkeitseinbussen gegenüber komplexeren Architekturen gut etabliert, unter anderem die von Øyvind Harboe konzipierte ZPU-Architektur, die mit einem vollständigen GNU-Werkzeugkasten (gcc, gdb) daherkommt.
Sie gilt als die kompakteste 32-Bit-Architektur mit gcc-Support, da sie im Vergleich zu den proprietären Herstellerlösungen wenig Logikelemente benötigt und eine der besten Code-Dichten aufweist.

Das System ist dabei aufgebaut wie ein Linux-Kernel, nur für Hardware:
Optionen können entsprechend der Zielanwendung konfiguriert werden, werden z.B. vier UARTs benötigt, wird die zugehörige Variable CONFIG_NUM_UART angepasst.
Dabei wird automatisch alles Nötige für den Anwender erzeugt:

  • Hardware-Definition in HDL (VHDL) für die Synthese oder Simulation
  • Header mit Register-Definitionen für den C-Programmierer
  • Register-Referenzdokumentation (Register-Bits, usw.)

Im Rahmen einer Machbarkeitsstudie wurde eine gegenüber der originalen ZPU-Architektur verbesserte schnellere Variante ZPUng mit dreistufiger Pipeline entwickelt, die unwesentlich mehr Logik benötigt und dank einem passenden Befehlsinterpreter vollständig Opcode-kompatibel zum Original ist.

Highlights

Referenzanwendungen

Alle Anwendungen benötigen weniger als 32kB on-Chip SRAM und normalerweise kein externes Memory.

Ein weiteres Feature ist die Fähigkeit der kompletten Simulation des Programmcode. Werden z.B. Variablen oder Register nicht korrekt initialisiert, tritt dies sofort in der Simulation zutage. Somit lassen sich komplette Programme, sofern sie deterministisch ablaufen, zusammen mit der Hardware und entsprechend virtuellen Stimulationen durchsimulieren und regress-testen.
So sind auch sicherheitsrelevante Funktionen wie hardwaremässige Not-Abschaltung bei fehlerhaftem Verhalten gut verifizierbar.

Für die Simulation wird ‘ghdl’ verwendet, welches quelloffen ist und eine hohe Robustheit aufweist. Insofern ist die vorgestellte Lösung komplett herstellerunabhängig und läuft je nach Konfiguration auf
Plattformen unterschiedlicher FPGA-Hersteller.
Das ‘MaSoCist’-Build-System ist unter einer OpenSource-Lizenz und als Docker container (keine Installation von weiteren Paketen nötig) verfügbar. Gleichzeitig werden Entwickler bei Neuentwicklung ihrer
Komponenten nicht zur Offenlegung ihrer Designs gezwungen.

Weitergehende Informationen (englisch):

https://section5.ch/index.php/documentation/masocist-soc/

Die Links zu Git-Repositories werden hier in Kürze veröffentlicht. Bitte um etwas Geduld.

IP cores

Overview:

The MaSoCist build system

Pun intended: Applying a linux kernel configuration approach onto a hardware system turned out to be painful. However, it pays off:

  • Cross platform: Simulate and synthesize reusable code for various architectures (Lattice, Xilinx, Altera FPGAs)
  • Configure peripheral interface instances like a linux kernel
  • Automated generation of address decoders, peripheral instances and memory map from gensoc – our in house SoC generator.
  • Generate hardware configuration, software drivers and corresponding register documentation in one call: ‘make all’
  • Continuous integration for hardware designs: Run test benches automatically in the cloud for hardware verification

It allows vendor specific designs and extensions without the need to OpenSource, therefore it is available in two license variants:

A fully usable Docker container is hosted at: https://hub.docker.com/r/hackfin/masocist/, use

docker pull hackfin/masocist

to download this container. If you want to run this on a Windows client, see Windows setup notes.

MaSoCist opensource primer: [sdm_download id=”1412″ fancy=”0″]

IP blocks and cores

Cottonpicken Engine

The Cottonpicken Engine is an in-house micro coded digital signal processing (DSP) engine that has the following functionality:

  • Bayer pattern decoding into various formats (YUV 4:2:2, YUV 4:2:0, RGB, programmable delays)
  • YUV conversion supports YCrCb, YCoCg as well
  • 3×3, 5×5 filter kernels
  • Specific matrix operations, cascadeable

It is capable of running at full data clock (pixel clock) up to 150 MHz (platform dependent).

The engine is only available as a closed source netlist object as part of a development package.

Image compression

JPEG Encoder IP

Our in-house designed, machine vision proof JPEG encoder and streaming solutions available for usage in standard FPGAs at low costs (at typically below 35’000 EUR support costs per project). The supported pixel bit depth is up to 12 bits. The JPEG IP is available in two standalone variants:

  • L1 monochrome multiplexed pipeline (150 MHz pixel clock on Spartan6)
  • L2 dual pipe simultaneous encoding for high quality YUV422, for example 1280×720@60fps (up to 100 MHz pixel clock)
  • L2H: Higher pixel clock variant (up to 200MHz) available for specific platforms

Fully deployable UDP/Ethernet (RFC 2435) streaming solutions and camera reference designs are available as well. The receiver’s software side is covered with embedded gstreamer OpenSource appplications that run on Linux and Windows platforms, likewise.

[ more … ]

Lossless and other compression methods

We have extensive know how on:

  • High speed DPCM-Compression (up to 200 MHz pixel clock, 16 bit, lossless, suitable for medical images, full simulation model available). Can be made software-compatible to lossless JPEG (not JPEG-LS).
  • Multirate adaptive predictors for special imagery (lossy coding / quantization support)
  • Wavelet coding kernels (lossy and lossless)
  • Combined, lossy approaches, non-standard (partial JPEG2000 transcoding possible)
  • Huffman/Golomb-Rice coding IP core, dual-channel (simultaneous throughput of Luma/Chroma channels).

cCAP SoC Reference designs

These System on Chip designs consist of a fully configureable CPU plus standard peripherals and can be customized with special interfaces. The CPU can be programmed with GCC and is accessed via ICE JTAG during development. The underlying build and configuration system and some of the peripherals are available as Open Source, see MaSoCist.

[ more … ]

Simulation models

Full simulation models are available for all our IP cores that can be co-simulated with custom IP or run ‘live’:

See also VirtualChip page.

Posted on

Asynchronous remote simulation using GHDL

Simulation is daily business for hardware developers, you can’t get things running right by just staring at your VHDL code (unless you’re a real genius).

There are various commercial tools out there which did the job so far: MentorGraphics, Xilinx isim, and many more, the limit mostly being your wallet.

We’re not cutting edge chip designers, so we used to work with the stuff that comes for free with the standard FPGA toolchains. However, these tools – as all proprietary stuff – confront you with limitations sooner or later. Moreover, VHDL testbench coding is a very tedious task when you need to cover all test scenarios. Sooner or later you’ll want to interface with some real world stuff, means: the program that should work with the hardware should first and likewise be able to talk to the simulation.

The interfacing

Ok, so we have a program written in say, C – and a hardware description. How do we marry them? Searching for solutions, it turns out that the OpenSource GHDL simulation package is an ideal candidate for these kind of experiments. It implements the VHPI-Interface, allowing to integrate C routines into your VHDL simulation. Its implementation is not too well documented, but hey, being able to read the source code compensates that, doesn’t it?

So, we can call C routines from our simulation. But that means: The VHDL side is the master, or rather: It implements the main loop. This way, we can’t run an independent and fully asynchronous C procedure from the outside – YET.

Assume we want to build up some kind of communication between a program and a HDL core through a FIFO. We’d set up two FIFOs really..one for Simulation to C, and one for the reverse direction. To run asynchronously, we could spawn the C routine into a separate thread, fill/empty the FIFO in a clock sensitive process from within the simulation (respecting data buffer availability) and run a fully dynamic simulation. Would that work? Turns out it does. Let’s have a look at the routine below.

mainclock:
process -- clock process for clk
begin
    thread_init; -- Initialize external thread
    wait for OFFSET;
    clockloop : loop
        u_ifclk <= '0';
        wait for (PERIOD - (PERIOD * DUTY_CYCLE));
        u_ifclk <= '1';
        wait for (PERIOD * DUTY_CYCLE);
        if finish = '1' then
            print(output, "TERMINATED");
            u_ifclk <= 'X';
            wait;
        end if;
    end loop clockloop;
end process;

Before we actually start the clock, we initialize the external thread which runs our C test routine. Inside another, clock sensitive process, we call the simulation interface of our little C library, for example, the FIFO emptier. Of course we can keep things much simpler and just query a bunch of pins (e.g. button states). We’ll get to the detailed VHPI interfacing later.

Going “virtual”

The previous method still has some drawbacks: We have to write a specific thread for all our asynchronous, functionality specific C events. This is not too nice. Why can’t we just use a typical program that talks a UART protocol, for example, and reroute this into our simulation?

Well, you expected that: yes we can. Turns out there is another nice application for our netpp library (which we have used a lot for remote stuff). Inside the thread, we just fire up a netpp server listening on a TCP port and connect to it from our program. We can use a very simple server for a raw protocol, or use the netpp protocol to remote-control various simulation properties (pins, timing, stop conditions, etc).

This way, we are interactively communicating with our simulation for example through a python script with the FIFO:

import time
import netpp dev = netpp.connect("localhost")
r = dev.sync()
r.EnablePin.set(1) # arm input in the simulation
r.Fifo.set(QUERY_FRAME) # Send query frame command sequence
frame = r.Fifo.get() # Fetch frame
hexdump(frame) # Dump frame data

Timing considerations

When running this for hours, you might realize that your simulation setup takes a lot of CPU time. Or when you’re plotting wave data, you might end up with huge wave files with a lot of “idle data”. Why is that? Remember that your simulation does not run ‘real time’. It simulates your entire clocked architecture just as fast as it can. If you have a fast machine and a not too complex design, chances are that the simulation actually has a shorter runtime that its actual realtime duration.

So for our clock main loop, we’d very likely have to insert some wait states and put the main clock process to sleep for a few µs. Well, now we’d like to introduce the resource which has taught us quite a bit on how to play with the VHPI interface: Yann Guidons GHDL extensions. Have a look at the GHDL/clk/ code. Taking this one step further, we enhance our netpp server with the Clock.Start and Clock.Stop properties so we can halt the simulation if we are idling.

Dirty little VHPI details

Little words have been lost about exactly how it’s done. Yanns examples show how to pass integers around, but not std_logic_vectors. However, this is very simple: they are just character arrays. However, as we know, a std_logic has not just 0 and 1 states, there are some more (X, U, Z, ..)

Let’s have a look at our FIFO interfacing code. We have prototyped the routine sim_fifo_io() in VHDL as follows:

procedure fifo_io( din: inout fdata; flags : inout flag_t );
    attribute foreign of fifo_io : procedure is "VHPIDIRECT sim_fifo_io";

The attribute statement registers the routine as externally callable through the VHPI interface. On the C side, our interface looks like:

void sim_fifo_io(char *in, char *out, char *flag);

The char arrays just have the length of the std_logic_vector from the VHDL definition. But there is one important thing: the LSB/MSB order is not respected in the indexing order of the array. So, if you have a definition for flag_t like ‘subtype flag_t is unsigned(3 downto 0)’, flag(3) (VHDL) will correspond to flag[0] in C. If you address single elements, it might be wise to reorder them or not use a std_logic_vector. See also Yanns ’bouton’ example.

Conclusion and more ideas

So with this enhancement we are able to:

  • Make a C program talk to a simulation – remotely!
  • Allow the same C program to run on real hardware without modifications
  • Trigger certain events (those nasty ones that occur in one out of 10000) and monitor them selectively
  • Script our entire simulation using Python

Well, there’s certainly more to it. A talented JAVA hacker could probably design a virtual FPGA board with buttons and 7 segment displays without much effort. A good starting point might be the OpenSource goJTAG application (for which we hacked an experimental virtual JTAG adapter that speaks to our simulation over netpp). Interested? Let us know!

Update: More of a stepwise approach is shown at http://www.fpgarelated.com/showarticle/20.php

Another update: Find my presentation and paper for the Embedded World 2012 trade show here: