Posted on

Setting up virtual CPU environment on Windows

This is a short howto to get a Linux specific Virtual Chip running on a Windows OS:

  1. Download and install the Docker Toolbox
  2. Download and install the Xming X-Server
  3. Run ‘Docker Quickstart terminal’, normally installed on your desktop. Be patient, the environment takes some time to start up
  4. Download MaSoCist docker container:
    docker pull hackfin/masocist
  5. Optional: (GTKwave support): Prepare the Xming server by running XLaunch and configuring as follows using the Wizard:
    • Multiple Windows
    • Start no client
    • No access control selected (Warning, this could cause security issues, depending on your system config)
  6. Start the container using the script below
    docker run -ti --rm -u masocist -w /home/masocist -e DISPLAY=192.168.99.1:0 \
    -v /tmp/.X11-unix:/tmp/.X11-unix hackfin/masocist bash

You might save this script to a file like run.sh and start it next time from the Docker Quickstart terminal:

. run.sh

Depending on the MaSoCist release you’ve got, there are different install methods:

Opensource github

See README.md for most up to date build notes. GTKwave is not installed by default. Run

sudo apt-get update gtkwave

to install it.

When you have X support enabled, you can, after building a virtual board, run

make -C sim run

from the MaSoCist directory to start the interactive wave display simulation. If the board uses the virtual UART, you can now also connect to /tmp/virtualcom using minicom.

Custom ‘vendor’ license

You should have received a vendor-<YOUR_ORGANIZATION>.pdf for detailed setup notes, see ‘Quick start’ section.

A few more notes

  • All changes you will make to this docker container are void on exit. If this is not desired, remove the ‘–rm’ option and use the ‘docker ps -a’ and ‘docker start -i <container_id>’ commands to reenter your container. Consult the Docker documentation for details.
  • Closing the GTKwave window will not stop the simulation!
  • Ctrl-C on the console stops the simulation, but does not close the wave window
  • The UART output of the virtual SoC is printed on the console (“Hello!”). Virtual UART input is not supported on this system, but can be implemented using tools supporting virtual COM ports and Windows pipes.
  • Once you have the Docker container imported, you can alternatively use the Kitematic GUI and apply the above options, in particular:
    • -v: Volume mounts to /tmp/.X11-unix
    • -e: DISPLAY environment setting

SoC design virtualization

Live simulation examples

The video below demonstrates a live running CPU simulation which can be fully debugged through gdb (at a somewhat slower speed). Code can also be downloaded into the running simulation without the need to recompile.

Older videos

These are legacy flash animations which may no longer be supported by your browser. They demonstrate various trace scenarios of cycle accurate virtual SoC debugging.

Virtualisation benefits

Being able to run a full cycle accurate CPU simulation is helpful in various situations:

  • Verification of algorithms
  • Hardware verification: Make sure a IP core is functioning properly and not prone to timing issues
  • Firmware verification: hard verification of proper access (access to uninitialized registers or variables is found immediately)
  • Safety relevant applications: Full proof of correct functionality of a program main loop

Virtual interfaces and entities

Virtual entities allow to loop in external data or events into a fully cycle and timing accurate HDL simulation. For example, interaction with a user space software can take place by a virtual UART console, like a terminal program can speak to a program running on a simulated CPU with 16550 UART.

For all these virtualisation approaches, the software has to take a very different timing into account, because the simulation would run slower by up to a factor of 1000, when simulating complex SoC environments. However, using mimicked timing models, it turns out that the software in general becomes more stable and race conditions are avoided effectively.

So far, the following simple models are covered by our co-simulation library:

  • Virtual UART/PTY
  • FIFO: Cypress FX2 model, FT2232H model
  • Packet FIFO: Ethernet MAC receive/transmit (without Phy simulation)
  • Virtual Bus: Access wishbone components by address or registers directly by Python, like:
    fpga.SPI_CONFIG.ENABLE.set(True)

The virtualization library is issued in an opensource version at: github:hackfin/ghdlex

More complex model concepts:

RAM

For fast simulation, a dual port RAM model was implemented for Co-Simulation that allows access through a back door via network. That way, new RAM content can be updated in fraction of seconds for regression tests via simple Python scripting.

Virtual optical sensor

For camera algorithm verification with simulated image data (like from a PNG image or YUV video), we have developed a customizeable virtual sensor model that can be fed with arbitrary image data, likewise. Its video timing (blanking times, etc.) can be configured freely, image data is fed by a FIFO. A backwards FIFO channel can again receive processed data, like from an edge filter. This way, complex FPGA and DSP hybrid systems can be fully emulated and algorithms be verified by automated regression tests.

Display

For visual direct control or front end for virtualized LCD devices, the netpp display server allows to post YUV, RGB or indexed, hardware-processed images to be displayed on the PC screen from within the simulation. For example, decoded YUV-format video can be displayed. When running as full, cycle accurate HDL simulation, this is very slow. However, functional simulation for example through Python has turned out to be quite effective.

See also old announcement post for example: [ Link ]

Co-Simulation ecosystems

Sometimes it is necessary to link different development branches, like hardware and software: Make ends meet (and meet the deadlines, as well). Or, you might want to pipe processed data from matlab into your simulation and compare with the on-chip-processed result for thorough coverage or numerical stability. This is where you run into the typical problem:

  • Simulation runs on host A (Linux workstation)
  • Your LabVIEW client runs on the Student’s Windows PC in the other building
  • The sensors are on the roof

When you order a IP core design, you might want to have the same reference (test environment) as we do. This is based on a Docker container, so you do not have local dependency issues. Plus, it allows a continuous integration of software and hardware designs.

HDL playground

The HDL playground is a Jupyter Notebook based environment that is launched in a browser via the link below:

https://github.com/hackfin/hdlplayground

Features:

  • Co-Simulation of Python stimuli and your own data against Verilog (optional VHDL) modules
  • yosys, nextpnr and Lattice ECP5 specific tools for synthesis, mapping and PnR in the cloud
  • Auto-testing of Notebooks
  • No installation of local software (other than the Docker service)
Posted on Leave a comment

Virtual Hardware JPEG encoding

For a while I had been messing with various DSP architectures while playing with FPGA technology. So far both worlds were kinda separated in its own sandbox, the FPGA got to do really stupid interfacing and simple transforms while the DSP was doing the real complex encoding.
Now it’s time for a leap: Why not move some often used DSP primitives smoothly into the FPGA?
What kept me from doing it, were the tools. Most time is actually not burned on the concept or implementation, but on the debugging. Since the tools helping to debug would cost a fortune, it was just more economic to put a powerful chip next to the FPGA, even if it would have had the resources to run a decent number of soft cores in parallel.
Well, turns out somehow that in the development process of the past years, own tools can do. The ghdl extensions described here, allow to verify processing chains with real data just by replacing a hard VHDL FIFO module for data input by its virtual counterpart. This VirtualFIFO just runs in the simulation and can be fed from outside (via network) with data.

One good example of a complex processing chain is a JPEG encoder, which is typically implemented in software, i.e. as a serial procedure running on a CPU. If you’d want to migrate parts of the encoding such as the computationally expensive DCT into real parallel working hardware, the classical way could be to produce some number of offline data sets to run through the testbench (simulation) to verify the correct behaviour of your design.
But you could also just loop in the simulation into your software, in the sense of a “Software in the Loop” co-simulation.

Hacking the JPEG encoder

So, to replace the DCT of a JPEG routine using our DCT hardware simulation, we have to loop in some piece of code that implements our virtual DCT. We call it “remote DCT”, because the simulation could run on another machine. To speak to the remote DCT, we use the VirtualFIFO, which has a netpp interface, meaning, it can be accessed from the network.

The following python script demonstrates, how the remote DCT is accessed:

import netpp
d = netpp.connect("TCP:localhost")
r = d.sync()
buf = r.Fifo.Buffer
b = get_next_buffer()  # Example function returning a buffer
buf.set(b) # Send buffer b
rb = buf.get() # Get return buffer

The C version of this looks a bit more complicated, but basically makes API calls to netpp to transfer the data and wait for the return. This is simply the concept of swapping out local functions against remote procedure calls that are answered by the VHDL simulation.

This way, we run our JPEG encoder with a black and white test PNG shown below:

Original PNG image
Original PNG image

Since the DCT is running in a hardware simulation and not in software, it is very slow. It can take minutes, to encode an image. However, what we can get from this simulation, is a cycle accurate waveform of things that are happening under the hood.

DCT encoder waveforms
DCT encoder waveforms

After many hours of debugging and fixing some data flow issues, our virtual hardware does what it should.

Encoded JPEG image
Encoded JPEG image

And this is the encoded result. By swapping out the remoteDCT routine again by the built in routine, we get a reference image which we can subtract from the virtually hardware encoded image. If the result is all zeros, we know that both methods produce identical results.

Synthesis

Now the interesting part: How much hardware resources are allocated? See the synthesis results output below (This is for a Xilinx Spartan3E 250).

Typically, the timing results from synthesis can’t be trusted, when place & route has completed and fitted other logic as well, the maximum clock will significantly decrease. Since this design is using up all DSP slices on this FPGA, we’ll only see this as an intermediate station and move on to more gates and DSP power.

 Number of Slices:                      872  out of   2448    35%  
 Number of Slice Flip Flops:            623  out of   4896    12%  
 Number of 4 input LUTs:               1668  out of   4896    34%  
 Number of IOs:                          39
 Number of bonded IOBs:                  38  out of     66    57%  
 Number of BRAMs:                         6  out of     12    50%  
 Number of MULT18X18SIOs:                12  out of     12   100%     

 Timing constraint: Default period analysis for Clock 'clk'
 Clock period: 6.710ns (frequency: 149.032MHz)