Posted on Leave a comment

Virtual Hardware JPEG encoding

For a while I had been messing with various DSP architectures while playing with FPGA technology. So far both worlds were kinda separated in its own sandbox, the FPGA got to do really stupid interfacing and simple transforms while the DSP was doing the real complex encoding.
Now it’s time for a leap: Why not move some often used DSP primitives smoothly into the FPGA?
What kept me from doing it, were the tools. Most time is actually not burned on the concept or implementation, but on the debugging. Since the tools helping to debug would cost a fortune, it was just more economic to put a powerful chip next to the FPGA, even if it would have had the resources to run a decent number of soft cores in parallel.
Well, turns out somehow that in the development process of the past years, own tools can do. The ghdl extensions described here, allow to verify processing chains with real data just by replacing a hard VHDL FIFO module for data input by its virtual counterpart. This VirtualFIFO just runs in the simulation and can be fed from outside (via network) with data.

One good example of a complex processing chain is a JPEG encoder, which is typically implemented in software, i.e. as a serial procedure running on a CPU. If you’d want to migrate parts of the encoding such as the computationally expensive DCT into real parallel working hardware, the classical way could be to produce some number of offline data sets to run through the testbench (simulation) to verify the correct behaviour of your design.
But you could also just loop in the simulation into your software, in the sense of a “Software in the Loop” co-simulation.

Hacking the JPEG encoder

So, to replace the DCT of a JPEG routine using our DCT hardware simulation, we have to loop in some piece of code that implements our virtual DCT. We call it “remote DCT”, because the simulation could run on another machine. To speak to the remote DCT, we use the VirtualFIFO, which has a netpp interface, meaning, it can be accessed from the network.

The following python script demonstrates, how the remote DCT is accessed:

import netpp
d = netpp.connect("TCP:localhost")
r = d.sync()
buf = r.Fifo.Buffer
b = get_next_buffer()  # Example function returning a buffer
buf.set(b) # Send buffer b
rb = buf.get() # Get return buffer

The C version of this looks a bit more complicated, but basically makes API calls to netpp to transfer the data and wait for the return. This is simply the concept of swapping out local functions against remote procedure calls that are answered by the VHDL simulation.

This way, we run our JPEG encoder with a black and white test PNG shown below:

Original PNG image
Original PNG image

Since the DCT is running in a hardware simulation and not in software, it is very slow. It can take minutes, to encode an image. However, what we can get from this simulation, is a cycle accurate waveform of things that are happening under the hood.

DCT encoder waveforms
DCT encoder waveforms

After many hours of debugging and fixing some data flow issues, our virtual hardware does what it should.

Encoded JPEG image
Encoded JPEG image

And this is the encoded result. By swapping out the remoteDCT routine again by the built in routine, we get a reference image which we can subtract from the virtually hardware encoded image. If the result is all zeros, we know that both methods produce identical results.

Synthesis

Now the interesting part: How much hardware resources are allocated? See the synthesis results output below (This is for a Xilinx Spartan3E 250).

Typically, the timing results from synthesis can’t be trusted, when place & route has completed and fitted other logic as well, the maximum clock will significantly decrease. Since this design is using up all DSP slices on this FPGA, we’ll only see this as an intermediate station and move on to more gates and DSP power.

 Number of Slices:                      872  out of   2448    35%  
 Number of Slice Flip Flops:            623  out of   4896    12%  
 Number of 4 input LUTs:               1668  out of   4896    34%  
 Number of IOs:                          39
 Number of bonded IOBs:                  38  out of     66    57%  
 Number of BRAMs:                         6  out of     12    50%  
 Number of MULT18X18SIOs:                12  out of     12   100%     

 Timing constraint: Default period analysis for Clock 'clk'
 Clock period: 6.710ns (frequency: 149.032MHz)
Posted on

VHDL simulation remote display

In the previous article we described the netpp server enhancements to our GHDL based simulation to feed data to a FIFO. So we had condemned that extra thread being a slave listening to commands. But what if the GHDL-Simulation would be a netpp master?

Inspired by Yann Guidons framebuffer example at http://ygdes.com/GHDL/, the thought came up: why not hack a netpp client and use the existing ‘display’ device server (which we use for our intelligent camera remote display). What performance would it have?

For this, we extended our libnetpp.vhdl bindings by a few functions:

  • function device_open(id: string) return netpphandle_t — Opens a connection to a netpp (remote) device
  • function initfb(dev: netpphandle_t; x: integer; y: integer; buftype: integer) return framebuffer_t — Initialize virtual frame buffer from device
  • procedure setfb(fb: framebuffer_t; data: pixarray_t) — transfer data to framebuffer

Not to forget the cleanup functions device_close() and releasefb().

With this little functionality, we are running a little YUV color coded display as shown below:

Framebuffer output

It is very slow as  we keep repeatedly filling the YUV (more precisely UYVY interleaved data) using a clock sensitive process. We thus get a framerate of about 1fps – but it works!

The ghdlex source code is now hosted publicly at:

https://github.com/hackfin/ghdlex

Posted on

Asynchronous remote simulation using GHDL

Simulation is daily business for hardware developers, you can’t get things running right by just staring at your VHDL code (unless you’re a real genius).

There are various commercial tools out there which did the job so far: MentorGraphics, Xilinx isim, and many more, the limit mostly being your wallet.

We’re not cutting edge chip designers, so we used to work with the stuff that comes for free with the standard FPGA toolchains. However, these tools – as all proprietary stuff – confront you with limitations sooner or later. Moreover, VHDL testbench coding is a very tedious task when you need to cover all test scenarios. Sooner or later you’ll want to interface with some real world stuff, means: the program that should work with the hardware should first and likewise be able to talk to the simulation.

The interfacing

Ok, so we have a program written in say, C – and a hardware description. How do we marry them? Searching for solutions, it turns out that the OpenSource GHDL simulation package is an ideal candidate for these kind of experiments. It implements the VHPI-Interface, allowing to integrate C routines into your VHDL simulation. Its implementation is not too well documented, but hey, being able to read the source code compensates that, doesn’t it?

So, we can call C routines from our simulation. But that means: The VHDL side is the master, or rather: It implements the main loop. This way, we can’t run an independent and fully asynchronous C procedure from the outside – YET.

Assume we want to build up some kind of communication between a program and a HDL core through a FIFO. We’d set up two FIFOs really..one for Simulation to C, and one for the reverse direction. To run asynchronously, we could spawn the C routine into a separate thread, fill/empty the FIFO in a clock sensitive process from within the simulation (respecting data buffer availability) and run a fully dynamic simulation. Would that work? Turns out it does. Let’s have a look at the routine below.

mainclock:
process -- clock process for clk
begin
    thread_init; -- Initialize external thread
    wait for OFFSET;
    clockloop : loop
        u_ifclk <= '0';
        wait for (PERIOD - (PERIOD * DUTY_CYCLE));
        u_ifclk <= '1';
        wait for (PERIOD * DUTY_CYCLE);
        if finish = '1' then
            print(output, "TERMINATED");
            u_ifclk <= 'X';
            wait;
        end if;
    end loop clockloop;
end process;

Before we actually start the clock, we initialize the external thread which runs our C test routine. Inside another, clock sensitive process, we call the simulation interface of our little C library, for example, the FIFO emptier. Of course we can keep things much simpler and just query a bunch of pins (e.g. button states). We’ll get to the detailed VHPI interfacing later.

Going “virtual”

The previous method still has some drawbacks: We have to write a specific thread for all our asynchronous, functionality specific C events. This is not too nice. Why can’t we just use a typical program that talks a UART protocol, for example, and reroute this into our simulation?

Well, you expected that: yes we can. Turns out there is another nice application for our netpp library (which we have used a lot for remote stuff). Inside the thread, we just fire up a netpp server listening on a TCP port and connect to it from our program. We can use a very simple server for a raw protocol, or use the netpp protocol to remote-control various simulation properties (pins, timing, stop conditions, etc).

This way, we are interactively communicating with our simulation for example through a python script with the FIFO:

import time
import netpp dev = netpp.connect("localhost")
r = dev.sync()
r.EnablePin.set(1) # arm input in the simulation
r.Fifo.set(QUERY_FRAME) # Send query frame command sequence
frame = r.Fifo.get() # Fetch frame
hexdump(frame) # Dump frame data

Timing considerations

When running this for hours, you might realize that your simulation setup takes a lot of CPU time. Or when you’re plotting wave data, you might end up with huge wave files with a lot of “idle data”. Why is that? Remember that your simulation does not run ‘real time’. It simulates your entire clocked architecture just as fast as it can. If you have a fast machine and a not too complex design, chances are that the simulation actually has a shorter runtime that its actual realtime duration.

So for our clock main loop, we’d very likely have to insert some wait states and put the main clock process to sleep for a few µs. Well, now we’d like to introduce the resource which has taught us quite a bit on how to play with the VHPI interface: Yann Guidons GHDL extensions. Have a look at the GHDL/clk/ code. Taking this one step further, we enhance our netpp server with the Clock.Start and Clock.Stop properties so we can halt the simulation if we are idling.

Dirty little VHPI details

Little words have been lost about exactly how it’s done. Yanns examples show how to pass integers around, but not std_logic_vectors. However, this is very simple: they are just character arrays. However, as we know, a std_logic has not just 0 and 1 states, there are some more (X, U, Z, ..)

Let’s have a look at our FIFO interfacing code. We have prototyped the routine sim_fifo_io() in VHDL as follows:

procedure fifo_io( din: inout fdata; flags : inout flag_t );
    attribute foreign of fifo_io : procedure is "VHPIDIRECT sim_fifo_io";

The attribute statement registers the routine as externally callable through the VHPI interface. On the C side, our interface looks like:

void sim_fifo_io(char *in, char *out, char *flag);

The char arrays just have the length of the std_logic_vector from the VHDL definition. But there is one important thing: the LSB/MSB order is not respected in the indexing order of the array. So, if you have a definition for flag_t like ‘subtype flag_t is unsigned(3 downto 0)’, flag(3) (VHDL) will correspond to flag[0] in C. If you address single elements, it might be wise to reorder them or not use a std_logic_vector. See also Yanns ’bouton’ example.

Conclusion and more ideas

So with this enhancement we are able to:

  • Make a C program talk to a simulation – remotely!
  • Allow the same C program to run on real hardware without modifications
  • Trigger certain events (those nasty ones that occur in one out of 10000) and monitor them selectively
  • Script our entire simulation using Python

Well, there’s certainly more to it. A talented JAVA hacker could probably design a virtual FPGA board with buttons and 7 segment displays without much effort. A good starting point might be the OpenSource goJTAG application (for which we hacked an experimental virtual JTAG adapter that speaks to our simulation over netpp). Interested? Let us know!

Update: More of a stepwise approach is shown at http://www.fpgarelated.com/showarticle/20.php

Another update: Find my presentation and paper for the Embedded World 2012 trade show here: