Posted on

In Circuit Emulation for MIPS soft CPU

Many developers got used to this: Plug in a JTAG connector to your embedded target and debug the misbehaviour down to the hardware. No longer a luxury, isn’t it?
When you move on to a soft core from for a simple SoC solution on FPGA, it may be. Most of them don’t have a TAP – a Test Access Port – like other off the shelf ARM cores.

We had pimped a ZPU soft core with in circuit emulation features a while ago (In Circuit Emulation for the ZPU). This enabled us to connect to a hardware ZPU core with the GNU debugger like we would attach to for example a Blackfin, msp430, you name it.

Coming across various MIPS compatible IP cores without debug facilities, the urge came up to create a standard test access port that would work on this architecture as well. There is an existing Debug standard called EJTAG, but it turned out way more complex to implement than simple In Circuit Emulation (ICE) using the above TAP from the ZPU.

So we would like to do the same as with the ZPU:

  • Compile programs with the <arch>-elf-gcc
  • Download programs into the FPGA hardware using GDB
  • set breakpoints, inspect and manipulate values and registers, etc.
  • Run a little „chip scope“

Would that work without killer efforts?

Yes, it would. But we’re not releasing the white rabbit yet. Stay tuned for embedded world 2013 in Nuremberg in February. We (Rene Doß from Dossmatik GmbH Germany and myself) are going to show some interesting stuff that will boost your SoC development big time by using known architectures and debug tools.

His MIPS compatible Mais CPU is on the way to become stable, and turns out to be a quick bastard while being stingy on resources.

For a sneak peek, here’s some candy from the synthesizer (including I/O for the LCD as shown below):

Selected Device : 3s250evq100-5

Number of Slices:                     1172  out of   2448    47%
Number of Slice Flip Flops:            904  out of   4896    18%
Number of 4 input LUTs:               2230  out of   4896    45%
Number of IOs:                          15
Number of bonded IOBs:                  15  out of     66    22%
Number of BRAMs:                         2  out of     12    16%
Number of GCLKs:                         3  out of     24    12%

And for the timing:

Minimum period: 13.498ns (Maximum Frequency: 74.084MHz)
Mais on Papilio with TFT wing

Update: Here’s a link to the presentation (as PDF) given on the embedded world 2013 in Nuremberg:


Posted on

A full baseline pipelined JPEG encoder in VHDL

As described in a previous post, a framework was built to loop in a DCT hardware design into a software JPEG encoder for verification (and acceleration) purposes.

Turns out this strategy speeds up development a lot, and that the remaining modules on the way to a full hardware based and pipelined JPEG encoding solution weren’t a big job. Actually, I was expecting that this enhanced encoder would no longer fit into a small Spartan3E 250k. Wrong!

Have a look:

Device utilization summary:

Selected Device : 3s250evq100-5 

 Number of Slices:                     1567  out of   2448    64%  
 Number of Slice Flip Flops:           1063  out of   4896    21%  
 Number of 4 input LUTs:               2915  out of   4896    59%  
    Number used as logic:              2900
    Number used as Shift registers:      15
 Number of IOs:                          49
 Number of bonded IOBs:                  47  out of     66    71%  
 Number of BRAMs:                        12  out of     12   100%  
 Number of MULT18X18SIOs:                11  out of     12    91%  
 Number of GCLKs:                         2  out of     24     8%

JPEG encoder latency and timing

From the XST summary, we get:

Timing Summary:
Speed Grade: -5

   Minimum period: 13.449ns (Maximum Frequency: 74.353MHz)
   Minimum input arrival time before clock: 9.229ns
   Maximum output required time after clock: 6.532ns
   Maximum combinational path delay: 7.693ns

The timing is again optimistic, place and route normally deteriorates the latencies. The maximum clock is in fact the clock you can feed the JPEG encoder with pixel data (12 bit) without causing overflow. The output is a huffman coded byte stream that is typically embedded into a JFIF structure header, table data and the appropriate markers by a CPU.

There is quite some room for optimization, in fact, the best compromise of BRAM bandwidth and area has not yet been reached. Quite a few BRAMs ports are not used, but kept open to allow access through an external CPU, like for manipulation of the Huffman tables.

JPEG encoder waveforms
JPEG encoder waveforms

The last performance question might be the latency: how long does it take until encoded JPEG data appears after the first arriving pixel data? The above waveform snapshot should speak for itself: at 50MHz input clock, the latency is approx. 4 microseconds.

Colour encoding

We haven’t talked about colour yet. This is a complex subject, because there are many possibilities of encoding colour, but not really for the JPEG encoder. This is rather a matter of I/O sequencing and the proper colour conversion. As you might remember, a JPEG encoder does not encode three RGB channels, but in YUV space, which might be roughly described as: brightness, redness and blueness. The ‚greenness‘ is implicitely included in this information. But why repeat what’s already nicely described: You find all the details right here on Wikipedia.

So, to encode all the colour, we just need properly separated data according to one of the interleaving schemes (4:2:0 or 4:2:2) and feed the MCU blocks of 8×8 pixels through the encoder while assserting the channel value (Y, Cb, Cr) on the channel_select input. Voilà.

Turns out that the Bayer Pattern that we receive from many optical colour sensors can be converted rather directly into YUV 4:2:0 space using the right setting four our Scatter-Gather unit (‚Cottonpicken‘ engine). With a tiny bit of software intervention through a soft core, we finally cover the entire colour processing stream. Proof below.

Colour JPEG
Colour JPEG output from the encoder
The original RGB photo
The original RGB photo

As you can see, the colours are quite not perfect yet compared with the original. This is a typical problem, that you get a greenish tint. We leave this to the colour optimization department 🙂

One more serious word: Just yesterday I’ve read the news and had to see that the person who changed the optical colour sensor industry, Bryce Bayer, has passed away. As a final „thank you“ to his work, I’d like to post the Bayer Picture of the above.

Bayer pattern source image
Dedicated to Bryce Bayer
Posted on

Dynamic netpp properties

The legacy

Up to netpp v0.3x, device properties used to be static, that means, a device had a certain predefined set of properties written in XML and that’s it. No stuff on-the-fly.
However, as coders familiar with the internals know, there are dynamic properties: The port concept of netpp allows to dynamically add a ‚Port‘ to a ‚Hub‘ when the latter is probed.
For example, when sending a ‚probe‘ broadcast on the ‚TCP‘ Hub, all responding servers are added to the Port property list and show up when running the netpp master tool

> netpp
Available interfaces/hubs:
 Child: [80000000] 'TCP'
    Child: [80010000] ''
 Child: [80000001] 'UDP'
    Child: [80020000] ''

The challenge

So, where would dynamic properties in an embedded device make sense? You could think of the following scenarios:

  1. A device would not only certain properties when in a certain operation mode
  2. A device would retrieve its properties from another description than the static property list

The first scenario could so far always be dealt with using class derivation: When in Mode A, the device shows its base properties, when in mode B, it shows the base properties plus some extensions. The built in method of the base device would just solve this by returning the proper device index on the fly from the local_getroot() function.

The second scenario however is trickier:  Of course you could build a static property list from another device description and register it with netpp. However, this would turn out in a rather complicated external code mess, it turns out to be easier to enhance the existing structures for the dynamic properties.

Let’s just work on an example to get to the details:

Example for VHDL test bench

The VPI specification originating from the Verilog HDL (hardware description language) allows to iterate through the signal hierarchy of a hardware design simulation. Using these extensions, quite a few hardware simulators allow to load a shared library on top of the simulation that does a few things, like external signal manipulation.

In various articles (like ‚Asynchronous remote simulation‚), it is described how to interface netpp with virtual hardware using the VHPI specification. However, the VHPI interface does not have the fancy shared library option. This means, each test bench that should be netpp capable must be manually enhanced using the desired netpp interface and DClib hardware description that maps abstract properties into register addresses on the VHDL side.

For a quick and dirty approach, it would be nice if we could just load a module on top of a simulation that queries the top level signals of the test bench, export them as netpp properties, and the user could manipulate them from outside (or remotely) using a python script.

Within our so called vpiwrapper (vpiwrapper.c), we create a function that creates a dynamic property from a signal

TOKEN property_from_signal(TOKEN parent, vpiHandle sig)

The parent is a dynamic(!) root node that we have created previously, if none did already exist.

Enhanced functionality

The above example just created a number of top level properties from existing signals. So far so good. But what if we wanted the basic functionality from the VHPI extensions, too, common to all the simulations? This again would be the static properties from previous approaches. So we mix a lot of stuff: VPI with VHPI extensions and static with dynamic properties.

Wouldn’t that turn out into a nightmare?

Nope. The answer is again derivation: We just create two device root nodes: One is the static node from the static property list (proplist.c). The other is the dynamic node that we created inside the vpiwrapper (if it didn’t already exist). Now we simply set the static node as „base class“ of the dynamic node. That means, the property iteration on the slave side (inside the simulation) will see the signal properties created dynamically and the static properties, likewise.

As example, a running „simram“ simulation will answer the netpp call as follows:

Properties of Device 'VPI_GHDLwrapper' associated with Hub 'TCP':
Child: [80000001] 'clk'
Child: [80000002] 'we'
Child: [80000003] 'addr'
Child: [80000004] 'data0'
Child: [80000005] 'data1'
Child: [80000006] 'data2'
Child: [80000007] 'ram0'
Child: [80000008] 'ram1'
Child: [00000002] 'Enable'
Child: [00000005] 'Irq'
Child: [00000007] 'Reset'
Child: [00000008] 'Throttle'
Child: [00000009] 'Timeout'
Child: [00000003] 'Fifo'

The properties beginning with a capital are the static ones. You can see a difference in the TOKEN values as well when compared to the signal properties with lower caps. (Internals experts know, that dynamic property tokens have the MSB set)

On a side note: The properties ram0 and ram1 are explicitely registered by this simulation. The implement a netpp BUFFER variable that can simply be read/written asynchronously during the simulation. From the VHDL side, they simulate a simple dual port memory.

Posted on

Virtual Hardware JPEG encoding

For a while I had been messing with various DSP architectures while playing with FPGA technology. So far both worlds were kinda separated in its own sandbox, the FPGA got to do really stupid interfacing and simple transforms while the DSP was doing the real complex encoding.
Now it’s time for a leap: Why not move some often used DSP primitives smoothly into the FPGA?
What kept me from doing it, were the tools. Most time is actually not burned on the concept or implementation, but on the debugging. Since the tools helping to debug would cost a fortune, it was just more economic to put a powerful chip next to the FPGA, even if it would have had the resources to run a decent number of soft cores in parallel.
Well, turns out somehow that in the development process of the past years, own tools can do. Merely, the ghdl extensions described here, allow to verify processing chains with real data just by replacing a hard VHDL FIFO module for data input by its virtual counterpart. This VirtualFIFO just runs in the simulation and can be fed from outside (via network) with data.

One good example of a complex processing chain is a JPEG encoder, which is typically implemented in software, i.e. as a serial procedure running on a CPU. If you’d want to migrate parts of the encoding such as the computationally expensive DCT into real parallel working hardware, the classical way could be to produce some number of offline data sets to run through the testbench (simulation) to verify the correct behaviour of your design.
But you could also just loop in the simulation into your software, in the sense of a „Software in the Loop“ co-simulation.

Hacking the JPEG encoder

So, to replace the DCT of a JPEG routine using our DCT hardware simulation, we have to loop in some piece of code that implements our virtual DCT. We call it „remote DCT“, because the simulation could run on another machine. To speak to the remote DCT, we use the VirtualFIFO, which has a netpp interface, meaning, it can be accessed from the network.

The following python script demonstrates, how the remote DCT is accessed:

import netpp
d = netpp.connect("TCP:localhost")
r = d.sync()
buf = r.Fifo.Buffer
b = get_next_buffer()  # Example function returning a buffer
buf.set(b) # Send buffer b
rb = buf.get() # Get return buffer

The C version of this looks a bit more complicated, but basically makes API calls to netpp to transfer the data and wait for the return. This is simply the concept of swapping out local functions against remote procedure calls that are answered by the VHDL simulation.

This way, we run our JPEG encoder with a black and white test PNG shown below:

Original PNG image
Original PNG image

Since the DCT is running in a hardware simulation and not in software, it is very slow. It can take minutes, to encode an image. However, what we can get from this simulation, is a cycle accurate waveform of things that are happening under the hood.

DCT encoder waveforms
DCT encoder waveforms

After many hours of debugging and fixing some data flow issues, our virtual hardware does what it should.

Encoded JPEG image
Encoded JPEG image

And this is the encoded result. By swapping out the remoteDCT routine again by the built in routine, we get a reference image which we can subtract from the virtually hardware encoded image. If the result is all zeros, we know that both methods produce identical results.


Now the interesting part: How much hardware resources are allocated? See the synthesis results output below (This is for a Xilinx Spartan3E 250).

Typically, the timing results from synthesis can’t be trusted, when place & route has completed and fitted other logic as well, the maximum clock will significantly decrease. Since this design is using up all DSP slices on this FPGA, we’ll only see this as an intermediate station and move on to more gates and DSP power.

 Number of Slices:                      872  out of   2448    35%  
 Number of Slice Flip Flops:            623  out of   4896    12%  
 Number of 4 input LUTs:               1668  out of   4896    34%  
 Number of IOs:                          39
 Number of bonded IOBs:                  38  out of     66    57%  
 Number of BRAMs:                         6  out of     12    50%  
 Number of MULT18X18SIOs:                12  out of     12   100%     

 Timing constraint: Default period analysis for Clock 'clk'
 Clock period: 6.710ns (frequency: 149.032MHz)