Posted on

ZPU next generation – pipelined

Working with various ZPU adaptations for quite a while, some feature requests came up for which there was no immediate solution except swapping out the ZPU core. In the past months, quite a number of SoC designs turned out to be workarounds which were sufficient for its purpose, but were prognosed to be a maintenance nightmare.

Long story short: I decided to give it a new go. This time in MyHDL. If you’re not aware of this developer’s gem, check out the official website. It has its quirks, but it is yet the best tool to design a CPU, due to Python’s extensive test benching features. Verification and validation of a CPU is way easier than using the classical VHDL/Verilog approach.

Pipelining the ZPU

The ZPU is a pure stack machine, therefore sorting out the register hazards like on the MIPS architecture is kinda simple. The first approach however does not use a separate fetch stage, therefore branch penalties are not too bad for the moment (although sacrificing some clock speed). This design introduced “shortcuts”: When a pop() instruction is following a push(), the value does not need to be fetched from the stack explicitely but can be bypassed by a ‘write register’. There are some minor but nasty exceptions  that need to be sorted out, but these turned out to require not too much logic.

Eventually, parts of the design were borrowed from another very basic VLIW concept I used on a MIPS16 clone (made a while ago) and ZPU instructions just translate to VLIW sequences. Operations like LOAD that don’t pipeline well due to possible I/O stalls are just implemented as VLIW ‘micro code’.

Differences to ‘classic’ ZPU4/Zealot ‘small’

In the original ZPU small, quite some traffic occured between core and the shared memory (program, data, stack). The ZPUng introduces some changes:

  • Separate stack memory: This allows a pure register (distributed RAM) synthesis for higher speed. Plus, the stack cannot trash the program code
  • Shared Program/Data: This is required to be compatible to the Zealot ‘Phi’ programs. However, traffic is reduced and the writes are delayed (writeback stage). ZPUng v0.2 implements instruction prefetching and DMA access to the program/data memory
  • Optional: Allow usage of pseudo dual port memory on very small FPGAs.
  • A read immediately following a write is a classic hazard scenario which is handled by bypass logic on the stack memory. On the prog/data memory, it is not relevant on the ZPU.

DMA access could already be implemented on the ZPU4 using a specific DMA capable memory block.

I had implemented ICE debugging for the Zealot, using our in-house “StdTAP” interface that is running on a few native JTAG primitives of various FPGA vendors. The new ZPUng should of course behave likewise. Since an ICE event is handled like an exception on high priority, a bit of logic had to be added.

Handling events: IRQ and emulation

This is the harder part: The existing ZPU and the same program code with IRQ handlers is required to work likewise on the new ‘ZPUng’ (working title). However, there are a few extras: By using an external System Interrupt Controller (SIC), we get more control over generating interrupt vectors. Remember, the standard ZPU4 or Zealot in its “phi” configuration has only one interrupt channel and vector. In the ZPUng, we take an external interrupt vector from the SIC which can be configured using the peripheral I/O (memory mapped registers; MMR). Because the ZPU is a stack machine, no specific “return from exception” command is required, therefore it is very simple to register an IRQ: Just set the interrupt vector register ‘n’ of the SIC to a C function address handler.

The very tricky part is, to make interrupt handling work together with the on chip debugger (In Circuit Emulation aka ICE). There a few boundary conditions:

  • IRQs don’t interrupt inside a “IM” (immediate load) sequence. Therefore, no fixed IRQ latency possible, but IM sequences are always atomic
  • IRQs can interrupt inside a single step ICE session

Another feature of the IRQ enhancements: Typically, an IRQ handler acknowledges the interrupt request to the SIC, allowing another interrupt to occur. If this happens before the IRQ routine is actually ended, it will re-enter itself and trash the stack, eventually. This is avoided on the ZPUng by clearing the corresponding IPEND flag just before the final return (POPPC). The logic sets the IRQACK flag (which prevents reentrance) to the IRQ state on every branch instruction. So interrupt routines are not reentrant when following this scheme. Reentrance could be enabled by nasty hacks messing with the SIC configuration.

IRQ redesign rev1

In order to re-enable IRQ reentrancy and allowing IRQ priorisation through the SIC, the interrupt handling was redesigned in ZPUng v1 such that higher prioritised IRQs can interrupt a current interrupt handler. Other implementations had made use of a POPINT opcode – the same is now happening here, with one exception: It just clears the flag, return is still done by a POPPC. This makes the code easier to handle. IPEND flags are now cleared at the beginning of the IRQ handler and a final IRQ_REARM() macro clears the internal IRQ acknowledge.

The SIC was changed such that recurring IRQs with lower priority don’t cause another “dingdong” on the IRQ pin.

Resource usage

For example, the ZPUng ‘small’ (compatible to the ‘phi’ config) alone was synthesized in two configurations for a MACHXO2 7000 with the following resource usage:

Speed optimized (max. 32 MHz as SoC) : LUT: 906 Registers: 153  SLICE: 453
Area optimized (max. 25 Mhz as SoC)  : LUT: 745 Registers: 152  SLICE: 361

This SoC configuration just uses a system interrupt controller plus a 2×16 GPIO bank as peripherals. Complex peripherals on the Wishbone bus would slow down the design further due to logic congestion of the current architecture.

Synthesis on the Papilio Spartan3-250k platform produces quite similar results, however the CPU is running a few MHz faster. This is very likely due to the Xilinx architecture being a little faster on the block RAM side.

Development Roadmap

  • Sources release: Planned, but not scheduled yet.
  • Speedup optimizations: There is not much in for it, and changes to the pipeline will increase logic elements. This is something where other CPU architectures perform better.
  • Explore ‘sliding register window’ tricks for pure research fun. Might require a full new instruction set…?
Posted on

In Circuit Emulation for MIPS soft CPU

Many developers got used to this: Plug in a JTAG connector to your embedded target and debug the misbehaviour down to the hardware. No longer a luxury, isn’t it?
When you move on to a soft core from opencores.org for a simple SoC solution on FPGA, it may be. Most of them don’t have a TAP – a Test Access Port – like other off the shelf ARM cores.

We had pimped a ZPU soft core with in circuit emulation features a while ago (In Circuit Emulation for the ZPU). This enabled us to connect to a hardware ZPU core with the GNU debugger like we would attach to for example a Blackfin, msp430, you name it.

Coming across various MIPS compatible IP cores without debug facilities, the urge came up to create a standard test access port that would work on this architecture as well. There is an existing Debug standard called EJTAG, but it turned out way more complex to implement than simple In Circuit Emulation (ICE) using the above TAP from the ZPU.

So we would like to do the same as with the ZPU:

  • Compile programs with the <arch>-elf-gcc
  • Download programs into the FPGA hardware using GDB
  • set breakpoints, inspect and manipulate values and registers, etc.
  • Run a little “chip scope”

Would that work without killer efforts?

Yes, it would. But we’re not releasing the white rabbit yet. Stay tuned for embedded world 2013 in Nuremberg in February. We (Rene Doß from Dossmatik GmbH Germany and myself) are going to show some interesting stuff that will boost your SoC development big time by using known architectures and debug tools.

His MIPS compatible Mais CPU is on the way to become stable, and turns out to be a quick bastard while being stingy on resources.

For a sneak peek, here’s some candy from the synthesizer (including I/O for the LCD as shown below):

Selected Device : 3s250evq100-5

Number of Slices:                     1172  out of   2448    47%
Number of Slice Flip Flops:            904  out of   4896    18%
Number of 4 input LUTs:               2230  out of   4896    45%
Number of IOs:                          15
Number of bonded IOBs:                  15  out of     66    22%
Number of BRAMs:                         2  out of     12    16%
Number of GCLKs:                         3  out of     24    12%

And for the timing:

Minimum period: 13.498ns (Maximum Frequency: 74.084MHz)
Mais on Papilio with TFT wing

Update: Here’s a link to the presentation (as PDF) given on the embedded world 2013 in Nuremberg:

presentation-2013

Posted on

A full baseline pipelined JPEG encoder in VHDL

As described in a previous post, a framework was built to loop in a DCT hardware design into a software JPEG encoder for verification (and acceleration) purposes.

Turns out this strategy speeds up development a lot, and that the remaining modules on the way to a full hardware based and pipelined JPEG encoding solution weren’t a big job. Actually, I was expecting that this enhanced encoder would no longer fit into a small Spartan3E 250k. Wrong!

Have a look:

Device utilization summary:
---------------------------

Selected Device : 3s250evq100-5 

 Number of Slices:                     1567  out of   2448    64%  
 Number of Slice Flip Flops:           1063  out of   4896    21%  
 Number of 4 input LUTs:               2915  out of   4896    59%  
    Number used as logic:              2900
    Number used as Shift registers:      15
 Number of IOs:                          49
 Number of bonded IOBs:                  47  out of     66    71%  
 Number of BRAMs:                        12  out of     12   100%  
 Number of MULT18X18SIOs:                11  out of     12    91%  
 Number of GCLKs:                         2  out of     24     8%

JPEG encoder latency and timing

From the XST summary, we get:

Timing Summary:
---------------
Speed Grade: -5

   Minimum period: 13.449ns (Maximum Frequency: 74.353MHz)
   Minimum input arrival time before clock: 9.229ns
   Maximum output required time after clock: 6.532ns
   Maximum combinational path delay: 7.693ns

The timing is again optimistic, place and route normally deteriorates the latencies. The maximum clock is in fact the clock you can feed the JPEG encoder with pixel data (12 bit) without causing overflow. The output is a huffman coded byte stream that is typically embedded into a JFIF structure header, table data and the appropriate markers by a CPU.

There is quite some room for optimization, in fact, the best compromise of BRAM bandwidth and area has not yet been reached. Quite a few BRAMs ports are not used, but kept open to allow access through an external CPU, like for manipulation of the Huffman tables.

JPEG encoder waveforms
JPEG encoder waveforms

The last performance question might be the latency: how long does it take until encoded JPEG data appears after the first arriving pixel data? The above waveform snapshot should speak for itself: at 50MHz input clock, the latency is approx. 4 microseconds.

Colour encoding

We haven’t talked about colour yet. This is a complex subject, because there are many possibilities of encoding colour, but not really for the JPEG encoder. This is rather a matter of I/O sequencing and the proper colour conversion. As you might remember, a JPEG encoder does not encode three RGB channels, but in YUV space, which might be roughly described as: brightness, redness and blueness. The ‘greenness’ is implicitely included in this information. But why repeat what’s already nicely described: You find all the details right here on Wikipedia.

So, to encode all the colour, we just need properly separated data according to one of the interleaving schemes (4:2:0 or 4:2:2) and feed the MCU blocks of 8×8 pixels through the encoder while assserting the channel value (Y, Cb, Cr) on the channel_select input. Voilà.

Turns out that the Bayer Pattern that we receive from many optical colour sensors can be converted rather directly into YUV 4:2:0 space using the right setting for our Scatter-Gather unit (‘Cottonpicken’ engine). With a tiny bit of software intervention through a soft core, we finally cover the entire colour processing stream. Proof below.

Colour JPEG
Colour JPEG output from the encoder
The original RGB photo
The original RGB photo

As you can see, the colours are quite not perfect yet compared with the original. This is a typical problem, that you get a greenish tint. We leave this to the colour optimization department 🙂

One more serious word: Just yesterday I’ve read the news and had to see that the person who changed the optical colour sensor industry, Bryce Bayer, has passed away. As a final “thank you” to his work, I’d like to post the Bayer Picture of the above.

Bayer pattern source image
Dedicated to Bryce Bayer