Posted on Leave a comment

Using linux kernel config and devdesc/XML for VHDL designs

A true mess of a VHDL design

Almost every linux kernel user is kinda familiar with it: a ‘make menuconfig’ calls up the blue configuration screen that lets you choose all kind of drivers. Some folks have been using the kconfig tool for embedded systems for a while, like busybox, or the very nice antares setup for esp8266 systems ().

So why not use the linux kernel config for hardware designs? Since I’ve been working with System on Chip (SoC) implementations of various kinds, I’ve kept shooting myself in the foot with a lot of maintenance work, caused by all kinds of different configurations. For example, I’d like to have a SoC where peripherals can be configured freely, or using another soft core. Merging all these projects into a single setup with *one* configuration entity to touch, turned out to be a bit of a challenge.

So far we used XML to describe the entire SoC. This generates all the necessary peripheral decoders and address maps automatically, like with the various system builder tools from the big FPGA vendors. But what if there is an entire family of SoCs even running on FPGAs from different vendors? Even then, an XML file would have to be written for each platform configuration. No good. Plus, the XML does not help you much in selecting the source files, unless you export Makefiles on the fly. No good idea either.

A better approach is, to actually specify *one* omnipotent hardware setup in the XML device description and use kconfig to turn on/off the components or even specify the number of instantiations, for example of several UARTs.

So what’s left to do?
kconfig covers up the export of its CONFIG parameters using various backends. There are already backends for:

  • Makefiles
  • C headers

Missing is the support for VHDL. This is currently done by a makefile hack which exports the configuration into a global_config.vhdl file. This is used as a package from every relevant HDL design file.

Now there comes up another problem: If units can be turned on or off, the component interface (I/O pin layout) will change. VHDL does not have any preprocessing functionality for the full power of conditional compilation, however, every decent developer system has a program which can do: The C preprocessor.

So for the top level SoC module which directly instances the peripheral I/O module with a conditional pin mapping, we use VHDL code decorated with the #ifdef statements known from C. These VHDL files have a .chdl suffix. In the Makefiles, there is simply a rule to convert this .chdl file to .vhdl. Done. We just need to figure out the proper Makefile rules and make sure everything is resynced to source files changes upon calling a “make all”.

The real stuff

So, how does it work?

menuconfig

When you call the classic “make menuconfig” inside our SoC project, you would see the above configuration screen. Everything else works pretty much linux’ish.

For defining new hardware peripherals, you still have to do a few redundant extra steps:

  1. Edit the XML file with the register map of the peripheral
  2. Define the peripheral address mapping in the XML file
  3. Add the configuration options to the perio/Kconfig file, just like in Linux
  4. Adapt the kconfig->VHDL makefile script (vhdlconfig.mk)
  5. Of course: Implement the core for your peripheral to be instanced by the soc_mmr_perio module.

Step (4) could be simplified by writing a specific VHDL backend for kconfig. This is just “nice to have”, so maybe someone else would want to step in?

devdesc/XML for hardware

netpp/devdesc is close to having its 10th birthday. It’s our own XML language to describe devices and has been in service for quite a bit for Internet of Things applications. It turned out that not only existing hardware can be described with it, but a full SoC design can be generated at very little overhead. The gensoc tool, making heavy use of netpp technology, creates the full peripheral module including decoders and peripheral instances (like UART0, UART1, PWM, …) from a system description XML file.

As mentioned, we don’t want to write plenty of similar looking XMLs for each SoC variant. It is much easier to decorate the existing XML which contains all HW definitions with processing commands. Their purpose is, to only emit XML nodes to a specific target description when the corresponding CONFIG_<module> variable is set.

In short, we create a stripped down XML file, describing the specific target, from a XML family file containing the whole peripheral superset. The Target XML file is then used to generate the HDL via gensoc.

The software side

Every SoC of course has some built-in software – a bootloader or a bare metal program doing stuff. This code is typically written in C or assembly and makes quite some happy usage of the kconfig output as well. For instance, an “autoconf.h” header is exported that contains all the macro defines for the configuration. So you can enable test routines or hardware drivers as usual.

Every SoC should come with proper debugging features. It helps a lot, when developing new SoC peripherals and drivers, to make use of some regression testing scripts that can run inside GDB or as a remote procedure call solution inside the simulation.

For example, we use the same framework to generate a wrapper such that all peripheral registers can be accessed through a python script, like:

r.GPIO_DIR.set(0xfffffff0)

for i in range(100):
    r.GPIO_SET.set(0xffff)
    r.GPIO_CLR.set(0xffff)

This is rather exclusive to the simulation, when running on the target, writing GDB scripts is typically the best option.

The python script does not have to be aware of any hard addresses, so testing systems can be set up for entire families of SoCs with differing address bases. For a specific hardware target however, the GDB scripts containing the register mappings have to be explicitely regenerated.

Conclusion

The kconfig tool boosts the SoC development quite a bit and makes things really clean and more robust. The entire system with the XML description is still a bit from being perfect, but it just works on plenty of platforms using the typicall (free) developer tools.

 

Posted on 8 Comments

ZPU next generation – pipelined

Working with various ZPU adaptations for quite a while, some feature requests came up for which there was no immediate solution except swapping out the ZPU core. In the past months, quite a number of SoC designs turned out to be workarounds which were sufficient for its purpose, but were prognosed to be a maintenance nightmare.

Long story short: I decided to give it a new go. This time in MyHDL. If you’re not aware of this developer’s gem, check out the official website. It has its quirks, but it is yet the best tool to design a CPU, due to Python’s extensive test benching features. Verification and validation of a CPU is way easier than using the classical VHDL/Verilog approach.

Pipelining the ZPU

The ZPU is a pure stack machine, therefore sorting out the register hazards like on the MIPS architecture is kinda simple. The first approach however does not use a separate fetch stage, therefore branch penalties are not too bad for the moment (although sacrificing some clock speed). This design introduced “shortcuts”: When a pop() instruction is following a push(), the value does not need to be fetched from the stack explicitely but can be bypassed by a ‘write register’. There are some minor but nasty exceptions  that need to be sorted out, but these turned out to require not too much logic.

Eventually, parts of the design were borrowed from another very basic VLIW concept I used on a MIPS16 clone (made a while ago) and ZPU instructions just translate to VLIW sequences. Operations like LOAD that don’t pipeline well due to possible I/O stalls are just implemented as VLIW ‘micro code’.

Differences to ‘classic’ ZPU4/Zealot ‘small’

In the original ZPU small, quite some traffic occured between core and the shared memory (program, data, stack). The ZPUng introduces some changes:

  • Separate stack memory: This allows a pure register (distributed RAM) synthesis for higher speed. Plus, the stack cannot trash the program code
  • Shared Program/Data: This is required to be compatible to the Zealot ‘Phi’ programs. However, traffic is reduced and the writes are delayed (writeback stage). ZPUng v0.2 implements instruction prefetching and DMA access to the program/data memory
  • Optional: Allow usage of pseudo dual port memory on very small FPGAs.
  • A read immediately following a write is a classic hazard scenario which is handled by bypass logic on the stack memory. On the prog/data memory, it is not relevant on the ZPU.

DMA access could already be implemented on the ZPU4 using a specific DMA capable memory block.

I had implemented ICE debugging for the Zealot, using our in-house “StdTAP” interface that is running on a few native JTAG primitives of various FPGA vendors. The new ZPUng should of course behave likewise. Since an ICE event is handled like an exception on high priority, a bit of logic had to be added.

Handling events: IRQ and emulation

This is the harder part: The existing ZPU and the same program code with IRQ handlers is required to work likewise on the new ‘ZPUng’ (working title). However, there are a few extras: By using an external System Interrupt Controller (SIC), we get more control over generating interrupt vectors. Remember, the standard ZPU4 or Zealot in its “phi” configuration has only one interrupt channel and vector. In the ZPUng, we take an external interrupt vector from the SIC which can be configured using the peripheral I/O (memory mapped registers; MMR). Because the ZPU is a stack machine, no specific “return from exception” command is required, therefore it is very simple to register an IRQ: Just set the interrupt vector register ‘n’ of the SIC to a C function address handler.

The very tricky part is, to make interrupt handling work together with the on chip debugger (In Circuit Emulation aka ICE). There a few boundary conditions:

  • IRQs don’t interrupt inside a “IM” (immediate load) sequence. Therefore, no fixed IRQ latency possible, but IM sequences are always atomic
  • IRQs can interrupt inside a single step ICE session

Another feature of the IRQ enhancements: Typically, an IRQ handler acknowledges the interrupt request to the SIC, allowing another interrupt to occur. If this happens before the IRQ routine is actually ended, it will re-enter itself and trash the stack, eventually. This is avoided on the ZPUng by clearing the corresponding IPEND flag just before the final return (POPPC). The logic sets the IRQACK flag (which prevents reentrance) to the IRQ state on every branch instruction. So interrupt routines are not reentrant when following this scheme. Reentrance could be enabled by nasty hacks messing with the SIC configuration.

IRQ redesign rev1

In order to re-enable IRQ reentrancy and allowing IRQ priorisation through the SIC, the interrupt handling was redesigned in ZPUng v1 such that higher prioritised IRQs can interrupt a current interrupt handler. Other implementations had made use of a POPINT opcode – the same is now happening here, with one exception: It just clears the flag, return is still done by a POPPC. This makes the code easier to handle. IPEND flags are now cleared at the beginning of the IRQ handler and a final IRQ_REARM() macro clears the internal IRQ acknowledge.

The SIC was changed such that recurring IRQs with lower priority don’t cause another “dingdong” on the IRQ pin.

Resource usage

For example, the ZPUng ‘small’ (compatible to the ‘phi’ config) alone was synthesized in two configurations for a MACHXO2 7000 with the following resource usage:

Speed optimized (max. 32 MHz as SoC) : LUT: 906 Registers: 153  SLICE: 453
Area optimized (max. 25 Mhz as SoC)  : LUT: 745 Registers: 152  SLICE: 361

This SoC configuration just uses a system interrupt controller plus a 2×16 GPIO bank as peripherals. Complex peripherals on the Wishbone bus would slow down the design further due to logic congestion of the current architecture.

Synthesis on the Papilio Spartan3-250k platform produces quite similar results, however the CPU is running a few MHz faster. This is very likely due to the Xilinx architecture being a little faster on the block RAM side.

Development Roadmap

  • Releases: The ZPUng is available as generated, synthesizable VHDL.
    See https://github.com/hackfin/MaSoCist/
  • Full source release is not on the table for now, I’m afraid. One reason is that some modifications of MyHDL took place that will have to be merged back some day.
  • Speedup optimizations: There is not much in for it, and changes to the pipeline will increase logic elements. This is something where other CPU architectures perform better.
  • Update: The long term development is kinda unsure, as the GCC port of the ZPUng is not very much maintained. The future focus – I’m afraid – will be on compact RISC-V derivatives.
Posted on Leave a comment

Hacking your own FPGA chip scope

Ok, why would we want to do that, when there are various existing solutions:

  • Altera SignalTap
  • Xilinx ChipScope
  • Lattice Reveal

Make a long story short: To be in control! Well, there were some quirks with existing code, some tools wouldn’t run on my OS, and I wanted to be vendor independent.

But one major reason: We wanted to debug our own DSP core based system while eavesdropping on some internal signals.

So there we go with the shopping list:

  1. JTAG port and TAP implementation [already done]
  2. JTAG agent for debugging and BSCAN register monitoring [already done]
  3. Output to ‘live’ wave display

So again, a little hack of a VCD output got me going. Basically, a specific JTAG register is continously polled in the main loop and the output is written to a VCD file. Like we did it here (JTAG debugging movies), this is now taken on to real hardware.

You basically need a netpp installation for the header and the source files below:

scope.c : The VCD output for a homemade scope

scope.h : Necessary header

So what you do to build your own scope (that works with any other data source, by the way): Write the VCD data to a file, and meanwhile you run GTKwave under Linux as follows:

shmidcat /tmp/out.vcd | gtkwave -v -I run.sav

When selecting View->Partial VCD dynamic Zoom End within GTKwave, the window will scroll along your output. There are some compression options for GTKwave when the data file is getting too big.

Note also: GTKwave will overflow after a while. So make sure the time unit somewhat matches your resolution.

Ok, and now you might note that there is some draw back to this: The resolution might be pretty bad, the scope just shows a current state read out at some rather fuzzy defined real time. Some glitches or fast changes are not recorded!

So this needs a more advanced version which I won’t cover here, as it is highly specific to the problem you’re debugging. Just as a guide line: You’ll have to set up an internal trace buffer in block ram that will monitor the interesting signals and record every change in some way. Then you read out this trace buffer through another channel (which does not have to be JTAG in particular). This technique gets way closer to what the more professional tools are doing: triggering a trace on a specific event and recording them to memory, possibly in a compressed way. And there you go: You’ll be able to write your own logic analyzer and find out that the pro tools not always would save you the time.