Posted on 1 Comment

Virtual ROM on small FPGAs

Readonly data – outsourced

When running display output applications on small FPGAs where printing of strings is required, using the internal block RAM for character sets would be a waste of resources, or even: not sufficient. The 64kB character map for a normal and inverted font using both RGB and BGR subpixel smoothed renderings (more below) would eat up more than available on the Spartan3-250k device, for example.
The Spartan3 on my good old Papilio has a SPI flash attached with less than 50% actual usage for the FPGA bit stream. Tadaa. Plenty of space for character bitmaps. So we can just store the bitmap in an unused sector on the SPI and blit-copy from SPI flash to the LCD display directly.
But what if there’s more read-only data, like second stage program code or coprocessor microcode that is loaded on the fly by an applet? If we don’t need it for the boot process, it should not live in the block ram permanently, but still execute from block RAM. Screams for a cache, doesn’t it?
So the simple solution actually is, to create a small controller entity ‘scache’ inside the system specific peripheral of the SoC. This simply watches access to certain addresses and generates an exception once this address is hit. The exception vector then jumps into a handler routine which does all the SPI flash loading into the physical cache memory area. Then, for the next LOAD instruction, the virtual address internally translates to the physical cache address.
This requires very simple logic, however runs through some program code and needs a bit of time to load from the SPI flash.
Turns out this is barely noticeable for the LCD display.

Under the hood: Linker scripting

Ok, so there is plenty more data from the .rodata section. We could implement some kind of overlay loader and a kind of file system, but why make it complicated, when our data is somewhat static. We just relocate the cached external data using a linker script. Say, our program memory ranges from 0x0000 to 0x2000, the cache is allocated after that. Then we define a memory specification in the linker script as shown below:

MEMORY
{
        l1ram(rwx): ORIGIN = 0x0000, LENGTH = 0x2000
        l1cache(rwx): ORIGIN = 0x2000, LENGTH = 0x2000
        xdata(r): ORIGIN = 0x10000, LENGTH = 0x8000
}

For the .rodata section, we can simply use a line like

.rodata         : { *(.rodata .rodata.* .gnu.linkonce.r.*) } > xdata

to allocate the read-only areas into the virtual SPI ROM starting from 0x10000. Eventually, compiling the whole story will result in an ELF file which we only have to separate into the boot ROM area and the second stage binaries that live in the SPI flash.

But we may want to have read-only data that should be present at all times, like library code that is frequently used. For this purpose, we can just define our own segments and use the __attribute__ decorator for the data that we want in BRAM:

__attribute__((section(".l1.rodata"))) char g_fifobuf[32];

I will not dive into the details of linker scripting, but likewise you can also allocate entire library or object files in specific memory segments.

This simple technique allows you to pack quite a bit of code into small FPGA SoCs.

The LCD example

LCD screen
LCD subpixel smoothing example

Here you can see an example of the character table in action. I have mentioned the RGB/BGR subpixel smoothing above which lead to a bit higher ROM usage. The used LCDs have quite a bit of functionality, like setting the orientation of the screen display. You could just not care about when designing a character table and use a simple black/white scheme. However, when trying to achieve a 4×8 pixel character set, the font will look quite unreadable and you’re better off with the subpixel smoothing. This again is sensitive to the pixel order, so for top and bottom display orientation you will already need the font rendered in two variants, plus in color. Now, as we have the space, it is just much simpler to drop the entire bit map into the flash instead of figuring out compression techniques.

Program code

Likewise, not too frequently used program code can be dropped into external SPI flash. This comes in handy when a program becomes bigger during development and the space usage can not be determined a priori. The ZPUng is able to handle a specific exception signal that is risen by the SCACHE controller. The cache handler microcode routine then loads code from SPI flash and resumes execution while translating the virtual PC address into a physical address inside the cache area, where instructions are effectively fetched from. This is the trick used to run the netpp communication library for IoT applications  on FPGAs with limited resources, like the Papilio One.


		
Posted on

netpp on programmable logic: IoT for the FPGA

For quite a while I wouldn’t have said, it’s impossible, but wouldn’t put much effort in it either. Why, if you have a spare $1 microprocessor that can run a simple communication stack like netpp.
Well, sometimes it’s the time to try something else: Running a soft core CPU (ZPU) on small FPGAs has found some interest, due to the limited resource consumption. The ZPU will even fit on a $5 FPGA and still leave some space for specific interfaces like motor control. It is a slow stack machine, even the fastest pipelined implementations don’t really beat the MIPS alike architectures, however this doesn’t bother us when we just have to configure a set of registers, moreover, the ZPU architecture compensates with quite some code density.
To keep a long sermon short: A full netpp stack running over a UART interface fits in less than 15kB of memory. And runs on a MachXO2-7000 from Lattice, for example, with less than 50% logic usage.
Who’s still saying that an FPGA is too dumb for the internet?

Ok, there’s one little missing piece: The TCP/IP stack and the ethernet MAC. For this purpose, I’m using a esp8266 module. You’re right, we don’t want to do the full networking on the FPGA – yet.

Update

The solution was presented on the Embedded World Conference 2016. The paper is available for download below.

embedded2016

Posted on Leave a comment

Using linux kernel config and devdesc/XML for VHDL designs

A true mess of a VHDL design

Almost every linux kernel user is kinda familiar with it: a ‘make menuconfig’ calls up the blue configuration screen that lets you choose all kind of drivers. Some folks have been using the kconfig tool for embedded systems for a while, like busybox, or the very nice antares setup for esp8266 systems ().

So why not use the linux kernel config for hardware designs? Since I’ve been working with System on Chip (SoC) implementations of various kinds, I’ve kept shooting myself in the foot with a lot of maintenance work, caused by all kinds of different configurations. For example, I’d like to have a SoC where peripherals can be configured freely, or using another soft core. Merging all these projects into a single setup with *one* configuration entity to touch, turned out to be a bit of a challenge.

So far we used XML to describe the entire SoC. This generates all the necessary peripheral decoders and address maps automatically, like with the various system builder tools from the big FPGA vendors. But what if there is an entire family of SoCs even running on FPGAs from different vendors? Even then, an XML file would have to be written for each platform configuration. No good. Plus, the XML does not help you much in selecting the source files, unless you export Makefiles on the fly. No good idea either.

A better approach is, to actually specify *one* omnipotent hardware setup in the XML device description and use kconfig to turn on/off the components or even specify the number of instantiations, for example of several UARTs.

So what’s left to do?
kconfig covers up the export of its CONFIG parameters using various backends. There are already backends for:

  • Makefiles
  • C headers

Missing is the support for VHDL. This is currently done by a makefile hack which exports the configuration into a global_config.vhdl file. This is used as a package from every relevant HDL design file.

Now there comes up another problem: If units can be turned on or off, the component interface (I/O pin layout) will change. VHDL does not have any preprocessing functionality for the full power of conditional compilation, however, every decent developer system has a program which can do: The C preprocessor.

So for the top level SoC module which directly instances the peripheral I/O module with a conditional pin mapping, we use VHDL code decorated with the #ifdef statements known from C. These VHDL files have a .chdl suffix. In the Makefiles, there is simply a rule to convert this .chdl file to .vhdl. Done. We just need to figure out the proper Makefile rules and make sure everything is resynced to source files changes upon calling a “make all”.

The real stuff

So, how does it work?

menuconfig

When you call the classic “make menuconfig” inside our SoC project, you would see the above configuration screen. Everything else works pretty much linux’ish.

For defining new hardware peripherals, you still have to do a few redundant extra steps:

  1. Edit the XML file with the register map of the peripheral
  2. Define the peripheral address mapping in the XML file
  3. Add the configuration options to the perio/Kconfig file, just like in Linux
  4. Adapt the kconfig->VHDL makefile script (vhdlconfig.mk)
  5. Of course: Implement the core for your peripheral to be instanced by the soc_mmr_perio module.

Step (4) could be simplified by writing a specific VHDL backend for kconfig. This is just “nice to have”, so maybe someone else would want to step in?

devdesc/XML for hardware

netpp/devdesc is close to having its 10th birthday. It’s our own XML language to describe devices and has been in service for quite a bit for Internet of Things applications. It turned out that not only existing hardware can be described with it, but a full SoC design can be generated at very little overhead. The gensoc tool, making heavy use of netpp technology, creates the full peripheral module including decoders and peripheral instances (like UART0, UART1, PWM, …) from a system description XML file.

As mentioned, we don’t want to write plenty of similar looking XMLs for each SoC variant. It is much easier to decorate the existing XML which contains all HW definitions with processing commands. Their purpose is, to only emit XML nodes to a specific target description when the corresponding CONFIG_<module> variable is set.

In short, we create a stripped down XML file, describing the specific target, from a XML family file containing the whole peripheral superset. The Target XML file is then used to generate the HDL via gensoc.

The software side

Every SoC of course has some built-in software – a bootloader or a bare metal program doing stuff. This code is typically written in C or assembly and makes quite some happy usage of the kconfig output as well. For instance, an “autoconf.h” header is exported that contains all the macro defines for the configuration. So you can enable test routines or hardware drivers as usual.

Every SoC should come with proper debugging features. It helps a lot, when developing new SoC peripherals and drivers, to make use of some regression testing scripts that can run inside GDB or as a remote procedure call solution inside the simulation.

For example, we use the same framework to generate a wrapper such that all peripheral registers can be accessed through a python script, like:

r.GPIO_DIR.set(0xfffffff0)

for i in range(100):
    r.GPIO_SET.set(0xffff)
    r.GPIO_CLR.set(0xffff)

This is rather exclusive to the simulation, when running on the target, writing GDB scripts is typically the best option.

The python script does not have to be aware of any hard addresses, so testing systems can be set up for entire families of SoCs with differing address bases. For a specific hardware target however, the GDB scripts containing the register mappings have to be explicitely regenerated.

Conclusion

The kconfig tool boosts the SoC development quite a bit and makes things really clean and more robust. The entire system with the XML description is still a bit from being perfect, but it just works on plenty of platforms using the typicall (free) developer tools.