Posted on Leave a comment

High level synthesis Part 1

Creating hardware elements in Python using its generator methods leads us to a few application scenarios that make complex development fun and less error prone. Let’s start with a simple use case.

Counter abstraction

Assume you have a complex design using a bunch of FIFOs and plenty of counters. Wait, what kind of counters? The first implementation – for debugging reasons – is likely to begin with a description of a binary (ripple) counter, as long as there is no cross clock domain transition involved (this we deal with below). So, you’d simulate, find it’s running correctly, and drop the design into the synthesizer toolchain, realizing there is going to be a bottleneck with respect to speed or logic usage. Then one might recall that there is not only one way to count: an LFSR implementation for instance uses less logic than a ripple counter and cycles through some interval, where each value occurs only once.

So, having started out with a VHDL design, you’d have to identify every counter instance, change it into an LFSR, deal with translation of decimal start and stop values into their LFSR counterparts (note this is not deterministic, unlike a gray code). Using the Counter abstraction from the cyrite library, this plays nicer with some early planning.

The Counter class represents a Signal type that performs like a usual signal, allows adding or subtracting a constant as usual within a @genprocess notation or @always function in MyHDL style. So we’d instance Counters instead of signals, but introduce no further abstraction, it will still look all quite explicit:

c = Counter(SIZE = 12, START_VALUE = 0)
...

@always_seq(clk.posedge, reset)
def worker():
    if en:
        c.next = c + 1

However when it’s about to change to a different counter implementation, we’d want to swap them all at once using one flag. So one would end up with a class design, typically, where the desired counter type is provided at the initialization stage, and the counter abstraction will generate the rest. Then, inside the Python ‘factory’ class, we’d spell:

class Unit(Module):
    def __init__(self):
        self.Counter = LFSRX
...
    @block_component
    def unit0(clk: ClkSignal, r: ResetSignal, a : Signal, b : Signal.Output):
        c = self.Counter(...)

        @always_seq(clk.posedge, r)
        def worker():
            ...

Now here comes the catch: The above worker process should not change, but the + 1 operation really only applies to a binary counter. Plus, a specific counter component such as an LFSR module will have to be implicitely inferred. Here’s where Pythons meta programming features kick in: Overriding the __add__ method, we can return a generator object creating HDL elements, plus apply rules that make sure only applicable counting steps are allowed (for instance, a gray counter always counts up or down by one).
Behind this is a Composite class that provides mechanisms to combine hardware block implementations (hierarchically resolved) with a resulting signal.

Here’s how they play in detail (allow a few minutes maximum for the binder to start up):

https://mybinder.org/v2/gh/hackfin/myhdl.v2we/verilog?urlpath=lab/tree/examples/composite_classes.ipynb

When it comes to LFSRs, there are a few issues. Let’s recap:

  • LFSR generators are given a known good polynom creating a maximum-length sequence of numbers (2 ** n – 1), n = number of register bits. See also LFSR reference polynoms.
  • The base LFSR feedback must not start with all zeros or all ones, depending on implementation
  • There is no direct function providing the LFSR-specific representation of a decimal value. It has to be iterated.

In a FIFO scenario, you might usually require a counter covering all values in the interval [0, 2**n-1]. Therefore, you’ll have to generate a LFSR variant with an extra feedback, that includes the zero and all ones state. This is covered by the LFSRX implementation.

Normally, you’ll have *in* and *out* counters that are compared against each other, so there is no need to determine the value of the n’th state. However, when generating a video signal that way, for instance, this is different, because a specific, configureable interval is to be accounted for. So: we need some software extension to cycle the LFSR through n times in order to get the corresponding value. This is viable for small n, but will take plenty of time for larger.

Even worse would the situation be when a reverse look-up is required. Then, we’d have to store all values in a large table. Again, not an option for large n. The improved solution however lies somewhere in between: a bit of storage and a bit of iterating, until the value is found.

Assuming, this correspondence mapping is solved, we can finally tackle the comparison against this value: Since we already introduced plenty of abstraction, overriding customizing the __eq__ special method creates no further pain. We check against an integer and translate this ad-hoc into the corresponding state’s value.

This way it is possible to abstract counter elements and swap them against each other – provided they support the counting methods (LFSR: one up only, Gray: One up, one down).

Synthesis

The question might arise: What’s that got to do with High Level Synthesis? (HLS)

Well, this is only Part 1. What we understand as HLS so far (the term being dominated by Xilinx), is that we can drop in a formula written in a sequential programming language and get it rolled out into hardware. There’s some massive intelligencia behind that, like attempts to detect known elements and somehow infer a clock/area optimized net list of primitives. However, this is not what we intended: HLS should give maximum control over what’s created during synthesis from a high level perspective. This can be basically covered using these mechanisms, depending on the target, the same hardware description rolls out into target specific elements. We can leave the optimum inference to the synthesis/mapper intelligence, or we can decide to impose our own design rules and for instance infer specific (truncated or intentionally fuzzy) multipliers for certain AI operations.

HDL issues

The ongoing, everlasting and omnipresent discussion whether hardware design language authoring is considered ‘programming’ or ‘describing functionality’ has obviously gotten to a certain annoyance level in the communities which often split themselves into Hardware and Software developers. The Xilinx HLS marketing does not take the tension out of the situation, as it suggests that a Software Developer can simply drop Matlab or C code into a tool and get the right thing out of it. I believe, this is a wrong approach, as it will never bridge the gap between software and hardware development.

Eventually, let’s face it: VHDL *is* a programming language, as it was designed as one: to model behaviour. It is not a description language like XML, as it gives little means to specifically describe the wanted result in an abstracted way. I.e. unlike XML, practise shows that there is no way to self-verify or emit towards synthesis according to specific design rules or makes it very hard to extend using own datatypes, although the architecture for derivation is implemented.

On the other hand, it is a massive trial and error game to determine which construct can be processed by what tool, let aside anything to do with simulation. The VHDL language complexity has led many commercial developments to poor or incorrect language support in the past.

So, back to HLS: Why does Python, as a programming language, get us there? Answer: it’s the built-in features:

  • Meta programming (overriding of operations)
  • Generator concepts (yield), etc.
  • Self-parsing: AST analysis, Transpilation

Using generator constructs, we still are programming. But we are lead to a different way of thinking with respect to creating elements and the tool itself (the HLS kernel) will tell us early what is allowed for synthesis, and what only for simulation. That way, a Python coder is taught how to describe, effectively.

To be continued in Part 2..

Posted on Leave a comment

Hardware generation and simulation in Python

There are various approaches to Python HDLs, some more suited to Python developers than to HDL developers. They all have one thing in common: The very refined test bench capabilities of the Python ecosystem which allow you to just connect almost everything to all. From all these Python dialects, myHDL turns out to be the most readable and sustainable language for hardware development. Let me outline a few more properties:

  • Has a built-in simulator (limited to defined values)
  • Converts a design into flattened Verilog or VHDL
  • Uses a sophisticated ‘wire’ concept for integer arithmetics

In a previous post, I mentioned experiments with yosys and its Python API. Not much has changed on that front, as the myHDL ‘kernel’ based approach turned out to be unmaintainable for various reasons. Plus, the myHDL kernel has a basic limitation due to its AST-Translation into target HDLs that impedes code reusability and easy extendability with custom signal types.

For experiments with higher level synthesis, such as automated pipeline unrolling or matrix multiplications, a different approach was taken. This ‘kernel’, if you will, can handle the legacy myHDL data types plus derived extensions. This works as follows:

  • Front end language (myHDL) is slightly AST-translated into a different internal representation language (‘myIRL’)
  • The myIRL representation is executed within a target context to generate logic as:
    • VHDL (synthesizeable)
    • RTL (via pyosys target)
    • mangled Verilog (via yosys)

Now the big omnipresent question is: Does that logic perform right? How to verify?

  • The VHDL output (hierarchical modules) is imported into the GHDL simulator and can be driven by a test bench. The test bench is also generated as a VHDL module. Co-Simulation support is currently not provided.
  • The Verilog output can be simulated with iverilog, however, Co-Simulation is not enabled for the time being for this target
  • The RTL representation is translated to C++ via the CXXRTL back end and is co-simulated against the Python test bench. Note that support for signal events are rudimentary. CXXRTL is targeting at speedy execution with defined values (no ‘X’ and ‘U’)

Instead using classic documentation frameworks, the strategy was chosen again to use Jupyter Notebooks running in a Jupyter Lab environment. Again, the Binder technology enables us to run this in the cloud without requirement to install a specific Linux environment. The advantages:

  • Auto-Testing functionality for notebooks in a reference Docker environment
  • Reduced overhead for creating minimum working examples or error cases

This Binder is launched via the button below.

Launch button for myhdl emulation demos

Overview of functionality:

  • Generation of hardware as RTL or VHDL
  • Simulation (GHDL, rudimentary CXXRTL)
  • RTL display, output of waveforms
  • Application examples:
    • Generators (CRC, Gray-Counter, …)
    • Pipeline and vector operations
    • Extension types (SoC register map generation, etc.)

Yosys synthesis and target architectures

The OpenSource yosys tool finally allows to drop a reference tool chain flow into the cloud without licensing issues. This is in particular interesting for new, sustainable FPGA architectures. A few architectures have been under scrutiny for ‘dry dock’ synthesis without actually having hardware.

In particular, a reference SoC environment (MaSoCist) was dropped into the make flow for various target architectures to see:

  • How much logic is used
  • If synthesis translates into the correct primitives
  • If the entire mapped output simulates correctly with different simulators

The latter is a huge task that could only be somewhat automated using Python. Therefore, the entire MaSoCist SoC builder will slowly migrate towards a Python based architecture.

It is expected to document some more in particular about several architectures.

As an example, a synthesis and mapping step for a multiplier:

Limitations

As always with educational software, some scenarios don’t play. The restrictions in place for this release:

  • Variable usage in HDL not supported
  • Custom generators, such as Partial assignments (p(1 downto 0) <= s) or vector operations not supported in RTLIL
  • Limited support for @block interfaces
  • Thus: No HLS alike library support through direct synthesis (yet)

Exploring CXXRTL

CXXRTL by @whitequark is a relatively fresh simulator backend for yosys, creating heavily template-decorated C++ code compiling into a binary executable simulation model. It was found to perform quite well as a cythonized (compiled Python) back end driven from a thin simulator API integrated into the MyIRL library.

Since it requires its own driver from the top, a thin simulator API built on top of the myIRL library takes care of the event scheduling, unlike GHDL or icarus verilog which handle delays and delta cycling for complex combinatorial units. It is therefore still regarded as a ‘know thy innards’ tool. A few more benefits:

  • Allows to distribute functional simulation models as executables, without requirements to publish the source
  • Permits model-in-the-loop scenarios to integrate external simulators as black boxes
  • Eventually aids in mixed language (VHDL, Verilog, RTL) and many-level model simulations

There are also drawbacks: Like the MyHDL simulator, CXXRTL is not aware of ‘U’ (uninitialized) and ‘X’ (undefined) values, it knows 0 and 1 signals only. It is therefore not suitable for a full trace of your ASIC’s reset circuitry without workarounds. Plus, CXXRTL only processes synthesizeable code and would not provide the necessary delay handling for post place and route simulation.

Co-Simulation: How does this play with MyHDL syntax?

This is where it gets complicated. MyHDL allows a a subset of Python code to be translated to Verilog of VHDL such that you can write simple test benches for verification that run entirely in the target language.

Then there’s the co-simulation option, where native Python code (featured by the myHDL ‘simulator kernel’, if you will) runs alongside a compiled simulation model of your hardware. The simplest setup is basically a circuit or entire virtual board with only a virtual reset and clock stimulus. Any other simulation model, like as UART, a SPI flash, etc. can be connected to such a simulation with more or less effort. The big issue: Who is producing the event, who is consuming it? This leads us back to the infamous master and slave topic (I am aware it’s got a connotation).

The de-facto standards aiding us so far in the simulator interfacing ecosystem:

  • VHDL: VHPI, VHDLDIRECT, specific GHDL implementations
  • Verilog/mixed: VPI, FLI
  • QEMU as CPU emulation coupled to hardware models

The easiest to handle may be the VPI transaction layer, that is already present for myHDL. In this implementation, it is using a pipe to send signal events to the simulation and reading back results through another reverse path. Here, the myHDL plays a clear master role. For GHDL, a asynchronous concept was explored via my ghdlex library, allowing distributed co-simulation across networks where master and slave relationships are becoming fuzzy.

Finally, the CXXRTL method provides most flexibility, as we can add blackbox hardware that does just something. We have the full control here over a simple C++-layer without any overhead induced through pipes. The binding for Python can easily be created using Cython code. However it requires to clearly separate testbench code from hardware implementation.

This implies:

  • Test bench must be written in myHDL syntax style and needs to use specific simulation signal classes
  • Extended bulk signal/container classes re-usage is restricted
  • Hardware description can be in any syntax or intermediate representation, as well as blackbox Verilog or VHDL modules

Links and further documentation

As usual in the quickly moving opensource world, documentation is sparse and solutions on top of it are prone to become orphanware, once the one man bands retire or lose interest. However, I tend to rate the risk very low in this case. Useful links so far (hopefully, there’ll be found more soon):

Disclaimers

  • Recommended for academical or private/experimental use only
  • The pyosys API (Python wrapper for libyosys) may at this moment crash without warning or yield misleading feedback. There’s not much being done about this now as updates from the yosys development are expected.
  • Therefore, jupyter notebooks may crash and you may lose your input/data
  • No liability taken!