Posted on Leave a comment

High level synthesis Part 2

VHDL, Verilog and the likesies – although generally known as high level languages – may in fact have the character of assembly language, when it comes to more complex constructs such as pipelines. Moreover, arithmetic chains of signed and unsigned operations can become obscure:

  • In Verilog, behaviour is implicit, but a formula easy to read
  • In VHDL, elaboration is explicit, but a formula can result in a complicated casting chain

High level meta programming approaches such as MyHDL have been sucessful in lifting the arithmetic troubles in conversion, up to some point where the core design exhibits translation flaws. However, a MyHDL alike syntax for HLS extensions has proven to be readable and well maintainable as show below.

Pipelining arithmetic formulas

As there is no formal representation for a thing such as a pipeline or pipelined multiplication in these language approaches, pipelines become harder to maintain on that level. Fully fledged HLS however hides the entire pipeline unrolling from the user and simply turns a formula into a HDL construct. In some cases, this may not be wanted. For instance, a HLS suite is able to unroll a DCT (Discrete Cosine Transform) into HDL, but this will normally result in an unbalanced pipeline, meaning, DSP units for multiplication will idle for certain cycles. In the particular case of DCT-alike processing, such pipeline exhibit dual-lane capabilities, meaning that two channels can be processed in an interleaved way, getting most out of the pipeline elements with minimized idle cycles. Such optimization could only be done via an explicit, manual elaboration.

The need for an intermediate representation

The Pythonic high level approach starts out simple. You have a software function, which you want to turn into a hardware element using a design rule. That you would do as follows (simplified):

@hls.pipeline_unroll(designrule)
def func(a : Data, b : Data, c : Data):
    result = a * b + c
    yield [ result ]

The HLS auxiliaries casted by the decorator turn this formula into a suitable hardware element, based on the given data type. They take care of the automated fixed point extension according to a given design rule. Such a simple construct is easy to handle, however when getting more complicated such as transformations including start/stop condititons, a more fine grained control is necessary.

Therefore, an intermediate representation of a pipeline generator function is proposed as follows:

@mypipe.pipeline(clk, ce = ce, pass_in = ce, pass_out = rvalid)
def worker():
    @mypipe.stage
    def mul():
        x.next = a * b
    @mypipe.stage
    def add():
        r.next = c.delay(1) + x

    return mul, add

This explicitely lists each stage. Behind the curtains, we can also do latency accounting for signals. Note for example, that c is delayed by one cycle in order to match up with the latency of x. The only drawback in standard Python interpreters is that we need to explicitely return the stage funclets in their right order.

The above pipeline is never idle, as long as ce is asserted. However, when there are time slots where the processing needs to switch the source or destination, conditional constructs come into play. For example, if a calculation is to be made every other sample, a slot variable will typically toggle during a filled pipeline. Due to the latency delay of each stage, the slot will need to be checked according to the stage delay. When another stage is inserted, all these conditional checks will have to be revisited. This calls for another abstraction of a Slot.

slot = mypipe.Slot(1) # Two state boolean slot

@always_seq(clk.posedge, reset)
def toggle():
    if ce or rvalid:
        slot.next = ~slot

@mypipe.pipeline(clk, ce = ce, pass_in = ce, pass_out = rvalid)
def worker():
    @mypipe.stage
    def mul():
        if mul.slot(slot):
            x.next = a * b
        else:
            x.next = -a * b
    @mypipe.stage
    def add():
        r.next = c.delay(1) + x

    return mul, add

The .slot method determines internally according to the latency which value slot should be compared against. For greater slot sizes, a simple bit array is used instead of a classical binary counter.

Note that the stages determine the explicit relative latency of a signal, once a slot condition is introduced, you need to make sure yourself that the next stage is getting the correct result from the previous stage and slot.

For instance, when a signal is only assigned to in one slot #n, it is valid earliest in the next stage in the slots j > n. The latency check will not detect when you violate that rule.

More formalisms

Using the above approach, pretty homogenous pipelines can be optimized manually and are verified against the original formula in various ways:

  • By evaluating the pipeline hardware elements simply by running the funclets from Python via a specific Evaluator
  • By emitting the code into HDL and running it through a trusted simulator
  • By compiling the mapped/synthesized elements into executable code and co-simulate it within Python

This calls for a few more extensions:

We might want to know from a top level, how deep a pipeline effectively has become. Thus, there need to be auxiliaries determining the latency of a particular pipeline design.

Then, we may have to deal with signal delays outside the reach of our latency accounting, for instance external memories, that yield a data word one clock cycle after address assertion.

Then finally, we may want to implicitely create delay queues for signals that are not used from within the first stage, neither would we want to explicitely elaborate a delay queue for each buffered signal. This calls for an AutoAdjust construct that automatically inserts a delay queue with the proper length, such that the latency matches. This however is beyond the scope of this HLS brain storm for now and is elaborated in detail in the @mypipe documentation that will be released at a later point as a Jupyter notebook.

Flexible fixpoint arithmetics

In MyHDL, several approaches have been taken to implement adaptive fixpoint arithmetics (fixbv). In most cases, manual implementation using a intbv derivative have shown to be sufficient and straightforward, moreover, due to issues with conversion of arithmetic chains, this could never be taken to a finally effective result.

For complex pipelines, adaptive signal types however make sense from a high level perspective:

  • You define the precision needed in the wire type
  • You pass the Signal based on this specific wire type to the pipeline construct
  • The pipeline unrolls into HDL matching your particular fixpoint arithmetic design rule

For the cyrite dual head approach, the story is a little different than with MyHDL: Since it is possible to run a functional description natively as well as transpiling it into generator representation for output into target languages, a formal evaluation of a pipeline by *native* execution allows to determine the required fixpoint precision for a certain arithmetic operation, before emitting it into a hardware representation in the second pass. Remember that a transpilation to HDL or direct synthesis also occurs by execution, but in a generator character more than native sequential execution.

This opens up quite a new field of auto-verifying arithmetic HLS methods to make sure that latency and precision is an accurate match to a software prototype that may a priori be based on built-in floating point arithmetics. So the verification process is like:

  1. Write software prototype routine in Python with built-in integers, floats, …
  2. Turn it into a bit-accurate sequence using cyrite datatypes and .next style assignments
  3. Unroll such a @hls function into a @pipeline, optimize manually, if necessary
  4. Run Verification:
    • Verify pipeline is calculating the same as the prototype, using formal automated verification. This step will determine the actual resources needed for the arithmetics
    • Elaborate into hardware, run through external simulator again, co-simulate against software routine, if necessary
  5. Stick design into a CI/CD software pipeline to make sure you don’t break things

Posted on Leave a comment

Co-Simulation models ‘2.0’

The classical simulation method we used so far via GHDL using VHDL as transfer language had several drawbacks:

  • Slow
  • The libgrt.a is subject to GPL, meaning, VHDL code had to be published to comply to OpenSource licenses

The latter is often not an option. Thus, network workarounds were required with tunneling hardware protocols through netpp into a hardware model by co-simulation.

The newer MLIR and CXXRTL approaches lift that problem: They produce executables that can be shipped without source disclosure, as long as they don’t link against GPL code.

The result of this, after some years of settle time:

  • We will gradually release legacy core IP models as co-simulation setups that you can stick into your CI
  • Some models will also allow to output VHDL or Verilog sources in a readable form or as a net list, depending on the IP core terms.
  • All models will install using Python pip and be driven from a Python main loop allowing to verify software routines against the virtualized hardware

How to

In order to not have to install any software that you don’t trust, the entire cyrite ecosystem is supplied as a Docker container with Jupyter Notebook extensions. This allows to combine documentation with Python code.

The button below will launch a cyrite Binder with a default JupterLab configuration. This does not yet contain the IP modules. To run a specific IP core release, you need to download the associated Jupyter notebook first by a given link or from the gist collection below (Download ZIP file). Then simply drag it onto the file pane in the Jupyterlab IDE and open it by double clicking. Alternatively, you can use git from the command line to pull the file via the ‘Clone via HTTPS’ option (Drop down below the Embed choice).

Typically, such a notebook will perform a pip install procedure and install all necessary dependencies. This may modify the running container, so you need to be aware when running installations on a persistent, local container setup as it may alter its behaviour.

Launch button for cyhdl demos

Once the installation is finished, run the notebook cells step by step. The binder setup will let you modify the code and try things interactively, but note that timeouts will terminate your session without warning upon activity, and your changes will be lost. For proper evaluation of the Docker container you will either need to install the docker environment on your Linux development machine or use a virtual machine setup for Docker on Windows.

Gist collection