VHDL, Verilog and the likesies – although generally known as high level languages – may in fact have the character of assembly language, when it comes to more complex constructs such as pipelines. Moreover, arithmetic chains of signed and unsigned operations can become obscure:
- In Verilog, behaviour is implicit, but a formula easy to read
- In VHDL, elaboration is explicit, but a formula can result in a complicated casting chain
High level meta programming approaches such as MyHDL have been sucessful in lifting the arithmetic troubles in conversion, up to some point where the core design exhibits translation flaws. However, a MyHDL alike syntax for HLS extensions has proven to be readable and well maintainable as show below.
Pipelining arithmetic formulas
As there is no formal representation for a thing such as a pipeline or pipelined multiplication in these language approaches, pipelines become harder to maintain on that level. Fully fledged HLS however hides the entire pipeline unrolling from the user and simply turns a formula into a HDL construct. In some cases, this may not be wanted. For instance, a HLS suite is able to unroll a DCT (Discrete Cosine Transform) into HDL, but this will normally result in an unbalanced pipeline, meaning, DSP units for multiplication will idle for certain cycles. In the particular case of DCT-alike processing, such pipeline exhibit dual-lane capabilities, meaning that two channels can be processed in an interleaved way, getting most out of the pipeline elements with minimized idle cycles. Such optimization could only be done via an explicit, manual elaboration.
The need for an intermediate representation
The Pythonic high level approach starts out simple. You have a software function, which you want to turn into a hardware element using a design rule. That you would do as follows (simplified):
@hls.pipeline_unroll(designrule)
def func(a : Data, b : Data, c : Data):
result = a * b + c
yield [ result ]
The HLS auxiliaries casted by the decorator turn this formula into a suitable hardware element, based on the given data type. They take care of the automated fixed point extension according to a given design rule. Such a simple construct is easy to handle, however when getting more complicated such as transformations including start/stop condititons, a more fine grained control is necessary.
Therefore, an intermediate representation of a pipeline generator function is proposed as follows:
@mypipe.pipeline(clk, ce = ce, pass_in = ce, pass_out = rvalid)
def worker():
@mypipe.stage
def mul():
x.next = a * b
@mypipe.stage
def add():
r.next = c.delay(1) + x
return mul, add
This explicitely lists each stage. Behind the curtains, we can also do latency accounting for signals. Note for example, that c
is delayed by one cycle in order to match up with the latency of x
. The only drawback in standard Python interpreters is that we need to explicitely return the stage funclets in their right order.
The above pipeline is never idle, as long as ce
is asserted. However, when there are time slots where the processing needs to switch the source or destination, conditional constructs come into play. For example, if a calculation is to be made every other sample, a slot
variable will typically toggle during a filled pipeline. Due to the latency delay of each stage, the slot will need to be checked according to the stage delay. When another stage is inserted, all these conditional checks will have to be revisited. This calls for another abstraction of a Slot
.
slot = mypipe.Slot(1) # Two state boolean slot
@always_seq(clk.posedge, reset)
def toggle():
if ce or rvalid:
slot.next = ~slot
@mypipe.pipeline(clk, ce = ce, pass_in = ce, pass_out = rvalid)
def worker():
@mypipe.stage
def mul():
if mul.slot(slot):
x.next = a * b
else:
x.next = -a * b
@mypipe.stage
def add():
r.next = c.delay(1) + x
return mul, add
The .slot
method determines internally according to the latency which value slot
should be compared against. For greater slot sizes, a simple bit array is used instead of a classical binary counter.
Note that the stages determine the explicit relative latency of a signal, once a slot condition is introduced, you need to make sure yourself that the next stage is getting the correct result from the previous stage and slot.
For instance, when a signal is only assigned to in one slot #n
, it is valid earliest in the next stage in the slots j > n
. The latency check will not detect when you violate that rule.
More formalisms
Using the above approach, pretty homogenous pipelines can be optimized manually and are verified against the original formula in various ways:
- By evaluating the pipeline hardware elements simply by running the funclets from Python via a specific Evaluator
- By emitting the code into HDL and running it through a trusted simulator
- By compiling the mapped/synthesized elements into executable code and co-simulate it within Python
This calls for a few more extensions:
We might want to know from a top level, how deep a pipeline effectively has become. Thus, there need to be auxiliaries determining the latency of a particular pipeline design.
Then, we may have to deal with signal delays outside the reach of our latency accounting, for instance external memories, that yield a data word one clock cycle after address assertion.
Then finally, we may want to implicitely create delay queues for signals that are not used from within the first stage, neither would we want to explicitely elaborate a delay queue for each buffered signal. This calls for an AutoAdjust
construct that automatically inserts a delay queue with the proper length, such that the latency matches. This however is beyond the scope of this HLS brain storm for now and is elaborated in detail in the @mypipe
documentation that will be released at a later point as a Jupyter notebook.
Flexible fixpoint arithmetics
In MyHDL, several approaches have been taken to implement adaptive fixpoint arithmetics (fixbv
). In most cases, manual implementation using a intbv
derivative have shown to be sufficient and straightforward, moreover, due to issues with conversion of arithmetic chains, this could never be taken to a finally effective result.
For complex pipelines, adaptive signal types however make sense from a high level perspective:
- You define the precision needed in the wire type
- You pass the Signal based on this specific wire type to the pipeline construct
- The pipeline unrolls into HDL matching your particular fixpoint arithmetic design rule
For the cyrite dual head approach, the story is a little different than with MyHDL: Since it is possible to run a functional description natively as well as transpiling it into generator representation for output into target languages, a formal evaluation of a pipeline by *native* execution allows to determine the required fixpoint precision for a certain arithmetic operation, before emitting it into a hardware representation in the second pass. Remember that a transpilation to HDL or direct synthesis also occurs by execution, but in a generator character more than native sequential execution.
This opens up quite a new field of auto-verifying arithmetic HLS methods to make sure that latency and precision is an accurate match to a software prototype that may a priori be based on built-in floating point arithmetics. So the verification process is like:
- Write software prototype routine in Python with built-in integers, floats, …
- Turn it into a bit-accurate sequence using cyrite datatypes and
.next
style assignments - Unroll such a @hls function into a @pipeline, optimize manually, if necessary
- Run Verification:
- Verify pipeline is calculating the same as the prototype, using formal automated verification. This step will determine the actual resources needed for the arithmetics
- Elaborate into hardware, run through external simulator again, co-simulate against software routine, if necessary
- Stick design into a CI/CD software pipeline to make sure you don’t break things