Posted on

netpp for Windows quick start

Here’s a short step by step intro to install and run netpp on Windows10 (it will work likewise on older Windows versions):

  1. Download Python 2.7 (32 bit) and install it. Make sure to install for all users, otherwise the netpp installer will not find the Python directory and complain.
  2. Download the netpp installer and run it
  3. Run the example server via Start Menu->netpp->Run example server. A warning will appear and request you to unblock the service, possibly. Then, a small example netpp server is listening on your local machine.
  4. Start the IDLE environment via Start Menu->Python 2.x->IDLE. Read more in detail below…

If you did not download Python before installing netpp, the netpp installer will throw a warning, but still continue.

First steps

Windows10 netpp session
Windows10 netpp session

When you have started IDLE, first try to import the netpp module as shown in the screen shot. Then make a connection to your local example server using the .connect() method.

The .sync() command creates a local property tree with root node ‘r’. When first called, the device server is queried for all available properties, so this can take a long time on some systems. Once the query has completed, the tree is stored in a cache and is only reloaded, if the device properties have changed.

Note the message “using PWD for storage”. If you have no cache directory created, the cache will be installed in the current program’s working directory, which may not be desired. Create the folders ‘.netpp/cache’ in your home directory, then the warning will go away. If you ever need to manually delete the cache files, you will find them there.

Next, we are going to look at the properties. This is simply done the pythonish way using ‘dir’.

And last, we obtain a property value using the .get() method. Simple as that.

Using the Power shell

There are two command line utilities:

  • master: Simple command line demo tool for netpp access
  • netpp-cli: interactive netpp client

Open up a Power shell (or a legacy cmd.exe) and change directory to where your netpp binaries are installed, like

cd c:\Program Files (x86)\section5\netpp\bin

The master.exe is a very primitive command line tool for netpp device query. When run without arguments, it will display the available hubs and send out a broadcast on the local network for attached netpp devices. If your example server is running, you will see it listed. Try accessing it:

.\master TCP:192.168.56.1

and it will list the device’s properties.

The netpp-cli.exe is an interactive console with a bit more caching functionality and session character, i.e. when you made a connection, the device will reserve a session with you until you exit the CLI. This operation mode may be required on some more complex devices that work session based.

Make a connection to a device:

.\netpp-cli TCP:192.168.56.1

At the netpp prompt, type ‘?’. Now you can just get a property value by typing its property name, like Container.Test. When you append a value, you can change a property, likewise.

Process viewer/browser support

To use the predefined GUI process control based on the free pvbrowser (see github repository), you need a demo setup based on a modhub or another Linux embedded demo setup running a pvhub server. This assumes a netpp node setup with default design.

  1. Start the pvbrowser on your PC
  2. Enter the URL
    pv://modhub/netpp:UDP:192.168.0.5:2016

    into the pvbrowser for a direct connection to the netpp node (assuming default IP at 192.168.0.5). Replace ‘modhub’ by the IP address of your pvhub server host.

  3. The process viewer should display something like below:
PVbrowser netpp node v0
PVbrowser netpp node example

 

More information

 

  • netpp HOWTO: [PDF]
  • API reference OpenSource v0.3x [ HTML ]

 

Posted on

netpp node evaluation platform

netpp node [Spartan6-LX9]

I am glad to announce a new user evaluation platform module called ‘netpp node’. Its motto is ‘IoT on FPGA done right’:

netpp node preview rendering
  • Integrated high speed UDP stack
  • dagobert network SoC with various configureable interfaces (SPI, UART, PWM, I2C) and pin multiplexing options.
  • SDK: GCC, GDB hardware debugger
  • Up to 6 analog I/O channels with ‘analog population option’
  • Piggy backs onto a custom I/O PCB using two 2×16 pin headers
  • In-Field upgrade option (reprogramming of user images with fallback scenario)

The default firmware runs a fully functional netpp stack for remote control and measurement.

Its main applications:

  • Reliable computing (safety relevant applications with tamper-safe main loop watchdogs)
  • Guaranteed real-time response in network for scaleable applications (100s of units). Performance outlines:
    • Up to 1200 netpp continuous property requests per second verified
    • Push-on-Demand streaming of arbitrary dataports (high speed ADC) at maximum network bandwidth (10 MByte/s)
  • Simple DSP applications and smart analog measurement (low power, filtering and differential options)
  • Evaluation of next generation ZPU architecture for embedded GNU style developers
netpp node alternate view
Update [11.9.]:
netpp node PCB prototype
netpp node PCB prototype

First prototypes are finished and are running 24/7 in the test bench at this moment.

Things are going very smooth so far, just a minor capacitor change will be required for the v0.1 series.

Preliminary documentation

Analog I/O

ADC10 low level control

For analog I/O, U3 on the board is by default populated with a MSP430G2553, functioning as a smart ADC that is controlled from the Dagobert SoC via i2c. All relevant ADC configuration registers are directly accessible via netpp. For instance, we access the low level registers through a process browser panel as shown above to play with the parameters. The process view panel automatically updates the volatile properties from the netpp peer device. The ADC10 variant of the netpp node provides up to six  analog channels internally sampled at up to 200ksps. When in synchronous acquisition configuration (SPI master), only five channels can be used.

Differential 16 bit sigma-delta ADC

SD16 analog input

The alternate population option with a MSP430F2013 provides a Sigma-Delta 16 bit ADC with differential inputs and programmable gain amplifier. This variant provides three different input channel configurations using the provided analog input pins on this board. Moreover, the internal temperature is available in a separate channel.

‘Push on demand’ data streaming

By default, the analog sensors are polled, i.e. a measurement value is delivered upon request by the master. For synchronous sampling however, a ‘push’ strategy might be desired, where a netpp node delivers a value stream to a data logger or database. This can be netpp (where the netpp node acts as a master), however for high speed data transfers (‘network scope’), a low overhead UDP stream is more desirable. The dagobert SoC features a data port option with programmable slots to stream I/O channels as well as analog values using a standard real time protocol with 90 kHz time stamps.

Monitoring netpp packet performance

Packet behaviour in a real network is measured using the Wireshark protocol analyzer.

The figure below shows some example netpp transaction log that the netpp node handles at a very low CPU overhead based on direct register accesses.The red bars is the effective number of query responses using somewhat ineffective ping-pong requests. The performance can be increased by accumulating data into larger buffer properties.

For i2c or SPI transactions however, the packet rate is expected way lower.

For high speed performance like MJPEG video streaming, a separate UDP/RTP queue can be set up within the firmware to reach maximum throughput. However, there is no handshaking using this method.

The image below shows a repeated property query from within Python. The pauses are introduced by external disturbance (stress test) that causes a packet drop – and the netpp engine to timeout and re-synchronize.

Python property query session
Python property query session

Improved RX/TX queue

With an improved packet FIFO on FPGA, I was able to crank up the number of netpp requests per second, as shown in the Wireshark trace below. This test makes sure that several netpp clients can poll the netpp node at high frequencies without disturbing each other. The blue trace is a repeated poll of the full property tree, the red bars are the timed queries from a process viewer daemon. With no other disturbance, we get the occasional drops (e.g. at 45s, 101.5s) due to the queue running full

 

In-Field/System update

The default boot loader firmware supports self-programming over the cable. That means, the netpp_node can be supplied remotely with a new firmware image via a simple upgrade procedure over netpp. If the uploaded image is faulty, the system will fall back to the default boot loader. However, if the new design itself has errors, the system will be unable to recover  unless the reset button is pressed.

Test procedures

As the full model of this design is available for simulation, we can verify the system effectively against stress situations. In particular, network safety is of outmost importance. The test procedure check list of the dagobert SoC:

Completed

  • ARP and ping flooding
  • netpp packet performance test
  • Broken packet handling
  • Lost interrupt scenario (packet queue desynchronization)

In process

  • Jumbo packet flooding

 

Posted on

JPEG encoding on FPGA [revisited]

Although considered ‘ancient’ because invented in the early nineties, the JPEG standard is far from being dead or superseded. Its basic methods are still up to date for modern video compression.

For low latency image streaming, we have developed our own system on chip encoder solution ‘dorothea’ in 2013. It is based on the second generation L2 (a tag referring to ‘two lane’) JPEG engine, allowing JPEG compression of YCbCr 4:2:2 video at full pixelclock.

The ‘dorothea’ SoC is now superseded by the new ZPUng architecture, allowing more microcode tricks than on the previous MIPS based SoC. It is available as reference design ‘dombert’ (see SoC design overview) for UDP streaming up to 100 Mbps, optionally, 1G cores (third party) can be deployed as well for more throughput.

L2 example videos

These example videos are taken by direct capture (as coming from the camera) of the UDP video stream. The direct Bayer to YUV422 method is implemented in a microcode engine and may still show visible artefacts, also, color correction is not implemented for this demo. For the live videos, a MT9M024 sensor on the HDR60 development kit has been used. A bit file for the HDR60 kit is available on request.

Posted on

DMA Autobuffering techniques

For high speed DMA throughput where you’d want the least interruptions, several CPU cores on the higher end deploy scatter-gather style DMA engines to avoid slow memory copying. Let me give an example:

  1. Assume you would want to compose a network package and stream it out to a media access controller IP core (MAC)
  2. You have one Ethernet/IP/UDP/RTP header or alike and you are not planning to change it much during a streaming session
  3. Your payload might be in memory, but also coming from an external data port (such as a video data)
DMA descriptor scheme

The DMAA core (part of the cCAP families starting with ‘d’, like dombert) allows to set up descriptors in a special shared (and fast) dual port memory such that the DMAA engine can stream almost continuously to a peripheral. Likewise, descriptors can be set up for input streams (peripheral to memory), if required. The above image should talk. You start the DMA engine by writing the descriptor address into the DMAA_START_DESC register. Once the transaction has completed, an IRQ will be fired (which will for example allow you to update the sequence number of an RTP packet). Meanwhile, the DMAA engine will fetch the NEXT descriptor pointer (next()) and stream the payload data pointed to by ptr. You just have to make sure the IRQ routine does not waste too much time doing other things, if you use the same header in the following packet.

Note the control bits: an IRQ will for example only be fired, when the according IRQ bit is set. The EN bit tells the DMA engine to keep going after the current descriptor. The entire transaction stops when the last descriptor has the EN bit set to 0.

In the example for a MAC controller, you might want to append several chunks of data before actually issuing an Ethernet packet. How would the MAC know how long the packet actually is, without explicit (and timing critical) writing of length registers? This is taken care of by the FL (FLUSH) bit. Once a DMA transaction has completed and the FL bit is set, the packet is flushed in one go to the MAC and sent out immediately. The DMA engine waits until the data has been transmitted to the Packet FIFO and then resumes with the next descriptor.

In fact, this concept requires very little logic overhead and can run at high speeds even on small FPGAs (with a reasonable amount of memory buffers shared between DMAA and CPU core)

High speed streaming

When streaming high speed raw data, you might not want the CPU composing packets from external sources. Better stream them directly from the peripheral to the MAC the interleaved way. For the above Payload packet descriptor, the alternative data input method would be by setting the DP (DATAPORT) bit. In this case, the ptr and inc attribute of the descriptor are ignored and `count` number of bytes plus one are taken from the dataport input FIFO. If there is a premature end to the data, a CANCEL action can occur. The DMA will then stop (ignore the EN bit) and the user space program can react accordingly by checking if the DMAA engine is active. The effectively written number of bytes can be read from a DMA_CURCOUNT register and if necessary, padding action can occur.

Complex DMAA setup

Likewise, receiving packets of a priori unknown sizes is possible using this approach. A receive DMA IRQ handler just checks this register and fills in the effective packet size in the receiver packet queue which is later polled by ‘user space’ (this being the bare metal main loop, typically).

This way, data rates up to the theoretical maximum can be achieved. The rest is a matter of configuring the right packet and FIFO sizes. Below you can see a Wireshark packet graph for a regular packet burst (30 frames per second). As you can see, it’s cranked up to the maximum possible throughput during the burst.

Wireshark Packet burst

Summary

Advanced DMA capabilities are easy to implement in FPGA SoCs and make high speed transfers possible even with simple and slow CPUs.

Some reference applications:

  • Almost-Zero latency streaming of compressed Video over RTP (RealTimeProtocol)
  • Signal Analyzer and trace units (‘digital scopes’)