Pi Day – Sigma Lambda Pi

Sigma Lambda Pi is the perfect thing to talk about on the 3/14 Pi day!

This crazy machine is a 16 Raspberry Pi-4 cluster in a 2U server rack, set to execute FaaS (Function as a Service) with a green-it approach. Don’t make a dream of Raspberry-Pi high performance demonstration, you will be disappointed and it’s not the purpose of this project. This is not a commercial product, the objective of the company who made it, was research, team building and team skills improvement. This has been made by friends of my, working at Be|ys, a team of 9 people, under the lead of Christophe Prugnaud. They made a demo of it during the Clermont’Tech Api Hour #46, the video will be soon accessible.

Inside the machine

So, this team has made the crazy idea of building a cluster of Pi4, to put it in a Rack 2U with industrial constraints like: the architecture must be reliable, each of the Pi can be replaced independently in production… On top of this, the idea is to be able to power on/off the Pi4 physically to be able to adapt the power consumption inside the box according to the load.

inside the Sigma Lambda Pi

In the center of the appliance, you can see the 16 Pi, connected to the internal network and the internal power supply. On the top left, you have the internal switch and a 17th Pi used as a router for the inbound/outbound network traffic and a power measurement module. On the right side, on the top, you can see an Arduino used to control the fan speed and front display. On the middle and bottom, an internal NAS with 4x250Gb controled by the Pi 18th.

This is finally an appliance with 64 cpu core @ 1.5Ghz offering 64GB of RAM and 518 GFlops of computation power with 1TB SSD storage and 1GBps network for a power consumption around 100W.

Just to understand the performance comparison with a standard server, I’ve ran a bench tool (lmbench) that is easy to compile and run on different platform. I ran it on a Pi 4 (a single one) and a proxmox VM running on one of my bare-metal server Opteron 4334 – 6 core @3.5GHz (a bit old). There is nothing to conclude with such test condition. That is just to see the distance between platform. Finally I added a 3rd test with a Corei7, bare-metal. Test are running on 1 core.

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
Host      Mhz  null null      open slct sig  sig  fork exec sh
               call I/O  stat clos TCP  inst hndl proc proc proc
--------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
rpi       1500 0.82 0.49 3.91 6.95 8.98 0.92 8.38 649. 1880 3925
amd       3075                4.76      0.57 1.88 254.  813 2590
Mac.      4290 0.22 0.40 2.97 8.87 14.1 0.34 1.56 300. 1548 3691   
***

Basic uint64 operations - times in nanoseconds - smaller is better
------------------------------------------------------------------
Host                 OS int64  int64  int64  int64  int64
                         bit    add    mul    div    mod
--------- ------------- ------ ------ ------ ------ ------
rpi       Linux 4.19.58  0.670                 88.8   20.7
amd       Linux 3.10.    0.33                  11.94  4.88
Mac-mini- Darwin 19.3.0  0.160        0.0800   10.0   9.6100

*** 
Basic double operations - times in nanoseconds - smaller is better
------------------------------------------------------------------
Host                 OS  double double double double
                         add    mul    div    bogo
--------- ------------- ------  ------ ------ ------
rpi       Linux 4.19.58 2.6700 2.6700   12.0   10.7
amd       Linux 3.10.   1.6300 1.6300    8.8    3.2
Mac-mini- Darwin 19.3.0 0.9300 0.9300.  3.34.   0.5
***

Context switching - times in microseconds - smaller is better
------------------------------------------------------------------
Host         OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                 ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
-- ------------- ------ ------ ------ ------ ------ ------- -------
rpi  Linux 4.x  5.9900 6.2900 5.6700 5.4200 6.7400 6.09000    11.3
amd  Linux 3.10   15.8   15.6   16.7   19.8   19.2    20.4    19.3
Mac  Darwin 19. 1.7500 2.1600 2.1900 2.6400 3.0700 2.82000     3.3
***

Memory latencies in nanoseconds - smaller is better
    (WARNING - may not be correct, check graphs)
------------------------------------------------------------------
Host    Mhz   L1 $   L2 $    Main mem    Rand mem    Guesses
-----   ---   ----   ----    --------    --------    -------
rpi   1500 2.6680 6.6720        15.8       256.9
amd   3075 1.3020 6.5150        22.6       221.4
mac   4290 0.9300 2.7890        18.2        82.0

In comparison, system performance on AMD/Intel is really better about 2x with a VM and old kernel. Computation on AMD/Intel is about 2x – 5x (but frequency is also 2x).

Context switching is better on rpi about 3x compared to amd/vm but 2x under i7 on mac. Memory access is also better, not so far from modern i7 architecture.

So basically, it should not be totally irrelevant for FaaS in my point of view where memory, fork, context-switch are intensive ; in regard of the power consumption.

Multiple challenges

As this work has not been made for creating a performant cluster, it has been made for leveraging different technical challenges as a team.

As an example it has been decided to not use SDCards for Pi local storage: the reliability of a such element is too low. So they needed to make the Pi booting on LAN, this requires to flash the Pi bootloader and was an opportunity of learning bootloader flashing.

Then, as they were looking to run a K8S light cluster to dispatch the workload on the Pi, they had to make a light Linux distribution running Docker and 64bits. This distribution is GenePi now avalaible on github as nothing were yet existing for this.

There was also multiple technical challenges like managing the fan speed according to the temperature, measuring the real time power consumption. All of this is custom design and software made on Pi or Arduino.

There was also interesting challenges on mechanical design to make all the pieces correctly fitting inside the 2U rack. 3d printers ran for hours 🙂

As an example, this structure is the Pi Rack allowing to fix the 16 RPI4 and allowing to replace each of them easily in the cluster.

There have been so many thinks made and learned in this experience, it’s a bit long to detail in a post. So I encourage you to take a look to the API Hour presentation (in French) that will be published soon.

Next Steps

The project is now ready for the next phase: deploying the OpenFaaS software on it to make is a scalable, low power solution with Pi shutting down automatically according to the load. To be continued !

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.