Sigfox, from the poc to the prod

Sigfox is a really nice technology when you want to make a really quick experimentation in IoT. The Time To  Hello World will take you less than 5 minutes and it makes it really easy.

That said after the POC comes the production, and the way you manage your Sigfox backend for production is not the way you build your quick & dirty front-end platform for the POC.

This post will introduce how to make your Production platform and what is the difficulty you need to consider. I’ll propose you some architectural solution I’ve put in place but they are one of the ways to implement it. I won’t detail the pre-packages PaaS solution as i’m not a big fan of them : in my point of view, they are firstly responding to the POC situation, but it is only my own opinion.

Capture [all] the Sigfox messages

This is usually where we start to build our own platform because it is the first need we have. Regarding the device life-cycle, this is not totally logical but I’ll start by this step because it is the one we want to focus on.

The main difficulty is to not miss any Sigfox messages. There are two ways to capture Sigfox messages : push & pull.

 

Pulling messages

Sigfox pull architecture based on API use

You can request messages from the backend based on the Sigfox API. This solution has some advantages like you do not need to expose and secure a service, but has  numerous limits:

  • The API is per devices so you need to have a large number of call as soon as you manage a large fleet of objects
  • The API do not allow to manage from/to time range so you need to manage pages in your code
  • There is no trigger when a message arrives so you won’t be able to anticipate, you need to regularly call the API and process ever and ever the same data. You spam the sigfox API and your own server
  • There is no way to manage downlink communication in real time as you are not synchronized with the device transmission.

As a consequence it’s not a good idea to base capture on this solution. Its better to use it, as an example, to manage the callback failure like for reloading lost messages. Personally, I preferred to do it differently.

Pushing messages

Sigfox push architecture based on callback

This is the usually preferred callback approach: when a message is arriving from a device to Sigfox backend, Sigfox is calling one (or more) callback URL on your own server to push you the data.

This technic is real-time, allows downlink and offer more than just pushing the message. You can choose to load messages one by one with a risk of having a huge traffic on your backend (we are talking about device communicating 1% of the time.. so basically huge is relative) or you can choose to get a callback with a maximum frequency of 1 per second containing all the messages received that second. Personally, I found this option more complicated to manage until I will have millions of devices and that time I will take some dev to manage this extra complexity. So using the callback for every message (and one for each duplicate) is easy and good.

In this mode, we can use HTTP get & post. POST looks like a better choice even if a little bit more complex to setup, later on, it will become a great help. (the complexity will come with the device type management). A Year ago it was more complex to debug as the backend did not show the post body but it has been solved now so there is no more reason to use GET.

The problem of the Push Method is to manage the failures:

  • No doubt, at some point, your server will crash.  If it’s not because of you, it will because of your Hoster, but believe me, it will crash
  • For sure you will to manage release and stop your service and even if your are THE BOSS of the blue green deployment, you will finish by crashing sometime.

So it means that you need to have a solution for managing these situations:

  • The first thing to put in place is high availability: you can do it based on a load-balancer and multiple capture backend servers. This is the easiest and simpler way to implement it. The second option is to declare different callback on your different servers and manage on the backend server the nonduplication of the entries in your database. This solution is more complex in the end but removes the load balancer need.
  • The second thing is to have a totally different capture infrastructure as a backup. Let’s detail this point:

This is the best solution I found to answer the 2 failure points identified above. On top of my HA capture service, I have a totally different software solution used for backup the messages. When the capturing solution is analyzing/decoding the data, the backup solution only stores it as a raw information. This backup solution is really simple, initially, it was a simple PHP script storing the JSON body in a text file. Then I decided to change it for something better.

Basically, this backup component is simple in terms of code and in terms of infrastructure so it has no reason to fail, no need to be updated. In any case, if it crashes, the main capture will not and the data may be received. To ensure this backup will not crash with the main capture stream I decided to deploy it in another data center on another server… The best option could be to deploy it on a different cloud hoster.

This backup service offer the raw storage service and an API. This API can be called by the main capture backend to resynchronize its data. So right question now is: why not using the Sigfox Message API for this? the answer is simple: It was really easy for me to format the data coming from my backup exactly as it has came from the Sigfox callback and I did no extra development to manage it. With Sigfox API I would have to create a different interface with a level of information different than what I have in my callback.

As an example of my callback, I’m adding some global variables like the device protocol version in relation to my device firmware. This information is not part of the API and important for me. Later I could have more information to add and the API will not support it.

With this solution, I can schedule capture synchronization or simply run this resync process after a capture stop.

The callback push approach is not only related to messages coming from the device but also includes the network messages around billing, errors, downlink. They have to be captured and managed.

Sigfox backend now proposes to retry messages in case of failure, this feature is interesting but only covers short downs, like less than 5 minutes. It can be a good way to protect against stop/start or load balancer switch time. But it is not protecting against application crash or releases

This part is something important as once you captured the information you need to react on it and start creating device management processes. We will see that later.

Sigfox reliable backend architecture example

Manage out of order

Being able to replay some missed messages after a while also means inserting not ordered messages. This requires some specific processing to be sure they do the right actions. As an example sending an alarm 2 days after it has been raised is not a good idea. Sending a callback response on a replay doesn’t make sense. So even if the replay capability can be implemented following the same path as the usual capture it needs some specific processing you need to not forget and you have to integrate into development first steps.

Manage device type

As soon as you have started to create your callback and includes the service messages your device type configuration will start looking a mess and duplicating a device type will be your nightmare (trust me I did it so many time). Even if I tried to convinced Sigfox about adding this option in the backend I actually failed

By the way, you can automate this part from the API. I will be honest there is a bunch of line of code to do it because you need at first to define your different message format (personally my choice was for a global common message whatever the callback type is with certain constants to identify it) then you need to automate the creation of different API for device type and for callback. (I’ll share some part of my code soon)

In the end, this was working well until Sigfox has broken part of the API related to contract management but it is on the way to be solved (but it’s too long in my point of view)

So this is an important part: being able to deploy a complex callback configuration in one click. You need to have in mind you will need multiple device types even with one device because this device will have versions you will manage more easily that way because you will have dev and production devices because you will have multiple contracts to manage and why not different clients owning Sigfox contracts directly

If you do not have this automated for sure you will forget something during the configuration and lost data or degrade the quality of your data.

You also need to think about deploying multiple device types in relation to your device lifecycle. In my personal experience, I want to be able to stop some device communication and the easiest way is to move them in a dedicated device type with no callback. You can also want to have device type for trashed device or device under test. Device type is a good way to organize your devices in a Sigfox point of view and depends on the device state you can expect different callback target. You need to start defining your rules in regard to device lifecycle and them implement the automation around device transfer

Manage the downlink communication

Downlink communication is kind of art ! you can check my post dedicated to downlink here. Basically, you need to understand that you are never sure of what is happening during the downlink process. You can be certain that the device received your downlink if you have added a functional feedback. The problem is: if you’ve got nothing back, it doesn’t mean that your device did not get it.

Downlink confirmation

Consequently, managing downlinks means managing different status, managing the callback service and possibly checking OOB from the API (I need to check this whereas it seems we can get the OOB from the API but it is not provided by the callback service even if this service exists). You can estimate that an OOB has been sent by detecting the sequence id jump. Therefore, you understand the kind of Magic you need to implement.

The key rules are the following:

  • Your device must be able to collect consecutive downlinks with the same value without making it incoherent. Because of the unknown reception status and associated retry the device will receive the same downlink multiple times.
  • You need to manage Downlink priority in your backend since the pending downlink can sometimes take a lot of time before being processed. Priority must be given to urgent downlinks.
  • You need to manage downlink cancelling, downlink retry, downlink with no retry.

Once you’ve mastered your downlink management, taking all these points into consideration, you are ready for downlink implementation.

From my own experience, downlink management implementation will impact the firmware design. For that reason, this should be designed with the firmware team in parallel of the device making, not as a step downstream. Otherwise, you won’t have a trustable downlink system

You should also deeply consider your duty cycle during the downlink process. With multiple sequential communications for request, tech and functional acknowledgement, you may keep quiet for a while.

Manage your devices

Managing device life cycle is a long story, there is different elements to consider:

  • Manufacturing process
  • Activation process
  • Firmware versionning
  • Current configuration
  • Pending configuration
  • Subscription to the network
  • Subscription to your service

We can consider some major steps for the device :

  • Manufactured (soldered but not yet sold)

At that point, you will register the device in your backend with technical information such as the object type, the firmware version or the default configuration. As the device will not emit all these information (or it will but you are not sure to get it, due to a possible network loss), it is always better to store it because they are changing month after month and at the end, you will never remind which version has which settings and which device is which version.

  • Tested (it as emitted but not yet used a token)

A way to check that the device is working correctly, is to send a frame during the test phase and to verify it has been received by the network. That is possible if your contract allows you some test frames before starting consuming your token. In this case, it means that your device must be attached to a device type and a contract. The device type used for test may not be the same as the production one. To manage this, you need to automate the device creation and attachment to a device type. Then, switch to the prod device type once the test has been validated. All these operations can be automated thanks to the API.

  • Activated (it is activated and consume a token)

Your customer activated the device and now it is communicating data. At this point, the first communication starts consuming the Sigfox subscription. You need to care about the number of free tokens you still have in your contracts and may switch to the right contract. To succeed this step, you need to manage the network error message and automate the corrective actions. Raise alarms to the platform manager at least.

During this phase, the device may declare itself on the backend, change its configuration and upgrade its firmware. There are many different situations to anticipate it and code.

The activated mode should detect the crazy devices or not talking devices, manage the battery level to monitor the sensors network and advise manual operations.

  • Crazy

Sometimes, you can have crazy devices talking every 10 seconds. One of the usual reasons is a battery failure not correctly managed or a bug in the firmware. Crazy devices can’t be stopped talking in general but you can move them out of the normal processing data to reduce database spam.

  • Out of contract (its network subscription is terminated)

A token has a Sigfox subscription end. This should be managed, anticipated and automated for renewal.

  • Out of subscription (your service subscription ended)

Your service may also have a subscription and you need to manage it. Once your subscription is ended, the device may continue to talk. You can have in your firmware, a function to stop it but basically it can continue to talk. The automatic Sigfox token renewal is a good feature for not to lose connection but if you don’t want to pay for talking devices you need to setup your device for this. Here you have a tight integration between your device management and Sigfox device management.

  • Retired (will never be used again)

The device will never emit. So, you probably don’t want to list it in your devices list. Hide it somewhere. This state is easier to manage and in relation with the device-type management.

You can imagine a lot of other states I’m sure. That depends on your functional need and it adds complexity every time.

Device Configuration

As part of the Device Management, we have a device configuration. This is not a simple part if you look at the future when you start designing your object. A device has a version and a version will exist for many years. The version will affect different firmware with different capabilities and bugs, different communication protocols (your firmware version will add functions and a frame type), different hardware revisions or different devices.

The device configuration will sometimes change out of your control: firmware can’t be updated on the network so it will be manual. Configuration may be done directly on the device…

In my opinion, a device needs at first to identify itself sending its configuration, and version to the network at boot. This would be perfect if we were not using a network with loss as part of the design and with infinite bandwidth. The reality is we have basically 12 bytes to announce a device with HW version, FW version and main configuration parameters. This frame will be received… or not.

Therefore, the backend must know before the configuration what are the expected information included in this frame. The device will then update these information by sending the frame. A part of the overall configuration will not be known and estimated based on the initial config determined by the FW version and the different configuration changed requested by the backend.

A device can be requested to change its configuration thanks to the downlink mechanism. Under the best possible conditions, you can reconfigure the whole device with only one downlink request, that’s perfect. Usually the parameter number is higher and you need to send multiple downlinks to change the configuration. Consequently, the configuration will require a long time in respect of duty cycle and your downlink policy. During this time your device will have to manage a shadow configuration (stored but not applied) and your backend will also have to manage the current communication and the pending one. Once applied, the configuration must be updated on both sides. At this point the downlink confirmation protocol is very important and for sure you need to manage in your back/front manual operation to correct unexpected behaviours, unterminated configuration… From experience, a key point of this, is to be able to change parameters one by one to ensure you can re-execute a configuration, without having to consider the initial configuration state. That is because you need to consider at any point that your backend configuration image is false compared to what really works in your device.

Configuration management with downlink and multiple stages

Manage device access right, group of users…

Even if at end of this post, right management must be consider at first. User / Group and right management is also something important. I’ll not detail this part because it’s quite standard. But there are some elements you can consider:

  • Managing devices means transferring ownership sometime
  • Transferring devices means cleaning historical data
  • Access devices can be usually done by multiple users
  • Users usually have multiple devices

Devices are raising alerts, users are getting alerts. They get it in different ways depending on these preferences.

Data access

Once the data are stored in a backend, it may be consumed by front-end or other systems. Front-ends consumes APIs (if you try to be a little modern in your development) and APIs can be exposed to be system to system.

System to system API integration can generate huge traffic if you do not offer the right access endpoints. As an example, if I want a quickly device information update, even if the device is emitting once a day, I need to call the API every X second/minute to know when this update has been made. This consumes a lot. To avoid a such situation, it’s good to have an API offering information of what have been updated since the last call or when the information will be refreshed based on last communication and configuration. This is the pull approach.

The Push approach is better. You can do it in different ways with webhook (like the sigfox callback) : your consumer is setting id callback and you push the information. That means you should implement retry, API access to ensure the failures will be managed.

There are other ways to do a push update. From my point of view, MQTT is the easiest solution: you push your data to a message bus on topic where customer subscribe. The bus will keep the information until the customer consumes it. The consumer immediately receives the information. (I’m still expecting that Sigfox implement a such feature later, it would make the integration simpler and better).

Even if you make a full stack service, my advice is: always think about exporting your data as part of the initial design. Even for yourself it will help you in the future and it costs a little when initially designed.

Backend integration with frontend / mobile apps / external customer

Non functional data

Managing a device fleet is also being able to identify device failure cause. For this reason some important parameters must be tracked in the backend to later search for failure correlations.

As part of these information you can track:

  • Battery – to track anormal level decrease
  • Temperature – to identify condition of use and correlate with battery discharge as an exemple
  • Reboot – to identify watchdog execution, battery change…
  • Shock – to identify some reboot condition, communication stop…

Many others information can be track. Sometime the non-functional data can be larger and more important than the functional one.

To conclude

I hope you’ve understood that making an IoT platform (this is applicable for Sigfox as any other) is not something you can do in 5 minutes and the raised questions are important to manage at start of the design, not as a second step. You can make it iteratively but you need to make it as a foundation.

To give you an idea, my current backend is about 20,000 java/springboot line code and I spent 3 full weeks of work on it. This is implementing about 80% of what has been described above.

Just to let you know.

2 thoughts on “Sigfox, from the poc to the prod

  1. Hi Paul,

    Disclaimer: working at sigfox, so not neutral 😉

    Thanks for your article, it expose a lot of situations you have to take in account prior going to production and gives good recipes, especially with the way to handle backup data and managing device types.

    I would suggest however a few changes/precisions:
    – API Devices / Messages do handle “since” and “before” URL parameters to return from a time window
    – API Device Types / Messages can allow retrieval of messages from device type, not only for one device, even through I’m fully agree, pooling is BAD and should be avoided whenever possible
    – API calls are limited. You can not send hundreds of API calls in a sustained way without being rejected by the backend, so it cannot be your choice to get messages using a pooling mechanism if you plan to sell thousandth of devices
    – API “Messages not sent this month” would allow retrieval of messages not handled by callbacks in case of server failure. Not only you have the exact missed callbacks, but also their content as it should have been sent, so it is quite convenient to reintegrate them in the database. It has a 30 days history, so it should be pooled at least every 30 days.
    – You can use callbacks mechanism on both live and backup server, with two distinct URLs. And recover on both server messages not sent due to outage, as exposed earlier, will require you to distinguish which one has failed prior inserting data in the correct database
    – Alert eMail set on device type will give you a clue whenever an issue exist on callbacks chain
    – A message, without retry, can arrive very late after it has been sent. For example, a device can send a message that will reach only one base station, this base station being not connected due to Internet shortage in the area. This unconnected base station will keep the received messages until recovering the network. During this time, the object can move, and reach an area where there is no issue with base stations connectivity to the backend. Those new messages sent will be retrieved on time, with correct callback handing and just trigger a warning on sequence number for the first one. Then, when the base station that was offline will recover the network, it will de-queue its messages stored, including the one sent earlier by the device. The callback will then reach the customer server(s), without retries with a big delay compared to timestamp of the message (cf. {time}), and in disorder. In this case, you might want to avoid acting as if it was an on-time event. Thus, not only sequence number but also time of reception has to be taken in account
    – Callback retries has always been implemented, but before, only one attempt was done after a failure. Since a recent release, it is two more attempts, so 4 in total. Note that if a server reply with 3xx or 4xx, it will be considered as permanent failure. 5xx will allows retries. Others errors, such as timeout… will lead to retries (all those errors being classified as 600 currently)
    – Since recently, you have the option to suspend a device (Crazy devices) with the suspend/resume option (see 7.3 release notes)
    – To be able to check OOB messages for downlink acknowledgments using API or GUI (they not forwarded through callbacks), you need to have a role that most of the users doesn’t have: OPT-INTEGRATOR. Without this role, no luck, the OOB for downlink acknowledgments won’t be available. But as explained, if you do not have break in sequence event triggered AND have a seqNumber missing in your reception chain, then it’s a deal
    – To be able to test a device, even without a test frames contract, there is a workaround: create a device type linked to a contract without token (or with only one that will be lost, as you cannot create device types without valid contract). Register your device in this device type. When the device will send successfully a message, it will acquire the status ‘OFF-CONTRACT’. Even if you do not have the data payload, you’ll know then that your device is being able to communicate with the backend. Best thing, you can associate those tested devices having communicated using the device event callback ‘OFF-CONTRACT’, then register them again in the ‘verified’ device type

    Thanks again Paul for your post (that will help a lot some peolple) and your blog, not only it had a lot of information, but it made me discover sigfox before I joined the company 😉

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.