The Helium router (aka console) is the LoRaWan network server. In a previous post I described how to setup a Helium router / console. In this post I will give you some details of what you can see in the grafana monitoring dashboard and this will help you to understand better how the network works to process the LoRaWan packets. We are going to detail what is an offer, a packet and the different monitoring information we can get from the router.
Let’s have some definition of the terms we are going to manipulate and an explanation of how helium traffic management works. What is specific in Helium is that we have on one side the hotspots (LoRaWan gateway) that receives packets and on the other side the router (LoRaWan Network Server) that purchase the packets. As we are using a blockchain with different role and different work, we have a transaction between these actors.
These transactions are recorded in a blockchain, not the main blockchain. All the transaction would be too big to be recorded, so we are creating time limited blockchains between routers and hotspots. These blockchains are named State Channels. It has a duration in number of blocks and a maximum amount of DCs corresponding to the transaction fees. DCs is a stable coin used for all the LoRaWan transactions, used to pay for the communication. DCs are burn to create HNT and vice-versa.
A transaction has different steps:
- The hotspot receives a LoRaWan Packet
- In case of a JOIN, the XOR filter is used to identify the router. In case of UPLINK, the DevAddr is used to identify the router.
- The hotspot makes an offer to the router through the State Channel
- The router can accept or reject the offer, 1DC will be given to the Hotspot at the State Channel close.
- When the router accept the offer, the hotspot send the packet to the router
- Eventually the router can send a message back to the device, this is a Downlink or a ACK. A downlink or a ACK must be send back to the device on different RX windows, usually 1s or 2s after the device transmission end. Timing is critical.
When the offer is made, only a pieces of the packet is proposed to the router. The router can accept one of more offers for a packet, depending on the Multiple Packet configuration for a given device.
The XOR filter is a filter used to route a JOIN packet, based on the DevEUI, AppEUI. The XOR Filters is on the main chain, so all the hotspots know it. It makes the routing table compact and fast. The DevAddr have been acquired by the router owner and specific to a given router. Multiple devices uses the same DevAddr, the encryption/signature mechanism are used to differentiate them all.
Let’s get started with the monitoring graph, the first group is addressing the duration of the different steps described above:
This graph shows the duration of the overall process, from the packet offering to router downlink/ack when there is one. This does not includes the time between the hotspot reception and the offer creation. We will see later that this is sometime long.
There are different type of offer considered in this graph.
- The Join packet, where downlink (Join Accept) is mandatory. The RX windows for a JOIN are RX1 5s and RX2 6s. We usually see a duration of 2s, this is the time the router is waiting for different offers. Join packet use, as much as possible multiple packets to select the best hotspot to send the Join Accept. Over 6s, the Join Accepts would be out of time.
- The Packet downlink=true are packets with a ACK or Downlink, the RX1 windows is 1s after the end of the transmission, RX2 2s. It’s important to have a duration under 1s or you have a risk to not be able to respond. The router is also waiting for different offer during the time given by ROUTER_FRAME_TIMEOUT parameter. Longer is the value, more offer you can get but less time you have for reaching the RX1 windows for the response.
- The Packet donwlink=false are packet without ACK or Downlink, there is no time constraint on a such packet, they are the one in blue on the above graph. For this reason, the router will wait a longer time to get all the offers before start the purchasing phase.
The Travel time is a technical information, this is indicating the network time between the hotspot and the router. It really depends on the hotspot involved and the distance between the hotspot and the router. It can be an indicator of a network problem on your router if the minimum time becomes really high but the maximum time is not really important as it is related to the hotspots. At least, this helps you to identify that you may consider the average travel time as a margin to keep in the offer to downlink.
Then we have the graph Offer Durations, this is the internal processing time to proceed to the different type of offer. These timing are basically related to the router internal code and the server load. you can eventually see the impact of a release here.
The Packet Duration graph is a subset of the Offer->Packet->Downlink graph, this one is focusing on the packet processing, after the packet has been received. We see the processing time. We see the 2s wait for the packet with downlink = false corresponding to the wait for multi-buy packets. We see the downlink = true processing time that must be reduced to pass in the RX windows timing. The more interesting thing are the rejection reasons. This is interesting because at this moment, the router has purchased the packet, so if it is rejected, there is a potential source of problem. As an example here, we have unknown device and bad_mic. The first one means that we accepted a packet based on devAddr but at the end it is not belonging to a device, like if the device has been removed from the console but is kept active with a running session by its owner. The second one is more about a Join request from a device using as an example a wrong AppKey.This is all about time but later we will see the quantity for such errors.
This table is a bit complex to read and can be frightening : it basically indicate the time between the packet reception on the hotspot and the delivery of the packet to the router. Most of the time here is related to the hotspot, particularly when you see duration over a minute. This is also percentiles. So you can read the above table the following way: 50% of the packet Rx to delivery is under 487ms average. Basically when a Hotspot receives a packet but can’t propose it on the chain, due ti network problem, the proposal will be delayed. If a hotspot is out of network, it will queue the messages and propose them when back. So you can have a lot of messages arriving really late due to one hotspot and this will impact the stat. This is a good indicator of the hotspot health.
The Hold Time graph close to that one trace the average Hold time, it is usually under a second but you can see high pic on it (as an example I currently have a 33 minutes pic). this pic mean that one (or more) hotspot at that time delivered really old messages and this is impacting the average. this is not related to the router health in general and this will be more visible when you have a low number of devices on your router.
These 3 panels allows to see the offer rate (average offer per seconds) in different ways. The two first give an idea of the router usage. The rejection rate is not an anomaly, for a single transmission, there will be multiple offers coming from the different hotspot receiving it. Only some will be accepted and the other rejected.
The analysis of the rejection cause is on the third graph: we have the rate and the total number of offer concerned. This is useful to understand the previous one and actions are needed. Basically on the graph we see a large rejection due to multi_buy, corresponding to multiple offer for a single packet ; the ratio is low and should be really larger in my point of view, but this is a different topic. then we see “device_inactivate” meaning that the concerned devices has been disable from the console. More interesting things “join – device_not_enought_dc” is indicating some of the router user does not have the DCs to cover the communication, but the device is still active. More surprising, “join – console_unknown_device” sounds related to devices removed from the console but still in the XOR filters and trying a connection.
Once the offer is accepted by the router, the packet is sent to it. Packet is the traffic that is really processed by the router.
The packet rate and Join rate are, basically, the equivalent of the Offer Rate accepted. So here the main new information is the Downlink rate, we usually see the router_http_channel = error as a visible signal. In fact there is no real problem with this: when an integration does not return OK, it is considered as an HTTP error and this is why we see this.
The next graphs give details on the router wallet and blockchain synchronization status. Router needs to be in sync with the blockchain to make sure it push the XOR filters and to maintain the state channels.
This graph shows the age of the last block on the router over the time. As blocks are generated on 1:30 – 2:00 rate, a value around 3:00 is a good value. Some block are really longer to compute and this is what the peaks you see are related. We also see some really long peaks that can come from the chain itself: when the chain takes a lot of time to generate a block, during that time, the age of the previous block will increase. So this is an interesting view of the blockchain health and your server health. An XOR filter update transaction will be delayed if you are late on the chain.
There two graphs are important as they are related to money & cost. The DC Balance is your router balance. You need to have balance for opening state channels and purchase packets, realize XOR transactions… Router stops when balance goes to 0. You can see DC balance increasing when you have HNT burn activated for users (they are crediting your router wallet) and also when a state channel is closed : the remaining DCs will be credited back, then debited to open a new state channel.
The Xor filter cost is a accumulated cost of the XOR filter update transactions. This graph is back to zero when the router has been restarted. this is why you see it making waves. The XOR filter is increasing when you add new devices this is why is can be stable for a long period of time. The price currently depends on the total size of the XOR filter in byte and this will change soon to only depends on the delta. The initial price is 30K DCs, here you see a 60K DCs on every update.
The monitoring of the State Channels is key, on the left you see the two state-channels, there are some glitch related to some router update/restart. It’s important to not have a red state channel. This will happen if you have block an amount of 10K DCs, as an example, but you purchased for more than 10K packets during the life of the state channel. In a such situation, the state channel can’t be used anymore (it is empty) but it will be kept active until it’s end-of-life. A new one will be created but as you have a limit of possible State-Channels, you could reach a point where you can’t open new state-channels until the end-of-life of the one locked due to overspent. In a such situation, the router won’t be able to accept any more packet.
The graph on the right comes with that one, indicating the current State Channels DC balance. When a State Channel is open, an amount of DCs is locked into it and consume when the packet are purchased. The amount is a setting in the env-router file. The decrease rate and the amount at state-channel close time are the key indicator. you need to make sure you have a certain margin. It’s also a good indicator to monitor the use of the router by devices. On the side image, you see a longer period state-channel DC balance, you can see the trend indicating that the usage is growing on the router / console and you can anticipate the future need for extending the DC balance in the state channels.
The second graph is informational, is show the number of different actors (Hotspot) involved in the state channels transaction. It is basically the number of different hotspot that provided data in the different open state channel. When a state channel is closed, this number decrease as the new one, at start have no actors.
Decoders Graph and API
These graph monitor the decoder function written by users. You can check the processing time and the potential errors / crashed. This is more about debugging or alerting eventually. I will not detail more about it.
The next block is about the API response time, basically the console response time (but also the user direct API calls. The API Success % allows you to make sure your API are up eventually you can capture some low rate errors from it. The graph can have some period of time with no value, this is a grafana display “bug” when the value is 100% ok. The console Websocket is the state of the websocket between router and console. It should be 100%
That graph allows to monitor the response time for all API endpoint. The more complex API is “report_status” corresponding to the device detail page on the console, this time is real time refreshed and the most complex. It is usually on the top. The best is to have all these API response time as stable as possible for the end-user. This is impacting the user experience.
Two others graphs are detailing the OK and Error rate to monitor the evolution of the API call rate over time.
The last part of the dashboard is about the machine state, CPU, processes, memory … I will not go in detail of this as this is really standard monitoring for a server. There is one dashboard that is really specific to helium:
The GRPC connection is a different ways for the hotspot to connect to the router for the communication. This is more efficient and faster than using P2P network. Currently, only the data only hotspot are using GRPC, later all the miners should migrate. So basically, this graph is indicating that some data-only hotspot have sent data to my router.