Z-Wave Firmware Hardening – Avoid Truck Rolls
How to Go Out of Business
One of the fastest ways to go out of business is to ship product that requires a technician to come to the home and replace it – aka a “Truck Roll”. One of these infamous failures is the Wi-Fi door lock maker Lockstate. In 2017 Lockstate put out a firmware update that caused a percentage of their locks to “brick” and require the lock be removed and returned to the manufacturer. Lockstate didn’t go out of business, but they did have to rebrand as RemoteLock due to their reputation damage. Fortunately for them only a small percentage of the locks bricked so the company did survive, barely. The small percentage of failures means the update had passed normal Quality Assurance (QA) testing. There is never enough testing!
Every IoT device has bugs. Today’s devices have many thousands of lines of code and too many hardware features meaning there are plenty of bugs hiding in every device. Bugs are in your product. I know it, and you know it. I propose that the best solution for these bugs is to “harden” your firmware to make it more resilient and keep on truckin’ if something bad happens. Below are my Tips and Techniques for hardening Z-Wave firmware to survive a failure. These ideas are for Silicon Labs SDK but similar techniques apply to Trident IoT with obviously some different details.
I am only discussing resilient firmware and not talking about making a product more secure which is also called “hardening”. Security is an important topic which must be planned before coding begins. The most basic security measure is to set the Debug Lock on all production units. Debug Lock disables the debugger port of the chip making it difficult to read out the flash image. There are several more layers of security in the Z-Wave chips that can be enabled but at a bare minimum be sure to set Debug Lock.
Seven Tips to Harden Z-Wave Firmware
- Assume Everything is Broken
- Use FOR Instead of WHILE
- Replace Default_Handler
- Enable the Other Watchdog
- Reboot if no Comms in a Day
- Enable Stack Overflow Checking in FreeRTOS
- Run Static Analysis Tools
1. Everything is Broken
This is a philosophical idea you need to keep in the back of your mind with every line of code you write. Assume everything is broken all the time – hardware never goes “ready”, a queue is always full, a mutex never switches, an impossible state occurs and similar sorts of failures. The most common code technique is to always check for error conditions of any function that returns a value. Always check inputs for validity.
The most insidious failures are stack overflows. This can happen where parts of RAM are overwritten and I’ve found it amazing that the code can keep running even after trashing potentially hundreds of memory locations. The challenge is there is no way to predict what might happen. All sorts of things that “can never happen” absolutely will happen when the stack overflows. Due to the limited RAM on Z-Wave chips, it is easy to overflow the stack.
Another common impossible condition is when the power supply sags just enough to flip bits in ways that are technically impossible. Strong magnetic fields from nearby motors and even cosmic radiation can flip bits in impossible ways. There is truly nothing that “can’t happen”. Thus, always code with the thought that the impossible can happen because eventually it will.
Wireless IoT devices using Z-Wave are often wired directly to mains-power. They cannot be easily rebooted like you do with your computer when it freezes. If a device bricks, it’s dead potentially for months or even years before a power failure brings it back online. Once a device is wired in, it’s usually there for many years and with Z-Wave it can be decades. Ensuring the firmware is resilient when (not if!) the impossible occurs will keep customers happy since they never knew the device rebooted – it kept on truckin’.
2. Use FOR instead of WHILE
The Silicon Labs SDK including the bootloader has many while(hardware_busy) loops that will wait forever and can cause the device to brick. For example, in Silicon Labs if you enable the LFXO (32KHz crystal oscillator) but don’t have the crystal wired up, the startup code waits for the LFXO to be “ready” with a while loop. While this is obvious when debugging firmware this will cause a device in the field to brick if for some reason the crystal stops working. The simple solution to this is to add a FOR loop with a timeout enabling the code to continue.
Example in em_cmu.c:
Replace: while ((LFXO->STATUS & _LFXO_STATUS_ENS_MASK) != 0U) { }
With: for (int i=0; (i<1000)&&((LFXO->STATUS & _LFXO_STATUS_ENS_MASK) != 0U); i++) {__NOP()}
Note the __NOP() is necessary to prevent the compiler from optimizing the loop and removing it. The timeout value (1000 in this case) must be chosen based on testing. I usually set it to 10X the typical value. Following the FOR should be an assert to check that the timeout didn’t occur. Note the use of < and not == for the check of the timeout. If the impossible were to happen, there is a chance “i” could skip past exactly 1000 and then since this is a 32-bit number, the timeout would be waiting for a long time for the 32-bit number to wrap all the way around. This is another defensive coding technique where in the back of my mind I’m thinking of the impossible and coding to be resilient even when the impossible happens.
3. Default_Handler
Segger has a great article on debugging the many “fault handlers” in the Cortex-M processors. The article provides code for many different handlers to help debug the fault and make the code more resilient.
Default_Handler is in startup_<chipnumber>.c and is unfortunately NOT declared as weak so you must edit the SDK file itself. Best to select the Copy Contents mode in the .slcp file Import Mode. Maybe it’s better to fix in the SDK for all your projects! The Silicon Labs SDK has only a single line while (true); for a default handler. This code counts on the watchdog to eventually reboot the chip. But the reboot isn’t guaranteed and there is no additional debugging information as shown in the Segger examples. All the fault handlers are mapped into this single handler, but they can be individually overridden due to weak assignments. Even something as simple as a divide by zero can cause a fault handler to be called and brick the device.
At a minimum, put the Segger recommended code in for at least some of the exception handlers to make debug easier. Generally, it is a good idea to light an LED (ideally red) or some other external indicator to help during debug. Other ideas are to log the address and condition that caused the exception and store it in the User Data Page/NVM that can be read out from production units that were returned from the field by angry customers. Then perform forensic analysis to identify the cause and release a firmware update that solves the problem.
4. Watchdogs
Watchdog timers are crucial for reliable 24x7x365 operation of an IoT device. A watchdog timer is a timer that slowly counts down. Every now and then, the firmware “feeds” the watchdog by resetting the counter to a high value. If the counter reaches zero, a full reset of the chip is triggered which reboots the chip and hopefully resolves the error condition. The watchdog timer typically takes a couple of seconds of being starved before the reset to ensure it doesn’t falsely reset. The trick to a resilient watchdog is deciding when to feed it, and more importantly, when not to. I wrote a blog post on watchdog timer best practices back in the 500 series days which still applies.
There may be opportunities to further align the current 700/800 series watchdog timer implementation in the Z-Wave SDK with established best practices. The watchdog is fed every time the FreeRTOS idle task is executed. The only two ways the watchdog will fire is if FreeRTOS crashes and stops servicing the idle task or if a task sits in a tight loop long enough. There are many other failure mechanisms that can occur that brick the device but continue to execute the idle task. The most common failure is a queue is full and the code skips over writing the queue and waits until later but the task draining the queue has crashed. Even if the watchdog is rebooting the chip, if the failure isn’t resolved by a reboot, then the device is still bricked as it’s stuck in a reset loop.
The idle task is normally compiled into a library which makes altering it impossible. Trident has the entire source code for the SDK available, making it possible to improve the code to follow the best practices. The main effort in improving the watchdog is identifying everything that can cause the code to lock up. Check that all queues, mutexes, state machines and perhaps even peripherals are idle before feeding the watchdog. The other key is to disable the watchdog during development. Watchdog resets are notoriously good at hiding faults as the chip seems to briefly stop but you send the command again everything is working again because the chip rebooted and is hiding the bug! The Z-Wave chips have a second watchdog timer so you can create your own robust watchdog following the best practices.
5. Reboot When no Communication for a Day
Hello, is anyone listening? The concept here is basically a long-duration watchdog timer. If the controller hasn’t sent a frame and/or hasn’t acknowledged the receipt of a frame in twenty-four hours, maybe a reboot will clear things up. I’ve seen this in the 500 series where on rare occasions reading the HomeID from the external NVM would fail. As a result, the device would forget the HomeID and assume some random number. This random number would then be stuck in the device for days or weeks or even years until the device rebooted for some reason. This was a classic Impossible Condition that seemed to happen on a fairly regular basis when several tens of thousands of Z-Wave devices have been operating for a few months. The only solution for the end-customer was to rip the bricked device out of the wall! Or more commonly factory reset it and rejoin the network which also resulted in one-star reviews.
The solution is simple, setup a 24-hour software timer and check the RX/TX statistics and if nothing has made it through, reboot! This fixes rare impossible conditions in a way that most customers would not even notice. This check is only needed for always-on or FLiRs (LSEN) devices as deep sleeping devices reboot every time they wake up. Below is an implementation you can drop into your application.
| / Usually this is in app.c or your own application files
#include <AppTimer.h> / #define ONCE_PER_DAY (1000*60*60*24UL) static SSwTimer CheckTxStatsTimer; void CheckTxStatsCallback(SSwTimer *pTimer){ pStats = ZAF_getNetworkStatistics(); if ((pStats->tx_frames == 0) && (pStats->rx_frames == 0)) { while(true); // reboot if no frames have been transmitted or received in the last 24hrs } zpal_radio_clear_network_stats(); // zero the stats. } // in ApplicationTask just before the FOR loop AppTimerInit(EAPPLICATIONEVENT_TIMER, xTaskGetCurrentTaskHandle()); AppTimerRegister(&CheckTxStatsTimer, true, CheckTxStatsCallback); // auto reloads TimerStart(&CheckTxStatsTimer,ONCE_PER_DAY); |
6. Stack Overflow Checking
A real-time-operating-system adds complexity, but FreeRTOS has a feature to check for a stack overflow which is enabled by default. The variable configCHECK_FOR_STACK_OVERFLOW is set to 2 by default in FreeRTOSConfig.h. This enables some checking and fills the stack space with 0xA5s which is then checked with each task switch. Inspecting RAM after running the code for some time in the debugger can provide insights as to how close to overflowing the stack has happened so far. The check calls vApplicationStackOverflowHook if there is a failure but there is only an assert in the weak function. My recommendation is to add a breakpoint here during testing and consider rebooting in the released code. Stack overflow checking is only recommended during development and testing due to the additional overhead.
The insidious problem with stack overflows is they often require several things to go wrong at the same time – a task switch, an interrupt, the radio sending or receiving data and having to find a new mesh route, code allocating sizable temporary buffers and maybe even more code that uses up the limited stack space. As mentioned above, I have observed the stack overflowing, trashing many dozens of memory locations and the code keeps running but eventually there is an impossible condition or more often a hardfault exception. As a result, the failure is often overlooked as it only happened that “one time” but in reality, it happens a lot. Getting the failure to happen repeatedly in a controlled environment is often very difficult. I have had dozens of units set up testing a specific failure case which would take all weekend to finally trigger. Then not having enough data on the unit that failed makes it even more exasperating.
7. Static Code Analysis
Use Claude, CodeX or other static code analysis tools to review all firmware. AI continues to improve at an exponential rate to grade code quality. Often AI can recommend changes to fix the code, but I would carefully check over the suggestions. AI can hallucinate or simply start making things up out of nowhere. The GCC compiler has a -fanalyzer option that will find a few interesting things. Coverity is an industry leader in this field but is pricey. What tools have you used?
Resources
I’m not the only one to make the siren call to make your code resilient. Jack Ganssle wrote the Embedded Muse newsletter for 27 years on embedded programming. He features a failure every issue with plenty of interesting stories and many tips on resilient embedded C coding. Micheal Barrs’ C Coding Standard should be on every firmware engineer’s bookshelf or bookmarked. The book is filled with practical advice on coding for resilience and maintainability. Maybe I should write a book on embedded coding based on this Journey?
Next Steps
Part 6 of the Z-Wave Developer’s Journey discusses hardware best practices. I present my tips and tricks for making low-cost, easy to debug and manufacture Z-Wave products from my 25 plus years of Z-Wave experience. As we continue along the Z-Wave Developer’s Journey, I welcome your comments and questions. Please feel free to reach out to me directly via email.
About the Author
Eric Ryherd has been at the forefront of Z-Wave innovation since 2003, beginning as a consultant and later serving as a Field Application Engineer at Silicon Labs. Over the course of his career, he has contributed to the design and development of a wide range of Z-Wave products, including sensors, remote controls, motorized window shades, and in-wall dimmers, many of which are on the market today.
Although he “retired” in 2022, Eric remains deeply engaged in embedded systems and Z-Wave development through his blog, DrZWave.blog, and ongoing IoT consulting projects. He is also a familiar face at Z-Wave Alliance Unplug Fests, where he frequently serves as the lead coordinator, supporting interoperability and developer collaboration.