Do chips have bugs?

There are probably many people who think that microcontrollers are bug-free. After all, they are glorified integrated circuits; a hard-wired jumble of infinitesimal transistor logic gates. There should be no unexpected behavior, as long as you operate the device within the rated voltage and temperature parameters….

Wrong!

What we tend to forget from our CPU architecture classes is that a CPU actually has a program inside. Known as microcode, its primary function is to interpret each instruction into the right electrical signals to drive the various parts of the CPU. For example, an addlw 0x7F instruction might involve directing the ALU’s input to the next word in program memory (0x7F), and then telling the ALU to add it to WREG, with output set back in WREG. The microcode for addwf MyVar would be different again; it needs to get a value in RAM, and set the result back there too.

Well, where there is a program, there will definitely be bugs.

My first experience with a microcontroller bug cost me several weekends of frustration, fretting, and frantic but fruitless rework. Here’s how it happened:

Oscilloclock Gone Wild

It was the early days of the Prototype, And things were looking great! My dream was coming to fruition! Except… every once in a while, the clock would go absolutely berserk. Seemingly at random, it would start displaying crazy, meaningless images, and controls would cease to function. Sometimes it would recover; other times, it would exhibit brain death – requiring a hard reset.

April Fool's? No - it's a PIC bug!

April Fool’s? No – it’s a PIC bug!

No amount of testing or experimentation could tell me what the problem was. I rewrote huge blocks of code. I removed massive chunks to simplify the code. I drank more and more coffee. Sleepless nights and grumpy days ensued, wasting my precious youth!

Days passed – and at last, a Google search revealed that my PIC 18F2860 has several known issues. I learned that when Microchip discovers any bugs, they describe the issue and any workarounds in the Errata document for the device.

PIC 18F2680 Errata Documents

One of these Errata documents had a note regarding interrupts and memory corruption. Hooray! Implementing the workaround described fixed the issue, and the Oscilloclock became stable and reliable. (And I also became stable and reliable.)

Keynote: You must always check the errata document for your device! And go back frequently in case more are found!

More on the PIC Interrupt Corruption Bug

Microcontrollers have a facility known as interrupts. You enable an interrupt to trigger on a certain event, such as when an input signal changes, a timer countdown hits zero, or a butterfly flaps its wings in Brazil. When the event occurs, the MCU quickly stops whatever it is doing, and processes the task you have associated with that event – this task is called the interrupt routine. Once the task is done, the MCU goes back to what it was doing before.

It’s like your boss calling you into the office and giving you an urgent task. You return to your desk, put the papers you were writing in a neat stack on one side, and work on making the boss happy. Once the task is done, you unstack your papers, and return to your writing (after getting a Starbucks of course). You assume that your papers are all still there and all in the right order – you do not even check! And… you continue writing from the place you left off.

Except for the Starbucks, the MCU behaves in the same way. It assumes that the data it had in its active memory (registers) when it started to process the interrupt routine, has been accurately restored afterwards.

The bug in this PIC resulted in the data not getting restored properly. It was as if the last sheet of paper you’d been working on had gone lost!

In the case of the Oscilloclock, there are two interrupts enabled. One is triggered when a GPS message updated is received. Another is when an internal timer counts down to 0, indicating that 1/300 of a second has elapsed.

Upon both of these interrupts, data was being lost – sporadically and inexplicably. The effect of this loss really depended on what instruction, and what part of the code, the MCU had been working on at the time of the interrupt. Some of the code was not so critical, but the display related code certainly was! Hence, crazy images.

If ONLY I had known about this earlier…

PIC Interrupt Bug Errata