Power Issues and LPC wakeup

I’ve also been having this problem, and after spending a lot of time this past week diagnosing and testing, I was able to figure out the problem (and a fix). The TLDR is that it is indeed a subtle issue with the keyboard firmware.

I’d be very appreciative if anyone else who’s encountered this problem, could verify that (a) they can readily reproduce the issue the same way I describe below, and (b) applying the minimal patch below (not the debug patch) means that the issue no longer occurs with the same repro procecdure.

cc @Chartreuse (who I believe wrote the original powersave code)

Summary:

The problem is that calling wdt_disable() from inside the WDT interrupt routine (ISR(WDT_vect)) is apparently dodgy/racey, and sometimes causes the MCU to never fully wake from sleep. I suspect this is because the AVR takes multiple cycles to fully wake from sleep (according to the datasheet). The problem can be avoided by replacing the wdt_disable() call with Delay_MS(1), and adding wdt_disable() immediately after sleep_disable() inside keyboard_power_off().

I also found a few other small improvements to the powersave code. I’ll create a Merge Request with all the changes as soon as I can - @minute, can you please approve my account on source.mnt.re?

In the meantime, the minimal patch (against commit 1a9c21f) to avoid the problem is:

diff --git a/reform2-keyboard-fw/keyboard.c b/reform2-keyboard-fw/keyboard.c
index 840dad6..1efeea9 100644
--- a/reform2-keyboard-fw/keyboard.c
+++ b/reform2-keyboard-fw/keyboard.c
@@ -342,7 +342,7 @@ void setup_hardware(void)
 ISR(WDT_vect)
 {
   // WDT interrupt enable and flag cleared on entry
-  wdt_disable(); // Disable watchdog for now
+  Sleep_MS(1);
 }

 /** Event handler for the library USB Connection event. */
diff --git a/reform2-keyboard-fw/powersave.c b/reform2-keyboard-fw/powersave.c
index 4edcc8f..5927304 100644
--- a/reform2-keyboard-fw/powersave.c
+++ b/reform2-keyboard-fw/powersave.c
@@ -60,6 +60,7 @@ void keyboard_power_off(void)
     sleep_cpu();        // Actually go to sleep
     // Zzzzzz
     sleep_disable();    // We've woken up
+    wdt_disable();      // Disable watchdog for now
     sei();
     // Check if circle key has been pressed (active-low)
     // If not reset the watchdog and try again

More details:

The first thing I noticed is that the device doesn’t have to be sleeping for a long time, for the problem to happen. One time I happened to press Circle (actually hold for 2+ secs) again straight away after using Circle + 0, and the problem occurred - the menu didn’t come up. There was still 3.3v on SYSCTL, and after resetting the keyboard, Circle + s showed the LPC still reporting many cycles in “normal” mode (ie. the LPC and batteries were fine).

I then tested the keyboard some more by holding down the Circle key, and pressing 0 every time the keyboard FW woke up. This turned out to be a reliable repro - the problem would eventually occur, although sometimes it could take a while, eg. 5-10 mins sitting there pressing 0 every 2 seconds.

I sped up the testing process considerably by using this patch, which removes the “goodbye” animation, lowers the timeout from 1 sec to 0.25 secs, and removes the need to hold down the Circle key:

diff --git a/reform2-keyboard-fw/menu.c b/reform2-keyboard-fw/menu.c
index 796f87d..aa9532c 100644
--- a/reform2-keyboard-fw/menu.c
+++ b/reform2-keyboard-fw/menu.c
@@ -77,7 +77,6 @@ void jump_to_bootloader(void) {
 int execute_meta_function(int keycode) {
   if (keycode == KEY_0) {
     // TODO: are you sure?
-    anim_goodbye();
     remote_turn_off_som();
     keyboard_power_off();
     reset_keyboard_state();
diff --git a/reform2-keyboard-fw/powersave.c b/reform2-keyboard-fw/powersave.c
index 4edcc8f..4837f05 100644
--- a/reform2-keyboard-fw/powersave.c
+++ b/reform2-keyboard-fw/powersave.c
@@ -51,7 +51,7 @@ void keyboard_power_off(void)
   do {
     wdt_reset();
     WDTCSR = (1<<WDCE) | (1<<WDE); // Enable writes to watchdog
-    WDTCSR = (1<<WDIE) | (1<<WDE) | (0<<WDP3) | (1<<WDP2) | (1<<WDP1) | (0<<WDP0); // Interrupt mode, 1s timeout
+    WDTCSR = (1<<WDIE) | (0<<WDE) | (0<<WDP3) | (1<<WDP2) | (0<<WDP1) | (0<<WDP0); // Interrupt mode, 0.25s timeout

     // Enter Power-save mode
     set_sleep_mode(SLEEP_MODE_PWR_DOWN);
@@ -63,7 +63,7 @@ void keyboard_power_off(void)
     sei();
     // Check if circle key has been pressed (active-low)
     // If not reset the watchdog and try again
-  } while(PINC&(1<<6));
+  } while(false);

   // Resume and reinitialize hardware
   setup_hardware();

I also added a lot of debug code, including clearing the OLED (instead of turning it off) and poking various info to it during the sleep / wake process. This eventually allowed me to confirm that the MCU was getting stuck inside the sleep_cpu(), and never returning from it. Another thing I noticed was that when I had calls to gfx_poke(...); gfx_flush(); inside the interrupt routine, the problem was significantly less likely to occur (or maybe it never occurred, I can’t recall exactly).

After reading the Sleep and WDT (Enhanced Watchdog Timer) sections of the ATMega32U4-AU’s datasheet, I started investigating some more (eg. confirming that the WDTCSR flags were actually being set, checking when the WDT interrupt routine was running, etc). I noticed the datasheet mentions that the AVR MCU takes several cycles to fully wake from sleep (restarting clocks/timers, etc). This, plus the problem being less likely when the interrupt routine took longer to complete, led me to try an interrupt routine which only waited for 1ms, and calling wdt_disable() from the main execution thread after sleep mode had been disabled.

(I also made the WDT use interrupt-only mode, not interrupt-and-reset mode, to remove the possibility of accidental WDT-induced bootloops. This is an issue I discovered while searching for problems with ATMegas and the WDT, but it turns out this wasn’t the cause here.)

This change (disabling the WDT from outside its interrupt routine) fixed the problem - using the 4x speed-up testing, I went 5 mins without the problem reoccurring. This test was equivalent to (at least) 20 mins of “normal” testing, and my previous “normal” testing had always struck the problem in under 10 mins.

6 Likes