Heat related hard crashes?

A couple of days ago, my little Pocket Reform crashed in a horrifying way: the screen turned black for a second and came back on with the picture mangled and at full brightness, the keyboard and trackball were unresponsive. I turned it off using the OLED menu, turned it on again, and it booted with a horribly flickering screen. So I turned it off to let it rest for a while until I regained the courage to turn it on again. The flicker was gone, but the right side of the screen had a different color than the rest. This wore off after some use, so the screen seems fine again.

So I was a bit worried and did not touch it for a couple of days, but used it for an hour yesterday without any problems. Unfortunately, today it happened again after 30 minutes, but this time the screen did not flicker after reboot; than it crashed again after a few minutes with the screen remaining black. Trying to boot it again did not work because it crashed during the kernel messages. So I let it cool for a while, and it started again, but after a few minutes, it went black again.

I let it cool some more, and now it seems to be running fine again, even when stressing the machine by running some compilations and wildly scrolling through Mastodon in Firefox. The only difference, and I’m not sure if it’s related, is that I booted with a power adapter attached and I am in a spot with better Wi-Fi reception. This is how hot it will run now:

$ sensors
soc_thermal-virtual-0
Adapter: Virtual device
temp1:        +84.0°C  

nvme-pci-0100
Adapter: PCI adapter
Composite:    +52.9°C  (low  = -40.1°C, high = +83.8°C)
                       (crit = +87.8°C)
Sensor 1:     +73.8°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +48.9°C  (low  = -273.1°C, high = +65261.8°C)

BAT0-spi-1-0
Adapter: SPI adapter
in0:           0.00 V  
curr1:         0.00 A  

cpu_thermal-virtual-0
Adapter: Virtual device
temp1:        +80.0°C  

I am running my OS of NVME 2Tb WD Blue using the Reform image for installation. Here’s the output of reform-check:

$ sudo reform-check
I: Your platform name is: MNT Pocket Reform with i.MX8MP Module
I: You are running kernel version: 6.9.12-mnt-reform-arm64
I: Your installed kernel version: 6.9.12-1+reform20240803T055726Z
I: Your installed reform-tools version: 1.47
I: not installed:  pocket-reform-handbook 
W: eMMC does not contain latest uboot
W: You can update it to the latest version by running as root:
reform-flash-uboot emmc
E: unexpected last line in /etc/skel/.profile, should be:
if [ "$(whoami)" = "root" ]; then reform-help --root; elif [ -z "$WAYLAND_DISPLAY" ]; then reform-help; fi
I: the following files differ from how they are shipped by reform-tools:
??5??????   /var/lib/alsa/asound.state
??5?????? c /etc/skel/.config/sway/config.d/input
I: kernel boot parameters your system does use but which are not the default:
 + console=ttymxc1

The logs (using sudo journalctl) do not yield anything interesting around the time it crashes. I changed the apt source.list to use Trixie, hoping to get slightly more stable Debian packages and automatically getting on stable.

So, is this some (heat-related) hardware issue, or am I running a bad kernel? Maybe both? How do I get back to a more stable situation because this is now a usable state?

2 Likes

At first glance it sounds more like a hardware problem to me. I wonder why your SoC gets so hot. Maybe the thermal pad doesn’t attach properly? Also, it sounds like the display connection was lost, otherwise it’s rare that it would crash like that. The flickering is “normal” after a loss of display signal while the display is still powered on. It will recover after several minutes, like you discovered. Normally, the iMX8MP system is very stable, so something unusual is going on. I would suggest checking all connections and the thermal pad. Another option would be to restrict the CPU to 1800 MHz using, for example, cpupower-gui.

2 Likes

Makes me want to check my temps. The back of my display (you know the PCB plate) gets hot, hot enough to uncomfortable to hold, but not hot enough to burn or make me want to let go. 84C on the other hand would burn my hand for sure.

I know when you take the backplate off, you have to reseat the thermal pad on the SoC. Is it possible that isn’t seating properly or that it is somehow misaligned?

What is an example of your workload when hitting these temps? Lastly what is the ambient temps in your environment?

1 Like

Thanks for your reply and easing my worry about it being fried somehow. I opened up the top plate and the thermal pad is nicely stuck on top of the CPU on the processor board. I slightly pulled the display ribbon cables to feel if there’s any wiggle but did not detect any. How to I unlatch these cables to redo them?

Oh, I am not sure this is related because it always was the case and it has been stable for weeks before the mentioned crash in on this topic but the display is a couple of pixels off on the left side. So it’s missing the top (I’m assuming the left side is actually the top of the display) 2 (?) rows of pixels.

1 Like

The mnt logo gets hot but not “burn” hot only a bit uncomfortable to touch. I don’t think 84°C should really be a problem, my amd work horse shoots up to 90° when pressuring it a bit. It’s summer here in the Netherlands and that means >20°C so that’s not really a factor. I stressed it to 84 by running a single core compile (guix pull) and scrolling mastodon in firefox (librewolf), especially css filter blur images are a bit challenging for this machine.

Yeah I wasn’t trying to imply that the i.mx couldn’t handle those temps, just that they are a little high. Not sure about all of the components inside of the display. For example how hot can the display itself get before there are issues?

1 Like

My pocket gets pretty hot too but since I changed from firefox to ungoogled chromium and turned off GPU acceleration in both firefox and vscode(ium) my pocket stays quite “cool”

5 Likes

Thanks for the tip, using ungoogled chromium (flatpak) instead of firefox (librewolf) it does seem to run cooler.

For reference: chromium was too unstable to disable hardware acceleration in the UI due to graphic glitches so I first started it with the --disable-gpu command line option and now it runs smoothly (more so than firefox).

3 Likes

When running firefox, is there any recommended configuration regarding hardware acceleration?

Using ungoogled chromium instead of firefox does seem to make a difference. I haven’t crashed the machine since. Also interesting is that stressing the machine by running parallel compiles (for many hours) runs cooler when this screen is locked (and thus inactive) so it seems that using the graphics related parts of the module cause extra heat.