NVMe - strange behavior

Tolstoevsky · August 8, 2021, 4:54am

Hello again.
I’ve got some problems with NVMe. It was working fine for 2 days (booting without SD-card), but this morning I’ve got the following issue:

System got extremely slow after login.
A bit later I’ve got I/O errors in TTY
Next was smth like “NVMe IRQ blalba error. Reset controller”
Reboot sends me to rescue mode (“No NMVe SSD detected. Falling back to SD card”)

I’ve used two reset switches on Motherboard (CPU and LPC) and re-installed the SSD unit. And now it works fine (this message is sent from Reform).

But could somebody please tell me:

What could be wrong? Maybe I didn’t screw it properly or something?
Any solutions for proper diagnostics if the issue repeats in future?

Tolstoevsky · August 14, 2021, 6:03am

Now it’s seems to be dead. Reset buttons and re-inserting NVME does not help.
Not detected - falling back to SD…
@minute Please help. If NVME was working for one week - how long will SD last?..

Tolstoevsky · August 14, 2021, 8:48am

Lol. NVME is back online again. Maybe that’s all about batt voltage?..

Kooda · August 24, 2021, 12:02pm

I had the same problem a few weeks ago! I was using the computer fine and then suddenly I couldn’t access any file from the SSD, only getting IO errors. Rebooting did not fix the issue, the SSD wasn’t showing up in dmesg anymore, I had to power off and on to get it back. (this is using the SSD from the kit)

EDIT: 20 minutes after writing this message, it happened again! This time I noticed that the LED next to the SSD (D23) was flashing at a fixed rate (about 2 Hz) even after a soft reset (Circle+R). I had to power off and on a few times for it to come back.

minute · August 24, 2021, 12:31pm

The flashing LED probably means the NVMe is trying to do some internal repair/check. I would recommend to use smartmontools, back up your data and maybe switch over to a new SSD if it happens. I had one SSD die on me over a year ago that never stopped flashing…

minute · August 24, 2021, 12:52pm

Hi, I forgot to ask, what is your NVMe model? Can you install smartmontools (via apt) and give us the output of smartctl --all /dev/nvme0?

Tolstoevsky · August 24, 2021, 12:52pm

My SSD from the kit is now definitely dead — nothing about it in dmesg. Replaced it by the new one, working from it…

minute · August 24, 2021, 12:52pm

Ah, our messages have crossed. I’m very sorry to hear this. It was the 256GB model from the DIY kit?

Tolstoevsky · August 24, 2021, 12:56pm

Not DIY - it’s 256 GB from full Reform+wifi (SSD and wifi were packed in the box with laptop and SD)

Pleroma - photos

minute · August 24, 2021, 1:03pm

Out of curiosity, which new model of SSD did you put in? And if you want, you can send us the defective SSD back and we will send it to the maker for replacement.

Tolstoevsky · August 24, 2021, 1:08pm

Local electronic retail’s own label — called DEXP. Don’t know the real manufacturer (will try to find it out few hours later) it’s cheap and probably crappy, will see by it’s behaviour…

Kooda · September 24, 2021, 10:46am

My SSD is now dead as well (256GB Transcend from the DIY kit). What would be the procedure to get it replaced? I suppose it should be covered by the waranty still.

minute · September 25, 2021, 4:20pm

I am very sorry to hear this. Please get in touch via lukas at mntre dot com and we’ll handle the replacement.

Edit: I’ve just seen that Transcend have a warranty service of 3-5 years. This might be the most efficient way to replace the drive(s) Warranty Periods - Transcend Information, Inc.

jirka · January 13, 2023, 12:08pm

Interestingly, my 512GB SSD from the kit died yesterday. Just after I concluded that my setup is ideal The LED next to the SDD was constantly blinking and the drive did was no longer detected.

I have replaced it with a (bigger) Seagate one.

minute · January 13, 2023, 5:19pm

I am sorry to hear this but glad you have a better replacement.

josch · January 14, 2023, 4:39am

Since it is the model that comes with the reform, I bought a 1TB Transcend 220S M.2 2280 (TS1TMTE220S) in October 2022. That drive died last week as well with these errors in dmesg before it went down:

[ 2461.455397] nvme0n1: I/O Cmd(0x2) @ LBA 365600216, 256 blocks, I/O Error (sct 0x2 / sc 0x81) MORE 
[ 2461.466519] critical medium error, dev nvme0n1, sector 365600216 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 2
[ 2461.514774] nvme0n1: I/O Cmd(0x2) @ LBA 365600392, 8 blocks, I/O Error (sct 0x2 / sc 0x81) MORE 
[ 2461.525875] critical medium error, dev nvme0n1, sector 365600392 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 2461.835394] nvme0n1: I/O Cmd(0x2) @ LBA 17021144, 128 blocks, I/O Error (sct 0x2 / sc 0x81) MORE 
[ 2461.847014] critical medium error, dev nvme0n1, sector 17021144 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 2
[ 2461.860241] nvme0n1: I/O Cmd(0x2) @ LBA 17021400, 256 blocks, I/O Error (sct 0x2 / sc 0x81) MORE 
[ 2461.871901] critical medium error, dev nvme0n1, sector 17021400 op 0x0:(READ) flags 0x80700 phys_seg 14 prio class 2
[ 2461.920537] nvme0n1: I/O Cmd(0x2) @ LBA 17021184, 8 blocks, I/O Error (sct 0x2 / sc 0x81) MORE 
[ 2461.933666] critical medium error, dev nvme0n1, sector 17021184 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 2461.980344] nvme0n1: I/O Cmd(0x2) @ LBA 17021184, 8 blocks, I/O Error (sct 0x2 / sc 0x81) MORE 
[ 2461.993570] critical medium error, dev nvme0n1, sector 17021184 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 4373.792523] nvme0n1: I/O Cmd(0x2) @ LBA 18844640, 256 blocks, I/O Error (sct 0x2 / sc 0x81) MORE 
[ 4373.803682] critical medium error, dev nvme0n1, sector 18844640 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 2
[ 4373.851291] nvme0n1: I/O Cmd(0x2) @ LBA 18844760, 8 blocks, I/O Error (sct 0x2 / sc 0x81) MORE 
[ 4373.863513] critical medium error, dev nvme0n1, sector 18844760 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 4728.084598] nvme0n1: I/O Cmd(0x2) @ LBA 1325051056, 256 blocks, I/O Error (sct 0x2 / sc 0x81) MORE 
[ 4728.094794] critical medium error, dev nvme0n1, sector 1325051056 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 2

And nvme error-log was full of these:

Error Log Entries for device:nvme0n1 entries:64
.................
 Entry[ 0]   
.................
error_count	: 975
sqid		: 1
cmdid		: 0x619e
status_field	: 0x2281(Unrecovered Read Error: The read data could not be recovered from the media)
phase_tag	: 0
parm_err_loc	: 0xffff
lba		: 0x103bb78
nsid		: 0x1
vs		: 0
trtype		: The transport type is not indicated or the error is not transport related.
cs		: 0
trtype_spec_info: 0

I wonder why that happened because the smart info looked okay:

josch@reform:~$ sudo nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning			: 0
temperature				: 47°C (320 Kelvin)
available_spare				: 97%
available_spare_threshold		: 10%
percentage_used				: 0%
endurance group critical warning summary: 0
data_units_read				: 2166207
data_units_written			: 4421615
host_read_commands			: 44677970
host_write_commands			: 57073967
controller_busy_time			: 1719
power_cycles				: 74
power_on_hours				: 1368
unsafe_shutdowns			: 37
media_errors				: 857
num_err_log_entries			: 857
Warning Temperature Time		: 0
Critical Composite Temperature Time	: 0
Thermal Management T1 Trans Count	: 0
Thermal Management T2 Trans Count	: 0
Thermal Management T1 Total Time	: 0
Thermal Management T2 Total Time	: 0

Yes, the “unsafe_shutdown” counter is high but this is no surprise as I had to hard-reset the reform often while working on Debian integration of the reform kernel. And 37 unsafe shutdown shouldn’t kill a drive, no?

At least the drive was still under warranty so I sent it in to get my money back. I’m now back at using a Wester Digital Blue SN550 1 TB (WDS100T2B0C) with my reform.

If nothing else this is yet another reminder to do frequent backups because hardware can fail at any time!

@minute what do you think about including the nvme-cli package by default in the rescue sd-card image? Adding that package would require 1595 kB additional space on the sd-card image.