Changes introduced by kernel 6.5~rc7-1~exp1+reform20230831T205634Z1

Yesterday I’ve upgraded my MNT Reform; this upgrade installed kernel 6.5~rc7-1~exp1+reform20230831T205634Z1. System boots after upgrade but there are 2 changes introduced by it: one negative, one potentially positive.

Let’s start with negatives. New kernel emits warnings related to encryption:

[  725.982526] INFO: task cryptomgr_test:495 blocked for more than 604 seconds.
[  725.992660]       Tainted: G        W  O       6.5.0-0-reform2-arm64 #1 Debian 6.5~rc7-1~exp1+reform20230831T205634Z1
[  726.005866] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  726.016334] task:cryptomgr_test  state:D stack:0     pid:495   ppid:2      flags:0x00000008
[  726.027373] Call trace:
[  726.032461]  __switch_to+0xe8/0x168
[  726.038617]  __schedule+0x390/0xd08
[  726.044752]  schedule+0x58/0xf0
[  726.050545]  schedule_timeout+0x170/0x188
[  726.057205]  __wait_for_common+0xcc/0x260
[  726.063868]  wait_for_completion+0x28/0x40
[  726.070628]  test_skcipher_vec_cfg+0x4bc/0x6a8
[  726.077718]  test_skcipher+0xac/0x140
[  726.083971]  alg_test_skcipher+0xa0/0x1a8
[  726.090514]  alg_test+0x144/0x550
[  726.096290]  cryptomgr_test+0x2c/0x50
[  726.102373]  kthread+0xe8/0xf8
[  726.107829]  ret_from_fork+0x10/0x20

and

[  199.504748] ------------[ cut here ]------------
[  199.513501] WARNING: CPU: 2 PID: 425 at crypto/api.c:172 crypto_wait_for_test+0xa4/0xd8
[  199.525149] Modules linked in: overlay binfmt_misc ext4 crc16 mbcache jbd2 hantro_vpu iwlmvm mac80211 v4l2_vp9 caam_jr(+) v4l2_h264 caamhash_desc snd_soc_fsl_asoc_card videobuf2_dma_contig caamalg_desc snd_soc_imx_audmux libarc4 crypto_engine v4l2_mem2mem snd_soc_simple_card_utils snd_soc_fsl_sai authenc videobuf2_memops libdes snd_soc_fsl_utils bonding snd_ac97_codec videobuf2_v4l2 crypto_null imx_pcm_dma reform2_lpc(O) snd_soc_wm8960 cpufreq_dt iwlwifi ac97_bus videodev tls snd_soc_core videobuf2_common chaoskey snd_pcm_dmaengine rng_core snd_pcm etnaviv fsl_imx8_ddr_perf mc gpu_sched snd_timer governor_userspace cfg80211 snd imx_bus soundcore caam rfkill imx2_wdt evdev error imx_mailbox imx_cpufreq_dt qoriq_thermal imx_sdma pkcs8_key_parser loop fuse efi_pstore dm_mod dax configfs ip_tables x_tables autofs4 btrfs blake2b_generic xor xor_neon hid_generic raid6_pq libcrc32c crc32c_generic imx8mq_interconnect imx_interconnect usbhid hid xhci_plat_hcd xhci_hcd usbcore nvme nvme_core t10_pi crc64_rocksoft_generic
[  199.525400]  polyval_ce polyval_generic ghash_ce gf128mul sha2_ce crc64_rocksoft crc_t10dif sha256_arm64 at803x sha1_ce imx_dcss sdhci_esdhc_imx crct10dif_generic dwc3 sdhci_pltfm phy_fsl_imx8mq_usb cqhci udc_core roles fec ulpi selftests usb_common of_mdio fixed_phy crct10dif_ce fwnode_mdio crc64 sdhci libphy crct10dif_common rtc_pcf8523 nvmem_imx_ocotp spi_imx mxsfb drm_dma_helper nwl_dsi phy_fsl_imx8_mipi_dphy ti_sn65dsi86 panel_edp drm_display_helper drm_dp_aux_bus drm_kms_helper drm pwm_bl pwm_imx27 i2c_mux_pca954x i2c_mux fan53555 i2c_imx fixed mux_mmio mux_core reset_imx7 aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher
[  199.704969] CPU: 2 PID: 425 Comm: (udev-worker) Tainted: G        W  O       6.5.0-0-reform2-arm64 #1  Debian 6.5~rc7-1~exp1+reform20230831T205634Z1
[  199.723109] Hardware name: MNT Reform 2 (DT)
[  199.732182] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  199.743953] pc : crypto_wait_for_test+0xa4/0xd8
[  199.753154] lr : crypto_wait_for_test+0x98/0xd8
[  199.762194] sp : ffff800082953680
[  199.769856] x29: ffff800082953680 x28: ffff80007a738600 x27: 0000000000000000
[  199.781243] x26: 0000000000000000 x25: 0000000000000001 x24: ffff80007a7216c8
[  199.792497] x23: 0000000000000003 x22: ffff800081a52440 x21: ffff80007a738640
[  199.803606] x20: ffff0000cf440a00 x19: ffff800081a52468 x18: 0000000000000000
[  199.814569] x17: 0000000000000000 x16: 0000000000000020 x15: 0000000000000002
[  199.825386] x14: 1740000000000000 x13: 0000000000000000 x12: 0000000000000031
[  199.836103] x11: 0000000000000002 x10: 0000000000000001 x9 : ffff8000800cccd0
[  199.846752] x8 : ffff0000ce1553b0 x7 : 747365745f72676d x6 : 676d6f7470797263
[  199.857395] x5 : 0000000000000000 x4 : ffff0000ff778af0 x3 : 0000000000000000
[  199.868029] x2 : ffffffffffffff00 x1 : 0000000000008001 x0 : 0000000000000000
[  199.878687] Call trace:
[  199.884630]  crypto_wait_for_test+0xa4/0xd8
[  199.892316]  crypto_register_alg+0xc8/0x100
[  199.899984]  crypto_register_skcipher+0x7c/0x98
[  199.907994]  caam_algapi_init+0x268/0x570 [caam_jr]
[  199.916388]  caam_jr_probe+0x518/0x668 [caam_jr]
[  199.924515]  platform_probe+0x70/0xd8
[  199.931662]  really_probe+0x190/0x3d8
[  199.938812]  __driver_probe_device+0x84/0x180
[  199.946660]  driver_probe_device+0x44/0x120
[  199.954342]  __driver_attach+0xf8/0x208
[  199.961669]  bus_for_each_dev+0x80/0xe8
[  199.968964]  driver_attach+0x2c/0x40
[  199.975996]  bus_add_driver+0x11c/0x238
[  199.983294]  driver_register+0x64/0x138
[  199.990600]  __platform_driver_register+0x30/0x48
[  199.998772]  jr_driver_init+0x3c/0xff8 [caam_jr]
[  200.006890]  do_one_initcall+0x5c/0x290
[  200.014198]  do_init_module+0x60/0x210
[  200.021424]  load_module+0x2198/0x2288
[  200.028636]  init_module_from_file+0x8c/0xd8
[  200.036373]  __arm64_sys_finit_module+0x1b4/0x3a0
[  200.044555]  invoke_syscall+0x78/0x100
[  200.051779]  el0_svc_common.constprop.0+0xcc/0xf8
[  200.059969]  do_el0_svc+0x40/0xa8
[  200.066772]  el0_svc+0x34/0xd8
[  200.073309]  el0t_64_sync_handler+0x120/0x130
[  200.081151]  el0t_64_sync+0x190/0x198
[  200.088282] ---[ end trace 0000000000000000 ]---
[  200.102118] caam algorithms registered in /proc/crypto
[  200.113198] caam 30900000.crypto: caam pkc algorithms registered in /proc/crypto
[  200.124770] caam 30900000.crypto: registering rng-caam
[  200.144303] caam 30900000.crypto: rng crypto API alg registered prng-caam
[  200.164490] caam_jr 30903000.jr: failed to create crypto request pump task
[  200.175190] caam_jr 30903000.jr: Could not init crypto-engine
[  200.184765] caam_jr: probe of 30903000.jr failed with error -12

They occur both when there are encrypted partitions in use and when not. I was able to decrypt LUKS partition giving password during boot, so it does not seem to be fatal error. At the same time I’m not sure about severity of this problem, so haven’t yet upgraded my main system where most of it resides on encrypted NVMe.

Should we discuss problems here, or should I create Debian bug in BTS?

Second change is more promising. I tested suspend anh resume, and it was better than previously! Screen was active after resume without need to play with swaymsg output eDP-1 disable/enable.
dmesg was still showing

[  221.789440] nvme 0001:01:00.0: Unable to change power state from D3hot to D0, device inaccessible

On original MNT System (built using official scripts around October 2022 and upgraded since then) NVMe was deactivated shortly afterward, but with slightly different error message:

[  283.634916] nvme nvme0: I/O 24 QID 0 timeout, disable controller
[  283.670976] nvme nvme0: Identify Controller failed (-4)
[  283.676270] nvme nvme0: Disabling device after reset failure: -5

But my custom image (described shortly here) had NVMe responding! At least sgdisk --print /dev/nvme0n1 was showing proper disk layoutand did not complain about empty device. I haven’t tested mounted or decrypted partition during suspend/resume though.

It brings me a bit of hope, but I’m still a bit worried about safety here. Any tips how to test it without risking my disk and data on it?

1 Like

I don’t have my system encrypted so I am ambivalent to the encrypted regression. Seems like it is a separate issue though and one that has nothing to do with suspend.

I have had 4 successful resumes from suspend with 6.5. Are you saying that I can remove the sway output portion of the script that we added?

The interesting thing I have noticed is that on 6.4 the screen always came back, but was unresponsive UNTIL the sway output was toggled. I fear this is still the case, but am interested in your findings nonetheless.

Hi, I’m the one who usually rebases the MNT Reform patch stack and makes new kernel version builds happen in the gitlab CI so I guess I should reply here but beware that while I am able to apply patches and build software I’m not a kernel expert so I’m afraid that I cannot be of help with your specific problems.

It’s fine to discuss problems here, absolutely. If you bring something up to the Debian BTS be careful about it because this is not the Debian kernel. It is 99% the Debian kernel but it has quite a few patches on top, uses a slightly different kernel config and is built with some build profiles enabled that are not used for the regular builds. So any problem you spot might very well be due to those changes and obviously Debian maintainers are not enthusiastic about having problems reported to them that are not their fault. If you do decide to ask the Debian kernel maintainers, be very clear about what kernel you are using and please put josch@debian.org (me) in CC.

But my custom image (described shortly here) had NVMe responding!

Are you saying that using the same hardware but with a different system image you got different results with your nvme after resume? Is that reproducible over multiple resumes? What are the differences?

Any tips how to test it without risking my disk and data on it?

The usual way: create a backup disk image of your system with dd so that you can restore everything later if anything gets messed up.

I have bad news: I let my system rest over the weekend and tested it on Monday and today - and NVMe does not wake up after suspend.
I have multiple SD cards with different variants of system: some with official MNT image, some with my variant. Some have old uBoot (v3), some new one (2023-07-04). Some have encrypted partitions, some don’t. Of course I do not have all variants (5 to be precise), but I believe it should be enough for testing. I haven’t planned it this way, but decided that they can be useful for testing. :grinning:
Unfortunately this time NVMe did not wake up: neither under 6.5~rc7, nor under 6.5.1-1~exp1+reform20230903T163512Z1. I tested all variants, and on all of them the situation is the same. I even tried to run sgdisk --print /dev/nvme0n1 just after resume but in such a case it hangs until kernel decides NVMe is unresponsive. Then situation is the same as previously. I’m not sure what could be the next steps.
As for crypto warnings: they do not occur on 6.4.11-1+reform20230821T151707Z1 (not even with lower log level). On 6.5 they occur every few minutes; and on such a system load is higher than usual: on some variants 1, on some almost 2 (1.80-1.90). I’m not sure what to think about that: I’ll try to upgrade my old amd64 system to 6.5, but can do it only during weekend. I hope to know more then. In any case I’m not upgrading kernel to 6.5 on my main system :unamused:

You are not the only one for whom nvme fails to come back after suspend. I have written about this elsewhere in this forum as well. Is this a regression for you? Did resume from suspend work for you before 6.5?

I heard @2disbetter had a much better experience with suspend and 6.5.

1 Like

@josch Indeed! I am on 14 successful resumes on 6.5 right now. Honestly, my experience seems to show that if power voltages don’t fluctuate that resume reliability is greatly improved. (IE: only suspend on power, and resume on power. If you unplug in between that no problem, just make sure you plug back in BEFORE you resume again.) Real scientific I know, but my experiences nonetheless.

you can echo y > /sys/module/cryptomgr/parameters/notests to turn off testing noise. the crypto testing is in bad shape even on x86_64 – it used to misbehave if compiled into the kernel but work ok if loaded as a module.

I’m running off an SD card, since nvme is for 9front on my system, and I get lots of graphics crashes after I resume from sleep. the system isn’t stable enough to collect dmesg output in those circumstances, but here’s a dmesg attachment from a non-crashing session just for reference

http://okturing.com/src/16893/body

Thanks for tips about disabling crypto-tests; I’ll try it during my next experiments.

I was testing suspend/resume when on battery; following the next advice I’ll try testing when connected to the charger. Not sure if my other issues are related. On the other hand suspend makes more sense “on the road” without easy access to power, so making sure it works on battery would be next step.

It looks that official set (WiFi and NVMe sold together in shop) is working without much problems - and other models are not so well cooperating together. On the one hand it’s good to have reference hardware; on the other hand I believe that strength of MNT Reform is customisability - and so it should work in various configurations. In the future, when there are different motherboards (new release) or CPU modules (LS1028A, i.MX8MPlus, A511… with varying number of available PCIe lanes) I’m afraid that the more variations might make job of stabilising entire system more complex - especially when we add uBoot and LPC/firmware versions. Some people upgrade those, some keep at initial versions. And let’s not forget about different operating systems (Genode, Plan 9, …).

We (as community) might start thinking about gathering information about working hardware, configurations and tips for dealing with potential problems. What do you think about creating few wiki pages (using GitLab) where we could write what works and what doesn’t? Currently finding it requires going through many posts in forum.

Just wanted to report back and say that the resume from suspend seems to be really good on 6.5. Not sure about why either.

Results of my experiments:

echo y > /sys/module/cryptomgr/parameters/notests does not help, tests are still being run and write to console. This is really annoying, so for now I’ve pinned the kernel to stay at version 6.4.0-3.

I’ve upgraded my amd64/x86_64 to Debian kernel 6.5.3-1. It behaves much better: load is low, there is no cryptotest running - so no messages, neither on console nor in dmesg. Not sure how to start debugging this.

We have turned off the dysfunctional crypto tests using the cryptomgr.notests kernel commandline flag. Hopefully just apt update and apt upgrade should do the trick. You can verify that by checking cat /proc/cmdline after the fact.

Thanks. I can confirm that above kernel command line parameter (cryptomgr.notests) fixes problem. For now I’ve applied it on my test systems (see Problem upgrading kernel - #5 by serpent for more detailed description of situation). I’ll see how it behaves and in few days try on main system.