Has anyone done any local LLM inference on the rk3588, what's the performance like?

I’ve been seeing what room this has in my workflow, I will probably keep using it on my desktop, maybe through ssh, but I just wanted to know what the inference performance was like. It takes a few seconds to render responses even on my quite overpowered gaming PC.

I haven’t personally tried it but apparently it is possible. Libre drivers for the NPU were submitted to mainline last year although I don’t know the current status. As with everything LLM, much of the ecosystem is in flux and there’s no telling what currently works without dredging through forums and probably Discord chats.

Resources that may help:

Edit: I see now that you were asking more about performance. The important thing to note is that you want to leverage the NPU to get maximum performance, which involves converting models to a special format. Performance appears to be adequate within certain context limits as described above. It should be better than many average machines, but worse than either a desktop with a beefy GPU or a M* Mac with their unique architecture.

Thanks a lot; this is a lot of useful info. It seems that the SDK to compile for the NPU is not open source, but still potentially usable.

Update for anyone wanting to do this, Radxa has released a guide for running a Deepseek R1 1.5B distilled model on RK3588, reported performance is 15 tokens/sec: DeepSeek shown to run on Rockchip RK3588 with AI acceleration at about 15 tokens/s - CNX Software

2 Likes

I’m still unable to get a suitably licensed version of RKLLM2 which is used to compile them to the correct format for the npu.

But v cool nonetheless!