The below is something I put together to help @minute with some kernel tracing. I figured rather than me trying to write a tracing script for a set of kernel patches that I’m not using (which, to be fair, probably wouldn’t be to hard after this exampel) it would be more valuable to trace some totally different kernel function and thus teach a powerful tool instead!
I’ve glossed a fair few things in all this, all questions welcome.
How it works
eBPF offers us the ability to introspect on what the kernel is doing. We’re going to look at how we can insert probes into the kernel to tell us what functions are doing on entry and exit, focusing on arguments and return codes. This helps a lot when debugging what the operating system is doing without causing performance degradation, losing events or lossy time-based sampling.
I’ve done this with a fairly simple test - I wanted to prove that I could print a string argument to a kernel function (not a syscall) that I could trigger manually that I already know works correctly and is related to the specific hardware. I chose rockchip_thermal_get_temp.
Let’s skip straight to the end, here’s rockchip_thermal_get_temp.bt:
kprobe:rockchip_thermal_get_temp
{
printf("Kprobe matched: %s\n", probe);
$type = ((struct thermal_zone_device *)arg0)->type;
printf("type: %s\n", $type);
}
Running it, and then in another terminal running sensors:
grimmware@fmlr/pts/4:~
> sudo bpftrace rockchip_thermal_get_temp.bt
[sudo] password for grimmware:
Attached 1 probe
Kprobe matched: kprobe:rockchip_thermal_get_temp
type: gpu-thermal
Kprobe matched: kprobe:rockchip_thermal_get_temp
type: littlecore-thermal
Kprobe matched: kprobe:rockchip_thermal_get_temp
type: bigcore0-thermal
Kprobe matched: kprobe:rockchip_thermal_get_temp
type: npu-thermal
Kprobe matched: kprobe:rockchip_thermal_get_temp
type: center-thermal
Kprobe matched: kprobe:rockchip_thermal_get_temp
type: bigcore2-thermal
Kprobe matched: kprobe:rockchip_thermal_get_temp
type: package-thermal
Understanding and introspecting
So it’s not terribly difficult to see what I’m doing in our example given it’s a 1-line script spun out to 3 lines, but it doesn’t help you bootstrap on being able to write your own.
For starters, if you want to trace a function and have BTF enabled in your kernel, you can take a look at what the arguments for that function are:
grimmware@fmlr/pts/1:~
> sudo bpftrace -vl kprobe:rockchip_thermal_get_temp
kprobe:rockchip_thermal_get_temp
struct thermal_zone_device * arg0
int * arg1
I was looking for some string to print (on the basis that strings in eBPF are finicky things and bear addressing directly) that was the member of a struct (so I could get my brain around how I needed to cast the complex data structures I was probably going to run into using bpftrace), so we’ve found a struct already that sounds like the sort of thing that probably needs a friendly name…
But do I really have to look up the struct for my kernel version on the internet? Again, BTF to the rescue with a really nasty little awk script:
grimmware@fmlr/pts/1:~
> sudo bpftool btf dump file /sys/kernel/btf/vmlinux format c | awk '
BEGIN { is_struct=0 }
/^struct thermal_zone_device {$/ { is_struct=1 }
is_struct==1 { print $0 }
/^};$/ { is_struct=0 }
'
struct thermal_zone_device {
int id;
char type[20];
struct device device;
struct completion removal;
struct completion resume;
struct attribute_group trips_attribute_group;
struct list_head trips_high;
struct list_head trips_reached;
struct list_head trips_invalid;
enum thermal_device_mode mode;
void *devdata;
int num_trips;
long unsigned int passive_delay_jiffies;
long unsigned int polling_delay_jiffies;
long unsigned int recheck_delay_jiffies;
int temperature;
int last_temperature;
int emul_temperature;
int passive;
int prev_low_trip;
int prev_high_trip;
struct thermal_zone_device_ops ops;
struct thermal_zone_params *tzp;
struct thermal_governor *governor;
void *governor_data;
struct ida ida;
struct mutex lock;
struct list_head node;
struct delayed_work poll_queue;
enum thermal_notify_event notify_event;
u8 state;
struct list_head user_thresholds;
struct thermal_trip_desc trips[0];
};
All of this type data is inferred directly by bpftrace, which is why we don’t have to declare it ourselves like we would have had to do without BTF.
Because type is a char array rather than a string, we can use it directly because bpftrace can infer from the data above that it just has to copy 20 bytes out, and there are safe direct bounds-protected ways of doing that. If it were a pointer to a sequence of null terminated chars however, we need to use the kernel’s wrapper for copying the memory which would mean wrapping it in str().
There are a whole bunch of nuances, gotchas and interesting features but the great thing is it’s incredibly performant and you basically can’t break anything doing it. There are plenty of intros to eBPF online, and many different tools for interacting with it and a pretty extensive set of applications, but you should especially read up on what its limitations are if you want to do anything more advanced than the above.
Setup
To enable BTF data in the MNT Pocket Reform kernel package build, there were two changes needed:
- Add
CONFIG_DEBUG_INFO_BTF=yto linux/config - Remove
pkg.linux.nokerneldbgandpkg.linux.nokerneldbginfofromDEB_BUILD_PROFILESin linux/build.sh
This results in the usual package build also producing a package for linux-image-$(uname -r)-mnt-reform-arm64-dbg to be installed in addition to the rest of the kernel packages. You’ll need the header packages installed as well.
You’ll also want the tools bpftrace and bpftool, and be aware that access to use this capability is restricted by default on most systems to root.
Where you can take this
With all this in place and a decent kernel source browser, you can do all manner of mad tricks - for instance at work I wanted to trace commands run by logged-in users on our servers. I was able to use eBPF and the bcc framework to filter execve syscalls to just those with loginuid set (i.e. a person had logged in) but there was still an absolute spamcannon of PATH searches because shells mostly don’t stat to see if the file is there before trying to execve it. We could have stashed the syscall arguments to dispatch to userland on return, but then we’d never get a log line for a long-running process!
I solved this by browsing elixir.bootlin.com and finding that there’s a nonstatic function used in the syscall that actually loads the binary into memory. I was able to hook that to then only dispatch a log line with the execve if it actually reached loading into memory (after existence and permission checks). This is obviously not a stable API unlike syscall probes.