When I conceived the design of Austin for the first time, I've sworn to always adhere to two guiding principles:
- no dependencies other than the standard C library (and whatever system calls the OS provides);
- minimal impact on the tracee, even under high sampling frequency.
Let me elaborate on why I decided to stick to these two rules. The first one
is more of a choice of simplicity. The power horse of Austin is the capability
of reading the private memory of any process, be it a child process or not. Many
platforms provide the API or system calls to do that, some with more security
gotchas than others. Once Austin has access to that information, the rest is
plain C code that makes sense of that data and provides a meaningful
representation to the user by merely calling
fprintf on a loop.
The second guiding principle is what everybody desires from observability tools. We want to be able to extract as much information as possible from a running program, perturbing it as little as possible as to avoid skewed data. Austin can make this guarantee because reading VM memory does not require the tracee to be halted. Furthermore, the fact that Python has a GIL implies that a simple Python application will run on at most one physical core. To be more precise, a normal, pure-Python application would not spend more CPU time than wall-clock time. Therefore, on machines with multiple cores, even if Austin ends up acting like a busy loop at high sampling frequencies and hogging a physical core, there would still be plenty of other cores to run the Python application unperturbed and unaware that is being spied on. Even for multiprocess applications, the expected impact is minimal, for if you are running, say, a uWSGI server on a 64-core machine, you wouldn't lose much if Austin hogs one of them. Besides, you probably don't need to sample at very high frequences (like once every 50 microseconds), but you could be happy with, e.g. 1000 Hz, which is still pretty high, but would not cause Austin to require an entire core for itself.
When you put these two principles together you get a tool that compiles down to a single tiny binary and that has minimal impact on the tracee at runtime. The added bonus is that it doesn't even require any instrumentation! These are surely ideal features for an observability tool that make Austin very well suited for running in a production environment.
But Austin strengths are also its limitations unfortunately. What if our
application has parts written as Python extensions, e.g. native C/C++
extensions, Cython, Rust, or even assembly? By
reading a process private VM, Austin can only reconstruct the pure-Python call
stacks. To unwind the native call stacks, Austin would need to use some heavier
machinery. Forget about using a third-party library for doing that, which would
violate the first principle, the more serious issue here is that there are
currently no ways of avoiding the use of system calls like
from user-space. This would be a serious violation of the second principle. Why?
Because stack unwinding using
ptrace requires threads to be halted, thus
causing a non-negligible impact on the tracee. Besides, stack unwinding is not
exactly straight-forward on every platform to implement.
The compromise is austinp, a variant of Austin that can do native
stack unwinding, just on Linux, using
This tool is to be used when you really need to have observability into native
call stacks, as the use of
ptrace implies that the tracee will be impacted to
some extent. This is why, be default,
austinp samples at a much lower rate.
This doesn't mean that you cannot use this tool in a production environment, but
that you should be aware of the potential penalties that come with it. Many
observability tools from the past relied on
ptrace or similar to achieve their
austinp is just a (relatively) new entry into that list. More modern
solutions rely on technologies like eBPF to provide efficient
observability into the Linux kernel, as well as into user-space.
Speaking of the Linux kernel, eBPF is not the only way to retrieve kernel
stacks. In the future we might have a variant of Austin that relies on eBPF for
some heavy lifting, but for now
austinp leverages the information exposed by
procfs to push stack unwinding down to the Linux kernel level. The
austinp variant has the same CLI of Austin, but with the extra option
which can be used to sample kernel stacks alongside native ones. I am still to
find a valid use-case for wanting to obtain kernel observability from a Python
program, but I think this could be an interesting way to see how the interpreter
interacts with the kernel; and perhaps someone might find ways of inspecting the
Linux kernel performance by coding a simple Python script rather than a more
verbose C equivalent.
You can find some examples of
austinp in action on my Twitter
account. This, for example, is what you'd get for a simple
scikit-learn classification model, when you open the collected
samples via the Austin VS Code extension:
The latest development builds of @AustinSampler, including the austinp variant for native stack sampling on Linux are now available from @github releases https://t.co/nBfzm3mDng. pic.twitter.com/IjVfAm1hRk— Gabriele Tornetta 🇪🇺 🇮🇹 🇬🇧 (@p403n1x87) September 8, 2021
If you want to give
austinp a try you can follow the instructions on the
README for compiling from sources, or download the pre-built binary
from the the Development build. In the future,
austinp will be
available from ordinary releases too!