As requisted earlier on the mailing list, below is the detailed
description of user-space probes followed by patches.
Please review and provide your comments.
http://lkml.org/lkml/2006/3/20/2 is the earlier posting.
- Separate patches to move generic code to mm and vfs subsystem.(Nick)
- Use get_user_pages() for __copy_to_user_inatomic to succeed.(Arjan)
- Use kmap_atomic() instead of kmap().(Andrew)
- Remove __lock_page() usage.(Andrew)
- Use mmap_sem before calling find_vma().(Andrew)
- Use flush_dcache_page() in replace_original_insn().(Andrew)
- Remove docbook style comments.(Andrew)
- Use inc/dec_preempt_count() in uprobe_handlers().(Andrew)
A. What is the problem we are trying to solve?
The primary intent is to provide a system-wide tracing framework for Linux.
This framework can be used in conjunction with (as an extension of) kprobes
to gather information both from kernel and user-space, thus mitigating the
need to collect data separately and correlating them. It provides a
system-wide view of the problem at hand.
Some use-cases could be:
- One process depletes a system-wide resource (dcache, etc)
- One process owns resources exclusively, causing others to wait
- One process hogs the CPU or I/O bandwidth.
B. Why does Linux need this feature?
As Linux gets deployed in bigger and more complicated computing environments,
more issues relating to performance are surfacing. Tools that provide a
holistic view of the system can provide invaluable insights to the problem
at hand. Some debug scenarios require system-wide instrumentation so that
thousands of active probes with low-overhead can co-exist and all the instances
of probe hits on any binary can be detected. There are situations where the
existing tools like ptrace does not scale well.
- When working on networking-related performance problems, you need to
correlate instrumentation from multiple layers (the MAC layer
and IP stack in the kernel up to the application in question).
- Diagnosing problems with the X-Windows server, for example, might
require instrumenting all clients that connect to the server.
- When tackling issues relating to performance of distributed systems,
involving, say, a filesystem, samba, apache and the like, gathering
data independently and then correlating the same is going to be a
C. Design drivers
The primary drivers in arriving at this design were:
- Dynamic instrumentation that can be created, installed and removed as
needed without rebooting or restarting applications
- System-wide instrumentation with user having the freedom to retain or
discard data as desired as also the ability to gather both user and
kernel data with the same instrumentation code
- Not having to force COW on pages
- Not having to force pages into memory just to insert probes
- Not having to be concerned with evicting pages from memory under pressure
- Ability to probe shared libraries
- Ability to insert probes on applications that are yet to be started
- Probes are visible across fork() calls
D. Advantages of this approach
- No COW/privatization of pages or forcing of pages into memory just for
the sake of probe insertion
- No restriction on evicting pages with probes from memory
- Since probes are inserted based on the inode-offset tuple, all
instances of the program are instrumented -- user then has the
advantage of choosing what instances of the application he'd like
- Probes can be inserted on applications residing on read-only mounts,
since the text pages are discarded post execution
E. The details
At the basic level, similar to kprobes, a breakpoint instruction or
watchpoint is inserted at the instrumentation location and handlers are
run when the breakpoint/watchpoint is hit.
In order to be able to insert probes on pages that aren't in memory
during registration, the readpage(s) hooks of struct
address_space_operations are modified for the inode in question so as to
be able to first insert the probes onto the page at the time it is read
into memory. This mechanism adds some overhead, but is restricted to the
probed binaries only.
The instrumented binary should not be allowed to change for the duration
of the instrumentation. This is achieved by decrementing the
inode->i_writecount of the instrumented binary, so we get exclusive
write access for the entire instrumentation duration.
When the breakpoint is hit, similar to a kprobe, its associated pre_handler
is invoked. The original instruction is then single-stepped out-of-line
so as to prevent any possible SMP misses. Single-stepping out-of-line
requires us to find an unused area in the process address space to which
we can copy the probed instruction.
- The application stack is checked to see if there is sufficient space
for the instruction copy. If so, the instruction is copied to the bottom
of the page. Some architectures have stack pages with no-exec set.
In such cases, the no-exec bit for the corresponding stack page is
- If there is insufficient space on stack, the vma is expanded beyond
the current stack's vma and that is used for the single-stepping.
- In cases where the vma can't be extended (the process has exhausted
all its virtual address space), we resort to single-stepping inline by
replacing the original instruction back at the probed location.
F. Known issues/flaws:
- Currently, applications that access the page-cache directly for I/O
will see the breakpoint instruction in text. Similar is the case of
text pages that are mmap'ed private.
- Arjan pointed out that tripwire-like tools can clearly detect the text
- There is a way to fix these, albeit not too elegant.
- Modify the file_read_actor() to check if the read is
for a probed application and remove the breakpoints on
the copied image. This solution has been prototyped and
is known to work.
- There is a possibility that probes on an executable mmap'ed shared could
be written back to disk. The simplest solution is to disallow probes on
shared mmap objects.
- Instrumentation data that can be gathered is limited to pages resident
in memory when the probepoint is hit. A jprobe like approach can to
used so as to collect the data from pages that are not present in the
memory when the probepoint is hit.
- The instrumentation handler runs in kernel context. As Arjan pointed
out in one of the earlier discussion threads on this topic, running a
handler is user-space provides availability of better debug information.
- A jprobe like approach has been prototyped using a system-call
interface. This provides for executing the instrumentation
code in the process context in userspace. Clearly this has
- We have to take a debug trap to return back to the
"normal" process context from the instrumentation context.
- The instrumentation code must be made part of the
address spaces of the processes that map the same
- Probes on text that are mapped at different addresses by different
processes need special handling. This could be solved by tracking
vmas that map the same text pages and insert probes on them.
- Coexistence with debuggers is another issue. The simplest solution is
to fail registration of a breakpoint if one is already existing at the
location to be instrumented.
- Due to the system-wide approach to instrumentation, all processes
running the same executable end up having to pay the penalty of taking
the debug trap. Finer-grained controls can be provided to minimze
overhead by possibly filtering events based on pids of processes we
are interested in.
- Its been suggested that writing a kernel module to gather user-space
data isn't a great idea. However, with tools like systemtap, it is
possible for application programmers and system admins to just script
and gather data.
G. What alternative solutions were there?
As far as we know, there doesn't exist a system-wide, dynamic tracing
There are, of course, tools like ptrace(), that are suitable for per-process
instrumentation. But ptrace() has its own design/performance issues and
it's also well known that the ptrace approach won't scale well, especially
given the overhead of context switches and other issues with the current
A short writeup on other approaches tried is available here:
The belief is, on Linux there is space for both types of instrumentation
to coexist, and a need for both. Hence the proposal.
H. Open questions:
1. What if the text is writably mapped?
- Fail inserting probes on them.
2. What are the typical cases when an executable (library?) is mmap'ed
/* Allocate a uprobe structure */
struct uprobe p;
/* Define pre handler */
int handler_pre(struct kprobe *p, struct pt_regs *regs)
.............collect useful data..............
void handler_post(struct kprobe *p, struct pt_regs *regs,
unsigned long flags)
.............collect useful data..............
int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr)
........ release allocated resources & try to recover ....
Before inserting the probe, specify the pathname of the application
on which the probe is to be inserted.
/*pointer to the pathname of the application */
p.pathname = "/home/prasanna/bin/myapp";
p.kp.pre_handler = handler_pre;
p.kp.post_handler = handler_post;
p.kp.fault_handler = handler_fault;
/* Specify the probe address */
/* $nm appln |grep func1 */
p.kp.addr = (kprobe_opcode_t *)0x080484d4;
/* Specify the offset within the application/executable*/
p.offset = (unsigned long)0x4d4;
/* Now register the userspace probe */
if (ret = register_uprobe(&p))
printk("register_uprobe: unsuccessful ret= %d\n", ret);
/* To unregister the registered probed, just call..*/
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Video 4 Linux]
[Linux for the blind]