RR: Engineering Record and Replay for Deployability

https://www.usenix.org/conference/atc17/technical-sessions/presentation/ocallahan

Presentation

  • Topic: partial record and replay debugging with rr

  • Debugging nondeterminisim

    • Bugs, confuses the output of the system

    • Difficult to debug

  • Deterministic hardware

  • Sources of nondeterminism

    • Record inputs

    • Reply execution

  • Old idea

    • Nirvana, PinPlay, ReVirt, Jockey, ReSpec, Chronomancer, PANDA, Echo, FlashBack, ...

RR goals

  1. Easy to deploy: stock hardware (i.e. not customized), commodity OS, no kernel changes

  2. Low overhead

  3. Works on Firefox

  4. Small investment

RR design

Idea: user-space processes running in the linux, record all the input (system call results, signals) to those processes, reply those inputs, get the same process execution, would be able to replay and debug

  • No code instrumentation

  • Use modern HW/OS features

      • Linux API: ptrace

    • Data races: multiple CPU running the same time, one read and one write, can lead to non-deterministic result

      • Shared memory data access: limit to single core , manage context switches

    • Asynchronous event timing: HW performance counters

      • Signal runs at the right program states during replay

      • Idea: count the number of signals, deliver the signal after the signals

      • Doing this in HW, no code instrumentation

    • Trap on a subset of system calls: seccomp-bpf

      • Two traps, four context switches, system calls expensive

      • shim library: loaded into the process that we're tracing; part of the recording and replay; wrap the common system calls; after system calls, record the results into the buffer (periodically flush by the supervisor process)

        • Inject bpf predicate to the kernel

      • What happens if the system call blocks?

        • Schedule another thread to run

        • DESCHED perf event

          • Everytime a thread is put out of the core and put into the idle queue

          • Get the performance of it

    • Other issues

      • RDTSC

      • RDRAND

      • XGEBIN/XEND

      • CPUID

Performance

  • CP: recursive

  • Octane: javascript

  • HTMLTEST: firefox on html unit test

  • Sambatest

Also: reverse-execution debugging

Lessons

  • Replay performance matters

  • Session-cloning performance matters (checkpoint of the current system states)

    • Clonning processes via fork() seems cheaper than e.g. cloning VM state

  • In-process system-call interception is fragile

    • Applications make syscalls in strage states (bad TLS, insufficient stack, etc)

    • In-process interception code could be accidentally or maliciously subverted

    • Move this part into kernel?

  • OS design implications

    • Recording boundary should

      • Be stable, simple, documented API boundary

      • Also be a boundary for hardware performance counter measurement

  • ARM

    • Need hardware support to detect / compensate

    • Or binary rewriting

  • Related work

    • VM-level reply: heavyweight

    • Kernel-supported replay: hard to maintain

    • Pure user-space replay: instrumentation, higher overhead

    • Higher-level replay: more limited scope

    • Parallel replay: more limited scope, higher overhead

    • Hardware-supported parallel replay: nonexistent hardware

Conclusions

  • rr's apporach delivers a lot of vlaue

  • more research needed for multicore apporaches

  • lots of unexplored applications of record+replay

Questions

  • 1-thread-one-time: disappearance?

  • virtual system calls?

    • patch this to normal system call

  • application: undefined behavior?

    • fine, re-produce the exact execution during replay

  • deal with msi free

    • recording: deterministic behavior

    • recording locations of memory maps, use map fix to make sure that ...

  • applications that have randomizations?

    • exponential back-off?

    • random numbers are from some source, record and replay from the random number generations

  • move traces in between different machines is difficult?

    • trace format: pack

    • cpu ID

Presentation: Practical Record & Replay Debugging with rr

  • Non-determinism (debugging)

    • Test randomly failed, don't know why / how often

    • Diff tests running (linux opt/pgo/debug/...)

    • orange/red test has nothing to do with the change

  • Deterministic hardware

    • External sources of non-determinism

  • Building in the middle: record input

    • Non-determinisitc conditions

  • Replay execution

  • Old idea

    • ODR, PinPlay, ...

  • RR goals

    • Easy to deploy

    • Low overhead

    • Works on FF

    • Small investment (other work: binary instrumentation, OS kernel changes, hard to maintain and distribute)

  • Modern HW/OS features

    • Ptrace: one process monitors what is happening on the other one (system tracing)

      • Tracer / tracee

      • Single sys call: context switches (overhead)

        • I.e. get PID, read (cheap sys call)

    • Shared memory data races --> limit to single core

    • Async event timing --> HW performance counters (retired conditional branch)

      • When the tracee gets to the point, software interrupt

      • Instruction stream, runtime instrumentation

      • JIT

    • Trap on a subset of system calls

      • seccomp-bpf

        • Do the filtering when staying in the user-space

      • Conditions to be checked before context-switches

      • Recording in user-space

    • Sys call block

      • Look at the scheduled event and record them

    • Record all the memories

      • Same memory map in the same locations

    • Other issues

      • instructions that generate randomness in CPU

      • RDTSC: tell the .. to interrupt

      • Back them: same CPU

      • Now: ptrace to tell what the CPU is

  • Replay: can be fast (no context switches)

Another apporach: cloning the whole VM state

  • Capture the evolution of the memory?

    • See what's changing

    • Doing it as the process level

    • No need to re-record, but keep track of the changes

  • GDB: go backward

    • In forward execution, at diff points in time, fork() of replay

    • Backward: breakpoint, and then go into one of the fork()

  • Move this part into kernel

    • Painful: because they do all of these in the process

    • Some of the recording phase can maybe done in kernel

      • Security

      • Faster

  • could create snapshot but it's not what they're describing

Distributed system

  • Can we apply this kind of technique

  • Common bugs but hard to find

    • FlyMS (eurosys 19), samc (osdi 14)

    • Make the traces more interpretable?

    • Oathkeeper (OSDI 22)

    • etcd still gets constant flow of bug reports

  • Bottom line: many bugs reproducible by recreating external conditions (e.g., not OS thread timing dependent)

    • No memory race conditions, etc.

  • Now

    • Debugging distributed systems now

      • Collect per-machine logs

      • Virtually unify them

      • Guess root causes

    • A system that emits machine readable logs?

      • Logs --> reproduce the bugs

      • Reverse-execute the system from the bug location

        • Like in rr

  • Root cause analysis in distributed systems: prior work

    • Demi (NSDI 16) [minimize faulty executions of distributed systems)

    • FlyMC, SAMC

      • Collect per-node partial event orders

      • Use DPOR to recreate total order

    • I.e. RR needs to have exact recording, but not this one

    • STOA debugging tools for DS? limited to the types of systems, but not in general distributed system way (Define a general model for every distributed system that you have)

      • Ray: log file of outputs, per process

  • Message contents

    • Annotate each of the log? figure out from the message contents / payload?

    • Merging files? (where do you merge [problem])

    • Common goal of DS: common state machine ? to do this you need an exact ordering of the log ? network parition? [data inconsistency --> data corruption]

Last updated