Earthquake is a programmable fuzzy scheduler for testing real implementations of distributed system (such as ZooKeeper).
Blog: http://osrg.github.io/earthquake/
Earthquakes permutes C/Java function calls, Ethernet packets, Filesystem events, and injected faults in various orders so as to find implementation-level bugs of the distributed system.
Earthquake can also control non-determinism of the thread interleaving (by calling sched_setattr(2)
with randomized parameters).
So Earthquake can be also used for testing standalone multi-threaded software.
Basically, Earthquake permutes events in a random order, but you can write your own state exploration policy (in Golang) for finding deep bugs efficiently.
- ZooKeeper:
- Found ZOOKEEPER-2212 (race): blog article (repro code)
- Reproduced ZOOKEEPER-2080 (race): blog article (repro code)
- etcd:
- Found an etcd command line client (etcdctl) bug #3517 (timing specification), fixed in #3530: (repro code). The fix also resulted a hint of #3611.
- Reproduced flaky tests {#4006, #4039} (repro instruction)
- YARN:
- Found YARN-4301 (fault tolerance): (repro code)
- Reproduced flaky tests YARN-{1978, 4168, 4543, 4548, 4556} (repro instruction)
The following instruction shows how you can start Earthquake Container, the simplified CLI for Earthquake.
$ sudo apt-get install libzmq3-dev libnetfilter-queue-dev
$ go get github.com/osrg/earthquake/earthquake-container
$ sudo earthquake-container run -it --rm ubuntu bash
In Earthquake Container, you can run arbitrary command that might be flaky. JUnit tests are interesting to try.
earthquake-container$ git clone something
earthquake-container$ cd something
earthquake-container$ for f in $(seq 1 1000);do mvn test; done
You can also specify a config file (-eq-config
option for earthquake-container
.)
A typical configuration file (config.toml
) is as follows:
# Policy for observing events and yielding actions
# You can also implement your own policy.
# Default: "random"
explorePolicy = "random"
[explorePolicyParam]
# for Ethernet/Filesystem/Java inspectors, event are non-deterministically delayed.
# minInterval and maxInterval are bounds for the non-deterministic delays
# Default: 0 and 0
minInterval = "80ms"
maxInterval = "3000ms"
[containerParam]
# Default: false
enableEthernetInspector = true
# Default: true
enableProcInspector = true
# Default: "1s"
procWatchInterval = "1s"
If you don't want to use containers, you can also use Earthquake with an arbitrary process tree.
$ go get github.com/osrg/earthquake/earthquake
$ sudo earthquake inspectors proc -root-pid $TARGET_PID -watch-interval 1s -autopilot config.toml
For full-stack (fully-distributed) Earthquake environment, please refer to doc/how-to-setup-env-full.md.)
The slides for the presentation at FOSDEM might be also helpful.
- FOSDEM (January 30-31, 2016, Brussels)
- The poster session of ACM Symposium on Cloud Computing (SoCC) (August 27-29, 2015, Hawaii)
We welcome your contribution to Earthquake. Please feel free to send your pull requests on github!
Copyright (C) 2015 Nippon Telegraph and Telephone Corporation.
Released under Apache License 2.0.
// implements earthquake/explorepolicy/ExplorePolicy interface
type MyPolicy struct {
actionCh chan Action
}
func (p *MyPolicy) GetNextActionChan() chan Action {
return p.actionCh
}
func (p *MyPolicy) QueueNextEvent(event Event) {
// Possible events:
// - JavaFunctionEvent (byteman)
// - PacketEvent (Netfilter, Openflow)
// - FilesystemEvent (FUSE)
// - ProcSetEvent (Linux procfs)
// - LogEvent (syslog)
fmt.Printf("Event: %s\n", event)
// You can also inject fault actions
// - PacketFaultAction
// - FilesystemFaultAction
// - ProcSetSchedAction
// - ShellAction
action, err := event.DefaultAction()
if err != nil {
panic(err)
}
// send in a goroutine so as to make the function non-blocking.
// (Note that earthquake/util/queue/TimeBoundedQueue provides
// better semantics and determinism, this is just an example.)
go func() {
fmt.Printf("Action ready: %s\n", action)
p.actionCh <- action
fmt.Printf("Action passed: %s\n", action)
}()
}
func NewMyPolicy() ExplorePolicy {
return &MyPolicy{actionCh: make(chan Action)}
}
func main(){
RegisterPolicy("mypolicy", NewMyPolicy)
os.Exit(CLIMain(os.Args))
}
Please refer to example/template for further information.
After running Earthquake (process inspector) many times, sched_setattr(2)
can fail with EBUSY
.
This seems to be a bug of kernel; We're looking into this.