Skip to content

sr/pachyderm

 
 

Repository files navigation

Pachyderm

GitHub release GitHub license

News

We are in the midst of a refactor! See the release branch for the current, working release of Pachyderm.

Check out our docker volume driver! https://github.com/pachyderm/pachyderm/tree/master/src/cmd/pfs-volume-driver.

WE'RE HIRING! Love Docker, Go and distributed systems? Learn more about our team and email us at jobs@pachyderm.io.

What is Pachyderm?

Pachyderm is a complete data analytics solution that lets you efficiently store and analyze your data using containers. We offer the scalability and broad functionality of Hadoop, with the ease of use of Docker.

Key Features

  • Complete version control for your data
  • Pipelines are containerized, so you can use any languages and tools you want
  • Both batched and streaming analytics
  • One-click deploy on AWS without data migration

Is Pachyderm enterprise production ready?

No, Pachyderm is in beta, but can already solve some very meaningful data analytics problems. We'd love your help. :)

What is a commit-based file system?

Pfs is implemented as a distributed layer on top of btrfs, the same copy-on-write file system that powers Docker. Btrfs already offers git-like semantics on a single machine; pfs scales these out to an entire cluster. This allows features such as:

  • Commit-based history: File systems are generally single-state entities. Pfs, on the other hand, provides a rich history of every previous state of your cluster. You can always revert to a prior commit in the event of a disaster.
  • Branching: Thanks to btrfs's copy-on-write semantics, branching is ridiculously cheap in pfs. Each user can experiment freely in their own branch without impacting anyone else or the underlying data. Branches can easily be merged back in the main cluster.
  • Cloning: Btrfs's send/receive functionality allows pfs to efficiently copy an entire cluster's worth of data while still maintaining its commit history.

What are containerized analytics?

Rather than thinking in terms of map or reduce jobs, pps thinks in terms of pipelines expressed within a container. A pipeline is a generic way expressing computation over large datasets and it’s containerized to make it easily portable, isolated, and easy to monitor. In Pachyderm, all analysis runs in containers. You can write them in any language you want and include any libraries.

Development

We're hiring! If you like ambitious distributed systems problems and think there should be a better alternative to Hadoop, please reach out. Email jobs@pachyderm.io.

Running

You need to install docker-compose for the Makefile commands to work.

curl -L https://github.com/docker/compose/releases/download/1.4.0rc2/docker-compose-$(uname -s)-$(uname -m) > /usr/local/bin/docker-compose
chmod +x /usr/local/bin/docker-compose

You need to have Go 1.5 installed and have GO15VENDOREXPERIMENT=1.

Useful development commands can be seen in the Makefile. Key commands:

make test-deps # download all golang dependencies
make build # build the source code (does not build the tests)
make test # run all the tests
make clean # clean up all pachyderm state
RUNARGS="go test -test.v ./..." make run # equivalent to TESTFLAGS=-test.v make test
make launch-pfsd # launch the new pfsd daemon
make install # install all binaries locally
pfs # if ${GOPATH}/bin is on your path, this will run the new pfs cli, this is very experimental and does not check for common errors

Development Notes

Logs

We're using protolog for logging. All new log events should be wrapped in a protobuf message. A package that has log messages should have a proto file named protolog.proto in it. See src/pps/run/protolog.proto and src/pps/run/runner.go for an example.

Environment Setup

With golang, it's generally easiest to have your fork match the import paths in the code. We recommend you do it like this:

# assuming your github username is alice
rm -rf ${GOPATH}/src/github.com/pachyderm/pachyderm
mkdir -p ${GOPATH}/src/github.com/pachyderm
cd ${GOPATH}/src/github.com/pachyderm
git clone https://github.com/alice/pachyderm.git
git remote add upstream https://github.com/pachyderm/pachyderm.git # so you can run 'git fetch upstream' to get upstream changes

The Vagrantfile in this repository will set up a development environment for Pachyderm that has all dependencies installed.

The easiest way to install Vagrant on your mac is probably:

brew install caskroom/cask/brew-cask
brew cask install virtualbox vagrant

Basic usage:

mkdir -p pachyderm_vagrant
cd pachyderm_vagrant
mkdir -p etc/initdev
curl https://raw.githubusercontent.com/pachyderm/pachyderm/master/Vagrantfile > Vagrantfile
curl https://raw.githubusercontent.com/pachyderm/pachyderm/master/etc/initdev/init.sh > etc/initdev/init.sh
vagrant up # starts the vagrant box
vagrant ssh # ssh into the vagrant box

Once in the vagrant box, set everything up and verify that it works:

go get github.com/pachyderm/pachyderm/...
cd ~/go/src/github.com/pachyderm/pachyderm
make test

Some other useful vagrant commands:

vagrant suspend # suspends the vagrant box, useful if you are not actively developing and want to free up resources
vagrant resume # resumes a suspended vagrant box
vagrant destroy # destroy the vagrant box, this will destroy everything on the box so be careful

See Vagrant's website for more details.

Common Problems

Problem: Nothing is running after launch.

  • Check to make sure the docker daemon is running with ps -ef | grep docker.
  • Check to see if the container exited with docker ps -a | grep IMAGE_NAME.
  • Check the container logs with docker logs.

Problem: Docker commands are failing with permission denied

The bin scripts assume you have your user in the docker group as explained in the Docker Ubuntu installation docs. If this is set up properly, you do not need to use sudo to run docker. If you do not want this, and want to have to use sudo for docker development, wrap all commands like so:

sudo -E bash -c 'bin/run go test ./...' # original command would have been `./bin/run go test ./...`

Contributing

To get started, sign the Contributor License Agreement.

Send us PRs, we would love to see what you do!

About

Containerized Data Analytics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Go 90.5%
  • Protocol Buffer 5.6%
  • Shell 2.2%
  • Makefile 1.7%