Skip to content

tv42/pachyderm

 
 

Repository files navigation

Pachyderm

GitHub release GitHub license

News

WE'RE HIRING! Love Docker, Go and distributed systems? Learn more about our team and email us at jobs@pachyderm.io.

Getting Started

Get up and running the grep example.

What is Pachyderm?

Pachyderm is a Data Lake. A place to dump and process gigantic data sets. Pachyderm is inspired by the Hadoop ecosystem but shares no code with it. Instead we leverage the container ecosystem to provide the broad functionality of Hadoop with the ease of use of Docker.

Pachyderm offers the following broad functionality:

  • Virtually limitless storage for any data.
  • Virtually limitless processing power using any tools.
  • Tracking of data history, provenance and ownership. (Version Control).
  • Automatic processing on new data as it’s ingested. (Streaming).
  • Chaining processes together. (Pipelining)

What's new about Pachyderm? (How is it different from Hadoop?)

There are two bold new ideas in Pachyderm:

  • Containers as the processing primitive
  • Version Control for data

These ideas lead directly to a system that's much easier to use and administer.

To process data you simply create a containerized program which reads and writes to the local filesystem. Pachyderm will take your container and inject data into it by way of a FUSE volume. You can use any tools you want! Pachyderm will automatically replicate your container. It creates multiple copies of the same container showing each one a different chunk of data in the FUSE volume. With this technique Pachyderm can scale any code you write up to petabytes of data.

Pachyderm also version controls all data using a commit based distributed filesystem (PFS), it's very similar to what git does with code. Version control has far reaching consequences in a distributed filesystem. You get the full history of your data, it's much easier to collaborate with teammates and if anythng goes wrong you can revert the entire cluster with one click!

Version control is very synergistic with our containerized processing engine. Pachyderm understands how your data changes and thus, as new data is ingested, can run your workload on the diff of the data rather than the whole thing. This means that there's no difference between a batched job and a streaming job, the same code will work for both!

Our Vision

Containers are a revolutionary new technology with compelling application to big data. Our goal is to fully realize that application. Hadoop has spawned a sprawling ecosystem of tools but with each new tool the complexity of your cluster grows until maintaining it becomes a full time job. Containers are the perfect antidote to this problem. What if adding a new tool to your data infrastructure was as easy as installing an app? Thanks to the magic of containers in Pachyderm it really is.

The most exciting thing about this vision though is what comes next. Pachyderm can do big data with anything that runs on Linux. And anything you build you can easily share with the rest of the community, afterall it's just a container. We have some ideas of our own about what to build but it's just the tip of the iceburg, we expect our users will have many more interesting ideas. We can't wait to see what they are!

Contributing

Deploying Pachyderm.

To get started, sign the Contributor License Agreement.

Send us PRs, we would love to see what you do!

About

Containerized Data Analytics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Go 91.5%
  • Protocol Buffer 5.5%
  • Makefile 1.5%
  • Shell 1.2%
  • Ruby 0.3%