Mozilla Services Data Pipeline

This repository contains the extra bits and pieces needed to build heka for use in the Cloud Services Data Pipeline.

Visit us on irc.mozilla.org in #datapipeline.

Building a Data Pipeline RPM

Run bin/build_pipeline_heka.sh from the top level of this repo to build a heka RPM.

Using the Data Pipeline

If you are simply looking to test out some data analysis plugins and don't want to setup your own pipeline here is the fastest way to get going: https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Using+the+sandbox+manager+in+the+prod+prototype+pipeline

Running/Testing Your Own Data Pipeline

You can set up a bare-bones data pipeline of your own. You will get an endpoint that listens for HTTP POST requests, performs GeoIP lookups, and wraps them up in protobuf messages. These messages will be relayed to a stream-processor, and will be output to a local store on disk. There will be basic web-based monitoring, and the ability to add your own stream processing filters.

Clone this data-pipeline github repo

git clone https://github.com/mozilla-services/data-pipeline.git

Build and configure heka. If you are unable to build heka, drop by #datapipeline on irc.mozilla.org and we will try to provide you a pre-built version.
Run source bin/build_pipeline_heka.sh

Install lua modules

mkdir lua_modules
rsync -av build/heka/build/heka/lib/luasandbox/modules/ lua_modules/

Procure a GeoLiteCity.dat file and put it in the current dir

wget http://geolite.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz

Set up the main Pipeline using the examples/basic_local_pipeline.toml config file. This will listen for HTTP POSTs on port 8080, log the raw and decoded messages requests to stdout, run the example filter, and output the records to a file.
```
build/heka/build/heka/bin/hekad -config examples/basic_local_pipeline.toml
```
Check the monitoring dashboard at http://localhost:4352

Fire off some test submissions!

for f in $(seq 1 20); do
  curl -X POST "http://localhost:8080/submit/test/$f/foo/bar/baz" -d "{\"test\":$f}"
done

Verify that your data was stored in the output file using the heka-cat utility

build/heka/build/heka/bin/heka-cat data_raw.out
build/heka/build/heka/bin/heka-cat data_decoded.out

Experiment with sandbox filters, outputs, and configurations.

Useful things to know

GeoIP
- It’s not terribly interesting to do GeoIP lookups on 127.0.0.1, so you may want to provide a --header "X-Forwarded-For: 8.8.8.8" argument to your curl commands. That will force a geoIP lookup on the specified IP address (Google’s DNS server in this example).
How to configure namespaces
- The example config allows submissions to either /submit/telemetry/docid/more/path/stuff or /submit/test/id/and/so/on
- You can add more endpoints by modifying the namespace_config parameter in basic_local_pipeline.edge.toml.
- The namespace config is more manageable if you the JSON in a separate file, and run it through something like jq -c '.' < my_namespaces.json before putting it into the toml config.
Where to get more info about configuring heka
- http://hekad.readthedocs.org/en/latest/index.html

Name		Name	Last commit message	Last commit date
Latest commit History 557 Commits
aws		aws
bin		bin
doc		doc
examples		examples
heka		heka
hindsight		hindsight
reports		reports
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws

aws

bin

bin

doc

doc

examples

examples

heka

heka

hindsight

hindsight

reports

reports

.gitignore

.gitignore

.travis.yml

.travis.yml

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Mozilla Services Data Pipeline

Building a Data Pipeline RPM

Using the Data Pipeline

Running/Testing Your Own Data Pipeline

Useful things to know

About

Releases

Packages

Languages

License

bsmedberg/data-pipeline

Folders and files

Latest commit

History

Repository files navigation

Mozilla Services Data Pipeline

Building a Data Pipeline RPM

Using the Data Pipeline

Running/Testing Your Own Data Pipeline

Useful things to know

About

Resources

License

Stars

Watchers

Forks

Languages