GitHub - aya/pb: Export data from an SQL-like database to Amazon S3

#S5

"Because your data is probably small enough!" -- Author

S5 to directly export data from SQL-like databases to Amazon S3. Although the same could be achieved by gluing existing tools, S5 is built to:

Be standalone, have no runtime dependencies and be easy to run and deploy.
Have predictable operational behavior.
Leave no trace, and be friendly to the health of your systems.

The name S5 borrows its name from its previous name SqlS3Streamer which has five Ss in total.

Use cases

This tool is made to export small data stored in SQL-like databases to a compressed, field and record-delimited format, such as TSV or CSV, to Amazon S3. This is primarily made with Amazon Redshift in mind that imports such formats natively.

This tool is only useful if your data is small enough that given your database and network resources, could be exported in reasonable amount of time.

You can export select number of columns from your tables and even apply simple transformations, as far as your SQL database allows.

How it works

S5 runs an arbitrary SQL query against a database, streams the result in chunks and uses Amazon S3's multipart upload API to store the data on S3.

The result-set could be arbitrarily large, such as 10s of GB of data and the memory consumption will remain constant.

###Output Format

Currently S5 only knows how to serialize records into field/record-delimited formats such as TSV and CSV.

###Compression

S5 can compress the exported data in a way that the resulting file, stitched together by Amazon S3 multipart API, forms a valid gzip file.

Pipelining

S5 works by reading chunks of data and uploading them until all the data is consumed. In order to speed up the process, reading and uploading happens in stages; meaning that while a part is being uploaded, the next is being read, serialized and compressed.

##Running

You can run S5 by supplying the arguments:

s5  --sql.user=user                               \
    --sql.password=pass                           \
    --sql.database=mydb                           \
    --s3.region us-east-1                         \
    --s3.accesskey S3ACCESSKEY                    \
    --s3.secretkey S3SECRETKEY                    \
    --s3.bucket mydbs                             \
    --s3.path '2015/07/27/foos.gz'                \
    --sql.query="select * from foos"              \
    --compress                                    \
    --chunksizemb 100

For obvious explanation of command line arguments try:

s5 --help

When a multipart S3 upload fails, the chunks are not automatically removed by S3. They won't even show in any file hierarchy on S3 but you'll be charged for them. In order to clean those up, I wrote this little tool which you can schedule to run at an appropriate time, not interfering with uploads:

s3multicleanup --s3.region us-east-1              \
               --s3.accesskey S3ACCESSKEY         \
               --s3.secretkey S3SECRETKEY         \
               --s3.bucket mydbs

Considerations

S5 will pass whatever query you provide to the SQL database. If you run operations such as JOIN, GROUP BY or DISTINCT etc.. or even DROP TABLE, they'll be passed directly to your database!

Besides, I suggest avoiding any operation beyond sequential sweeps. There are myriads of tools our there to carry out the T part of your _ETL_s.

Depending on the storage engine, your query may not result in a consistent snapshot of your data.

##Development

S5 is written in Go and can be built and tested by golang's standard toolchain:

go test ./...
go build ./...

Future development

Pull requests are welcome.

TODO

More comments for packages and some functions.
Add tests for the S3 package.
Packaging for Debian.
Pluggable output format where CSV/TSV is not sufficient.
Configurable compression.

Credits

This tool is originally made for Brainly

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
cmd		cmd
gzreader		gzreader
s3		s3
s5		s5
sqlstream		sqlstream
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd

cmd

gzreader

gzreader

s3

s3

s5

s5

sqlstream

sqlstream

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Use cases

How it works

Pipelining

Considerations

Future development

TODO

Credits

About

Releases

Packages

Languages

License

aya/pb

Folders and files

Latest commit

History

Repository files navigation

Use cases

How it works

Pipelining

Considerations

Future development

TODO

Credits

About

Resources

License

Stars

Watchers

Forks

Languages