Unix utilities that process ‘abstracted tables’, or streams of records. In the spirit of of the classic Unix utilities: head, tail, cut, grep, sort, and works with multiple encodings (tab, csv, etc.).
go get github.com/kyleburton/go-abtab/cmd/abtab cd $GOPATH/src/github.com/kyleburton/go-abtab bash install.sh
Record streams are identified using a URI structure. To convert a tab file in the pwd (foo.tab
) to a csv file:
abcat -i tab://foo.tab -o csv://foo.csv
Tab separated values is the core file format. By default, files are expected to have a header line describing the columns in the file though a header can be specified as part of the url.
FieldSeparator=‘\t’
RecordSeparator=‘\n’
‘Memo’ fields, and embedded delimiters:
Embedded Newlines represnted as ‘\n’
Embedded Tabs represnted as ‘\t’
Embedded CR represnted as ‘\r’
Examples:
‘cat’ a tab delimited file to the terminal:
abcat -i foo.tab
All utilities supoprt using a URL to specify the encoding:
abcat -i tab://foo.tab
Convert from CSV to TAB:
abcat -i csv://foo.csv -o tab://foo.tab
View a ‘portrait’ mode of the records in a file:
abview -i csv://foo.csv
Cut columns out of a file by name:
# Id Created_At Updated_At Email_Address abcut -i csv://foo.csv -f Id,First_Name,Last_Name
Grep for lines
abgrep -i csv://foo.csv -e 'First_Name == "Bob"' abgrep -i csv://foo.csv -e '__LINE__ >= 300 && __LINE__ < 400'
Take a random sample: Flip a coin for each record:
abgrep -e 'RandFloat()>0.5' -i tab://foo.tab
The expression is evaluated using: https://github.com/kyleburton/go-eval, a restricted sub-set of the go language. There will be a pre-defined variable for each field in the row as well all __LINE__
which contains the current record number.
Take a sample: record 300 to 399:
abgrep -i csv://foo.csv -e '__LINE__ >= 300 && __LINE__ < 400'
Take a sample: Emit every Nth record:
abgrep -e '(__LINE__%2)==0' -i tab://foo.tab abgrep -e '(__LINE__%5)==0' -i tab://foo.tab abgrep -e '(__LINE__%10)==0' -i tab://foo.tab
Head and Tail
Header plus the first 100 records:
abhead -i csv://foo.csv -n 100
Header plus the last 100 records:
abtail -i csv://foo.csv -n 100
Header, skips the first 100 records, emits the rest:
abtail -i csv://foo.csv -n +100
Sorting
absort -i csv://foo.csv -f Last_Name,First_Name
Joining/Merging
TBD
Expressions have access to the record number in the stream (lnum), the array of field values (rec) and each of the declared fields.
Modify a source: add columns, apply an expression to records.
- -tmp /path/to/tmp
Specfiy an alternatvie temp directory to use.
Schema correlates to the driver / file encoding.
Path correlates to the file path to read from or write to. You can interaact with stdin/stdout on Unix by specifying file paths of /dev/stdin and /dev/stdout respectively. Drivers may interpret these differently, ex: database drivers would interpret the first parts of the path as the database host, schema and table name.
These are for common options and driver specific options.
- header=f1,f2,f3
Specify the header for a file if it does not have one of its own.
- -header
Indicates the source has no header and that one should be fabricated (F1, F2, F3, …). This is assumed if a header is supplied, and can be used to strip the header off of a destination file.
- skipLines=N
Indicates to skip the first N lines of the source.
There are no custom options supported for the tab driver.
There are no custom options supported for the csv driver.
The fixed width driver can be used to read or write fixed width file formats. It supports the following options:
TBD
NB: The pg driver supports read operations but does not support write operations.
abcat -i “pg://localhost:5432/database_name/table_name?order=somecol&limit=10&offset=0&user=pg_username&password=pg_password” -o csv://table.csv
TBD
abcat -i "tab://fixtures/galbithink.org/92f10-19.tab?skipLines=4&header=Name,Count" abcat -i "tab://fixtures/galbithink.org/92f10-19.tab?skipLines=4&header=Name,Count" -o csv:///dev/stdin abfillrates -i "tab://fixtures/galbithink.org/92f10-19.tab?skipLines=4&header=Name,Count" abgrep -e 'Substr(Name,-1,0) == "y"' -i "tab://fixtures/galbithink.org/92f10-19.tab?skipLines=4&header=Name,Count" 2>&1 | less
TODO:
- DONE implement an installer, and cmdline wrappers (abcat, abview, abgrep, …)
- documentation: packages, apis, examples, command line utilities
- DONE implement driver interface and types
- DONE driver type/struct
- DONE url parsing
- DONE driver options
- DONE implement tab driver
- DONE implement csv driver
- DONE implement abcat utility
- implement support for memo fields in tab driver (embedded newlines and tabs, escape on input, unescape on output)
- DONE implement abcut utility
- DONE implement abhead utility
- DONE implement abtail utility
- DONE implement abview utility
- DONE implement abgrep utility
- DONE fork https://github.com/sbinet/go-eval
- DONE make all necessary types public, or otherwise determine how to inject variables into the interpreter
- DONE embed the go-eval interpreter into grep in order to evaluate expressions per record
- DONE fork https://github.com/sbinet/go-eval
- DONE re-write grep’s expression fn as a stream filter
- extend the ‘standard library’ for abgrep’s expression language
- what does this need to be useful?
- DONE abgrep: implement Rand() functions
- implement pg driver
- implement absort utility
- implement abmod driver
- implement fixed driver
- implement mysql driver
- Fill Rates:
- Implement a cardinality checker.
Support both a strict uniqueness count (requires keeping the entire file in
ram), and estimation of cardinatlity (using a bloom filter or hyper-log-log)