Skip to content

kyleburton/go-abtab

Repository files navigation

Abstract Table Utilities

Unix utilities that process ‘abstracted tables’, or streams of records. In the spirit of of the classic Unix utilities: head, tail, cut, grep, sort, and works with multiple encodings (tab, csv, etc.).

Installation

go get github.com/kyleburton/go-abtab/cmd/abtab
cd $GOPATH/src/github.com/kyleburton/go-abtab
bash install.sh

Overview

Record streams are identified using a URI structure. To convert a tab file in the pwd (foo.tab) to a csv file:

abcat -i tab://foo.tab -o csv://foo.csv

Tab separated values is the core file format. By default, files are expected to have a header line describing the columns in the file though a header can be specified as part of the url.

FieldSeparator=‘\t’
RecordSeparator=‘\n’

‘Memo’ fields, and embedded delimiters:
Embedded Newlines represnted as ‘\n’
Embedded Tabs represnted as ‘\t’
Embedded CR represnted as ‘\r’

Examples:

‘cat’ a tab delimited file to the terminal:

  abcat -i foo.tab

All utilities supoprt using a URL to specify the encoding:

  abcat -i tab://foo.tab

Convert from CSV to TAB:

  abcat -i csv://foo.csv -o tab://foo.tab

View a ‘portrait’ mode of the records in a file:

  abview -i csv://foo.csv

Cut columns out of a file by name:

  # Id Created_At Updated_At Email_Address
  abcut -i csv://foo.csv -f Id,First_Name,Last_Name

Grep for lines

  abgrep -i csv://foo.csv -e 'First_Name == "Bob"'
  abgrep -i csv://foo.csv -e '__LINE__ >= 300 && __LINE__ < 400'

Take a random sample: Flip a coin for each record:

  abgrep -e 'RandFloat()>0.5' -i tab://foo.tab

The expression is evaluated using: https://github.com/kyleburton/go-eval, a restricted sub-set of the go language. There will be a pre-defined variable for each field in the row as well all __LINE__ which contains the current record number.

Take a sample: record 300 to 399:

  abgrep -i csv://foo.csv -e '__LINE__ >= 300 && __LINE__ < 400'

Take a sample: Emit every Nth record:

  abgrep -e '(__LINE__%2)==0' -i tab://foo.tab
  abgrep -e '(__LINE__%5)==0' -i tab://foo.tab
  abgrep -e '(__LINE__%10)==0' -i tab://foo.tab

Head and Tail

Header plus the first 100 records:

  abhead -i csv://foo.csv -n 100

Header plus the last 100 records:

  abtail -i csv://foo.csv -n 100

Header, skips the first 100 records, emits the rest:

  abtail -i csv://foo.csv -n +100

Sorting

  absort -i csv://foo.csv -f Last_Name,First_Name

Joining/Merging

TBD

Expressions have access to the record number in the stream (lnum), the array of field values (rec) and each of the declared fields.

Modify a source: add columns, apply an expression to records.

Common Command Line Options

  • -tmp /path/to/tmp

Specfiy an alternatvie temp directory to use.

URL Encoding

Schema

Schema correlates to the driver / file encoding.

Path

Path correlates to the file path to read from or write to. You can interaact with stdin/stdout on Unix by specifying file paths of /dev/stdin and /dev/stdout respectively. Drivers may interpret these differently, ex: database drivers would interpret the first parts of the path as the database host, schema and table name.

Query String

These are for common options and driver specific options.

Common Options
  • header=f1,f2,f3

Specify the header for a file if it does not have one of its own.

  • -header

Indicates the source has no header and that one should be fabricated (F1, F2, F3, …). This is assumed if a header is supplied, and can be used to strip the header off of a destination file.

  • skipLines=N

Indicates to skip the first N lines of the source.

Supported Encoding Formats

tab

There are no custom options supported for the tab driver.

csv

There are no custom options supported for the csv driver.

fixed

The fixed width driver can be used to read or write fixed width file formats. It supports the following options:

xls

TBD

pg

NB: The pg driver supports read operations but does not support write operations.

abcat -i “pg://localhost:5432/database_name/table_name?order=somecol&limit=10&offset=0&user=pg_username&password=pg_password” -o csv://table.csv

mysql

TBD

Examples

  abcat -i "tab://fixtures/galbithink.org/92f10-19.tab?skipLines=4&header=Name,Count"
  abcat -i "tab://fixtures/galbithink.org/92f10-19.tab?skipLines=4&header=Name,Count" -o csv:///dev/stdin
  abfillrates -i "tab://fixtures/galbithink.org/92f10-19.tab?skipLines=4&header=Name,Count"

  abgrep -e 'Substr(Name,-1,0) == "y"' -i "tab://fixtures/galbithink.org/92f10-19.tab?skipLines=4&header=Name,Count" 2>&1 | less

TODO:

  • DONE implement an installer, and cmdline wrappers (abcat, abview, abgrep, …)
  • documentation: packages, apis, examples, command line utilities
  • DONE implement driver interface and types
    • DONE driver type/struct
    • DONE url parsing
    • DONE driver options
  • DONE implement tab driver
  • DONE implement csv driver
  • DONE implement abcat utility
  • implement support for memo fields in tab driver (embedded newlines and tabs, escape on input, unescape on output)
  • DONE implement abcut utility
  • DONE implement abhead utility
  • DONE implement abtail utility
  • DONE implement abview utility
  • DONE implement abgrep utility
    • DONE fork https://github.com/sbinet/go-eval
      • DONE make all necessary types public, or otherwise determine how to inject variables into the interpreter
      • DONE embed the go-eval interpreter into grep in order to evaluate expressions per record
  • DONE re-write grep’s expression fn as a stream filter
  • extend the ‘standard library’ for abgrep’s expression language
    • what does this need to be useful?
  • DONE abgrep: implement Rand() functions
  • implement pg driver
  • implement absort utility
  • implement abmod driver
  • implement fixed driver
  • implement mysql driver
  1. Fill Rates:
  1. Implement a cardinality checker.

Support both a strict uniqueness count (requires keeping the entire file in
ram), and estimation of cardinatlity (using a bloom filter or hyper-log-log)

About

Abstract Tables in Go

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published