Span

Span formats.

The span command line tools aim at high performance, versatile document conversions between a series of metadata formats.

The goal is to quickly move between input formats, such as stored API responses, XML-ish or line-delimited JSON bibliographic data and output formats, such as finc intermediate format or formats, that can be directly imported into SOLR or elasticsearch.

As a non-goal, the span tools do not care, how you obtain your input data. The tools expect a single input file and produce a single output file (stdin and stdout, respectively).

Why in Go?

Linux shell scripts have no native XML or JSON support, Python is a bit too slow for the casual processing of 100M or more records, Java is a bit too verbose - which is why we chose Go. Go comes with XML and JSON support in the standard library, nice concurrency primitives and simple single static-binary deployments.

Install with

$ go get github.com/miku/span/cmd/...

or via deb or rpm packages.

Formats

CrossRef API, works and members
JATS Journal Archiving and Interchange Tag Set, with various flavours for JSTOR and others
DOAJ exports
OVID holdings
Google holdings
FINC Intermediate Format
FINC SOLR Schema
GENIOS Profile XML

A toolkit approach

span-import, anything to intermediate schema
span-export, intermediate schema to anything

The span-import tool should require minimal external information (no holdings file, etc.) and be mainly concerned with the transformation of fancy source formats into the catch-all intermediate schema.

The span-export tool may include external sources to create output, e.g. holdings files.

Usage

$ span-import -h
Usage of span-import:
  -cpuprofile="": write cpu profile to file
  -i="": input format
  -list=false: list formats
  -log="": if given log to file
  -members="": path to LDJ file, one member per line
  -v=false: prints current program version
  -verbose=false: more output
  -w=4: number of workers

$ span-export -h
Usage of span-export:
  -any=[]: ISIL
  -b=20000: batch size
  -cpuprofile="": write cpu profile to file
  -dump=false: dump filters and exit
  -f=[]: ISIL:/path/to/ovid.xml
  -l=[]: ISIL:/path/to/list.txt
  -list=false: list output formats
  -o="solr413": output format
  -skip=false: skip errors
  -source=[]: ISIL:SID
  -v=false: prints current program version
  -w=4: number of workers

Examples

List available formats:

$ span-import -list
doaj
genios
crossref
degruyter
jstor

Import crossref LDJ (with cached members API responses) or DeGruyter XML (preprocessed into a single file):

$ span-import -i crossref -members members.ldj crossref.ldj > crossref.is.ldj
$ span-import -i jats degruyter.ldj > degruyter.is.ldj

Concat for convenience:

$ cat crossref.is.ldj degruyter.is.ldj > ai.is.ldj

Export intermediate schema records to a memcache server with memcldj:

$ memcldj ai.is.ldj

Export to finc 1.3 SOLR 4 schema:

$ span-export -o solr413 -f DE-14:DE-14.xml -f DE-15:DE-15.xml ai.is.ldj > ai.ldj

The exported ai.ldj contains all aggregated index record and incorporates all holdings information. It can be indexed quickly with solrbulk:

$ solrbulk ai.ldj

Adding new sources

This is work/simplification-in-progress.

For the moment, a new data source has to implement is the span.Source interface:

// Source can emit records given a reader. What is actually returned is decided
// by the source, e.g. it may return Importer or Batcher object.
// Dealing with the various types is responsibility of the call site.
// Channel will block on slow consumers and will not drop objects.
type Source interface {
        Iterate(io.Reader) (<-chan interface{}, error)
}

Channels in APIs might not be the optimum, though we deal with a kind of unbounded streams here.

Additionally, the the emitted objects must implement span.Importer or span.Batcher, which is the transformation business logic:

// Importer objects can be converted into an intermediate schema.
type Importer interface {
        ToIntermediateSchema() (*finc.IntermediateSchema, error)
}

The exporters need to implement the finc.Exporter interface:

// ExportSchema encapsulate an export flavour. This will most likely be a
// struct with fields and methods relevant to the exported format. For the
// moment we assume, the output is JSON. If formats other than JSON are
// requested, move the marshalling into this interface.
type ExportSchema interface {
        // Convert takes an intermediate schema record to export. Returns an
        // error, if conversion failed.
        Convert(IntermediateSchema) error
        // Attach takes a list of strings (here: ISILs) and attaches them to the
        // current record.
        Attach([]string)
}

TODO

decouple batching (performance) from record stream generation (content)
write wrappers around common inputs like XML, JSON, CSV ...
maybe factor out importer interface (like exporter)
docs: add example files for each supported data format

A filtering pipeline.

The final processing step from an intermediate schema to an export format includes various decisions.

Should an ISIL be attached to a record or not?
Should a record be excluded, due to an expired or deleted DOI?

Name		Name	Last commit message	Last commit date
Latest commit History 645 Commits
assets		assets
assetutil		assetutil
cmd		cmd
container		container
crossref		crossref
doaj		doaj
examples		examples
filter		filter
finc		finc
genios		genios
holdings		holdings
jats		jats
packaging		packaging
schema		schema
thieme		thieme
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
Vagrantfile		Vagrantfile
common.go		common.go
helper.go		helper.go
helper_test.go		helper_test.go

License

voxadam/span

Folders and files

Latest commit

History

Repository files navigation

Span

Formats

A toolkit approach

Usage

Examples

Adding new sources

TODO

About

Resources

License

Stars

Watchers

Forks

Languages