README

Note: oaimi is deprectated. For a better experience, please take a look at metha - it supports incremental harvesting, compresses results and has overall a simplified interface and internals.

README

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a low-barrier mechanism for repository interoperability. https://www.openarchives.org/pmh/

No frills OAI harvesting. It acts as cache and will take care of incrementally retrieving new records.

Installation

$ go get github.com/miku/oaimi/cmd/oaimi

There are deb and rpm packages as well.

Usage

Show repository information:

$ oaimi -id http://digital.ub.uni-duesseldorf.de/oai
{
  "formats": [
    {
      "prefix": "oai_dc",
      "schema": "http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
    },
    ...
    {
      "prefix": "epicur",
      "schema": "http://www.persistent-identifier.de/xepicur/version1.0/xepicur.xsd"
    }
  ],
  "identify": {
    "name": "Visual Library Server der Universitäts- und Landesbibliothek Düsseldorf",
    "url": "http://digital.ub.uni-duesseldorf.de/oai/",
    "version": "2.0",
    "email": "docserv@uni-duesseldorf.de",
    "earliest": "2008-04-18T07:54:14Z",
    "delete": "no",
    "granularity": "YYYY-MM-DDThh:mm:ssZ"
  },
  "sets": [
    {
      "spec": "ulbdvester",
      "name": "Sammlung Vester (DFG)"
    },
    ...
    {
      "spec": "ulbd_rsh",
      "name": "RSH"
    }
  ]
}

Harvest the complete repository into a single file (default format is oai_dc, might take a few minutes on first run):

$ oaimi -verbose http://digital.ub.uni-duesseldorf.de/oai > metadata.xml

Harvest only a slice (e.g. set ulbdvester in format epicur for 2010 only):

$ oaimi -set ulbdvester -prefix epicur -from 2010-01-01 \
        -until 2010-12-31 http://digital.ub.uni-duesseldorf.de/oai > slice.xml

Harvest, and add an artificial root element, so the result gets a bit more valid XML:

$ oaimi -root records http://digital.ub.uni-duesseldorf.de/oai > withroot.xml

To list the harvested files, run:

$ ls $(oaimi -dirname http://digital.ub.uni-duesseldorf.de/oai)

Add any parameter to see the resulting cache dir:

$ ls $(oaimi -dirname -set ulbdvester -prefix epicur -from 2010-01-01 \
             -until 2010-12-31 http://digital.ub.uni-duesseldorf.de/oai)

To remove all cached files:

$ rm -rf $(oaimi -dirname http://digital.ub.uni-duesseldorf.de/oai)

Play well with others:

$ oaimi http://acceda.ulpgc.es/oai/request | \
    xmlcutty -path /Response/ListRecords/record/metadata -root collection | \
    xmllint --format -

<?xml version="1.0"?>
<collection>
  <metadata>
    <oai_dc:dc xmlns:oai_dc="ht...... dc.xsd">
      <dc:title>Elementos m&#xED;ticos y paralelos estructurales en la ...</dc:title>
...

Options:

$ oaimi -h
Usage of oaimi:
  -cache string
      oaimi cache dir (default "/Users/tir/.oaimicache")
  -dirname
      show shard directory for request
  -from string
      OAI from
  -id
      show repository info
  -prefix string
      OAI metadataPrefix (default "oai_dc")
  -root string
      name of artificial root element tag to use
  -set string
      OAI set
  -until string
      OAI until (default "2015-11-30")
  -v  prints current program version
  -verbose
      more output

Experimental oaimi-id and oaimi-sync for identifying or harvesting in parallel:

$ oaimi-id -h
Usage of oaimi-id:
  -timeout duration
      deadline for requests (default 30m0s)
  -v  prints current program version
  -verbose
      be verbose
  -w int
      requests in parallel (default 8)

$ oaimi-sync
Usage of oaimi-sync:
  -cache string
      where to cache responses (default "/Users/tir/.oaimicache")
  -v  prints current program version
  -verbose
      be verbose
  -w int
      requests in parallel (default 8)

How it works

The harvesting is performed in chunks (weekly at the moment). The raw data is downloaded and appended to a single temporary file per source, set, prefix and month. Once a month has been harvested successfully, the temporary file is moved below a cache dir. In short: The cache dir will not contain partial files.

If you request the data for a given data source, oaimi will try to reuse the cache and only harvest not yet cached data. The output file is the concatenated content for the requested date range. The output is no valid XML because a root element is missing. You can add a custom root element with the -root flag.

The value proposition of oaimi is that you get a single file containing the raw data for a specific source with a single command and that incremental updates are relatively cheap - at most the last 7 days need to be fetched.

For the moment, any further processing must happen in the client (like handling deletions).

More Docs: http://godoc.org/github.com/miku/oaimi

Similar projects

More sites

Distributions

Over 2038 repositories.

supported formats
earliest date
Format representants

Miscellaneous

1min of harvest, 2min parallelism

License

GPLv3

Name		Name	Last commit message	Last commit date
Latest commit History 205 Commits
cmd		cmd
img		img
packaging		packaging
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
clients.go		clients.go
doc.go		doc.go
file.go		file.go
formats.tsv		formats.tsv
info.go		info.go
intervals.go		intervals.go
intervals_test.go		intervals_test.go
ioutil2.go		ioutil2.go
request.go		request.go
request_test.go		request_test.go
sites.ldj		sites.ldj
sites.tsv		sites.tsv
version.go		version.go

License

ubleipzig/oaimi

Folders and files

Latest commit

History

Repository files navigation

README

Installation

Usage

How it works

Similar projects

More sites

Distributions

Miscellaneous

License

About

Resources

License

Stars

Watchers

Forks

Languages