Skip to content

ubleipzig/marctools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

marctools

Various MARC command line utilities.

Build StatusGo Report Card

Installation

For native RPM or DEB packages see: Releases

If you have a local Go installation, you can just

go get github.com/ubleipzig/marctools/cmd/{marctojson,marctotsv,...}

Executables available:

Autogenerated docs: https://godoc.org/github.com/ubleipzig/marctools

marccount

Prints the number of records found in a file and then exits.

$ marccount fixtures/journals.mrc
10

marcdb

Turn a marc file into an sqlite3 database for random access. Supports secondary keys, so you can add an additional value as key, if needed (e.g. a date).

$ marcdb
Usage: marcdb [OPTIONS] MARCFILE
  -cpuprofile="": write cpu profile to file
  -encode=false: base64 encode record before inserting it
  -o="": output sqlite3 filename
  -secondary="": add a secondary value to the row
  -v=false: prints current program version

$ marcdb -secondary todo -o journals.db fixtures/journals.mrc
$ sqlite3 journals.db ".schema"
CREATE TABLE store (id TEXT, secondary TEXT, record BLOB, PRIMARY KEY (id, secondary));
CREATE INDEX idx_store_id ON store (id);

Note: sqlite3 version 3.8.6 has convenient io helper to extract binary data properly on the command line.

$ sqlite3 journals.db "select record from store where id = 'testsample1'" \
    > testsample1.mrc
$ sqlite3 journals.db "select record from store where id = 'testsample1' \
                       and secondary = 'todo'" > testsample1.mrc

If the -encode flag is set, the record will be base64 encoded before insert:

$ marcdb -encode -o journals.db fixtures/journals.mrc
$ sqlite3 journals.db "select record from store where id = 'testsample1'"
MDE1NzFjYXMgYTIyMDAzNjExYSA0NSAgMDAxMDAxMjAwMDAw... DE0Hh0=

marcdump

Dumps MARC to stdout, similar to yaz-marcdump:

$ marcdump fixtures/testbug2.mrc
001 testbug2
005 20110419140028.0
008 110214s1992    it a     b    001 0 ita d
020 [  ] [(a) 8820737493]
035 [  ] [(a) (OCoLC)ocm30585539]
040 [  ] [(a) RBN], [(c) RBN], [(d) OCLCG], [(d) PVU]
041 [1 ] [(a) ita], [(a) lat], [(h) lat]
043 [  ] [(a) e-it---]
050 [14] [(a) DG848.15], [(b) .V53 1992]
049 [  ] [(a) PVUM]
100 [1 ] [(a) Vico, Giambattista,], [(d) 1668-1744.]
240 [10] [(a) Principum Neapolitanorum coniurationis anni MDCCI ...
245 [13] [(a) La congiura dei Principi Napoletani 1701 :], [(b) (pr ...
250 [  ] [(a) Fictional edition.]
260 [  ] [(a) Morano :], [(b) Centro di Studi Vichiani,], [(c) 1992.]
300 [  ] [(a) 296 p. :], [(b) ill. ;], [(c) 24 cm.]
490 [1 ] [(a) Opere di Giambattista Vico ;], [(v) 2/1]
500 [  ] [(a) Italian and Latin.]
504 [  ] [(a) Includes bibliographical references (p. [277]-281) and index.]
520 [3 ] [(a) Sample abstract.]
590 [  ] [(a) April11phi]
651 [ 0] [(a) Naples (Kingdom)], [(x) History], [(y) Spanish rule, ....
700 [1 ] [(a) Pandolfi, Claudia.]
800 [1 ] [(a) Vico, Giambattista,], [(d) 1668-1744.], [(t) Works.], ...
856 [40] [(u) http://fictional.com/sample/url]
994 [  ] [(a) C0], [(b) PVU]

marcmap

Dumps a list of id, offset, length tuples to stdout (TSV) or to a sqlite3 database:

By default write to stdout:

$ marcmap fixtures/journals.mrc
testsample1 0   1571
testsample2 1571    1195
testsample3 2766    1057
testsample4 3823    1361
testsample5 5184    1707
testsample6 6891    1532
testsample7 8423    1426
testsample8 9849    1251
testsample9 11100   2173
testsample10    13273   1195

Dump listing into an sqlite database with -o FILENAME:

$ marcmap -o seekmap.db fixtures/journals.mrc
$ sqlite3 seekmap.db 'select id, offset, length from seekmap'
testsample1|0|1571
testsample2|1571|1195
testsample3|2766|1057
testsample4|3823|1361
testsample5|5184|1707
testsample6|6891|1532
testsample7|8423|1426
testsample8|9849|1251
testsample9|11100|2173
testsample10|13273|1195

marcsplit

Splits a MARC file into smaller pieces.

$ marcsplit
Usage of marcsplit:
  -C=1: number of records per file
  -cpuprofile="": write cpu profile to file
  -d=".": directory to write to
  -s="split-": split file prefix
  -v=false: prints current program version

$ marcsplit -d /tmp -C 3 -s "example-prefix-" fixtures/journals.mrc
$ ls -1 /tmp/example-prefix-0000000*
/tmp/example-prefix-00000000
/tmp/example-prefix-00000001
/tmp/example-prefix-00000002
/tmp/example-prefix-00000003

marctojson

Converts MARC to JSON. This is a bit slower than yaz-marcdump -i marc -o json, but offers a bit more flexibility in the output format: It is possible to filter fields, omit the leader and to add additional meta information. Also, the output format is terser. It keeps all the information (including order) from MARC, but tries to be as brief as possible, e.g. there are no explicit subfield keys and fields are used only once as keys. Here is a short side-by-side comparison.

$ marctojson
Usage of marctojson:
  -b=10000: batch size for intercom
  -cpuprofile="": write cpu profile to file
  -i=false: ignore marc errors (not recommended)
  -l=false: dump the leader as well
  -m="": a key=value pair to pass to meta
  -p=false: plain mode: dump without content and meta
  -r="": only dump the given tags (e.g. 001,003)
  -recordkey="record": key name of the record
  -v=false: prints current program version and exit
  -w=4: number of workers

Default conversion (abbreviated, pretty-printed):

$ marctojson fixtures/testbug2.mrc | jsonpp
{
   "record" : {
      ...
      "245" : [
         {
            "ind1" : "1",
            "c" : [
               "Giambattista Vico ; a cura di Claudia Pandolfi."
            ],
            "a" : [
               "La congiura dei Principi Napoletani 1701 :"
            ],
            "ind2" : "3",
            "b" : [
               "(prima e seconda stesura) /"
            ]
         }
      ],
      ...
      "250" : [
         {
            "ind2" : " ",
            "a" : [
               "Fictional edition."
            ],
            "ind1" : " "
         }
      ],
      "020" : [
         {
            "ind2" : " ",
            "ind1" : " ",
            "a" : [
               "8820737493"
            ]
         }
      ],
      "490" : [
         {
            "v" : [
               "2/1"
            ],
            "ind2" : " ",
            "a" : [
               "Opere di Giambattista Vico ;"
            ],
            "ind1" : "1"
         }
      ],
      "240" : [
         {
            "a" : [
               "Principum Neapolitanorum coniurationis anni MDCCI historia."
            ],
            "ind1" : "1",
            "l" : [
               "Italian & Latin"
            ],
            "ind2" : "0"
         }
      ],
      "001" : "testbug2"
   },
   "meta" : {}
}

Dump the leader as well with -l and only dump field 040 with -r 040:

$ marctojson -l -r 040 fixtures/testbug2.mrc | jsonpp
{
   "record" : {
      "040" : [
         {
            "ind2" : " ",
            "c" : [
               "RBN"
            ],
            "a" : [
               "RBN"
            ],
            "d" : [
               "OCLCG",
               "PVU"
            ],
            "ind1" : " "
         }
      ],
      "leader" : {
         "status" : "c",
         "sfcl" : "2",
         "lol" : "4",
         "losp" : "5",
         "type" : "a",
         "ba" : "337",
         "impldef" : "m Ma ",
         "length" : "1234",
         "ic" : "2",
         "raw" : "01234cam a2200337Ma 4500",
         "cs" : "a"
      }
   },
   "meta" : {}
}

Restrict JSON to 001 and 245, and use plain mode with -p, which has no meta or content key:

$ marctojson -r "001, 245" -p fixtures/testbug2.mrc | jsonpp
{
   "001" : "testbug2",
   "245" : [
      {
         "ind1" : "1",
         "a" : [
            "La congiura dei Principi Napoletani 1701 :"
         ],
         "ind2" : "3",
         "c" : [
            "Giambattista Vico ; a cura di Claudia Pandolfi."
         ],
         "b" : [
            "(prima e seconda stesura) /"
         ]
      }
   ]
}

Add some value (here the current date) to the meta map:

$ marctojson -r "001, 245" -m date="$(date)" fixtures/testbug2.mrc | jsonpp
{
   "record" : {
      "001" : "testbug2",
      "245" : [
         {
            "ind2" : "3",
            "c" : [
               "Giambattista Vico ; a cura di Claudia Pandolfi."
            ],
            "ind1" : "1",
            "a" : [
               "La congiura dei Principi Napoletani 1701 :"
            ],
            "b" : [
               "(prima e seconda stesura) /"
            ]
         }
      ]
   },
   "meta" : {
      "date" : "Wed Jul 23 17:21:24 CEST 2014"
   }
}

In marctools version 1.6, the record key can be supplied by the user, and the default key for the record data was changed from content to record.

$ marctojson -r "001, 245" -recordkey data fixtures/testbug2.mrc | jsonpp
{
  "data": {
    "001": "testbug2",
    "245": [
      {
        "a": [
          "La congiura dei Principi Napoletani 1701 :"
        ],
        "b": [
          "(prima e seconda stesura) /"
        ],
        "c": [
          "Giambattista Vico ; a cura di Claudia Pandolfi."
        ],
        "ind1": "1",
        "ind2": "3"
      }
    ]
  },
  "meta": {}
}

marctotsv

Converts selected MARC tags to tab-separated values (TSV).

$ marctotsv
Usage: marctotsv [OPTIONS] MARCFILE TAG [TAG, TAG, ...]
  -cpuprofile="": write cpu profile to file
  -f="<NULL>": fill missing values with this
  -i=false: ignore marc errors (not recommended)
  -k=false: skip incomplete lines (missing values)
  -s="": separator to use for multiple values
  -v=false: prints current program version and exit
  -w=4: number of workers

Extract a single column:

$ marctotsv fixtures/journals.mrc 001
testsample1
testsample2
testsample3
testsample4
testsample5
testsample6
testsample7
testsample8
testsample9
testsample10

Extract two columns:

$ marctotsv fixtures/journals.mrc 001 245.a
testsample1 Journal of rational emotive therapy :
testsample2 Rational living.
testsample3 Psychotherapy in private practice.
testsample4 Journal of quantitative criminology.
testsample5 The Journal of parapsychology.
testsample6 Journal of mathematics and mechanics.
testsample7 The Journal of psychology.
testsample8 Journal of psychosomatic research.
testsample9 The journal of sex research
testsample10    Journal of phenomenological psychology.

Use a custom value for undefined fields with -f UNDEF:

$ marctotsv -f UNDEF fixtures/journals.mrc  001 245.a 245.b
testsample1 Journal of rational emotive therapy :   the journal of the In ...
testsample2 Rational living.    UNDEF
testsample3 Psychotherapy in private practice.  UNDEF
testsample4 Journal of quantitative criminology.    UNDEF
testsample5 The Journal of parapsychology.  UNDEF
testsample6 Journal of mathematics and mechanics.   UNDEF
testsample7 The Journal of psychology.  UNDEF
testsample8 Journal of psychosomatic research.  UNDEF
testsample9 The journal of sex research UNDEF
testsample10    Journal of phenomenological psychology. UNDEF

Only keep complete rows with -k:

$ marctotsv -k fixtures/journals.mrc  001 245.a 245.b
testsample1 Journal of rational emotive therapy :   the journal of the In ...

Include all values, separated by a pipe via - s "|":

$ marctotsv -s "|" fixtures/journals.mrc  001 710.a
testsample1 Institute for Rational-Emotive Therapy (New York, N.Y.)
testsample2 Institute for Rational-Emotive Therapy (New York, N.Y.)|Inst ...
testsample3 <NULL>
testsample4 LINK (Online service)
testsample5 Duke University.|ProQuest Psychology Journals.
testsample6 Indiana University.|Indiana University.
testsample7 ProQuest Psychology Journals.
testsample8 ScienceDirect (Online service).
testsample9 Society for the Scientific Study of Sex (U.S.)|Society for ...
testsample10    Ingenta (Firm).

marcuniq

Iterate over a MARC file and keep only the first record (by field 001). To deduplicate a number of updates, the data should be reversed first.

$ marcuniq
Usage: marcuniq [OPTIONS] MARCFILE
  -i=false: ignore marc errors (not recommended)
  -o="": output file (or stdout if none given)
  -v=false: prints current program version
  -x="": comma separated list of ids to exclude (or filename with one id per line)

Exclude three IDs and dump to file:

$ marcuniq -x "testsample1,testsample2" -o filtered.mrc fixtures/journals.mrc
excluded ids interpreted as string
2 ids to exclude loaded
10 records read
8 records written, 0 skipped, 2 excluded, 0 without ID (001)

$ marctotsv filtered.mrc 001
testsample3
testsample4
testsample5
testsample6
testsample7
testsample8
testsample9
testsample10

marcxmltojson

Convert MARCXML to Json. Note that MARCXML does not suffer certain size limits, as binary MARC does.

$ marcxmltojson
Usage: marcxmltojson [OPTIONS] MARCFILE
  -cpuprofile="": write cpu profile to file
  -i=false: ignore marc errors (not recommended)
  -l=false: dump the leader as well
  -m="": a key=value pair to pass to meta
  -p=false: plain mode: dump without content and meta
  -r="": only dump the given tags (e.g. 001,003)
  -v=false: prints current program version and exit
  -w=4: number of workers

Parameters are the same as for marctojson. Both command might merge into one in some future release.


Development

To run the tests just type:

make

To open a coverage report in you browser, run:

make cover

To package an DEB adjust debian/marctools/DEBIAN/control, e.g. update the version, then run:

make deb

To package an RPM, adjust packaging/marctools.spec, e.g. update the version, then run:

make rpm

To package an RPM on a CentOS 6.2 with libc 2.12 setup a VM with veewee and vagrant. Then run:

vagrant up
make vm-setup

Subsequently build RPMs against libc 2.12 with

make rpm-compatible

Previous versions

Versions 1.0 up to 1.3.8 (named gomarckit) used a non-standard project layout and lacked tests. Their version history is preserved under the 1.3.8-maint branch.

Todo

  • Perform and include some performance benchmarks in README.
  • The MARC21 library used might issue more system calls than needed, e.g. in the main Record create loop each data and control field will issue a read system call. It could be more efficient to read MARC in larger block and distribute the Record parsing itself to the workers.
  • Add more tests for more fancy MARC files (encodings, broken dirents, etc.).