Various MARC command line utilities, with an eye on performance.
Included:
Based on marc21 by Dan Scott and William Waites.
You'll need a Go installation.
Go library dependencies:
Build:
$ git clone git@github.com:miku/gomarckit.git
$ cd gomarckit
$ go get github.com/mattn/go-sqlite3
$ make
Symlink the executables:
$ make install-home
Clean everything:
$ make clean
Just like yaz-marcdump. The Go version is about 4-5x slower than the
C version.
$ ./marcdump test.mrc
...
001 003915646
003 DE-576
004 000577812
005 19951013000000
008 940607||||||||||||||||ger|||||||
689 [ ] [(A) g], [(0) 235869880], [(a) Französisch], [(x) 00]
689 [ ] [(A) s], [(0) 235840734], [(a) Syntax], [(x) 01]
852 [ ] [(a) DE-Ch1]
852 [ 1] [(c) ID 5150 boe], [(9) 00]
936 [ln] [(0) 221790136], [(a) ID 5150]
...
$ marctotsv -h
Usage: ./marctotsv [OPTIONS] MARCFILE TAG [TAG, ...]
-f="<NULL>": fill missing values with this
-i=false: ignore marc errors (not recommended)
-k=false: skip lines with missing values
-s="": separator to use for multiple values
-v=false: prints current program version
Convert MARC21 to TSV (tab-separated values). By default, if a given tag or subfield has multiple values, just the first one is considered. Most useful for control fields or non-repeatable fields with a single subfield.
For repeated fields, use the -s flag to specify a value delimiter.
Examples:
$ ./marctotsv test.mrc 001
...
111859182
111862493
111874173
111879078
...
Empty tag values get a default fill value <NULL>:
$ ./marctotsv test.mrc 001 004 005 852.a
...
121187764 01635253X 20040324000000 DE-105
38541028X 087701561 20010420000000 DE-540
385410298 087701561 20120910090128 <NULL>
385411057 087701723 20120910090145 DE-540
...
Use a custom fill NA tag with -f:
$ ./marctotsv test.mrc -f UNDEF 001 004 005 852.a
...
385410271 087701553 20101125121554 DE-15
38541028X 087701561 20010420000000 DE-540
385410298 087701561 20120910090128 UNDEF
385411057 087701723 20120910090145 DE-540
...
Or skip non-complete row entirely with -k (hard to see):
$ ./marctotsv test.mrc -k 001 004 005 852.a
...
121187764 01635253X 20040324000000 DE-105
121187772 01635253X 20040324000000 DE-105
38541028X 087701561 20010420000000 DE-540
385411057 087701723 20120910090145 DE-540
...
Access leader with special tags:
$ ./marctotsv test.mrc -k 001 @Status
9780415681889 c
9780415839792 n
9780415773874 c
$ ./marctotsv test.mrc 001 @Length @Status 300.a
9781420069235 5000 c 712 p. :
9780415458931 6769 c 424 p. :
9781841846804 3983 c 444 p. :
...
Available special tags to access the leader:
- @Length
- @Status
- @Type
- @ImplementationDefined
- @CharacterEncoding
- @BaseAddress
- @IndicatorCount
- @SubfieldCodeLength
- @LengthOfLength
- @LengthOfStartPos
Literals can be passed along (as long as they do not look like tags):
$ ./marctotsv test.mrc 001 @Status Hello "x y z"
9781420086454 n Hello x y z
9781420086478 n Hello x y z
9781420094237 c Hello x y z
9780849397776 c Hello x y z
...
To convert repeated subfield values, you can specifiy a separator string. If a separator string is present, multiple subfield values will be joined using that separator:
$ ./marctotsv -f NA -s "|" test.mrc 001 773.w 800.w 810.w 811.w 830.w
...
007717717 (DE-576)007717709 NA (DE-576)005674956 NA NA
007717725 (DE-576)007717709 NA (DE-576)005674956 NA NA
007717997 NA NA NA NA (DE-576)008007659|(DE-576)014490781
...
Similar output to marctojson (Java version). The Go version is a bit more lightweight and faster.
Note: The JSON format is tailored for a project, which uses Elasticsearch. The idea was to make queries as convenient as possible. Here a some example queries:
$ curl localhost:9200/_search?q=content.001:007717717
$ curl localhost:9200/_search?q=content.020.a:0-684-84328-5
Performance data points:
-
Extracting 5 fields from 4007756 records from a 4.3G file takes about 8 minutes, so about 8349 records per second. That's about four times faster than the Java version.
-
A conversion to a 7 fields plus leader of a 4.3G file in plain mode takes about 7m27.258s.
Command:
$ time marctojson -p -l -r 001,020,100,245,260,700,776 test.mrc > test.json real 7m27.258s user 5m24.932s sys 1m38.876sThe resulting JSON file is about 2.1G in size. A similar conversion to TSV
$ time marctotsv -f NA -s "|" test-tit.mrc 001 020.a 020.z 020.9 100.a \ 245.a 245.b 245.c 260.a 260.b 260.c 700.a 776.z > test.tsv real 5m39.058s user 4m10.628s sys 1m29.868stakes about 5 minutes with a output file size of about 700M.
-
As comparison, the baseline iteration, which only creates the MARC data structures takes about 4 minutes in Go, which amounts to about 17425 records per seconds.
-
A simple
yaz-marcdump -npseems to iterate over the same 4007756 records in about 30 seconds (133591 records per second) and adev/nullediteration about 65 seconds (61657 records per second). So C is still three three to four times faster.
The upside of the Go version is, that it is quite short, it is about 126 LOC (+ 826 for entiry library), the Java version has around 600 (with a bit more functionality, but + 11061 for the library) and the C version contains about 260 LOC (+ 41938 for the yaz source). (Note: I know this is a skewed comparison, it's only here for framing).
Example usage:
$ marctojson
Usage: marctojson [OPTIONS] MARCFILE
-i=false: ignore marc errors (not recommended)
-l=false: dump the leader as well
-m="": a key=value pair(s) to pass to meta
-p=false: plain mode: dump without content and meta
-r="": only dump the given tags (e.g. 001,003)
-v=false: prints current program version and exit
$ ./marctojson -r 001,260 test-tit.mrc|head -1|json_pp
{
"content_type" : "application/marc",
"content" : {
"leader" : {
"sfcl" : "2",
"ba" : "265",
"status" : "c",
"lol" : "4",
"impldef" : "m a",
"length" : "1078",
"cs" : "a",
"type" : "a",
"losp" : "5",
"ic" : "2",
"raw" : "01078cam a2200265 a4500"
},
"260" : [
{
"c" : "18XX",
"a" : "Karlsruhe :",
"b" : "Groos,"
}
],
"001" : "00002144X"
},
"meta" : {}
}
$ ./marctojson -r 001,005,260 -m date=`date +"%Y-%-m-%d"` test-tit.mrc | head -1 | json_pp
{
"content_type" : "application/marc",
"content" : {
"leader" : {
"sfcl" : "2",
"ba" : "265",
"status" : "c",
"lol" : "4",
"impldef" : "m a",
"length" : "1078",
"cs" : "a",
"type" : "a",
"losp" : "5",
"ic" : "2",
"raw" : "01078cam a2200265 a4500"
},
"260" : [
{
"c" : "18XX",
"a" : "Karlsruhe :",
"b" : "Groos,"
}
],
"005" : "20120312115715.0",
"001" : "00002144X"
},
"meta" : {
"date" : "2013-10-30"
}
}
Convert MARC XML to JSON.
$ marcxmltojson
Usage: marcxmltojson MARCFILE
-m="": a key=value pair to pass to meta
-p=false: plain mode: dump without content and meta
-v=false: prints current program version and exit
Example input (snippet):
<?xml version="1.0" encoding="utf-8"?>
<marc:collection xmlns:marc="http://www.loc.gov/MARC21/slim">
<marc:record xmlns:marc="http://www.loc.gov/MARC21/slim">
<marc:leader> njm a22 2u 4500</marc:leader>
<marc:controlfield tag="001">NML00000001</marc:controlfield>
<marc:controlfield tag="003">DE-Khm1</marc:controlfield>
<marc:controlfield tag="005">20130916115438</marc:controlfield>
<marc:controlfield tag="006">m||||||||h||||||||</marc:controlfield>
<marc:controlfield tag="007">cr nnannnu uuu</marc:controlfield>
<marc:controlfield tag="008">130916s2013</marc:controlfield>
<marc:datafield tag="028" ind1="1" ind2="1">
<marc:subfield code="a">8.220369</marc:subfield>
<marc:subfield code="b">Naxos Digital Services Ltd</marc:subfield>
</marc:datafield>
<marc:datafield tag="035" ind1=" " ind2=" ">
<marc:subfield code="a">(DE-Khm1)NML00000001</marc:subfield>
</marc:datafield>
<marc:datafield tag="040" ind1=" " ind2=" ">
<marc:subfield code="a">DE-Khm1</marc:subfield>
...
<marc:datafield tag="830" ind1=" " ind2="0">
<marc:subfield code="a"/>
<marc:subfield code="p">Caucasian Sketches, Suite 1, Op. 10 -- Caucasian
Sketches, Suite 2, Op. 42, "Iveria"
</marc:subfield>
</marc:datafield>
<marc:datafield tag="856" ind1="4" ind2="0">
<marc:subfield code="u">
http://univportal.naxosmusiclibrary.com/catalogue/item.asp?cid=8.220369
</marc:subfield>
<marc:subfield code="z">Verbindung zu Naxos Music Library</marc:subfield>
</marc:datafield>
<marc:datafield tag="902" ind1=" " ind2=" ">
<marc:subfield code="a">130916</marc:subfield>
</marc:datafield>
</marc:record>
...
</marc:collection>
Go XML unmarshaller is not streaming, which limits the size of the files that can be handled. Converting a 500M XML file takes about 1m and is about twice as fast as a corresponding Python version.
Just like yaz-marcdump -s [PREFIX] -C [CHUNKSIZE] FILE, but a bit faster.
Count the number of records in a file (fast). Can be about 4 times faster than
yaz-marcdump -np file.mrc | tail -1 | awk '{print $3}'
for files with a lot of small records. Up to 20 times faster, when the Linux file system cache is warmed up.
Like uniq but consider MARC files and their 001 fields.
Usage: marcuniq [OPTIONS] MARCFILE
-i=false: ignore marc errors (not recommended)
-o="": output file (or stdout if none given)
-v=false: prints current program version
-x="": comma separated list of ids to exclude (or filename with one id per line)
The -x option is a bit unrelated. It lets in addition specify ids, that
should be ignored completely. It can be a comma separated string or a filename
with one id per line. Example with filename:
$ marcuniq -x excludes.txt -o filtered.mrc file.mrc
Example with inline excludes:
$ marcuniq -x '15270, 15298, 15318, 15335' file.mrc > filtered.mrc
Generate tab separated values of the id (001), offset and length of the MARC records in a file:
Usage: marcmap [OPTIONS] MARCFILE
-o="": output to sqlite3 file
-v=false: prints current program version
Example:
$ marcmap file.mrc|head -5
1000574288 0 880
1000572307 880 777
1000570320 1657 861
1000570045 2518 908
1000555798 3426 823
This tool uses yaz-marcdump and awk internally to be more performant.
A 4.3G file with 4007803 records takes less than two minutes to map.
(The original version without the external commands needed to parse each record
and took about four minutes on the above file.)
Note: Since this tools won't parse the record, it cannot at the moment handle files with records, that have no ID value (no 001 field).
Since 1.3.6: You can use the -o FILE flag to specify a filename where
a sqlite3 representation of the data is written into a table called
seekmap(id, offset, length). Example:
$ marcmap -o test.db test-dups.mrc
$ sqlite3 test.db ".schema seekmap"
CREATE TABLE seekmap (id text, offset int, length int);
$ sqlite3 test.db "select count(*) from seekmap"
102812
For development and benchmarks only.
loktotsv, which was just a marctotsv with a fixed set of fields.
GPLv3.