, ,
/////|
///// |
|~~~| |
|===| |
|j | |
| g | |
| s| /
|===|/
'---'
Scratch project, assorted utilities around scholarly metadata formats and tasks.
status: wip, api and cli not stable yet
$ git clone https://github.com/miku/scholkit.git
$ cd scholkit
$ make
This builds a couple of executables, all starting with the sk prefix. The executables are designed to work as standalone as possible, but also share configuration for various tasks (e.g. directories).
$ curl -sL https://archive.org/download/arxiv-2024-02-15/arxiv-2024-02-15.xml.zst | \
zstd -dc | \
sk-convert -f arxivWe want conversions from various formats to one single format (e.g. release entities). Source formats include:
- crossref
- datacite
- pubmed
- arxiv
- oaiscrape
- openalex
- dblp
- and more
Target:
- fatcat2 entities
For each format, try to find the smallest conversion unit, e.g. one record. Then add convenience layers on top, e.g. for streams.
No bulk conversion should take longer than an 1 hour, roughly (slowest currently is openalex - 250M records - which takes about 45 min).
Create a "works" view from releases.
The sk-cat utility streams content from multiple URLs to stdout. Can help
to create single file versions of larger datasets like pubmed, openalex, etc.
$ curl -s "https://www.gutenberg.org/browse/scores/top" | \
grep -Eo "/ebooks/[0-9]+" | \
awk '{print "https://gutenberg.org"$0".txt.utf-8"}' > top100.txt
$ sk-cat < top100.txt > top100books.txt
- implement schema conversions and tests
- add layer for daily harvests and capturing data on disk
- cli to interact with the current files on dist
- cli for basic stats
- some simplistic index/query structure, e.g. to quickly find a record by id or the like
More:
- map basic fields to fatcat release entities
- map all fields to fatcat release entities
- basic clustering algorithm