Skip to content

miku/scholkit

Repository files navigation

scholkit

   ,   ,
  /////|
 ///// |
|~~~|  |
|===|  |
|j  |  |
| g |  |
|  s| /
|===|/
'---'

Scratch project, assorted utilities around scholarly metadata formats and tasks.

status: wip, api and cli not stable yet

Try

$ git clone https://github.com/miku/scholkit.git
$ cd scholkit
$ make

This builds a couple of executables, all starting with the sk prefix. The executables are designed to work as standalone as possible, but also share configuration for various tasks (e.g. directories).

$ curl -sL https://archive.org/download/arxiv-2024-02-15/arxiv-2024-02-15.xml.zst | \
    zstd -dc | \
    sk-convert -f arxiv

Tools

Conversions

We want conversions from various formats to one single format (e.g. release entities). Source formats include:

  • crossref
  • datacite
  • pubmed
  • arxiv
  • oaiscrape
  • openalex
  • dblp
  • and more

Target:

  • fatcat2 entities

For each format, try to find the smallest conversion unit, e.g. one record. Then add convenience layers on top, e.g. for streams.

No bulk conversion should take longer than an 1 hour, roughly (slowest currently is openalex - 250M records - which takes about 45 min).

Clustering

Create a "works" view from releases.

Misc

The sk-cat utility streams content from multiple URLs to stdout. Can help to create single file versions of larger datasets like pubmed, openalex, etc.

$ curl -s "https://www.gutenberg.org/browse/scores/top" | \
    grep -Eo "/ebooks/[0-9]+" | \
    awk '{print "https://gutenberg.org"$0".txt.utf-8"}' > top100.txt

$ sk-cat < top100.txt > top100books.txt

Notes

TODO

  • implement schema conversions and tests
  • add layer for daily harvests and capturing data on disk
  • cli to interact with the current files on dist
  • cli for basic stats
  • some simplistic index/query structure, e.g. to quickly find a record by id or the like

More:

  • map basic fields to fatcat release entities
  • map all fields to fatcat release entities
  • basic clustering algorithm

About

Assorted utitlies around scholarly metadata.

Resources

License

Stars

Watchers

Forks

Packages

No packages published