scholkit

   ,   ,
  /////|
 ///// |
|~~~|  |
|===|  |
|j  |  |
| g |  |
|  s| /
|===|/
'---'

Scratch project, assorted utilities around scholarly metadata formats and tasks.

status: wip, api and cli not stable yet

Try

$ git clone https://github.com/miku/scholkit.git
$ cd scholkit
$ make

This builds a couple of executables, all starting with the sk prefix. The executables are designed to work as standalone as possible, but also share configuration for various tasks (e.g. directories).

$ curl -sL https://archive.org/download/arxiv-2024-02-15/arxiv-2024-02-15.xml.zst | \
    zstd -dc | \
    sk-convert -f arxiv

Tools

Conversions

We want conversions from various formats to one single format (e.g. release entities). Source formats include:

Target:

fatcat2 entities

For each format, try to find the smallest conversion unit, e.g. one record. Then add convenience layers on top, e.g. for streams.

No bulk conversion should take longer than an 1 hour, roughly (slowest currently is openalex - 250M records - which takes about 45 min).

Clustering

Create a "works" view from releases.

Misc

The sk-cat utility streams content from multiple URLs to stdout. Can help to create single file versions of larger datasets like pubmed, openalex, etc.

$ curl -s "https://www.gutenberg.org/browse/scores/top" | \
    grep -Eo "/ebooks/[0-9]+" | \
    awk '{print "https://gutenberg.org"$0".txt.utf-8"}' > top100.txt

$ sk-cat < top100.txt > top100books.txt

Notes

TODO

implement schema conversions and tests
add layer for daily harvests and capturing data on disk
cli to interact with the current files on dist
cli for basic stats
some simplistic index/query structure, e.g. to quickly find a record by id or the like

More:

map basic fields to fatcat release entities
map all fields to fatcat release entities
basic clustering algorithm

Name		Name	Last commit message	Last commit date
Latest commit History 344 Commits
atomicfile		atomicfile
attic		attic
cmd		cmd
config		config
convert		convert
crossref		crossref
data		data
dateutil		dateutil
exdep		exdep
exp/datafeed		exp/datafeed
feeds		feeds
fixtures		fixtures
normal		normal
notes		notes
packaging/deb/scholkit/DEBIAN		packaging/deb/scholkit/DEBIAN
parallel		parallel
schema		schema
scripts		scripts
static		static
xflag		xflag
xmlstream		xmlstream
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
env.go		env.go
go.mod		go.mod
go.sum		go.sum
version.go		version.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

scholkit

Try

Tools

Conversions

Clustering

Misc

Notes

TODO

About

Uh oh!

Releases 9

Packages

Uh oh!

Languages

License

miku/scholkit

Folders and files

Latest commit

History

Repository files navigation

scholkit

Try

Tools

Conversions

Clustering

Misc

Notes

TODO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Languages

Packages