Amazon is interested in using media from Commons, on the order of many millions of images. Enterprise is not currently supporting text or image dumps for Commons. Evaluate possible technical solutions using existing capabilities (dumps, APIs, etc.)
Most of the following is summarized from Slack discussion. I'm moving it to Phabricator for technical evaluation. Apologies to any of the Slack contributors if I've made errors in summarizing or crediting your contribution to that discussion. I've subscribed you all, for visibility. Feel free to unsubscribe if you like.
Fabio at Amazon (via email):
I’ve done some progress from this email that I’ll try to summarise in this email, I’d also be happy to have a call session for any of that and even explore requests to Wikimedia in order to decrease almost to zero our footprint on your servers using an all metadata in dump solution (I’ve seen there are specific Yahoo dumps for instance).
Here are my findings so far:
We were planning to use this Images Api first ( ref. https://commons.wikimedia.org/w/api.php?action=help&modules=query%2Ballimages ) but we found two problems with it:
It doesn’t tell us if an image is removed within the “last run approach” as it’s only addition
We would have to max daily our 5000 request an hour, using the max allowance of 500 images a request (max allowance), every 24h we would be able to understand if something changed in the whole 60million images
We the tried to use https://commons.wikimedia.org/w/api.php?action=query&generator=allrevisions&arvdir=newer&arvlimit=50&arvstart=2022-08-05T01:34:56Z to get every revision from a “last run”, but we would have to remove all the non-images entries from it.
Then we started analysing dumps and we found that Wikimedia publish a dump of commons twice a month.
This is the dump we are targeting https://dumps.wikimedia.org/commonswiki/
Of all the dumps and SQLs I’ve analysed, it would be quite tricky to load 150GB of SQL to perform some queries and extract a delta of all the images that changed, so we decided to target xml partials like commonswiki-20220801-pages-articles-multistream1.xml-p1p1500000.bz2
I couldn’t find any “File:…” only dump, which I would’ve been very keen of, so our approach will be to parse each partial for Files and extract some metadata there, but the list we need for now goes way beyond.
The closest thing I’ve found in your planning was this https://phabricator.wikimedia.org/T240520, but still not the ideal format.
Once we fetch all the images and we can detect what changed using the <revision> and checking both
=={{int:license-header}}== {{PD-Art|PD-old-100-expired}}
and
|Permission = {{Cc-by-1.0|[[User:Andre Engels|Andre Engels]]}}
We would then use the image for the metadata acquisition.
Our perfect result would be everything that it’s included here https://commons.wikimedia.org/wiki/File:President_Barack_Obama.jpg , so File informations, base metadata, Structured data, what is prominent and categories + pages the images is linked in, with all the entities in a WikiId format, but as I wasn’t able to find anything similar, we will have for each image combine some API results, specifically:
Calling the API to extract the Page id from File:President_Barack_Obama.jpg -> https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&titles=File:President_Barack_Obama.jpg&iiprop=dimensions%7Cmime%7Cextmetadata%7Curl
Using the page ID with an M prefixed to call https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=wbgetentities&format=json&ids=M23956389
Extract all the WikiIds (Qxxxx) from the image and the preferred information
Understanding where the image is linked and used using https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=wbgetentities&format=json&ids=M23956389
Combine every page ID returned in a new call to extract the WikiIDs of those pages using https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&titles=President%20of%20the%20United%20States%7CBarack%20Obama%7CCategory:Barack%20Obama
So to extract most of https://commons.wikimedia.org/wiki/File:President_Barack_Obama.jpg but not everything, we will have to call 4 APIs per file, and downloading 60Million images could result being quite intense on your infrastructure.
With the that analysis in mind, here are my questions:
As we are planning to ingest images with the highest level of “freshness” we were to use the API approach. We do have an API to get every image, and then after the first run, every new image from a timestamp. Is there an API or a combination of parametres for us to get every new revision change just of images so that we can reduce drastically the amount of “continue” we need to fetch every images revision diff?
In an aim to reduce our footprint as much as possible from your service (therefore the cost of it), is there any API which I haven’t yet found that for a given Image file, returns us every information like url, extmetadata, dimensions, mime, structured data in wikiId format and description, usage of the image (basically what we can find grouped here https://commons.wikimedia.org/wiki/File:President_Barack_Obama.jpg) that we can use to get every detail of an image or a batch of images with a single call?
Is there a data dump that combine question 1 and 2, so that we can find a snapshot of every single image and all the metadata that we need without the noise of every other non-Image page ? This specifically would help us for the first bulk ingestion as would probably avoid 60+Million calls to your service with just one data extraction, so that we can then only work with the delta and using the answer to my question #2
Thanks for any information, as I said I’d be very happy to have a live session anytime too.
Regards,
Fabio
We would like to start ingesting images as soon as the 5th of September, it would be great, if a meeting before that date is improbable, at least to understand using a specific User Agent for our API calls what’s the rate we should use so that we don’t overload any API endpoint. Calling https://commons.wikimedia.org/w/api.php, am I right to assume that the suggested rate is 200 reqs per second? I’ve found that on https://www.mediawiki.org/wiki/Wikimedia_REST_API#Terms_and_conditions
Thanks,
Fabio
Shari Wakiyama:
Amazon wants to start ingesting images on September 5th.
We want to understand using a specific User Agent for our API calls. What’s the rate we should use so that we don’t overload any API endpoint calling https://commons.wikimedia.org/w/api.php, am I right to assume that the suggested rate is 200 reqs per second? I’ve found that on https://www.mediawiki.org/wiki/Wikimedia_REST_API#Terms_and_conditions.
- Is the best way to collect every image available on wikimedia (CC-By-SA and PD) using the Commonswiki family datadump ? ( example https://dumps.wikimedia.org/commonswiki/20220720/ ) or is there any better dump just aimed at images? This would be aimed for a one-of bulk ingestion to avoid crawling APIs for the whole content and reduce our footprint on wikimedia servers.
- Once then working on a delta fetch, if using allimages, I can get the most recent added images with something like https://commons.wikimedia.org/w/api.php?action=query&list=allimages&aisort=timestamp&aistart=2022-07-29T07:00:00Zformat=json , but wouldn't this just give me ADD operations? is there an endpoint to understand deletes and change in licenses so that we can takedown any image which changes from CCBYSA/PD to something else or an image that has been taken down from commons ?
Will Doran:
re: 1 There is no batch dump of images, we don’t have the storage space for it
re: 2, there is no endpoint that would provide that directly - there are some log deletions that are public, those can be gotten by endpoint but many are hidden. Ariel suggests they use
https://dumps.wikimedia.org/other/mediatitles/, it’s a list of daily ‘all images’ they could then diff that against what they have.
They could verify page moves, version changes etc using the log events API https://www.mediawiki.org/wiki/API:Logevents filtering by the appropriate action. Any API that access the log table would also work. There is also the page delete stream: https://stream.wikimedia.org/?doc#/streams/get_v2_stream_mediawiki_page_delete
Seddon: There is the image info API https://www.mediawiki.org/wiki/API:Imageinfo . A note though that Imageinfo is at some point going to be deprecated and replaced: https://www.mediawiki.org/wiki/API/Architecture_work/Planning#Rewrite_prop=imageinfo_from_scratch_as_prop=fileinfo
There is the rdf output for a given file also available via dumps. Example rdf file: https://files.slack.com/files-pri/T012JBDTTHA-F03SL3SBWSG/download/m107361509.rdf?origin_team=T024KLHS4
https://dumps.wikimedia.org/commonswiki/entities/
That these exist is a byproduct of WikibaseMediaInfo
Desire for image dumps is covered well in T298394: Produce regular public dumps of Commons media files. My understanding is that currently there is no existing end point for license changes. The closest you could get would be taking a recent changes events stream and parsing it to look for changes in templates used. But its fragility is why Enterprise exists.
(BPirkle note: T298394: Produce regular public dumps of Commons media files has interesting context info, mostly about why image dumps would be challenging. There does not appear to be anyone actually working on image dumps at this time)