Page MenuHomePhabricator

Evaluate technical solutions for Amazon use of Commons media
Closed, ResolvedPublic

Description

Amazon is interested in using media from Commons, on the order of many millions of images. Enterprise is not currently supporting text or image dumps for Commons. Evaluate possible technical solutions using existing capabilities (dumps, APIs, etc.)

Most of the following is summarized from Slack discussion. I'm moving it to Phabricator for technical evaluation. Apologies to any of the Slack contributors if I've made errors in summarizing or crediting your contribution to that discussion. I've subscribed you all, for visibility. Feel free to unsubscribe if you like.


Fabio at Amazon (via email):

I’ve done some progress from this email that I’ll try to summarise in this email, I’d also be happy to have a call session for any of that and even explore requests to Wikimedia in order to decrease almost to zero our footprint on your servers using an all metadata in dump solution (I’ve seen there are specific Yahoo dumps for instance).

Here are my findings so far:
We were planning to use this Images Api first ( ref. https://commons.wikimedia.org/w/api.php?action=help&modules=query%2Ballimages ) but we found two problems with it:
It doesn’t tell us if an image is removed within the “last run approach” as it’s only addition
We would have to max daily our 5000 request an hour, using the max allowance of 500 images a request (max allowance), every 24h we would be able to understand if something changed in the whole 60million images

We the tried to use https://commons.wikimedia.org/w/api.php?action=query&generator=allrevisions&arvdir=newer&arvlimit=50&arvstart=2022-08-05T01:34:56Z to get every revision from a “last run”, but we would have to remove all the non-images entries from it.
Then we started analysing dumps and we found that Wikimedia publish a dump of commons twice a month.
This is the dump we are targeting https://dumps.wikimedia.org/commonswiki/
Of all the dumps and SQLs I’ve analysed, it would be quite tricky to load 150GB of SQL to perform some queries and extract a delta of all the images that changed, so we decided to target xml partials like commonswiki-20220801-pages-articles-multistream1.xml-p1p1500000.bz2

I couldn’t find any “File:…” only dump, which I would’ve been very keen of, so our approach will be to parse each partial for Files and extract some metadata there, but the list we need for now goes way beyond.

The closest thing I’ve found in your planning was this https://phabricator.wikimedia.org/T240520, but still not the ideal format.

Once we fetch all the images and we can detect what changed using the <revision> and checking both

=={{int:license-header}}==
{{PD-Art|PD-old-100-expired}}

and

|Permission = {{Cc-by-1.0|[[User:Andre Engels|Andre Engels]]}}

We would then use the image for the metadata acquisition.
Our perfect result would be everything that it’s included here https://commons.wikimedia.org/wiki/File:President_Barack_Obama.jpg , so File informations, base metadata, Structured data, what is prominent and categories + pages the images is linked in, with all the entities in a WikiId format, but as I wasn’t able to find anything similar, we will have for each image combine some API results, specifically:

Calling the API to extract the Page id from File:President_Barack_Obama.jpg -> https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&titles=File:President_Barack_Obama.jpg&iiprop=dimensions%7Cmime%7Cextmetadata%7Curl
Using the page ID with an M prefixed to call https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=wbgetentities&format=json&ids=M23956389
Extract all the WikiIds (Qxxxx) from the image and the preferred information
Understanding where the image is linked and used using https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=wbgetentities&format=json&ids=M23956389
Combine every page ID returned in a new call to extract the WikiIDs of those pages using https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&titles=President%20of%20the%20United%20States%7CBarack%20Obama%7CCategory:Barack%20Obama

So to extract most of https://commons.wikimedia.org/wiki/File:President_Barack_Obama.jpg but not everything, we will have to call 4 APIs per file, and downloading 60Million images could result being quite intense on your infrastructure.
With the that analysis in mind, here are my questions:

As we are planning to ingest images with the highest level of “freshness” we were to use the API approach. We do have an API to get every image, and then after the first run, every new image from a timestamp. Is there an API or a combination of parametres for us to get every new revision change just of images so that we can reduce drastically the amount of “continue” we need to fetch every images revision diff?

In an aim to reduce our footprint as much as possible from your service (therefore the cost of it), is there any API which I haven’t yet found that for a given Image file, returns us every information like url, extmetadata, dimensions, mime, structured data in wikiId format and description, usage of the image (basically what we can find grouped here https://commons.wikimedia.org/wiki/File:President_Barack_Obama.jpg) that we can use to get every detail of an image or a batch of images with a single call?

Is there a data dump that combine question 1 and 2, so that we can find a snapshot of every single image and all the metadata that we need without the noise of every other non-Image page ? This specifically would help us for the first bulk ingestion as would probably avoid 60+Million calls to your service with just one data extraction, so that we can then only work with the delta and using the answer to my question #2

Thanks for any information, as I said I’d be very happy to have a live session anytime too.
Regards,
Fabio


We would like to start ingesting images as soon as the 5th of September, it would be great, if a meeting before that date is improbable, at least to understand using a specific User Agent for our API calls what’s the rate we should use so that we don’t overload any API endpoint. Calling https://commons.wikimedia.org/w/api.php, am I right to assume that the suggested rate is 200 reqs per second? I’ve found that on https://www.mediawiki.org/wiki/Wikimedia_REST_API#Terms_and_conditions

Thanks,
Fabio


Shari Wakiyama:

Amazon wants to start ingesting images on September 5th.

We want to understand using a specific User Agent for our API calls. What’s the rate we should use so that we don’t overload any API endpoint calling https://commons.wikimedia.org/w/api.php, am I right to assume that the suggested rate is 200 reqs per second? I’ve found that on https://www.mediawiki.org/wiki/Wikimedia_REST_API#Terms_and_conditions.

  1. Is the best way to collect every image available on wikimedia (CC-By-SA and PD) using the Commonswiki family datadump ? ( example https://dumps.wikimedia.org/commonswiki/20220720/ ) or is there any better dump just aimed at images? This would be aimed for a one-of bulk ingestion to avoid crawling APIs for the whole content and reduce our footprint on wikimedia servers.
  1. Once then working on a delta fetch, if using allimages, I can get the most recent added images with something like https://commons.wikimedia.org/w/api.php?action=query&list=allimages&aisort=timestamp&aistart=2022-07-29T07:00:00Zformat=json , but wouldn't this just give me ADD operations? is there an endpoint to understand deletes and change in licenses so that we can takedown any image which changes from CCBYSA/PD to something else or an image that has been taken down from commons ?

Will Doran:

re: 1 There is no batch dump of images, we don’t have the storage space for it

re: 2, there is no endpoint that would provide that directly - there are some log deletions that are public, those can be gotten by endpoint but many are hidden. Ariel suggests they use
https://dumps.wikimedia.org/other/mediatitles/, it’s a list of daily ‘all images’ they could then diff that against what they have.
They could verify page moves, version changes etc using the log events API https://www.mediawiki.org/wiki/API:Logevents filtering by the appropriate action. Any API that access the log table would also work. There is also the page delete stream: https://stream.wikimedia.org/?doc#/streams/get_v2_stream_mediawiki_page_delete


Seddon: There is the image info API https://www.mediawiki.org/wiki/API:Imageinfo . A note though that Imageinfo is at some point going to be deprecated and replaced: https://www.mediawiki.org/wiki/API/Architecture_work/Planning#Rewrite_prop=imageinfo_from_scratch_as_prop=fileinfo

There is the rdf output for a given file also available via dumps. Example rdf file: https://files.slack.com/files-pri/T012JBDTTHA-F03SL3SBWSG/download/m107361509.rdf?origin_team=T024KLHS4

https://dumps.wikimedia.org/commonswiki/entities/
That these exist is a byproduct of WikibaseMediaInfo

Desire for image dumps is covered well in T298394: Produce regular public dumps of Commons media files. My understanding is that currently there is no existing end point for license changes. The closest you could get would be taking a recent changes events stream and parsing it to look for changes in templates used. But its fragility is why Enterprise exists.

(BPirkle note: T298394: Produce regular public dumps of Commons media files has interesting context info, mostly about why image dumps would be challenging. There does not appear to be anyone actually working on image dumps at this time)

Event Timeline

Structured Data on Commons x Alexa call notes also has useful background in terms of use cases: https://docs.google.com/document/d/1Jv96p0PAELm7frkP44C96h6gLq3HZVhvKHdq7RmHQKU/edit

Alexa has two aims:
Trying to attribute metadata to images that will feed into search
Build structured connections between entities

It doesn’t tell us if an image is removed within the “last run approach” as it’s only addition

You can use https://commons.wikimedia.org/w/api.php?action=help&modules=query%2Bfilearchive to figure out which files have been deleted.
It's going to be a little awkward because it doesn't allow filtering by timestamp.
You could, however, use &fadir=descending to sort on most recent changes, and look at the returned timestamp information; this would allow you to stop querying after you've started to find pages from before your previous run.

We would have to max daily our 5000 request an hour, using the max allowance of 500 images a request (max allowance), every 24h we would be able to understand if something changed in the whole 60million images

Initial ingestion would take awhile. IIRC, there are about 86M files, which would take about 35 hours if you're able to enumerate 2.5M (500 * 5000) per hour.
After the initial run, you could narrow things do with &aisort=timestamp&aistart=20220901000000 (where the timestamp is whenever you last ran) to get additions.
Deletions could be found as described above. I'm not sure whether moves/renames show up in allimages, so might need to check allrevisions (which already appears to be the plan anyway) or consult other sources as suggested by Will Doran & Seddon.

We the tried to use https://commons.wikimedia.org/w/api.php?action=query&generator=allrevisions&arvdir=newer&arvlimit=50&arvstart=2022-08-05T01:34:56Z to get every revision from a “last run”, but we would have to remove all the non-images entries from it.

Also include &arvnamespace=6 (this will limit results to all pages within a certain namespace; 6 = files namespace), so you could query these more efficiently, without the non-file results.

Also note: from the pasted query, you appear to be using allrevisions as generator (generator=allrevisions).
Most modules can be used in both list (list=allrevisions) and generator (generator=allrevisions) mode: the latter allows you to immediately request props from the results, all in the same call, in the same response. Compared to list, it loses module-specific information, though (in this case: the list of revisions)
When used as generator, all params must be prefixed with a g, though. &arvdir=newer&arvlimit=50&arvstart=2022-08-05T01:34:56Z should become &garvdir=newer&garvlimit=50&garvstart=2022-08-05T01:34:56Z

So to extract most of https://commons.wikimedia.org/wiki/File:President_Barack_Obama.jpg but not everything, we will have to call 4 APIs per file, and downloading 60Million images could result being quite intense on your infrastructure.

You could use the globalusage prop (see https://commons.wikimedia.org/w/api.php?action=help&modules=query%2Bglobalusage) along with whatever generator is used to provide the list of files, to retrieve all information all at once (note: when using allimages (or any other) as generator, don't forget to also prefix all of it's arguments with a g (e.g. aisort=timestamp becomes gaisort=timestamp)
Something like, e.g.: https://commons.wikimedia.org/w/api.php?action=query&generator=allimages&gaisort=timestamp&gaistart=20220901000000&gailimit=500&prop=globalusage&gulimit=500
This will list 500 file pages after 1 Sept 2022, along with their usage on other wikis; all in 1 call.
You could include even more props (e.g. imageinfo) at once, like so: https://commons.wikimedia.org/w/api.php?action=query&generator=allimages&gaisort=timestamp&gaistart=20220901000000&gailimit=500&prop=globalusage|imageinfo&gulimit=500

Is there an API or a combination of parametres for us to get every new revision change just of images so that we can reduce drastically the amount of “continue” we need to fetch every images revision diff?

I think this was already answered above. You can use &arvnamespace=6 (or &garvnamespace=6 when used as generator) to narrow things down to only files.

is there any API which I haven’t yet found that for a given Image file, returns us every information like url, extmetadata, dimensions, mime, structured data in wikiId format and description, usage of the image (basically what we can find grouped here https://commons.wikimedia.org/wiki/File:President_Barack_Obama.jpg) that we can use to get every detail of an image or a batch of images with a single call?

All of the information available via props (&prop=... param) can be combined (e.g. &prop=globalusage|imageinfo), and even be fetched immediately when using modules as generators (as described above). Then you can also add additional arguments for each prop to the query string (e.g. &iiprop=dimensions|url|mime|extmetadata)
For most of the data you described, that would be something like: &prop=globalusage|imageinfo|entityterms&iiprop=dimensions|url|mime|extmetadata&wbetterms=label.
I'm not sure exactly what structured data you'll need, but you probably won't be able to access those from props, but will have to call https://commons.wikimedia.org/w/api.php?action=help&modules=wbgetclaims separately instead.
Also note: extmetadata is very expensive - this data is essentially compiled by parsing the page and extracting information from the resulting DOM, based on commons structures adopted by our contributor community. This could take a (very) long time to generate if you need it for many pages at once. Try to avoid if you can do without :)

Is there a data dump that combine question 1 and 2, so that we can find a snapshot of every single image and all the metadata that we need without the noise of every other non-Image page ?

Perhaps, but none that I'm aware of.


Calling https://commons.wikimedia.org/w/api.php, am I right to assume that the suggested rate is 200 reqs per second? I’ve found that on https://www.mediawiki.org/wiki/Wikimedia_REST_API#Terms_and_conditions

We have multiple API endpoints. The page linked to is the REST API, which is a lot more limited at the moment and they probably won't find many of the data they need.
All of what was described earlier is via the actions API. I'm not sure exactly what the limits are, but https://api.wikimedia.org/wiki/Documentation/Getting_started/Rate_limits seems to suggest up to 5000 per hour.

Closing this one with the possibility to reopen it if needed.