Page MenuHomePhabricator

Community-defined Translation Collections: definition, storage & recommendation
Closed, ResolvedPublic

Description

Description

Originating from the technical exploration in T368713, the scope of this ticket is:

1. Collection definition and storage
a) We will define a simple mechanism to mark a campaign page as part of a "translation campaign". We will not retrospectively support any of the existing campaign pages.
b) These are normal wikipages. The pages can be edited as usual. The only difference is the page will have a 'marker' that tells this page is a translation campaign source. More marker information in T373132
c) Document the above process.

2. Candidate recommendation
a) The recommendation API, if the API request includes a includes-campaign param, will look for all pages with the above defined 'marker' using appropriate cirrus-search (Help:CirrusSearch#Hastemplate)
b) 3-stage processing;
-> For these pages, fetch all the candidate articles listed as Article Titles and QIDs. That article list, with the source language information forms featured list candidates.
-> Note that these articles need to go through the filtering stage to find all translatable articles, and then the ranking stage to sort/serve according to relevance.
c) Cache the pages we find with that 'marker' with a reasonable cache eviction policy.
-> Cache the article lists too.
-> Use diskcache for this purpose (recommended).
-> Fill this cache on API server startup (recommended)
d) API response can contain additional information on from where the candidates were sourced. This may not be useful for UI at this point, but I consider this info will help the development and debugging workflow.

Test Scenarios

TBD

Acceptance Criteria

TBD

Event Timeline

@santhosh just following up on your summary in T368713#10030506 but moving discussion here. All of what's outlined here generally makes sense to me for a start. I guess the design research question is whether to ask organizers to add specific interwiki links to source articles or just the generic Wikidata IDs for the items they see fit for translation. I'm going to use a recent worklist from Wiki Loves Sports as an example because it conveniently seems very close to what you're envisioning (screenshot below). They have a table of articles that includes both the various individual interwiki links (those green cells with + in them are actually links to the language version) and the Wikidata items for the articles of interest.

Screenshot 2024-08-01 at 4.48.13 PM.png (1×1 px, 373 KB)

I could see this going a few ways:

So I think with this approach you can mostly reproduce the existing functionality of the recommendation API. Differences:

  • It'll take several more API calls because we can't use just the simple Search generator that we're using right now. This means a bit greater latency probably but I suspect within reason.
  • We do lose the ability to allow custom filtering of the list to articles that are similar to a example article (the morelike functionality of Search) but maybe that's okay for now.
  • Search API also operates at 500 candidates at a time, which is useful if you expect a lot of filtering (e.g., a lot of the articles already exist). I presume for most cases, operating in sets of 50 like the above would be okay because the UI only needs like 10 articles at a time to show to users.
  • The one case where we would have problems with working with 50 links at a time is if organizers are putting together really large lists -- e.g., using a SPARQL query to generate a massive table of all women scientists or something like that (example). In those cases, we have a few issues:
    • We probably don't want to always just surface the first 50 links that show up in the initial iwlinks API query. The cache could perhaps help with that but something to think about for how we pull results from the cache/APIs in a less deterministic way.
    • Folks presumably will want to apply topic filters -- e.g., women scientists campaign + a specific country. Depending on the country, this filter might only find results once every 100 articles and so applying it post-hoc would require a lot of querying to find enough results to return. We don't have this problem when we're using the Search API natively because the filter is being applied on the Search backend and so you're only being returned results that match that topic. The only thing I can think here is that if you're using caches, you might want to run the whole process when filling the cache and collect all the topics for each page too in the cache so it's faster to iterate through really large sets to find matching articles?

@Isaac As a real example to work on for first iteration, I created a simplest version of campaign at https://meta.wikimedia.org/wiki/User:Santhosh.thottingal/Essential_Biography. You can see our campaign marker in the page and the list.

We need to give flexibility to define a campagin organizer to define the list using either qids or links. For example,

Both [[:en:Aristotle|Aristotle]] and [[d:Q868|Aristotle]] should be accepted.

So I wrote this notebook illustrating the cirrussearch, article list extraction: https://colab.research.google.com/drive/1F7Tud4i3Y8jm-_fjnDyP9wnV8_jlhxMq#scrollTo=pPpoUzxjgmkj
I used dump html parsing for iwlinks, but https://meta.wikimedia.org/w/api.php?action=query&prop=iwlinks&titles=User%3ASanthosh.thottingal%2FEssential_Biography&format=json&formatversion=2&iwlimit=max&iwprefix=d may be better

My idea is to heavily cache the list of campaign pages and articles listed in these pages. Also their inter language properties. Possibly weighted tags too?
We might need to find out correct cache eviction strategies, but I won't worry much about it since these are "suggestions" and they goes through the filters anyway.

I guess heavy caching should address some extend of large campaign lists. But let us see that in iterations and testing.

@santhosh oh cool, yeah, this is looking good! Agreed that supporting both wikilinks and QIDs would be ideal. And my gut feeling is that caching everything about the articles to enable faster filtering is ideal so long as that doesn't fill up the cache. Two optional extensions I could imagine:

  • Let organizers maintain their pages on their local wikis and just create a page with a soft redirect (template that we could parse) that the backend could follow? Or probably easier just add a "campaign-list-page" as an optional parameter in the Translation campaign template? And then you could also support an organizer adding multiple translation campaign templates to a single page. I think it would be minor for the code to support it and would allow campaigns that already maintain extensive lists on their local wikis to take part without duplicating that work or moving it.
  • Maybe have an expiration date parameter on the list? Trying to think how we don't use up cache space for old translation lists. Maybe that's not necessary but could help with ensuring the translation lists are maintained.

@Isaac, both are good ideas. We had discussed the second one in our team. First one adds the flexibitlity with minimal technical cost from our side.

Change #1059945 had a related patch set uploaded (by Eamedina; author: Eamedina):

[research/recommendation-api@master] WIP - Community-defined campaign translations

https://gerrit.wikimedia.org/r/1059945

A thought I had when talking to Alex at Wikimania: It would probably be useful if these lists had Wikidata items, and that the inclusion of the topics were made through the Wikidata property "on focus list for WikiProject (P5008)". It fits really well, as this is a focus list by definition. It would also enable other ways to use the knowledge that this is a prioritized topic.

We at Wiki Project Med would love the ability to add lists based on MDWiki... We basically put articles once they are ready to be translated into a category. This allows us the ability to only suggest well developed articles for translation. https://mdwiki.org/wiki/Category:RTT @Ainali how would I add this category as a list to Wikidata?

Change #1064325 had a related patch set uploaded (by Santhosh; author: Santhosh):

[research/recommendation-api@master] Add cache for translation campaigns pages

https://gerrit.wikimedia.org/r/1064325

Change #1064350 had a related patch set uploaded (by Santhosh; author: Santhosh):

[research/recommendation-api@master] translation campaign pages: support listing by article title

https://gerrit.wikimedia.org/r/1064350

A thought I had when talking to Alex at Wikimania: It would probably be useful if these lists had Wikidata items, and that the inclusion of the topics were made through the Wikidata property "on focus list for WikiProject (P5008)". It fits really well, as this is a focus list by definition. It would also enable other ways to use the knowledge that this is a prioritized topic.

I think Ainali's idea here is actually a good way to leverage Wikidata without having to support queries as sources of lists (organizers would be able to cultivate a focus list on Wikidata, or create a listeria bot page marked with the correct metadata for a list.

Change #1059945 merged by jenkins-bot:

[research/recommendation-api@master] Support filter by community-defined translation campaigns

https://gerrit.wikimedia.org/r/1059945

Change #1064325 merged by jenkins-bot:

[research/recommendation-api@master] Add cache for translation campaigns pages

https://gerrit.wikimedia.org/r/1064325

Change #1064350 merged by jenkins-bot:

[research/recommendation-api@master] translation campaign pages: support listing by article title

https://gerrit.wikimedia.org/r/1064350

ngkountas subscribed.

@eamedina it feels to me that this task is ready for QA. What do you think?

Change #1073176 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update rec-api image in staging and prod

https://gerrit.wikimedia.org/r/1073176

Change #1073176 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update rec-api image in staging and prod

https://gerrit.wikimedia.org/r/1073176

PWaigi-WMF renamed this task from Community-defined translation lists: List definition, storage & recommendation to Community-defined Translation Collections: definition, storage & recommendation.Thu, Oct 31, 2:52 PM
PWaigi-WMF updated the task description. (Show Details)

Change #1088276 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update recommendation-api to 2024-11-06-190017-production

https://gerrit.wikimedia.org/r/1088276

@eamedina Sorry for the confusion but what exactly do you want me to test from a QA point of view? Can you please provide a user story and acceptance criteria or QA steps? Thanks!

GMikesell-WMF updated Other Assignee, added: GMikesell-WMF.
GMikesell-WMF updated Other Assignee, removed: GMikesell-WMF.