Page MenuHomePhabricator

CirrusSearch should generate a document consistent to a given schema
Closed, ResolvedPublic5 Estimated Story Points

Description

While creating an avro schema to extract the CirrusSearch data out of elasticsearch we identified a couple inconsistencies that should probably be fixed at CirrusSearch level:

  • defaultsort should be null when not set instead of being false
  • coordinates should be always be floats (not ints)
  • file_text should be a string, not an empty array nor false
  • labels: should be null or an empty map instead of an empty array
  • descriptions should be null or an empty map instead of an empty array
  • descriptions should always be a map<string, array<string>> and not an map<string,string> on wikidata

AC:

  • CirrusSearch produces a document that can be validated against the provided avro schema

Event Timeline

dcausse renamed this task from CirrusSearch should generate a document with consistent schema to CirrusSearch should generate a document consistent to a given schema.Nov 3 2022, 1:50 PM

Change 858643 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/core@master] WikiTextStructure::getDefaultSort should return null, not false

https://gerrit.wikimedia.org/r/858643

Change 858652 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/GeoData@master] Ensure coordinates are stored as floats

https://gerrit.wikimedia.org/r/858652

Change 859131 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikibaseCirrusSearch@master] Encode empty labels field as null

https://gerrit.wikimedia.org/r/859131

defaultsort should be null when not set instead of being false

false was the default returned value from ParserOutput::getPageProperty. Updated to convert the false into null, matching the existing phpdoc's

coordinates should be always be floats (not ints)

GeoData was using the value as-parsed, added a bit to cast to float on construct of GeoData\Coord, now matching the existing phpdoc's.

file_text should be a string, not an empty array nor false

Doesn't look to happen anymore, looks like all existing values are *.js and *.css files. I have a process running to collect the set of all pages with this issue, will re-review to make sure it's only *.js and *.css once complete. Will possibly issue a one-off update to the found pages to clean up existing indices.

labels: should be null or an empty map instead of an empty array
descriptions should be null or an empty map instead of an empty array

Making this an empty map is a bit awkward, we would have to pass around \stdClass in places and document that it should be expected. Instead converted the sources to return either a populated array or null.

descriptions should always be a map<string, array<string>> and not an map<string,string> on wikidata

Not sure yet, still investigating. A naive reading suggests this shouldn't be possible.

Change 858643 merged by jenkins-bot:

[mediawiki/core@master] WikiTextStructure::getDefaultSort should return null, not false

https://gerrit.wikimedia.org/r/858643

Change 859131 merged by jenkins-bot:

[mediawiki/extensions/WikibaseCirrusSearch@master] Encode empty label/description fields as null

https://gerrit.wikimedia.org/r/859131

labels: should be null or an empty map instead of an empty array
descriptions should be null or an empty map instead of an empty array

Making this an empty map is a bit awkward, we would have to pass around \stdClass in places and document that it should be expected. Instead converted the sources to return either a populated array or null.

\Wikibase\MediaInfo\Content\MediaInfoHandler::getContentDataForSearchIndex might need a small tweak too I think, it enforces the description field to $fieldsData[DescriptionsField::NAME] = [] when no labels are found.

descriptions should always be a map<string, array<string>> and not an map<string,string> on wikidata

Not sure yet, still investigating. A naive reading suggests this shouldn't be possible.

Could we simply wrap the single description in an array in DescriptionsField would $data[$language] = [$desc->getText()]; break something?

Change 859577 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikibaseMediaInfo@master] Represent empty labels/descriptions in search as null

https://gerrit.wikimedia.org/r/859577

Change 859578 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikibaseCirrusSearch@master] Provide an array of descriptions for search

https://gerrit.wikimedia.org/r/859578

Change 858652 merged by jenkins-bot:

[mediawiki/extensions/GeoData@master] Ensure coordinates are stored as floats

https://gerrit.wikimedia.org/r/858652

labels: should be null or an empty map instead of an empty array
descriptions should be null or an empty map instead of an empty array

Making this an empty map is a bit awkward, we would have to pass around \stdClass in places and document that it should be expected. Instead converted the sources to return either a populated array or null.

\Wikibase\MediaInfo\Content\MediaInfoHandler::getContentDataForSearchIndex might need a small tweak too I think, it enforces the description field to $fieldsData[DescriptionsField::NAME] = [] when no labels are found.

Good find, patch up.

descriptions should always be a map<string, array<string>> and not an map<string,string> on wikidata

Not sure yet, still investigating. A naive reading suggests this shouldn't be possible.

Could we simply wrap the single description in an array in DescriptionsField would $data[$language] = [$desc->getText()]; break something?

I had written that, but was still poking around trying to figure out why commons had arrays in the first place. Eventually found that they swap the labels into the description field only on the commons side. Patch up since now it's clear what's happening.

Change 859577 merged by jenkins-bot:

[mediawiki/extensions/WikibaseMediaInfo@master] Represent empty labels/descriptions in search as null

https://gerrit.wikimedia.org/r/859577

Change 859578 merged by jenkins-bot:

[mediawiki/extensions/WikibaseCirrusSearch@master] Provide an array of descriptions for search

https://gerrit.wikimedia.org/r/859578

After reviewing the results of previous work I'm seeing that file_text => false is still possible. This occurs on files where the media handler does not support extracting text from the file format. This is a significant number of files, most of them really. While other bits only changed the values being returned, here we need to set the default value on all pages to null to ensure it gets cleared out.

An alternative option would be some sort of painless script run during reindexing to clear out the old incorrect values. That might be viable, but we don't have a ton of experience with that kind of thing.

Change 863040 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/core@master] search: Set file_text to null when not available

https://gerrit.wikimedia.org/r/863040

Change 864779 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] import_cirrus: Update doc cleaning to match cirrus updates

https://gerrit.wikimedia.org/r/864779

Change 864779 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] import_cirrus: Update doc cleaning to match cirrus updates

https://gerrit.wikimedia.org/r/864779

After reviewing the results of previous work I'm seeing that file_text => false is still possible. This occurs on files where the media handler does not support extracting text from the file format. This is a significant number of files, most of them really. While other bits only changed the values being returned, here we need to set the default value on all pages to null to ensure it gets cleared out.

An alternative option would be some sort of painless script run during reindexing to clear out the old incorrect values. That might be viable, but we don't have a ton of experience with that kind of thing.

I think that as long as the CirrusSearch doc building code generates a doc that's compliant with the schema we expect we should be good enough, if existing data in our elastic indices still remains with the old schema it might mean that we keep some BC code in the elastic -> hive spark job.
We discussed about having more flexibility to deal with this kind of "migration" using mw-config (e.g. use the config to configure field removal during in-place re-index) so I'd be fine considering this task done and create a new one to address such migration plans.

Change 863040 merged by jenkins-bot:

[mediawiki/core@master] search: Set file_text to null when not available

https://gerrit.wikimedia.org/r/863040