Page MenuHomePhabricator

Reduce noise in Growth team's Logstash dashboard
Open, LowPublic

Description

The GrowthExperiments dashboard frequently has noise that makes it harder to quickly scan for issues relevant to our team. Examples:

None of those are actionable by us, and we also don't need to know about them unless there is a big spike. I'd like to propose that we filter out these messages from our dashboard if they occur at a rate of < 100 per day.

Event Timeline

How would that work? Kibana can filter by any property of an individual log event, but I don't think you can filter in aggregate.

  • I think Error connecting is never relevant to us so we could just filter that out regardless of the frequency (I assume when there's a lot of it someone gets paged).
  • The Flow one is a genuine error in one of our extensions, we just don't think it's worth the effort to fix.
  • Search error might or might not be relevant (if we do invalid searches, e.g. exceed the length limit which happened in the past, we should know about it; OTOH if it is something like a search server being down, that's not actionable for us) and volume filtering wouldn't help that much either (when there is an issue with a server, there is usually a big spike). We could add an exclusion list to the logging code, not sure if it is worth the effort.
  • RequestTimeoutException might in theory be relevant (something is way too slow). In practice these seem to be slow parses where EarlyLifeCycleHooks::onMessageCache__get happens to be triggered just when the parse runs out of the 60 sec limit (it's triggered on every message lookup so it's not that unlikely). Not sure what to do about that. Filed T328183: Replace MessageCache__get hook with a way to pre-register message keys as a possible solution.

Change 900809 had a related patch set uploaded (by Mainframe98; author: Mainframe98):

[mediawiki/extensions/GrowthExperiments@master] Use MessageCacheFetchOverrides hook

https://gerrit.wikimedia.org/r/900809

Change 900809 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Use MessageCacheFetchOverrides hook

https://gerrit.wikimedia.org/r/900809

  • Search error might or might not be relevant (if we do invalid searches, e.g. exceed the length limit which happened in the past, we should know about it; OTOH if it is something like a search server being down, that's not actionable for us) and volume filtering wouldn't help that much either (when there is an issue with a server, there is usually a big spike). We could add an exclusion list to the logging code, not sure if it is worth the effort.

Done. Excluded "server is busy", which was the only irrelevant search error in the last 30 days. Kept query too long errors, which still happen occasionally.

  • RequestTimeoutException might in theory be relevant (something is way too slow). In practice these seem to be slow parses where EarlyLifeCycleHooks::onMessageCache__get happens to be triggered just when the parse runs out of the 60 sec limit (it's triggered on every message lookup so it's not that unlikely). Not sure what to do about that. Filed T328183: Replace MessageCache__get hook with a way to pre-register message keys as a possible solution.

This is done, thanks to @Mainframe98 and @Urbanecm_WMF.

Change 900836 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/Echo@master] Fix logging of cross-wiki API errors

https://gerrit.wikimedia.org/r/900836

Change 900837 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] User impact: Reduce pageview fetching log noise

https://gerrit.wikimedia.org/r/900837

  • I think Error connecting is never relevant to us so we could just filter that out regardless of the frequency (I assume when there's a lot of it someone gets paged).

Also done.

Most of the noise I see right now is Failed to fetch API response from {wiki}. Error code {code} from Echo which is relevant but a bit annoying to browse because the code parameter is missing. Fixed in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/900836 ; after that we can filter out "server is down" type of API errors, which aren't really useful.

The other major source of noise is the "An earlier attempt to fetch page XXX failed" messages for user impact pageview metrics (example), which were filtered out but only for jobs; patch submitted.

Change 900836 merged by jenkins-bot:

[mediawiki/extensions/Echo@master] Fix logging of cross-wiki API errors

https://gerrit.wikimedia.org/r/900836

Change 900837 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] User impact: Reduce pageview fetching log noise

https://gerrit.wikimedia.org/r/900837

DMburugu triaged this task as Medium priority.Apr 26 2023, 2:41 PM
DMburugu lowered the priority of this task from Medium to Low.
Aklapper added a subscriber: DMburugu.

@DMburugu: Please keep the Growth team's project tag on this unresolved task as this is about the Growth team's dashboard. In case that there is nothing left to do in this task, please set the task status to resolved - thanks!

At least for warnings like "Search error: We could not complete your search due to a temporary problem. Please try again later.", we should not log them to logstash in the first place, but to statsd/graphite instead. There we are also able to create an alert.

/me makes a TODO to create a task for that.
DONE: T367211: Log unactionable errors to statslib/prometheus and set alert instead of using logstash