Reduce noise in Growth team's Logstash dashboard
Open, LowPublic
Actions

Assigned To

None

Authored By

	kostajh
	Jan 27 2023, 1:26 PM

Description

The GrowthExperiments dashboard frequently has noise that makes it harder to quickly scan for issues relevant to our team. Examples:

T303175: Error 1062 from Flow\Data\Storage\RevisionStorage::insert, {error} {sql1line} {db_server} (recently filtered out of our dashboard, thank you @Tgr!)
Error connecting to {db_server} as user {db_user}: {error}
Search error: {message}
[{reqId}] {exception_url} Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of {limit} seconds was exceeded

None of those are actionable by us, and we also don't need to know about them unless there is a big spike. I'd like to propose that we filter out these messages from our dashboard if they occur at a rate of < 100 per day.

Details

Subject	Repo	Branch	Lines +/-
User impact: Reduce pageview fetching log noise	mediawiki/extensions/GrowthExperiments	master	+13 -3
Fix logging of cross-wiki API errors	mediawiki/extensions/Echo	master	+3 -1
Use MessageCacheFetchOverrides hook	mediawiki/extensions/GrowthExperiments	master	+12 -10

Customize query in gerrit

Related Objects

Mentioned In: T367059: Error connecting to {db_server} as user {db_user}: {error}
T367211: Log unactionable errors to statslib/prometheus and set alert instead of using logstash
T328129: Implement alerting when spikes occur in Growth team dashboard in Logstash
Mentioned Here: T367211: Log unactionable errors to statslib/prometheus and set alert instead of using logstash
T328183: Replace MessageCache__get hook with a way to pre-register message keys
T303175: Error 1062 from Flow\Data\Storage\RevisionStorage::insert, {error} {sql1line} {db_server}

Event Timeline

kostajh created this task.Jan 27 2023, 1:26 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 27 2023, 1:26 PM

kostajh added a parent task: T311850: [Epic] FY 2022-23 Growth Maintenance Work.Jan 27 2023, 1:26 PM

kostajh mentioned this in T328129: Implement alerting when spikes occur in Growth team dashboard in Logstash.Jan 27 2023, 1:34 PM

kostajh moved this task from Inbox to Triaged on the Growth-Team board.Jan 27 2023, 2:23 PM

kostajh edited parent tasks, added: T323132: [Epic] Q3 FY 2022-23 Growth Maintenance Work; removed: T311850: [Epic] FY 2022-23 Growth Maintenance Work.

How would that work? Kibana can filter by any property of an individual log event, but I don't think you can filter in aggregate.

I think Error connecting is never relevant to us so we could just filter that out regardless of the frequency (I assume when there's a lot of it someone gets paged).
The Flow one is a genuine error in one of our extensions, we just don't think it's worth the effort to fix.
Search error might or might not be relevant (if we do invalid searches, e.g. exceed the length limit which happened in the past, we should know about it; OTOH if it is something like a search server being down, that's not actionable for us) and volume filtering wouldn't help that much either (when there is an issue with a server, there is usually a big spike). We could add an exclusion list to the logging code, not sure if it is worth the effort.
RequestTimeoutException might in theory be relevant (something is way too slow). In practice these seem to be slow parses where EarlyLifeCycleHooks::onMessageCache__get happens to be triggered just when the parse runs out of the 60 sec limit (it's triggered on every message lookup so it's not that unlikely). Not sure what to do about that. Filed T328183: Replace MessageCache__get hook with a way to pre-register message keys as a possible solution.

DMburugu moved this task from Triaged to Current Maintenance Focus on the Growth-Team board.Jan 31 2023, 3:23 PM

Sgs subscribed.Feb 6 2023, 6:03 PM

Change 900809 had a related patch set uploaded (by Mainframe98; author: Mainframe98):

[mediawiki/extensions/GrowthExperiments@master] Use MessageCacheFetchOverrides hook

https://gerrit.wikimedia.org/r/900809

gerritbot added a project: Patch-For-Review.Mar 19 2023, 12:57 PM

Change 900809 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Use MessageCacheFetchOverrides hook

https://gerrit.wikimedia.org/r/900809

Maintenance_bot removed a project: Patch-For-Review.Mar 19 2023, 2:30 PM

ReleaseTaggerBot added a project: MW-1.41-notes (1.41.0-wmf.1; 2023-03-20).Mar 19 2023, 3:00 PM

In T328128#8566364, @Tgr wrote:

Search error might or might not be relevant (if we do invalid searches, e.g. exceed the length limit which happened in the past, we should know about it; OTOH if it is something like a search server being down, that's not actionable for us) and volume filtering wouldn't help that much either (when there is an issue with a server, there is usually a big spike). We could add an exclusion list to the logging code, not sure if it is worth the effort.

Done. Excluded "server is busy", which was the only irrelevant search error in the last 30 days. Kept query too long errors, which still happen occasionally.

RequestTimeoutException might in theory be relevant (something is way too slow). In practice these seem to be slow parses where EarlyLifeCycleHooks::onMessageCache__get happens to be triggered just when the parse runs out of the 60 sec limit (it's triggered on every message lookup so it's not that unlikely). Not sure what to do about that. Filed T328183: Replace MessageCache__get hook with a way to pre-register message keys as a possible solution.

This is done, thanks to @Mainframe98 and @Urbanecm_WMF.

Change 900836 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/Echo@master] Fix logging of cross-wiki API errors

https://gerrit.wikimedia.org/r/900836

gerritbot added a project: Patch-For-Review.Mar 19 2023, 11:50 PM

Change 900837 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] User impact: Reduce pageview fetching log noise

https://gerrit.wikimedia.org/r/900837

In T328128#8566364, @Tgr wrote:

I think Error connecting is never relevant to us so we could just filter that out regardless of the frequency (I assume when there's a lot of it someone gets paged).

Also done.

Most of the noise I see right now is Failed to fetch API response from {wiki}. Error code {code} from Echo which is relevant but a bit annoying to browse because the code parameter is missing. Fixed in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/900836 ; after that we can filter out "server is down" type of API errors, which aren't really useful.

The other major source of noise is the "An earlier attempt to fetch page XXX failed" messages for user impact pageview metrics (example), which were filtered out but only for jobs; patch submitted.

Tgr removed a subscriber: Mainframe98.Mar 20 2023, 12:14 AM

Change 900836 merged by jenkins-bot:

[mediawiki/extensions/Echo@master] Fix logging of cross-wiki API errors

https://gerrit.wikimedia.org/r/900836

Change 900837 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] User impact: Reduce pageview fetching log noise

https://gerrit.wikimedia.org/r/900837

ReleaseTaggerBot edited projects, added MW-1.41-notes (1.41.0-wmf.2; 2023-03-27); removed MW-1.41-notes (1.41.0-wmf.1; 2023-03-20).Mar 23 2023, 2:00 PM

Maintenance_bot removed a project: Patch-For-Review.Mar 23 2023, 2:10 PM

DMburugu edited parent tasks, added: T333335: [Epic] Q4 FY 2022-23 Growth Maintenance Work; removed: T323132: [Epic] Q3 FY 2022-23 Growth Maintenance Work.Mar 28 2023, 2:25 PM

DMburugu triaged this task as Medium priority.Apr 26 2023, 2:41 PM

DMburugu lowered the priority of this task from Medium to Low.

DMburugu moved this task from Current Maintenance Focus to Sprint 0 (Growth Team) on the Growth-Team board.

DMburugu edited projects, added Growth-Team (Sprint 0 (Growth Team)); removed Growth-Team.

DMburugu moved this task from Incoming to Current Month Maintenance Priorities on the Growth-Team (Sprint 0 (Growth Team)) board.

DMburugu edited parent tasks, added: T340455: [Epic] Q1 FY 2023-24 Growth Maintenance Work; removed: T333335: [Epic] Q4 FY 2022-23 Growth Maintenance Work.Jun 26 2023, 5:07 PM

DMburugu removed a parent task: T340455: [Epic] Q1 FY 2023-24 Growth Maintenance Work.Oct 2 2023, 10:15 AM

DMburugu removed a project: Growth-Team (Sprint 0 (Growth Team)).Oct 6 2023, 8:41 AM

@DMburugu: Please keep the Growth team's project tag on this unresolved task as this is about the Growth team's dashboard. In case that there is nothing left to do in this task, please set the task status to resolved - thanks!

JFernandez-WMF moved this task from Inbox to Blocked on the Growth-Team board.Jan 23 2024, 10:01 AM

JFernandez-WMF moved this task from Blocked to Current Maintenance Focus on the Growth-Team board.

Michael subscribed.May 28 2024, 8:10 AM

At least for warnings like "Search error: We could not complete your search due to a temporary problem. Please try again later.", we should not log them to logstash in the first place, but to statsd/graphite instead. There we are also able to create an alert.

/me makes a TODO to create a task for that.
DONE: T367211: Log unactionable errors to statslib/prometheus and set alert instead of using logstash

Michael mentioned this in T367211: Log unactionable errors to statslib/prometheus and set alert instead of using logstash.Jun 11 2024, 5:30 PM

DAlangi_WMF mentioned this in T367059: Error connecting to {db_server} as user {db_user}: {error}.Jun 13 2024, 10:31 AM

Urbanecm_WMF edited projects, added Growth-Team (Maintenance); removed Growth-Team.Mon, Nov 18, 5:24 PM

Sgs moved this task from Backlog to Statslib migration on the Growth-Team (Maintenance) board.Tue, Nov 19, 12:03 PM

Sgs moved this task from Statslib migration to Backlog on the Growth-Team (Maintenance) board.Wed, Nov 20, 3:40 PM

Reduce noise in Growth team's Logstash dashboardOpen, LowPublicActions

Description

Details

Related Objects

Event Timeline

Reduce noise in Growth team's Logstash dashboard
Open, LowPublic
Actions