Page MenuHomePhabricator

Improve wikifunctions logging
Closed, ResolvedPublic

Description

Currently wikifunctions pods log ~2500 lines per hour in production (in one DC). The absolute majority of those logs are from "levelPath: trace/req" for the /metrics and /_info endpoints, which could probably suppressed. There are also a bunch of "levelPath: debug" logs logging memory and cpu usage, lacking any other identifier (like a request-id). Same goes for "levelPath: info" which seems to print the function call (plus arguments?) and some "levelPath: notice".

But especially the amount of trace/req logs probably made it hard to spot the problem that lead to T344998 right away, see https://logstash.wikimedia.org/goto/aa3497c585530d231916e0d2b479c026 vs. https://logstash.wikimedia.org/goto/d9f2a3f0cab9474b5d48bed41cf7b8c1

Event Timeline

JMeybohm renamed this task from Improcve wikifunctions logging to Improve wikifunctions logging.Sep 13 2023, 4:59 PM
JMeybohm created this task.
Jdforrester-WMF changed the task status from Open to In Progress.Sep 26 2023, 2:09 PM
Jdforrester-WMF claimed this task.
Jdforrester-WMF triaged this task as High priority.
Jdforrester-WMF moved this task from To triage to In Progress on the Abstract Wikipedia team board.

Change 961848 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/extensions/WikiLambda@master] Update function-schemata sub-module to HEAD (958de5c)

https://gerrit.wikimedia.org/r/961848

Change 961848 merged by jenkins-bot:

[mediawiki/extensions/WikiLambda@master] Update function-schemata sub-module to HEAD (958de5c)

https://gerrit.wikimedia.org/r/961848

Change 962650 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Use function-orchestrator image with better logging

https://gerrit.wikimedia.org/r/962650

Change 962650 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Use function-orchestrator image with better logging

https://gerrit.wikimedia.org/r/962650

When I check https://logstash.wikimedia.org/goto/dbb318a11eaeb4b54503b9878c805f15 for the last 1 hour and remove NOT levelPath:trace/req, the number of events is unchanged, so the levelPath:trace/req is suppressed.
But levelPath:trace is still present on Apps Logs (Kubernetes) (about 3,600 in 1 hour). We don't need to suppress them too?

Also, the dashboard wikifunctions has events with level 10 (`levelPath
trace`), ~8,404; and level INFO accounts for 4,551 events. Overall (without filtering), wikifunctions dashboard presents 13,110 events in 1 hour. Should the dashboard filters be adjusted further?

Additional improvements for logging will be addressed in other tasks if necessary. trace/req is not present anymore. Closing this task as Resolved.