Skip to content

Add hook for formatting Kubernetes Event messages#910

Open
agoose77 wants to merge 51 commits into
jupyterhub:mainfrom
agoose77:feat-add-hook
Open

Add hook for formatting Kubernetes Event messages#910
agoose77 wants to merge 51 commits into
jupyterhub:mainfrom
agoose77:feat-add-hook

Conversation

@agoose77
Copy link
Copy Markdown

@agoose77 agoose77 commented Mar 3, 2026

TL;DR

Note

No LLMs were used in the authoring of this PR.

2i2c is currently working on a user-story to improve the kubespawner progress messages, as part of an initiative to improve the spawn-progress page.

This PR does several things:

  1. Adds a custom decorate_progress_message that overrides the pretty-printing of event messages.
  2. Adds a kubespawner.events module for richer built-in formatting of log messages.
  3. Adds a kubespawner.events.RuleEventFormatter and other types for defining event formatting rules.

See the before and after:

Before
old

After
new

Goal

The goal is modest: to improve the human readability of spawn messages, and to allow further customisation.

Example Decoration Hook

Basic hook

def decorate_progress_message(spawner, event, text):
    return { 
        "message": f"custom-message-{text}",
        "html_message": f"<span>{text}</span>"
    }
c.KubeSpawner.decorate_progress_message = decorate_progress_message

Use the rules to define custom renderers

c.RuleEventFormatter.rules = {
    "01-container-image-events": {
        "match": {
            "reportingComponent": r"kubelet",
            "fieldPath": r"spec\.(?P<container>initContainers|containers)\{([^}]+)\}",
            "reason": r"(?P<action>Pulling|Pulled)",
        },
        "template": "{action} {image} image for the {container} container",
    },
    "02-container-lifecycle-events": {
        "match": {
            "reportingComponent": r"kubelet",
            "fieldPath": r"spec\.(?P<container>initContainers|containers)\{([^}]+)\}",
            "reason": r"(?P<action>Started|Killing|Created|Stopped)",
        },
        "template": "{action} the {container} container",
    },
    "03-pod-resource-events": {
        "match": {
            "reportingComponent": r"kubelet",
            "reason": r"OutOf(?P<resource>memory|cpu|ephemeral-storage|pods)",
        },
        "template": "The node selected to run your server ran out of {resource}",
    },
    "04-scheduler-node-found-events": {
        "match": {
            "reportingComponent": r".*-user-scheduler",
            "reason": r"Scheduled",
            "message": r".*?assigned \S+ to (?P<node>\S+)",
        },
        "template": "A node ({node}) has been found to run your server",
    },
    "05-scheduler-no-nodes-events": {
        "match": {
            "reportingComponent": r".*-user-scheduler",
            "reason": r"FailedScheduling",
        },
        "template": "No existing nodes are currently able to run your server",
    },
    "06-cluster-autoscaler-events": {
        "match": {
            "reportingComponent": r"cluster-autoscaler",
            "reason": r"TriggeredScaleUp",
        },
        "template": "Launching new nodes by scaling up the cluster",
    },
    "07-node-affinity-events": {
        "match": {
            "reportingComponent": r"kubelet",
            "message": r"Predicate NodeAffinity failed.*",
            "reason": r"NodeAffinity",
        },
        "template": "It was not possible to find or launch any nodes to run your server. This is likely due to a configuration problem with the infrastructure or the JupyterHub",
    },
    "08-gke-scheduler-node-found-events": {
        "match": {
            "reportingComponent": r"gke\.io/optimize-utilization-scheduler",
            "reason": r"Scheduled",
            "message": r".*?assigned \S+ to (?P<node>\S+)",
        },
        "template": "A node ({node}) has been found to run your server",
    },
    "09-gke-scheduler-no-nodes-events": {
        "match": {
            "reportingComponent": r"gke\.io/optimize-utilization-scheduler",
            "reason": r"FailedScheduling",
        },
        "template": "No existing nodes are currently able to run your server",
    },
    "10-taint-eviction-events": {
        "match": {
            "reportingComponent": r"taint-eviction-controller",
            "reason": r"gke\.io/optimize-utilization-scheduler",
            "message": r"Cancelling deletion of Pod.*",
        },
        "template": "Cancelling deletion of your server. This normally happens when a scale-up has just taken place.",
    },
}

Design Details

UI

  • Timestamps are formatted to regular isoformat-like %Y-%m-%DTHH:MM:SSZ to keep fixed width
  • Timestamps and message types are pretty formatted as button-pills
  • Messages are simplified where possible

Constraints

  • I targeted Python 3.7 given pyproject.toml, meaning no match, :=, or removeprefix.

Questions

  • Is Kubespawner making too many assumptions if we bake-in the expectation of Bootstrap?
  • Could we consider adding a start-time timestamp so that times can simply be given as "minutes since spawn" rather than UTC times?

Comment thread kubespawner/spawner.py Outdated
Comment thread kubespawner/messages.py Outdated
@agoose77 agoose77 marked this pull request as ready for review March 3, 2026 14:00
@agoose77 agoose77 force-pushed the feat-add-hook branch 3 times, most recently from 4cf984c to ed9f488 Compare March 3, 2026 16:00
@manics
Copy link
Copy Markdown
Member

manics commented Mar 3, 2026

This does effectively vendor some cluster-provider specifics (like the GCP scheduler). I think that's OK? But if we are vehemently against that, we can just pull those parts out.

Whatever we decide we need to be consistent in future. If we add GCP specific code we need to accept code for other clouds, including from third parties who use platforms that we can't test ourselves.

@agoose77 agoose77 force-pushed the feat-add-hook branch 3 times, most recently from dc81d08 to 8e4f471 Compare March 3, 2026 17:28
Copy link
Copy Markdown
Member

@jnywong jnywong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I enjoy this feature ❤️ I like that the format_event_hook was easy to configure for basic formatting. It took me a while to understand what was going on with the default formatting, but I think there will alway be room for improvements there.

In general, my main comment is to complete the test suite to regenerate how the sample-events were created through the message.py, since I think that represents the bulk of the work in this PR. The default formatting may undergo further development, but like you say I think we want to add in some basic regression testing and update that if needed in future.


In answer to your questions:

Is Kubespawner making too many assumptions if we bake-in the expectation of Bootstrap?

We know Bootstrap ships with JupyterHub, so I think this is a safe assumption for now. I don't think we need to worry about this for BinderHub?

Could we consider adding a start-time timestamp so that times can simply be given as "minutes since spawn" rather than UTC times?

I think this is a nice-to-have. Most users hopefully shouldn't have to dwell on the spawn progress screen, but if they do, then they will likely screenshot their spawn failure to send to an admin. Keeping the timestamps consistent with server side logs with raw k8s events should be useful for sysadmins for troubleshooting.

Should I rework the built-in formatter to generate HTML at every stage — would it be useful to have e.g. image names/tags, and node names in button-tags?

At this stage, I would prefer not to have the default formatter be too flashy.

Is there motivation to move the default message format into its own configurable, rather than requiring users to create their own hook?

Yes, out of all of these questions I think this would be one to focus on. Most people will probably want to configure an extension to the default formatter.

Comment thread kubespawner/spawner.py
Comment thread kubespawner/spawner.py Outdated
Comment thread tests/test_spawner.py Outdated
Comment thread kubespawner/messages.py Outdated
Comment thread kubespawner/messages.py Outdated
Comment thread kubespawner/messages.py Outdated
@jnywong
Copy link
Copy Markdown
Member

jnywong commented Mar 5, 2026

This does effectively vendor some cluster-provider specifics (like the GCP scheduler). I think that's OK? But if we are vehemently against that, we can just pull those parts out.

Whatever we decide we need to be consistent in future. If we add GCP specific code we need to accept code for other clouds, including from third parties who use platforms that we can't test ourselves.

I am okay with that – I don't think we need to assume full responsibility for testing code on platforms we don't have access to, but we should ensure contributors who would like this functionality to include full test suites for that. I think the scope of changes in this PR are pretty cosmetic, so there doesn't seem to be huge scope for someone to introduce anything too crazy on a third party platform.

There is a small question about how to structure this as the corpus of messages to reformat scales, but I think we can cross that bridge when we get to it?

Copy link
Copy Markdown
Collaborator

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this, @agoose77! I left a comment about changing the implementation to be a lot more declarative than it is now, which should hopefully make both maintenance and extension much easier.

Comment thread kubespawner/messages.py Outdated
@agoose77
Copy link
Copy Markdown
Author

agoose77 commented Mar 6, 2026

@yuvipanda I've reworked the PR to leave the functional hook that completely bypasses event formatting and remove the default formatter callable.

I think the test failure is just a flaky test?

I've then added a rule system with no defaults. This was a two-fold decision:

  1. It keeps kubespawner leaner (no cloud specific functionality)
  2. It resolve the problem of making this overrideable.1

As such, I intend to put the specific implementation of these rules in the z2hj chart instead. I'm happy to revert that decision, but I am not off-the-top-of-my-head aware of a nice way to do it other than defining the default value as a dict and implementing the extra functionality in kubespawner itself, where users could clobber the default names and set them to None (plus None handling logic to remove these).

Footnotes

  1. My knowledge of traitlets is that one can't refer to the default value of a trait. This means that to allow users to override this, e.g. z2jh would literally have to import the HasTraits-derived parent class and extract the default to compose them.

Comment thread tests/test_spawner.py Outdated
await spawner.stop()


@pytest.mark.parametrize("rules_as_dict", [True, False])
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: test validation logic

@yuvipanda
Copy link
Copy Markdown
Collaborator

yuvipanda commented Mar 6, 2026

I haven't looked at the implementation yet, but thank you for reworking it! I think the rules system should be here, not in z2jh. You can make it extensible by having a default list in here, and then just allowing extra_event_formatter_rules or similar that are appended. This is essential functionality that will benefit everyone using kubespawner, and I'm not concerned about a few extra rules that are cloud provider specific, especially as they also mostly come from the open source autoscaler project. We aren't doing anything specific to any cloud provider here, but supporting things from the autoscaler which has plugins for cloud specific functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants