Skip to content

Conversation

@jfongatyelp
Copy link
Contributor

We sometimes have folks who identify one/few pods that happen to show higher timings or errors, and often folks have wanted to try just restarting that bad pod rather than force all pods to get restarted.

This PR adds the option thru the PaaSTA API at /services/{service}/{instance}/replicas/{replica}/restart, and reuses our PaaSTA API auth added for remote-run. (This means I also moved a helper that had originally only been for remote-run to utils so I could use it too).

As this seemed pretty straightforward as an additional option on top of the existing paasta restart syntax, I just added this there, with the --replica flag. I opted for --replica instead of pod to keep the paasta terminology consistent w/ paasta status output (tho instances/instances still exists, tbf :p)

Running this locally seemed to do the right thing:

$ PAASTA_SYSTEM_CONFIG_DIR=./etc_paasta_for_development/ python paasta_tools/cli/cli.py restart --replica paasta-contract-monitor-main-84564c8585-698x5 -s paasta-contract-monitor -i main  -c eksstage
[service paasta-contract-monitor] Issued request to restart replica 'paasta-contract-monitor-main-84564c8585-698x5' of paasta-contract-monitor.main by jfong@dev208-uswest1adevc.uswest1-devc.yelpcorp.com
✓ Successfully initiated restart of replica 'paasta-contract-monitor-main-84564c8585-698x5'
  Service: paasta-contract-monitor
  Instance: main
  Cluster: eksstage
  Replica: paasta-contract-monitor-main-84564c8585-698x5

Kubernetes will automatically create a replacement pod.

The actual implementation is for the PaaSTA API to send a pod deletion request rather than actually 'restart' their main container, but I think this is what we generally reach for if someone asks us as oncall for a similar thing.

I do think more often than not people fighting 'bad' containers leads to a bad node rather than a bad replica, but this would at least give service owners a troubleshooting option that doesn't require elevated privileges or restarting all running pods.

For now, this makes a call to our PaaSTA API auth but /v1/services calls are not restricted; I'll attempt to add a rule with more specific path to try to restrict this particular call, but this also reuses the existing client-side LDAP check that paasta restart already uses.

@jfongatyelp jfongatyelp requested a review from a team as a code owner December 16, 2025 22:02
)
def instance_replica_restart(
request: Request,
) -> dict[str, Any]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: i wonder if a typeddict or using the generated model would maybe be nicer here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, at the very least having a named model for the response would be great... i don't really see us importing the models into api/views/ files or instance/kubernetes.py at all, and instead only used on the client side, so I suppose I could do both--typeddict on the api side, reference the generated model on the client side?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, i'm not sure if there's a reason for why we haven't tried using the generated named models or if we just never bothered since a lot of these endpoints predated our typing enthusiasm :p

delete_options = V1DeleteOptions()
try:
kube_client.core.delete_namespaced_pod(
pod_name, namespace, body=delete_options, grace_period_seconds=0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to force-delete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered the same thing, but then I couldn't find any force param in DeleteOptions or delete_namespaced_pod, and came across kubernetes-client/python#508 (comment) ???

So I've just set grace_period_seconds=0 here (which we generally do when people ask us to delete pods as well)...I think eventually ppl might want this to optionally use an alternate grace period, but for now seems ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, i meant: do we actually want to force here - seems like it'd be safer to default to following the configured terminationGracePeriodSeconds rather than always force-deleting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha, so my thinking on this was that when peoplee are reaching for a restart of an individual replica, it's probably already in a bad state and they want this to be forcibly restarted, and whenever we've done this manually for people at their request, it's basically always forced.

But yes, I can go ahead and create that option and default to the configured value, making force opt-in 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, true - I guess that if a pod is potentially already causing errors, it's not like we'll make things worse

this is fine with me then!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, you were right, best to be safe by default first, then force, so I've now added the --force option that passes things along. One minor hiccup is that the API was only getting LongRunningInstanceConfigs rather than KubernetesDeploymentConfig, and only the latter has get_termination_grace_period(), so I just added a minor hasattr to get araound it.

port = pick_random_port("paasta-dev-api")
# Generate api endpoints
api_endpoints = {"api_endpoints": {cluster: f"http://localhost:{port}"}}
# Just create both non-eks and eks endpoints, one has to be right :D
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants