Skip to content

Tags: luccabb/monarch

Tags

v0.1.0

Toggle v0.1.0's commit message
Add unhandled supervision error hook to crash the client (meta-pytorc…

…h#1637)

Summary:

Part of meta-pytorch#1209

Make two variants of the "actor_states_monitor" watchdog. One version for Owned ActorMesh,
which will send a message to the owner if it exists, and one version for Ref ActorMesh which will
not. This way, Ref actor meshes will generate liveness exceptions without propagation, and Owned
actor meshes will send a SupervisionFailureMessage to its owning actor. Since every Owned mesh
is also doing this, events will always reach the client if they aren't handled.

Add a `monarch.actor.unhandled_fault_hook` function which is called when an unhandled supervision
error reaches the client. It takes one argument, a MeshFailure object, and is expected
to somehow halt the process. By default it calls `sys.exit(1)` after logging the error.
Raising an exception is not sufficient, as it is called outside of a Python thread (by a tokio task).

Note that propagation will not happen if an ActorMesh and all endpoints are unreachable and garbage
collected, but the actors are still running something that generates an error. We'll want to fix this
eventually.

Reviewed By: mariusae

Differential Revision: D85163744

v0.0.0

Toggle v0.0.0's commit message
v0.0.0 release