How to detect and respond to zombie flows
Learn how to detect and respond to zombie flows.
Sudden infrastructure failures (like machine crashes or container evictions) can cause flow runs to become unresponsive and appear stuck in a Running
state.
To mitigate this, flow runs triggered by deployments can emit heartbeats to drive Automations that detect and respond to these “zombie” flow runs, ensuring they are marked as Crashed
if they stop reporting heartbeats.
Enable flow run heartbeat events
You will need to ensure you’re running Prefect version 3.1.8 or greater and set PREFECT_RUNNER_HEARTBEAT_FREQUENCY
to an integer greater than 30 to emit flow run heartbeat events.
Create the automation
To create an automation that marks zombie flow runs as crashed, run this script:
The trigger definition says that after
each heartbeat event for a flow run we expect
to see another heartbeat event or a
terminal state event for that same flow run within
90 seconds of a heartbeat event.
Adjusting behavior with settings
If PREFECT_RUNNER_HEARTBEAT_FREQUENCY
is set to 30
, the automation will trigger only after 3 heartbeats have been missed.
You can adjust within
in the trigger definition and PREFECT_RUNNER_HEARTBEAT_FREQUENCY
to change how quickly the automation
will fire after the server stops receiving flow run heartbeats.
You can also add additional actions to your automation to send a notification when zombie runs are detected.