How to detect and respond to zombie flows

Sudden infrastructure failures (like machine crashes or container evictions) can cause flow runs to become unresponsive and appear stuck in a Running state. To mitigate this, flow runs triggered by deployments can emit heartbeats to drive Automations that detect and respond to these “zombie” flow runs, ensuring they are marked as Crashed if they stop reporting heartbeats.

Enable flow run heartbeat events

You will need to ensure you’re running Prefect version 3.1.8 or greater and set PREFECT_RUNNER_HEARTBEAT_FREQUENCY to an integer greater than 30 to emit flow run heartbeat events.

Create the automation

To create an automation that marks zombie flow runs as crashed, run this script:

from datetime import timedelta

from prefect.automations import Automation
from prefect.client.schemas.objects import StateType
from prefect.events.actions import ChangeFlowRunState
from prefect.events.schemas.automations import EventTrigger, Posture
from prefect.events.schemas.events import ResourceSpecification


my_automation = Automation(
    name="Crash zombie flows",
    trigger=EventTrigger(
        after={"prefect.flow-run.heartbeat"},
        expect={
            "prefect.flow-run.heartbeat",
            "prefect.flow-run.Completed",
            "prefect.flow-run.Failed",
            "prefect.flow-run.Cancelled",
            "prefect.flow-run.Crashed",
        },
        match=ResourceSpecification({"prefect.resource.id": ["prefect.flow-run.*"]}),
        for_each={"prefect.resource.id"},
        posture=Posture.Proactive,
        threshold=1,
        within=timedelta(seconds=90),
    ),
    actions=[
        ChangeFlowRunState(
            state=StateType.CRASHED,
            message="Flow run marked as crashed due to missing heartbeats.",
        )
    ],
)

if __name__ == "__main__":
    my_automation.create()

The trigger definition says that after each heartbeat event for a flow run we expect to see another heartbeat event or a terminal state event for that same flow run within 90 seconds of a heartbeat event.

Adjusting behavior with settings

If PREFECT_RUNNER_HEARTBEAT_FREQUENCY is set to 30, the automation will trigger only after 3 heartbeats have been missed. You can adjust within in the trigger definition and PREFECT_RUNNER_HEARTBEAT_FREQUENCY to change how quickly the automation will fire after the server stops receiving flow run heartbeats. You can also add additional actions to your automation to send a notification when zombie runs are detected.

Workflows

Automations

Workflow Infrastructure

Platform

Extensibility

How to detect and respond to zombie flows

Enable flow run heartbeat events

Create the automation

Adjusting behavior with settings

Workflows

Automations

Workflow Infrastructure

Platform

Extensibility

​Enable flow run heartbeat events

​Create the automation

​Adjusting behavior with settings

Enable flow run heartbeat events

Create the automation

Adjusting behavior with settings