Walk it like I talk it
By Nate Nowack •
One of the most common questions from OSS Prefect users is:
How do I run open-source Prefect server in a high availability (HA) mode?
Historically, it hasn’t been directly possible, for a couple reasons:
- in-memory event-bus
- “singleton server” assumptions (e.g. caching automation objects in memory)
Recently, we’ve been working towards allowing users to run a setup that you can reasonably call HA, by:
- implementing a Redis Streams-based messaging implementation
- using Postgres Listen/Notify to broadcast events to all server instances
- unwinding those “singleton server” assumptions
We’ve had some brave souls test out the new recommended setup, but there have been some rough edges.
So, in the words of the visionary triumvirate known as Migos, one must:
Walk it like I talk it
… that is to say, we must run and use the HA setup that we suggest to others.
We always compulsively test fixes and features against actual prefect servers, but it’s been a while since we’ve made a concerted effort to scale a single server installation. The new/broad interest in HA server configurations has made it a great time to do so!
Making our test bed before we lay in it
We decided on the following initial setup in GKE:
- 2 replicas of the prefect webserver (i.e.
prefect server start --no-services
) - 1 instance of the background services (i.e.
prefect server services start
) - 1 instance of the kubernetes worker (i.e.
prefect worker start --type kubernetes
) - 1 Google-managed Redis instance
- 1 Google-managed Postgres instance
For the yaml-junkies among us, here are (more or less) the Helm chart configurations we used:
Helm chart configuration
server.yaml
:
1---
2apiVersion: helm.toolkit.fluxcd.io/v2
3kind: HelmRelease
4metadata:
5 name: prefect-server
6spec:
7 interval: 5m
8 chart:
9 spec:
10 chart: prefect-server
11 version: "2025.9.5190948" # Pin to specific version
12 sourceRef:
13 kind: HelmRepository
14 name: prefect
15 namespace: flux-system
16 values:
17 global:
18 prefect:
19 image:
20 repository: prefecthq/prefect
21 prefectTag: 3.4.17-python3.11-kubernetes
22 server:
23 replicaCount: 2
24 loggingLevel: WARNING
25 uiConfig:
26 prefectUiApiUrl: http://localhost:4200/api
27 resources:
28 requests:
29 cpu: 500m
30 memory: 512Mi
31 limits:
32 cpu: "1"
33 memory: 1Gi
34 # Add Redis auth environment variables to server pod
35 extraEnvVarsSecret: prefect-redis-env-secret
36 migrations:
37 enabled: true
38 backgroundServices:
39 runAsSeparateDeployment: true
40 # Use the Redis auth secret for password
41 extraEnvVarsSecret: prefect-redis-env-secret
42 messaging:
43 broker: prefect_redis.messaging
44 cache: prefect_redis.messaging
45 redis:
46 host: <REDIS_HOST_IP>
47 port: 6379
48 db: 0
49 username: default
50 # External PostgreSQL (Cloud SQL)
51 postgresql:
52 enabled: false
53 # External Redis (Memorystore)
54 redis:
55 enabled: false
56 # External database connection via secret
57 # This secret will be created by SecretProviderClass from GCP Secret Manager
58 secret:
59 create: false
60 name: prefect-db-connection-secret
worker.yaml
:
1---
2apiVersion: helm.toolkit.fluxcd.io/v2
3kind: HelmRelease
4metadata:
5 name: prefect-worker
6 namespace: <YOUR_NAMESPACE>
7spec:
8 chart:
9 spec:
10 chart: prefect-worker
11 version: 2025.9.5190948
12 sourceRef:
13 kind: HelmRepository
14 name: prefect
15 namespace: flux-system
16 driftDetection:
17 mode: warn
18 install:
19 remediation:
20 retries: 3
21 interval: 5m
22 maxHistory: 2
23 upgrade:
24 remediation:
25 retries: 3
26 values:
27 worker:
28 config:
29 http2: false
30 workPool: <YOUR_WORK_POOL_NAME>
31 apiConfig: selfHostedServer
32 selfHostedServerApiConfig:
33 apiUrl: http://prefect-server.<YOUR_NAMESPACE>.svc.cluster.local:4200/api
34 image:
35 repository: prefecthq/prefect
36 prefectTag: 3.4.17-python3.11-kubernetes
37 pullPolicy: Always
38 livenessProbe:
39 enabled: true
40 revisionHistoryLimit: 2
41 resources:
42 requests:
43 memory: 1Gi
44 limits:
45 memory: 2Gi
You can read more about the helm chart here, and more about Flux here.
Note that as of prefect==3.4.16
, prefect server start --no-services
avoids running any services, so as to avoid a bug causing redis connection errors in the server pod. We still need to audit each background service independently to ensure they can be horizontally scaled.
There were several PRs that came out of this deployment process:
and in general we expect to feel more of the friction that Prefect Server operators have felt in the past in the coming weeks, as we expand our use of the server for our own needs.
For example, I immediately got very annoyed that I had to click many buttons to pause all schedules for all deployments, so I opened that #18860 PR above to add an --all
flag to prefect deployment schedule pause
.
Running flows against our new setup
We’ve got a nice set of flows deployed and executing successfully via our kubernetes worker, which should help us catch any bugs early using the nighly dev builds and generally act as a canary in the coalmine for issues with HA setups.