Skip to content

Deploy

The whole stack is a single Compose file, dataland-infrastructure/compose.yml. Deploy is one shell script, deploy.sh, that fast-forwards every repo, derives an immutable image tag, runs a pre-deploy boot guard, then rolls the stack with docker compose up -d --build.

Where this runs

Production lives on the Spark DGX VDS (ege@100.124.170.43, repos under /home/cobanov/DATALAND/). The public surface is fronted by Cloudflare tunnels; operators on the Tailscale tailnet reach the stack directly via the *_PUBLIC_BIND host bindings. Service-to-service traffic always uses the dataland-network bridge and is unaffected by host port binding. See Services for per-service detail.

Recent changes

  • DAT-291 added the pre-deploy boot-guard fail-fast (run the real agent boot guard against the production .env before any rebuild), wired the tailnet *_PUBLIC_BIND publishing across stateful services, added the docs.dataland.chat service, and folded prod secret rotation into the deploy flow.
  • DAT-269 standardized every service on gemini-3.5-flash (chat + Gemini captioning); RAG vectors use gemini-embedding. A deploy after a model bump just rebuilds the agent + rag images — the model id is read from .env at boot.
  • DAT-82 introduced the host-metrics Compose profile (cAdvisor + node-exporter) that deploy.sh layers automatically on Linux deploys.

First time on a new host

cd /home/cobanov/DATALAND
cp dataland-infrastructure/.env.example .env
mkdir -p secrets && chmod 700 secrets   # (1)!
# fill in .env, drop secrets/gcp-key.json, then:
chmod 600 .env secrets/gcp-key.json   # (2)!
  1. 700 on secrets/ so only the operator can traverse the directory. The .env and gcp-key.json you drop inside still need their own 600 — the directory mode alone does not protect the files.
  2. DAT-88 — the default umask on a fresh secrets/ would leave new files world-readable. The Compose mount is :ro from the container, but host-side file permissions are the operator's job, so lock both the .env and the GCP key down to 600 (owner read/write only).

chmod 600 matters: the default umask on a fresh secrets/ would leave new files world-readable. The compose mount is :ro from the container, but host permissions are the operator's job (DAT-88).

Fill the :?-required secrets before the first deploy

Several services hard-fail at startup if their secret is missing or still a placeholder. Compose enforces a subset via ${VAR:?...}:

Variable Service Enforced by
REDIS_PASSWORD redis + every consumer (DAT-76) Compose :? + healthcheck
MUSEUM_PASSWORD, MUSEUM_SESSION_SECRET museum-api dashboard gate Compose :?
RDC_REDIS_URL museum-api (external RDC redis) Compose :?
INFORMATION_WEBUI_PASSWORD, INFORMATION_WEBUI_SESSION_SECRET Catalog Studio Compose :?
DOCS_PASSWORD docs.dataland.chat basic auth Compose :? + entrypoint
GRAFANA_ADMIN_PASSWORD Grafana (DAT-266) Compose :?

Anything else flagged by the boot guard (DAT-291) is caught on the next deploy — see below.

Routine deploy

cd /home/cobanov/DATALAND
./dataland-infrastructure/deploy.sh

deploy.sh (cds to the parent of dataland-infrastructure first, so paths are relative to /home/cobanov/DATALAND) runs, in order:

flowchart TD
    A["git checkout main<br/>git pull --ff-only<br/>(all 6 repos)"] --> B["Derive IMAGE_TAG<br/>= UTC timestamp + infra short SHA"]
    B --> C{"dataland/agent:latest<br/>image exists?"}
    C -- "yes" --> D["DAT-291 boot guard:<br/>docker run --rm --env-file .env<br/>agent:latest → assert_boot_required_env()"]
    C -- "no (first deploy)" --> F
    D -- "exit 1" --> E["ABORT before rebuild<br/>(no crash-loop)"]
    D -- "exit 0" --> F["docker compose up -d --build<br/>--profile host-metrics"]
    F --> G["Retag each built image<br/>:IMAGE_TAG → :latest"]
    G --> H["docker compose ps<br/>echo Active deploy: IMAGE_TAG"]

1. Fast-forward every repo

The script checks out main and git pull --ff-only for all six repos:

repos=(
  dataland-infrastructure   # (1)!
  dataland-agent
  dataland-rag-v2
  dataland-museum
  dataland-notification
  dataland-atlas
)
  1. All six repos are checked out to main and pulled with git pull --ff-only. --ff-only is deliberate: a deploy must never silently create a merge commit on the box. A repo that has diverged from origin/main fails the pull loudly and stops the deploy before anything is rebuilt.

--ff-only is deliberate: a deploy must never silently create a merge commit on the box. If a repo has diverged from origin/main, the pull fails loudly and the deploy stops before anything is rebuilt.

2. Derive IMAGE_TAG

infra_sha=$(git -C dataland-infrastructure rev-parse --short HEAD)   # (1)!
export IMAGE_TAG="${IMAGE_TAG:-$(date -u +%Y%m%d-%H%M%S)-${infra_sha}}"   # (2)!
  1. Short SHA of the dataland-infrastructure repo at deploy time — the deploy is keyed to the infra repo HEAD, not the per-service repos, since compose.yml lives here.
  2. DAT-77${IMAGE_TAG:-…} only mints a fresh <UTC-timestamp>-<infra-sha> tag (e.g. 20260604-093015-eb4857a) when IMAGE_TAG is not already set, which is what lets you pin a deploy to an existing tag. The timestamp + SHA makes every deploy's image set immutable and rollback-addressable.

IMAGE_TAG is a UTC timestamp plus the infrastructure repo short SHA, e.g. 20260604-093015-eb4857a (DAT-77). Every internally-built image in compose.yml is tagged dataland/<svc>:${IMAGE_TAG:-latest}, so each deploy produces an immutable, timestamped set of images you can roll back to.

Pinning a deploy

Set IMAGE_TAG in the environment (or in .env) to reuse an existing tag instead of minting a new one. Leaving it unset in .env means a plain docker compose up uses whatever :latest the local Docker last built. deploy.sh always exports a fresh value unless you pre-set one.

3. Pre-deploy boot guard (DAT-291)

Before rebuilding anything, deploy.sh runs the real agent boot guard against the production .env using the current dataland/agent:latest image:

if docker image inspect dataland/agent:latest > /dev/null 2>&1; then   # (1)!
  if ! docker run --rm --env-file .env dataland/agent:latest /app/.venv/bin/python -c "
import sys
from app.runtime import assert_boot_required_env
try:
    assert_boot_required_env()
except RuntimeError as exc:
    print(exc, file=sys.stderr)
    sys.exit(1)
"; then
    echo "ABORT: .env failed the agent boot guard (see above)." >&2   # (2)!
    exit 1   # (3)!
  fi
fi
  1. Guards the whole block on dataland/agent:latest existing. On the very first deploy there is no :latest image to run the guard from, so the check is skipped and the deploy trusts the operator's .env (DAT-291).
  2. Runs the real assert_boot_required_env() — the same function the container runs at boot — against the production .env, so the pre-flight can never drift from the boot-time contract. --rm discards the throwaway container; --env-file .env feeds it the exact env a real boot would see.
  3. Exits non-zero before docker compose ... --build is ever reached. Nothing is rebuilt or rolled — fix the flagged secrets in .env and re-run deploy.sh. This is the check that prevents the crash-loop outage.

Why this exists: the agent's boot guard (assert_boot_required_env) crash-loops the container if the production .env still holds placeholder/default secrets. Before DAT-291, that meant a freshly-built agent would build cleanly, start, fail the guard, restart, fail again — taking chat offline. The exact outage this check prevents.

Properties of the guard

  • It runs the same function the container runs at boot, so the pre-flight check can never drift from the boot-time contract.
  • It is a no-op outside APP_ENV=production, so dev deploys are unaffected.
  • It is skipped on the very first deploy when no dataland/agent:latest image exists yet (nothing to run the guard from). The first deploy therefore trusts the operator's .env; subsequent deploys are guarded.
  • On failure the script exits 1 before docker compose ... --build is invoked. Nothing is rebuilt, nothing is rolled. Fix the flagged secrets in .env, re-run deploy.sh.

4. Build + roll the stack (with the host-metrics profile)

docker compose -f dataland-infrastructure/compose.yml --profile host-metrics --env-file .env up -d --build   # (1)!
  1. DAT-82--profile host-metrics layers in cadvisor + node-exporter, which mount the host /proc and /sys and only work on Linux. deploy.sh always passes it because production is the Spark Linux box; on a macOS dev box you drop the flag so those two containers stay off. --build rebuilds only images whose Dockerfile/source changed, then -d recreates each service detached.

This rebuilds any image whose Dockerfile or source changed, then recreates each service with the new image. restart: unless-stopped (from the x-common-env anchor) does the rest.

The host-metrics profile (DAT-82)

The profile enables cadvisor + node-exporter, which mount the host /proc and /sys and only work on Linux. deploy.sh always passes --profile host-metrics because production is the Spark Linux box. On a macOS dev box you would run docker compose up without the profile so those two containers stay off. The rest of the monitoring stack (prometheus, grafana, alertmanager, postgres-exporter, redis-exporter) has no profile and starts everywhere.

The simulator profile is separate and is not enabled by deploy.sh; the simulator only runs when you explicitly opt in — see Simulator.

5. Retag :IMAGE_TAG:latest

for image in "${internal_images[@]}"; do
  if docker image inspect "${image}:${IMAGE_TAG}" > /dev/null 2>&1; then   # (1)!
    docker tag "${image}:${IMAGE_TAG}" "${image}:latest"   # (2)!
  fi
done
  1. The inspect guard skips any image that was not built this run. Only images that actually got the fresh :IMAGE_TAG are retagged.
  2. Points :latest at the just-built :IMAGE_TAG. This is what makes the next deploy's boot guard (step 3) and any plain docker compose up resolve :latest to the most recent deploy.

The six internal images (dataland/auth-server, dataland/rag, dataland/museum-api, dataland/agent, dataland/notification, dataland/information-webui) get a :latest alias pointing at the just-built :IMAGE_TAG. The next boot guard run (step 3 of the following deploy) and any plain docker compose up then resolve :latest to the most recent deploy. The conditional skips images that weren't built this run.

6. Report

docker compose -f dataland-infrastructure/compose.yml --env-file .env ps
echo "Active deploy: ${IMAGE_TAG}"   # (1)!
echo "Roll back with: IMAGE_TAG=<previous-tag> docker compose -f dataland-infrastructure/compose.yml --env-file .env up -d"   # (2)!
  1. Prints the immutable tag this deploy just stamped, so you have a record of exactly which image set is live.
  2. Emits the ready-to-paste rollback one-liner with the current tag interpolated — copy it before the next deploy and you have a no-rebuild path back to this exact state.

The final lines print the active tag and the exact rollback one-liner.

What deploy.sh does NOT do

The docs service is not in internal_images, so it never gets a :IMAGE_TAG → :latest retag in step 5 (it builds fine; it just isn't aliased). The script also does not run the smoke suite, run migrations explicitly (agent + auth run them on startup), or touch volumes.

Surgical single-service redeploy

A full deploy.sh rebuilds whatever changed across six repos. When you've changed exactly one service and want to roll only that container — without re-evaluating dependencies or disturbing healthy services — use a targeted up:

cd /home/cobanov/DATALAND

# Rebuild + recreate ONE service, leave its dependencies untouched:
docker compose -f dataland-infrastructure/compose.yml --env-file .env \
  up -d --no-deps --build agent   # (1)!
  1. --no-deps stops Compose from also recreating postgres, redis, rag, museum-api just because agent depends_on them — those stay up and serving. --build rebuilds only the target image from its repo context. The final arg is the compose service name, not the container name (valid names: agent, rag, museum-api, notification-worker, notification-api, information-webui, auth, docs).

  2. --no-deps stops Compose from also recreating postgres, redis, rag, museum-api just because agent depends_on them. Those stay up and serving.

  3. --build rebuilds the target image from its repo context.
  4. Pass the compose service name, not the container name: agent, rag, museum-api, notification-worker, notification-api, information-webui, auth, docs.

Pin the surgical redeploy to the active tag

A bare surgical up builds a :latest image (no IMAGE_TAG exported). To keep the deployed set consistent and immutable, mint a tag the same way deploy.sh does:

export IMAGE_TAG="$(date -u +%Y%m%d-%H%M%S)-$(git -C dataland-infrastructure rev-parse --short HEAD)"   # (1)!
docker compose -f dataland-infrastructure/compose.yml --env-file .env \
  up -d --no-deps --build agent
docker tag dataland/agent:${IMAGE_TAG} dataland/agent:latest   # (2)!
  1. Mints the tag exactly the way deploy.sh does (UTC timestamp + infra short SHA) so a surgical redeploy stays immutable instead of building an anonymous :latest.
  2. deploy.sh's retag loop (step 5) does not run on a surgical up, so you alias :latest to the freshly-built tag by hand — otherwise the next full deploy's boot guard would still resolve :latest to the old image.

An .env change needs up, not restart

docker compose restart <svc> reuses the old environment. To pick up an edited .env, recreate the container: docker compose ... up -d --no-deps <svc>. (See the runbook — ".env change didn't take effect".)

The consumer-name guard restart window (DAT-109)

The notification consumers are the one place where a surgical redeploy has a timing constraint. notification-worker and notification-api both join the museum:telemetry Redis Streams consumer group (CONSUMER_GROUP, default notification-service) and each must own a unique consumer name in that group, or two replicas race over the same PEL slot and acks silently collide.

To defend against this, the consumer refuses to start if a consumer with its name is already registered in the group with an idle time short enough to mean it's still alive (app/consumer.py::_reject_duplicate_consumer, DAT-109):

RuntimeError: consumer_name '<name>' is already registered in group
'notification-service' with idle=<n>ms (< pel_reap_idle_ms=60000ms).
Another worker is using this name. Set CONSUMER_NAME to a unique value
or wait for the prior process to age out.

How this interacts with deploys:

  • Default CONSUMER_NAME is unique per processhostname-<8 hex> (_default_consumer_name). So a normal recreate (old container stops, new one starts with a fresh uuid suffix) does not collide. The guard is invisible in the happy path.
  • If you pin CONSUMER_NAME to a fixed value, the new container can collide with the outgoing one during the brief overlap of a rolling recreate, or with a still-registered prior process. The guard then aborts the new container.

The restart window

The "still alive" threshold is pel_reap_idle_ms, default 60s (PEL_REAP_IDLE_MS=60000). A consumer whose last activity is older than that is treated as gone — the PEL reaper will reclaim its orphaned entries and re-registering the same name is safe (logged as consumer_name_reclaimed_from_idle_predecessor).

Practically, if you run the notification service with a fixed CONSUMER_NAME, leave ~60s between stopping the old container and starting the new one, or just rely on the unique default. The cleanest surgical redeploy of a fixed-name worker is:

docker compose -f dataland-infrastructure/compose.yml --env-file .env stop notification-worker
# wait > pel_reap_idle_ms (60s) for the old name to age out, then:
docker compose -f dataland-infrastructure/compose.yml --env-file .env \
  up -d --no-deps --build notification-worker   # (1)!
  1. DAT-109 — only needed when CONSUMER_NAME is pinned to a fixed value. The new replica's _reject_duplicate_consumer guard aborts startup if a consumer with the same name is still registered with idle < pel_reap_idle_ms (default 60s). Stopping first and waiting out that window lets the PEL reaper reclaim the old name before the new container claims it. With the unique default name (hostname-<8 hex>) you can skip the stop-and-wait entirely.

Or, set a fresh CONSUMER_NAME for the new replica and skip the wait entirely.

Rebuilding the docs container

docs.dataland.chat is its own Compose service (docs) built from dataland-infrastructure/docs/Dockerfile — a two-stage build that runs mkdocs build --strict (MkDocs 1.6.1 + Material 9.5.44 + pymdown-extensions 10.11.2) and serves the static site/ from nginx:1.27-alpine.

Edits to any page under docs/src/ (this site) or to docs/mkdocs.yml only land after the docs image is rebuilt:

cd /home/cobanov/DATALAND
docker compose -f dataland-infrastructure/compose.yml --env-file .env \
  up -d --no-deps --build docs   # (1)!
  1. Edits to any page under docs/src/ or to docs/mkdocs.yml are baked into the static site/ at image-build time, so they only land after a --build. --no-deps keeps the rest of the stack untouched. The two-stage build runs mkdocs build --strict, so a broken link fails the build here rather than serving a broken page.

--strict will fail the build on a broken link

mkdocs build --strict (in the Dockerfile) turns broken internal links, missing nav entries, and broken anchors into build errors. A docs redeploy that fails to come up is almost always a --strict failure — check docker logs dataland-docs (build output) and fix the link/anchor before re-running. This is why cross-links between pages must be valid relative paths, e.g. [RAG](../services/rag.md).

How the docs container is served and secured (DAT-291 added the service, DAT-73 the bind policy):

  • Basic auth is generated at container start by docker-entrypoint.sh: htpasswd -bcB from DOCS_USERNAME / DOCS_PASSWORD, so the password lives only in env + memory, never baked into the image. The entrypoint hard-fails if DOCS_PASSWORD is unset.
  • TLS terminates at Cloudflare; the credential never travels plaintext as long as docs is reached through the cloudflared tunnel.
  • Health is GET /healthz (auth-exempt in nginx.conf), checked with wget --spider.
  • Bind policy (DAT-73): published on two host IPs — 127.0.0.1:${DOCS_PUBLIC_PORT:-4148} for the host cloudflared systemd unit (→ public docs.dataland.chat) and ${DOCS_PUBLIC_BIND:-100.124.170.43}:4148 for direct tailnet browser access (spark:4148). Never set DOCS_PUBLIC_BIND to 0.0.0.0 — basic auth is the only protection.

Adding docs to the public tunnel

If docs.dataland.chat is a new ingress, add the route once in the Cloudflare Zero Trust dashboard (Tunnels → minotaur tunnel → Public Hostname → Add), pointing at http://127.0.0.1:4148. After that, every docs change is just a --no-deps --build docs.

Rollback

Deploy strategy is :latest in place — there is no blue-green. Rollback is a redeploy of a previous image. Because every deploy stamps an immutable IMAGE_TAG, the fastest rollback re-points the stack at a prior tag without rebuilding:

cd /home/cobanov/DATALAND
IMAGE_TAG=<previous-tag> \
  docker compose -f dataland-infrastructure/compose.yml --env-file .env up -d   # (1)!
  1. No --build — this re-points the running stack at an already-built prior tag, so rollback is near-instant. The strategy is :latest in place with no blue-green, which is why every deploy stamps an immutable IMAGE_TAG you can address here. List candidates with docker images 'dataland/*'.

(deploy.sh prints exactly this line, with the active tag, at the end of every run.) List available tags with docker images 'dataland/*'.

If the bad state is in source rather than an image you still have locally, roll the affected repo back and rebuild:

  1. cd /home/cobanov/DATALAND/dataland-<repo>
  2. git log --oneline -10 → identify the last known-good commit.
  3. git checkout <sha> (or git reset --hard <sha> if bad code was already pushed).
  4. cd /home/cobanov/DATALAND && ./dataland-infrastructure/deploy.sh (or a surgical up --no-deps --build <svc> for a single repo).
  5. Confirm health: docker compose -f dataland-infrastructure/compose.yml ps.

For dependency-upgrade and migration rollbacks (uv.lock revert, alembic hand-revert), see the on-call runbook — the migration path is forward-only, so a bad migration is reverted by hand or restored from a Postgres dump.

Smoke test

After a deploy:

bash dataland-infrastructure/scripts/smoke.sh

Read-only suite per service (agent, museum, rag-v2, notification, information-webui), green/red rollup, logs to dataland-infrastructure/.smoke-logs/.

Exit code Meaning
0 all green
1 one or more suites failed
2 one or more skipped (tunnel down, suite missing)

For full coverage including write paths, open the internal-port tunnel and flip writes on:

ssh -L 4143:127.0.0.1:4143 -L 8080:127.0.0.1:8080 -L 4145:127.0.0.1:4145 \
    ege@100.124.170.43   # (1)!

SMOKE_ALLOW_WRITES=1 \
SMOKE_NOTIFICATION_REDIS_CLEANUP=tunneled \
bash dataland-infrastructure/scripts/smoke.sh   # (2)!
  1. Forwards the loopback-only internal ports (4143, 8080, 4145) over SSH so the write-path suites can reach services that are bound to 127.0.0.1 on the box and are not exposed publicly.
  2. SMOKE_ALLOW_WRITES=1 flips the suite from read-only to exercising write paths — only safe behind the tunnel. SMOKE_NOTIFICATION_REDIS_CLEANUP=tunneled tells the suite to clean up its notification Redis test keys through the forwarded port rather than assuming a local redis.

There is also an end-to-end visit-flow harness, scripts/smoke-visit-flow.sh, which exercises the ticket → telemetry → notification path and tails dataland-notification-worker to confirm rules fire.

Logs

docker compose -f dataland-infrastructure/compose.yml --env-file .env logs -f agent
docker compose -f dataland-infrastructure/compose.yml --env-file .env logs -f museum-api
# ...etc

Each service also writes JSONL to ${DATALAND_LOG_DIR:-/home/cobanov/DATALAND/logs}/<service>/. The Docker json-file driver is capped at 10 MiB × 5 rotations (50 MiB max per container, DAT-81) so a chatty service can't fill host disk; the bind-mounted JSONL needs a separate logrotate config — see reports/runbook.md.

Simulator

For local end-to-end testing without an Empatica band, layer compose.sim.yml:

SIM_TICKET_ID=<a-real-ticket> \
SIM_REDIS_PASSWORD=$(openssl rand -hex 24) \
SIM_FORCE_HR_SPIKE=1 SIM_FORCE_EDA_SPIKE=1 \
bash dataland-infrastructure/scripts/start-simulator.sh   # (1)!

docker logs -f dataland-notification-worker  # (2)!
  1. SIM_TICKET_ID must be a real ticket so the synthetic telemetry maps to a routable visitor. SIM_REDIS_PASSWORD is minted fresh for the isolated sidecar redis, not reused from prod. SIM_FORCE_HR_SPIKE / SIM_FORCE_EDA_SPIKE inject heart-rate and EDA spikes so the notification rules actually fire instead of waiting on organic signal.
  2. Tail the worker to watch the rules trigger end-to-end. Because the overlay points notification-worker at the sim redis, these are sim events only — the live dataland-redis is untouched and real visitors are undisturbed.

The overlay does three things:

  1. Spins up a sidecar dataland-redis-sim (separate volume, loopback port 4149).
  2. Spins up dataland-telemetry-sim, a Python publisher that XADDs synthetic events to museum:telemetry on the sim redis.
  3. Recreates notification-worker + notification-api with REDIS_HOST pointed at the sim redis. The main dataland-redis keeps serving every other service, so real visitors aren't disturbed.

Stop with bash dataland-infrastructure/scripts/stop-simulator.sh.

Never replay simulation telemetry into live redis

The playback/simulator path must stay isolated on its own redis container. Replaying museum-simulation Avro/telemetry into the production dataland-redis would corrupt the live museum:telemetry stream that the notification consumers drain.

Destructive reset

./dataland-infrastructure/reset-stack.sh   # (1)!
  1. Wraps docker compose down -v --remove-orphans then a fresh up -d --build. The -v deletes every named volume (postgres-data, qdrant-data, redis-data, auth-data, prometheus-data, alertmanager-data, grafana-data, plus the webui SQLite bind). This is not reversible from inside the stack — read reports/backup-restore.md first, and note a wiped auth-data means re-provisioning the JWKS mirror (DAT-286, data/extra_jwks.json).

This is docker compose down -v --remove-orphans followed by a fresh up -d --build. -v deletes every named volumepostgres-data, qdrant-data, redis-data, auth-data, prometheus-data, alertmanager-data, grafana-data, plus the webui SQLite bind. Read reports/backup-restore.md first — this is not reversible from inside the stack, and a wiped auth-data means re-provisioning the JWKS mirror (DAT-286, data/extra_jwks.json).

Inspect

docker compose -f dataland-infrastructure/compose.yml --env-file .env ps
docker compose -f dataland-infrastructure/compose.yml --env-file .env config   # (1)!
docker images 'dataland/*'   # (2)!
  1. Renders the fully resolved Compose config with .env interpolated and anchors expanded — the canonical way to confirm what a service will actually run with (env, binds, profiles) before or after a deploy.
  2. Lists every built dataland/* image tag. Each :IMAGE_TAG here is a rollback candidate you can feed straight into the IMAGE_TAG=<previous-tag> … up -d one-liner.

Adding a new service

  1. Add the service block to compose.yml, attached to dataland-network and inheriting <<: *common-env.
  2. Pin mem_limit / cpus to a realistic budget — defaults are unbounded (DAT-80).
  3. Bind host port to 127.0.0.1: (and optionally a *_PUBLIC_BIND tailnet IP, DAT-73) unless it's genuinely a public ingress. Never 0.0.0.0.
  4. Add a healthcheck.
  5. If it builds an internal image, add dataland/<svc> to the internal_images array in deploy.sh so it gets the :IMAGE_TAG → :latest retag.
  6. Re-deploy. If it's public-facing, add the Cloudflare ingress route (Zero Trust → Tunnels → minotaur tunnel → Public Hostname → Add).
  7. Document it: copy a service page in dataland-infrastructure/docs/src/services/, add it to docs/mkdocs.yml nav (so the --strict build passes), and rebuild the docs container.