Skip to content

Service hosting & relocation

Working document — first drafted 2026-05-28 ahead of the museum infra alignment call. Status: decisions landing. The directional GCP split is captured in the sibling Migration plan (Spark ↔ GCP); this page stays the operator-facing "how the stack maps to a host and what to re-provision when it moves" reference.

This page answers three operational questions:

  1. What runs where today and what the opening per-service placement is (the matrix below).
  2. Which on-disk state must survive a move — every named volume + host bind that holds authoritative data.
  3. What you have to re-provision by hand after the move — secrets, the GCS key ACL for the non-root runtime user, the JWKS mirror, and the museum knowledge re-ingest. These are the steps a git clone && deploy.sh does not do for you, and the ones that have bitten us.

Recent changes

This page reflects the 2026-06-03 → 2026-06-04 change-set. The provisioning gotchas were materially reshaped by DAT-291 (deploy.sh now fails fast on placeholder prod secrets + tailnet *_PUBLIC_BIND publishing + the docs.dataland.chat service), DAT-286 (the dataland-auth JWKS mirror via data/extra_jwks.json), and DAT-288 (museum chapter placeholder purge). The Qdrant knowledge collection was also re-ingested 4839 → 4969 points as part of the section/scene re-ingest.

Current state

Every Dataland service runs on a single host, spark (ege@100.124.170.43, tailnet IP), with the production checkout under /home/cobanov/DATALAND/. All inter-service traffic stays on the dataland-network Docker bridge; the only things that leave the host are:

  • Cloudflare tunnels (host systemd cloudflared) — public ingress for dataland.chat, data.dataland.chat, and docs.dataland.chat. TLS terminates at Cloudflare; the tunnel ingresses to 127.0.0.1 on the host.
  • Tailscale tailnet — direct operator + peer access to the data-plane and ops ports. Every stateful/ops service is published twice in compose.yml: once on 127.0.0.1 (local tooling + SSH-tunnel) and once on a *_PUBLIC_BIND host IP (the tailnet interface) for direct peer access (DAT-73, extended by DAT-291). Never 0.0.0.0 — several of these have weak (Postgres) or no (Qdrant) auth, so the tailnet is the trust boundary.

The single load-bearing cross-network flow is museum-api → the external RDC redis (Refik Anadol data center: the wearable/sensor source of truth). Every other external touchpoint is public internet (Gemini, GCS, OneSignal) or our own internal infra.

vlan placement is itself being corrected

Two statements from the museum infra side drove the relocation:

  • ktm (2026-05-28): "we don't want it running on vlan14."
  • Alex M (2026-05-28): "vlan14 is only for backend nodes to connect to the internet. no services should be hosted on that network."

vlan14 is an egress vlan, not a hosting vlan, so Spark's placement there was a misuse. The directional answer (see Migration plan) is that Spark moves to vlan23 (no public internet) and the public-facing half lands on GCP, with a Cloud VPN / Interconnect as the only Spark ↔ GCP path. This page covers the host-level provisioning regardless of which destination a given service lands on.

Per-service hosting recommendation

The "Suggested location" column is the opening position; the locked split lives in the Migration plan. Ports below are the host-published ports from .env.example.

Service Container Host port Suggested location Why RDC dependency Statefulness
Agent dataland-agent 4141 GCP Public-facing (dataland.chat), stateless behind Postgres Indirect (via museum-api) Stateless
Auth dataland-auth 9000 GCP Public JWKS, low traffic, needs Postgres None Signing keys in auth-data volume
RAG dataland-rag 4143 Spark Heavy ONNX rerank + BM25, co-located with Qdrant None Stateless (recompute from GCS)
Information WebUI dataland-atlas 4152 GCP Curator console (data.dataland.chat), talks to GCS + RAG None SQLite + uploads on host bind
Docs dataland-docs 4148 GCP Static MkDocs build (docs.dataland.chat), nginx basic auth None Stateless (image rebuilt)
Postgres dataland-postgres 5432 GCP (Cloud SQL candidate) Agent + auth DBs None Stateful — dump/restore
Qdrant dataland-qdrant 4146/4147 Spark Vector store, co-located with RAG None Stateful — re-ingest from GCS possible but slow
Museum API dataland-museum 4144 Spark Bridges the live RDC redis; latency + bandwidth sensitive Direct PSUBSCRIBE In-memory cache only
Museum simulator dataland-simulator Spark Writes to local redis (profile simulator) None Stateless
Notification worker dataland-notification-worker GCP/Spark Consumes museum:telemetry, needs outbound to OneSignal/Slack/Discord Indirect (internal redis) Consumer-group state in redis
Notification API dataland-notification-api 8080 Same as worker Sibling container (DLQ + state inspector) Indirect Consumer-group state in redis
Internal Redis dataland-redis 4145 Spark (vlan23) Backs museum:telemetry + agent ephemeral state; KTM directive None (it's ours) Stateful — AOF persistence
Observability dataland-prometheus / -grafana / -alertmanager (+ exporters) 9090/3000/9093 wherever services land Tunnel-friendly either side None Stateful but rebuildable from config

Runtime UIDs differ per image

This matters for the GCS-key ACL step below. The agent (and auth, which shares the agent image) runs as uid 10001 (dataland-agent/Dockerfile, DAT-153). The rag and information-webui images run as uid 1000. All three read the same secrets/gcp-key.json mounted read-only — so on a fresh host the key file must be readable by both uids (see Re-provision after a move → GCS key ACL).

Volumes & host binds that MUST persist

A relocation is "done" only when this state lands intact on the new host. The named volumes are declared at the bottom of compose.yml; the binds come from .env.

State Kind Source → container path Owner Authoritative? Recovery if lost
dataland_postgres-data named volume … → /var/lib/postgresql/data postgres Yes — auth users, agent conversations pg_dump/pg_restore only
dataland_qdrant-data named volume … → /qdrant/storage qdrant No (downstream of GCS) Snapshot, or re-/ingest/sync (hours)
dataland_redis-data named volume … → /data redis Partial (stream + AOF) RDB/AOF restore, or replay RDC bridge
dataland_auth-data named volume … → /app/data auth Yes — RS256 signing key + extra_jwks.json See JWKS section; do not restore stale signing material
webui data host bind ${INFORMATION_WEBUI_DATA_DIR}/app/data information-webui Yescatalog.sqlite3, uploaded images, thumbnails cp/rsync + backup-webui.sh
museum images GCS, not on host dataland-public/artworks, cobanov-public/chapters RAG / webui / catalog n/a (lives in GCS) Nothing to move; re-ingest pointers
app logs host bind ${DATALAND_LOG_DIR}/app/logs all services No (operational) New host can start empty
prometheus-data / grafana-data / alertmanager-data named volumes observability No (rebuildable from config) Re-provision from ./monitoring/

Two binds, not volumes

The two pieces of authoritative state that are host bind mounts (not Docker named volumes) are the easiest to forget in a docker volume-centric migration:

  • WebUI catalog + uploads at ${INFORMATION_WEBUI_DATA_DIR} (prod default /home/cobanov/DATALAND/dataland-atlas/data). Holds catalog.sqlite3, projects/<slug>/images/, museum/, and thumbnails/<slug>/. Use backup-webui.sh (WAL-safe SQLite .backup + tarball of museum projects thumbnails) — it verifies the dump and writes a manifest. A bare docker volume migration will silently leave this behind.
  • Logs at ${DATALAND_LOG_DIR} (prod default /home/cobanov/DATALAND/logs). Not authoritative, but the path must exist on the new host or every container fails to mount /app/logs. Both defaults only exist on the prod VDS — set them explicitly on any other host.

For the full backup/restore command set and DR scenarios, see the backup & restore runbook in the repo (reports/backup-restore.md). reset-stack.sh drops every named volume in one command and is not reversible — read that runbook first.

Re-provision after a move

These are the manual steps a fresh checkout + deploy.sh cannot do for you. Run them in order.

1. Secrets (.env + GCS key)

# from the parent checkout dir on the new host, e.g. /home/cobanov/DATALAND
cp dataland-infrastructure/.env.example .env
mkdir -p secrets && chmod 700 secrets   # (1)!
# fill the real secrets in .env, then:
chmod 600 .env   # (2)!
  1. 0700 on secrets/ keeps the directory (and the gcp-key.json you drop in next) reachable only by the deploy user. Anything looser exposes the service-account key on a shared host.
  2. .env is the canonical deploy config (DAT-265) and holds every plaintext secret in the table below. 0600 locks it to the deploy user before you populate it — do this before filling in real values, not after.

.env is the canonical deploy config (DAT-265) — compose.yml reads it for every service. These keys are required (compose.yml uses the :? form and refuses to render without them):

Var Gate
REDIS_PASSWORD --requirepass on redis; every consumer authenticates (DAT-76)
RDC_REDIS_URL museum-api streams from the live RDC redis only
MUSEUM_PASSWORD, MUSEUM_SESSION_SECRET museum dashboard gate
INFORMATION_WEBUI_PASSWORD, INFORMATION_WEBUI_SESSION_SECRET CMS gate
DOCS_PASSWORD docs.dataland.chat nginx basic auth
GRAFANA_ADMIN_PASSWORD Grafana admin (DAT-266)

deploy.sh fails fast on placeholder secrets (DAT-291)

Before rebuilding, deploy.sh runs the real agent boot guard (app.runtime.assert_boot_required_env) from the current dataland/agent:latest image against the new .env. If any production secret is still a placeholder/default, the deploy aborts before anything is rebuilt — this prevents the exact crash-loop outage that motivated the check (the agent's boot guard would otherwise crash-loop the fresh container and take chat offline). The guard is a no-op outside APP_ENV=production, and is skipped on the very first deploy when no dataland/agent:latest image exists yet. So on a brand-new host the first deploy will not catch placeholders — validate .env by hand (scripts/validate-env.sh, scripts/check-env-drift.sh) before the first deploy.sh.

2. GCS key ACL for the runtime user

Put the GCP service-account key at secrets/gcp-key.json, then lock it down (DAT-88):

chmod 600 secrets/gcp-key.json   # (1)!
chown "$(id -u)":"$(id -g)" secrets/gcp-key.json   # (2)!
stat -c '%a %n' secrets/gcp-key.json   # (3)!
  1. DAT-88 lockdown: the service-account key is a long-lived GCS credential, so it stays owner-read only by default. Note the gotcha below — 0600 is the starting posture, not always the final one, because the runtime container uids must still be able to read it.
  2. Set ownership to the current deploy user explicitly. On a relocation the host uid mapping is the thing most likely to have shifted, and a wrong owner here is exactly what breaks the in-container read.
  3. Sanity check: expected output is 600 secrets/gcp-key.json. Verifies the chmod stuck before you move on.

secrets/gcp-key.json is bind-mounted :ro into agent, rag, and information-webui at /app/gcp-key.json. The catch is that those containers run as non-root users, and the container UID must be able to read the host file through the bind:

  • agent (+ auth, same image) → uid 10001
  • rag, information-webui → uid 1000

uid 10001 must be able to read the key (GCS key ACL gotcha)

A chmod 600 key owned by the host deploy user is only readable by that user's uid inside the container. If the deploy user's host uid is not 10001/1000, the agent (uid 10001) will fail to read /app/gcp-key.json and GCS-backed calls break, while the host operator sees a perfectly fine cat secrets/gcp-key.json. After a relocation the host uid mapping is the thing most likely to have changed. Two clean fixes:

# Option A — keep 600 but make the file owned-by / group-readable to the
# runtime uids. World-readable (644) also works on a single-tenant host but
# weakens the DAT-88 lockdown.
sudo chown 10001:10001 secrets/gcp-key.json   # (1)!
sudo chmod 640 secrets/gcp-key.json   # (2)!

# Option B — group both runtime uids and grant the group read.
  1. 10001 is the agent/auth runtime uid (shared image, DAT-153). This makes the key owned by the in-container uid so the read through the :ro bind succeeds. For rag/information-webui you would chown to 1000 instead, or use the group approach in Option B to cover both.
  2. 0640 (owner read/write + group read) keeps the key off world-read while letting the new owning uid read it. Prefer this over 0644, which works on a single-tenant host but undoes the DAT-88 lockdown.

Verify from inside each container after deploy.sh:

docker exec dataland-agent cat /app/gcp-key.json >/dev/null && echo agent-ok   # (1)!
docker exec dataland-rag   cat /app/gcp-key.json >/dev/null && echo rag-ok   # (2)!
  1. Reads the key as uid 10001 from inside the container and discards the contents — prints agent-ok only if the in-container read actually succeeds. This is the check that catches the uid-mismatch gotcha that a host-side cat hides.
  2. Same proof for uid 1000 (rag, and by extension information-webui, share the same uid). Prints rag-ok only when the through-the-bind read works.

On the GCP side, the migration also splits Gemini/GCS access behind a new ai-proxy so Spark callers don't need the key at all post-cut — see Migration plan. Until that lands, every host hosting agent/rag/information-webui needs a readable key.

3. JWKS mirror (data/extra_jwks.json)

dataland-auth runs from the agent image (command: python auth_server.py) and persists its RS256 signing key in the auth-data volume at /app/data (auth_rsa_private.pem + auth_rsa_kid.txt, kid=dataland-rs256-1). On top of that, DAT-286 has it serve a mirror of the external CMS signing key's public JWK alongside its own:

flowchart LR
  CMS["CMS / mobile backend<br/>signs tokens (kid dataland-rs256-1)"]
  subgraph auth["dataland-auth"]
    local["local RS256 signing key<br/>(auth_rsa_private.pem)"]
    extra["extra_jwks.json<br/>(public CMS key, mirror)"]
    served["/.well-known/jwks.json<br/>(local key first, then mirror)"]
  end
  Agent["dataland-agent<br/>verifies JWT against each JWKS URL"]
  CMSremote["CMS-staging JWKS<br/>(fallback / sole validator if mirror absent)"]

  CMS -->|public key| extra
  local --> served
  extra --> served
  served --> Agent
  CMSremote -.->|fallback| Agent

Why it matters for relocation: the agent verifies a token by trying each configured JWKS URL in turn. With the CMS key absent from the local JWKS, the agent falls through to the external CMS-staging endpoint as the sole validator for all chat auth — a single point of failure. A JWKS only ever exposes public material (never the signing key), so mirroring it locally demotes the external endpoint to a pure backup. The agent WARNs when a fallback JWKS provider is the only validator.

The mirror lives in the persisted auth-data volume, so a volume restore carries it. A fresh volume does not — re-provision it:

# fetch the upstream CMS public JWKS and drop it into the auth volume
docker exec dataland-auth python -c "import urllib.request; \
  open('/app/data/extra_jwks.json','wb').write( \
  urllib.request.urlopen('https://<cms-host>/.well-known/jwks.json').read())"   # (1)!
docker restart dataland-auth   # (2)!
# confirm the served JWKS now carries >1 key (local + mirrored kid)
curl -fsS http://localhost:9000/.well-known/jwks.json | jq '.keys | length'   # (3)!
  1. Writes the file at /app/data/extra_jwks.json — the persisted auth-data mount, so it survives a restart. A JWKS only ever carries public key material, so this never moves signing material; it just gives auth a local copy of the CMS public key.
  2. Auth loads extra_jwks.json at startup, so the restart is what actually picks up the mirror. No restart, no mirror.
  3. Expect >1: one entry for the local dataland-rs256-1 key plus the mirrored CMS kid. A 1 means the mirror did not load and the external CMS endpoint is still the sole validator.

Alternatively set AUTH_EXTRA_JWKS_JSON (inline JSON) or AUTH_EXTRA_JWKS_PATH in .env. A malformed or missing source is logged (auth.extra_jwks.bad_source) and never raised — a bad mirror must not take auth, and thus all chat, down. The local key is always authoritative: an extra key reusing dataland-rs256-1 is skipped so a stale mirror can never shadow the key auth signs with.

Never restore stale auth signing material

The local RS256 private key in auth-data is signing material. If the volume is lost and you cannot prove the old key was never compromised, rotate instead of restoring: let auth regenerate, re-provision the extra_jwks.json mirror, and notify any client that pinned the prior JWKS. Treat a restore as a key-rotation event.

4. Museum knowledge re-ingest (only if Qdrant is rebuilt)

Qdrant is recomputable — vectors derive from GCS documents via RAG. If you carry the qdrant-data volume (or a per-collection snapshot) you skip this. If you start with an empty Qdrant, re-ingest:

curl -fsS -X DELETE http://localhost:4146/collections/knowledge   # (1)!
curl -fsS -X POST  http://localhost:4143/ingest/sync -H "X-API-Key: $API_KEY"   # (2)!
  1. Drops the knowledge collection entirely on the Qdrant REST port (4146) before the rebuild. This is the destructive half — skip it if you carried the qdrant-data volume or a snapshot.
  2. Triggers a full drop-and-re-embed from GCS via RAG (port 4143); the X-API-Key header is required. Budget hours, not minutes — every museum document is re-embedded.

The museum content specifically — 20 sections + scenes + the overview — is (re-)ingested into the Qdrant knowledge collection through the WebUI's RAG live-sync (text → /ingest/file, images → /ingest/image; rag slugs museum-section-<slug> / museum-scene-<slug>, replace-by-slug with UUIDv5 ids). As part of this change-set the knowledge collection went 4839 → 4969 points. After a webui-data restore, re-run the WebUI live-sync (or scripts/sync_from_qdrant.py) to reconcile the catalog against Qdrant — drift between the two is the most common after-restore footgun.

Museum chapter placeholders are purged (DAT-288)

The default_reference_* placeholder images were removed from both chapters.json and GCS cobanov-public/chapters. They were never in Qdrant, so this is purely a GCS + catalog concern — do not re-seed them on the new host.

5. Tailnet binds + Cloudflare ingress

compose.yml publishes the data-plane/ops services on each host's tailnet IP via *_PUBLIC_BIND (default 100.124.170.43). On a new host these must point at that host's tailnet interface IP, never 0.0.0.0 and never 127.0.0.1 (which would collide with the loopback line already in compose). Set, at minimum: QDRANT_PUBLIC_BIND, REDIS_PUBLIC_BIND, RAG_PUBLIC_BIND, NOTIFICATION_PUBLIC_BIND, POSTGRES_PUBLIC_BIND, DOCS_PUBLIC_BIND. The observability UIs (PROMETHEUS_PUBLIC_BIND, GRAFANA_PUBLIC_BIND, ALERTMANAGER_PUBLIC_BIND) default to 127.0.0.1 — override to the tailnet IP only if operators want browser access without an SSH tunnel.

Finally, repoint Cloudflare tunnel ingress for dataland.chat, data.dataland.chat, and docs.dataland.chat at the new origin (host systemd cloudflared, ingressing to 127.0.0.1). Swap DNS / origin last so rollback stays a one-line origin change.

Cross-network dependencies

Each service's external touchpoints — useful for sizing firewall / peering asks.

Service Reaches out to Direction
Agent RAG, museum-api, notification-api, Auth JWKS, CMS JWKS, Gemini API, Postgres, internal redis Egress
Museum API External RDC redis (museum LAN), internal redis, RAG Egress
RAG GCS (dataland-public, dataland-private, cobanov-public), Gemini API, Qdrant Egress
Information WebUI RAG, GCS, (no DB — SQLite is local) Egress
Notification worker / api Internal redis, Agent (service token), OneSignal, Discord/Slack (ops alerts) Egress
Auth Postgres Egress
Internal Redis / Docs None outbound

The one critical cross-network flow is museum-api → RDC redis on the museum's network. Every other external touchpoint is public internet or our own internal infra. Post-split, the Migration plan routes Gemini + GCS through a GCP ai-proxy so Spark (vlan23, no internet) can still reach them over the Spark ↔ GCP link.

Open decisions (asks for the call)

  1. Is there a museum-internal "services" vlan where a host can sit and still reach the RDC redis directly? If yes, museum-api + notification + internal redis can stay on-premise (sub-ms redis latency). Current direction: Spark → vlan23 with a Cloud VPN to GCP.
  2. If everything goes to GCP, what's the supported network path back to the RDC redis? Options: Cloud VPN / IPSec, Dedicated Interconnect (likely overkill), a TLS-fronted redis endpoint with an IP allow-list, or a Tailscale subnet router. Current direction: Cloud VPN / Interconnect.
  3. Long-term plan for the RDC redis endpoint — same instance, or are server-side consumers moving?
  4. IP/auth model on the new GCP peer — IAM, IP allow-list, mTLS. We build to it but need the shape.

Suggested migration phases

This is the host-level ordering; the network-level phasing lives in the Migration plan.

Phase A (independent of museum decision, 1-2 days)
  □ Stand up the destination host(s) + VPC + service accounts
  □ Migrate stateless services first: agent, auth, rag, information-webui, docs
  □ Carry/re-provision the persistent state:
      postgres-data (dump+restore), auth-data (incl. extra_jwks.json),
      webui host bind (catalog.sqlite3 + uploads), redis-data (AOF)
  □ Qdrant: carry the volume/snapshot, OR rebuild from GCS via /ingest/sync

Phase B (after museum redis-access decision)
  □ Land museum-api + notification (worker+api) + internal redis in the
    decided location; smoke: telemetry flowing, rules firing

Phase C (clean-up)
  □ Re-provision the GCS-key ACL for uids 10001/1000 on the new host
  □ Re-point tailnet *_PUBLIC_BIND to the new host's tailnet IP
  □ Cloudflare ingress + DNS swap LAST (keeps rollback a one-line origin change)
  □ Drain the old host, decommission

Post-move smoke (acceptance)

Before declaring a relocation done, confirm — in this order:

  1. deploy.sh completes (its DAT-291 boot guard passed against the new .env).
  2. docker compose ps shows every container healthy (the healthchecks in compose.yml).
  3. GCS key is readable inside agent, rag, information-webui (the docker exec cat checks above).
  4. Auth serves >1 JWK (/.well-known/jwks.json carries local + mirrored kid).
  5. Qdrant knowledge reports ~4969 points (or your last re-ingest count).
  6. dataland-agent answers a museum-mode chat round-trip end-to-end, and a welcome push fires on the first empty /museum message (DAT-296).
  7. Telemetry flows: museum:telemetry is being XADD-ed and notification rules fire.

What's outside our control

  • Network design for redis access — museum infra responsibility. We build to the access pattern they sanction.
  • RDC redis availability / deprecation — museum-side roadmap. We can feature-flag a transport swap, but the cut-over date isn't ours.
  • Mobile client behavior — Refik Anadol Studio. Our chat API contract stays the same regardless of any RDC transport change.