Service hosting & relocation¶
Working document — first drafted 2026-05-28 ahead of the museum infra alignment call. Status: decisions landing. The directional GCP split is captured in the sibling Migration plan (Spark ↔ GCP); this page stays the operator-facing "how the stack maps to a host and what to re-provision when it moves" reference.
This page answers three operational questions:
- What runs where today and what the opening per-service placement is (the matrix below).
- Which on-disk state must survive a move — every named volume + host bind that holds authoritative data.
- What you have to re-provision by hand after the move — secrets, the GCS key ACL for the
non-root runtime user, the JWKS mirror, and the museum knowledge re-ingest. These are the steps a
git clone && deploy.shdoes not do for you, and the ones that have bitten us.
Recent changes
This page reflects the 2026-06-03 → 2026-06-04 change-set. The provisioning gotchas were
materially reshaped by DAT-291 (deploy.sh now fails fast on placeholder prod secrets +
tailnet *_PUBLIC_BIND publishing + the docs.dataland.chat service), DAT-286 (the
dataland-auth JWKS mirror via data/extra_jwks.json), and DAT-288 (museum chapter
placeholder purge). The Qdrant knowledge collection was also re-ingested 4839 → 4969 points
as part of the section/scene re-ingest.
Current state¶
Every Dataland service runs on a single host, spark (ege@100.124.170.43, tailnet IP), with
the production checkout under /home/cobanov/DATALAND/. All inter-service traffic stays on the
dataland-network Docker bridge; the only things that leave the host are:
- Cloudflare tunnels (host
systemdcloudflared) — public ingress fordataland.chat,data.dataland.chat, anddocs.dataland.chat. TLS terminates at Cloudflare; the tunnel ingresses to127.0.0.1on the host. - Tailscale tailnet — direct operator + peer access to the data-plane and ops ports. Every
stateful/ops service is published twice in
compose.yml: once on127.0.0.1(local tooling + SSH-tunnel) and once on a*_PUBLIC_BINDhost IP (the tailnet interface) for direct peer access (DAT-73, extended by DAT-291). Never0.0.0.0— several of these have weak (Postgres) or no (Qdrant) auth, so the tailnet is the trust boundary.
The single load-bearing cross-network flow is museum-api → the external RDC redis (Refik
Anadol data center: the wearable/sensor source of truth). Every other external touchpoint is
public internet (Gemini, GCS, OneSignal) or our own internal infra.
vlan placement is itself being corrected
Two statements from the museum infra side drove the relocation:
- ktm (2026-05-28): "we don't want it running on vlan14."
- Alex M (2026-05-28): "vlan14 is only for backend nodes to connect to the internet. no services should be hosted on that network."
vlan14 is an egress vlan, not a hosting vlan, so Spark's placement there was a misuse. The directional answer (see Migration plan) is that Spark moves to vlan23 (no public internet) and the public-facing half lands on GCP, with a Cloud VPN / Interconnect as the only Spark ↔ GCP path. This page covers the host-level provisioning regardless of which destination a given service lands on.
Per-service hosting recommendation¶
The "Suggested location" column is the opening position; the locked split lives in the
Migration plan. Ports below are the host-published ports from .env.example.
| Service | Container | Host port | Suggested location | Why | RDC dependency | Statefulness |
|---|---|---|---|---|---|---|
| Agent | dataland-agent |
4141 | GCP | Public-facing (dataland.chat), stateless behind Postgres |
Indirect (via museum-api) | Stateless |
| Auth | dataland-auth |
9000 | GCP | Public JWKS, low traffic, needs Postgres | None | Signing keys in auth-data volume |
| RAG | dataland-rag |
4143 | Spark | Heavy ONNX rerank + BM25, co-located with Qdrant | None | Stateless (recompute from GCS) |
| Information WebUI | dataland-atlas |
4152 | GCP | Curator console (data.dataland.chat), talks to GCS + RAG |
None | SQLite + uploads on host bind |
| Docs | dataland-docs |
4148 | GCP | Static MkDocs build (docs.dataland.chat), nginx basic auth |
None | Stateless (image rebuilt) |
| Postgres | dataland-postgres |
5432 | GCP (Cloud SQL candidate) | Agent + auth DBs | None | Stateful — dump/restore |
| Qdrant | dataland-qdrant |
4146/4147 | Spark | Vector store, co-located with RAG | None | Stateful — re-ingest from GCS possible but slow |
| Museum API | dataland-museum |
4144 | Spark | Bridges the live RDC redis; latency + bandwidth sensitive | Direct PSUBSCRIBE | In-memory cache only |
| Museum simulator | dataland-simulator |
— | Spark | Writes to local redis (profile simulator) |
None | Stateless |
| Notification worker | dataland-notification-worker |
— | GCP/Spark | Consumes museum:telemetry, needs outbound to OneSignal/Slack/Discord |
Indirect (internal redis) | Consumer-group state in redis |
| Notification API | dataland-notification-api |
8080 | Same as worker | Sibling container (DLQ + state inspector) | Indirect | Consumer-group state in redis |
| Internal Redis | dataland-redis |
4145 | Spark (vlan23) | Backs museum:telemetry + agent ephemeral state; KTM directive |
None (it's ours) | Stateful — AOF persistence |
| Observability | dataland-prometheus / -grafana / -alertmanager (+ exporters) |
9090/3000/9093 | wherever services land | Tunnel-friendly either side | None | Stateful but rebuildable from config |
Runtime UIDs differ per image
This matters for the GCS-key ACL step below. The agent (and auth, which shares the agent
image) runs as uid 10001 (dataland-agent/Dockerfile, DAT-153). The rag and
information-webui images run as uid 1000. All three read the same
secrets/gcp-key.json mounted read-only — so on a fresh host the key file must be readable by
both uids (see Re-provision after a move → GCS key ACL).
Volumes & host binds that MUST persist¶
A relocation is "done" only when this state lands intact on the new host. The named volumes are
declared at the bottom of compose.yml; the binds come from .env.
| State | Kind | Source → container path | Owner | Authoritative? | Recovery if lost |
|---|---|---|---|---|---|
dataland_postgres-data |
named volume | … → /var/lib/postgresql/data |
postgres | Yes — auth users, agent conversations | pg_dump/pg_restore only |
dataland_qdrant-data |
named volume | … → /qdrant/storage |
qdrant | No (downstream of GCS) | Snapshot, or re-/ingest/sync (hours) |
dataland_redis-data |
named volume | … → /data |
redis | Partial (stream + AOF) | RDB/AOF restore, or replay RDC bridge |
dataland_auth-data |
named volume | … → /app/data |
auth | Yes — RS256 signing key + extra_jwks.json |
See JWKS section; do not restore stale signing material |
| webui data | host bind ${INFORMATION_WEBUI_DATA_DIR} → /app/data |
information-webui | Yes — catalog.sqlite3, uploaded images, thumbnails |
cp/rsync + backup-webui.sh |
|
| museum images | GCS, not on host | dataland-public/artworks, cobanov-public/chapters |
RAG / webui / catalog | n/a (lives in GCS) | Nothing to move; re-ingest pointers |
| app logs | host bind ${DATALAND_LOG_DIR} → /app/logs |
all services | No (operational) | New host can start empty | |
prometheus-data / grafana-data / alertmanager-data |
named volumes | observability | No (rebuildable from config) | Re-provision from ./monitoring/ |
Two binds, not volumes
The two pieces of authoritative state that are host bind mounts (not Docker named volumes)
are the easiest to forget in a docker volume-centric migration:
- WebUI catalog + uploads at
${INFORMATION_WEBUI_DATA_DIR}(prod default/home/cobanov/DATALAND/dataland-atlas/data). Holdscatalog.sqlite3,projects/<slug>/images/,museum/, andthumbnails/<slug>/. Usebackup-webui.sh(WAL-safe SQLite.backup+ tarball ofmuseum projects thumbnails) — it verifies the dump and writes a manifest. A baredocker volumemigration will silently leave this behind. - Logs at
${DATALAND_LOG_DIR}(prod default/home/cobanov/DATALAND/logs). Not authoritative, but the path must exist on the new host or every container fails to mount/app/logs. Both defaults only exist on the prod VDS — set them explicitly on any other host.
For the full backup/restore command set and DR scenarios, see the
backup & restore runbook
in the repo (reports/backup-restore.md). reset-stack.sh drops every named volume in one
command and is not reversible — read that runbook first.
Re-provision after a move¶
These are the manual steps a fresh checkout + deploy.sh cannot do for you. Run them in order.
1. Secrets (.env + GCS key)¶
# from the parent checkout dir on the new host, e.g. /home/cobanov/DATALAND
cp dataland-infrastructure/.env.example .env
mkdir -p secrets && chmod 700 secrets # (1)!
# fill the real secrets in .env, then:
chmod 600 .env # (2)!
0700onsecrets/keeps the directory (and thegcp-key.jsonyou drop in next) reachable only by the deploy user. Anything looser exposes the service-account key on a shared host..envis the canonical deploy config (DAT-265) and holds every plaintext secret in the table below.0600locks it to the deploy user before you populate it — do this before filling in real values, not after.
.env is the canonical deploy config (DAT-265) — compose.yml reads it for every service. These
keys are required (compose.yml uses the :? form and refuses to render without them):
| Var | Gate |
|---|---|
REDIS_PASSWORD |
--requirepass on redis; every consumer authenticates (DAT-76) |
RDC_REDIS_URL |
museum-api streams from the live RDC redis only |
MUSEUM_PASSWORD, MUSEUM_SESSION_SECRET |
museum dashboard gate |
INFORMATION_WEBUI_PASSWORD, INFORMATION_WEBUI_SESSION_SECRET |
CMS gate |
DOCS_PASSWORD |
docs.dataland.chat nginx basic auth |
GRAFANA_ADMIN_PASSWORD |
Grafana admin (DAT-266) |
deploy.sh fails fast on placeholder secrets (DAT-291)
Before rebuilding, deploy.sh runs the real agent boot guard
(app.runtime.assert_boot_required_env) from the current dataland/agent:latest image against
the new .env. If any production secret is still a placeholder/default, the deploy aborts
before anything is rebuilt — this prevents the exact crash-loop outage that motivated the
check (the agent's boot guard would otherwise crash-loop the fresh container and take chat
offline). The guard is a no-op outside APP_ENV=production, and is skipped on the very first
deploy when no dataland/agent:latest image exists yet. So on a brand-new host the first
deploy will not catch placeholders — validate .env by hand (scripts/validate-env.sh,
scripts/check-env-drift.sh) before the first deploy.sh.
2. GCS key ACL for the runtime user¶
Put the GCP service-account key at secrets/gcp-key.json, then lock it down (DAT-88):
chmod 600 secrets/gcp-key.json # (1)!
chown "$(id -u)":"$(id -g)" secrets/gcp-key.json # (2)!
stat -c '%a %n' secrets/gcp-key.json # (3)!
- DAT-88 lockdown: the service-account key is a long-lived GCS credential, so it stays owner-read
only by default. Note the gotcha below —
0600is the starting posture, not always the final one, because the runtime container uids must still be able to read it. - Set ownership to the current deploy user explicitly. On a relocation the host uid mapping is the thing most likely to have shifted, and a wrong owner here is exactly what breaks the in-container read.
- Sanity check: expected output is
600 secrets/gcp-key.json. Verifies the chmod stuck before you move on.
secrets/gcp-key.json is bind-mounted :ro into agent, rag, and information-webui at
/app/gcp-key.json. The catch is that those containers run as non-root users, and the
container UID must be able to read the host file through the bind:
agent(+auth, same image) → uid10001rag,information-webui→ uid1000
uid 10001 must be able to read the key (GCS key ACL gotcha)
A chmod 600 key owned by the host deploy user is only readable by that user's uid inside
the container. If the deploy user's host uid is not 10001/1000, the agent (uid 10001)
will fail to read /app/gcp-key.json and GCS-backed calls break, while the host operator sees
a perfectly fine cat secrets/gcp-key.json. After a relocation the host uid mapping is the
thing most likely to have changed. Two clean fixes:
# Option A — keep 600 but make the file owned-by / group-readable to the
# runtime uids. World-readable (644) also works on a single-tenant host but
# weakens the DAT-88 lockdown.
sudo chown 10001:10001 secrets/gcp-key.json # (1)!
sudo chmod 640 secrets/gcp-key.json # (2)!
# Option B — group both runtime uids and grant the group read.
10001is the agent/auth runtime uid (shared image, DAT-153). This makes the key owned by the in-container uid so the read through the:robind succeeds. Forrag/information-webuiyou would chown to1000instead, or use the group approach in Option B to cover both.0640(owner read/write + group read) keeps the key off world-read while letting the new owning uid read it. Prefer this over0644, which works on a single-tenant host but undoes the DAT-88 lockdown.
Verify from inside each container after deploy.sh:
docker exec dataland-agent cat /app/gcp-key.json >/dev/null && echo agent-ok # (1)!
docker exec dataland-rag cat /app/gcp-key.json >/dev/null && echo rag-ok # (2)!
- Reads the key as uid
10001from inside the container and discards the contents — printsagent-okonly if the in-container read actually succeeds. This is the check that catches the uid-mismatch gotcha that a host-sidecathides. - Same proof for uid
1000(rag, and by extensioninformation-webui, share the same uid). Printsrag-okonly when the through-the-bind read works.
On the GCP side, the migration also splits Gemini/GCS access behind a new ai-proxy so Spark
callers don't need the key at all post-cut — see Migration plan. Until that
lands, every host hosting agent/rag/information-webui needs a readable key.
3. JWKS mirror (data/extra_jwks.json)¶
dataland-auth runs from the agent image (command: python auth_server.py) and persists its RS256
signing key in the auth-data volume at /app/data (auth_rsa_private.pem + auth_rsa_kid.txt,
kid=dataland-rs256-1). On top of that, DAT-286 has it serve a mirror of the external CMS
signing key's public JWK alongside its own:
flowchart LR
CMS["CMS / mobile backend<br/>signs tokens (kid dataland-rs256-1)"]
subgraph auth["dataland-auth"]
local["local RS256 signing key<br/>(auth_rsa_private.pem)"]
extra["extra_jwks.json<br/>(public CMS key, mirror)"]
served["/.well-known/jwks.json<br/>(local key first, then mirror)"]
end
Agent["dataland-agent<br/>verifies JWT against each JWKS URL"]
CMSremote["CMS-staging JWKS<br/>(fallback / sole validator if mirror absent)"]
CMS -->|public key| extra
local --> served
extra --> served
served --> Agent
CMSremote -.->|fallback| Agent
Why it matters for relocation: the agent verifies a token by trying each configured JWKS URL in turn. With the CMS key absent from the local JWKS, the agent falls through to the external CMS-staging endpoint as the sole validator for all chat auth — a single point of failure. A JWKS only ever exposes public material (never the signing key), so mirroring it locally demotes the external endpoint to a pure backup. The agent WARNs when a fallback JWKS provider is the only validator.
The mirror lives in the persisted auth-data volume, so a volume restore carries it. A fresh
volume does not — re-provision it:
# fetch the upstream CMS public JWKS and drop it into the auth volume
docker exec dataland-auth python -c "import urllib.request; \
open('/app/data/extra_jwks.json','wb').write( \
urllib.request.urlopen('https://<cms-host>/.well-known/jwks.json').read())" # (1)!
docker restart dataland-auth # (2)!
# confirm the served JWKS now carries >1 key (local + mirrored kid)
curl -fsS http://localhost:9000/.well-known/jwks.json | jq '.keys | length' # (3)!
- Writes the file at
/app/data/extra_jwks.json— the persistedauth-datamount, so it survives a restart. A JWKS only ever carries public key material, so this never moves signing material; it just gives auth a local copy of the CMS public key. - Auth loads
extra_jwks.jsonat startup, so the restart is what actually picks up the mirror. No restart, no mirror. - Expect
>1: one entry for the localdataland-rs256-1key plus the mirrored CMS kid. A1means the mirror did not load and the external CMS endpoint is still the sole validator.
Alternatively set AUTH_EXTRA_JWKS_JSON (inline JSON) or AUTH_EXTRA_JWKS_PATH in .env. A
malformed or missing source is logged (auth.extra_jwks.bad_source) and never raised — a bad
mirror must not take auth, and thus all chat, down. The local key is always authoritative: an extra
key reusing dataland-rs256-1 is skipped so a stale mirror can never shadow the key auth signs with.
Never restore stale auth signing material
The local RS256 private key in auth-data is signing material. If the volume is lost and you
cannot prove the old key was never compromised, rotate instead of restoring: let auth
regenerate, re-provision the extra_jwks.json mirror, and notify any client that pinned the
prior JWKS. Treat a restore as a key-rotation event.
4. Museum knowledge re-ingest (only if Qdrant is rebuilt)¶
Qdrant is recomputable — vectors derive from GCS documents via RAG. If you carry the
qdrant-data volume (or a per-collection snapshot) you skip this. If you start with an empty
Qdrant, re-ingest:
curl -fsS -X DELETE http://localhost:4146/collections/knowledge # (1)!
curl -fsS -X POST http://localhost:4143/ingest/sync -H "X-API-Key: $API_KEY" # (2)!
- Drops the
knowledgecollection entirely on the Qdrant REST port (4146) before the rebuild. This is the destructive half — skip it if you carried theqdrant-datavolume or a snapshot. - Triggers a full drop-and-re-embed from GCS via RAG (port
4143); theX-API-Keyheader is required. Budget hours, not minutes — every museum document is re-embedded.
The museum content specifically — 20 sections + scenes + the overview — is (re-)ingested into
the Qdrant knowledge collection through the WebUI's RAG live-sync (text → /ingest/file, images →
/ingest/image; rag slugs museum-section-<slug> / museum-scene-<slug>, replace-by-slug with
UUIDv5 ids). As part of this change-set the knowledge collection went 4839 → 4969 points. After
a webui-data restore, re-run the WebUI live-sync (or scripts/sync_from_qdrant.py) to reconcile the
catalog against Qdrant — drift between the two is the most common after-restore footgun.
Museum chapter placeholders are purged (DAT-288)
The default_reference_* placeholder images were removed from both chapters.json and GCS
cobanov-public/chapters. They were never in Qdrant, so this is purely a GCS + catalog concern
— do not re-seed them on the new host.
5. Tailnet binds + Cloudflare ingress¶
compose.yml publishes the data-plane/ops services on each host's tailnet IP via *_PUBLIC_BIND
(default 100.124.170.43). On a new host these must point at that host's tailnet interface IP,
never 0.0.0.0 and never 127.0.0.1 (which would collide with the loopback line already in
compose). Set, at minimum: QDRANT_PUBLIC_BIND, REDIS_PUBLIC_BIND, RAG_PUBLIC_BIND,
NOTIFICATION_PUBLIC_BIND, POSTGRES_PUBLIC_BIND, DOCS_PUBLIC_BIND. The observability UIs
(PROMETHEUS_PUBLIC_BIND, GRAFANA_PUBLIC_BIND, ALERTMANAGER_PUBLIC_BIND) default to 127.0.0.1
— override to the tailnet IP only if operators want browser access without an SSH tunnel.
Finally, repoint Cloudflare tunnel ingress for dataland.chat, data.dataland.chat, and
docs.dataland.chat at the new origin (host systemd cloudflared, ingressing to 127.0.0.1).
Swap DNS / origin last so rollback stays a one-line origin change.
Cross-network dependencies¶
Each service's external touchpoints — useful for sizing firewall / peering asks.
| Service | Reaches out to | Direction |
|---|---|---|
| Agent | RAG, museum-api, notification-api, Auth JWKS, CMS JWKS, Gemini API, Postgres, internal redis | Egress |
| Museum API | External RDC redis (museum LAN), internal redis, RAG | Egress |
| RAG | GCS (dataland-public, dataland-private, cobanov-public), Gemini API, Qdrant |
Egress |
| Information WebUI | RAG, GCS, (no DB — SQLite is local) | Egress |
| Notification worker / api | Internal redis, Agent (service token), OneSignal, Discord/Slack (ops alerts) | Egress |
| Auth | Postgres | Egress |
| Internal Redis / Docs | — | None outbound |
The one critical cross-network flow is museum-api → RDC redis on the museum's network. Every
other external touchpoint is public internet or our own internal infra. Post-split, the
Migration plan routes Gemini + GCS through a GCP ai-proxy so Spark (vlan23, no
internet) can still reach them over the Spark ↔ GCP link.
Open decisions (asks for the call)¶
- Is there a museum-internal "services" vlan where a host can sit and still reach the RDC redis directly? If yes, museum-api + notification + internal redis can stay on-premise (sub-ms redis latency). Current direction: Spark → vlan23 with a Cloud VPN to GCP.
- If everything goes to GCP, what's the supported network path back to the RDC redis? Options: Cloud VPN / IPSec, Dedicated Interconnect (likely overkill), a TLS-fronted redis endpoint with an IP allow-list, or a Tailscale subnet router. Current direction: Cloud VPN / Interconnect.
- Long-term plan for the RDC redis endpoint — same instance, or are server-side consumers moving?
- IP/auth model on the new GCP peer — IAM, IP allow-list, mTLS. We build to it but need the shape.
Suggested migration phases¶
This is the host-level ordering; the network-level phasing lives in the Migration plan.
Phase A (independent of museum decision, 1-2 days)
□ Stand up the destination host(s) + VPC + service accounts
□ Migrate stateless services first: agent, auth, rag, information-webui, docs
□ Carry/re-provision the persistent state:
postgres-data (dump+restore), auth-data (incl. extra_jwks.json),
webui host bind (catalog.sqlite3 + uploads), redis-data (AOF)
□ Qdrant: carry the volume/snapshot, OR rebuild from GCS via /ingest/sync
Phase B (after museum redis-access decision)
□ Land museum-api + notification (worker+api) + internal redis in the
decided location; smoke: telemetry flowing, rules firing
Phase C (clean-up)
□ Re-provision the GCS-key ACL for uids 10001/1000 on the new host
□ Re-point tailnet *_PUBLIC_BIND to the new host's tailnet IP
□ Cloudflare ingress + DNS swap LAST (keeps rollback a one-line origin change)
□ Drain the old host, decommission
Post-move smoke (acceptance)¶
Before declaring a relocation done, confirm — in this order:
deploy.shcompletes (its DAT-291 boot guard passed against the new.env).docker compose psshows every container healthy (the healthchecks incompose.yml).- GCS key is readable inside
agent,rag,information-webui(thedocker exec catchecks above). - Auth serves >1 JWK (
/.well-known/jwks.jsoncarries local + mirrored kid). - Qdrant
knowledgereports ~4969 points (or your last re-ingest count). dataland-agentanswers a museum-mode chat round-trip end-to-end, and a welcome push fires on the first empty/museummessage (DAT-296).- Telemetry flows:
museum:telemetryis being XADD-ed and notification rules fire.
What's outside our control¶
- Network design for redis access — museum infra responsibility. We build to the access pattern they sanction.
- RDC redis availability / deprecation — museum-side roadmap. We can feature-flag a transport swap, but the cut-over date isn't ours.
- Mobile client behavior — Refik Anadol Studio. Our chat API contract stays the same regardless of any RDC transport change.