Public ports & bind policy¶

Every Dataland service runs as a Docker-Compose container on the Spark DGX VDS (ege@100.124.170.43). Internal service-to-service traffic stays on the dataland-network Docker bridge and uses container DNS names (http://dataland-agent:4141, http://dataland-rag:4143, dataland-redis:6379, …) — never host-published ports. Only the host port mappings described below are reachable from off-container, and only a small public subset leaves the host at all.

The one rule: never 0.0.0.0 for data-plane or ops services

Binding a stateful or unauthenticated service to 0.0.0.0 republishes it on the host's public Spectrum IP, not just the tailnet. That regression once exposed Postgres (5432) and Qdrant (no API key) to the open internet (DAT-73). The fix: bind those services to 127.0.0.1 plus the tailnet IP, explicitly, and let deploy.sh fail fast on placeholder secrets (DAT-291).

Recent changes (DAT-73 / DAT-291)

*_PUBLIC_BIND variables let each data-plane and ops service publish on two host IPs: 127.0.0.1 (local host + SSH-tunnel workflow) and the host's tailnet IP (direct access from Tailscale peers). Defaults to the Spark's 100.124.170.43; never 0.0.0.0, never 127.0.0.1 (it would collide with the loopback line already present in compose.yml).
deploy.sh now runs the agent's real boot guard against the prod .env before rebuilding, aborting on placeholder/default secrets so a crash-loop deploy can't take chat offline.
The docs.dataland.chat service (this site) is dual-bound (loopback for cloudflared + tailnet for peers) and Cloudflare-fronted.

Bind policy at a glance¶

flowchart LR
  Internet([Public internet]) -->|HTTPS| CF[Cloudflare Tunnel<br/>cloudflared on host systemd]
  Peers([Tailscale peers]) -->|tailnet 100.124.170.43| TN
  Op([Operator workstation]) -->|SSH -L tunnel| LB

  subgraph Host["Spark VDS — host network namespace"]
    CF -->|127.0.0.1| LB[127.0.0.1 loopback binds]
    TN[Tailnet-IP binds<br/>*_PUBLIC_BIND]
    LB --> DKR
    TN --> DKR
  end

  DKR[(Docker bridge<br/>dataland-network)]

Three publishing patterns are used in compose.yml:

Pattern	Host binding(s)	Used by	Why
Cloudflare-fronted	`0.0.0.0` (single mapping) — but only ever reached through cloudflared	Agent, Museum API/dashboard, Information WebUI, Auth (JWKS)	Public hostnames terminate TLS at Cloudflare; the app enforces its own auth (JWT / shared-password / API token).
Dual-bind (loopback + tailnet)	`127.0.0.1` and `${X_PUBLIC_BIND}`	Postgres, Qdrant (HTTP+gRPC), Redis, RAG, Notification API, Docs	Stateful or weakly/un-authenticated: the tailnet is the trust boundary. Two explicit lines, never `0.0.0.0`.
Loopback-only / no host port	`127.0.0.1` only, or no `ports:` at all	Prometheus / Grafana / Alertmanager (default), exporters, cAdvisor, node-exporter, sim redis	Operator-only or scrape-only; the exporters publish no host port and are read by Prometheus over the Docker network.

Cloudflare-fronted ≠ 0.0.0.0 is safe

The agent, museum, webui and auth containers publish with a single ${X_PUBLIC_PORT}:container mapping, which Docker binds on 0.0.0.0. That is acceptable only because each enforces application-layer auth (RS256 JWT, shared-password session, API token) and the intended path is the Cloudflare tunnel. Do not copy this pattern to a service without auth — that is exactly the data-plane mistake DAT-73 fixed.

Application services¶

Service	Container	Host port (var)	Container port	Bind policy	Cloudflare hostname
Agent	`dataland-agent`	`4141` (`AGENT_PUBLIC_PORT`)	`4141`	`0.0.0.0` (Cloudflare-fronted; RS256 JWT via JWKS)	`dataland.chat`
Museum API + dashboard	`dataland-museum`	`4144` (`MUSEUM_PUBLIC_PORT`)	`5001`	`0.0.0.0` (Cloudflare-fronted; dashboard shared-password gate)	museum dashboard hostname
Information WebUI ("Catalog Studio")	`dataland-atlas`	`4152` (`INFORMATION_WEBUI_PUBLIC_PORT`)	`4152`	`0.0.0.0` (Cloudflare-fronted; shared-password gate)	`data.dataland.chat`
Auth (JWKS)	`dataland-auth`	`9000` (`AUTH_PUBLIC_PORT`)	`9000`	`0.0.0.0` (serves `/.well-known/jwks.json`)	— (internal / tunnel)
RAG	`dataland-rag`	`4143` (`RAG_PUBLIC_PORT`)	`4143`	dual-bind `127.0.0.1` + `RAG_PUBLIC_BIND`; `X-API-Key` auth	— (internal)
Notification API	`dataland-notification-api`	`8080` (`NOTIFICATION_PUBLIC_PORT`)	`8080`	dual-bind `127.0.0.1` + `NOTIFICATION_PUBLIC_BIND`; ops-token on writes	— (internal)
Docs (this site)	`dataland-docs`	`4148` (`DOCS_PUBLIC_PORT`)	`80`	dual-bind `127.0.0.1` + `DOCS_PUBLIC_BIND`; nginx basic auth	`docs.dataland.chat`

Notification worker has no host port

dataland-notification-worker is a pure stream consumer of museum:telemetry — it publishes nothing and exposes no host port. Only dataland-notification-api (the DLQ + state inspector) is published, and only on loopback + tailnet.

Cloudflare hostnames¶

Hostname	Fronts	Container target (host)	App-layer gate
`dataland.chat`	Chat agent (SSE chat, conversations, service endpoints)	`dataland-agent` → `127.0.0.1:4141`	RS256 JWT (JWKS)
`data.dataland.chat`	Information WebUI (Catalog Studio CMS)	`dataland-atlas` → `127.0.0.1:4152`	Shared-password session + optional API token
`docs.dataland.chat`	This documentation site	`dataland-docs` → `127.0.0.1:4148`	nginx HTTP basic auth (`DOCS_USERNAME` / `DOCS_PASSWORD`)
museum dashboard	Live museum monitor UI	`dataland-museum` → `127.0.0.1:4144`	Shared-password session (`MUSEUM_PASSWORD`)

How the tunnel reaches the container

cloudflared runs as a host systemd unit (not a compose service) and ingresses each public hostname to localhost:<host-port> on the VDS. That is why the dual-bound services (docs especially) keep their 127.0.0.1 line — cloudflared needs the loopback path, and the tailnet line is a separate, additive binding for direct peer browsing. TLS terminates at Cloudflare, so basic-auth credentials never travel plaintext over the public path. Adding a new public service means registering a route in the Zero Trust dashboard (Tunnels → tunnel → Public Hostname → Add), as described in Deploy.

Data plane (dual-bind: loopback + tailnet)¶

These are stateful or carry weak/no built-in auth, so they are published on 127.0.0.1 and their *_PUBLIC_BIND tailnet IP — two explicit lines in compose.yml, never 0.0.0.0.

Service	Container	Loopback host port (var)	Container port	Tailnet bind var	Trust boundary
Postgres	`dataland-postgres`	`5432` (`POSTGRES_PORT`)	`5432`	`POSTGRES_PUBLIC_BIND`	Tailnet (password auth only)
Redis	`dataland-redis`	`4145` (`REDIS_PUBLIC_PORT`)	`6379`	`REDIS_PUBLIC_BIND`	Tailnet + `--requirepass` (`REDIS_PASSWORD`, DAT-76)
Qdrant HTTP	`dataland-qdrant`	`4146` (`QDRANT_HTTP_PORT`)	`6333`	`QDRANT_PUBLIC_BIND`	Tailnet (Qdrant runs with no API key)
Qdrant gRPC	`dataland-qdrant`	`4147` (`QDRANT_GRPC_PUBLIC_PORT`)	`6334`	`QDRANT_PUBLIC_BIND`	Tailnet (no API key)

Qdrant has no auth — the tailnet IS the auth

QDRANT_API_KEY is empty in the deploy env. The only thing between an attacker and the knowledge / images / scenes collections is the bind policy. Setting QDRANT_PUBLIC_BIND=0.0.0.0 would publish a fully open vector store on the public IP. Redis is safer (DAT-76 --requirepass) but the same rule holds: tailnet, not world.

Host binding does not affect service-to-service traffic

rag talks to qdrant, the agent talks to postgres and redis, etc., all over the dataland-network Docker bridge. Changing or removing a host port binding never breaks internal calls — those resolve container-to-container, independent of the host publish lines.

Observability (loopback by default)¶

Prometheus, Grafana and Alertmanager default to 127.0.0.1 so their UIs are invisible on the public Spectrum IP. Each has a *_PUBLIC_BIND you may set to the tailnet IP for direct browser access without an SSH tunnel.

Service	Container	Host port (var)	Bind var (default)	Auth
Prometheus	`dataland-prometheus`	`9090` (`PROMETHEUS_PUBLIC_PORT`)	`PROMETHEUS_PUBLIC_BIND` (`127.0.0.1`)	None (`--web.enable-lifecycle` is on)
Grafana	`dataland-grafana`	`3000` (`GRAFANA_PUBLIC_PORT`)	`GRAFANA_PUBLIC_BIND` (`127.0.0.1`)	Admin login (`GRAFANA_ADMIN_USER` / `GRAFANA_ADMIN_PASSWORD`, DAT-266)
Alertmanager	`dataland-alertmanager`	`9093` (`ALERTMANAGER_PUBLIC_PORT`)	`ALERTMANAGER_PUBLIC_BIND` (`127.0.0.1`)	None

Never 0.0.0.0 for the monitoring UIs

Prometheus and Alertmanager ship with no authentication, and Grafana's admin password is the only thing between the public internet and full Prometheus data. For tailnet browser access set, e.g., PROMETHEUS_PUBLIC_BIND=100.124.170.43 in .env. Otherwise reach them via an SSH tunnel:

ssh -L 9090:127.0.0.1:9090 -L 9093:127.0.0.1:9093 -L 3000:127.0.0.1:3000 ege@100.124.170.43 # (1)!

Forwards all three loopback-only monitoring UIs (Prometheus 9090, Alertmanager 9093, Grafana 3000) over one SSH session to the same local ports. Required because each service defaults to *_PUBLIC_BIND=127.0.0.1, so they are unreachable off-host unless you either tunnel or set the bind var to the tailnet IP.

Exporters & host-metrics profile (no host ports)¶

These publish no host ports — Prometheus scrapes them over the Docker network by container name. cadvisor and node-exporter are gated behind the host-metrics Compose profile (Linux-only; deploy.sh layers --profile host-metrics, macOS dev boxes leave it off because they can't expose host /proc + /sys).

Container	Role	Profile
`dataland-postgres-exporter`	`pg_stat_*` snapshots	always on
`dataland-redis-exporter`	`INFO`, `LATENCY`, stream depth (auths via `REDIS_PASSWORD`)	always on
`dataland-cadvisor`	per-container CPU/mem/IO	`host-metrics`
`dataland-node-exporter`	host CPU steal, IO wait, disk pressure	`host-metrics`

See Observability for the scrape map and dashboards.

Dev & simulator overlays¶

Service	Container	Host port	Bind	Notes
Museum simulator	`dataland-simulator`	—	—	`compose.dev.yml` overlay or the `simulator` profile. Publishes synthetic `museum:telemetry` into `dataland-redis`. Production runs without it — the RDC bridge in `museum-api` feeds telemetry instead.
Sim redis	`dataland-redis-sim`	`4149`	`127.0.0.1` only	`compose.sim.yml` overlay. Separate volume + password (`SIM_REDIS_PASSWORD`). Kept on loopback so the operator can `XINFO` from the VDS without exposing the sim stream to the tailnet.
Telemetry sim publisher	`dataland-telemetry-sim`	—	—	`compose.sim.yml`. Emits `museum:telemetry` shaped like the prod RDC → museum bridge into `dataland-redis-sim`, redirecting the notification worker/api at the sim redis.

Isolate playback from live telemetry

The compose.sim.yml overlay exists precisely so synthetic events land on a dedicated dataland-redis-sim, never on the live dataland-redis. Real visitors are not disturbed. Tear down with scripts/stop-simulator.sh.

Port allocation reference¶

All internally-allocated host ports live in the 47xx/414x range plus a few upstream-conventional ports. The canonical values live in .env.example under "Public host ports".

Port	Service	Source var
`4141`	Agent	`AGENT_PUBLIC_PORT`
`4143`	RAG	`RAG_PUBLIC_PORT`
`4144`	Museum API + dashboard	`MUSEUM_PUBLIC_PORT`
`4145`	Redis	`REDIS_PUBLIC_PORT`
`4146`	Qdrant HTTP	`QDRANT_HTTP_PORT`
`4147`	Qdrant gRPC	`QDRANT_GRPC_PUBLIC_PORT`
`4148`	Docs	`DOCS_PUBLIC_PORT`
`4149`	Sim redis (dev)	hard-coded (`compose.sim.yml`)
`4152`	Information WebUI	`INFORMATION_WEBUI_PUBLIC_PORT`
`5432`	Postgres	`POSTGRES_PORT`
`8080`	Notification API	`NOTIFICATION_PUBLIC_PORT`
`9000`	Auth (JWKS)	`AUTH_PUBLIC_PORT`
`9090`	Prometheus	`PROMETHEUS_PUBLIC_PORT`
`9093`	Alertmanager	`ALERTMANAGER_PUBLIC_PORT`
`3000`	Grafana	`GRAFANA_PUBLIC_PORT`

Health-check cheat sheet¶

Run on the VDS (everything resolves on localhost there). Off-host, open the SSH tunnel for the loopback-only services first, or hit the tailnet IP (100.124.170.43) for the dual-bound ones.

# Public-path + loopback health on the host:
for url in \
  http://localhost:4141/health \
  http://localhost:4143/health \
  http://localhost:4144/health \
  http://localhost:4152/health \
  http://localhost:8080/health \
  http://localhost:9000/.well-known/jwks.json \
  http://localhost:4148/healthz \
  http://localhost:9090/-/healthy \
  http://localhost:9093/-/healthy \
  http://localhost:3000/api/health
do
  printf '%-55s ' "$url"
  curl -fsS -o /dev/null -w '%{http_code}\n' "$url" || echo down # (1)!
done

-f makes curl exit non-zero on HTTP 4xx/5xx so the || echo down branch fires; -sS silences the progress meter but keeps real errors; -o /dev/null discards the body and -w '%{http_code}\n' prints only the status code. The probe set covers every host-published service: the auth check hits /.well-known/jwks.json (the JWKS endpoint, not a /health route), and docs uses /healthz because it is nginx behind basic auth (see the note below), not a FastAPI /health.

Why /healthz (not /health) for docs

The docs container is nginx, not a FastAPI app. nginx returns 200 ok at location = /healthz with auth_basic off, so the health probe (and the Compose wget --spider healthcheck) bypasses the basic-auth gate. Every other path requires the DOCS_USERNAME / DOCS_PASSWORD credential.

Deploy — deploy.sh, image pinning, Cloudflare ingress, smoke tests.
Observability — Prometheus scrape map, Grafana dashboards, Alertmanager.
Service hosting & relocation — what leaves the host and what stays on dataland-network.
Auth — JWKS, RS256 validation, the CMS signing-key mirror.