Skip to content

Public ports & bind policy

Every Dataland service runs as a Docker-Compose container on the Spark DGX VDS (ege@100.124.170.43). Internal service-to-service traffic stays on the dataland-network Docker bridge and uses container DNS names (http://dataland-agent:4141, http://dataland-rag:4143, dataland-redis:6379, …) — never host-published ports. Only the host port mappings described below are reachable from off-container, and only a small public subset leaves the host at all.

The one rule: never 0.0.0.0 for data-plane or ops services

Binding a stateful or unauthenticated service to 0.0.0.0 republishes it on the host's public Spectrum IP, not just the tailnet. That regression once exposed Postgres (5432) and Qdrant (no API key) to the open internet (DAT-73). The fix: bind those services to 127.0.0.1 plus the tailnet IP, explicitly, and let deploy.sh fail fast on placeholder secrets (DAT-291).

Recent changes (DAT-73 / DAT-291)

  • *_PUBLIC_BIND variables let each data-plane and ops service publish on two host IPs: 127.0.0.1 (local host + SSH-tunnel workflow) and the host's tailnet IP (direct access from Tailscale peers). Defaults to the Spark's 100.124.170.43; never 0.0.0.0, never 127.0.0.1 (it would collide with the loopback line already present in compose.yml).
  • deploy.sh now runs the agent's real boot guard against the prod .env before rebuilding, aborting on placeholder/default secrets so a crash-loop deploy can't take chat offline.
  • The docs.dataland.chat service (this site) is dual-bound (loopback for cloudflared + tailnet for peers) and Cloudflare-fronted.

Bind policy at a glance

flowchart LR
  Internet([Public internet]) -->|HTTPS| CF[Cloudflare Tunnel<br/>cloudflared on host systemd]
  Peers([Tailscale peers]) -->|tailnet 100.124.170.43| TN
  Op([Operator workstation]) -->|SSH -L tunnel| LB

  subgraph Host["Spark VDS — host network namespace"]
    CF -->|127.0.0.1| LB[127.0.0.1 loopback binds]
    TN[Tailnet-IP binds<br/>*_PUBLIC_BIND]
    LB --> DKR
    TN --> DKR
  end

  DKR[(Docker bridge<br/>dataland-network)]

Three publishing patterns are used in compose.yml:

Pattern Host binding(s) Used by Why
Cloudflare-fronted 0.0.0.0 (single mapping) — but only ever reached through cloudflared Agent, Museum API/dashboard, Information WebUI, Auth (JWKS) Public hostnames terminate TLS at Cloudflare; the app enforces its own auth (JWT / shared-password / API token).
Dual-bind (loopback + tailnet) 127.0.0.1 and ${X_PUBLIC_BIND} Postgres, Qdrant (HTTP+gRPC), Redis, RAG, Notification API, Docs Stateful or weakly/un-authenticated: the tailnet is the trust boundary. Two explicit lines, never 0.0.0.0.
Loopback-only / no host port 127.0.0.1 only, or no ports: at all Prometheus / Grafana / Alertmanager (default), exporters, cAdvisor, node-exporter, sim redis Operator-only or scrape-only; the exporters publish no host port and are read by Prometheus over the Docker network.

Cloudflare-fronted ≠ 0.0.0.0 is safe

The agent, museum, webui and auth containers publish with a single ${X_PUBLIC_PORT}:container mapping, which Docker binds on 0.0.0.0. That is acceptable only because each enforces application-layer auth (RS256 JWT, shared-password session, API token) and the intended path is the Cloudflare tunnel. Do not copy this pattern to a service without auth — that is exactly the data-plane mistake DAT-73 fixed.

Application services

Service Container Host port (var) Container port Bind policy Cloudflare hostname
Agent dataland-agent 4141 (AGENT_PUBLIC_PORT) 4141 0.0.0.0 (Cloudflare-fronted; RS256 JWT via JWKS) dataland.chat
Museum API + dashboard dataland-museum 4144 (MUSEUM_PUBLIC_PORT) 5001 0.0.0.0 (Cloudflare-fronted; dashboard shared-password gate) museum dashboard hostname
Information WebUI ("Catalog Studio") dataland-atlas 4152 (INFORMATION_WEBUI_PUBLIC_PORT) 4152 0.0.0.0 (Cloudflare-fronted; shared-password gate) data.dataland.chat
Auth (JWKS) dataland-auth 9000 (AUTH_PUBLIC_PORT) 9000 0.0.0.0 (serves /.well-known/jwks.json) — (internal / tunnel)
RAG dataland-rag 4143 (RAG_PUBLIC_PORT) 4143 dual-bind 127.0.0.1 + RAG_PUBLIC_BIND; X-API-Key auth — (internal)
Notification API dataland-notification-api 8080 (NOTIFICATION_PUBLIC_PORT) 8080 dual-bind 127.0.0.1 + NOTIFICATION_PUBLIC_BIND; ops-token on writes — (internal)
Docs (this site) dataland-docs 4148 (DOCS_PUBLIC_PORT) 80 dual-bind 127.0.0.1 + DOCS_PUBLIC_BIND; nginx basic auth docs.dataland.chat

Notification worker has no host port

dataland-notification-worker is a pure stream consumer of museum:telemetry — it publishes nothing and exposes no host port. Only dataland-notification-api (the DLQ + state inspector) is published, and only on loopback + tailnet.

Cloudflare hostnames

Hostname Fronts Container target (host) App-layer gate
dataland.chat Chat agent (SSE chat, conversations, service endpoints) dataland-agent127.0.0.1:4141 RS256 JWT (JWKS)
data.dataland.chat Information WebUI (Catalog Studio CMS) dataland-atlas127.0.0.1:4152 Shared-password session + optional API token
docs.dataland.chat This documentation site dataland-docs127.0.0.1:4148 nginx HTTP basic auth (DOCS_USERNAME / DOCS_PASSWORD)
museum dashboard Live museum monitor UI dataland-museum127.0.0.1:4144 Shared-password session (MUSEUM_PASSWORD)

How the tunnel reaches the container

cloudflared runs as a host systemd unit (not a compose service) and ingresses each public hostname to localhost:<host-port> on the VDS. That is why the dual-bound services (docs especially) keep their 127.0.0.1 line — cloudflared needs the loopback path, and the tailnet line is a separate, additive binding for direct peer browsing. TLS terminates at Cloudflare, so basic-auth credentials never travel plaintext over the public path. Adding a new public service means registering a route in the Zero Trust dashboard (Tunnels → tunnel → Public Hostname → Add), as described in Deploy.

Data plane (dual-bind: loopback + tailnet)

These are stateful or carry weak/no built-in auth, so they are published on 127.0.0.1 and their *_PUBLIC_BIND tailnet IP — two explicit lines in compose.yml, never 0.0.0.0.

Service Container Loopback host port (var) Container port Tailnet bind var Trust boundary
Postgres dataland-postgres 5432 (POSTGRES_PORT) 5432 POSTGRES_PUBLIC_BIND Tailnet (password auth only)
Redis dataland-redis 4145 (REDIS_PUBLIC_PORT) 6379 REDIS_PUBLIC_BIND Tailnet + --requirepass (REDIS_PASSWORD, DAT-76)
Qdrant HTTP dataland-qdrant 4146 (QDRANT_HTTP_PORT) 6333 QDRANT_PUBLIC_BIND Tailnet (Qdrant runs with no API key)
Qdrant gRPC dataland-qdrant 4147 (QDRANT_GRPC_PUBLIC_PORT) 6334 QDRANT_PUBLIC_BIND Tailnet (no API key)

Qdrant has no auth — the tailnet IS the auth

QDRANT_API_KEY is empty in the deploy env. The only thing between an attacker and the knowledge / images / scenes collections is the bind policy. Setting QDRANT_PUBLIC_BIND=0.0.0.0 would publish a fully open vector store on the public IP. Redis is safer (DAT-76 --requirepass) but the same rule holds: tailnet, not world.

Host binding does not affect service-to-service traffic

rag talks to qdrant, the agent talks to postgres and redis, etc., all over the dataland-network Docker bridge. Changing or removing a host port binding never breaks internal calls — those resolve container-to-container, independent of the host publish lines.

Observability (loopback by default)

Prometheus, Grafana and Alertmanager default to 127.0.0.1 so their UIs are invisible on the public Spectrum IP. Each has a *_PUBLIC_BIND you may set to the tailnet IP for direct browser access without an SSH tunnel.

Service Container Host port (var) Bind var (default) Auth
Prometheus dataland-prometheus 9090 (PROMETHEUS_PUBLIC_PORT) PROMETHEUS_PUBLIC_BIND (127.0.0.1) None (--web.enable-lifecycle is on)
Grafana dataland-grafana 3000 (GRAFANA_PUBLIC_PORT) GRAFANA_PUBLIC_BIND (127.0.0.1) Admin login (GRAFANA_ADMIN_USER / GRAFANA_ADMIN_PASSWORD, DAT-266)
Alertmanager dataland-alertmanager 9093 (ALERTMANAGER_PUBLIC_PORT) ALERTMANAGER_PUBLIC_BIND (127.0.0.1) None

Never 0.0.0.0 for the monitoring UIs

Prometheus and Alertmanager ship with no authentication, and Grafana's admin password is the only thing between the public internet and full Prometheus data. For tailnet browser access set, e.g., PROMETHEUS_PUBLIC_BIND=100.124.170.43 in .env. Otherwise reach them via an SSH tunnel:

ssh -L 9090:127.0.0.1:9090 -L 9093:127.0.0.1:9093 -L 3000:127.0.0.1:3000 ege@100.124.170.43 # (1)!
  1. Forwards all three loopback-only monitoring UIs (Prometheus 9090, Alertmanager 9093, Grafana 3000) over one SSH session to the same local ports. Required because each service defaults to *_PUBLIC_BIND=127.0.0.1, so they are unreachable off-host unless you either tunnel or set the bind var to the tailnet IP.

Exporters & host-metrics profile (no host ports)

These publish no host ports — Prometheus scrapes them over the Docker network by container name. cadvisor and node-exporter are gated behind the host-metrics Compose profile (Linux-only; deploy.sh layers --profile host-metrics, macOS dev boxes leave it off because they can't expose host /proc + /sys).

Container Role Profile
dataland-postgres-exporter pg_stat_* snapshots always on
dataland-redis-exporter INFO, LATENCY, stream depth (auths via REDIS_PASSWORD) always on
dataland-cadvisor per-container CPU/mem/IO host-metrics
dataland-node-exporter host CPU steal, IO wait, disk pressure host-metrics

See Observability for the scrape map and dashboards.

Dev & simulator overlays

Service Container Host port Bind Notes
Museum simulator dataland-simulator compose.dev.yml overlay or the simulator profile. Publishes synthetic museum:telemetry into dataland-redis. Production runs without it — the RDC bridge in museum-api feeds telemetry instead.
Sim redis dataland-redis-sim 4149 127.0.0.1 only compose.sim.yml overlay. Separate volume + password (SIM_REDIS_PASSWORD). Kept on loopback so the operator can XINFO from the VDS without exposing the sim stream to the tailnet.
Telemetry sim publisher dataland-telemetry-sim compose.sim.yml. Emits museum:telemetry shaped like the prod RDC → museum bridge into dataland-redis-sim, redirecting the notification worker/api at the sim redis.

Isolate playback from live telemetry

The compose.sim.yml overlay exists precisely so synthetic events land on a dedicated dataland-redis-sim, never on the live dataland-redis. Real visitors are not disturbed. Tear down with scripts/stop-simulator.sh.

Port allocation reference

All internally-allocated host ports live in the 47xx/414x range plus a few upstream-conventional ports. The canonical values live in .env.example under "Public host ports".

Port Service Source var
4141 Agent AGENT_PUBLIC_PORT
4143 RAG RAG_PUBLIC_PORT
4144 Museum API + dashboard MUSEUM_PUBLIC_PORT
4145 Redis REDIS_PUBLIC_PORT
4146 Qdrant HTTP QDRANT_HTTP_PORT
4147 Qdrant gRPC QDRANT_GRPC_PUBLIC_PORT
4148 Docs DOCS_PUBLIC_PORT
4149 Sim redis (dev) hard-coded (compose.sim.yml)
4152 Information WebUI INFORMATION_WEBUI_PUBLIC_PORT
5432 Postgres POSTGRES_PORT
8080 Notification API NOTIFICATION_PUBLIC_PORT
9000 Auth (JWKS) AUTH_PUBLIC_PORT
9090 Prometheus PROMETHEUS_PUBLIC_PORT
9093 Alertmanager ALERTMANAGER_PUBLIC_PORT
3000 Grafana GRAFANA_PUBLIC_PORT

Health-check cheat sheet

Run on the VDS (everything resolves on localhost there). Off-host, open the SSH tunnel for the loopback-only services first, or hit the tailnet IP (100.124.170.43) for the dual-bound ones.

# Public-path + loopback health on the host:
for url in \
  http://localhost:4141/health \
  http://localhost:4143/health \
  http://localhost:4144/health \
  http://localhost:4152/health \
  http://localhost:8080/health \
  http://localhost:9000/.well-known/jwks.json \
  http://localhost:4148/healthz \
  http://localhost:9090/-/healthy \
  http://localhost:9093/-/healthy \
  http://localhost:3000/api/health
do
  printf '%-55s ' "$url"
  curl -fsS -o /dev/null -w '%{http_code}\n' "$url" || echo down # (1)!
done
  1. -f makes curl exit non-zero on HTTP 4xx/5xx so the || echo down branch fires; -sS silences the progress meter but keeps real errors; -o /dev/null discards the body and -w '%{http_code}\n' prints only the status code. The probe set covers every host-published service: the auth check hits /.well-known/jwks.json (the JWKS endpoint, not a /health route), and docs uses /healthz because it is nginx behind basic auth (see the note below), not a FastAPI /health.

Why /healthz (not /health) for docs

The docs container is nginx, not a FastAPI app. nginx returns 200 ok at location = /healthz with auth_basic off, so the health probe (and the Compose wget --spider healthcheck) bypasses the basic-auth gate. Every other path requires the DOCS_USERNAME / DOCS_PASSWORD credential.

  • Deploydeploy.sh, image pinning, Cloudflare ingress, smoke tests.
  • Observability — Prometheus scrape map, Grafana dashboards, Alertmanager.
  • Service hosting & relocation — what leaves the host and what stays on dataland-network.
  • Auth — JWKS, RS256 validation, the CMS signing-key mirror.