Public ports & bind policy¶
Every Dataland service runs as a Docker-Compose container on the Spark DGX VDS
(ege@100.124.170.43). Internal service-to-service traffic stays on the
dataland-network Docker bridge and uses container DNS names
(http://dataland-agent:4141, http://dataland-rag:4143,
dataland-redis:6379, …) — never host-published ports. Only the host port
mappings described below are reachable from off-container, and only a small
public subset leaves the host at all.
The one rule: never 0.0.0.0 for data-plane or ops services
Binding a stateful or unauthenticated service to 0.0.0.0 republishes it on
the host's public Spectrum IP, not just the tailnet. That regression once
exposed Postgres (5432) and Qdrant (no API key) to the open internet
(DAT-73). The fix: bind those services to 127.0.0.1 plus the tailnet IP,
explicitly, and let deploy.sh fail fast on placeholder secrets (DAT-291).
Recent changes (DAT-73 / DAT-291)
*_PUBLIC_BINDvariables let each data-plane and ops service publish on two host IPs:127.0.0.1(local host + SSH-tunnel workflow) and the host's tailnet IP (direct access from Tailscale peers). Defaults to the Spark's100.124.170.43; never0.0.0.0, never127.0.0.1(it would collide with the loopback line already present incompose.yml).deploy.shnow runs the agent's real boot guard against the prod.envbefore rebuilding, aborting on placeholder/default secrets so a crash-loop deploy can't take chat offline.- The
docs.dataland.chatservice (this site) is dual-bound (loopback for cloudflared + tailnet for peers) and Cloudflare-fronted.
Bind policy at a glance¶
flowchart LR
Internet([Public internet]) -->|HTTPS| CF[Cloudflare Tunnel<br/>cloudflared on host systemd]
Peers([Tailscale peers]) -->|tailnet 100.124.170.43| TN
Op([Operator workstation]) -->|SSH -L tunnel| LB
subgraph Host["Spark VDS — host network namespace"]
CF -->|127.0.0.1| LB[127.0.0.1 loopback binds]
TN[Tailnet-IP binds<br/>*_PUBLIC_BIND]
LB --> DKR
TN --> DKR
end
DKR[(Docker bridge<br/>dataland-network)]
Three publishing patterns are used in compose.yml:
| Pattern | Host binding(s) | Used by | Why |
|---|---|---|---|
| Cloudflare-fronted | 0.0.0.0 (single mapping) — but only ever reached through cloudflared |
Agent, Museum API/dashboard, Information WebUI, Auth (JWKS) | Public hostnames terminate TLS at Cloudflare; the app enforces its own auth (JWT / shared-password / API token). |
| Dual-bind (loopback + tailnet) | 127.0.0.1 and ${X_PUBLIC_BIND} |
Postgres, Qdrant (HTTP+gRPC), Redis, RAG, Notification API, Docs | Stateful or weakly/un-authenticated: the tailnet is the trust boundary. Two explicit lines, never 0.0.0.0. |
| Loopback-only / no host port | 127.0.0.1 only, or no ports: at all |
Prometheus / Grafana / Alertmanager (default), exporters, cAdvisor, node-exporter, sim redis | Operator-only or scrape-only; the exporters publish no host port and are read by Prometheus over the Docker network. |
Cloudflare-fronted ≠ 0.0.0.0 is safe
The agent, museum, webui and auth containers publish with a single
${X_PUBLIC_PORT}:container mapping, which Docker binds on 0.0.0.0. That
is acceptable only because each enforces application-layer auth
(RS256 JWT, shared-password session, API token) and the intended path is the
Cloudflare tunnel. Do not copy this pattern to a service without auth — that
is exactly the data-plane mistake DAT-73 fixed.
Application services¶
| Service | Container | Host port (var) | Container port | Bind policy | Cloudflare hostname |
|---|---|---|---|---|---|
| Agent | dataland-agent |
4141 (AGENT_PUBLIC_PORT) |
4141 |
0.0.0.0 (Cloudflare-fronted; RS256 JWT via JWKS) |
dataland.chat |
| Museum API + dashboard | dataland-museum |
4144 (MUSEUM_PUBLIC_PORT) |
5001 |
0.0.0.0 (Cloudflare-fronted; dashboard shared-password gate) |
museum dashboard hostname |
| Information WebUI ("Catalog Studio") | dataland-atlas |
4152 (INFORMATION_WEBUI_PUBLIC_PORT) |
4152 |
0.0.0.0 (Cloudflare-fronted; shared-password gate) |
data.dataland.chat |
| Auth (JWKS) | dataland-auth |
9000 (AUTH_PUBLIC_PORT) |
9000 |
0.0.0.0 (serves /.well-known/jwks.json) |
— (internal / tunnel) |
| RAG | dataland-rag |
4143 (RAG_PUBLIC_PORT) |
4143 |
dual-bind 127.0.0.1 + RAG_PUBLIC_BIND; X-API-Key auth |
— (internal) |
| Notification API | dataland-notification-api |
8080 (NOTIFICATION_PUBLIC_PORT) |
8080 |
dual-bind 127.0.0.1 + NOTIFICATION_PUBLIC_BIND; ops-token on writes |
— (internal) |
| Docs (this site) | dataland-docs |
4148 (DOCS_PUBLIC_PORT) |
80 |
dual-bind 127.0.0.1 + DOCS_PUBLIC_BIND; nginx basic auth |
docs.dataland.chat |
Notification worker has no host port
dataland-notification-worker is a pure stream consumer of
museum:telemetry — it publishes nothing and exposes no host port. Only
dataland-notification-api (the DLQ + state inspector) is published, and only
on loopback + tailnet.
Cloudflare hostnames¶
| Hostname | Fronts | Container target (host) | App-layer gate |
|---|---|---|---|
dataland.chat |
Chat agent (SSE chat, conversations, service endpoints) | dataland-agent → 127.0.0.1:4141 |
RS256 JWT (JWKS) |
data.dataland.chat |
Information WebUI (Catalog Studio CMS) | dataland-atlas → 127.0.0.1:4152 |
Shared-password session + optional API token |
docs.dataland.chat |
This documentation site | dataland-docs → 127.0.0.1:4148 |
nginx HTTP basic auth (DOCS_USERNAME / DOCS_PASSWORD) |
| museum dashboard | Live museum monitor UI | dataland-museum → 127.0.0.1:4144 |
Shared-password session (MUSEUM_PASSWORD) |
How the tunnel reaches the container
cloudflared runs as a host systemd unit (not a compose service) and
ingresses each public hostname to localhost:<host-port> on the VDS. That is
why the dual-bound services (docs especially) keep their 127.0.0.1 line —
cloudflared needs the loopback path, and the tailnet line is a separate,
additive binding for direct peer browsing. TLS terminates at Cloudflare, so
basic-auth credentials never travel plaintext over the public path. Adding a
new public service means registering a route in the Zero Trust dashboard
(Tunnels → tunnel → Public Hostname → Add), as described in
Deploy.
Data plane (dual-bind: loopback + tailnet)¶
These are stateful or carry weak/no built-in auth, so they are published on
127.0.0.1 and their *_PUBLIC_BIND tailnet IP — two explicit lines in
compose.yml, never 0.0.0.0.
| Service | Container | Loopback host port (var) | Container port | Tailnet bind var | Trust boundary |
|---|---|---|---|---|---|
| Postgres | dataland-postgres |
5432 (POSTGRES_PORT) |
5432 |
POSTGRES_PUBLIC_BIND |
Tailnet (password auth only) |
| Redis | dataland-redis |
4145 (REDIS_PUBLIC_PORT) |
6379 |
REDIS_PUBLIC_BIND |
Tailnet + --requirepass (REDIS_PASSWORD, DAT-76) |
| Qdrant HTTP | dataland-qdrant |
4146 (QDRANT_HTTP_PORT) |
6333 |
QDRANT_PUBLIC_BIND |
Tailnet (Qdrant runs with no API key) |
| Qdrant gRPC | dataland-qdrant |
4147 (QDRANT_GRPC_PUBLIC_PORT) |
6334 |
QDRANT_PUBLIC_BIND |
Tailnet (no API key) |
Qdrant has no auth — the tailnet IS the auth
QDRANT_API_KEY is empty in the deploy env. The only thing between an
attacker and the knowledge / images / scenes collections is the bind
policy. Setting QDRANT_PUBLIC_BIND=0.0.0.0 would publish a fully open vector
store on the public IP. Redis is safer (DAT-76 --requirepass) but the same
rule holds: tailnet, not world.
Host binding does not affect service-to-service traffic
rag talks to qdrant, the agent talks to postgres and redis, etc., all
over the dataland-network Docker bridge. Changing or removing a host port
binding never breaks internal calls — those resolve container-to-container,
independent of the host publish lines.
Observability (loopback by default)¶
Prometheus, Grafana and Alertmanager default to 127.0.0.1 so their UIs are
invisible on the public Spectrum IP. Each has a *_PUBLIC_BIND you may set to
the tailnet IP for direct browser access without an SSH tunnel.
| Service | Container | Host port (var) | Bind var (default) | Auth |
|---|---|---|---|---|
| Prometheus | dataland-prometheus |
9090 (PROMETHEUS_PUBLIC_PORT) |
PROMETHEUS_PUBLIC_BIND (127.0.0.1) |
None (--web.enable-lifecycle is on) |
| Grafana | dataland-grafana |
3000 (GRAFANA_PUBLIC_PORT) |
GRAFANA_PUBLIC_BIND (127.0.0.1) |
Admin login (GRAFANA_ADMIN_USER / GRAFANA_ADMIN_PASSWORD, DAT-266) |
| Alertmanager | dataland-alertmanager |
9093 (ALERTMANAGER_PUBLIC_PORT) |
ALERTMANAGER_PUBLIC_BIND (127.0.0.1) |
None |
Never 0.0.0.0 for the monitoring UIs
Prometheus and Alertmanager ship with no authentication, and Grafana's
admin password is the only thing between the public internet and full
Prometheus data. For tailnet browser access set, e.g.,
PROMETHEUS_PUBLIC_BIND=100.124.170.43 in .env. Otherwise reach them via an
SSH tunnel:
- Forwards all three loopback-only monitoring UIs (Prometheus
9090, Alertmanager9093, Grafana3000) over one SSH session to the same local ports. Required because each service defaults to*_PUBLIC_BIND=127.0.0.1, so they are unreachable off-host unless you either tunnel or set the bind var to the tailnet IP.
Exporters & host-metrics profile (no host ports)¶
These publish no host ports — Prometheus scrapes them over the Docker
network by container name. cadvisor and node-exporter are gated behind the
host-metrics Compose profile (Linux-only; deploy.sh layers --profile
host-metrics, macOS dev boxes leave it off because they can't expose host
/proc + /sys).
| Container | Role | Profile |
|---|---|---|
dataland-postgres-exporter |
pg_stat_* snapshots |
always on |
dataland-redis-exporter |
INFO, LATENCY, stream depth (auths via REDIS_PASSWORD) |
always on |
dataland-cadvisor |
per-container CPU/mem/IO | host-metrics |
dataland-node-exporter |
host CPU steal, IO wait, disk pressure | host-metrics |
See Observability for the scrape map and dashboards.
Dev & simulator overlays¶
| Service | Container | Host port | Bind | Notes |
|---|---|---|---|---|
| Museum simulator | dataland-simulator |
— | — | compose.dev.yml overlay or the simulator profile. Publishes synthetic museum:telemetry into dataland-redis. Production runs without it — the RDC bridge in museum-api feeds telemetry instead. |
| Sim redis | dataland-redis-sim |
4149 |
127.0.0.1 only |
compose.sim.yml overlay. Separate volume + password (SIM_REDIS_PASSWORD). Kept on loopback so the operator can XINFO from the VDS without exposing the sim stream to the tailnet. |
| Telemetry sim publisher | dataland-telemetry-sim |
— | — | compose.sim.yml. Emits museum:telemetry shaped like the prod RDC → museum bridge into dataland-redis-sim, redirecting the notification worker/api at the sim redis. |
Isolate playback from live telemetry
The compose.sim.yml overlay exists precisely so synthetic events land on a
dedicated dataland-redis-sim, never on the live dataland-redis. Real
visitors are not disturbed. Tear down with scripts/stop-simulator.sh.
Port allocation reference¶
All internally-allocated host ports live in the 47xx/414x range plus a few
upstream-conventional ports. The canonical values live in .env.example under
"Public host ports".
| Port | Service | Source var |
|---|---|---|
4141 |
Agent | AGENT_PUBLIC_PORT |
4143 |
RAG | RAG_PUBLIC_PORT |
4144 |
Museum API + dashboard | MUSEUM_PUBLIC_PORT |
4145 |
Redis | REDIS_PUBLIC_PORT |
4146 |
Qdrant HTTP | QDRANT_HTTP_PORT |
4147 |
Qdrant gRPC | QDRANT_GRPC_PUBLIC_PORT |
4148 |
Docs | DOCS_PUBLIC_PORT |
4149 |
Sim redis (dev) | hard-coded (compose.sim.yml) |
4152 |
Information WebUI | INFORMATION_WEBUI_PUBLIC_PORT |
5432 |
Postgres | POSTGRES_PORT |
8080 |
Notification API | NOTIFICATION_PUBLIC_PORT |
9000 |
Auth (JWKS) | AUTH_PUBLIC_PORT |
9090 |
Prometheus | PROMETHEUS_PUBLIC_PORT |
9093 |
Alertmanager | ALERTMANAGER_PUBLIC_PORT |
3000 |
Grafana | GRAFANA_PUBLIC_PORT |
Health-check cheat sheet¶
Run on the VDS (everything resolves on localhost there). Off-host, open the
SSH tunnel for the loopback-only services first, or hit the tailnet IP
(100.124.170.43) for the dual-bound ones.
# Public-path + loopback health on the host:
for url in \
http://localhost:4141/health \
http://localhost:4143/health \
http://localhost:4144/health \
http://localhost:4152/health \
http://localhost:8080/health \
http://localhost:9000/.well-known/jwks.json \
http://localhost:4148/healthz \
http://localhost:9090/-/healthy \
http://localhost:9093/-/healthy \
http://localhost:3000/api/health
do
printf '%-55s ' "$url"
curl -fsS -o /dev/null -w '%{http_code}\n' "$url" || echo down # (1)!
done
-fmakes curl exit non-zero on HTTP 4xx/5xx so the|| echo downbranch fires;-sSsilences the progress meter but keeps real errors;-o /dev/nulldiscards the body and-w '%{http_code}\n'prints only the status code. The probe set covers every host-published service: the auth check hits/.well-known/jwks.json(the JWKS endpoint, not a/healthroute), and docs uses/healthzbecause it is nginx behind basic auth (see the note below), not a FastAPI/health.
Why /healthz (not /health) for docs
The docs container is nginx, not a FastAPI app. nginx returns 200 ok at
location = /healthz with auth_basic off, so the health probe (and the
Compose wget --spider healthcheck) bypasses the basic-auth gate. Every
other path requires the DOCS_USERNAME / DOCS_PASSWORD credential.
Related pages¶
- Deploy —
deploy.sh, image pinning, Cloudflare ingress, smoke tests. - Observability — Prometheus scrape map, Grafana dashboards, Alertmanager.
- Service hosting & relocation — what leaves the host and what stays on
dataland-network. - Auth — JWKS, RS256 validation, the CMS signing-key mirror.