Dataland Architecture¶
Welcome. This is the operator + contributor map of the Dataland backend stack: the set of microservices that powers the museum guide app, the live visitor experience, and the curator tooling behind Refik Anadol Studio's Dataland museum.
If you have just landed on the project, read Architecture for the network topology and data flows, then the Services reference for the per-service detail. If you came here because something broke, jump straight to Deploy and Observability.
Recent changes (2026-06-03 → 2026-06-04)
A large change-set landed across every repo. See the What changed this cycle section below for the full DAT-referenced summary. Highlights: the whole stack standardized on gemini-3.5-flash (DAT-269), the first museum message now produces an instant zero-LLM welcome (DAT-296), silent visitor-complaint detection went live (DAT-213), the local auth service now mirrors the CMS signing key to kill the chat-auth single point of failure (DAT-286), and deploy.sh now fails fast on placeholder prod secrets (DAT-291).
What Dataland is¶
Dataland is a physical AI-art museum. Visitors wear biosensor bands; sensor data flows from the RDC (Refik Anadol data center) into this stack, where it drives a live AI guide and a context-aware notification engine. The backend has four jobs:
- Talk to visitors. An AI guide (agent) answers questions over SSE, grounded in live room/vitals telemetry and a retrieval corpus of artwork and museum knowledge.
- Ingest the live experience. The museum bridge reads the external RDC redis (the wearable/sensor source of truth) and normalizes it into an internal telemetry stream.
- React to the visit. The notification engine consumes that telemetry and fires OneSignal pushes (welcome, room-transition tips, session-flow nudges, checkout) plus internal ops alerts.
- Keep the corpus current. Curators use the Catalog Studio (information-webui) to manage the artwork + museum catalog, which live-syncs into the RAG vector store.
What runs where¶
The whole stack runs on a single Linux host: the Spark DGX VDS (tailnet IP 100.124.170.43, repos under /home/cobanov/DATALAND/), under Docker Compose. The dataland-infrastructure repo is the orchestrator. It owns compose.yml, the deploy script, the monitoring stack, and this docs site itself.
The unified compose stack brings up three rings plus the docs site:
| Layer | Services |
|---|---|
| Data plane | postgres, redis, qdrant |
| App services | auth, rag, museum-api, agent, notification-worker, notification-api, information-webui |
| Docs | docs (this site) |
| Observability | prometheus, grafana, alertmanager, postgres-exporter, redis-exporter, cadvisor, node-exporter |
A dev-only overlay (compose.dev.yml) and the simulator profile add museum-simulator for synthetic telemetry. The simulator is not part of production. In production, museum-api itself bridges from the external RDC redis.
Naming
The compose name: is dataland, so containers are dataland-agent, dataland-rag, dataland-postgres, and so on. Internal callers use those container DNS names over the Docker network (http://dataland-rag:4143), never host-published ports. See Architecture.
Service map at a glance¶
| Service | Container | Role | Public port | Internal port |
|---|---|---|---|---|
| Agent | dataland-agent |
AI guide. Museum + general chat over SSE, conversation history, service + ops endpoints | 4141 (0.0.0.0) |
4141 |
| Museum API | dataland-museum |
RDC redis bridge, telemetry normalizer, chapters catalog + dashboard | 4144 (0.0.0.0) |
5001 |
| RAG | dataland-rag |
Retrieval. Qdrant search + ingest, Gemini captioning + embeddings | 4143 (loopback / tailnet) |
4143 |
| Information WebUI | dataland-atlas |
Catalog Studio CMS. Projects + Museum workspaces, RAG live-sync | 4152 (0.0.0.0) |
4152 |
| Notification worker | dataland-notification-worker |
Telemetry rules engine → OneSignal pushes + ops alerts | — (no HTTP) | — |
| Notification API | dataland-notification-api |
DLQ + replay + state inspector ops surface | 8080 (loopback / tailnet) |
8080 |
| Auth | dataland-auth |
JWKS provider for RS256 JWT chat auth | 9000 (0.0.0.0) |
9000 |
| Redis | dataland-redis |
museum:telemetry stream, ticket state, dedup keys (requirepass) |
4145 (loopback / tailnet) |
6379 |
| Postgres | dataland-postgres |
Agent + auth databases | 5432 (loopback / tailnet) |
5432 |
| Qdrant | dataland-qdrant |
Vector store: knowledge, images, scenes |
4146/4147 (loopback / tailnet) |
6333/6334 |
| Docs | dataland-docs |
This site (MkDocs Material + nginx basic auth) | 4148 (loopback / tailnet) |
80 |
For the authoritative port + bind matrix, see Public ports.
The 30-second mental model¶
flowchart LR
subgraph external["External"]
Mobile["Mobile app<br/>(visitor)"]
RDC[("RDC Redis<br/>wearable/sensor<br/>source of truth")]
OneSignal["OneSignal"]
GCS[("GCS<br/>public + private")]
Gemini["Gemini API<br/>(gemini-3.5-flash<br/>+ gemini-embedding)"]
Curator["Curator"]
end
subgraph dataland["Dataland host (Spark, 100.124.170.43)"]
Agent["agent :4141"]
Museum["museum-api :5001"]
RAG["rag :4143"]
WebUI["information-webui :4152"]
NotifW["notification-worker"]
NotifA["notification-api :8080"]
Auth["auth :9000"]
Redis[("redis :6379")]
PG[("postgres :5432")]
QD[("qdrant :6333")]
end
Mobile -->|SSE chat| Agent
Agent -->|JWT verify| Auth
Agent --> RAG
Agent --> Museum
Agent --> PG
Agent --> Redis
Agent -->|complaint detect| NotifA
Museum -->|PSUBSCRIBE| RDC
Museum -->|XADD museum:telemetry| Redis
NotifW -->|XREADGROUP| Redis
NotifW -->|service call| Agent
NotifW --> OneSignal
RAG --> QD
RAG --> GCS
RAG --> Gemini
Curator --> WebUI
WebUI --> RAG
WebUI --> GCS
Two flows do most of the work:
- Telemetry → reaction.
museum-apiPSUBSCRIBEs the external RDC redis, normalizes each event, andXADDs it to the internalmuseum:telemetrystream ondataland-redis. The notification workerXREADGROUPs that stream, evaluates rules, and fans out OneSignal pushes (and ops alerts). See the telemetry sequence. - Chat. The mobile app POSTs to the agent with a Bearer JWT; the agent verifies it against the auth JWKS, calls its tools (
get_visitor_vitals,get_room_info,get_scene_flow,search_knowledge,search_artwork_images) against museum-api and RAG, and streams tokens back over SSE. See the chat sequence.
Edge + access¶
flowchart LR
Internet(("Internet")) --> CF["Cloudflare Tunnel<br/>(host systemd, token mode)"]
Tailnet(("Tailscale tailnet")) -.->|*_PUBLIC_BIND| Host
CF --> Host["Spark host<br/>published ports"]
Host --> Agent["agent → dataland.chat"]
Host --> Museum["museum-api → museum.dataland.chat"]
Host --> WebUI["information-webui → data.dataland.chat"]
Host --> Docs["docs → docs.dataland.chat"]
- Cloudflare is the only public ingress.
cloudflaredruns as a host systemd service and routes the public*.dataland.chathostnames to the right local port. There is no nginx/Traefik in front of the stack. TLS terminates at Cloudflare. - Tailnet is the second access path. Stateful + internal services publish on
127.0.0.1and on a*_PUBLIC_BINDhost IP (default100.124.170.43, the tailnet interface) so tailnet peers reach them directly without an SSH tunnel. Never0.0.0.0(DAT-73). See Public ports. - This docs site is private. Cloudflare ingresses
docs.dataland.chatto thedocscontainer; nginx inside enforces a single shared-password gate (HTTP basic auth,DOCS_USERNAME/DOCS_PASSWORDfrom the host.env). Same posture as the museum and Catalog Studio dashboards. Ask an admin for credentials, or hithttp://127.0.0.1:4148on the host (still password-gated). The docs service itself shipped this cycle (DAT-291).
How to navigate these docs¶
| You want to… | Go to |
|---|---|
| Understand the topology + data flows | Architecture |
| Look up what a service does, its ports, env, and auth | Services → the per-service page |
| Find which port / Cloudflare hostname maps where | Public ports |
| Deploy, rebuild, or roll back | Deploy |
| Debug something slow, or read metrics/alerts | Observability |
| Re-home a service to another host | Service hosting & relocation |
| Plan the Spark ↔ GCP move | Migration plan |
Quick service jumps: Agent · Museum · RAG · Information WebUI · Notification · Auth · Redis · Postgres · Qdrant.
Models¶
The stack is standardized on Google Gemini:
| Use | Model | Where |
|---|---|---|
| Chat (museum + general) | gemini-3.5-flash |
agent (agent_model = google-gla:gemini-3.5-flash) |
| Image captioning at ingest | gemini-3.5-flash |
rag (gemini_model) |
| RAG vector embeddings | gemini-embedding-2 |
rag (embedding_model) |
Single model, single source of truth
Everything chat- and caption-facing runs gemini-3.5-flash after DAT-269. Code, .env.example (GEMINI_MODEL=gemini-3.5-flash), and the live .env are aligned. Older gemini-2.5-flash / gemini-3.1-flash-lite references are deprecated and should not appear in docs or config.
What changed this cycle (2026-06-03/04)¶
This was a heavy cycle touching the agent, RAG, museum, notification, auth, and infra repos. Grouped by area, with Linear DAT references.
Models + chat experience¶
- DAT-269 — Standardized on
gemini-3.5-flashacross agent, RAG, and infra. Code,.env.example, and the live.envaligned. - DAT-296 — Instant personalized welcome. An empty first
/v1/chat/museummessage now returns a personalized welcome without an LLM call and triggers a welcome push, ticket-deduped against the RDCvisit_startedevent. The/registerand/currentendpoints were removed: the first/museummessage lazily registers the ticket, andconversation_id == ticket_id. - DAT-284 — Follow-up suggestions restored. Post-stream follow-up suggestions now reappear when a conversation is reloaded.
- DAT-281 — Real room names. The agent speaks the public gallery names, never bare codes: Data Pavilion (GA), Latent Gallery (GB), Infinity Room (GC), The Sanctuary (GD), and the Discovery Portal entrance (ON).
Agent tools + reliability¶
- DAT-279 / 280 / 285 / 261 —
get_room_infoimage cap + dedup, an empty-room "flail" fix, physiological sanity bounds on vitals, and the newget_scene_flowtool. - RAG client timeout — the agent's RAG
/searchread timeout was raised 10s → 25s. Museum-knowledge queries were tripping a 10s timeout → 3x retry →agent_timeout.
Notifications¶
- DAT-213 — Silent visitor-complaint detection. A server-side LLM judge runs off the response path, invisible to the visitor, and raises an ops ticket on detected dissatisfaction via
POST /v1/ops/complaint. The ops notifier is swappable (Discord/Slack) with comma-separated multi-provider fan-out (OPS_NOTIFIER_PROVIDER); the ticket carries visitor identity, with per-session dedup + an audit log. - DAT-287 / 289 / 290 / 293 / 282 — notification gating (no content pushes until a visitor reaches Gallery A; the welcome push is exempt), Gallery-B session-flow pushes, room-transition pushes that bypass the 240s cooldown, a
visit_ended/ checkout push, RDC plain-UTF-8 decode for Q-SYS/audio fields, and a resolver that falls back touser_idwhenexternal_idis absent.
Content + retrieval¶
- Museum corpus re-ingest — the 20 museum sections + scenes + overview were (re-)ingested into the Qdrant
knowledgecollection (point count 4839 → 4969). - DAT-288 — Placeholder images purged.
default_reference_*placeholder images were removed fromchapters.jsonand the GCScobanov-public/chaptersbucket. (These were never in Qdrant.)
Security + ops¶
- DAT-286 — Auth single-point-of-failure removed. The local
dataland-authservice now mirrors the CMS signing key (kid dataland-rs256-1) viadata/extra_jwks.jsonin theauth-datavolume, so chat auth no longer depends on a single remote JWKS. The agent WARNs when a fallback JWKS provider is the sole validator. Re-run after a volume wipe or CMS key rotation. - DAT-291 — Safer deploys + tailnet publishing.
deploy.shnow runs the agent's real boot guard (assert_boot_required_env) against the new.envbefore rebuilding, so placeholder/default prod secrets abort the deploy instead of crash-looping chat. This cycle also added tailnet*_PUBLIC_BINDpublishing and thedocs.dataland.chatservice, plus a prod secret rotation.
Deploy gotcha
Because of DAT-291, deploy.sh will abort if the prod .env still holds placeholder secrets, before any rebuild. If a deploy stops at "ABORT: .env failed the agent boot guard", fix the flagged secrets and re-run. See Deploy.