Redis¶
Dataland runs against two completely separate Redis servers, and keeping the distinction straight is the single most important thing to understand about the data plane:
dataland-redis (internal) |
RDC redis (external) | |
|---|---|---|
| Owner | Dataland | Refik Anadol Studio (RDC = Refik data center) |
| Role | Internal control plane: the museum:telemetry stream, ticket state, dedup keys, counters, the active-ticket mirror |
Source of truth for live wearable / sensor / scene data |
| Access | Read and write | Read-only from our side |
| Encoding | Redis-stream maps (UTF-8 strings) | Mixed: msgpack for most keys, plain UTF-8 for some (Q-SYS, audio flags) |
| Reached via | dataland-redis:6379 on the docker network; :4145 on loopback + tailnet |
RDC_REDIS_URL (required at museum-api startup) |
| Persistence | AOF (redis-data volume) |
Not our concern |
dataland-museum is the only service that talks to both: it reads the external RDC redis and writes a derived telemetry stream + active-ticket mirror into the internal dataland-redis. Everything else (agent, notification worker) only touches dataland-redis.
Recent changes
- DAT-76:
dataland-redisnow enforces--requirepass; every client passesREDIS_PASSWORD. Empty password collapses toNoneso unauthenticated local dev still works. - DAT-296: ticket-keyed welcome dedup key
notification:welcome_sent:<ticket>(atomicSET NX), shared by the mobile-init welcome and the RDCvisit_startedwelcome. - DAT-287 / 289 / 290 / 293: gallery-gated notifications,
visit_endedevents emitted from the active-ticket mirror loop, room-transition pushes bypass the cooldown. - DAT-282: the RDC reader decodes plain-UTF-8 values (Q-SYS health, audio-player flags) when msgpack fails.
1. dataland-redis — the internal control plane¶
| Container | dataland-redis |
| Image | redis:7-alpine |
| Internal URL | redis://dataland-redis:6379 |
| Published ports | 127.0.0.1:4145 → 6379 (loopback) and ${REDIS_PUBLIC_BIND:-100.124.170.43}:4145 → 6379 (tailnet) |
| Memory / CPU | mem_limit 1g, mem_reservation 128m, cpus 0.5, Redis maxmemory 512mb |
| Persistence | --appendonly yes, named volume redis-data (AOF + RDB) |
| Auth | --requirepass ${REDIS_PASSWORD} (DAT-76) |
1.1 Compose configuration¶
The service is defined in dataland-infrastructure/compose.yml. The command line is the operative contract:
command:
- "redis-server"
- "--appendonly"
- "yes" # (1)!
- "--maxmemory"
- "512mb" # (2)!
- "--maxmemory-policy"
- "noeviction" # (3)!
- "--requirepass"
- "${REDIS_PASSWORD:?REDIS_PASSWORD is required (DAT-76); set a strong unique value in .env}" # (4)!
- Enables AOF persistence so the
redis-datavolume survives restarts. The control-plane state (ticket hashes, dedup keys, the active-ticket mirror) must not vanish on a container bounce. - Caps Redis at 512 MB even though
mem_limitis 1g — the gap leaves headroom for the alpine process itself. Crossing 85% of this is alerted on (maxmemory above 85%). - DAT-80 —
museum:telemetryis a stream consumers must drain. Any eviction policy would silently drop entries at the cap;noevictioninstead fails writes loudly so the producer errors and the operator gets paged. This is the deliberate backpressure signal. - DAT-76 — the
:?form makes the container refuse to start ifREDIS_PASSWORDis unset, so the tailnet-exposed instance is never accidentally launched without auth. The value is sourced from.env, not hard-coded.
ports:
- "127.0.0.1:${REDIS_PUBLIC_PORT:-4145}:6379" # (1)!
- "${REDIS_PUBLIC_BIND:-100.124.170.43}:${REDIS_PUBLIC_PORT:-4145}:6379" # (2)!
- Loopback-only publish for
redis-cliand SSH-tunnel ops on the host. The external port defaults to4145(REDIS_PUBLIC_PORT), mapped to the container's6379. - DAT-73 — tailnet bind for direct peer access at
spark:4145. The default100.124.170.43is the tailnet IP; it is never0.0.0.0. Exposure here is gated by--requirepass(DAT-76).
Never bind 0.0.0.0
The two published ports are deliberate (DAT-73): loopback for redis-cli / SSH-tunnel ops, and the tailnet IP for direct peer access at spark:4145. Exposure on the tailnet is gated by --requirepass (DAT-76). The bind address is never 0.0.0.0.
Why noeviction (DAT-80)
museum:telemetry is a stream that consumers must drain. Under any eviction policy, Redis would silently drop entries when it hit maxmemory — losing telemetry events. noeviction instead makes writes fail loudly when the cap is reached, which is the correct backpressure signal: the producer (museum-api's bridge / the simulator) sees an error and the operator gets paged. The alert rules live in monitoring/rules/dataland.alerts.yml (maxmemory above 85%, redis-exporter cannot reach dataland-redis).
Why requirepass (DAT-76)
REDIS_PASSWORD is passed through the container environment (not the command line) so the secret never appears in the ps process listing. The healthcheck reads it from REDISCLI_AUTH:
test:
- "CMD-SHELL"
- 'REDISCLI_AUTH="$$REDIS_PASSWORD" redis-cli --no-auth-warning ping | grep -q PONG' # (1)!
REDISCLI_AUTHletsredis-cliauthenticate without putting the password on argv (so it stays out ofps); the$$escapes Compose interpolation so the literal$REDIS_PASSWORDreaches the shell at container runtime.--no-auth-warningsuppresses the stderr notice that would otherwise pollute healthcheck output.
1.2 Keyspace¶
Every key written to dataland-redis is enumerated below, grouped by producer. The notification worker is the heaviest tenant; the agent and museum-api each write a narrow slice.
The telemetry stream + DLQ (museum-api / simulator → notification worker)¶
| Key | Type | Producer | Consumer | Notes |
|---|---|---|---|---|
museum:telemetry |
stream | museum-api bridge + dataland-simulator |
notification worker (XREADGROUP) |
MAXLEN ~10000 (approximate). Consumer group notification-service (CONSUMER_GROUP). |
museum:telemetry:dlq |
stream | notification worker | ops / replay tool | DLQ_STREAM_KEY, MAXLEN ~10000 (DLQ_MAX_LEN). Bearer/Basic tokens scrubbed via redact() (DAT-115). |
The stream is the seam between Museum and Notification. The publisher (dataland-museum/app/services/redis_client.py::publish_telemetry and telemetry_publisher.py) encodes every value as a stream-field string; the worker's process_stream_fields re-parses bool/numeric strings on the way out, so the round-trip is lossless.
flowchart LR
RDC[(RDC redis<br/>BioSensors PUBLISH)] -->|msgpack| SUB[museum-api<br/>RDCSubscriber]
SUB -->|XADD normalised event| ST[(dataland-redis<br/>museum:telemetry)]
SIM[dataland-simulator] -->|XADD synthetic| ST
ST -->|XREADGROUP<br/>group=notification-service| W[notification worker]
W -->|on error| DLQ[(museum:telemetry:dlq)]
W -->|XACK| ST
Ticket state (notification worker)¶
app/state.py::TicketStateStore keeps one hash per ticket so rules can reason about session changes, gallery entry, and alert latches.
| Key | Type | TTL | Fields |
|---|---|---|---|
notification:state:<ticket_id> |
hash | STATE_TTL_SECONDS = 86400 (24 h) |
last_room_code, last_session_id, last_chapter_id, heart_rate_alert_active, skin_conductance_alert_active, visit_count, entered_gallery |
entered_gallery is the gating latch (DAT-287): content/condition notifications only fire once the ticket has been seen in a gallery (GA/GB/GC/GD). A new session_id increments visit_count and re-gates entered_gallery for the fresh visit. The welcome push is exempt from this gate.
Dedup / fire-once stores (notification worker)¶
These four stores all share the visit-length TTL of 21600 s (6 h) and are structurally near-identical on purpose.
| Key | Type | TTL (env) | Purpose |
|---|---|---|---|
notification:rooms_seen:<ticket>:<session_id> |
set | ROOMS_SEEN_TTL_SECONDS = 21600 |
room_transition fires once per genuinely-new room per session (DAT-228). Revisiting a room in the same session does not refire. SADD + EXPIRE, membership via SISMEMBER. |
notification:tips_seen:<ticket>:<session_id> |
set | EXPERIENCE_TIPS_SEEN_TTL_SECONDS = 21600 |
experience_tip fires a given tip once per session (DAT-259). Mirrors rooms_seen. |
notification:welcome_sent:<ticket> |
string | WELCOME_DEDUP_TTL_SECONDS = 21600 |
Welcome fires exactly once per visit (DAT-296). try_claim is an atomic SET key 1 NX EX ttl — the first of the two welcome paths to claim wins, the other skips. Keyed by ticket_id because that is the only id both paths share (mobile-init has no session_id). |
notification:complaint_seen:<session_id> |
string (int rank) | COMPLAINT_DEDUP_TTL_SECONDS = 21600 |
One complaint ticket per session unless severity escalates to a strictly higher tier (DAT-213). Stores the highest severity rank fired; a same-or-lower repeat is suppressed, a clear escalation re-fires. |
Welcome dedup is ticket_id-keyed for a reason
Two independent paths can trigger the welcome, sharing only the ticket:
- the mobile sending an empty first
/museummessage (agent →POST /v1/ops/welcome), - the RDC
visit_startedtelemetry (worker →VisitStartedRule).
try_claim (a single SET NX) is the atomic tie-breaker. If the send fails, release() deletes the claim so the other path can retry.
Rate limiter + EDA baseline (notification worker)¶
| Key | Type | TTL | Purpose |
|---|---|---|---|
notification:cooldown:<ticket_id> |
string | per-rule cooldown_seconds |
RedisRateLimiter (app/rate_limiter.py). acquire is a SET NX EX — holds a per-ticket cooldown across rules. Room-transition pushes bypass this (DAT-293). |
notification:eda_baseline:<ticket> |
sorted set | ttl_seconds = 900 (15 min) |
Per-ticket rolling 5-min skin-conductance window (DAT-230). Member "{ts}:{value}:{uuid}", score = sample_ts. Out-of-window samples evicted on every read via ZREMRANGEBYSCORE. Lets the EDA rule judge spikes relative to that visitor's baseline rather than a fixed 1.5 µS threshold. |
Audit + metrics (notification worker)¶
| Key | Type | Cap | Purpose |
|---|---|---|---|
notification:attempts |
stream | MAXLEN ~50000 |
One entry per notification attempt (DAT-57): ticket_id, rule_name, channel (push/chat), status, error, payload_json (redacted), timestamps. Best-effort writes; a Redis hiccup never crashes the engine. |
museum:complaints:audit |
stream | max_len |
Complaint audit trail (DAT-213). |
notification:counters:* |
string (INCRBY) | — | Stable metric counters: processed, dlq, push:sent, push:failed, chat:opened, chat:failed, rule:triggered:<rule>, rule:rate_limited:<rule> (DAT-59). |
notification:counters:index:rule:triggered / …:rate_limited |
set | — | Index SETs so /metrics reads via SMEMBERS + MGET instead of a per-scrape SCAN (DAT-107). |
notification:hist:e2e_latency_seconds:* |
string | — | Event-ts → push-ack latency histogram (:sum, :count, :bucket:<le>) (DAT-257). |
The Prometheus renderer (app/metrics.py::render_prometheus) also emits live gauges read straight off the stream: notification_dlq_depth (XLEN of DLQ), notification_telemetry_stream_depth (XLEN, diagnostic only — it pegs near MAXLEN in steady state), and notification_telemetry_consumer_lag from XINFO GROUPS (the correct backpressure signal, DAT-82/DAT-110).
Agent + museum-api keys¶
| Key | Type | Producer | Consumer | Notes |
|---|---|---|---|---|
museum:active_ticket_ids |
set | museum-api (1 Hz mirror) |
agent (session_state.py), mobile |
Cross-service contract key (constant, not env-overridable). See §3. |
| Agent rate-limit / session state | string / hash | agent | agent | TTL-bounded ephemeral state. |
1.3 Consumer-group mechanics¶
The worker reads with XREADGROUP (group=notification-service, default CONSUMER_COUNT=10, BLOCK_MS=2000) and runs a parallel PEL reaper.
- Unique consumer names (DAT-109): each replica/process needs a unique
CONSUMER_NAMEwithin the group or two workers share a PEL slot and acks race. Default is${HOSTNAME}-<uuid8>. On startup,_reject_duplicate_consumer()refuses to boot if a live consumer (idle <PEL_REAP_IDLE_MS) already holds the name. - PEL reaping:
_reap_loopruns everyPEL_REAP_INTERVAL_SECONDS(30 s).XPENDING+XCLAIMre-process entries idle pastPEL_REAP_IDLE_MS(60 s); entries delivered more thanPEL_REAP_MAX_DELIVERIES(5) times go to the DLQ. - Dedicated pools (DAT-116/DAT-216): the blocking
XREADGROUPgets its own 2-connection pool (consumer_redis_kwargs) so the read coroutine and the reap coroutine (XPENDING/XCLAIM) never contend with — or get starved by — the shared 32-connection pool (REDIS_MAX_CONNECTIONS).
1.4 Reaching it¶
# From the host (PING):
docker exec dataland-redis sh -c '
redis-cli -a "$REDIS_PASSWORD" --no-auth-warning PING
' # (1)!
# Stream depth:
docker exec dataland-redis sh -c '
redis-cli -a "$REDIS_PASSWORD" --no-auth-warning XLEN museum:telemetry
' # (2)!
# Consumer-group lag (the real backpressure signal):
docker exec dataland-redis sh -c '
redis-cli -a "$REDIS_PASSWORD" --no-auth-warning XINFO GROUPS museum:telemetry
' # (3)!
# Active-ticket mirror membership:
docker exec dataland-redis sh -c '
redis-cli -a "$REDIS_PASSWORD" --no-auth-warning SMEMBERS museum:active_ticket_ids
' # (4)!
# From a workstation (SSH tunnel first):
ssh -L 4145:127.0.0.1:4145 ege@100.124.170.43 # (5)!
redis-cli -h 127.0.0.1 -p 4145 -a "$REDIS_PASSWORD" PING
- Run inside the container so
$REDIS_PASSWORDresolves from the containerenvironment(DAT-76).-aplus--no-auth-warningauthenticates quietly; a healthy server repliesPONG. XLEN museum:telemetryis diagnostic only — in steady state it pegs nearMAXLEN ~10000, so a high value is normal, not a backlog. Use consumer lag (next) to judge backpressure.XINFO GROUPSexposes thenotification-servicegroup'slag/pending— the correct backpressure signal (DAT-82/DAT-110). A growing lag means the worker is falling behind, unlike raw stream depth.SMEMBERS museum:active_ticket_idslists the live visits the mirror replicates from RDC every 1 Hz. Empty (or missing key) means RDC reported nobody in — the loopDELs the set on an empty list.- Off-box you cannot reach
dataland-redisdirectly — tunnel4145(the loopback-published port) to your workstation first, then pointredis-cliat127.0.0.1:4145.
1.5 Metrics¶
dataland-redis-exporter (oliver006/redis_exporter, REDIS_ADDR=redis://dataland-redis:6379, auth via REDIS_PASSWORD) scrapes INFO, LATENCY, and stream depths. Surfaced on Grafana → Dataland folder → Redis dashboard. Alert rules in monitoring/rules/dataland.alerts.yml.
2. RDC redis — the external source of truth¶
The Refik Anadol Studio data-center redis is the sole source of live museum data after simulator/playback mode was removed. dataland-museum reads it; it never falls back to a substitute store.
| Connection | RDC_REDIS_URL (e.g. redis://user:pw@host:6379/0) |
| Required at startup | Yes — app/config.py raises RuntimeError if RDC_REDIS_URL is empty |
| Access | Read-only (on-demand GET/MGET/SCAN + pub/sub PSUBSCRIBE) |
| Encoding | msgpack for most keys; plain UTF-8 for some |
Mixed encoding — try msgpack, fall back to UTF-8
Most RDC SET keys are msgpack. A few are plain UTF-8 strings: Rooms:DATALAND:Qsys:* server-health usage strings ("5.03%", "52.0°C") and Rooms:*:Audio_Player_* flags. The reader's _decode_value() (dataland-museum/app/services/rdc_reader.py) unpacks msgpack first, then falls back to raw.decode("utf-8") (DAT-282). Pure-msgpack keys use _unpack() directly.
2.1 How museum-api reads it¶
Two clients, both built from RDC_REDIS_URL (app/services/rdc_redis.py):
- Synchronous (
get_rdc_redis) — HTTP routes do on-demandGET/MGET/SCANfor static-ish keys (visitor record, watch state, scene control, AQI, Q-SYS health, kiosks, audio players). These change at second-to-minute cadence so on-demand reads are cheap. - Async (
get_rdc_async_redis) — theRDCSubscriberbackground taskPSUBSCRIBEs the live channels and keeps an in-process latest-payload cache (RDCStateCache), so live biometrics/position never need a round-trip per request.
2.2 Keys read (representative; full catalogue in museum-rdc-redis-data-dictionary.md)¶
| Key pattern | Encoding | Used for |
|---|---|---|
Visitors:ActiveTicketIDs |
msgpack (list) | Source of truth for live visits → mirrored locally; drives visit_ended. |
Visitors:<ticket>:Status / :EmpaticaDeviceID / :EmpaticaDeviceIDShort / :ScentDeviceID / :ETemp |
msgpack | Per-ticket visitor record; serial ↔ ticket binding. |
Wearables:WatchDevices:<serial>:RoomCode / :Zone / :IsAlive / :Status |
msgpack | Watch localisation + liveness. |
Galleries:<gallery>:<art_id>:SceneControl:{RTChapter,DDSChapter,Playhead,IsPlaying,…} |
msgpack | Active chapter / scene flow. |
IO:AQI:<room> |
msgpack (Enviro+ dict) | Ambient temp (°F → °C), lux, humidity. |
Rooms:GA:Augmenta:Clusters |
msgpack (list) | Crowd density (GA only). |
Rooms:DATALAND:Qsys:* |
UTF-8 strings | Q-SYS server health. |
Rooms:*:Audio_Player_* |
UTF-8 flags | Audio playback state. |
Rooms:ON:Kiosks:<n>:* |
mixed | Onboarding/checkout lifecycle. |
2.3 Channels subscribed (RDCSubscriber.PATTERNS, msgpack)¶
Wearables:WatchDevices:*:BioSensors # (1)!
Wearables:WatchDevices:*:BlueIoTSensors # (2)!
Visitors:Register # (3)!
Visitors:*:Status
Visitors:*:AssignDevice # (4)!
- Carries heart rate, EDA, skin temp, activity, steps, and battery. This is the live biometric feed the EDA-baseline window (DAT-230) and heart-rate alert latches are built from, so it is cached in-process rather than re-fetched per request.
BlueIoT_Position+Room_ID— the watch's BlueIoT localisation, which drives room-code transitions and theroom_transitionpush (cooldown-exempt, DAT-293).- The onboarding control plane channel: announces new visitor registrations as they come online.
- Device rebinding events. Receiving one clears the cached serial↔ticket binding so a re-paired watch is attributed to the correct ticket on the next read.
Never replay RDC payloads into the live dataland-redis
Museum-simulation playback uses a dedicated sidecar redis (compose.sim.yml → dataland-redis-sim). Never replay captured RDC Avro/msgpack into the production dataland-redis — it would poison museum:telemetry for the live notification pipeline.
3. The active-ticket mirror (RDC → dataland-redis)¶
Visitors:ActiveTicketIDs is RDC's single source of truth for which visits are live. The museum-api RDC broadcast loop (app/web/server.py::_sync_active_tickets) replicates it verbatim into the local dataland-redis SET museum:active_ticket_ids every tick (1 Hz), so the agent and mobile can answer "is this ticket active?" without touching RDC.
flowchart LR
RDC[(RDC redis<br/>Visitors:ActiveTicketIDs)] -->|1 Hz read| LOOP[museum-api<br/>_rdc_broadcast_loop]
LOOP -->|DEL staging; SADD; RENAME| MIRROR[(dataland-redis<br/>museum:active_ticket_ids)]
LOOP -->|prev - current → XADD visit_ended| ST[(museum:telemetry)]
MIRROR -->|SISMEMBER| AGENT[agent session_state]
MIRROR -->|SISMEMBER| MOBILE[mobile app]
Design contract (deliberately trust-RDC):
- Atomic swap — staged in
museum:active_ticket_ids:stagingthenRENAMEd, so a reader never sees a half-built set. - Empty list →
DEL— when RDC reports zero active tickets, the set is deleted; membership reads false for everyone, exactly mirroring "nobody is in". - No TTL, no debounce, no fail-safe heuristics — the set is rewritten every tick to equal RDC verbatim. If RDC drops a ticket, the next sync drops it here (within ~1 s). Wrong membership is an upstream RDC problem, not something the mirror second-guesses.
- Best-effort + read-gated — the sync only runs after a successful RDC read, so a failed RDC read skips the tick and leaves the last good set in place rather than wiping it. A
dataland-redishiccup is caught and never breaks the RDC loop.
ACTIVE_TICKETS_MIRROR_KEY = "museum:active_ticket_ids" is a fixed constant in dataland-museum/app/config.py (not env-overridable) and the agent reads the identical literal in dataland-agent/app/services/session_state.py (_ACTIVE_TICKETS_KEY), so writer and reader can never drift apart.
The same loop also emits visit_ended telemetry (DAT-287) for every ticket that left the active set since the last tick (prev_active - current_active), which the notification worker's VisitEndedRule turns into the checkout "How was the experience?" push.
4. Volumes¶
The RDC redis has no Dataland-managed volume — it is external infrastructure owned by Refik Anadol Studio.
Related¶
- Museum — the RDC bridge, subscriber, and broadcast loop that read RDC and write the stream + mirror.
- Notification — the rules engine that consumes
museum:telemetryand owns everynotification:*key. - Agent — reads
museum:active_ticket_idsfor session/ticket activity.