Redis¶

Dataland runs against two completely separate Redis servers, and keeping the distinction straight is the single most important thing to understand about the data plane:

	`dataland-redis` (internal)	RDC redis (external)
Owner	Dataland	Refik Anadol Studio (RDC = Refik data center)
Role	Internal control plane: the `museum:telemetry` stream, ticket state, dedup keys, counters, the active-ticket mirror	Source of truth for live wearable / sensor / scene data
Access	Read and write	Read-only from our side
Encoding	Redis-stream maps (UTF-8 strings)	Mixed: msgpack for most keys, plain UTF-8 for some (Q-SYS, audio flags)
Reached via	`dataland-redis:6379` on the docker network; `:4145` on loopback + tailnet	`RDC_REDIS_URL` (required at museum-api startup)
Persistence	AOF (`redis-data` volume)	Not our concern

dataland-museum is the only service that talks to both: it reads the external RDC redis and writes a derived telemetry stream + active-ticket mirror into the internal dataland-redis. Everything else (agent, notification worker) only touches dataland-redis.

Recent changes

DAT-76: dataland-redis now enforces --requirepass; every client passes REDIS_PASSWORD. Empty password collapses to None so unauthenticated local dev still works.
DAT-296: ticket-keyed welcome dedup key notification:welcome_sent:<ticket> (atomic SET NX), shared by the mobile-init welcome and the RDC visit_started welcome.
DAT-287 / 289 / 290 / 293: gallery-gated notifications, visit_ended events emitted from the active-ticket mirror loop, room-transition pushes bypass the cooldown.
DAT-282: the RDC reader decodes plain-UTF-8 values (Q-SYS health, audio-player flags) when msgpack fails.

1. `dataland-redis` — the internal control plane¶


Container	`dataland-redis`
Image	`redis:7-alpine`
Internal URL	`redis://dataland-redis:6379`
Published ports	`127.0.0.1:4145 → 6379` (loopback) and `${REDIS_PUBLIC_BIND:-100.124.170.43}:4145 → 6379` (tailnet)
Memory / CPU	`mem_limit 1g`, `mem_reservation 128m`, `cpus 0.5`, Redis `maxmemory 512mb`
Persistence	`--appendonly yes`, named volume `redis-data` (AOF + RDB)
Auth	`--requirepass ${REDIS_PASSWORD}` (DAT-76)

1.1 Compose configuration¶

The service is defined in dataland-infrastructure/compose.yml. The command line is the operative contract:

command:
  - "redis-server"
  - "--appendonly"
  - "yes"                  # (1)!
  - "--maxmemory"
  - "512mb"                # (2)!
  - "--maxmemory-policy"
  - "noeviction"           # (3)!
  - "--requirepass"
  - "${REDIS_PASSWORD:?REDIS_PASSWORD is required (DAT-76); set a strong unique value in .env}"  # (4)!

Enables AOF persistence so the redis-data volume survives restarts. The control-plane state (ticket hashes, dedup keys, the active-ticket mirror) must not vanish on a container bounce.
Caps Redis at 512 MB even though mem_limit is 1g — the gap leaves headroom for the alpine process itself. Crossing 85% of this is alerted on (maxmemory above 85%).
DAT-80 — museum:telemetry is a stream consumers must drain. Any eviction policy would silently drop entries at the cap; noeviction instead fails writes loudly so the producer errors and the operator gets paged. This is the deliberate backpressure signal.
DAT-76 — the :? form makes the container refuse to start if REDIS_PASSWORD is unset, so the tailnet-exposed instance is never accidentally launched without auth. The value is sourced from .env, not hard-coded.

ports:
  - "127.0.0.1:${REDIS_PUBLIC_PORT:-4145}:6379"                          # (1)!
  - "${REDIS_PUBLIC_BIND:-100.124.170.43}:${REDIS_PUBLIC_PORT:-4145}:6379"  # (2)!

Loopback-only publish for redis-cli and SSH-tunnel ops on the host. The external port defaults to 4145 (REDIS_PUBLIC_PORT), mapped to the container's 6379.
DAT-73 — tailnet bind for direct peer access at spark:4145. The default 100.124.170.43 is the tailnet IP; it is never 0.0.0.0. Exposure here is gated by --requirepass (DAT-76).

Never bind 0.0.0.0

The two published ports are deliberate (DAT-73): loopback for redis-cli / SSH-tunnel ops, and the tailnet IP for direct peer access at spark:4145. Exposure on the tailnet is gated by --requirepass (DAT-76). The bind address is never 0.0.0.0.

Why noeviction (DAT-80)

museum:telemetry is a stream that consumers must drain. Under any eviction policy, Redis would silently drop entries when it hit maxmemory — losing telemetry events. noeviction instead makes writes fail loudly when the cap is reached, which is the correct backpressure signal: the producer (museum-api's bridge / the simulator) sees an error and the operator gets paged. The alert rules live in monitoring/rules/dataland.alerts.yml (maxmemory above 85%, redis-exporter cannot reach dataland-redis).

Why requirepass (DAT-76)

REDIS_PASSWORD is passed through the container environment (not the command line) so the secret never appears in the ps process listing. The healthcheck reads it from REDISCLI_AUTH:

test:
  - "CMD-SHELL"
  - 'REDISCLI_AUTH="$$REDIS_PASSWORD" redis-cli --no-auth-warning ping | grep -q PONG'  # (1)!

REDISCLI_AUTH lets redis-cli authenticate without putting the password on argv (so it stays out of ps); the $$ escapes Compose interpolation so the literal $REDIS_PASSWORD reaches the shell at container runtime. --no-auth-warning suppresses the stderr notice that would otherwise pollute healthcheck output.

1.2 Keyspace¶

Every key written to dataland-redis is enumerated below, grouped by producer. The notification worker is the heaviest tenant; the agent and museum-api each write a narrow slice.

The telemetry stream + DLQ (museum-api / simulator → notification worker)¶

Key	Type	Producer	Consumer	Notes
`museum:telemetry`	stream	`museum-api` bridge + `dataland-simulator`	notification worker (`XREADGROUP`)	`MAXLEN ~10000` (approximate). Consumer group `notification-service` (`CONSUMER_GROUP`).
`museum:telemetry:dlq`	stream	notification worker	ops / replay tool	`DLQ_STREAM_KEY`, `MAXLEN ~10000` (`DLQ_MAX_LEN`). Bearer/Basic tokens scrubbed via `redact()` (DAT-115).

The stream is the seam between Museum and Notification. The publisher (dataland-museum/app/services/redis_client.py::publish_telemetry and telemetry_publisher.py) encodes every value as a stream-field string; the worker's process_stream_fields re-parses bool/numeric strings on the way out, so the round-trip is lossless.

flowchart LR
  RDC[(RDC redis<br/>BioSensors PUBLISH)] -->|msgpack| SUB[museum-api<br/>RDCSubscriber]
  SUB -->|XADD normalised event| ST[(dataland-redis<br/>museum:telemetry)]
  SIM[dataland-simulator] -->|XADD synthetic| ST
  ST -->|XREADGROUP<br/>group=notification-service| W[notification worker]
  W -->|on error| DLQ[(museum:telemetry:dlq)]
  W -->|XACK| ST

Ticket state (notification worker)¶

app/state.py::TicketStateStore keeps one hash per ticket so rules can reason about session changes, gallery entry, and alert latches.

Key	Type	TTL	Fields
`notification:state:<ticket_id>`	hash	`STATE_TTL_SECONDS` = 86400 (24 h)	`last_room_code`, `last_session_id`, `last_chapter_id`, `heart_rate_alert_active`, `skin_conductance_alert_active`, `visit_count`, `entered_gallery`

entered_gallery is the gating latch (DAT-287): content/condition notifications only fire once the ticket has been seen in a gallery (GA/GB/GC/GD). A new session_id increments visit_count and re-gates entered_gallery for the fresh visit. The welcome push is exempt from this gate.

Dedup / fire-once stores (notification worker)¶

These four stores all share the visit-length TTL of 21600 s (6 h) and are structurally near-identical on purpose.

Key	Type	TTL (env)	Purpose
`notification:rooms_seen:<ticket>:<session_id>`	set	`ROOMS_SEEN_TTL_SECONDS` = 21600	`room_transition` fires once per genuinely-new room per session (DAT-228). Revisiting a room in the same session does not refire. `SADD` + `EXPIRE`, membership via `SISMEMBER`.
`notification:tips_seen:<ticket>:<session_id>`	set	`EXPERIENCE_TIPS_SEEN_TTL_SECONDS` = 21600	`experience_tip` fires a given tip once per session (DAT-259). Mirrors `rooms_seen`.
`notification:welcome_sent:<ticket>`	string	`WELCOME_DEDUP_TTL_SECONDS` = 21600	Welcome fires exactly once per visit (DAT-296). `try_claim` is an atomic `SET key 1 NX EX ttl` — the first of the two welcome paths to claim wins, the other skips. Keyed by `ticket_id` because that is the only id both paths share (mobile-init has no `session_id`).
`notification:complaint_seen:<session_id>`	string (int rank)	`COMPLAINT_DEDUP_TTL_SECONDS` = 21600	One complaint ticket per session unless severity escalates to a strictly higher tier (DAT-213). Stores the highest severity rank fired; a same-or-lower repeat is suppressed, a clear escalation re-fires.

Welcome dedup is ticket_id-keyed for a reason

Two independent paths can trigger the welcome, sharing only the ticket:

the mobile sending an empty first /museum message (agent → POST /v1/ops/welcome),
the RDC visit_started telemetry (worker → VisitStartedRule).

try_claim (a single SET NX) is the atomic tie-breaker. If the send fails, release() deletes the claim so the other path can retry.

Rate limiter + EDA baseline (notification worker)¶

Key	Type	TTL	Purpose
`notification:cooldown:<ticket_id>`	string	per-rule `cooldown_seconds`	`RedisRateLimiter` (`app/rate_limiter.py`). `acquire` is a `SET NX EX` — holds a per-ticket cooldown across rules. Room-transition pushes bypass this (DAT-293).
`notification:eda_baseline:<ticket>`	sorted set	`ttl_seconds` = 900 (15 min)	Per-ticket rolling 5-min skin-conductance window (DAT-230). Member `"{ts}:{value}:{uuid}"`, score = `sample_ts`. Out-of-window samples evicted on every read via `ZREMRANGEBYSCORE`. Lets the EDA rule judge spikes relative to that visitor's baseline rather than a fixed 1.5 µS threshold.

Audit + metrics (notification worker)¶

Key	Type	Cap	Purpose
`notification:attempts`	stream	`MAXLEN ~50000`	One entry per notification attempt (DAT-57): `ticket_id`, `rule_name`, `channel` (`push`/`chat`), `status`, `error`, `payload_json` (redacted), timestamps. Best-effort writes; a Redis hiccup never crashes the engine.
`museum:complaints:audit`	stream	`max_len`	Complaint audit trail (DAT-213).
`notification:counters:*`	string (INCRBY)	—	Stable metric counters: `processed`, `dlq`, `push:sent`, `push:failed`, `chat:opened`, `chat:failed`, `rule:triggered:<rule>`, `rule:rate_limited:<rule>` (DAT-59).
`notification:counters:index:rule:triggered` / `…:rate_limited`	set	—	Index SETs so `/metrics` reads via `SMEMBERS` + `MGET` instead of a per-scrape `SCAN` (DAT-107).
`notification:hist:e2e_latency_seconds:*`	string	—	Event-ts → push-ack latency histogram (`:sum`, `:count`, `:bucket:<le>`) (DAT-257).

The Prometheus renderer (app/metrics.py::render_prometheus) also emits live gauges read straight off the stream: notification_dlq_depth (XLEN of DLQ), notification_telemetry_stream_depth (XLEN, diagnostic only — it pegs near MAXLEN in steady state), and notification_telemetry_consumer_lag from XINFO GROUPS (the correct backpressure signal, DAT-82/DAT-110).

Agent + museum-api keys¶

Key	Type	Producer	Consumer	Notes
`museum:active_ticket_ids`	set	`museum-api` (1 Hz mirror)	agent (`session_state.py`), mobile	Cross-service contract key (constant, not env-overridable). See §3.
Agent rate-limit / session state	string / hash	agent	agent	TTL-bounded ephemeral state.

1.3 Consumer-group mechanics¶

The worker reads with XREADGROUP (group=notification-service, default CONSUMER_COUNT=10, BLOCK_MS=2000) and runs a parallel PEL reaper.

Unique consumer names (DAT-109): each replica/process needs a unique CONSUMER_NAME within the group or two workers share a PEL slot and acks race. Default is ${HOSTNAME}-<uuid8>. On startup, _reject_duplicate_consumer() refuses to boot if a live consumer (idle < PEL_REAP_IDLE_MS) already holds the name.
PEL reaping: _reap_loop runs every PEL_REAP_INTERVAL_SECONDS (30 s). XPENDING + XCLAIM re-process entries idle past PEL_REAP_IDLE_MS (60 s); entries delivered more than PEL_REAP_MAX_DELIVERIES (5) times go to the DLQ.
Dedicated pools (DAT-116/DAT-216): the blocking XREADGROUP gets its own 2-connection pool (consumer_redis_kwargs) so the read coroutine and the reap coroutine (XPENDING/XCLAIM) never contend with — or get starved by — the shared 32-connection pool (REDIS_MAX_CONNECTIONS).

1.4 Reaching it¶

# From the host (PING):
docker exec dataland-redis sh -c '
  redis-cli -a "$REDIS_PASSWORD" --no-auth-warning PING
'                                                          # (1)!

# Stream depth:
docker exec dataland-redis sh -c '
  redis-cli -a "$REDIS_PASSWORD" --no-auth-warning XLEN museum:telemetry
'                                                          # (2)!

# Consumer-group lag (the real backpressure signal):
docker exec dataland-redis sh -c '
  redis-cli -a "$REDIS_PASSWORD" --no-auth-warning XINFO GROUPS museum:telemetry
'                                                          # (3)!

# Active-ticket mirror membership:
docker exec dataland-redis sh -c '
  redis-cli -a "$REDIS_PASSWORD" --no-auth-warning SMEMBERS museum:active_ticket_ids
'                                                          # (4)!

# From a workstation (SSH tunnel first):
ssh -L 4145:127.0.0.1:4145 ege@100.124.170.43             # (5)!
redis-cli -h 127.0.0.1 -p 4145 -a "$REDIS_PASSWORD" PING

Run inside the container so $REDIS_PASSWORD resolves from the container environment (DAT-76). -a plus --no-auth-warning authenticates quietly; a healthy server replies PONG.
XLEN museum:telemetry is diagnostic only — in steady state it pegs near MAXLEN ~10000, so a high value is normal, not a backlog. Use consumer lag (next) to judge backpressure.
XINFO GROUPS exposes the notification-service group's lag / pending — the correct backpressure signal (DAT-82/DAT-110). A growing lag means the worker is falling behind, unlike raw stream depth.
SMEMBERS museum:active_ticket_ids lists the live visits the mirror replicates from RDC every 1 Hz. Empty (or missing key) means RDC reported nobody in — the loop DELs the set on an empty list.
Off-box you cannot reach dataland-redis directly — tunnel 4145 (the loopback-published port) to your workstation first, then point redis-cli at 127.0.0.1:4145.

1.5 Metrics¶

dataland-redis-exporter (oliver006/redis_exporter, REDIS_ADDR=redis://dataland-redis:6379, auth via REDIS_PASSWORD) scrapes INFO, LATENCY, and stream depths. Surfaced on Grafana → Dataland folder → Redis dashboard. Alert rules in monitoring/rules/dataland.alerts.yml.

2. RDC redis — the external source of truth¶

The Refik Anadol Studio data-center redis is the sole source of live museum data after simulator/playback mode was removed. dataland-museum reads it; it never falls back to a substitute store.


Connection	`RDC_REDIS_URL` (e.g. `redis://user:pw@host:6379/0`)
Required at startup	Yes — `app/config.py` raises `RuntimeError` if `RDC_REDIS_URL` is empty
Access	Read-only (on-demand `GET`/`MGET`/`SCAN` + pub/sub `PSUBSCRIBE`)
Encoding	msgpack for most keys; plain UTF-8 for some

Mixed encoding — try msgpack, fall back to UTF-8

Most RDC SET keys are msgpack. A few are plain UTF-8 strings: Rooms:DATALAND:Qsys:* server-health usage strings ("5.03%", "52.0°C") and Rooms:*:Audio_Player_* flags. The reader's _decode_value() (dataland-museum/app/services/rdc_reader.py) unpacks msgpack first, then falls back to raw.decode("utf-8") (DAT-282). Pure-msgpack keys use _unpack() directly.

2.1 How museum-api reads it¶

Two clients, both built from RDC_REDIS_URL (app/services/rdc_redis.py):

Synchronous (get_rdc_redis) — HTTP routes do on-demand GET/MGET/SCAN for static-ish keys (visitor record, watch state, scene control, AQI, Q-SYS health, kiosks, audio players). These change at second-to-minute cadence so on-demand reads are cheap.
Async (get_rdc_async_redis) — the RDCSubscriber background task PSUBSCRIBEs the live channels and keeps an in-process latest-payload cache (RDCStateCache), so live biometrics/position never need a round-trip per request.

2.2 Keys read (representative; full catalogue in `museum-rdc-redis-data-dictionary.md`)¶

Key pattern	Encoding	Used for
`Visitors:ActiveTicketIDs`	msgpack (list)	Source of truth for live visits → mirrored locally; drives `visit_ended`.
`Visitors:<ticket>:Status` / `:EmpaticaDeviceID` / `:EmpaticaDeviceIDShort` / `:ScentDeviceID` / `:ETemp`	msgpack	Per-ticket visitor record; serial ↔ ticket binding.
`Wearables:WatchDevices:<serial>:RoomCode` / `:Zone` / `:IsAlive` / `:Status`	msgpack	Watch localisation + liveness.
`Galleries:<gallery>:<art_id>:SceneControl:{RTChapter,DDSChapter,Playhead,IsPlaying,…}`	msgpack	Active chapter / scene flow.
`IO:AQI:<room>`	msgpack (Enviro+ dict)	Ambient temp (°F → °C), lux, humidity.
`Rooms:GA:Augmenta:Clusters`	msgpack (list)	Crowd density (GA only).
`Rooms:DATALAND:Qsys:*`	UTF-8 strings	Q-SYS server health.
`Rooms::Audio_Player_`	UTF-8 flags	Audio playback state.
`Rooms:ON:Kiosks:<n>:*`	mixed	Onboarding/checkout lifecycle.

2.3 Channels subscribed (`RDCSubscriber.PATTERNS`, msgpack)¶

Wearables:WatchDevices:*:BioSensors      # (1)!
Wearables:WatchDevices:*:BlueIoTSensors  # (2)!
Visitors:Register                        # (3)!
Visitors:*:Status
Visitors:*:AssignDevice                  # (4)!

Carries heart rate, EDA, skin temp, activity, steps, and battery. This is the live biometric feed the EDA-baseline window (DAT-230) and heart-rate alert latches are built from, so it is cached in-process rather than re-fetched per request.
BlueIoT_Position + Room_ID — the watch's BlueIoT localisation, which drives room-code transitions and the room_transition push (cooldown-exempt, DAT-293).
The onboarding control plane channel: announces new visitor registrations as they come online.
Device rebinding events. Receiving one clears the cached serial↔ticket binding so a re-paired watch is attributed to the correct ticket on the next read.

Never replay RDC payloads into the live dataland-redis

Museum-simulation playback uses a dedicated sidecar redis (compose.sim.yml → dataland-redis-sim). Never replay captured RDC Avro/msgpack into the production dataland-redis — it would poison museum:telemetry for the live notification pipeline.

3. The active-ticket mirror (RDC → dataland-redis)¶

Visitors:ActiveTicketIDs is RDC's single source of truth for which visits are live. The museum-api RDC broadcast loop (app/web/server.py::_sync_active_tickets) replicates it verbatim into the local dataland-redis SET museum:active_ticket_ids every tick (1 Hz), so the agent and mobile can answer "is this ticket active?" without touching RDC.

flowchart LR
  RDC[(RDC redis<br/>Visitors:ActiveTicketIDs)] -->|1 Hz read| LOOP[museum-api<br/>_rdc_broadcast_loop]
  LOOP -->|DEL staging; SADD; RENAME| MIRROR[(dataland-redis<br/>museum:active_ticket_ids)]
  LOOP -->|prev - current → XADD visit_ended| ST[(museum:telemetry)]
  MIRROR -->|SISMEMBER| AGENT[agent session_state]
  MIRROR -->|SISMEMBER| MOBILE[mobile app]

Design contract (deliberately trust-RDC):

Atomic swap — staged in museum:active_ticket_ids:staging then RENAMEd, so a reader never sees a half-built set.
Empty list → DEL — when RDC reports zero active tickets, the set is deleted; membership reads false for everyone, exactly mirroring "nobody is in".
No TTL, no debounce, no fail-safe heuristics — the set is rewritten every tick to equal RDC verbatim. If RDC drops a ticket, the next sync drops it here (within ~1 s). Wrong membership is an upstream RDC problem, not something the mirror second-guesses.
Best-effort + read-gated — the sync only runs after a successful RDC read, so a failed RDC read skips the tick and leaves the last good set in place rather than wiping it. A dataland-redis hiccup is caught and never breaks the RDC loop.

ACTIVE_TICKETS_MIRROR_KEY = "museum:active_ticket_ids" is a fixed constant in dataland-museum/app/config.py (not env-overridable) and the agent reads the identical literal in dataland-agent/app/services/session_state.py (_ACTIVE_TICKETS_KEY), so writer and reader can never drift apart.

The same loop also emits visit_ended telemetry (DAT-287) for every ticket that left the active set since the last tick (prev_active - current_active), which the notification worker's VisitEndedRule turns into the checkout "How was the experience?" push.

4. Volumes¶

redis-data    <- named volume for dataland-redis (AOF + RDB)

The RDC redis has no Dataland-managed volume — it is external infrastructure owned by Refik Anadol Studio.

Museum — the RDC bridge, subscriber, and broadcast loop that read RDC and write the stream + mirror.
Notification — the rules engine that consumes museum:telemetry and owns every notification:* key.
Agent — reads museum:active_ticket_ids for session/ticket activity.