Skip to content

Redis

Dataland runs against two completely separate Redis servers, and keeping the distinction straight is the single most important thing to understand about the data plane:

dataland-redis (internal) RDC redis (external)
Owner Dataland Refik Anadol Studio (RDC = Refik data center)
Role Internal control plane: the museum:telemetry stream, ticket state, dedup keys, counters, the active-ticket mirror Source of truth for live wearable / sensor / scene data
Access Read and write Read-only from our side
Encoding Redis-stream maps (UTF-8 strings) Mixed: msgpack for most keys, plain UTF-8 for some (Q-SYS, audio flags)
Reached via dataland-redis:6379 on the docker network; :4145 on loopback + tailnet RDC_REDIS_URL (required at museum-api startup)
Persistence AOF (redis-data volume) Not our concern

dataland-museum is the only service that talks to both: it reads the external RDC redis and writes a derived telemetry stream + active-ticket mirror into the internal dataland-redis. Everything else (agent, notification worker) only touches dataland-redis.

Recent changes

  • DAT-76: dataland-redis now enforces --requirepass; every client passes REDIS_PASSWORD. Empty password collapses to None so unauthenticated local dev still works.
  • DAT-296: ticket-keyed welcome dedup key notification:welcome_sent:<ticket> (atomic SET NX), shared by the mobile-init welcome and the RDC visit_started welcome.
  • DAT-287 / 289 / 290 / 293: gallery-gated notifications, visit_ended events emitted from the active-ticket mirror loop, room-transition pushes bypass the cooldown.
  • DAT-282: the RDC reader decodes plain-UTF-8 values (Q-SYS health, audio-player flags) when msgpack fails.

1. dataland-redis — the internal control plane

Container dataland-redis
Image redis:7-alpine
Internal URL redis://dataland-redis:6379
Published ports 127.0.0.1:4145 → 6379 (loopback) and ${REDIS_PUBLIC_BIND:-100.124.170.43}:4145 → 6379 (tailnet)
Memory / CPU mem_limit 1g, mem_reservation 128m, cpus 0.5, Redis maxmemory 512mb
Persistence --appendonly yes, named volume redis-data (AOF + RDB)
Auth --requirepass ${REDIS_PASSWORD} (DAT-76)

1.1 Compose configuration

The service is defined in dataland-infrastructure/compose.yml. The command line is the operative contract:

command:
  - "redis-server"
  - "--appendonly"
  - "yes"                  # (1)!
  - "--maxmemory"
  - "512mb"                # (2)!
  - "--maxmemory-policy"
  - "noeviction"           # (3)!
  - "--requirepass"
  - "${REDIS_PASSWORD:?REDIS_PASSWORD is required (DAT-76); set a strong unique value in .env}"  # (4)!
  1. Enables AOF persistence so the redis-data volume survives restarts. The control-plane state (ticket hashes, dedup keys, the active-ticket mirror) must not vanish on a container bounce.
  2. Caps Redis at 512 MB even though mem_limit is 1g — the gap leaves headroom for the alpine process itself. Crossing 85% of this is alerted on (maxmemory above 85%).
  3. DAT-80museum:telemetry is a stream consumers must drain. Any eviction policy would silently drop entries at the cap; noeviction instead fails writes loudly so the producer errors and the operator gets paged. This is the deliberate backpressure signal.
  4. DAT-76 — the :? form makes the container refuse to start if REDIS_PASSWORD is unset, so the tailnet-exposed instance is never accidentally launched without auth. The value is sourced from .env, not hard-coded.
ports:
  - "127.0.0.1:${REDIS_PUBLIC_PORT:-4145}:6379"                          # (1)!
  - "${REDIS_PUBLIC_BIND:-100.124.170.43}:${REDIS_PUBLIC_PORT:-4145}:6379"  # (2)!
  1. Loopback-only publish for redis-cli and SSH-tunnel ops on the host. The external port defaults to 4145 (REDIS_PUBLIC_PORT), mapped to the container's 6379.
  2. DAT-73 — tailnet bind for direct peer access at spark:4145. The default 100.124.170.43 is the tailnet IP; it is never 0.0.0.0. Exposure here is gated by --requirepass (DAT-76).

Never bind 0.0.0.0

The two published ports are deliberate (DAT-73): loopback for redis-cli / SSH-tunnel ops, and the tailnet IP for direct peer access at spark:4145. Exposure on the tailnet is gated by --requirepass (DAT-76). The bind address is never 0.0.0.0.

Why noeviction (DAT-80)

museum:telemetry is a stream that consumers must drain. Under any eviction policy, Redis would silently drop entries when it hit maxmemory — losing telemetry events. noeviction instead makes writes fail loudly when the cap is reached, which is the correct backpressure signal: the producer (museum-api's bridge / the simulator) sees an error and the operator gets paged. The alert rules live in monitoring/rules/dataland.alerts.yml (maxmemory above 85%, redis-exporter cannot reach dataland-redis).

Why requirepass (DAT-76)

REDIS_PASSWORD is passed through the container environment (not the command line) so the secret never appears in the ps process listing. The healthcheck reads it from REDISCLI_AUTH:

test:
  - "CMD-SHELL"
  - 'REDISCLI_AUTH="$$REDIS_PASSWORD" redis-cli --no-auth-warning ping | grep -q PONG'  # (1)!

  1. REDISCLI_AUTH lets redis-cli authenticate without putting the password on argv (so it stays out of ps); the $$ escapes Compose interpolation so the literal $REDIS_PASSWORD reaches the shell at container runtime. --no-auth-warning suppresses the stderr notice that would otherwise pollute healthcheck output.

1.2 Keyspace

Every key written to dataland-redis is enumerated below, grouped by producer. The notification worker is the heaviest tenant; the agent and museum-api each write a narrow slice.

The telemetry stream + DLQ (museum-api / simulator → notification worker)

Key Type Producer Consumer Notes
museum:telemetry stream museum-api bridge + dataland-simulator notification worker (XREADGROUP) MAXLEN ~10000 (approximate). Consumer group notification-service (CONSUMER_GROUP).
museum:telemetry:dlq stream notification worker ops / replay tool DLQ_STREAM_KEY, MAXLEN ~10000 (DLQ_MAX_LEN). Bearer/Basic tokens scrubbed via redact() (DAT-115).

The stream is the seam between Museum and Notification. The publisher (dataland-museum/app/services/redis_client.py::publish_telemetry and telemetry_publisher.py) encodes every value as a stream-field string; the worker's process_stream_fields re-parses bool/numeric strings on the way out, so the round-trip is lossless.

flowchart LR
  RDC[(RDC redis<br/>BioSensors PUBLISH)] -->|msgpack| SUB[museum-api<br/>RDCSubscriber]
  SUB -->|XADD normalised event| ST[(dataland-redis<br/>museum:telemetry)]
  SIM[dataland-simulator] -->|XADD synthetic| ST
  ST -->|XREADGROUP<br/>group=notification-service| W[notification worker]
  W -->|on error| DLQ[(museum:telemetry:dlq)]
  W -->|XACK| ST

Ticket state (notification worker)

app/state.py::TicketStateStore keeps one hash per ticket so rules can reason about session changes, gallery entry, and alert latches.

Key Type TTL Fields
notification:state:<ticket_id> hash STATE_TTL_SECONDS = 86400 (24 h) last_room_code, last_session_id, last_chapter_id, heart_rate_alert_active, skin_conductance_alert_active, visit_count, entered_gallery

entered_gallery is the gating latch (DAT-287): content/condition notifications only fire once the ticket has been seen in a gallery (GA/GB/GC/GD). A new session_id increments visit_count and re-gates entered_gallery for the fresh visit. The welcome push is exempt from this gate.

Dedup / fire-once stores (notification worker)

These four stores all share the visit-length TTL of 21600 s (6 h) and are structurally near-identical on purpose.

Key Type TTL (env) Purpose
notification:rooms_seen:<ticket>:<session_id> set ROOMS_SEEN_TTL_SECONDS = 21600 room_transition fires once per genuinely-new room per session (DAT-228). Revisiting a room in the same session does not refire. SADD + EXPIRE, membership via SISMEMBER.
notification:tips_seen:<ticket>:<session_id> set EXPERIENCE_TIPS_SEEN_TTL_SECONDS = 21600 experience_tip fires a given tip once per session (DAT-259). Mirrors rooms_seen.
notification:welcome_sent:<ticket> string WELCOME_DEDUP_TTL_SECONDS = 21600 Welcome fires exactly once per visit (DAT-296). try_claim is an atomic SET key 1 NX EX ttl — the first of the two welcome paths to claim wins, the other skips. Keyed by ticket_id because that is the only id both paths share (mobile-init has no session_id).
notification:complaint_seen:<session_id> string (int rank) COMPLAINT_DEDUP_TTL_SECONDS = 21600 One complaint ticket per session unless severity escalates to a strictly higher tier (DAT-213). Stores the highest severity rank fired; a same-or-lower repeat is suppressed, a clear escalation re-fires.

Welcome dedup is ticket_id-keyed for a reason

Two independent paths can trigger the welcome, sharing only the ticket:

  1. the mobile sending an empty first /museum message (agent → POST /v1/ops/welcome),
  2. the RDC visit_started telemetry (worker → VisitStartedRule).

try_claim (a single SET NX) is the atomic tie-breaker. If the send fails, release() deletes the claim so the other path can retry.

Rate limiter + EDA baseline (notification worker)

Key Type TTL Purpose
notification:cooldown:<ticket_id> string per-rule cooldown_seconds RedisRateLimiter (app/rate_limiter.py). acquire is a SET NX EX — holds a per-ticket cooldown across rules. Room-transition pushes bypass this (DAT-293).
notification:eda_baseline:<ticket> sorted set ttl_seconds = 900 (15 min) Per-ticket rolling 5-min skin-conductance window (DAT-230). Member "{ts}:{value}:{uuid}", score = sample_ts. Out-of-window samples evicted on every read via ZREMRANGEBYSCORE. Lets the EDA rule judge spikes relative to that visitor's baseline rather than a fixed 1.5 µS threshold.

Audit + metrics (notification worker)

Key Type Cap Purpose
notification:attempts stream MAXLEN ~50000 One entry per notification attempt (DAT-57): ticket_id, rule_name, channel (push/chat), status, error, payload_json (redacted), timestamps. Best-effort writes; a Redis hiccup never crashes the engine.
museum:complaints:audit stream max_len Complaint audit trail (DAT-213).
notification:counters:* string (INCRBY) Stable metric counters: processed, dlq, push:sent, push:failed, chat:opened, chat:failed, rule:triggered:<rule>, rule:rate_limited:<rule> (DAT-59).
notification:counters:index:rule:triggered / …:rate_limited set Index SETs so /metrics reads via SMEMBERS + MGET instead of a per-scrape SCAN (DAT-107).
notification:hist:e2e_latency_seconds:* string Event-ts → push-ack latency histogram (:sum, :count, :bucket:<le>) (DAT-257).

The Prometheus renderer (app/metrics.py::render_prometheus) also emits live gauges read straight off the stream: notification_dlq_depth (XLEN of DLQ), notification_telemetry_stream_depth (XLEN, diagnostic only — it pegs near MAXLEN in steady state), and notification_telemetry_consumer_lag from XINFO GROUPS (the correct backpressure signal, DAT-82/DAT-110).

Agent + museum-api keys

Key Type Producer Consumer Notes
museum:active_ticket_ids set museum-api (1 Hz mirror) agent (session_state.py), mobile Cross-service contract key (constant, not env-overridable). See §3.
Agent rate-limit / session state string / hash agent agent TTL-bounded ephemeral state.

1.3 Consumer-group mechanics

The worker reads with XREADGROUP (group=notification-service, default CONSUMER_COUNT=10, BLOCK_MS=2000) and runs a parallel PEL reaper.

  • Unique consumer names (DAT-109): each replica/process needs a unique CONSUMER_NAME within the group or two workers share a PEL slot and acks race. Default is ${HOSTNAME}-<uuid8>. On startup, _reject_duplicate_consumer() refuses to boot if a live consumer (idle < PEL_REAP_IDLE_MS) already holds the name.
  • PEL reaping: _reap_loop runs every PEL_REAP_INTERVAL_SECONDS (30 s). XPENDING + XCLAIM re-process entries idle past PEL_REAP_IDLE_MS (60 s); entries delivered more than PEL_REAP_MAX_DELIVERIES (5) times go to the DLQ.
  • Dedicated pools (DAT-116/DAT-216): the blocking XREADGROUP gets its own 2-connection pool (consumer_redis_kwargs) so the read coroutine and the reap coroutine (XPENDING/XCLAIM) never contend with — or get starved by — the shared 32-connection pool (REDIS_MAX_CONNECTIONS).

1.4 Reaching it

# From the host (PING):
docker exec dataland-redis sh -c '
  redis-cli -a "$REDIS_PASSWORD" --no-auth-warning PING
'                                                          # (1)!

# Stream depth:
docker exec dataland-redis sh -c '
  redis-cli -a "$REDIS_PASSWORD" --no-auth-warning XLEN museum:telemetry
'                                                          # (2)!

# Consumer-group lag (the real backpressure signal):
docker exec dataland-redis sh -c '
  redis-cli -a "$REDIS_PASSWORD" --no-auth-warning XINFO GROUPS museum:telemetry
'                                                          # (3)!

# Active-ticket mirror membership:
docker exec dataland-redis sh -c '
  redis-cli -a "$REDIS_PASSWORD" --no-auth-warning SMEMBERS museum:active_ticket_ids
'                                                          # (4)!

# From a workstation (SSH tunnel first):
ssh -L 4145:127.0.0.1:4145 ege@100.124.170.43             # (5)!
redis-cli -h 127.0.0.1 -p 4145 -a "$REDIS_PASSWORD" PING
  1. Run inside the container so $REDIS_PASSWORD resolves from the container environment (DAT-76). -a plus --no-auth-warning authenticates quietly; a healthy server replies PONG.
  2. XLEN museum:telemetry is diagnostic only — in steady state it pegs near MAXLEN ~10000, so a high value is normal, not a backlog. Use consumer lag (next) to judge backpressure.
  3. XINFO GROUPS exposes the notification-service group's lag / pending — the correct backpressure signal (DAT-82/DAT-110). A growing lag means the worker is falling behind, unlike raw stream depth.
  4. SMEMBERS museum:active_ticket_ids lists the live visits the mirror replicates from RDC every 1 Hz. Empty (or missing key) means RDC reported nobody in — the loop DELs the set on an empty list.
  5. Off-box you cannot reach dataland-redis directly — tunnel 4145 (the loopback-published port) to your workstation first, then point redis-cli at 127.0.0.1:4145.

1.5 Metrics

dataland-redis-exporter (oliver006/redis_exporter, REDIS_ADDR=redis://dataland-redis:6379, auth via REDIS_PASSWORD) scrapes INFO, LATENCY, and stream depths. Surfaced on Grafana → Dataland folder → Redis dashboard. Alert rules in monitoring/rules/dataland.alerts.yml.


2. RDC redis — the external source of truth

The Refik Anadol Studio data-center redis is the sole source of live museum data after simulator/playback mode was removed. dataland-museum reads it; it never falls back to a substitute store.

Connection RDC_REDIS_URL (e.g. redis://user:pw@host:6379/0)
Required at startup Yes — app/config.py raises RuntimeError if RDC_REDIS_URL is empty
Access Read-only (on-demand GET/MGET/SCAN + pub/sub PSUBSCRIBE)
Encoding msgpack for most keys; plain UTF-8 for some

Mixed encoding — try msgpack, fall back to UTF-8

Most RDC SET keys are msgpack. A few are plain UTF-8 strings: Rooms:DATALAND:Qsys:* server-health usage strings ("5.03%", "52.0°C") and Rooms:*:Audio_Player_* flags. The reader's _decode_value() (dataland-museum/app/services/rdc_reader.py) unpacks msgpack first, then falls back to raw.decode("utf-8") (DAT-282). Pure-msgpack keys use _unpack() directly.

2.1 How museum-api reads it

Two clients, both built from RDC_REDIS_URL (app/services/rdc_redis.py):

  • Synchronous (get_rdc_redis) — HTTP routes do on-demand GET/MGET/SCAN for static-ish keys (visitor record, watch state, scene control, AQI, Q-SYS health, kiosks, audio players). These change at second-to-minute cadence so on-demand reads are cheap.
  • Async (get_rdc_async_redis) — the RDCSubscriber background task PSUBSCRIBEs the live channels and keeps an in-process latest-payload cache (RDCStateCache), so live biometrics/position never need a round-trip per request.

2.2 Keys read (representative; full catalogue in museum-rdc-redis-data-dictionary.md)

Key pattern Encoding Used for
Visitors:ActiveTicketIDs msgpack (list) Source of truth for live visits → mirrored locally; drives visit_ended.
Visitors:<ticket>:Status / :EmpaticaDeviceID / :EmpaticaDeviceIDShort / :ScentDeviceID / :ETemp msgpack Per-ticket visitor record; serial ↔ ticket binding.
Wearables:WatchDevices:<serial>:RoomCode / :Zone / :IsAlive / :Status msgpack Watch localisation + liveness.
Galleries:<gallery>:<art_id>:SceneControl:{RTChapter,DDSChapter,Playhead,IsPlaying,…} msgpack Active chapter / scene flow.
IO:AQI:<room> msgpack (Enviro+ dict) Ambient temp (°F → °C), lux, humidity.
Rooms:GA:Augmenta:Clusters msgpack (list) Crowd density (GA only).
Rooms:DATALAND:Qsys:* UTF-8 strings Q-SYS server health.
Rooms:*:Audio_Player_* UTF-8 flags Audio playback state.
Rooms:ON:Kiosks:<n>:* mixed Onboarding/checkout lifecycle.

2.3 Channels subscribed (RDCSubscriber.PATTERNS, msgpack)

Wearables:WatchDevices:*:BioSensors      # (1)!
Wearables:WatchDevices:*:BlueIoTSensors  # (2)!
Visitors:Register                        # (3)!
Visitors:*:Status
Visitors:*:AssignDevice                  # (4)!
  1. Carries heart rate, EDA, skin temp, activity, steps, and battery. This is the live biometric feed the EDA-baseline window (DAT-230) and heart-rate alert latches are built from, so it is cached in-process rather than re-fetched per request.
  2. BlueIoT_Position + Room_ID — the watch's BlueIoT localisation, which drives room-code transitions and the room_transition push (cooldown-exempt, DAT-293).
  3. The onboarding control plane channel: announces new visitor registrations as they come online.
  4. Device rebinding events. Receiving one clears the cached serial↔ticket binding so a re-paired watch is attributed to the correct ticket on the next read.

Never replay RDC payloads into the live dataland-redis

Museum-simulation playback uses a dedicated sidecar redis (compose.sim.ymldataland-redis-sim). Never replay captured RDC Avro/msgpack into the production dataland-redis — it would poison museum:telemetry for the live notification pipeline.


3. The active-ticket mirror (RDC → dataland-redis)

Visitors:ActiveTicketIDs is RDC's single source of truth for which visits are live. The museum-api RDC broadcast loop (app/web/server.py::_sync_active_tickets) replicates it verbatim into the local dataland-redis SET museum:active_ticket_ids every tick (1 Hz), so the agent and mobile can answer "is this ticket active?" without touching RDC.

flowchart LR
  RDC[(RDC redis<br/>Visitors:ActiveTicketIDs)] -->|1 Hz read| LOOP[museum-api<br/>_rdc_broadcast_loop]
  LOOP -->|DEL staging; SADD; RENAME| MIRROR[(dataland-redis<br/>museum:active_ticket_ids)]
  LOOP -->|prev - current → XADD visit_ended| ST[(museum:telemetry)]
  MIRROR -->|SISMEMBER| AGENT[agent session_state]
  MIRROR -->|SISMEMBER| MOBILE[mobile app]

Design contract (deliberately trust-RDC):

  • Atomic swap — staged in museum:active_ticket_ids:staging then RENAMEd, so a reader never sees a half-built set.
  • Empty list → DEL — when RDC reports zero active tickets, the set is deleted; membership reads false for everyone, exactly mirroring "nobody is in".
  • No TTL, no debounce, no fail-safe heuristics — the set is rewritten every tick to equal RDC verbatim. If RDC drops a ticket, the next sync drops it here (within ~1 s). Wrong membership is an upstream RDC problem, not something the mirror second-guesses.
  • Best-effort + read-gated — the sync only runs after a successful RDC read, so a failed RDC read skips the tick and leaves the last good set in place rather than wiping it. A dataland-redis hiccup is caught and never breaks the RDC loop.

ACTIVE_TICKETS_MIRROR_KEY = "museum:active_ticket_ids" is a fixed constant in dataland-museum/app/config.py (not env-overridable) and the agent reads the identical literal in dataland-agent/app/services/session_state.py (_ACTIVE_TICKETS_KEY), so writer and reader can never drift apart.

The same loop also emits visit_ended telemetry (DAT-287) for every ticket that left the active set since the last tick (prev_active - current_active), which the notification worker's VisitEndedRule turns into the checkout "How was the experience?" push.


4. Volumes

redis-data    <- named volume for dataland-redis (AOF + RDB)

The RDC redis has no Dataland-managed volume — it is external infrastructure owned by Refik Anadol Studio.


  • Museum — the RDC bridge, subscriber, and broadcast loop that read RDC and write the stream + mirror.
  • Notification — the rules engine that consumes museum:telemetry and owns every notification:* key.
  • Agent — reads museum:active_ticket_ids for session/ticket activity.