Skip to content

Auth — RS256 JWT issuance, JWKS, and the CMS-key mirror

dataland-auth is the local identity provider that issues and serves the signing material the rest of the stack trusts. It runs as a separate container but lives inside the agent repo — it is the auth_server.py entrypoint of dataland-agent, sharing the same image, Postgres database, and data/ volume layout.

Two distinct roles converge here:

  1. Issuerdataland-auth signs RS256 JWTs (access + refresh) for accounts it stores in Postgres (argon2id-hashed passwords), and exposes a standard JWKS endpoint so any verifier can validate tokens without ever talking back to the issuer.
  2. JWKS mirror (DAT-286) — the same JWKS endpoint also serves the external CMS / mobile-backend public key (kid=dataland-rs256-1), so the agent can validate production-issued chat tokens locally, removing the chat-auth single point of failure.

Visitor identity for the museum chat flow is handled upstream by the CMS / mobile backend (dataland-cms-staging-…a.run.app), which signs the real production tokens. dataland-auth is the agent's primary, locally-resolvable JWKS source; the external CMS endpoint is demoted to a pure backup. See Agent for how tokens are consumed on the request path.

Container dataland-auth
Image dataland/auth-server:${IMAGE_TAG} (built from dataland-agent/Dockerfile)
Command uv run python auth_server.py
Public port 9000 (host ${AUTH_PUBLIC_PORT:-9000} → container 9000)
Internal URL http://dataland-auth:9000
Memory / CPU 256m / 0.5 core
Healthcheck GET /.well-known/jwks.json (interval 10s, timeout 5s, 3 retries, 5s start)
Depends on postgres: service_healthy
Repo dataland-agent (auth_server.py)

Recent changes (DAT-286, DAT-145, DAT-291)

  • DAT-286 — the local JWKS now mirrors the CMS signing key (kid=dataland-rs256-1) via data/extra_jwks.json, so the agent validates chat tokens locally instead of depending on the external CMS endpoint as the sole validator. The agent emits a WARNING whenever a fallback (non-primary) JWKS provider is the only one that could verify a token.
  • DAT-145 — passwords are hashed with argon2id (was sha256(salt+password)). Legacy rows verify and silently upgrade on next login.
  • DAT-291deploy.sh runs the agent's boot guard against .env before rebuild; an empty JWKS_URL or AUTH_SKIP=true in production aborts the deploy.

What it does

  • Issues RS256 JWTsPOST /api/auth/signup and POST /api/auth/login return an access token (1 h TTL) and a refresh token (7 day TTL), both signed with the local RSA private key under the header kid.
  • Serves JWKS at GET /.well-known/jwks.json — the local public key plus any mirrored external public keys (DAT-286).
  • Stores accounts in Postgres table auth_server_users (the shared dataland database, via AUTH_DATABASE_URL). Passwords are argon2id (DAT-145).
  • Hosts a dark-themed web UI at / — a login/signup form that mints a token and can deep-link into the agent's test chat client with ?token=….
  • GET /api/auth/me — returns the current user record for a valid Bearer token.

Auth issuance vs. token verification live in two places

dataland-auth (auth_server.py) signs and serves keys. The agent (app/auth.py) verifies tokens against one or more JWKS URLs. They share the repo and the data/ volume but are different processes with different jobs. Do not conflate the auth server's private key (signing) with the agent's verification path (public-key only).


Architecture

graph LR
  subgraph local["Docker network: dataland-network"]
    AU["dataland-auth<br/>auth_server.py :9000"]
    AG["dataland-agent<br/>app/auth.py :4141"]
    PG[("postgres<br/>auth_server_users + agent users")]
  end
  CMS["CMS / mobile backend<br/>dataland-cms-staging…a.run.app<br/>kid=dataland-rs256-1"]

  AU -->|asyncpg pool| PG
  AG -.->|"JWKS_URL (primary)"| AU
  AG -.->|"JWKS_URLS (fallback, backup)"| CMS
  CMS -.->|"public key mirrored into<br/>data/extra_jwks.json (DAT-286)"| AU

Hard dependency: depends_on: postgres: service_healthy.


RSA signing keys (local, persisted)

On first boot auth_server.py generates a 2048-bit RSA key pair and a random kid of the form local-<8hex>, then persists them to the data/ directory so issued tokens survive restarts:

data/auth_rsa_private.pem   <- PKCS8 private key (signing material — secret)  # (1)!
data/auth_rsa_kid.txt       <- the kid stamped into every token header  # (2)!
data/extra_jwks.json        <- (optional) mirrored external public JWKs (DAT-286)  # (3)!
  1. The 2048-bit RSA private key, generated once on first boot and never re-derived. This is the only secret that can sign tokens — losing it (e.g. wiping auth-data) invalidates every previously issued JWT.
  2. The local-<8hex> key id stamped into every token header so verifiers can pick the right public key. It is also the kid that the dedup logic protects: an extra mirrored key reusing this kid is always skipped.
  3. DAT-286 — the mirrored CMS public key (kid=dataland-rs256-1) that lets the agent verify production chat tokens locally. Optional, but absent it the external CMS endpoint becomes the sole validator (the SPOF this feature removes).

In production these live in the auth-data named volume (auth-data:/app/data).

Do not delete auth-data

The signing key lives in this volume. Blow it away (e.g. an aggressive reset-stack.sh) and every previously issued JWT becomes invalid — every active staff session and any locally-minted token dies, and the agent re-fetches a brand-new kid it has never seen. The backup/restore runbook is in dataland-infrastructure/reports/backup-restore.md; read it first. Wiping extra_jwks.json re-introduces the DAT-286 SPOF until re-provisioned.

The public key is exported as a JWK with kid, use="sig", alg="RS256".


The JWKS endpoint and the CMS-key mirror (DAT-286)

The problem it solves

The agent verifies a token by trying each configured JWKS URL in order (JWKS_URL first, then each entry of JWKS_URLS). Production chat tokens are signed by the CMS backend with kid=dataland-rs256-1. With that key absent from the local JWKS, the only place the agent can resolve it is the external CMS endpoint — making that external endpoint the sole source of the verification key for all chat auth. If the CMS endpoint blips, every chat request 401s.

The fix

dataland-auth serves the external public key alongside its own. A JWKS only ever exposes public material — never the private signing key — so mirroring it is safe. With the key present locally, chat auth resolves against dataland-auth (the primary), and the external endpoint becomes a pure backup.

How the served key set is built

auth_server.py merges keys at startup:

  1. The local public JWK is always first and authoritative.
  2. Extra public JWKs are loaded from inline env JSON first (AUTH_EXTRA_JWKS_JSON), then the on-disk file (AUTH_EXTRA_JWKS_PATH, default data/extra_jwks.json).
  3. Entries are accepted only if they carry both kty and kid; everything else is dropped as noise.
  4. Keys are deduplicated by kid — an extra key reusing the local kid is skipped, so a stale mirror can never shadow the key this server actually signs with.
  5. A malformed or missing extra-key source yields zero extra keys, logged via the auth.extra_jwks.bad_source observability event — it never raises. A bad mirror file must not take auth (and thus all chat) down.

When extras load successfully, the server emits auth.extra_jwks.loaded with the count and the mirrored kids.

flowchart TD
  A["local public JWK<br/>(kid=local-xxxxxxxx)"] --> M{merge by kid}
  B["AUTH_EXTRA_JWKS_JSON<br/>(inline env)"] --> L[load_extra_jwks]
  C["data/extra_jwks.json<br/>(AUTH_EXTRA_JWKS_PATH)"] --> L
  L --> M
  M -->|local first, dedup by kid,<br/>local kid never overridden| S["/.well-known/jwks.json<br/>{ keys: [...] }"]

Provisioning the mirror

Drop the upstream public JWK into the persisted volume (preferred), or pass it inline:

# On the VDS: write the CMS public JWK into the auth-data volume
docker exec -i dataland-auth sh -c 'cat > /app/data/extra_jwks.json' <<'JSON'  # (1)!
{ "keys": [ { "kty": "RSA", "use": "sig", "alg": "RS256",
  "kid": "dataland-rs256-1", "n": "<modulus>", "e": "AQAB" } ] }
JSON
docker restart dataland-auth  # (2)!

# Verify both keys are now served
curl -fsS http://localhost:9000/.well-known/jwks.json | jq '.keys[].kid'  # (3)!
# -> "local-xxxxxxxx"
# -> "dataland-rs256-1"
  1. Writes the file straight into the running container's data/ dir, which is the persisted auth-data volume — so the mirror survives restarts. Only the public JWK goes here (n/e, never d); a JWKS must never carry private material. Entries are kept only if they have both kty and kid (DAT-286 merge rules).
  2. A restart is required: auth_server.py merges the extra-key sources at startup, so the new extra_jwks.json is not picked up until the process re-reads it.
  3. Confirms the merged key set now serves two kids — the local authoritative key first, then the mirrored dataland-rs256-1. Seeing only local-… means the mirror did not load (check the auth.extra_jwks.bad_source event).
Env var Default Purpose
AUTH_EXTRA_JWKS_PATH data/extra_jwks.json File of extra public JWKs to mirror
AUTH_EXTRA_JWKS_JSON (empty) Inline JSON ({"keys":[…]} or a bare list) of extra public JWKs

Re-run after a volume wipe or CMS key rotation

extra_jwks.json lives in auth-data. If the volume is wiped, or the CMS rotates dataland-rs256-1, re-provision the mirror — otherwise the agent silently falls back to the external endpoint as sole validator and starts emitting the DAT-286 WARNING.


How the agent verifies tokens

Verification lives in the agent at dataland-agent/app/auth.py. The flow (_verify_access_token_sync):

  1. If AUTH_SKIP=true, the signature and expiry are not checked — the token is decoded unverified and a WARNING is logged. Never enable in production (the boot guard refuses it).
  2. Otherwise, for each URL in settings.auth_jwks_urls (i.e. JWKS_URL followed by every JWKS_URLS entry, de-duplicated, order preserved):
  3. Resolve the signing key via a cached PyJWKClient (jwks_cache_ttl, default 3600 s). On a cache miss the keys are re-fetched once before giving up — this transparently handles key rotation.
  4. jwt.decode(..., algorithms=["RS256"], options={"require": ["exp"], "verify_aud": False}). exp is required; aud is intentionally not checked because CMS tokens scope aud to the mobile client UUID and the agent treats any authenticated caller as a valid consumer.
  5. First URL that verifies wins; the loop breaks.
  6. If verification succeeds on a fallback provider (index > 0), the agent logs:

    JWT accepted by FALLBACK JWKS provider <url> -- local JWKS is missing this signing key; chat auth depends on this external endpoint

    This is the DAT-286 alert signal: the local JWKS is missing the key and should be re-mirrored. Treat it as an actionable warning, not noise. 4. If no provider verifies, the agent returns 401 Invalid or expired token.

After verification, _normalize_access_payload enforces token shape:

  • token_type, if present, must be access — refresh tokens are rejected with 403.
  • A user identifier is required: user_id if present, else sub. Missing both → 401 Token missing user identifier claim.
  • get_or_create_user auto-creates a local agent user from the claims (user_id, email, full_name, location, profile_photo_url, joined_date, access_permissions, stripe_customer_id) and stashes it on request.state.current_user for the logging middleware.
sequenceDiagram
  participant C as Client
  participant AG as agent /v1/*
  participant AU as dataland-auth JWKS (primary)
  participant CMS as CMS JWKS (fallback)
  C->>AG: Bearer <RS256 JWT>
  AG->>AU: fetch JWKS (cached, TTL 3600s)
  alt kid resolves locally (mirror present)
    AU-->>AG: public key for kid
    AG->>AG: verify sig + exp, normalize claims
    AG-->>C: 200 (SSE / JSON)
  else kid missing locally
    AG->>CMS: fallback fetch
    CMS-->>AG: public key
    AG->>AG: WARN "FALLBACK provider ... chat auth depends on external"
    AG-->>C: 200 (but mirror should be re-provisioned)
  end

Login & signup (argon2id — DAT-145)

Passwords are hashed with argon2id via argon2-cffi defaults (time_cost=3, memory_cost=64 MiB, parallelism=4), landing at ~70–90 ms/hash on the VDS — costly for an attacker, invisible to a user. The previous scheme was sha256(salt + password), a fast GPU-friendly hash that would crack to plaintext in minutes from a leaked table.

  • New rows store an $argon2… string; the salt column is set to the literal "argon2" (argon2 encodes its own salt + parameters in the hash). The salt column is retained only for legacy verification.
  • Verification dispatches by format: $argon2… → argon2id verify; 64-char hex → legacy sha256(salt+password) constant-time compare. Any malformed/corrupt row returns False rather than raising.
  • Opportunistic upgrade: a legacy sha256 row that authenticates successfully is re-hashed to argon2id on that login. No new logins ever land sha256, so every user migrates on next sign-in.

Schema (auth_server_users)

Created idempotently at startup (CREATE TABLE IF NOT EXISTS + additive ALTERs). Columns: id, sub (UUID, the JWT subject), email (unique, lower-cased), hash, salt, full_name, location, profile_photo_url, stripe_customer_id, created_at. Rows lacking a sub are backfilled with a fresh UUID on boot.

Issued token claims

_mint_tokens produces two RS256 JWTs signed with the local kid:

Token TTL Notable claims
access 1 hour sub, email, full_name, location, profile_photo_url, stripe_customer_id, joined_date, access_permissions: [], iss, aud, iat, exp, jti
refresh 7 days token_type: "refresh", sub, iss, aud, iat, exp, jti

iss defaults to http://dataland-auth:9000 (AUTH_ISSUER); aud defaults to dataland-agent (AUTH_AUDIENCE). Because the agent runs verify_aud=False, the aud value is informational on the agent side.


API reference

Method Path Auth Description
GET / none Login/signup web UI (mints a token, can deep-link to the agent test chat)
GET /.well-known/jwks.json none Public JWKS (local key + mirrored extras)
POST /api/auth/signup none Create account → { access, refresh, user, … }
POST /api/auth/login none Authenticate → { access, refresh, user, … }
GET /api/auth/me Bearer Current user record (verifies with the local public key)
GET /metrics Prometheus metrics (service="dataland-auth")

signup requires email + password (min length 4); a duplicate email returns 409. login returns 401 on bad credentials. me verifies the Bearer token against the local public key only (verify_aud=False) and looks the user up by sub.

me here vs. the agent's /v1/auth/me

GET /api/auth/me on dataland-auth (port 9000) verifies against the local key and is for the auth server's own UI/flows. The visitor-facing GET /v1/auth/me lives on the agent (port 4141) and runs the full multi-JWKS verification path. They are different endpoints on different services.


Key env vars

dataland-auth (the issuer/server)

AGENT_URL=https://dataland.chat          # (1)!
AUTH_DATABASE_URL=postgresql://dataland:***@dataland-postgres:5432/dataland  # (2)!
AUTH_PORT=9000                           # (3)!
AUTH_ISSUER=http://dataland-auth:9000    # (4)!
AUTH_AUDIENCE=dataland-agent             # (5)!
AUTH_EXTRA_JWKS_PATH=/app/data/extra_jwks.json   # (6)!
AUTH_EXTRA_JWKS_JSON=                    # (7)!
LOG_FILE_DIR=/app/logs
  1. Baked into the web UI's "open agent" deep-link so the minted token can be handed off to the agent's test chat client (?token=…). Cosmetic only — it does not affect token issuance.
  2. The Postgres DSN for the shared dataland database (table auth_server_users). Falls back to DATABASE_URL, then a local-dev default; a postgresql+asyncpg:// form is normalized to postgresql:// for asyncpg, and credentials are masked in the boot banner.
  3. Listen port, default 9000. Mapped to the host via ${AUTH_PUBLIC_PORT:-9000} and probed by the healthcheck at /.well-known/jwks.json.
  4. The iss claim stamped into issued tokens (default). The agent does not enforce it, but it identifies who minted the token.
  5. The aud claim (default dataland-agent). Informational on the agent side because verification runs with verify_aud=False — CMS tokens scope aud to the mobile client UUID instead.
  6. DAT-286 — on-disk file of extra public JWKs to mirror, default data/extra_jwks.json inside the auth-data volume. Loaded after the inline JSON source.
  7. DAT-286 — optional inline mirror ({"keys":[…]} or a bare list). Loaded first, before the on-disk file; useful when you would rather not write a file into the volume.

AUTH_DATABASE_URL falls back to DATABASE_URL, then to a local-dev default. A postgresql+asyncpg:// DSN is normalized to postgresql:// for asyncpg. Credentials are masked (scheme://***@host) in the boot banner.

dataland-agent (the verifier) — auth-relevant settings

JWKS_URL=http://dataland-auth:9000/.well-known/jwks.json        # (1)!
JWKS_URLS=                                                      # (2)!
JWKS_CACHE_TTL=3600                                             # (3)!
AUTH_SKIP=false                                                 # (4)!
  1. The primary JWKS source, tried first on every verification. Pointing it at dataland-auth (which now mirrors the CMS key per DAT-286) makes the agent verify production chat tokens against a locally-resolvable endpoint. The boot guard refuses to start production if this is empty.
  2. Optional fallback providers, comma-separated or a JSON array, tried in order after JWKS_URL. DAT-143 — keep this empty in production once the mirror is in place; a committed default pointing at staging once let prod silently accept staging-signed tokens. A token that verifies here (index > 0) triggers the DAT-286 FALLBACK warning.
  3. PyJWKClient cache lifetime in seconds (default 3600). On a cache miss the keys are re-fetched once before giving up, which is how transparent key rotation works.
  4. Bypass switch — when true, neither signature nor exp is checked and the token is decoded unverified (with a WARNING). NEVER true in production; assert_boot_required_env() crash-loops the agent if it is set under APP_ENV=production.

Keep JWKS_URLS empty in production once the mirror is in place (DAT-143)

A committed default pointing at staging once caused prod to silently accept staging-signed tokens (including self-issued ones). With the CMS key mirrored into dataland-auth (DAT-286), production verifies entirely against the local primary; leave JWKS_URLS empty unless you deliberately need a second issuer. The agent's infra .env.example ships JWKS_URLS= (commented example only). Note the agent's own docker-compose.yml (standalone dev) still defaults JWKS_URLS to the CMS staging endpoint — the production infra compose.yml does not set it, deferring to .env.


Production boot guard & deploy gate (DAT-140 / DAT-291)

The agent refuses to boot in production with broken auth. production_required_env_issues() (in app/runtime.py) flags:

  • JWKS_URL is empty; tokens cannot be verified in production
  • AUTH_SKIP is true; production deploys must verify JWTs

assert_boot_required_env() raises RuntimeError (crash-loop) on any such issue when APP_ENV=production; non-production stays warn-only.

DAT-291 runs this exact guard before a rebuild: deploy.sh executes assert_boot_required_env() from the current dataland/agent:latest image against the new .env. A failure aborts the deploy ("ABORT: .env failed the agent boot guard") rather than shipping a container that crash-loops and takes chat offline. The check is skipped only on the very first deploy (no image yet).


Reaching it

# JWKS (host):
curl -fsS http://localhost:9000/.well-known/jwks.json | jq  # (1)!

# From inside the network (agent's perspective):
curl -fsS http://dataland-auth:9000/.well-known/jwks.json  # (2)!

# Mint a token, then verify the agent accepts it:
TOKEN=$(curl -fsS -X POST http://localhost:9000/api/auth/login \
  -H 'Content-Type: application/json' \
  -d '{"email":"curator@example.com","password":"secret"}' | jq -r .access)  # (3)!

curl -fsS http://localhost:9000/api/auth/me -H "Authorization: Bearer $TOKEN" | jq  # (4)!

# Confirm the agent verifies the same token against its JWKS chain:
curl -fsS https://dataland.chat/v1/auth/me -H "Authorization: Bearer $TOKEN" | jq  # (5)!
  1. Hits the public JWKS over the host-mapped port. After provisioning the mirror you should see two kids here — the local-… key and dataland-rs256-1.
  2. The same endpoint via the in-network service name, which is exactly the URL the agent uses for JWKS_URL. Use this to confirm the agent can resolve the primary from inside dataland-network.
  3. Extracts the access token (1 h TTL) from a login response. jq -r strips the quotes so $TOKEN is the raw JWT; the refresh token in the same response is ignored here.
  4. Verifies against dataland-auth's local public key only (verify_aud=False), looking the user up by sub. This is the auth server's own me, not the agent's.
  5. Exercises the agent's /v1/auth/me (port 4141), which runs the full multi-JWKS chain. If this succeeds while the agent logs a FALLBACK warning, the local mirror is missing the kid and should be re-provisioned.

Volumes

auth-data:/app/data   <- RSA private key, kid, extra_jwks.json  (NEVER delete casually)  # (1)!
/app/logs             <- host: ${DATALAND_LOG_DIR}/auth/  # (2)!
  1. The named volume that persists all signing material: auth_rsa_private.pem, auth_rsa_kid.txt, and the DAT-286 extra_jwks.json. Wiping it (e.g. an aggressive reset-stack.sh) invalidates every issued JWT and drops the CMS mirror — restore from dataland-infrastructure/reports/backup-restore.md before restarting.
  2. Log directory, bind-mounted to ${DATALAND_LOG_DIR}/auth/ on the host so auth logs survive container recreation and sit alongside the other services' logs.

Operational notes & gotchas

Dropping auth-data invalidates every issued token

Loses auth_rsa_private.pem, auth_rsa_kid.txt, and extra_jwks.json. Every previously issued JWT fails verification and the DAT-286 mirror disappears. Restore from backup before restarting.

The FALLBACK JWKS provider warning means re-mirror the key

If the agent logs that a fallback provider validated a token, the local JWKS is missing that kid. Re-provision data/extra_jwks.json so verification returns to the local primary and the external CMS endpoint is back to pure-backup status.

Legacy passwords self-heal

Pre-DAT-145 sha256 rows keep working and upgrade to argon2id on the next successful login. No bulk migration is needed — just expect a brief, one-time UPDATE per legacy user on their first sign-in after the rollover.

See also

  • Agent — consumes these tokens; full multi-JWKS verification path and /v1/auth/me.
  • RAG — separate API_KEY bearer auth, unrelated to JWKS.