Auth — RS256 JWT issuance, JWKS, and the CMS-key mirror¶
dataland-auth is the local identity provider that issues and serves the signing material the rest of the stack trusts. It runs as a separate container but lives inside the agent repo — it is the auth_server.py entrypoint of dataland-agent, sharing the same image, Postgres database, and data/ volume layout.
Two distinct roles converge here:
- Issuer —
dataland-authsigns RS256 JWTs (access + refresh) for accounts it stores in Postgres (argon2id-hashed passwords), and exposes a standard JWKS endpoint so any verifier can validate tokens without ever talking back to the issuer. - JWKS mirror (DAT-286) — the same JWKS endpoint also serves the external CMS / mobile-backend public key (
kid=dataland-rs256-1), so the agent can validate production-issued chat tokens locally, removing the chat-auth single point of failure.
Visitor identity for the museum chat flow is handled upstream by the CMS / mobile backend (dataland-cms-staging-…a.run.app), which signs the real production tokens. dataland-auth is the agent's primary, locally-resolvable JWKS source; the external CMS endpoint is demoted to a pure backup. See Agent for how tokens are consumed on the request path.
| Container | dataland-auth |
| Image | dataland/auth-server:${IMAGE_TAG} (built from dataland-agent/Dockerfile) |
| Command | uv run python auth_server.py |
| Public port | 9000 (host ${AUTH_PUBLIC_PORT:-9000} → container 9000) |
| Internal URL | http://dataland-auth:9000 |
| Memory / CPU | 256m / 0.5 core |
| Healthcheck | GET /.well-known/jwks.json (interval 10s, timeout 5s, 3 retries, 5s start) |
| Depends on | postgres: service_healthy |
| Repo | dataland-agent (auth_server.py) |
Recent changes (DAT-286, DAT-145, DAT-291)
- DAT-286 — the local JWKS now mirrors the CMS signing key (
kid=dataland-rs256-1) viadata/extra_jwks.json, so the agent validates chat tokens locally instead of depending on the external CMS endpoint as the sole validator. The agent emits aWARNINGwhenever a fallback (non-primary) JWKS provider is the only one that could verify a token. - DAT-145 — passwords are hashed with argon2id (was
sha256(salt+password)). Legacy rows verify and silently upgrade on next login. - DAT-291 —
deploy.shruns the agent's boot guard against.envbefore rebuild; an emptyJWKS_URLorAUTH_SKIP=truein production aborts the deploy.
What it does¶
- Issues RS256 JWTs —
POST /api/auth/signupandPOST /api/auth/loginreturn anaccesstoken (1 h TTL) and arefreshtoken (7 day TTL), both signed with the local RSA private key under the headerkid. - Serves JWKS at
GET /.well-known/jwks.json— the local public key plus any mirrored external public keys (DAT-286). - Stores accounts in Postgres table
auth_server_users(the shareddatalanddatabase, viaAUTH_DATABASE_URL). Passwords are argon2id (DAT-145). - Hosts a dark-themed web UI at
/— a login/signup form that mints a token and can deep-link into the agent's test chat client with?token=…. GET /api/auth/me— returns the current user record for a valid Bearer token.
Auth issuance vs. token verification live in two places
dataland-auth (auth_server.py) signs and serves keys. The agent (app/auth.py) verifies tokens against one or more JWKS URLs. They share the repo and the data/ volume but are different processes with different jobs. Do not conflate the auth server's private key (signing) with the agent's verification path (public-key only).
Architecture¶
graph LR
subgraph local["Docker network: dataland-network"]
AU["dataland-auth<br/>auth_server.py :9000"]
AG["dataland-agent<br/>app/auth.py :4141"]
PG[("postgres<br/>auth_server_users + agent users")]
end
CMS["CMS / mobile backend<br/>dataland-cms-staging…a.run.app<br/>kid=dataland-rs256-1"]
AU -->|asyncpg pool| PG
AG -.->|"JWKS_URL (primary)"| AU
AG -.->|"JWKS_URLS (fallback, backup)"| CMS
CMS -.->|"public key mirrored into<br/>data/extra_jwks.json (DAT-286)"| AU
Hard dependency: depends_on: postgres: service_healthy.
RSA signing keys (local, persisted)¶
On first boot auth_server.py generates a 2048-bit RSA key pair and a random kid of the form local-<8hex>, then persists them to the data/ directory so issued tokens survive restarts:
data/auth_rsa_private.pem <- PKCS8 private key (signing material — secret) # (1)!
data/auth_rsa_kid.txt <- the kid stamped into every token header # (2)!
data/extra_jwks.json <- (optional) mirrored external public JWKs (DAT-286) # (3)!
- The 2048-bit RSA private key, generated once on first boot and never re-derived. This is the only secret that can sign tokens — losing it (e.g. wiping
auth-data) invalidates every previously issued JWT. - The
local-<8hex>key id stamped into every token header so verifiers can pick the right public key. It is also thekidthat the dedup logic protects: an extra mirrored key reusing thiskidis always skipped. - DAT-286 — the mirrored CMS public key (
kid=dataland-rs256-1) that lets the agent verify production chat tokens locally. Optional, but absent it the external CMS endpoint becomes the sole validator (the SPOF this feature removes).
In production these live in the auth-data named volume (auth-data:/app/data).
Do not delete auth-data
The signing key lives in this volume. Blow it away (e.g. an aggressive reset-stack.sh) and every previously issued JWT becomes invalid — every active staff session and any locally-minted token dies, and the agent re-fetches a brand-new kid it has never seen. The backup/restore runbook is in dataland-infrastructure/reports/backup-restore.md; read it first. Wiping extra_jwks.json re-introduces the DAT-286 SPOF until re-provisioned.
The public key is exported as a JWK with kid, use="sig", alg="RS256".
The JWKS endpoint and the CMS-key mirror (DAT-286)¶
The problem it solves¶
The agent verifies a token by trying each configured JWKS URL in order (JWKS_URL first, then each entry of JWKS_URLS). Production chat tokens are signed by the CMS backend with kid=dataland-rs256-1. With that key absent from the local JWKS, the only place the agent can resolve it is the external CMS endpoint — making that external endpoint the sole source of the verification key for all chat auth. If the CMS endpoint blips, every chat request 401s.
The fix¶
dataland-auth serves the external public key alongside its own. A JWKS only ever exposes public material — never the private signing key — so mirroring it is safe. With the key present locally, chat auth resolves against dataland-auth (the primary), and the external endpoint becomes a pure backup.
How the served key set is built¶
auth_server.py merges keys at startup:
- The local public JWK is always first and authoritative.
- Extra public JWKs are loaded from inline env JSON first (
AUTH_EXTRA_JWKS_JSON), then the on-disk file (AUTH_EXTRA_JWKS_PATH, defaultdata/extra_jwks.json). - Entries are accepted only if they carry both
ktyandkid; everything else is dropped as noise. - Keys are deduplicated by
kid— an extra key reusing the localkidis skipped, so a stale mirror can never shadow the key this server actually signs with. - A malformed or missing extra-key source yields zero extra keys, logged via the
auth.extra_jwks.bad_sourceobservability event — it never raises. A bad mirror file must not take auth (and thus all chat) down.
When extras load successfully, the server emits auth.extra_jwks.loaded with the count and the mirrored kids.
flowchart TD
A["local public JWK<br/>(kid=local-xxxxxxxx)"] --> M{merge by kid}
B["AUTH_EXTRA_JWKS_JSON<br/>(inline env)"] --> L[load_extra_jwks]
C["data/extra_jwks.json<br/>(AUTH_EXTRA_JWKS_PATH)"] --> L
L --> M
M -->|local first, dedup by kid,<br/>local kid never overridden| S["/.well-known/jwks.json<br/>{ keys: [...] }"]
Provisioning the mirror¶
Drop the upstream public JWK into the persisted volume (preferred), or pass it inline:
# On the VDS: write the CMS public JWK into the auth-data volume
docker exec -i dataland-auth sh -c 'cat > /app/data/extra_jwks.json' <<'JSON' # (1)!
{ "keys": [ { "kty": "RSA", "use": "sig", "alg": "RS256",
"kid": "dataland-rs256-1", "n": "<modulus>", "e": "AQAB" } ] }
JSON
docker restart dataland-auth # (2)!
# Verify both keys are now served
curl -fsS http://localhost:9000/.well-known/jwks.json | jq '.keys[].kid' # (3)!
# -> "local-xxxxxxxx"
# -> "dataland-rs256-1"
- Writes the file straight into the running container's
data/dir, which is the persistedauth-datavolume — so the mirror survives restarts. Only the public JWK goes here (n/e, neverd); a JWKS must never carry private material. Entries are kept only if they have bothktyandkid(DAT-286 merge rules). - A restart is required:
auth_server.pymerges the extra-key sources at startup, so the newextra_jwks.jsonis not picked up until the process re-reads it. - Confirms the merged key set now serves two
kids — the local authoritative key first, then the mirroreddataland-rs256-1. Seeing onlylocal-…means the mirror did not load (check theauth.extra_jwks.bad_sourceevent).
| Env var | Default | Purpose |
|---|---|---|
AUTH_EXTRA_JWKS_PATH |
data/extra_jwks.json |
File of extra public JWKs to mirror |
AUTH_EXTRA_JWKS_JSON |
(empty) | Inline JSON ({"keys":[…]} or a bare list) of extra public JWKs |
Re-run after a volume wipe or CMS key rotation
extra_jwks.json lives in auth-data. If the volume is wiped, or the CMS rotates dataland-rs256-1, re-provision the mirror — otherwise the agent silently falls back to the external endpoint as sole validator and starts emitting the DAT-286 WARNING.
How the agent verifies tokens¶
Verification lives in the agent at dataland-agent/app/auth.py. The flow (_verify_access_token_sync):
- If
AUTH_SKIP=true, the signature and expiry are not checked — the token is decoded unverified and aWARNINGis logged. Never enable in production (the boot guard refuses it). - Otherwise, for each URL in
settings.auth_jwks_urls(i.e.JWKS_URLfollowed by everyJWKS_URLSentry, de-duplicated, order preserved): - Resolve the signing key via a cached
PyJWKClient(jwks_cache_ttl, default 3600 s). On a cache miss the keys are re-fetched once before giving up — this transparently handles key rotation. jwt.decode(..., algorithms=["RS256"], options={"require": ["exp"], "verify_aud": False}).expis required;audis intentionally not checked because CMS tokens scopeaudto the mobile client UUID and the agent treats any authenticated caller as a valid consumer.- First URL that verifies wins; the loop breaks.
-
If verification succeeds on a fallback provider (index > 0), the agent logs:
JWT accepted by FALLBACK JWKS provider <url> -- local JWKS is missing this signing key; chat auth depends on this external endpointThis is the DAT-286 alert signal: the local JWKS is missing the key and should be re-mirrored. Treat it as an actionable warning, not noise. 4. If no provider verifies, the agent returns
401 Invalid or expired token.
After verification, _normalize_access_payload enforces token shape:
token_type, if present, must beaccess— refresh tokens are rejected with 403.- A user identifier is required:
user_idif present, elsesub. Missing both → 401Token missing user identifier claim. get_or_create_userauto-creates a local agent user from the claims (user_id,email,full_name,location,profile_photo_url,joined_date,access_permissions,stripe_customer_id) and stashes it onrequest.state.current_userfor the logging middleware.
sequenceDiagram
participant C as Client
participant AG as agent /v1/*
participant AU as dataland-auth JWKS (primary)
participant CMS as CMS JWKS (fallback)
C->>AG: Bearer <RS256 JWT>
AG->>AU: fetch JWKS (cached, TTL 3600s)
alt kid resolves locally (mirror present)
AU-->>AG: public key for kid
AG->>AG: verify sig + exp, normalize claims
AG-->>C: 200 (SSE / JSON)
else kid missing locally
AG->>CMS: fallback fetch
CMS-->>AG: public key
AG->>AG: WARN "FALLBACK provider ... chat auth depends on external"
AG-->>C: 200 (but mirror should be re-provisioned)
end
Login & signup (argon2id — DAT-145)¶
Passwords are hashed with argon2id via argon2-cffi defaults (time_cost=3, memory_cost=64 MiB, parallelism=4), landing at ~70–90 ms/hash on the VDS — costly for an attacker, invisible to a user. The previous scheme was sha256(salt + password), a fast GPU-friendly hash that would crack to plaintext in minutes from a leaked table.
- New rows store an
$argon2…string; thesaltcolumn is set to the literal"argon2"(argon2 encodes its own salt + parameters in the hash). Thesaltcolumn is retained only for legacy verification. - Verification dispatches by format:
$argon2…→ argon2id verify; 64-char hex → legacysha256(salt+password)constant-time compare. Any malformed/corrupt row returnsFalserather than raising. - Opportunistic upgrade: a legacy sha256 row that authenticates successfully is re-hashed to argon2id on that login. No new logins ever land sha256, so every user migrates on next sign-in.
Schema (auth_server_users)
Created idempotently at startup (CREATE TABLE IF NOT EXISTS + additive ALTERs). Columns: id, sub (UUID, the JWT subject), email (unique, lower-cased), hash, salt, full_name, location, profile_photo_url, stripe_customer_id, created_at. Rows lacking a sub are backfilled with a fresh UUID on boot.
Issued token claims¶
_mint_tokens produces two RS256 JWTs signed with the local kid:
| Token | TTL | Notable claims |
|---|---|---|
access |
1 hour | sub, email, full_name, location, profile_photo_url, stripe_customer_id, joined_date, access_permissions: [], iss, aud, iat, exp, jti |
refresh |
7 days | token_type: "refresh", sub, iss, aud, iat, exp, jti |
iss defaults to http://dataland-auth:9000 (AUTH_ISSUER); aud defaults to dataland-agent (AUTH_AUDIENCE). Because the agent runs verify_aud=False, the aud value is informational on the agent side.
API reference¶
| Method | Path | Auth | Description |
|---|---|---|---|
GET |
/ |
none | Login/signup web UI (mints a token, can deep-link to the agent test chat) |
GET |
/.well-known/jwks.json |
none | Public JWKS (local key + mirrored extras) |
POST |
/api/auth/signup |
none | Create account → { access, refresh, user, … } |
POST |
/api/auth/login |
none | Authenticate → { access, refresh, user, … } |
GET |
/api/auth/me |
Bearer | Current user record (verifies with the local public key) |
GET |
/metrics |
— | Prometheus metrics (service="dataland-auth") |
signup requires email + password (min length 4); a duplicate email returns 409. login returns 401 on bad credentials. me verifies the Bearer token against the local public key only (verify_aud=False) and looks the user up by sub.
me here vs. the agent's /v1/auth/me
GET /api/auth/me on dataland-auth (port 9000) verifies against the local key and is for the auth server's own UI/flows. The visitor-facing GET /v1/auth/me lives on the agent (port 4141) and runs the full multi-JWKS verification path. They are different endpoints on different services.
Key env vars¶
dataland-auth (the issuer/server)¶
AGENT_URL=https://dataland.chat # (1)!
AUTH_DATABASE_URL=postgresql://dataland:***@dataland-postgres:5432/dataland # (2)!
AUTH_PORT=9000 # (3)!
AUTH_ISSUER=http://dataland-auth:9000 # (4)!
AUTH_AUDIENCE=dataland-agent # (5)!
AUTH_EXTRA_JWKS_PATH=/app/data/extra_jwks.json # (6)!
AUTH_EXTRA_JWKS_JSON= # (7)!
LOG_FILE_DIR=/app/logs
- Baked into the web UI's "open agent" deep-link so the minted token can be handed off to the agent's test chat client (
?token=…). Cosmetic only — it does not affect token issuance. - The Postgres DSN for the shared
datalanddatabase (tableauth_server_users). Falls back toDATABASE_URL, then a local-dev default; apostgresql+asyncpg://form is normalized topostgresql://forasyncpg, and credentials are masked in the boot banner. - Listen port, default
9000. Mapped to the host via${AUTH_PUBLIC_PORT:-9000}and probed by the healthcheck at/.well-known/jwks.json. - The
issclaim stamped into issued tokens (default). The agent does not enforce it, but it identifies who minted the token. - The
audclaim (defaultdataland-agent). Informational on the agent side because verification runs withverify_aud=False— CMS tokens scopeaudto the mobile client UUID instead. - DAT-286 — on-disk file of extra public JWKs to mirror, default
data/extra_jwks.jsoninside theauth-datavolume. Loaded after the inline JSON source. - DAT-286 — optional inline mirror (
{"keys":[…]}or a bare list). Loaded first, before the on-disk file; useful when you would rather not write a file into the volume.
AUTH_DATABASE_URL falls back to DATABASE_URL, then to a local-dev default. A postgresql+asyncpg:// DSN is normalized to postgresql:// for asyncpg. Credentials are masked (scheme://***@host) in the boot banner.
dataland-agent (the verifier) — auth-relevant settings¶
JWKS_URL=http://dataland-auth:9000/.well-known/jwks.json # (1)!
JWKS_URLS= # (2)!
JWKS_CACHE_TTL=3600 # (3)!
AUTH_SKIP=false # (4)!
- The primary JWKS source, tried first on every verification. Pointing it at
dataland-auth(which now mirrors the CMS key per DAT-286) makes the agent verify production chat tokens against a locally-resolvable endpoint. The boot guard refuses to start production if this is empty. - Optional fallback providers, comma-separated or a JSON array, tried in order after
JWKS_URL. DAT-143 — keep this empty in production once the mirror is in place; a committed default pointing at staging once let prod silently accept staging-signed tokens. A token that verifies here (index > 0) triggers the DAT-286FALLBACKwarning. PyJWKClientcache lifetime in seconds (default 3600). On a cache miss the keys are re-fetched once before giving up, which is how transparent key rotation works.- Bypass switch — when
true, neither signature norexpis checked and the token is decoded unverified (with aWARNING). NEVER true in production;assert_boot_required_env()crash-loops the agent if it is set underAPP_ENV=production.
Keep JWKS_URLS empty in production once the mirror is in place (DAT-143)
A committed default pointing at staging once caused prod to silently accept staging-signed tokens (including self-issued ones). With the CMS key mirrored into dataland-auth (DAT-286), production verifies entirely against the local primary; leave JWKS_URLS empty unless you deliberately need a second issuer. The agent's infra .env.example ships JWKS_URLS= (commented example only). Note the agent's own docker-compose.yml (standalone dev) still defaults JWKS_URLS to the CMS staging endpoint — the production infra compose.yml does not set it, deferring to .env.
Production boot guard & deploy gate (DAT-140 / DAT-291)¶
The agent refuses to boot in production with broken auth. production_required_env_issues() (in app/runtime.py) flags:
JWKS_URL is empty; tokens cannot be verified in productionAUTH_SKIP is true; production deploys must verify JWTs
assert_boot_required_env() raises RuntimeError (crash-loop) on any such issue when APP_ENV=production; non-production stays warn-only.
DAT-291 runs this exact guard before a rebuild: deploy.sh executes assert_boot_required_env() from the current dataland/agent:latest image against the new .env. A failure aborts the deploy ("ABORT: .env failed the agent boot guard") rather than shipping a container that crash-loops and takes chat offline. The check is skipped only on the very first deploy (no image yet).
Reaching it¶
# JWKS (host):
curl -fsS http://localhost:9000/.well-known/jwks.json | jq # (1)!
# From inside the network (agent's perspective):
curl -fsS http://dataland-auth:9000/.well-known/jwks.json # (2)!
# Mint a token, then verify the agent accepts it:
TOKEN=$(curl -fsS -X POST http://localhost:9000/api/auth/login \
-H 'Content-Type: application/json' \
-d '{"email":"curator@example.com","password":"secret"}' | jq -r .access) # (3)!
curl -fsS http://localhost:9000/api/auth/me -H "Authorization: Bearer $TOKEN" | jq # (4)!
# Confirm the agent verifies the same token against its JWKS chain:
curl -fsS https://dataland.chat/v1/auth/me -H "Authorization: Bearer $TOKEN" | jq # (5)!
- Hits the public JWKS over the host-mapped port. After provisioning the mirror you should see two
kids here — thelocal-…key anddataland-rs256-1. - The same endpoint via the in-network service name, which is exactly the URL the agent uses for
JWKS_URL. Use this to confirm the agent can resolve the primary from insidedataland-network. - Extracts the
accesstoken (1 h TTL) from a login response.jq -rstrips the quotes so$TOKENis the raw JWT; the refresh token in the same response is ignored here. - Verifies against
dataland-auth's local public key only (verify_aud=False), looking the user up bysub. This is the auth server's ownme, not the agent's. - Exercises the agent's
/v1/auth/me(port 4141), which runs the full multi-JWKS chain. If this succeeds while the agent logs aFALLBACKwarning, the local mirror is missing thekidand should be re-provisioned.
Volumes¶
auth-data:/app/data <- RSA private key, kid, extra_jwks.json (NEVER delete casually) # (1)!
/app/logs <- host: ${DATALAND_LOG_DIR}/auth/ # (2)!
- The named volume that persists all signing material:
auth_rsa_private.pem,auth_rsa_kid.txt, and the DAT-286extra_jwks.json. Wiping it (e.g. an aggressivereset-stack.sh) invalidates every issued JWT and drops the CMS mirror — restore fromdataland-infrastructure/reports/backup-restore.mdbefore restarting. - Log directory, bind-mounted to
${DATALAND_LOG_DIR}/auth/on the host so auth logs survive container recreation and sit alongside the other services' logs.
Operational notes & gotchas¶
Dropping auth-data invalidates every issued token
Loses auth_rsa_private.pem, auth_rsa_kid.txt, and extra_jwks.json. Every previously issued JWT fails verification and the DAT-286 mirror disappears. Restore from backup before restarting.
The FALLBACK JWKS provider warning means re-mirror the key
If the agent logs that a fallback provider validated a token, the local JWKS is missing that kid. Re-provision data/extra_jwks.json so verification returns to the local primary and the external CMS endpoint is back to pure-backup status.
Legacy passwords self-heal
Pre-DAT-145 sha256 rows keep working and upgrade to argon2id on the next successful login. No bulk migration is needed — just expect a brief, one-time UPDATE per legacy user on their first sign-in after the rollover.