Skip to content

Migrations

GCP service migration (2026-06)

Beyond the schema/data/model migrations below, services are moving Spark → GCP one at a time. Done: the Atlas catalog (formerly information-webui) now runs on Cloud Run (https://atlas.dataland.chat, Cloud SQL + GCS); the self-hosted monitoring stack was removed. GCP infra is managed in Terraform: dataland-ai/dataland-gcp-terraform. Still on Spark: agent, auth, museum-api, notification×2, rag (GPU — stays on-prem), and the postgres/redis/qdrant datastores.

"Migration" means several different things in the Dataland stack, and this page covers all of them:

Kind Owner Lives in Run by
DB schema (Postgres DDL) dataland-agent migrations/versions/ (Alembic) explicit operator step
Data backfills (one-shot SQL) dataland-agent migrations/data/ by hand, with approval
Model migration (Gemini id) agent + rag + infra .env + code defaults deploy.sh redeploy
Env consolidation dataland-infrastructure .env.example + env-inventory.md guarded by check-env-drift.sh
Vector re-ingest (Qdrant) webui → rag images / knowledge collections replace-by-slug sync
Network split (Spark ↔ GCP) infra / KTM / GCP MIGRATION_PLAN.md phased cutover

Recent changes

This page reflects the 2026-06-03 → 2026-06-04 change-set: DAT-269 (model → gemini-3.5-flash everywhere), DAT-265 (env consolidation + drift detection), DAT-291 (deploy boot-guard), the 0010_user_profile_fields Alembic revision, and the museum re-ingest that moved the knowledge collection from ~4839 → ~4969 points.


1. Database schema migrations (Alembic — dataland-agent)

The agent owns the only relational schema that evolves over time (the agent + auth Postgres DBs). Schema changes are managed with Alembic (introduced in DAT-68). RAG vectors live in Qdrant and museum/RDC state lives in Redis — neither uses Alembic.

Two flavours of Postgres change live in dataland-agent/migrations/:

  • migrations/versions/ — Alembic-managed schema migrations. Each file is a Python module with upgrade() and downgrade().
  • migrations/data/ — one-shot SQL data backfills run by hand against production (see §2).

Configuration

alembic.ini (repo root) wires the migration runner:

[alembic]
script_location = migrations
file_template = %%(rev)s_%%(slug)s     # (1)!
truncate_slug_length = 60
timezone = UTC
sqlalchemy.url = postgresql://dataland:dataland@localhost:5432/dataland  # (2)!
  1. The %% is a literal % escaped for ConfigParser. This template yields the strictly-linear four-digit-prefix filenames like 0010_user_profile_fields.py, so ls migrations/versions sorts in deploy order.
  2. Placeholder only — never the real DB. The actual DB URL comes from $DATABASE_URL at runtime (see below); this static value just keeps the runner importable when no env is set.

The real DB URL comes from $DATABASE_URL at runtime; migrations/env.py strips the async-driver suffix so the same URL the application uses works for migrations too:

Application URL Rewritten for Alembic (sync)
postgresql+asyncpg://… postgresql+psycopg2://…
sqlite+aiosqlite:///… sqlite:///…
bare postgresql://… postgresql+psycopg2://…

env.py also wires target_metadata = Base.metadata (from app.models) so alembic revision --autogenerate diffs against the same models the app uses at runtime, with compare_type=True and compare_server_default=True.

Idempotent, dialect-aware migrations

Every revision is written to be idempotent and dual-dialect (Postgres and SQLite). Production is Postgres; the test fixture builds a fresh SQLite schema from Base.metadata.create_all. Migrations therefore inspect the live schema (sa.inspect(bind)) and skip ops that are already satisfied — e.g. 0005 only drops messages.event if the column still exists; 0006 skips the varchar → timestamptz cast if the column is already a DateTime. This is why the round-trip test (upgrade → downgrade → upgrade) passes on a freshly-create_all'd DB even though production took a different path to the same shape.

Migration chain (current head: 0010)

graph LR
  B["0001<br/>baseline"] --> U["0002<br/>utm active uniq"]
  U --> M["0003<br/>messages conv_seq uniq"]
  M --> E["0004<br/>users LOWER(email) uniq"]
  E --> D["0005<br/>drop messages.event"]
  D --> T1["0006<br/>messages.created_at → tz"]
  T1 --> T2["0007<br/>lift timestamps → tz"]
  T2 --> TK["0008<br/>tickets table"]
  TK --> R["0009<br/>runs table"]
  R --> P["0010<br/>user profile fields"]
Rev File Purpose Linear
0001 0001_baseline.py No-op. Establishes the Alembic head; the pre-Alembic schema (built by create_all) is stamped to this. DAT-68
0002 0002_utm_active_uniq.py Partial unique index ix_utm_ticket_active on user_ticket_mappings(ticket_id) WHERE active. Guarded — no-op on the simplified schema that no longer has the table. DAT-47
0003 0003_messages_conv_seq_uniq.py Composite unique ix_messages_conv_seq on messages(conversation_id, seq); also a covering index for the canonical read pattern. DAT-63
0004 0004_users_email_lower_uniq.py Case-insensitive partial unique ix_users_email_lower on users(LOWER(email)) WHERE email IS NOT NULL. Raw SQL (cross-dialect functional index). DAT-64
0005 0005_drop_messages_event.py Drop the dead messages.event column (NULL on 100% of 5949 rows). DAT-53
0006 0006_messages_created_at_tz.py messages.created_at varchar → timestamptz (all 5949 rows parse as ISO-8601). DAT-62
0007 0007_lift_timestamps_to_tz.py Lift the remaining 5 timestamp columns (users, conversations, user_ticket_mappings) to timestamptz, interpreting existing values as UTC. DAT-61
0008 0008_tickets_table.py Create tickets (per-ticket state) + backfill first_seen/last_seen/visit_count from user_ticket_mappings. Additive. DAT-66
0009 0009_runs_table.py Create runs (first-class run entity: model, tokens, cost, status) + best-effort backfill from messages (2918 distinct run_id). No hard FK yet. DAT-65
0010 0010_user_profile_fields.py Add nullable users profile columns from the museum-wide JWT: full_name, location, profile_photo_url, joined_date, access_permissions, stripe_customer_id. (profile fields)

The baseline + _ensure_schema_migrations retirement

0001_baseline.py is intentionally empty. The schema that existed at adoption-time predates Alembic — it was built by Base.metadata.create_all in app/db/session.py::init_db(). DAT-147 then retired the ad-hoc _ensure_schema_migrations DDL that used to run alongside create_all (every operation it performed had long since landed on every deploy). Today, init_db() still calls create_all so a freshly-spun environment boots with a working schema, but schema evolution from here lives in migrations/versions/ exclusively.

Running migrations

The application does not run alembic upgrade head on startup. Migrations are an explicit operator step so deploys stay predictable and rollbacks stay simple.

Environment Command
Local mirror DATABASE_URL=postgresql://dataland:dataland@localhost:15432/dataland uv run alembic upgrade head
Production ssh ege@<host> 'docker exec dataland-agent uv run alembic upgrade head'

First deploy of Alembic (one-time only). The version table must be created and stamped to baseline before any subsequent migration applies:

ssh ege@<host> 'docker exec dataland-agent uv run alembic stamp head'  # (1)!
  1. stamp writes the version row without running any migration SQL. It tells Alembic the existing (pre-Alembic, create_all-built) schema already matches the head revision. Run this exactly once on first adoption; running upgrade instead here would try to re-apply migrations against an already-current schema.

After that, every future migration deploy is just alembic upgrade head.

Authoring a new migration

# From the dataland-agent repo, DATABASE_URL pointed at the local mirror
DATABASE_URL=postgresql://dataland:dataland@localhost:15432/dataland \
  uv run alembic revision --autogenerate -m "dat_NN_short_description"  # (1)!
  1. --autogenerate diffs Base.metadata against the live mirror schema to draft the migration. It is a starting point, not the final artifact — autogen misses functional/partial indexes, data backfills, and dialect quirks, so the next step (hand-edit + round-trip) is mandatory.

Then always hand-edit before committing, and verify the round-trip:

DATABASE_URL=... uv run alembic upgrade head
DATABASE_URL=... uv run alembic downgrade -1
DATABASE_URL=... uv run alembic upgrade head   # (1)!
  1. The second upgrade must converge to the same schema as the first. This proves downgrade() is a true inverse and the revision is idempotent — a revision that fails this round-trip is not safe to ship.

Two open PRs, one revision line

Revision ids are strictly linear four-digit prefixes (%%(rev)s_%%(slug)s) so ls migrations/versions sorts in deploy order. Two PRs each adding a migration will collide on the next id. Coordinate the next number in Linear before branching, and never edit a migration that has already been applied to any environment — write a follow-up instead.


2. Data backfills (migrations/data/)

One-shot SQL backfills live outside Alembic because backfill SQL is per-row, slow on large tables, and benefits from decision-log review before commit. Every file is idempotent, wrapped in BEGIN; … COMMIT;, ends with a post-check SELECT COUNT(*), and includes a DO $$ … RAISE EXCEPTION $$ guard that aborts the transaction if the post-state is out of bounds.

# File Purpose Result on mirror Linear
0001 0001_dat_48_backfill_null_conversation_ids.sql Fill user_ticket_mappings.conversation_id on 62/70 legacy NULL rows (newest in-window conversation, 7-day grace). 62 updated; 0 NULL after; guard asserts < 5% NULL. DAT-48
0002 0002_dat_51_backfill_null_conversation_mode.sql Fill conversations.mode on 290/2638 NULL rows (owner's most-used mode, else museum). 290 updated (184 owner-pref, 106 default); guard asserts 0 NULL after. DAT-51

The runbook for each backfill (restore the local mirror, run there first, take a fresh production pg_dump parachute, run on production with explicit user approval, re-run the audit queries) is in dataland-agent/migrations/README.md.

Backfills are not reversible by script

Data migrations keep no undo log. The rollback path is the point-in-time pg_dump taken immediately before the run. Always take it; always run on the local mirror first.


3. Gemini model migration (DAT-269)

The stack is standardized on gemini-3.5-flash for all generative work (chat + Gemini captioning + RAG reranking). Vector embeddings are a separate model and were out of scope for this migration.

Why we migrated

On 2026-06-15 Google removes access to gemini-2.5-flash, gemini-2.5-flash-lite, and gemini-3-flash-preview for new and inactive GCP projects, and disables model tuning. Active projects aren't cut off, but a project that goes idle (or a fresh one) would break. gemini-3.5-flash is GA, multimodal, 1M context, and roughly half the cost of Gemini 3 Flash. (Earlier PRs briefly landed on gemini-3.1-flash-lite; DAT-269 is the final consolidation to gemini-3.5-flash.)

This is an id-only change — no schema, no data, no behavioural change. It lives entirely in config defaults + the deploy .env:

Repo Setting Value Where
dataland-agent agent_model google-gla:gemini-3.5-flash app/config.py:37, .env.example, AGENT_MODEL
dataland-agent gemini_model gemini-3.5-flash app/config.py:121, GEMINI_MODEL
dataland-rag-v2 gemini_model gemini-3.5-flash config.py:25, GEMINI_MODEL (captioning + rerank fallback + kreuzberg VLM)
dataland-infrastructure AGENT_MODEL / GEMINI_MODEL gemini-3.5-flash .env.example:101-102, propagated by compose

The agent's boot guard (app/runtime.py) refuses to start if AGENT_MODEL uses the google-gla: provider but GEMINI_API_KEY is empty.

Operator audit checklist (needs GCP console access):

  • [ ] Confirm the prod project still has gemini-3.5-flash in us-central1.
  • [ ] Per-project: list models used in the last 30 days (Cloud Monitoring → generativelanguage.googleapis.com / Vertex), confirm nothing pins a to-be-removed id.
  • [ ] Confirm no Model Armor template (DAT-268) or tuning job pins a removed model.
  • [ ] Post-deploy, watch agent/rag logs + Grafana for model not found / 404 for one full day.

Rollback: set AGENT_MODEL / GEMINI_MODEL back to the previous value and redeploy. No schema or data is involved. See the standalone working doc docs/gemini-deprecation-migration.md.

Embeddings are separate

EMBEDDING_MODEL (gemini-embedding-2-preview in the infra/rag .env.example; the rag in-code default is gemini-embedding-2) is the vector model for the Qdrant collections and was not touched by DAT-269. Changing it would require re-embedding every point — a far larger operation than an id swap. See RAG.


4. Environment consolidation (DAT-265)

DAT-265 made dataland-infrastructure/.env.example the canonical template for the whole stack, with automated drift detection so a new variable can never silently ship unset.

Source-of-truth hierarchy

1. /home/cobanov/DATALAND/.env          ← actual deploy values, gitignored, 0640
2. dataland-infrastructure/.env.example ← canonical template (git). New vars start here.
3. dataland-<service>/.env.example      ← standalone local-dev template (a subset)

Operational rule: every variable a deployed service reads must appear in the infra template. A service-repo var that is absent from both the infra template and the explicit service-local-only allowlist is treated as drift.

Adding a new variable (the workflow)

  1. Add it to dataland-infrastructure/.env.example (with an ownership comment).
  2. Wire it into the relevant services.*.environment: block in compose.yml.
  3. Mirror it in the owning service repo's .env.example for local-dev parity.
  4. Run bash scripts/check-env-drift.sh — exit 0 means done.
flowchart LR
  A["new var in service<br/>.env.example"] --> C{"in infra<br/>.env.example?"}
  C -->|yes| OK["check-env-drift.sh → exit 0"]
  C -->|no| D{"in SERVICE_LOCAL<br/>allowlist?"}
  D -->|yes| OK
  D -->|no| FAIL["exit 1 → smoke/CI fails<br/>follow-up PR required"]

The drift guardrail (scripts/check-env-drift.sh) extracts ^[A-Z][A-Z0-9_]*= keys from each service .env.example, subtracts the infra template, and fails on anything left that isn't in the SERVICE_LOCAL allowlist (per-developer tuning knobs like CHUNK_SIZE, RERANKER_MODEL, the notification EXPLORER_* vars). The full ownership map (required-in-prod, shared secrets, deploy-only, service-local-only) is documented in docs/env-inventory.md.

Phase 2 is deferred

The larger restructure (env/{dev,staging,prod}.env + replacing inline environment: blocks with env_file: references) is a separate follow-up. DAT-265 only establishes the hierarchy + drift detection that Phase 2 can build on.

Deploy-time boot guard (DAT-291)

deploy.sh now fails fast before rebuilding if the production .env still holds placeholder/default secrets. It runs the real agent boot guard (assert_boot_required_env) from the current dataland/agent:latest image against the new .env, so the check can never drift from the boot-time contract:

docker run --rm --env-file .env dataland/agent:latest \
  /app/.venv/bin/python -c "from app.runtime import assert_boot_required_env; assert_boot_required_env()"  # (1)!
  1. DAT-291. Runs the real boot guard from the already-built dataland/agent:latest image against the new .env, so the deploy-time check can never drift from the boot-time contract. A non-zero exit aborts deploy.sh before the rebuild, preventing the crash-loop outage where a freshly-built container fails the guard and takes chat offline.

If it exits non-zero, deploy.sh aborts — preventing the crash-loop outage (freshly-built container fails the guard and takes chat offline) that motivated the check. The guard is a no-op outside APP_ENV=production, and it is skipped on the very first deploy when no image exists yet. See Deploy.


5. Vector store re-ingest (Qdrant backfills)

Qdrant has no Alembic equivalent — content "migrations" are re-ingests. Schema/payload changes or content edits are applied by re-running ingestion, which is safe because of two properties enforced by dataland-rag-v2 and the webui's app/rag_sync.py:

  • Deterministic point ids — UUIDv5 derived from the source slug/path, so a re-ingest upserts in place instead of duplicating.
  • Replace-by-slug — every sync first issues a DELETE (e.g. DELETE /ingest/by-project-slug/<slug>, or by the namespaced museum slugs museum-section-<slug> / museum-scene-<slug>) to wipe stale points across both collections, then re-ingests.

Museum re-ingest (this change-set)

The 20 museum sections + their scenes + the museum overview were re-ingested into the Qdrant knowledge collection, moving it from ~4839 → ~4969 points. Text flows to /ingest/file (knowledge); images flow to /ingest/image (Gemini-captioned, images collection). Entity types on the payloads are section / scene / museum (plus section_image / scene_image).

Re-ingest is idempotent by design

Because ids are UUIDv5 and each sync deletes-then-reingests by slug, running the museum re-ingest twice converges on the same point set. The point-count delta (≈130) reflects new/edited content, not duplication. See Information WebUI and RAG.


6. Network split: Spark ↔ GCP migration

Living architecture for the museum / GCP network split. The authoritative working document is MIGRATION_PLAN.md at the repo root; this section mirrors its diagrams + summary. See also Service hosting & relocation — the 2026-05-28 precursor.

Context

The Spark host is moving from the museum's vlan14 to vlan23. On vlan23 it has no public-internet access, but it can reach a private GCP VPC over a Cloud VPN / Interconnect link. The single-host stack splits in two:

  • Spark (museum, vlan23): GPU- / ML-heavy services that benefit from local compute, plus Redis (which KTM requires to live on vlan23).
  • GCP (public-facing): everything reachable from the public internet, plus Postgres and a new ai-proxy that fronts Gemini + GCS for Spark callers.

Network roles: KTM (museum-side network, owns the vlan23 transition), Christian (GCP project + VPC + VPN/Interconnect peer), Mert (application architecture, sits between the two).

Target topology

flowchart TB
  Internet((Public Internet))

  subgraph gcp["GCP VPC — public-facing half"]
    direction TB
    CF["Cloudflare Tunnel"]

    subgraph gcp_public[" Public-facing apps "]
      direction LR
      AG["agent"]
      AU["auth"]
      WU["information-webui"]
      DC["docs"]
    end

    subgraph gcp_async[" Async workers "]
      direction LR
      NA["notification-api"]
      NW["notification-worker"]
    end

    subgraph gcp_infra[" Data &amp; infra "]
      direction LR
      PG[("postgres<br/>Cloud SQL")]
      AIP["ai-proxy<br/>(Gemini + GCS)"]
      OBS["monitoring"]
      BST["bastion + IAP"]
    end

    CF --> gcp_public
  end

  subgraph spark["Spark — museum, vlan23 (no internet)"]
    direction TB

    subgraph spark_app[" App services "]
      direction LR
      RG["rag"]
      MU["museum-api"]
      SIM["museum-simulator"]
    end

    subgraph spark_data[" Data plane "]
      direction LR
      QD[("qdrant")]
      RD[("redis")]
    end
  end

  Internet ==>|"end users"| CF
  AIP -.->|"egress to<br/>Gemini, GCS"| Internet
  gcp <==>|"Cloud VPN / Interconnect<br/>(only Spark↔GCP path)"| spark

  classDef ext fill:#fff,stroke:#666,stroke-width:2px
  class Internet ext

Key invariants of the new topology:

  1. Spark has no public internet. Any code path that today calls googleapis.com, GCS, Docker Hub, PyPI, etc. directly from a Spark container must route through a GCP service.
  2. Cloud VPN / Interconnect is the only Spark↔GCP path. All inter-half traffic crosses it, including hot paths like agent → rag and agent → redis.
  3. Cloudflare Tunnel is the only public ingress — as today, but terminating on GCP load balancers instead of on Spark.
  4. ai-proxy is the single boundary for Gemini + GCS. Spark callers (today: rag) treat it as a normal HTTP base URL; the proxy owns the GCP credentials and outbound calls to Google APIs.

Chat request flow (post-cut)

sequenceDiagram
  participant App as Mobile app
  participant CF as Cloudflare
  participant Agent as agent (GCP)
  participant Auth as auth (GCP)
  participant SQL as postgres (GCP)
  participant Redis as redis (Spark, via VPN)
  participant Museum as museum-api (Spark, via VPN)
  participant RAG as rag (Spark, via VPN)
  participant QD as qdrant (Spark, local to rag)
  participant Proxy as ai-proxy (GCP)
  participant Gemini as Gemini API

  App->>CF: POST /v1/chat (Bearer JWT)
  CF->>Agent: forward
  Agent->>Auth: JWKS (local in GCP)
  Agent->>SQL: session / state (Cloud SQL, local)
  Agent->>Redis: rate-limit check (VPN hop)
  Agent->>Museum: GET vitals (VPN hop)
  Agent->>RAG: POST /search (VPN hop)
  RAG->>Proxy: embed query (reverse VPN hop)
  Proxy->>Gemini: embed (GCP internet egress)
  Gemini-->>Proxy: vector
  Proxy-->>RAG: vector
  RAG->>QD: vector search (local)
  RAG->>RAG: ONNX rerank (local, heavy)
  RAG-->>Agent: ranked passages
  Agent->>Gemini: generate (direct, agent has internet)
  Agent-->>App: SSE stream

A single chat round-trip touches the VPN 2-3 times (agent→redis, agent→museum-api, agent→rag, and rag→ai-proxy). The link's RTT dominates end-to-end p50 / p95 — baseline measurement is an open action item in the working doc.

RAG search timeout vs. the new VPN RTT

The agent's RAG /search client read timeout was raised 10s → 25s (app/config.py::rag_search_timeout_seconds = 25.0). A /search round-trip (query embedding + rerank) runs ~10s today; a 10s client timeout caused ReadTimeout → 3 retries (~30s) → a second search → 60s agent wall-clock →agent_timeout` on museum-knowledge queries. The 25s budget lets a single search complete on the first attempt. Re-validate this budget once the VPN RTT is measured, since everyagent → rag` call becomes a cross-WAN hop.

Content ingestion flow (post-cut)

sequenceDiagram
  participant Curator
  participant WebUI as information-webui (GCP)
  participant GCS
  participant RAG as rag (Spark, via VPN)
  participant Proxy as ai-proxy (GCP)
  participant Gemini as Gemini API
  participant QD as qdrant (Spark, local)

  Curator->>WebUI: upload artwork + metadata
  WebUI->>GCS: PUT object (native GCP path)
  WebUI->>RAG: POST /ingest/file (VPN hop)
  RAG->>Proxy: embed chunks (reverse VPN hop)
  Proxy->>Gemini: embed
  Gemini-->>Proxy: vectors
  Proxy-->>RAG: vectors
  RAG->>QD: upsert (local)

Ingestion bytes (the file itself) move once from information-webui to GCS over a native GCP path — they never cross the VPN. Only the control call (/ingest/file with a GCS pointer) does. Embedding round-trips are per-chunk and the dominant VPN cost during a large ingest such as the museum re-ingest in §5.

Operator remote access

graph LR
  Laptop["Engineer laptop"]
  IAP["Google IAP<br/>(IAM-controlled)"]
  Bastion["bastion VM<br/>(no public IP)"]
  Spark["Spark host"]

  Laptop -->|gcloud + HTTPS| IAP
  IAP -->|tunnelled SSH| Bastion
  Bastion -->|SSH over VPN| Spark

  Fallback["Fallback: KTM-issued<br/>vlan23 VPN client"]
  Laptop -.->|emergency only| Fallback
  Fallback -.-> Spark

Default path is gcloud compute ssh bastion --tunnel-through-iap, with ProxyJump set so ssh spark transparently traverses IAP + bastion. IAM controls access; revoke = remove the role. The KTM VPN fallback exists only for the case where GCP itself is down.

What changes per service

Service Before (single-host Spark) After (Spark+GCP) Notes
agent Spark GCP Public chat endpoint at dataland.chat.
auth Spark GCP JWKS must be publicly reachable.
information-webui Spark GCP Operator-facing CMS.
docs Spark GCP Public docs site (docs.dataland.chat).
notification-worker / notification-api Spark GCP Need outbound to Slack / OneSignal / etc.
postgres Spark GCP (Cloud SQL) Co-locates with agent + auth; the Alembic chain in §1 runs against this Cloud SQL instance post-cut.
rag Spark Spark Heavy ONNX rerank + BM25; not easily portable.
qdrant Spark Spark Co-located with rag to avoid per-search VPN hops.
redis Spark Spark (vlan23) KTM directive. AOF persistence on local disk.
museum-api Spark Spark Confirmed museum-internal consumers only.
museum-simulator Spark Spark Writes to local Redis.
ai-proxy (didn't exist) GCP (new) Fronts Gemini + GCS for Spark callers.
Monitoring (Prom / Grafana / AM) Spark GCP (lean) Better alert-delivery from the internet side.

Phased migration (sketch)

  1. Phase 0 — decisions & baselines. Lock per-service split (done for the rows above); confirm GCP transport + RTT with Christian.
  2. Phase 1 — vlan23 + GCP scaffolding. KTM moves Spark + Redis to vlan23 (vlan14 dual-homed temporarily); Christian stands up VPC, VPN, bastion, test VM. Acceptance: redis-cli -h <spark-redis> ping from a GCP box returns PONG.
  3. Phase 2 — stateless GCP services live, dual-running. Bring auth, agent, information-webui, docs up on GCP pointing at Spark Redis (and still Spark Postgres). Cloudflare origin moves to GCP.
  4. Phase 3 — Postgres move. Snapshot + restore Spark Postgres into Cloud SQL; cutover DSNs. Re-run alembic upgrade head against Cloud SQL and confirm the head revision matches.
  5. Phase 4 — ai-proxy live. rag swaps its Gemini SDK / GCS client for the proxy base URL. Verified by running ingestion + chat with Spark's public internet blocked at the host level.
  6. Phase 5 — vlan14 cut. KTM disconnects vlan14 from Spark. Smoke the full stack.
  7. Phase 6 — cleanup. Remove migrated entries from the Spark compose stack; retire the Tailscale exposure path; document the new shape here.

Open questions

  • Cross-WAN RTT once the VPN is up (drives whether per-request agent→rag is acceptable as-is, or whether the 25s search timeout / agent-side caching needs revisiting).
  • DR posture per side: Cloud SQL handles GCP; what's the post-split backup plan for Spark qdrant + redis?
  • Build & deploy path: Spark's docker pull and uv sync are dead post-cut. Default plan: GCP Artifact Registry + PyPI mirror, both reachable from Spark via the VPN. Christian to confirm Artifact Registry shape; we update deploy.sh after.

Living plan & change log

For the network split, the authoritative tracker — open action items, change log, working notes — is MIGRATION_PLAN.md at the repo root. For schema + backfills it is dataland-agent/migrations/README.md. For env ownership it is docs/env-inventory.md. This page is updated when a decision lands; it is not the working doc.