Migrations¶
GCP service migration (2026-06)
Beyond the schema/data/model migrations below, services are moving
Spark → GCP one at a time. Done: the Atlas catalog (formerly
information-webui) now runs on Cloud Run (https://atlas.dataland.chat,
Cloud SQL + GCS); the self-hosted monitoring stack was removed. GCP infra
is managed in Terraform: dataland-ai/dataland-gcp-terraform. Still on
Spark: agent, auth, museum-api, notification×2, rag (GPU — stays
on-prem), and the postgres/redis/qdrant datastores.
"Migration" means several different things in the Dataland stack, and this page covers all of them:
| Kind | Owner | Lives in | Run by |
|---|---|---|---|
| DB schema (Postgres DDL) | dataland-agent |
migrations/versions/ (Alembic) |
explicit operator step |
| Data backfills (one-shot SQL) | dataland-agent |
migrations/data/ |
by hand, with approval |
| Model migration (Gemini id) | agent + rag + infra | .env + code defaults |
deploy.sh redeploy |
| Env consolidation | dataland-infrastructure |
.env.example + env-inventory.md |
guarded by check-env-drift.sh |
| Vector re-ingest (Qdrant) | webui → rag | images / knowledge collections |
replace-by-slug sync |
| Network split (Spark ↔ GCP) | infra / KTM / GCP | MIGRATION_PLAN.md |
phased cutover |
Recent changes
This page reflects the 2026-06-03 → 2026-06-04 change-set:
DAT-269 (model → gemini-3.5-flash everywhere), DAT-265 (env
consolidation + drift detection), DAT-291 (deploy boot-guard), the
0010_user_profile_fields Alembic revision, and the museum re-ingest that
moved the knowledge collection from ~4839 → ~4969 points.
1. Database schema migrations (Alembic — dataland-agent)¶
The agent owns the only relational schema that evolves over time (the agent + auth Postgres DBs). Schema changes are managed with Alembic (introduced in DAT-68). RAG vectors live in Qdrant and museum/RDC state lives in Redis — neither uses Alembic.
Two flavours of Postgres change live in dataland-agent/migrations/:
migrations/versions/— Alembic-managed schema migrations. Each file is a Python module withupgrade()anddowngrade().migrations/data/— one-shot SQL data backfills run by hand against production (see §2).
Configuration¶
alembic.ini (repo root) wires the migration runner:
[alembic]
script_location = migrations
file_template = %%(rev)s_%%(slug)s # (1)!
truncate_slug_length = 60
timezone = UTC
sqlalchemy.url = postgresql://dataland:dataland@localhost:5432/dataland # (2)!
- The
%%is a literal%escaped for ConfigParser. This template yields the strictly-linear four-digit-prefix filenames like0010_user_profile_fields.py, sols migrations/versionssorts in deploy order. - Placeholder only — never the real DB. The actual DB URL comes from
$DATABASE_URLat runtime (see below); this static value just keeps the runner importable when no env is set.
The real DB URL comes from $DATABASE_URL at runtime; migrations/env.py
strips the async-driver suffix so the same URL the application uses works for
migrations too:
| Application URL | Rewritten for Alembic (sync) |
|---|---|
postgresql+asyncpg://… |
postgresql+psycopg2://… |
sqlite+aiosqlite:///… |
sqlite:///… |
bare postgresql://… |
postgresql+psycopg2://… |
env.py also wires target_metadata = Base.metadata (from app.models) so
alembic revision --autogenerate diffs against the same models the app uses at
runtime, with compare_type=True and compare_server_default=True.
Idempotent, dialect-aware migrations
Every revision is written to be idempotent and dual-dialect (Postgres
and SQLite). Production is Postgres; the test fixture builds a fresh SQLite
schema from Base.metadata.create_all. Migrations therefore inspect the live
schema (sa.inspect(bind)) and skip ops that are already satisfied — e.g.
0005 only drops messages.event if the column still exists; 0006 skips
the varchar → timestamptz cast if the column is already a DateTime. This
is why the round-trip test (upgrade → downgrade → upgrade) passes on a
freshly-create_all'd DB even though production took a different path to the
same shape.
Migration chain (current head: 0010)¶
graph LR
B["0001<br/>baseline"] --> U["0002<br/>utm active uniq"]
U --> M["0003<br/>messages conv_seq uniq"]
M --> E["0004<br/>users LOWER(email) uniq"]
E --> D["0005<br/>drop messages.event"]
D --> T1["0006<br/>messages.created_at → tz"]
T1 --> T2["0007<br/>lift timestamps → tz"]
T2 --> TK["0008<br/>tickets table"]
TK --> R["0009<br/>runs table"]
R --> P["0010<br/>user profile fields"]
| Rev | File | Purpose | Linear |
|---|---|---|---|
| 0001 | 0001_baseline.py |
No-op. Establishes the Alembic head; the pre-Alembic schema (built by create_all) is stamped to this. |
DAT-68 |
| 0002 | 0002_utm_active_uniq.py |
Partial unique index ix_utm_ticket_active on user_ticket_mappings(ticket_id) WHERE active. Guarded — no-op on the simplified schema that no longer has the table. |
DAT-47 |
| 0003 | 0003_messages_conv_seq_uniq.py |
Composite unique ix_messages_conv_seq on messages(conversation_id, seq); also a covering index for the canonical read pattern. |
DAT-63 |
| 0004 | 0004_users_email_lower_uniq.py |
Case-insensitive partial unique ix_users_email_lower on users(LOWER(email)) WHERE email IS NOT NULL. Raw SQL (cross-dialect functional index). |
DAT-64 |
| 0005 | 0005_drop_messages_event.py |
Drop the dead messages.event column (NULL on 100% of 5949 rows). |
DAT-53 |
| 0006 | 0006_messages_created_at_tz.py |
messages.created_at varchar → timestamptz (all 5949 rows parse as ISO-8601). |
DAT-62 |
| 0007 | 0007_lift_timestamps_to_tz.py |
Lift the remaining 5 timestamp columns (users, conversations, user_ticket_mappings) to timestamptz, interpreting existing values as UTC. |
DAT-61 |
| 0008 | 0008_tickets_table.py |
Create tickets (per-ticket state) + backfill first_seen/last_seen/visit_count from user_ticket_mappings. Additive. |
DAT-66 |
| 0009 | 0009_runs_table.py |
Create runs (first-class run entity: model, tokens, cost, status) + best-effort backfill from messages (2918 distinct run_id). No hard FK yet. |
DAT-65 |
| 0010 | 0010_user_profile_fields.py |
Add nullable users profile columns from the museum-wide JWT: full_name, location, profile_photo_url, joined_date, access_permissions, stripe_customer_id. |
(profile fields) |
The baseline + _ensure_schema_migrations retirement
0001_baseline.py is intentionally empty. The schema that existed at
adoption-time predates Alembic — it was built by Base.metadata.create_all
in app/db/session.py::init_db(). DAT-147 then retired the ad-hoc
_ensure_schema_migrations DDL that used to run alongside create_all
(every operation it performed had long since landed on every deploy).
Today, init_db() still calls create_all so a freshly-spun environment
boots with a working schema, but schema evolution from here lives in
migrations/versions/ exclusively.
Running migrations¶
The application does not run alembic upgrade head on startup. Migrations
are an explicit operator step so deploys stay predictable and rollbacks stay
simple.
| Environment | Command |
|---|---|
| Local mirror | DATABASE_URL=postgresql://dataland:dataland@localhost:15432/dataland uv run alembic upgrade head |
| Production | ssh ege@<host> 'docker exec dataland-agent uv run alembic upgrade head' |
First deploy of Alembic (one-time only). The version table must be created and stamped to baseline before any subsequent migration applies:
stampwrites the version row without running any migration SQL. It tells Alembic the existing (pre-Alembic,create_all-built) schema already matches the head revision. Run this exactly once on first adoption; runningupgradeinstead here would try to re-apply migrations against an already-current schema.
After that, every future migration deploy is just alembic upgrade head.
Authoring a new migration¶
# From the dataland-agent repo, DATABASE_URL pointed at the local mirror
DATABASE_URL=postgresql://dataland:dataland@localhost:15432/dataland \
uv run alembic revision --autogenerate -m "dat_NN_short_description" # (1)!
--autogeneratediffsBase.metadataagainst the live mirror schema to draft the migration. It is a starting point, not the final artifact — autogen misses functional/partial indexes, data backfills, and dialect quirks, so the next step (hand-edit + round-trip) is mandatory.
Then always hand-edit before committing, and verify the round-trip:
DATABASE_URL=... uv run alembic upgrade head
DATABASE_URL=... uv run alembic downgrade -1
DATABASE_URL=... uv run alembic upgrade head # (1)!
- The second
upgrademust converge to the same schema as the first. This provesdowngrade()is a true inverse and the revision is idempotent — a revision that fails this round-trip is not safe to ship.
Two open PRs, one revision line
Revision ids are strictly linear four-digit prefixes (%%(rev)s_%%(slug)s)
so ls migrations/versions sorts in deploy order. Two PRs each adding a
migration will collide on the next id. Coordinate the next number in
Linear before branching, and never edit a migration that has already been
applied to any environment — write a follow-up instead.
2. Data backfills (migrations/data/)¶
One-shot SQL backfills live outside Alembic because backfill SQL is per-row,
slow on large tables, and benefits from decision-log review before commit. Every
file is idempotent, wrapped in BEGIN; … COMMIT;, ends with a post-check
SELECT COUNT(*), and includes a DO $$ … RAISE EXCEPTION $$ guard that aborts
the transaction if the post-state is out of bounds.
| # | File | Purpose | Result on mirror | Linear |
|---|---|---|---|---|
| 0001 | 0001_dat_48_backfill_null_conversation_ids.sql |
Fill user_ticket_mappings.conversation_id on 62/70 legacy NULL rows (newest in-window conversation, 7-day grace). |
62 updated; 0 NULL after; guard asserts < 5% NULL. | DAT-48 |
| 0002 | 0002_dat_51_backfill_null_conversation_mode.sql |
Fill conversations.mode on 290/2638 NULL rows (owner's most-used mode, else museum). |
290 updated (184 owner-pref, 106 default); guard asserts 0 NULL after. | DAT-51 |
The runbook for each backfill (restore the local mirror, run there first, take a
fresh production pg_dump parachute, run on production with explicit user
approval, re-run the audit queries) is in dataland-agent/migrations/README.md.
Backfills are not reversible by script
Data migrations keep no undo log. The rollback path is the point-in-time
pg_dump taken immediately before the run. Always take it; always run on the
local mirror first.
3. Gemini model migration (DAT-269)¶
The stack is standardized on gemini-3.5-flash for all generative work
(chat + Gemini captioning + RAG reranking). Vector embeddings are a separate
model and were out of scope for this migration.
Why we migrated
On 2026-06-15 Google removes access to gemini-2.5-flash,
gemini-2.5-flash-lite, and gemini-3-flash-preview for new and inactive
GCP projects, and disables model tuning. Active projects aren't cut off, but
a project that goes idle (or a fresh one) would break. gemini-3.5-flash is
GA, multimodal, 1M context, and roughly half the cost of Gemini 3 Flash.
(Earlier PRs briefly landed on gemini-3.1-flash-lite; DAT-269 is the final
consolidation to gemini-3.5-flash.)
This is an id-only change — no schema, no data, no behavioural change. It
lives entirely in config defaults + the deploy .env:
| Repo | Setting | Value | Where |
|---|---|---|---|
dataland-agent |
agent_model |
google-gla:gemini-3.5-flash |
app/config.py:37, .env.example, AGENT_MODEL |
dataland-agent |
gemini_model |
gemini-3.5-flash |
app/config.py:121, GEMINI_MODEL |
dataland-rag-v2 |
gemini_model |
gemini-3.5-flash |
config.py:25, GEMINI_MODEL (captioning + rerank fallback + kreuzberg VLM) |
dataland-infrastructure |
AGENT_MODEL / GEMINI_MODEL |
gemini-3.5-flash |
.env.example:101-102, propagated by compose |
The agent's boot guard (app/runtime.py) refuses to start if
AGENT_MODEL uses the google-gla: provider but GEMINI_API_KEY is empty.
Operator audit checklist (needs GCP console access):
- [ ] Confirm the prod project still has
gemini-3.5-flashinus-central1. - [ ] Per-project: list models used in the last 30 days (Cloud Monitoring →
generativelanguage.googleapis.com/ Vertex), confirm nothing pins a to-be-removed id. - [ ] Confirm no Model Armor template (DAT-268) or tuning job pins a removed model.
- [ ] Post-deploy, watch agent/rag logs + Grafana for
model not found/ 404 for one full day.
Rollback: set AGENT_MODEL / GEMINI_MODEL back to the previous value and
redeploy. No schema or data is involved. See the standalone working doc
docs/gemini-deprecation-migration.md.
Embeddings are separate
EMBEDDING_MODEL (gemini-embedding-2-preview in the infra/rag
.env.example; the rag in-code default is gemini-embedding-2) is the
vector model for the Qdrant collections and was not touched by DAT-269.
Changing it would require re-embedding every point — a far larger operation
than an id swap. See RAG.
4. Environment consolidation (DAT-265)¶
DAT-265 made dataland-infrastructure/.env.example the canonical template
for the whole stack, with automated drift detection so a new variable can never
silently ship unset.
Source-of-truth hierarchy¶
1. /home/cobanov/DATALAND/.env ← actual deploy values, gitignored, 0640
2. dataland-infrastructure/.env.example ← canonical template (git). New vars start here.
3. dataland-<service>/.env.example ← standalone local-dev template (a subset)
Operational rule: every variable a deployed service reads must appear in the infra template. A service-repo var that is absent from both the infra template and the explicit service-local-only allowlist is treated as drift.
Adding a new variable (the workflow)¶
- Add it to
dataland-infrastructure/.env.example(with an ownership comment). - Wire it into the relevant
services.*.environment:block incompose.yml. - Mirror it in the owning service repo's
.env.examplefor local-dev parity. - Run
bash scripts/check-env-drift.sh— exit0means done.
flowchart LR
A["new var in service<br/>.env.example"] --> C{"in infra<br/>.env.example?"}
C -->|yes| OK["check-env-drift.sh → exit 0"]
C -->|no| D{"in SERVICE_LOCAL<br/>allowlist?"}
D -->|yes| OK
D -->|no| FAIL["exit 1 → smoke/CI fails<br/>follow-up PR required"]
The drift guardrail (scripts/check-env-drift.sh) extracts ^[A-Z][A-Z0-9_]*=
keys from each service .env.example, subtracts the infra template, and fails
on anything left that isn't in the SERVICE_LOCAL allowlist (per-developer
tuning knobs like CHUNK_SIZE, RERANKER_MODEL, the notification EXPLORER_*
vars). The full ownership map (required-in-prod, shared secrets, deploy-only,
service-local-only) is documented in docs/env-inventory.md.
Phase 2 is deferred
The larger restructure (env/{dev,staging,prod}.env + replacing inline
environment: blocks with env_file: references) is a separate follow-up.
DAT-265 only establishes the hierarchy + drift detection that Phase 2 can
build on.
Deploy-time boot guard (DAT-291)¶
deploy.sh now fails fast before rebuilding if the production .env still
holds placeholder/default secrets. It runs the real agent boot guard
(assert_boot_required_env) from the current dataland/agent:latest image
against the new .env, so the check can never drift from the boot-time
contract:
docker run --rm --env-file .env dataland/agent:latest \
/app/.venv/bin/python -c "from app.runtime import assert_boot_required_env; assert_boot_required_env()" # (1)!
- DAT-291. Runs the real boot guard from the already-built
dataland/agent:latestimage against the new.env, so the deploy-time check can never drift from the boot-time contract. A non-zero exit abortsdeploy.shbefore the rebuild, preventing the crash-loop outage where a freshly-built container fails the guard and takes chat offline.
If it exits non-zero, deploy.sh aborts — preventing the crash-loop outage
(freshly-built container fails the guard and takes chat offline) that motivated
the check. The guard is a no-op outside APP_ENV=production, and it is skipped
on the very first deploy when no image exists yet. See Deploy.
5. Vector store re-ingest (Qdrant backfills)¶
Qdrant has no Alembic equivalent — content "migrations" are re-ingests.
Schema/payload changes or content edits are applied by re-running ingestion,
which is safe because of two properties enforced by dataland-rag-v2 and the
webui's app/rag_sync.py:
- Deterministic point ids — UUIDv5 derived from the source slug/path, so a re-ingest upserts in place instead of duplicating.
- Replace-by-slug — every sync first issues a
DELETE(e.g.DELETE /ingest/by-project-slug/<slug>, or by the namespaced museum slugsmuseum-section-<slug>/museum-scene-<slug>) to wipe stale points across both collections, then re-ingests.
Museum re-ingest (this change-set)¶
The 20 museum sections + their scenes + the museum overview were
re-ingested into the Qdrant knowledge collection, moving it from
~4839 → ~4969 points. Text flows to /ingest/file (knowledge); images flow
to /ingest/image (Gemini-captioned, images collection). Entity types on the
payloads are section / scene / museum (plus section_image /
scene_image).
Re-ingest is idempotent by design
Because ids are UUIDv5 and each sync deletes-then-reingests by slug, running the museum re-ingest twice converges on the same point set. The point-count delta (≈130) reflects new/edited content, not duplication. See Information WebUI and RAG.
6. Network split: Spark ↔ GCP migration¶
Living architecture for the museum / GCP network split. The authoritative working document is
MIGRATION_PLAN.mdat the repo root; this section mirrors its diagrams + summary. See also Service hosting & relocation — the 2026-05-28 precursor.
Context¶
The Spark host is moving from the museum's vlan14 to vlan23. On vlan23 it has no public-internet access, but it can reach a private GCP VPC over a Cloud VPN / Interconnect link. The single-host stack splits in two:
- Spark (museum, vlan23): GPU- / ML-heavy services that benefit from local compute, plus Redis (which KTM requires to live on vlan23).
- GCP (public-facing): everything reachable from the public internet, plus
Postgres and a new
ai-proxythat fronts Gemini + GCS for Spark callers.
Network roles: KTM (museum-side network, owns the vlan23 transition), Christian (GCP project + VPC + VPN/Interconnect peer), Mert (application architecture, sits between the two).
Target topology¶
flowchart TB
Internet((Public Internet))
subgraph gcp["GCP VPC — public-facing half"]
direction TB
CF["Cloudflare Tunnel"]
subgraph gcp_public[" Public-facing apps "]
direction LR
AG["agent"]
AU["auth"]
WU["information-webui"]
DC["docs"]
end
subgraph gcp_async[" Async workers "]
direction LR
NA["notification-api"]
NW["notification-worker"]
end
subgraph gcp_infra[" Data & infra "]
direction LR
PG[("postgres<br/>Cloud SQL")]
AIP["ai-proxy<br/>(Gemini + GCS)"]
OBS["monitoring"]
BST["bastion + IAP"]
end
CF --> gcp_public
end
subgraph spark["Spark — museum, vlan23 (no internet)"]
direction TB
subgraph spark_app[" App services "]
direction LR
RG["rag"]
MU["museum-api"]
SIM["museum-simulator"]
end
subgraph spark_data[" Data plane "]
direction LR
QD[("qdrant")]
RD[("redis")]
end
end
Internet ==>|"end users"| CF
AIP -.->|"egress to<br/>Gemini, GCS"| Internet
gcp <==>|"Cloud VPN / Interconnect<br/>(only Spark↔GCP path)"| spark
classDef ext fill:#fff,stroke:#666,stroke-width:2px
class Internet ext
Key invariants of the new topology:
- Spark has no public internet. Any code path that today calls
googleapis.com, GCS, Docker Hub, PyPI, etc. directly from a Spark container must route through a GCP service. - Cloud VPN / Interconnect is the only Spark↔GCP path. All inter-half
traffic crosses it, including hot paths like
agent → ragandagent → redis. - Cloudflare Tunnel is the only public ingress — as today, but terminating on GCP load balancers instead of on Spark.
ai-proxyis the single boundary for Gemini + GCS. Spark callers (today:rag) treat it as a normal HTTP base URL; the proxy owns the GCP credentials and outbound calls to Google APIs.
Chat request flow (post-cut)¶
sequenceDiagram
participant App as Mobile app
participant CF as Cloudflare
participant Agent as agent (GCP)
participant Auth as auth (GCP)
participant SQL as postgres (GCP)
participant Redis as redis (Spark, via VPN)
participant Museum as museum-api (Spark, via VPN)
participant RAG as rag (Spark, via VPN)
participant QD as qdrant (Spark, local to rag)
participant Proxy as ai-proxy (GCP)
participant Gemini as Gemini API
App->>CF: POST /v1/chat (Bearer JWT)
CF->>Agent: forward
Agent->>Auth: JWKS (local in GCP)
Agent->>SQL: session / state (Cloud SQL, local)
Agent->>Redis: rate-limit check (VPN hop)
Agent->>Museum: GET vitals (VPN hop)
Agent->>RAG: POST /search (VPN hop)
RAG->>Proxy: embed query (reverse VPN hop)
Proxy->>Gemini: embed (GCP internet egress)
Gemini-->>Proxy: vector
Proxy-->>RAG: vector
RAG->>QD: vector search (local)
RAG->>RAG: ONNX rerank (local, heavy)
RAG-->>Agent: ranked passages
Agent->>Gemini: generate (direct, agent has internet)
Agent-->>App: SSE stream
A single chat round-trip touches the VPN 2-3 times (agent→redis, agent→museum-api, agent→rag, and rag→ai-proxy). The link's RTT dominates end-to-end p50 / p95 — baseline measurement is an open action item in the working doc.
RAG search timeout vs. the new VPN RTT
The agent's RAG /search client read timeout was raised 10s → 25s
(app/config.py::rag_search_timeout_seconds = 25.0). A /search round-trip
(query embedding + rerank) runs ~10s today; a 10s client timeout caused
ReadTimeout → 3 retries (~30s) → a second search → 60s agent wall-clock →agent_timeout` on museum-knowledge queries. The 25s budget lets a single
search complete on the first attempt. Re-validate this budget once the VPN
RTT is measured, since everyagent → rag` call becomes a cross-WAN hop.
Content ingestion flow (post-cut)¶
sequenceDiagram
participant Curator
participant WebUI as information-webui (GCP)
participant GCS
participant RAG as rag (Spark, via VPN)
participant Proxy as ai-proxy (GCP)
participant Gemini as Gemini API
participant QD as qdrant (Spark, local)
Curator->>WebUI: upload artwork + metadata
WebUI->>GCS: PUT object (native GCP path)
WebUI->>RAG: POST /ingest/file (VPN hop)
RAG->>Proxy: embed chunks (reverse VPN hop)
Proxy->>Gemini: embed
Gemini-->>Proxy: vectors
Proxy-->>RAG: vectors
RAG->>QD: upsert (local)
Ingestion bytes (the file itself) move once from information-webui to GCS over
a native GCP path — they never cross the VPN. Only the control call
(/ingest/file with a GCS pointer) does. Embedding round-trips are per-chunk and
the dominant VPN cost during a large ingest such as the museum re-ingest in
§5.
Operator remote access¶
graph LR
Laptop["Engineer laptop"]
IAP["Google IAP<br/>(IAM-controlled)"]
Bastion["bastion VM<br/>(no public IP)"]
Spark["Spark host"]
Laptop -->|gcloud + HTTPS| IAP
IAP -->|tunnelled SSH| Bastion
Bastion -->|SSH over VPN| Spark
Fallback["Fallback: KTM-issued<br/>vlan23 VPN client"]
Laptop -.->|emergency only| Fallback
Fallback -.-> Spark
Default path is gcloud compute ssh bastion --tunnel-through-iap, with
ProxyJump set so ssh spark transparently traverses IAP + bastion. IAM
controls access; revoke = remove the role. The KTM VPN fallback exists only for
the case where GCP itself is down.
What changes per service¶
| Service | Before (single-host Spark) | After (Spark+GCP) | Notes |
|---|---|---|---|
agent |
Spark | GCP | Public chat endpoint at dataland.chat. |
auth |
Spark | GCP | JWKS must be publicly reachable. |
information-webui |
Spark | GCP | Operator-facing CMS. |
docs |
Spark | GCP | Public docs site (docs.dataland.chat). |
notification-worker / notification-api |
Spark | GCP | Need outbound to Slack / OneSignal / etc. |
postgres |
Spark | GCP (Cloud SQL) | Co-locates with agent + auth; the Alembic chain in §1 runs against this Cloud SQL instance post-cut. |
rag |
Spark | Spark | Heavy ONNX rerank + BM25; not easily portable. |
qdrant |
Spark | Spark | Co-located with rag to avoid per-search VPN hops. |
redis |
Spark | Spark (vlan23) | KTM directive. AOF persistence on local disk. |
museum-api |
Spark | Spark | Confirmed museum-internal consumers only. |
museum-simulator |
Spark | Spark | Writes to local Redis. |
ai-proxy |
(didn't exist) | GCP (new) | Fronts Gemini + GCS for Spark callers. |
| Monitoring (Prom / Grafana / AM) | Spark | GCP (lean) | Better alert-delivery from the internet side. |
Phased migration (sketch)¶
- Phase 0 — decisions & baselines. Lock per-service split (done for the rows above); confirm GCP transport + RTT with Christian.
- Phase 1 — vlan23 + GCP scaffolding. KTM moves Spark + Redis to vlan23
(vlan14 dual-homed temporarily); Christian stands up VPC, VPN, bastion, test
VM. Acceptance:
redis-cli -h <spark-redis> pingfrom a GCP box returnsPONG. - Phase 2 — stateless GCP services live, dual-running. Bring
auth,agent,information-webui,docsup on GCP pointing at Spark Redis (and still Spark Postgres). Cloudflare origin moves to GCP. - Phase 3 — Postgres move. Snapshot + restore Spark Postgres into Cloud
SQL; cutover DSNs. Re-run
alembic upgrade headagainst Cloud SQL and confirm the head revision matches. - Phase 4 —
ai-proxylive.ragswaps its Gemini SDK / GCS client for the proxy base URL. Verified by running ingestion + chat with Spark's public internet blocked at the host level. - Phase 5 — vlan14 cut. KTM disconnects vlan14 from Spark. Smoke the full stack.
- Phase 6 — cleanup. Remove migrated entries from the Spark compose stack; retire the Tailscale exposure path; document the new shape here.
Open questions¶
- Cross-WAN RTT once the VPN is up (drives whether per-request
agent→ragis acceptable as-is, or whether the 25s search timeout / agent-side caching needs revisiting). - DR posture per side: Cloud SQL handles GCP; what's the post-split backup plan
for Spark
qdrant+redis? - Build & deploy path: Spark's
docker pullanduv syncare dead post-cut. Default plan: GCP Artifact Registry + PyPI mirror, both reachable from Spark via the VPN. Christian to confirm Artifact Registry shape; we updatedeploy.shafter.
Living plan & change log¶
For the network split, the authoritative tracker — open action items, change
log, working notes — is MIGRATION_PLAN.md at the repo root. For schema +
backfills it is dataland-agent/migrations/README.md. For env ownership it is
docs/env-inventory.md. This page is updated when a
decision lands; it is not the working doc.