Skip to content

Qdrant

dataland-qdrant is the vector store that backs all of Dataland's retrieval. It holds three collections — knowledge, images, and scenes — every one of them a 3072-dimensional Gemini embedding under cosine distance. The knowledge collection additionally carries a native BM25 sparse vector so RAG can fuse dense + lexical signals on the museum corpus.

Only dataland-rag-v2 reads and writes Qdrant. Every other service that needs retrieval (the Agent's search_knowledge / search_artwork_images tools, the Information WebUI's live-sync) goes through RAG's HTTP API, never to Qdrant directly.

Container dataland-qdrant
Image qdrant/qdrant:v1.12.4 (QDRANT_VERSION)
HTTP (REST) port 4146 → container :6333 (QDRANT_HTTP_PORT)
gRPC port 4147 → container :6334 (QDRANT_GRPC_PUBLIC_PORT)
Host bind 127.0.0.1 + QDRANT_PUBLIC_BIND (tailnet 100.124.170.43). Never 0.0.0.0
Internal URLs http://dataland-qdrant:6333, dataland-qdrant:6334 (gRPC)
Memory / CPU mem_limit 4 GB (reservation 1 GB) / cpus 2.0
Persistence Named volume qdrant-data/qdrant/storage
API key None (QDRANT_API_KEY empty) — the tailnet is the trust boundary

No API key — the network is the only fence

Qdrant runs with no authentication. That is deliberate but load-bearing: the container is published on 127.0.0.1 (local + SSH tunnel) and on the Tailscale interface (QDRANT_PUBLIC_BIND=100.124.170.43, exposed as spark:4146 / spark:4147), and must never bind 0.0.0.0. RAG reaches Qdrant over the internal dataland-network Docker bridge (QDRANT_HOST=dataland-qdrant) regardless of host binding, so the published host ports exist purely for operator inspection and direct tailnet peer access. If you ever flip a *_PUBLIC_BIND to a public interface, you have handed the whole vector store to the internet. (Bind discipline tightened in DAT-73 / DAT-291.)


How RAG connects

RAG builds the client from its settings (storage/client.py):

QdrantClient(
    host=settings.qdrant_host,        # (1)!
    port=settings.qdrant_port,        # (2)!
    grpc_port=settings.qdrant_grpc_port,  # (3)!
    api_key=settings.qdrant_api_key or None,  # (4)!
    prefer_grpc=False,                # (5)!
)
  1. QDRANT_HOST, resolves to dataland-qdrant on the internal dataland-network Docker bridge. The container name is the DNS name; this is why RAG reaches Qdrant regardless of host port binding.
  2. QDRANT_PORT, the REST port 6333 inside the container (published to the host as 4146).
  3. QDRANT_GRPC_PORT, 6334. Built into the client but unused at runtime because prefer_grpc=False.
  4. QDRANT_API_KEY — empty today, so this evaluates to None. The tailnet is the only trust boundary; there is no key to pass.
  5. Forces the REST transport on :6333. The gRPC port is published for tooling but the service itself never speaks gRPC.
Setting (env) Default
QDRANT_HOST dataland-qdrant
QDRANT_PORT 6333
QDRANT_GRPC_PORT 6334
QDRANT_API_KEY (empty)
KNOWLEDGE_COLLECTION knowledge
IMAGES_COLLECTION images
SCENES_COLLECTION scenes

prefer_grpc=False means RAG talks REST on :6333; the gRPC port is published for tooling but not used by the service.


The three collections

Name Points (approx.) Purpose Write path
knowledge ~4969 Museum docs, sections, scenes-as-text, blog posts, external articles, uploaded files /ingest/file, GCS /sync, Information WebUI live-sync
images ~1485 Artwork images + extracted-document images /ingest/image, GCS /sync, Information WebUI uploads
scenes — (small) Museum scene metadata embedded for semantic similarity GCS /sync of museum/scenes/*.json

Recent changes

  • Knowledge grew 4839 → 4969 when the 20 museum sections + their scenes + the museum overview were (re-)ingested into Qdrant knowledge (today's change-set). The agent now answers room/section questions from this corpus and speaks the real room names — Data Pavilion (GA), Latent Gallery (GB), Infinity Room (GC), The Sanctuary (GD), Discovery Portal (ON) — instead of bare codes (DAT-281).
  • DAT-288: the default_reference_* placeholder images were purged from chapters.json and the GCS cobanov-public/chapters bucket. Those reference images are URL-only and were never embedded into Qdrant, so the images collection was unaffected.

Why three, and why they differ

All three share one vector geometry, but they are populated by different embedder calls and serve different query shapes:

flowchart LR
    subgraph Qdrant[dataland-qdrant 3072d cosine]
        K[knowledge<br/>dense + sparse BM25]
        I[images<br/>dense only]
        S[scenes<br/>dense only]
    end
    subgraph Embedder[GeminiEmbedder gemini-embedding-2-preview]
        D[RETRIEVAL_DOCUMENT]
        Q[RETRIEVAL_QUERY]
        SS[SEMANTIC_SIMILARITY]
        M[multimodal image]
    end
    D -->|chunk text| K
    SS -->|scene description| S
    M -->|image bytes| I
    Q -.->|query text| K & I
    M -.->|query image| I
  • knowledge stores text chunks embedded with task type RETRIEVAL_DOCUMENT; queries use RETRIEVAL_QUERY. This is the only hybrid collection — see Sparse BM25.
  • images stores image vectors embedded from the image bytes themselves (multimodal, no task type). It supports both image→image (search_by_image) and text→image (search_images_by_text, using a RETRIEVAL_QUERY text vector against the same image space).
  • scenes embeds the scene description with SEMANTIC_SIMILARITY.

Vector configuration

Every collection is created with one shared spec (storage/schema.py):

VECTOR_SIZE = 3072  # (1)!
VECTOR_DISTANCE = Distance.COSINE  # (2)!
COLLECTION_VECTOR_PARAMS = VectorParams(size=VECTOR_SIZE, distance=VECTOR_DISTANCE)  # (3)!
  1. The Gemini gemini-embedding-2-preview output dimension. This is baked into every stored point — changing it forces a full drop-and-re-ingest of all collections (see the danger admonition below).
  2. Cosine distance, matching how Gemini embeddings are normalized. All three collections share this geometry.
  3. One shared VectorParams reused for knowledge, images, and scenes at creation time, so the dense space is identical across all three.
Property Value Source
Dense dimensions 3072 EMBEDDING_DIM, VECTOR_SIZE
Distance Cosine VECTOR_DISTANCE
Embedding model gemini-embedding-2-preview EMBEDDING_MODEL
Batch size 50 EMBEDDING_BATCH_SIZE

The embedding model name is gemini-embedding-2-preview, not gemini-2.5-flash

Vectors come from Gemini's embedding model (gemini-embedding-2-preview, the effective value in both .env.example and compose; the Python class default is the same and config.py's bare default is gemini-embedding-2). This is a separate model from the chat / captioning model, which is gemini-3.5-flash (GEMINI_MODEL, standardized in DAT-269). Do not conflate the two: the chat model never touches a vector, and the embedding model never writes a chat token.

Changing the embedding model is a full re-index

The 3072 dimension and the cosine geometry are baked into every stored point. Switching EMBEDDING_MODEL to a model with a different output dimension — or even the same dimension but a different latent space — makes all existing vectors incomparable to new queries. There is no in-place migration: you would drop and re-ingest every collection (hours of Gemini-billed embed calls). Treat EMBEDDING_MODEL as immutable for the life of a qdrant-data volume.

Sparse BM25 (knowledge only)

The knowledge collection gets a second, sparse named vector for lexical/keyword recall, fused with the dense channel via Reciprocal Rank Fusion server-side. It is provisioned at startup (ensure_collections_ensure_sparse_support) and is idempotent:

sparse_vectors_config = {
    "sparse": SparseVectorParams(modifier=Modifier.IDF),  # (1)!
}
  1. Modifier.IDF tells Qdrant to apply inverse-document-frequency weighting natively, server-side, so RAG only ships raw term vectors. The vector is named sparse (SPARSE_VECTOR_NAME); images and scenes never get one, which is why hybrid search only fires when the target collection is knowledge.
Property Value
Sparse vector name sparse (SPARSE_VECTOR_NAME)
Modifier IDF (inverse document frequency) — Qdrant computes IDF natively
Sparse model Qdrant/bm25 (SPARSE_EMBEDDING_MODEL), run client-side via FastEmbed
Enabled by HYBRID_SEARCH_ENABLED=true

The BM25 sparse vectors themselves are produced client-side by RAG's FastEmbed SparseTextEmbedding (retrieval/sparse.py) and upserted as a second vector on each knowledge point; the IDF weighting and fusion happen inside Qdrant. images and scenes have no sparse vector — hybrid search only fires when the target collection is knowledge (Searcher._hybrid_query guards on collection == knowledge_collection).

Backfilling sparse vectors

Documents ingested before hybrid was enabled have dense-only points. SPARSE_BACKFILL_ON_STARTUP=true (default off) walks the collection at boot and adds the missing sparse vector in batches of SPARSE_BACKFILL_BATCH_SIZE (64). Day-to-day, leave it off — fresh ingests write both vectors in one pass (embed_and_upsert_knowledge).

Payload indexes (knowledge only)

On startup RAG also ensures payload indexes on knowledge so filtered and phrase searches stay fast (_ensure_payload_indexes):

  • Text indexes (multilingual tokenizer, lowercase, ASCII-folding, phrase matching, min_token_len=2) on content, title, section_title, source.
  • Keyword indexes on category, source_type, doc_id, source_path.

These back the optional "text-scroll" lexical channel (TEXT_SEARCH_ENABLED, off by default since DAT-167 because it is redundant with BM25 and dominates latency at scale) and any MatchValue / MatchAny filters.


Payload shapes

The payload is the JSON metadata stored alongside each vector. RAG returns it on every search hit (minus the bulky content, which is hoisted to the result's content field) and uses several keys for dedup, filtering, and the rerank prompt.

knowledge payload

Written by embed_and_upsert_knowledge (ingestion/ingestors/document.py). Point IDs are random UUIDv4 (one per chunk).

Key Type Notes
content str The chunk text (≈400 tokens, 60 overlap)
title str Derived from filename
source str Original filename or URL
category str documents | blog-posts | external (derived from GCS prefix)
source_type str File extension (pdf, md, txt, …)
source_path str bucket/prefix/filenamededup key
doc_id str doc_<12hex> — groups all chunks of one document
chunk_index int Position within the document
token_count int
section_title str? When the parser found a heading
page_number int? When available
tags list[str]
created_at str ISO-8601

WebUI museum payloads (entity_type, museum_slug)

The Information WebUI ingests museum sections/scenes/overview into knowledge via POST /ingest/file with an extra_metadata JSON blob, which RAG merges onto the base payload (any key outside the reserved set is passed through verbatim). That is how knowledge points pick up WebUI-specific keys like entity_type (section | scene | museum) and museum_slug, alongside the RAG slug convention museum-section-<slug> / museum-scene-<slug>. The reserved keys the caller may not override are content, title, source, source_path, source_type, category, doc_id, chunk_index, token_count, section_title, page_number, created_at, tags (a 400 is returned if any appears) — they are set by the ingestor and are load-bearing for dedup and the rerank pipeline.

images payload

Every image point — regardless of which of the three ingest paths produced it — shares the same 11 base keys via the unified build_image_point helper (DAT-172):

file_name, title, caption, keywords, tags, gs_uri, public_url, collection_name, artwork_id, source_path, ingested_at.

Key Type Notes
gs_uri str Canonical gs://bucket/path URI
public_url str Full HTTPS URL (rendered by the chat client)
collection_name str Sub-folder under artworks/ for GCS ingest; the project_slug for WebUI uploads
artwork_id str Filename stem
source_path str dedup key
caption str Gemini-generated (gemini-3.5-flash) on GCS/extracted ingest; curator-supplied on WebUI

The WebUI image path (/ingest/image) layers ~20 extra keys on top through extras, including the all-important project_slug (plus project_id, project_name, image_id, width/height, description, alt_text, source_type="webui-image", and provenance fields bucket, object_path, cdn_url, source_url, source_page). These extras keep the Qdrant payload in lockstep with the WebUI's DB schema.

Deterministic IDs for WebUI images (UUIDv5)

The GCS and extracted-document image paths use random UUIDv4 IDs. The WebUI path mints a deterministic UUIDv5 from source_path (namespace 4b7c8a3a-…-1c7a6b3a51d2), so re-ingesting the same image upserts in place rather than accumulating duplicates — load-bearing for the WebUI live-sync hook. The WebUI also issues DELETE /ingest/by-project-slug/{slug} (filtering on payload.project_slug) to wipe a project's footprint across both images and knowledge before a re-sync.

scenes payload

Written by ingest_scene. Point IDs are random UUIDv4.

Key Type Notes
scene_id str From the JSON id, or filename stem
title, description, location str description is what gets embedded
tags list[str]
related_images list[str]
metadata dict Any extra JSON fields not mapped above
source_path str dedup key
created_at str ISO-8601

Search filtering

Callers may scope a /search with a filters map, but RAG enforces an allowlist (services/search_filters.py) — any key outside it is silently dropped so a caller cannot probe internal payload state or force expensive unindexed filters:

category, source_type, source, source_path, doc_id, tags,
collection_name, artwork_id, project_slug, project_id, project_name

Per-project search isolation depends on project_slug being on this list (DAT-232): filters={"project_slug": "<x>"} was previously dropped to an unfiltered search and leaked cross-project hits. Scalar values become MatchValue; lists become MatchAny. For the full retrieval flow (dense + RRF-hybrid + rerank), see RAG.


Healthcheck quirk

The Qdrant image ships bash but no curl / wget, so the compose healthcheck uses bash's /dev/tcp builtin to TCP-connect to :6333, send a raw GET /healthz HTTP/1.0, read the response, and grep -q passed:

test:
  - "CMD-SHELL"
  - "bash -c 'exec 3<>/dev/tcp/localhost/6333 && printf \"GET /healthz HTTP/1.0\\r\\nHost: localhost\\r\\n\\r\\n\" >&3 && cat <&3 | grep -q passed'"  # (1)!
interval: 10s
timeout: 5s
retries: 5
start_period: 15s  # (2)!
  1. No curl/wget in the image, so this opens a TCP socket via bash's /dev/tcp builtin, hand-writes a raw HTTP/1.0 request to /healthz, and greps the response for passed. Qdrant binds :6333 only after storage init, so one check covers both "process up" and "ready to serve".
  2. 15s grace before failed checks count, giving the storage layer time to initialise and open :6333. Without it the container would flap unhealthy on a cold boot.

Qdrant binds :6333 only after the storage layer is initialised, so this single check catches both "process up" and "ready to serve".


Reaching it

# Health (loopback, on the VDS):
curl -fsS http://localhost:4146/healthz  # (1)!

# List collections:
curl -fsS http://localhost:4146/collections | jq

# Per-collection stats (vector count, status, config):
curl -fsS http://localhost:4146/collections/knowledge | jq
curl -fsS http://localhost:4146/collections/images   | jq
curl -fsS http://localhost:4146/collections/scenes   | jq

# Native Prometheus endpoint (scraped by dataland-prometheus):
curl -fsS http://localhost:4146/metrics  # (2)!
  1. Port 4146 is the host publish of container :6333 (QDRANT_HTTP_PORT). -fsS fails on HTTP error, stays silent on progress, but still prints errors. No API key — the loopback/tailnet boundary is the only gate.
  2. Qdrant's native Prometheus exposition endpoint, scraped by dataland-prometheus. Same port 4146, no auth.

From a tailnet peer, swap localhost for spark (100.124.170.43) on the same 4146 / 4147 ports. There is no API key to pass — access is gated entirely by the network boundary.


Volumes + backup + recovery

qdrant-data  -> /qdrant/storage   (named volume, stateful)

Losing qdrant-data is a full re-ingest

Qdrant is the heaviest stateful tenant (DAT-80: 4 GB / 2 cores). If you lose the volume, RAG can rebuild everything from GCS — knowledge/images/scenes re-derive from the source buckets — but expect several hours of Gemini-billed embed + caption calls plus a full Information WebUI live-sync cycle to repopulate the WebUI-curated points. The legacy GCS image points and the WebUI image points are distinguishable by project_slug (WebUI points have it; ~66 legacy points do not). Backup procedure: dataland-infrastructure/reports/backup-restore.md.

See also

  • RAG (dataland-rag-v2) — the only reader/writer; hybrid search, reranking, ingest endpoints.
  • Information WebUI — live-sync source of museum + project payloads.
  • Agent — consumes retrieval via search_knowledge / search_artwork_images tools.