pgvector: why most RAG should just be Postgres

The contrarian case. pgvector + pgvectorscale gets you HNSW-indexed vector search inside the database you already operate. When a dedicated vector DB is worth it, and when it's just operational debt.

The default RAG stack in 2026 shouldn’t involve a dedicated vector database. For most teams shipping most applications, pgvector in the Postgres instance you already run — with pgvectorscale layered on when you outgrow it — is the right answer. The burden of proof belongs to the team reaching for Pinecone, Weaviate, Chroma, or Turbopuffer, not to the team using the database already in their stack.

That is a contrarian take if you’ve been reading vendor blog posts. It is not a contrarian take if you’ve actually shipped a RAG system in the last year. The gap between “dedicated vector DB” and “Postgres with an extension” closed somewhere in late 2023, and by now the dedicated-DB narrative is mostly a marketing artifact of a problem that was real in 2022 and isn’t anymore.

This primer is the case for pgvector-first, the honest list of cases where a dedicated vector DB is still the right call, and a working five-minute starter you can copy.

What pgvector actually is

pgvector is a Postgres extension. It adds three things:

A vector column type that stores fixed-dimension float arrays efficiently.
Distance operators: <-> for L2 (Euclidean), <=> for cosine distance, <#> for negative inner product.
Exact and approximate nearest-neighbor queries, with two index types: IVFFlat and HNSW.

That’s the whole surface. It’s deliberately small. You write normal SQL and one of your columns happens to be a vector.

CREATE EXTENSION vector;

CREATE TABLE documents (
  id          bigserial PRIMARY KEY,
  user_id     bigint NOT NULL REFERENCES users(id),
  content     text NOT NULL,
  embedding   vector(1536) NOT NULL,
  created_at  timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);

You now have a vector search database. That’s it. The schema has normal columns next to the vector column, which is the feature that matters most — we’ll come back to it.

What pgvectorscale adds

pgvector on its own is great for the first few million vectors. Past that point, two things start to hurt: index build time and tail-latency consistency under write load. pgvectorscale, maintained by Timescale as an open-source companion extension, addresses both:

StreamingDiskANN — a disk-based ANN index inspired by Microsoft’s DiskANN work. Uses the disk as part of the index, not just memory, which matters when your embedding table is bigger than available RAM.
Statistical Binary Quantization (SBQ) — a compression scheme that shrinks vectors ~32x with minimal recall loss. Meaningful when you’re storing tens of millions of embeddings.
Better tail latency under concurrent reads and writes than pgvector’s HNSW at high scale.

Think of pgvectorscale as the “production hardening” layer. You don’t need it for your first few million vectors. You probably want it past ~10M, and you definitely want it past ~30M on a single node.

Both extensions install cleanly side-by-side. If your managed Postgres provider doesn’t have pgvectorscale available yet, that’s a real constraint worth checking before committing to the stack — as of early 2026, it’s available on Timescale Cloud, Supabase, and most self-managed Postgres deployments, but not all managed providers have caught up.

Why this is not obvious

The dedicated-vector-DB narrative was correct in 2022. pgvector existed but didn’t have HNSW indexes yet (that shipped in 0.5.0, mid-2023). Postgres with IVFFlat was measurably slower on realistic RAG workloads than Pinecone or Weaviate, and the gap was large enough to matter.

That gap closed in stages:

pgvector 0.5 (2023) — HNSW indexing. Query latency dropped into the same order of magnitude as dedicated vector DBs.
pgvectorscale (2024) — StreamingDiskANN and SBQ compression landed. The “Postgres can’t handle real scale” critique stopped being true for most definitions of real scale.
Ecosystem maturity (2025) — every major managed Postgres provider shipped pgvector support. Supabase made it a one-click enable. AWS RDS supported it. Neon, Crunchy, Railway, Fly — all in.

By early 2026, the question isn’t “can Postgres do this?” The question is “do I have a specific reason to pay the operational cost of a second database?”

For most teams, the honest answer is no.

The real selling point

The argument I find most compelling isn’t about latency or cost or feature parity. It’s about joins.

Here’s a query you write all the time in real RAG systems:

SELECT
  d.id,
  d.content,
  d.created_at,
  d.embedding <=> $1 AS distance
FROM documents d
WHERE
  d.user_id = $2
  AND d.created_at > now() - interval '30 days'
  AND d.project_id = ANY($3)
ORDER BY d.embedding <=> $1
LIMIT 10;

“Find the 10 documents most similar to this query embedding, belonging to user $2, in one of these projects, created in the last 30 days.” That is a single SQL query against one database. Planner picks the right index, uses the filter predicates to narrow the candidate set, does the vector comparison on what remains, returns results.

Now do that with a dedicated vector DB. You have two choices, both bad:

Denormalize everything into the vector DB as metadata fields, duplicate your access-control logic in two systems, and hope the metadata filter implementation in your vector DB is as good as Postgres’s query planner. Spoiler: it isn’t.
Query the vector DB for top-N candidates, then filter against Postgres in application code. You over-fetch by some multiple (10x? 100x?) and hope enough candidates survive the filter. Your relevance degrades, your latency is the sum of two network round-trips, and your application code is doing work the database planner should be doing.

Hybrid queries — vector search plus structured filters plus joins to other tables — are the thing that RAG actually looks like in production. Not “find the 10 nearest vectors.” That’s a demo. Production RAG is “find the 10 nearest vectors that this user is allowed to see, in these document categories, written by these authors, more recent than X, and also return the author’s team and the document’s source URL.”

Postgres does that natively. Every other vector DB makes you fight for it.

The setup

End-to-end setup for a fresh Postgres instance. Assumes you have psql access and are running Postgres 15+.

-- One-time extension install.
CREATE EXTENSION IF NOT EXISTS vector;
-- If you're using pgvectorscale:
CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;

-- A simple documents table.
CREATE TABLE documents (
  id          bigserial PRIMARY KEY,
  user_id     bigint NOT NULL,
  title       text NOT NULL,
  content     text NOT NULL,
  embedding   vector(1536) NOT NULL,
  metadata    jsonb NOT NULL DEFAULT '{}'::jsonb,
  created_at  timestamptz NOT NULL DEFAULT now()
);

-- HNSW index for approximate nearest-neighbor search.
CREATE INDEX documents_embedding_hnsw
  ON documents
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- A supporting B-tree for common filter predicates.
CREATE INDEX documents_user_created_idx
  ON documents (user_id, created_at DESC);

vector(1536) is the right size if you’re using OpenAI’s text-embedding-3-small. vector(1024) if you’re using Voyage’s voyage-3. Keep this number locked — changing it means re-embedding your entire corpus (more on that below).

HNSW parameters:

m — the number of bidirectional links per graph node. 16 is the default and the right starting point. Go up to 32 for higher-recall workloads; past that, diminishing returns and big build-time cost.
ef_construction — candidate list size during index build. 64 is the default. Higher = better index quality, slower build. 128 is a reasonable bump if you can afford the build time.

At query time, the knob that actually matters is ef_search:

SET hnsw.ef_search = 100;  -- default is 40

Higher ef_search = more candidates considered per query = higher recall, higher latency. 40 is fine for demos. 100 is the practical default for production. 200 if you genuinely need the extra recall and your p95 can absorb it.

Embedding choice, briefly

This primer is about the vector database, not the embeddings. But one paragraph of opinion, because it affects your schema:

In early 2026, the sensible defaults are OpenAI text-embedding-3-small (1536 dimensions, cheap, broadly good) or Voyage voyage-3 (1024 dimensions, slightly better on retrieval benchmarks, charges for it). Pick one. Stay with it. Changing embedding models means re-embedding your entire corpus, which at a million documents and a few cents per thousand embeddings is real money and real time. Don’t chase benchmark deltas — the difference between “good” and “great” embeddings is almost always smaller than the difference between a good retrieval system with no reranker and the same retrieval system with a reranker. More on that below.

HNSW vs IVFFlat

pgvector supports two approximate-NN index types. In 2026, the choice is almost always HNSW:

	HNSW	IVFFlat
Build cost	Higher (minutes to hours)	Lower (seconds to minutes)
Query latency	Excellent, very consistent	Good, high variance
Recall at similar speed	Better	Worse
Behavior on small tables	Great	Needs training data, gets weird
Update-heavy workloads	Fine	Fine
Recommended for	Almost everything	Legacy/niche cases

IVFFlat was the original pgvector index type. It works by clustering your vectors into Voronoi cells and only searching the cells nearest to the query. That’s efficient, but it has a sharp cliff: if you pick the wrong number of lists for your data size, recall falls off a cliff; if your data distribution shifts over time, you have to rebuild; and its behavior on small tables (under ~10k rows) is genuinely bad because the clustering can’t find meaningful structure.

HNSW doesn’t have those problems. It’s a layered navigable graph; it’s good at small tables, good at big tables, and tolerant of data distribution shifts. The only argument for IVFFlat today is “I need a very fast index build and I’ll accept the recall cost” — which, for most teams, is a bad trade.

Use HNSW. Don’t overthink it.

Latency and scale — honest numbers

I won’t invent specific benchmark numbers here, because the honest answer is “it depends on your hardware, your embedding dimension, your query patterns, and your filter predicates.” But in rough orders of magnitude:

Under 1M vectors on a reasonably-specced Postgres instance (say, 8 vCPU / 32 GB RAM): HNSW queries with filters comfortably under 50ms p95. You will not notice the vector layer in your latency budget.
1M to 10M vectors on the same box: still under 50ms p95 if your indexes fit in memory and your filters are selective. If your indexes don’t fit in memory, you will feel it — bump RAM before you bump extensions.
10M to 100M vectors: pgvectorscale’s StreamingDiskANN starts paying for itself. You can still do this on one box; you need to be intentional about RAM, disk IOPS, and index parameters.
Past 100M vectors: you should at least evaluate dedicated solutions. Not because Postgres can’t — it can, with partitioning and pgvectorscale — but because at that scale the ratio of “operational cost of scaling Postgres” to “operational cost of a managed vector DB” starts to flip.

The point isn’t specific numbers. The point is that “pgvector doesn’t scale” is a claim from 2022 that got copy-pasted into every vendor blog post since. Most teams aren’t anywhere near the regime where that concern is real.

pgvector vs dedicated vector DBs at a glance

A rough feature and cost comparison, the way I’d frame it if I were helping a team decide:

	pgvector (+ pgvectorscale)	Dedicated vector DB
New infrastructure	None if you already run Postgres	Yes — new service, new SDK, new failure mode
Filtered search (“where user_id = X”)	Native, planner-optimized, fast	Implemented as metadata filters, quality varies widely
Joins to application data	Native SQL joins	Round-trip to Postgres or denormalize
Transactional consistency between row + embedding	Yes	No — two systems, eventual consistency
Cost under 10M vectors	Your existing Postgres bill	$50–$500/mo on top of Postgres
Cost at 50M+ vectors	Scale Postgres vertically or partition	Often cheaper per-query at scale
Serverless / bursty pricing	Not really (Postgres wants to be running)	Turbopuffer and similar excel here
Operational surface	Standard Postgres ops	A second DB to monitor, back up, and upgrade
Team skill requirement	Whoever owns Postgres	New SDK, new query semantics to learn
Ceiling before it hurts	~30M–100M vectors on one node	Effectively unbounded (different tier)

Read that table as “defaults favor pgvector until you have a specific reason they don’t.” Which brings us to the specific reasons.

When a dedicated vector DB is actually right

I promised to be fair. Here are the cases where reaching past pgvector is the correct call:

You have no Postgres in your stack and no database team. If your backend is MongoDB, DynamoDB, or Firestore, and nobody on the team is comfortable running Postgres, adopting Postgres purely for RAG is a bigger commitment than picking a managed vector DB. In that world, Turbopuffer or Pinecone is the right answer, and nobody will fault you.
You need sub-10ms p99 on >50M vectors. This is a specialized regime. If your product’s UX literally breaks at 50ms, and you’re at 8-figures-plus of vectors, you want purpose-built infrastructure. This describes a narrower set of products than vendors would have you believe — most “we need it fast” translates to “we need it under 200ms p95,” which pgvector handles trivially.
You need serverless, bursty economics. This is Turbopuffer’s specific pitch and it’s a good one. If your RAG workload is 10 queries per minute on weekdays and 10,000 queries per minute during a once-a-quarter event, paying for a running Postgres instance that’s idle 99% of the time is wasteful. Serverless vector DBs shine here.
Multi-tenant vector search at scale, with per-tenant isolation as a hard requirement. At some point, running a vector index per tenant in Postgres stops being fun. Dedicated vector DBs with native namespace-per-tenant APIs are purpose-built for this. If you have thousands of small tenants and each one needs isolated retrieval, evaluate the dedicated options.

Three honest observations about this list:

None of these cases apply to most teams. If you recognize yourself in one, trust that recognition — you’re not most teams. If you don’t recognize yourself in any, don’t talk yourself into it.
Case 1 (no Postgres) is the most common genuinely valid reason.
Cases 2 and 4 are where the dedicated-vector-DB economics earn their keep. Case 3 is where the serverless specifically earns its keep — not a dedicated vector DB broadly.

Reranking is the real quality dial

A section worth reading even if you skip the rest of this piece.

Vector search gets you a candidate set. A reranker decides the final order. The quality delta between “pgvector top-10” and “pgvector top-50 → reranked to top-10” is almost always larger than the quality delta between “pgvector” and any dedicated vector DB.

This means teams spending ops effort on the vector DB choice while running no reranker are optimizing the wrong layer. The right shape for a modern RAG pipeline is:

Embed the query.
Get ~50 candidates from pgvector using HNSW.
Rerank with a cross-encoder reranker — Voyage’s rerank-2 or Cohere’s Rerank 3 are the two sensible defaults.
Pass the reranker’s top-10 to the LLM.

The reranker is a few hundred milliseconds and a few cents per thousand queries. The quality uplift is significant — often the difference between “the LLM’s answer is occasionally wrong” and “the LLM’s answer is reliably grounded.” If you’re not reranking, you’re leaving most of the achievable quality on the table, regardless of your vector DB.

Five-minute starter

A concrete Node/TypeScript snippet: embed a string with OpenAI, store it, query for nearest neighbors.

import OpenAI from "openai";
import postgres from "postgres";

const openai = new OpenAI();
const sql = postgres(process.env.DATABASE_URL!);

async function embed(text: string): Promise<number[]> {
  const res = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  });
  return res.data[0].embedding;
}

async function store(userId: number, title: string, content: string) {
  const embedding = await embed(`${title}\n\n${content}`);
  // pgvector accepts a string-formatted array like '[0.1, 0.2, ...]'
  await sql`
    INSERT INTO documents (user_id, title, content, embedding)
    VALUES (${userId}, ${title}, ${content}, ${JSON.stringify(embedding)})
  `;
}

async function search(userId: number, query: string, limit = 10) {
  const queryEmbedding = await embed(query);
  const rows = await sql`
    SELECT
      id,
      title,
      content,
      embedding <=> ${JSON.stringify(queryEmbedding)} AS distance
    FROM documents
    WHERE user_id = ${userId}
    ORDER BY embedding <=> ${JSON.stringify(queryEmbedding)}
    LIMIT ${limit}
  `;
  return rows;
}

// Usage.
await store(42, "Deployment runbook", "To deploy, run ./deploy.sh ...");
const hits = await search(42, "how do I ship to production");
console.log(hits);

That’s a complete RAG retrieval path. ~40 lines. No second database, no sync job, no eventual-consistency worry.

Swap OpenAI for Voyage by changing the embed function; swap in the reranker step between search and response by calling Voyage’s rerank endpoint on the rows array. Both are drop-in.

Pitfalls

The mistakes I see most often:

Misconfigured ef_search. The default of 40 is too low for production. Bump it to 100 and measure. If your p95 latency budget allows 200, use 200 and enjoy the recall.

Forgetting to VACUUM. Postgres needs to reclaim dead tuples. If you do a lot of updates or deletes on your documents table, autovacuum may not keep up with the index bloat. Monitor pg_stat_user_tables and tune autovacuum aggressiveness on your vector-heavy tables.

Mixing dimension sizes. If you change embedding models mid-flight without re-embedding, you’ll have rows with vector(1536) and rows with vector(1024) mingling. Postgres will reject the insert, but application code that expects a single dimension can still get confused. Version your embeddings explicitly — either a column indicating which model produced each row, or separate tables per embedding version.

Inconsistent distance operator choice. Cosine (<=>), L2 (<->), and negative inner product (<#>) are different metrics. Your HNSW index is built for one of them (vector_cosine_ops, vector_l2_ops, vector_ip_ops). Query with a different operator and the index won’t be used — you’ll get correct results, but via a slow sequential scan. Match the query operator to the index operator class.

Under-provisioning RAM. HNSW indexes want to live in memory. If your working set evicts the index, query latency becomes unpredictable. Size your Postgres instance so the active index fits. This is the single most common cause of “pgvector is slow” complaints that aren’t actually pgvector’s fault.

Ignoring connection pooling. Postgres performs best with a bounded connection pool in front of it (PgBouncer, Supavisor, or your framework’s built-in pooler). A RAG endpoint that opens a new connection per request will hit connection limits long before it hits query latency limits.

Treating the vector DB as the whole RAG system. Chunking strategy, embedding model choice, reranking, context assembly, and prompt construction all matter more, on the margin, than which vector DB you picked. Get those right first.

Storing raw floats when you could compress. If you’re past a few million vectors and using pgvectorscale, SBQ compression is essentially free quality-wise and gives you substantial storage and memory savings. Many teams leave it off out of inertia. Turn it on when you cross the threshold where it matters; leaving it off at 20M vectors is a choice, not a default.

Building the HNSW index before bulk-loading. If you’re importing a large corpus, drop the HNSW index first, bulk-insert the rows, then create the index. Creating an index on an empty table and streaming rows in is an order of magnitude slower than creating the index on a populated table. This is a one-time mistake teams make on initial load and never repeat.

Where next

First eval suite with Braintrust — the next step after you have retrieval working is measuring whether it’s working well. A retrieval eval harness is not optional for production RAG.
MCP in 20 minutes — if you want your RAG retrieval exposed as a tool to a coding agent or chat client, wrapping it in an MCP server is the modern path. A Postgres-backed pgvector RAG makes an especially clean MCP server because everything lives in one place.
The pgvector docs and pgvectorscale docs are both excellent. Read the HNSW parameter tuning notes specifically — the defaults are good, but the tuning knobs are worth understanding before you hit scale.

The short version: pgvector is the default. Use it first. Reach for a dedicated vector database only when you’ve identified a specific, named reason — not because the vendor blog told you to.