What is the difference between keyword search and vector search?

Keyword search finds documents containing specific words. Vector search finds documents that are semantically similar — discussing the same concepts, topics, or entities — even when using different vocabulary. Vector search is far more effective for research and discovery tasks in journalism.

What embedding models are used in newsroom vector search?

Common embedding models for newsroom applications include OpenAI text-embedding-3-small (cost-effective, high performance), OpenAI text-embedding-3-large (highest accuracy), Google text-embedding-004, Cohere embed-v3, and open-source models like BGE-M3 and E5-large for self-hosted deployments.

pgvector is an open-source PostgreSQL extension that adds vector storage and similarity search capabilities to PostgreSQL databases. It supports cosine, dot product, and Euclidean distance metrics, and HNSW indexing for fast approximate nearest-neighbour search at scale.

How much archive content can vector search handle?

Production vector search systems routinely handle tens of millions of document chunks. pgvector with HNSW indexing maintains sub-100ms query times at million-document scale. Pinecone, Weaviate, and Qdrant are purpose-built vector databases that scale to billions of vectors.

Can vector search replace traditional CMS search?

Vector search complements but does not replace traditional keyword search. A hybrid approach — combining vector similarity search with keyword filtering and metadata facets — typically outperforms either approach alone, enabling both semantic concept search and precise title/author/date filtering.

Vector Search in Newsrooms: Finding Hidden Connections in Your Archive

What Is Vector Search?

Vector search (also called semantic search or embedding-based search) is a retrieval technique that represents documents as numerical vectors in a high-dimensional space, enabling search by meaning rather than keyword matching. Unlike traditional keyword search, which returns documents containing the exact terms queried, vector search finds documents that are conceptually or semantically similar — even when they use entirely different vocabulary.

The technology works by encoding text — whether a news article, a query, or a document excerpt — into a numerical vector using an embedding model (such as OpenAI's text-embedding-3-small, Google's text-embedding-004, or open-source models like bge-large-en). Two texts that discuss the same concept will be encoded as vectors that are close together in the embedding space, enabling similarity search that matches concepts rather than words.

Why Vector Search Transforms Newsroom Archive Access

Every large newsroom sits atop an archive of immense journalistic value — decades of reporting that captures expertise, established sources, investigative precedents, and context for current stories. The problem is that traditional keyword search fails to surface this value effectively. A journalist researching corruption in a specific ministry will not find all relevant prior coverage unless they know the exact terms used in each article. A story about a politician's past involvement in a financial scandal may be buried in an article that used the word "investor" rather than "corrupt" — invisible to keyword search, but immediately surfaced by semantic search.

Vector search enables journalists to query an archive with natural language — "stories about government officials who later faced corruption charges" — and retrieve semantically relevant articles that may not share a single word with the query. This is transformative for investigative journalism, where connecting historical dots is often the key to breaking new stories.

Implementation: pgvector and the Omniscient AI Approach

For production newsroom RAG systems, pgvector — an open-source vector similarity search extension for PostgreSQL — is an increasingly popular choice because it integrates vector search with the standard relational database most newsrooms already operate. pgvector stores document embeddings as a native PostgreSQL data type and supports ANN (Approximate Nearest Neighbour) search with HNSW or IVFFlat indexing, achieving query times of 1–50ms on million-document archives.

Omniscient AI's fact-checking infrastructure uses pgvector to store embeddings of more than 1,200 news and fact-check sources, enabling real-time semantic retrieval of relevant passages for any factual query in under 100ms — fast enough to power the extension's real-time fact-checking interface.

What Is Vector Search?

Why Vector Search Transforms Newsroom Archive Access

Implementation: pgvector and the Omniscient AI Approach

Frequently Asked Questions

Related Articles

The Modern Newsroom Tech Stack in 2026

AI Transcription for Journalists: Tools, Accuracy, and Best Practices

News API Guide: Best Sources for Real-Time Data in AI Newsrooms