Government reports, academic datasets, and official statistics are the gold-standard sources for verified journalism — but they are often difficult to search quickly. Building a RAG-enabled index of key primary sources for your beat transforms hours of document search into seconds of natural-language querying.

The Primary Source Indexing Process

1. Identify key sources for your beat: List the 20–30 most important recurring primary sources (government statistics agencies, regulatory filings databases, academic journals, international organisation reports). 2. Set up automated downloads: Many official data sources have APIs or RSS feeds. Configure automatic download of new reports when they're published. 3. Convert to text: PDF documents require OCR extraction (Adobe Acrobat, PyMuPDF, or Tika). CSV data requires natural-language conversion of column definitions. 4. Chunk and embed: Split into 500-token chunks, embed with text-embedding-3-small, store in a vector database. 5. Build a query interface: A simple chat interface that lets reporters ask "What did the ONS say about employment in Q4 2026?" and returns the answer with a citation to the specific document and page number.