In other words, you can chat with them and the AI executes tasks. But if you’ve spent time on the Internet, you know it’s rife with incorrect information. The problem with using AI models trained on the whole Internet is that sometimes they make mistakes or make up information — what AI providers call “hallucinations.”
Google’s Gemini 2.5 Pro hallucinates 1.1 percent of the time, whereas xAI’s Grok hallucinates 2.1 percent of the time. Other models hallucinate more often: Mistral’s Large 2 has a hallucination rate of 4.1 percent; DeepSeek’s R1 7.7 percent of the time.
For government institutions, the relative speed with which large models can be used in an organization — most only require an email to sign in — is not worth the resultant inaccuracies. Sometimes hallucinations are relatively innocuous, mistaking a past elected official for a current one; sometimes large language models (LLMs) tell users to eat rocks and put glue on their pizza.
Innocuous or not, the error rates from large, off-the-shelf, mass market AI tools are inadequate for public agencies. Not many of us would be happy to get our mail a day earlier if it meant 2 percent of it never came at all. The trade-off is more acute when public servants are working to end homelessness, reinvigorate economic development, or provide public health resources.
When the mission matters, accuracy is king. And when accuracy is king, “small” AI models win the day.
In reality, the models themselves are not small, but the data sets off which the LLM is generating information is — especially when compared to all the information on the Internet.
LLMs deployed on private or local data sets use retrieval-augmented generation, or RAG. RAG LLMs enable organizations like government agencies to essentially say, “Here is the body of information you should use to pull answers from.” By gating the information LLMs can use to generate data, government agencies can reduce hallucinations born of inaccurate, imprecise or intentionally misleading information from the Internet at large.
Here are three things you should know about effectively deploying RAG LLMs.
1. IT IS A CRITICAL COMPONENT
There are many avenues to deploying RAG LLMs in a government agency with varying degrees of technical know-how required. Regardless of which path you choose, IT is a critical constituency to get it right.
Many agencies will choose to build a RAG LLM themselves on private data sets. That typically means identifying, collating and storing data in a homegrown database, which requires a ton of upfront investment from IT — to say nothing of maintaining it. Other options, like deploying an LLM through a cloud computing provider like Azure or AWS, or fine-tuning a RAG LLM model, like custom GPTs in a private cloud, also require IT staff’s time and energy.
Deploying AI is a massive technical lift, and IT can be a powerful ally if they’re prioritized early and often throughout the process.
2. TAKE A BROAD DEFINITION OF DATA
It’s easy to think of data as only resources — files, memos, spreadsheets, videos, and the like. But effective, small AI models in government agencies require context to operate most effectively. Let’s use a policy memo as an example.
It is helpful for government servants to be able to ask questions through an LLM about a policy memo, making it a critical piece of underlying data. What that’s missing, however, are the email conversations about the memo before it was drafted; the Teams chat from the 20-year department veteran who provided helpful feedback on the memo; the transcript of the call between agency staff and the nonprofits whom the memo will affect. This level of data, which we call contextual metadata, is fundamental.
Without it, government AI is missing a key piece of the puzzle. An LLM built on top of a database that includes not only core assets like policy memos, but also emails, chats, call transcripts and other contextual metadata, is able to answer questions using a richer data set. This results in more accurate and more reliable AI-generated information, which is of paramount consideration when deploying AI in government agencies.
3. TRACEABILITY AND AUDITABILITY ARE NON-NEGOTIABLE
Ultimately, public servants are beholden to the public. Because generative AI tools can be error-prone, the ability to trace and audit their work must be in place before a RAG LLM is deployed.
Public servants should be trained and given the ability to quickly and easily verify the information the AI provides. The best way to do this are robust citations — enabling government employees to know exactly which documents the LLM derived information from.
Beyond that, key leaders and technical administrators should have the ability to easily identify what data the LLM is querying, which personnel queried the LLM and what they asked. Multiple redundancies ensure the safe and accurate use of AI to improve government operations.
AI is upending how employees across industries work, and the government is no exception. AI can be a great enabler, powering efficiencies that lead to better service delivery, but off-the-shelf AI tools don’t work for public servants.
Madeleine Smith is the co-founder and CEO of Civic Roundtable, a venture-backed startup recognized by Fast Company as one of the Most Innovative Companies in Social Good. Civic Roundtable is a government operations platform purpose-built for public servants, helping agencies coordinate across teams, share institutional knowledge and streamline mission-critical work. With an MPP/MBA from Harvard, Smith has spent her career at the intersection of policy and technology, focused on equipping public servants with practical, user-friendly tools that drive real impact.