Empowering AI with the Ability to Say “I Don’t Know”: A Practical Evaluation Method

Artificial intelligence delivers remarkable capabilities—if it knows when to hold back. In applications like government services, providing incorrect answers can lead to significant issues for citizens. In this post, we explore a practical method for evaluating if an AI system can reliably say “I don’t know” when it lacks the necessary context.

The Challenge: Hallucination and Knowledge Gaps

Imagine asking a government chatbot about specific policy limits, only to be told outdated information. This happens when large language models (LLMs) generate plausible-sounding, yet factually incorrect responses. While many systems use retrieval-augmented generation (RAG) to anchor answers in trusted documents, even this method cannot anticipate every citizen’s query.

The key question becomes: What should an AI do when it does not know the answer? Rather than fabricating responses, the ideal behavior is for the system to recognize its knowledge gap and abstain from answering.

Building a Robust Evaluation Pipeline

The solution lies in a structured evaluation pipeline that systematically tests an AI’s ability to recognize when it should say “I don’t know.” This approach breaks down into three key stages:

Atomic Fact Extraction: Policy documents are deconstructed into individual, testable facts. For instance, a complex sentence is split into separate, independent facts that can be targeted in tests.
Question Generation with Diversity Filtering: For each fact, questions are generated that require that specific information. Careful filtering is applied to ensure that no other available fact can inadvertently provide the answer. Techniques like keyword analysis and semantic filtering ensure each question is unique and unambiguous.
Leave-One-Out Experiment Construction: For every generated question, the AI is provided with a context that deliberately omits the critical fact. This leave-one-out setup forces the AI to either acknowledge its lack of knowledge or risk fabricating an answer.

Easy Implementation: A Few Lines of Code

The entire testing pipeline can be implemented using an open-source library that streamlines the process. With just a few lines of code, you can set up the experiment:

kon = KnowOrNot()
kon.add_openai()
question_doc = kon.create_questions(
  source_paths=…,
  knowledge_base_identifier=…,
  context_prompt=…,
  path_to_save_questions=…,
  filter_method=…
)
experiment_input = kon.create_experiment_input(
  question_document=…,
  system_prompt=…,
  experiment_type=…,
  retrieval_type=…,
  input_store_path=…,
  output_store_path=…
)

This simple code illustrates how to transform policy documents into systematic knowledge tests, enabling developers to gauge whether the AI correctly abstains from providing uncertain or unsupported responses.

The Evaluation Process

Creating test cases is only part of the solution. Equally important is evaluating the AI’s responses. A common approach is to use an LLM as a judge, which can assess whether the answer appropriately shows abstention or reflects factual inaccuracies. The evaluation takes place through a hybrid approach combining automated judgment with human validation. Key steps include:

Defining explicit evaluation criteria: Establish clear rules for what constitutes an acceptable response, for example, specifying that absence of critical context should result in an “I don’t know” answer.
Iterative Refinement: Compare automated labels with human annotations to improve consistency. Iteratively refine prompts to maximize judgment accuracy.
Calculating Agreement Metrics: Metrics help pinpoint where the AI might be overconfident or where evaluation criteria need adjustment.

Insights from Practical Testing

Experiments in environments simulating high-stakes use cases, such as government policy queries, reveal startling results. Even with a conservative prompt and advanced retrieval techniques, systems may still provide confident answers nearly 40% of the time when the necessary information is absent. Moreover, when the AI does attempt to answer, factuality rates can be as low as 24% to 33%. These findings emphasize the importance of building robust guardrails and customized evaluation metrics, tailored specifically to the particular domain.

By integrating groundedness detection and systematic fact verification into your testing strategy, you can better ensure that AI systems not only provide better factual responses but also know when to refrain from answering.

Conclusion

Establishing trust in AI systems requires that they are able to recognize and communicate their own limitations. A methodical pipeline that extracts atomic facts, generates targeted questions, and tests the system’s responses without ensuring the critical context is a powerful way to evaluate and improve AI abstention behavior. As AI continues to be integrated into public services where accuracy is paramount, developing these robust evaluation pipelines is essential for ensuring responsible and reliable AI deployment.