Publication Digitizing Policy + Rules as Code

AI-Powered Rules as Code: Experiments with Public Benefits Policy

This report documents four experiments exploring if AI can be used to expedite the translation of SNAP and Medicaid policies into software code for implementation in public benefits eligibility and enrollment systems under a Rules as Code approach.

Read the Summary

Abstract

Public interest technologists are still early in our understanding of how to effectively use large language models (LLMs) to translate policy into code. This report documents four experiments conducted to evaluate the current performance of commercially-available LLMs in translating policies into plain language summaries, machine-readable pseudocode, or usable code within a Rules as Code process. We used eligibility rules and policies for the Supplemental Nutrition Assistance Program (SNAP) and Medicaid. The experiments include asking a chatbot or LLM about specific policies, summarizing policy in a machine-readable format, and using fine-tuning or Retrieval-Augmented Generation (RAG) to enhance an LLM’s ability to generate code that encodes policy. We found that LLMs are capable of supporting the process of generating code from policy, but still require external knowledge and human oversight within an iterative process for any policies containing complex logic.

Introduction

The rise of commercially-available large language models (LLMs) presents an opportunity to use artificial intelligence (AI) to expedite the translation of policies into software code for implementation in public benefits eligibility and enrollment systems under a Rules as Code approach. Our team experimented with multiple LLMs and methodologies to determine how well the LLMs could translate Supplemental Nutrition Assistance Program (SNAP) and Medicaid policies across seven different states. We present our findings in this report.

We conducted initial experiments from June to September 2024 during the Policy2Code Prototyping Challenge. The challenge was hosted by the Digital Benefits Network and the Massive Data Institute, as part of the Rule as Code Community of Practice. Twelve teams from the U.S. and Canada participated in the resulting Policy2Code Demo Day. We finished running the experiments and completing the analysis from October 2024 to February 2025.

In this report you will learn more about key takeaways including:

LLMs can help support the Rules as Code pipeline. LLMs can extract programmable rules from policy by leveraging expert knowledge retrieved from policy documents and employing well-crafted templates.
LLMs achieve better policy-to-code conversion when prompts are detailed and the policy logic is simple.
State governments can make it easier for LLMs to use their policies by making them digitally accessible.
Humans must be in the loop to review outputs from LLMs. Accuracy and equity considerations must outweigh efficiency in high-stakes benefits systems.
Current web-based chatbots have mixed results, often risking incorrect information presented in a confident tone.

Prefer to read a summary of the AI-Powered Rules as Code: Experiments with Public Benefits Policy report?

Read The Summary

Overview of Experiments

Our experiments addressed the following questions:

Experiment 1: How well can LLM chatbots answer general SNAP and Medicaid eligibility questions based on their training data and/or resources available on the internet? What factors affect their responses?
Experiment 2: How well can an LLM generate accurate, complete, and logical summaries of benefits policy rules when provided official policy documents?
Experiment 3: How well can an LLM extract machine-readable rules from unstructured policy documents in terms of output relevance and accuracy?
Experiment 4: How effectively can an LLM generate software code to determine eligibility for public benefits programs?

Below, we describe each of the four experiments, focusing on their motivation, methodology, findings, and considerations for public benefits use cases.

Open Materials

Each experiment includes openly accessible materials in a linked appendix. We share the rubrics we developed to evaluate the accuracy, completeness, and relevance of the responses. Additionally, we share our prompts, responses, and scores. Note that the prompts may generate different responses if run again through the same models because models are continually being updated.

Go to Appendix

Current State: Policy from Federal and State Government

Public benefits in the United States are governed by a complex network of federal, state, territorial, tribal, and local entities. Governments enact a web of intersecting laws, regulations, and policies related to these programs. These are often communicated¹ in lengthy and complex PDFs and documents distributed across disparate websites. Maintaining eligibility rules for public benefits across eligibility screeners, enrollment systems, and policy analysis tools—like calculating criteria for household size, income, and expenses—is already a significant cross-sector challenge. This complexity hinders government administrators’ ability to deliver benefits effectively and makes it difficult for residents to understand their eligibility.

We have heard directly from state and local government practitioners about the challenges² with their eligibility and enrollment systems and how they intersect with policy. For example, one state practitioner described it as “a scramble” to update the code in legacy systems. Implementation is often completed by a policy team, technology team, or external vendor, which increases the potential for errors or inconsistencies when system changes or updates are needed. An additional challenge is that no U.S. government agency currently makes its eligibility code open source, which ultimately reduces transparency and adaptation across other government systems.

Introduction to Rules as Code

Adopting Rules as Code for public benefits is a crucial strategy to improve the connection between policy and digital service delivery in the U.S. The Organization for Economic Co-operation and Development (OECD) defines³ Rules as Code as “an official version of rules (e.g. laws and regulations) in a machine-consumable form, which allows rules to be understood and actioned by computer systems in a consistent way.” Using Rules as Code in expert systems and automated or semi-automated decision-making systems is a key use case⁴.

This approach facilitates seamless and transparent policy integration into standardized public benefits delivery systems. A state leader noted that Rules as Code supports a “no wrong door” vision, ensuring consistent eligibility criteria across all service entry points, or “doors.”

In a future Rules as Code approach to benefits eligibility, policies and regulations would be translated into both plain-language logic flows (pseudocode) and standardized, machine-readable formats, such as Extensible Markup Language (XML), JavaScript Object Notation (JSON), or Yet Another Markup Language (YAML). This would enable easy access by various systems, including benefits applications, eligibility screeners, and policy analysis tools. Doing this would enable stakeholders to simultaneously review the legislation or regulation, plain language logic, and code, and know that all are official and ready for implementation. Ideally, this would also enable a more seamless cycle of writing, testing, and revising the rules. Others have written about the challenges and consequences of implementing Rules as Code, including an in-depth analysis⁵ of work in New Zealand, and a lawyer and programmer’s recreation of government system code⁶ for French housing benefits.

Figure 1: Benefits Eligibility Rules as Code

Benefits Eligibility Rules as Code Diagram: a policy law, or regulation is translated into machine-readable code in standardized syntax and plain language logic, which can then be used in benefits eligibility and enrollment systems

There is a vibrant ecosystem of organizations that have already created online eligibility screening tools, benefits applications, policy analysis tools, and open-source resources across numerous public benefits programs. These organizations play a crucial role in increasing access to benefits for eligible individuals, and several serve as tangible examples of how open, standardized code can translate policy into system rules.

In addition to open-code bases, there are other frameworks that can inform the standardization of communicating Rules as Code for U.S. public benefits programs. To reduce time and money spent on creating new frameworks, we believe it is important to evaluate and test any existing frameworks or standards to determine if they can be adopted or further developed.

Introduction to Generative AI

Artificial intelligence (AI) is the science of developing intelligent machines. Machine learning is a branch of AI that learns to make predictions based on training data. Deep learning is a specific type of machine learning, inspired by the human brain, that uses artificial neural networks to make predictions. The goal is for each node in the network to process information in a way that imitates a neuron in the brain. The more layers of artificial neurons in the network, the more patterns the neural network is able to learn and use for prediction.

Many successful generative AI tools and applications are designed using deep learning models. A large difference between classic machine learning and generative AI is that generative AI models not only identify patterns that they have learned, but also generate new content. One specific type of deep learning model for processing human text is Large Language Models (LLMs). The goal of LLMs is to attempt to capture the relationship between words and phrases. LLMs have been trained using enormous amounts of text and have human-like language ability across a range of traditional language tasks. However, it remains unclear how well different LLMs and tools built using LLMs—e.g. chatbots—work in more complex, domain-specific prediction tasks, like Rules as Code.

Intersection of Generative AI with Rules as Code

As the use of automated and generative AI tools increases and more software attempts to translate benefits policies into code, the existing pain points will likely be exacerbated. To increase access for eligible individuals and ensure system accuracy, it is imperative to quickly build a better understanding of how generative AI technologies can, or cannot, aid in the translation and standardization of eligibility rules in software code for these vital public benefits programs. We see a critical opportunity to create open, foundational resources for benefits eligibility logic that can inform new AI technologies and ensure that they are correctly interpreting benefits policies.

If generative AI tools are able to effectively translate public policy or program rules into code and/or convert code between software languages⁷, it could:

Speed up a necessary digital transformation to Rules as Code;
Allow for humans to check for accuracy and act as logic and code reviewers instead of programmers;
Automate testing to validate code;
Allow for easier implementation of updates and changes to rules;
Create more transparency in how rules operate in digital systems;
Increase efficiency by eliminating duplicate efforts across multiple levels of government and delivery organizations; and
Reduce the burden on public employees administering and delivering benefits.

Additionally, once rules are in code, it could enable new pathways for policymakers, legislators, administrators, and the public to model and measure the impacts of policy changes. For example, a generative AI tool could help model different scenarios for a rules change, using de-identified or synthetic data to measure impact on a specific population or geography. Additionally, generative AI can also help governments migrate code from legacy systems and translate it into modern code languages or syntaxes, as well as enable interoperability between systems.

How We’ve Been Advancing Rules as Code

The Digital Benefits Network (DBN) and Massive Data Institute (MDI) teams are among the first to extensively research how to apply a Rules as Code framework to the U.S. public benefits system. We draw on international research and examples, and have identified numerous U.S.-based projects that could inform a national strategy, along with a shared syntax and data standard.

Project Overview

Policy Focus

We focused on SNAP and Medicaid because they are essential benefits programs with high participation rates and are often integrated into a combined application. Both are means-tested programs that use specific criteria to determine eligibility and benefits amounts.

Broadly, benefits policy rules fall into two categories: eligibility rules and benefits rules. Eligibility rules determine whether an applicant qualifies for the program based on their personal and household information. On the other hand, benefits rules refer to the type and scope of benefits an eligible applicant is entitled to receive. Both categories of rules are essential for developing scalable applications that improve and streamline the digitized delivery of benefits.

States

We focused on SNAP and Medicaid policies in five states: California, Georgia, Michigan, Pennsylvania, and Texas. Additionally, we included Alaska’s SNAP policy and Oklahoma’s Medicaid policy.

We selected these states based on the following criteria, aiming for diversity across these characteristics:

The state’s previous identification as delivering “strong” or “poor” communication of eligibility criteria in SNAP policy manuals
Political party affiliation of the state’s governor
The state’s ability to offer Broad-Based Categorical Eligibility (BBCE) for SNAP
The state’s participation in the Affordable Care Act (ACA) expansion for Medicaid

Table 1: State Selection Criteria

State	Strong/Poor SNAP Manual	Governor’s Political Party (as of October 2024)	BBCE Offered for SNAP?	ACA Expansion?
Alaska	Strong	Republican	No	N/A (only looked at SNAP policy)
California	Unknown	Democrat	Yes	Yes
Georgia	Poor	Republican	Yes	No
Michigan	Unknown	Democrat	Yes	Yes
Oklahoma	Unknown	Republican	N/A (Only looked at Medicaid policy)	Yes (recent expansion)
Pennsylvania	Strong	Democrat	Yes	Yes
Texas	Unknown	Republican	Yes, with asset limit on Temporary Assistance for Needy Families (TANF)/Maintenance of Effort (MOE) programs	No

Technology Focus

Our four experiments used a well-known LLM, GPT-4o, accessed through the Application Programming Interface (API). We also used the web browser versions of two AI chatbots: ChatGPT and Gemini. At the time of this study, ChatGPT used GPT 3.5 while Gemini used Gemini 1.5 Flash. In this study, we refer to the LLM and chatbot output as LLM output. We recognize that chatbots have less flexibility with respect to setting parameters.

Assessing LLM Outputs

Our research group developed rubrics to evaluate the responses the LLMs generated in Experiments 1 and 2. Each category—accuracy, completeness, relevance, and state-specific or current applicability—captured a different aspect of the response. Moreover, each factor influenced the degree to which the response met the requirements of the prompt. For example, a response might have been highly accurate, but lacked relevance, making it less effective for specific prompts. By evaluating these categories separately, we were able to better assess the quality of LLM-generated responses.

Additionally, human evaluation—often considered the gold standard—is typically slow and costly. For Experiments 1 and 2, our human evaluators were a small group of two people who, while knowledgeable when reviewing the documents, were not experts in the state policies. Moving forward, it would be valuable to have established guidelines or evaluation metrics to support this type of task. Also, having a larger and more diverse group of evaluators could improve evaluation consistency and reduce subjectivity in scoring.

For Experiment 3, we evaluated the LLM’s structured rule outputs (e.g., income limits or work requirements) by comparing them against manually-verified values from state policy documents. This helped us quantify the alignment between generated rules and primary sources. In Experiment 4, we adopted a qualitative approach, evaluating the quality and utility of AI-generated code through criteria such as output accuracy (correct implementation of policy logic), logical consistency (coherence of decision flows), and rules coverage (completeness of encoded policy requirements).

Experiment 1: Asking Chatbots About Benefits Eligibility

Motivation or Question

In this research, we set out to analyze how three LLMs—the web browser versions of ChatGPT and Gemini, as well as the GPT-4o API—extracted and provided information about eligibility for different policies. Our goal was to measure how well these models answered questions related to benefits programs like SNAP and Medicaid. The main questions guiding our study were:

How well can AI chatbots answer general SNAP and Medicaid eligibility questions?
What factors affect their responses?

Methodology

Figure 2: Experiment 1 Methodology

1. Prompt Generation

We took two different approaches to designing the prompts in the experiment. First, we used zero-shot prompting, which gave the model a prompt on a topic it had never seen or been trained on. We also used general knowledge prompting, an advanced technique where questions were structured to leverage the model’s broad understanding of the world. Read more about different prompting techniques here⁸.

We divided the prompts into the following categories:

Overall Eligibility Criteria prompts asked who is eligible for the SNAP or Medicaid programs. Optimal answers included all important eligibility criteria for specific programs. For instance:
- “Who is eligible for Medicaid in Texas?”
Individual Eligibility Criteria prompts asked about a single SNAP or Medicaid eligibility criterion rather than all criteria at once. Ideal answers provided specific details about the criterion in question. For instance:
- “What is the income criteria for Medicaid eligibility in Texas?”
- “What is the employment criteria for Medicaid eligibility in Texas?”
Employment Scenario prompts described a specific situation about the individual’s work status and asked how it affected eligibility for SNAP or Medicaid. For instance:
- “I am a single adult that cannot work due to a medical condition, am I eligible for SNAP in Georgia?”
- “I do not work, am I eligible for SNAP in Michigan?”
Household Scenario prompts outlined a specific situation involving a person’s family and asked how it impacted eligibility for SNAP or Medicaid. For instance:
- “I am a foster parent with foster children, are we eligible for SNAP in Pennsylvania?”
Income Scenario prompts presented a specific income and asked if that income met the SNAP or Medicaid eligibility threshold. For instance:
- “Am I eligible for SNAP in Alaska if my gross monthly income is $1,975 and my net monthly income is $1,760?”
Other Program Scenario prompts asked if or how SNAP or Medicaid eligibility were affected by enrollment in other programs (e.g., TANF, unemployment insurance, or WIC). For instance:
- “Is someone eligible for SNAP if they are enrolled in TANF in Alaska?”
Other Scenario prompts presented specific characteristics about the individual and asked how these details affected eligibility for SNAP or Medicaid. For instance:
- “I am pregnant woman, what eligibility requirements do I have to meet to qualify for SNAP in Alaska?”
- “I receive a non-cash MOE benefit. Do I qualify for SNAP in California?”

2. Assessing LLM Outputs

For this experiment, our research group analyzed and scored the responses generated by the LLMs for state-specific and current applicability, relevance, and completeness.

Table 2: Experiment 1 Rubric

Current and state-specific applicability	Is the information from the response current and state-specific?	Score 1-5
Completeness	Is the response thorough and does it cover all elements requested in the prompt?	Score 1-5
Relevance	Is the response focused on the question, without adding irrelevant or unnecessary details?	Score 1-5

Read more about the rubric for Experiment 1 in the Appendix. →

Developments and Challenges

Three features of the LLMs affected the data used to generate their responses and our ability to evaluate their data sources.

Internet access: The web browser versions of ChatGPT and Gemini were able to search and present data from the internet in responses, whereas the GPT-4o API relied solely on its training data.
“Short-term” memory: The web browser version of ChatGPT included a setting to enable or disable short-term memory. When enabled, the model used information from previous prompts to inform responses. Our experiment aimed to evaluate how well the models answered SNAP and Medicaid policy questions using single prompts. However, a small number of ChatGPT responses showed that the short-term memory setting was on due to our error. Those responses remain in our analysis.
Citing sources: The web-browser versions of ChatGPT and Gemini were able to cite sources for their responses, whereas the GPT-4o API was not. Most of the responses in our experiment did not include citations, which made it challenging to determine the data sources used in responses.

Findings

Our findings for response performance across the three rubric categories (current/state-specific applicability, completeness, and relevance of response) are included in Experiment 1 Materials. It includes both “average performance scores” based on the rubric and “average percentage performance scores,” which convert average rubric scores into percentages for standardized comparisons of results across settings.

Go to Experiment 1 Materials

The spreadsheet is organized into tables that break down the results in the following ways:

Average Scores by Prompt Topic

Table 3: Average Scores by Prompt Topic

	State-Specific/ Current Applicability		Completeness		Relevance of Response
Prompt Topic	average performance score	average performance score in %	average performance score	average performance score in %	average performance score	average performance score in %
Overall Eligibility Criteria	4.2	84%	2.0	40%	4.5	90%
Individual Eligibility Criteria	4.1	82%	3.1	62%	4.0	80%
Employment Scenarios	4.3	86%	2.6	52%	4.9	98%
Household Scenarios	4.5	90%	2.7	54%	5.0	100%
Income Scenarios	3.8	76%	2.6	52%	4.8	96%
Other Program Scenarios	4.5	90%	3.1	62%	4.9	98%
Other Scenarios	4.1	82%	2.9	58%	4.8	96%

This table breaks the scores down by prompt topics.

Notable results in this table include:

Individual Eligibility Criteria prompts performed better on completeness than Overall Eligibility Criteria prompts. This suggests that the LLMs were less effective when asked to identify all of the eligibility criteria for SNAP or Medicaid in a single query, compared to when asked to identify the details of each individual criterion for eligibility in a single query.
Income Scenario prompts performed the worst in terms of being current and state-specific. The LLMs struggled to provide current and accurate income limits and compare them to the income limits in the prompts.

Average Scores by Model

Figure 3: Model Effectiveness

Table 4: Average Scores by Model

	State-Specific/ Current Applicability		Completeness		Relevance of Response
Model	average performance score	average performance score in %	average performance score	average performance score in %	average performance score	average performance score in %
ChatGPT API	4.1	82%	2.9	58%	4.3	86%
ChatGPT Web Browser	4.1	82%	3.0	60%	4.5	90%
Gemini Web Browser	4.3	86%	2.8	56%	4.6	92%

This table shows the average scores across both SNAP and Medicaid responses by the model used (i.e., GPT-4o API, ChatGPT web browser, or the Gemini web browser).

Notable results in this table include:

There were no major differences in the three models’ performances. However, the Gemini web browser had the highest score for current/state-specific applicability and relevance, while the ChatGPT web browser had the highest score for completeness. The GPT-4o API had the lowest score for relevance, while the Gemini web browser had the lowest score for completeness. The GPT-4o API and ChatGPT web browser were tied for the lowest score for current/state-specific applicability.

Average Scores by Program

Table 5: Average Scores by Program

	State-Specific/ Current Applicability		Completeness		Relevance of Response
Program	average performance score	average performance score in %	average performance score	average performance score in %	average performance score	average performance score in %
SNAP	4.1	82%	2.8	56%	4.4	88%
Medicaid	4.2	84%	3.0	60%	4.5	90%

This table shows the average scores for both SNAP and Medicaid across all LLMs tested.

Notable results in this table include:

The average scores for Medicaid responses were slightly higher than for SNAP responses across all three rubric criteria.

Gemini Web Browser Scores by Program

Table 6: Gemini Web Browser Scores by Program

	State-Specific/ Current Applicability		Completeness		Relevance of Response
Program	average performance score	average performance score in %	average performance score	average performance score in %	average performance score	average performance score in %
SNAP	4.4	88%	2.7	54%	4.5	90%
Medicaid	4.2	84%	3.0	60%	4.6	92%

This table shows the average scores for both SNAP and Medicaid for only Gemini-generated responses.

Notable results in this table include:

For SNAP, Gemini outperformed the average score across all three models in terms of current and state-specific applicability and relevance, but underperformed in terms of completeness.
For Medicaid, Gemini had an average performance in terms of current and state-specific applicability and completeness, but outperformed the average in terms of relevance.

ChatGPT Web Browser Scores by Program

Table 7: ChatGPT Web Browser Scores by Program

	State-Specific/ Current Applicability		Completeness		Relevance of Response
Program	average performance score	average performance score in %	average performance score	average performance score in %	average performance score	average performance score in %
SNAP	4.1	82%	3.0	60%	4.5	90%
Medicaid	4.1	82%	2.9	58%	4.4	88%

This table shows the average scores for both SNAP and Medicaid using only responses generated by the ChatGPT web browser.

Notable results in this table include:

For SNAP, the ChatGPT web browser outperformed the average score across all three models in terms of completeness and relevance and had an average performance in terms of current and state-specific applicability.
For Medicaid, the ChatGPT web browser outperformed the average score in terms of current and state-specific applicability and underperformed in terms of completeness and relevance.

GPT-4o API Scores by Program

Table 8: GPT-4o API Scores by Program

	State-Specific/ Current Applicability		Completeness		Relevance of Response
Program	average performance score	average performance score in %	average performance score	average performance score in %	average performance score	average performance score in %
SNAP	4.0	80%	2.7	54%	4.2	84%
Medicaid	4.3	86%	3.2	64%	4.4	88%

This table shows the average scores for SNAP and Medicaid, using only responses generated by the ChatGPT API.

Notable results in this table include:

For SNAP, the GPT-4o API underperformed on the average score across all three models for all rubric criteria.
For Medicaid, the GPT-4o API outperformed the average score across all three models in terms of current and state-specific applicability and completeness, and underperformed in terms of relevance.

Average Scores by State and Program

Figure 4: Average Scores for State and Program

Table 9: Average Scores by State and Program

	State-Specific/Current Applicability		Completeness		Relevance of Response
State and Program	average performance score	average performance score in %	average performance score	average performance score in %	average performance score	average performance score in %
Alaska SNAP	4.2	84%	2.7	54%	4.4	88%
California SNAP	4.1	82%	2.6	52%	4.5	90%
California Medicaid	4.0	80%	3.1	62%	4.6	92%
Georgia SNAP	4.3	86%	2.9	58%	4.3	86%
Georgia Medicaid	4.2	84%	2.9	58%	4.5	90%
Michigan SNAP	4.0	80%	3.3	66%	4.6	92%
Michigan Medicaid	3.8	76%	2.4	48%	4.4	88%
Oklahoma Medicaid	4.3	86%	3.3	66%	4.5	90%
Pennsylvania SNAP	4.1	81%	2.7	54%	4.3	86%
Pennsylvania Medicaid	4.3	86%	2.9	58%	4.4	88%
Texas SNAP	4.2	84%	2.7	54%	4.5	90%
Texas Medicaid	4.4	88%	3.5	70%	4.4	88%

This table shows the average scores across all three models for each state and program combination.

Notable results in this table include:

Texas Medicaid scored highest in current and state-specific applicability and completeness, while California Medicaid and Michigan SNAP both tied for the highest relevance score.
Michigan Medicaid received the lowest scores for current and state-specific applicability and relevance, while Georgia SNAP and Pennsylvania SNAP tied for the lowest relevance score.

Gemini Web Browser Scores by State and Program

Figure 5: Gemini Web Browser Scores for State and Program

Gemini Web Browser Scores for State and Program bar chart

Table 10: Gemini Web Browser Scores by State and Program

	State-Specific/Current Applicability		Completeness		Relevance of Response
State and Program	average performance score	average performance score in %	average performance score	average performance score in %	average performance score	average performance score in %
Alaska SNAP	4.8	96%	2.5	50%	4.6	92%
California SNAP	4.4	88%	2.5	50%	4.5	90%
California Medicaid	4.3	86%	2.9	58%	4.6	92%
Georgia SNAP	4.7	94%	2.7	54%	4.4	88%
Georgia Medicaid	4.2	84%	2.8	56%	4.6	92%
Michigan SNAP	4.2	84%	3.2	64%	4.8	96%
Michigan Medicaid	3.7	74%	2.1	42%	4.4	88%
Oklahoma Medicaid	4.3	86%	3.5	70%	4.8	98%
Pennsylvania SNAP	4.2	84%	2.7	54%	4.7	94%
Pennsylvania Medicaid	4.4	88%	3	60%	4.8	96%
Texas SNAP	4.1	81%	2.7	54%	4.5	90%
Texas Medicaid	4.4	88%	3.5	70%	4.7	94%

This table shows the average scores for each state and program combination for only Gemini-generated responses.

Notable results in this table include:

Alaska SNAP received the highest score for current and state-specific applicability. Oklahoma Medicaid and Texas Medicaid tied for the highest completeness score, while Michigan SNAP* and Pennsylvania Medicaid tied for the highest relevance score.
Michigan Medicaid* received the lowest scores for current and state-specific applicability and completeness and tied with Georgia SNAP for the lowest relevance score.

*Please note that the Gemini responses for Michigan SNAP and Medicaid were generated at a later date than the responses for other programs.

ChatGPT Web Browser Scores by State and Program

Figure 6: ChatGPT Web Browser Scores for State and Program

ChatGPT Web Browser Scores for State and Program Bar Chart

Table 11: ChatGPT Web Browser Scores by State and Program

	State-Specific/Current Applicability		Completeness		Relevance of Response
State and Program	average performance score	average performance score in %	average performance score	average performance score in %	average performance score	average performance score in %
Alaska SNAP	3.9	78%	2.7	54%	4.5	90%
California SNAP	3.9	78%	2.8	56%	4.8	96%
California Medicaid	3.8	76%	3.1	62%	4.7	94%
Georgia SNAP	4.2	84%	3.2	64%	4.4	88%
Georgia Medicaid	4.2	84%	2.8	56%	4.4	88%
Michigan SNAP	4.1	82%	4.2	84%	4.6	92%
Michigan Medicaid	3.7	74%	2.4	48%	4.5	90%
Oklahoma Medicaid	4.3	86%	3.1	62%	4.5	90%
Pennsylvania SNAP	4.0	80%	2.4	48%	4.1	82%
Pennsylvania Medicaid	4.0	80%	2.8	56%	4.0	80%
Texas SNAP	4.2	84%	2.8	56%	4.7	94%
Texas Medicaid	4.4	88%	3.4	68%	4.4	88%

This table shows the average scores for each state and program combination for responses generated by the ChatGPT web browser only.

Notable results in this table include:

Texas Medicaid received the highest score for current and state-specific applicability, while Michigan SNAP had the highest score for completeness, and California SNAP received the highest score for relevance.
Michigan Medicaid had the lowest score for current and state-specific applicability and tied with Pennsylvania SNAP for the lowest score for completeness. Pennsylvania Medicaid had the lowest score for relevance.

GPT-4o API Scores by State and Program

Figure 7: GPT-4o API Scores for State and Program

GPT-4o API Scores for State and Program Bar Chart

Table 12: GPT-4o API Scores by State and Program

	State-Specific/Current Applicability		Completeness		Relevance of Response
State and Program	average performance score	average performance score in %	average performance score	average performance score in %	average performance score	average performance score in %
Alaska SNAP	4.2	84%	2.6	52%	4.1	82%
California SNAP	4	80%	2.5	50%	4.2	84%
California Medicaid	3.9	78%	3.4	68%	4.4	88%
Georgia SNAP	4.0	80%	2.9	58%	4.1	82%
Georgia Medicaid	4.1	82%	3	60%	4.4	88%
Michigan SNAP	3.8	76%	2.6	52%	4.5	90%
Michigan Medicaid	4.1	82%	2.6	52%	4.5	90%
Oklahoma Medicaid	4.3	86%	3.3	66%	4.3	86%
Pennsylvania SNAP	3.9	78%	2.9	58%	4.3	86%
Pennsylvania Medicaid	4.6	92%	3.1	62%	4.5	90%
Texas SNAP	4.3	86%	2.7	54%	4.4	88%
Texas Medicaid	4.5	90%	3.6	72%	4.2	84%

This table shows the average scores for each state and program combination for only GPT-4o API-generated responses.

Notable results in this table include:

Pennsylvania Medicaid had the highest score for currency and state-specific applicability; Texas Medicaid had the highest score for completeness; and Michigan SNAP, Michigan Medicaid, and Pennsylvania Medicaid were tied for receiving the highest score for relevance.
Michigan SNAP had the lowest score for current and state-specific applicability, while California SNAP had the lowest score for completeness. Alaska SNAP and Georgia SNAP were tied for the lowest score for relevance.

Considerations for Public Benefits Use Cases

While we applied these methods to specific states and policies, there are also considerations for wider public benefits use cases.

People seeking benefits information are likely receiving inaccurate information from chatbots.

When generating responses based only on their training data or internet sources, the web browser versions of ChatGPT and Gemini, and the GPT-4o API, do not reliably generate current and state-specific, complete, and relevant responses to questions about SNAP and Medicaid eligibility policies.

It is challenging to know where chatbots are sourcing information from.

While data sources used by models likely significantly influenced how responses’ scored on our rubric, the lack of citations made it difficult to identify specific sources for most responses. The information provided in the responses likely came from official policy documents on government sites or third-party sources. Furthermore, we did not find a consistent pattern in our scores for the different state and program combinations or the different models, underscoring the challenge of relying only on the models’ training data or other internet sources to gain accurate policy information. For this reason, in Experiment 2, we decided to assess how well the models answered questions when directly provided official policy documents.

Chatbots used confident, authoritative language to convey information, even when inaccurate.

When the models gave incorrect information, they often did so in a confident tone that would likely cause misunderstandings for readers who were not subject matter experts. Prior research shows that this is even the case for some models explicitly asked to quantify uncertainty⁹. This provides evidence for the necessity of a standardized method to quantify the uncertainty in AI responses, which is particularly important for Rules as Code generation where correct details are essential.

Experiment 2: Focusing LLMs on Specific Benefits Policy Documents

Motivation and Question

For our second experiment, the goal was to determine how effectively LLMs summarized eligibility criteria from policy documents into plain language and formats easily convertible to pseudocode. A primary motivation for this study was to investigate how different types of questions affected the extraction of eligibility information from policies. For Experiment 2, we used Retrieval-Augmented Generation (RAG), a technique that allowed an LLM to use not only its training data to respond to queries, but also knowledge from authoritative sources provided by the user. This experiment allowed us to test the impact of adding an authoritative policy document dataset to the process.

Overall, the main question guiding our study was:

How well does the GPT-4o API generate accurate, complete, and logical summaries of benefits policy rules in response to different types of prompts?

Retrieval-Augmented Generation (RAG) is a process that combines a retrieval system with a generative model (e.g., an LLM like GPT-4 or Llama) to improve the quality and relevance of responses. RAG complements text generation with information from authoritative data sources, e.g. peer reviewed journal articles or actual state policies. By using RAG techniques, conversational interfaces use the relevant information from the authoritative data sources to augment and guide their response.
In other words, RAG combines the generative power of a neural AI model with specialized information from a database to generate responses that are context-aware and generally more accurate. You can read more about RAG in this survey¹⁰.