Publication Digitizing Policy + Rules as Code

AI-Powered Rules as Code: Experiments with Public Benefits Policy

This report documents four experiments exploring if AI can be used to expedite the translation of SNAP and Medicaid policies into software code for implementation in public benefits eligibility and enrollment systems under a Rules as Code approach.

Abstract

Public interest technologists are still early in our understanding of how to effectively use large language models (LLMs) to translate policy into code. This report documents four experiments conducted to evaluate the current performance of commercially-available LLMs in translating policies into plain language summaries, machine-readable pseudocode, or usable code within a Rules as Code process. We used eligibility rules and policies for the Supplemental Nutrition Assistance Program (SNAP) and Medicaid. The experiments include asking a chatbot or LLM about specific policies, summarizing policy in a machine-readable format, and using fine-tuning or Retrieval-Augmented Generation (RAG) to enhance an LLM’s ability to generate code that encodes policy. We found that LLMs are capable of supporting the process of generating code from policy, but still require external knowledge and human oversight within an iterative process for any policies containing complex logic.

Introduction

The rise of commercially-available large language models (LLMs) presents an opportunity to use artificial intelligence (AI) to expedite the translation of policies into software code for implementation in public benefits eligibility and enrollment systems under a Rules as Code approach. Our team experimented with multiple LLMs and methodologies to determine how well the LLMs could translate Supplemental Nutrition Assistance Program (SNAP) and Medicaid policies across seven different states. We present our findings in this report.

We conducted initial experiments from June to September 2024 during the Policy2Code Prototyping Challenge. The challenge was hosted by the Digital Benefits Network and the Massive Data Institute, as part of the Rule as Code Community of Practice. Twelve teams from the U.S. and Canada participated in the resulting Policy2Code Demo Day. We finished running the experiments and completing the analysis from October 2024 to February 2025. 

In this report you will learn more about key takeaways including:

  • LLMs can help support the Rules as Code pipeline. LLMs can extract programmable rules from policy by leveraging expert knowledge retrieved from policy documents and employing well-crafted templates.
  • LLMs achieve better policy-to-code conversion when prompts are detailed and the policy logic is simple.
  • State governments can make it easier for LLMs to use their policies by making them digitally accessible.
  • Humans must be in the loop to review outputs from LLMs. Accuracy and equity considerations must outweigh efficiency in high-stakes benefits systems.
  • Current web-based chatbots have mixed results, often risking incorrect information presented in a confident tone.

Prefer to read a summary of the AI-Powered Rules as Code: Experiments with Public Benefits Policy report?

Overview of Experiments

Our experiments addressed the following questions: 

  • Experiment 1: How well can LLM chatbots answer general SNAP and Medicaid eligibility questions based on their training data and/or resources available on the internet? What factors affect their responses?
  • Experiment 2: How well can an LLM generate accurate, complete, and logical summaries of benefits policy rules when provided official policy documents? 
  • Experiment 3: How well can an LLM extract machine-readable rules from unstructured policy documents in terms of output relevance and accuracy?
  • Experiment 4: How effectively can an LLM generate software code to determine eligibility for public benefits programs? 

Below, we describe each of the four experiments, focusing on their motivation, methodology, findings, and considerations for public benefits use cases.

Open Materials

Each experiment includes openly accessible materials in a linked appendix. We share the rubrics we developed to evaluate the accuracy, completeness, and relevance of the responses. Additionally, we share our prompts, responses, and scores. Note that the prompts may generate different responses if run again through the same models because models are continually being updated. 

Current State: Policy from Federal and State Government

Public benefits in the United States are governed by a complex network of federal, state, territorial, tribal, and local entities. Governments enact a web of intersecting laws, regulations, and policies related to these programs. These are often communicated1 in lengthy and complex PDFs and documents distributed across disparate websites. Maintaining eligibility rules for public benefits across eligibility screeners, enrollment systems, and policy analysis tools—like calculating criteria for household size, income, and expenses—is already a significant cross-sector challenge. This complexity hinders government administrators’ ability to deliver benefits effectively and makes it difficult for residents to understand their eligibility. 

We have heard directly from state and local government practitioners about the challenges2 with their eligibility and enrollment systems and how they intersect with policy. For example, one state practitioner described it as “a scramble” to update the code in legacy systems. Implementation is often completed by a policy team, technology team, or external vendor, which increases the potential for errors or inconsistencies when system changes or updates are needed. An additional challenge is that no U.S. government agency currently makes its eligibility code open source, which ultimately reduces transparency and adaptation across other government systems.

Introduction to Rules as Code

Adopting Rules as Code for public benefits is a crucial strategy to improve the connection between policy and digital service delivery in the U.S. The Organization for Economic Co-operation and Development (OECD) defines3 Rules as Code as “an official version of rules (e.g. laws and regulations) in a machine-consumable form, which allows rules to be understood and actioned by computer systems in a consistent way.” Using Rules as Code in expert systems and automated or semi-automated decision-making systems is a key use case4

This approach facilitates seamless and transparent policy integration into standardized public benefits delivery systems. A state leader noted that Rules as Code supports a “no wrong door” vision, ensuring consistent eligibility criteria across all service entry points, or “doors.”

In a future Rules as Code approach to benefits eligibility, policies and regulations would be translated into both plain-language logic flows (pseudocode) and standardized, machine-readable formats, such as Extensible Markup Language (XML), JavaScript Object Notation (JSON), or Yet Another Markup Language (YAML). This would enable easy access by various systems, including benefits applications, eligibility screeners, and policy analysis tools. Doing this would enable stakeholders to simultaneously review the legislation or regulation, plain language logic, and code, and know that all are official and ready for implementation. Ideally, this would also enable a more seamless cycle of writing, testing, and revising the rules. Others have written about the challenges and consequences of implementing Rules as Code, including an in-depth analysis5 of work in New Zealand, and a lawyer and programmer’s recreation of government system code6 for French housing benefits. 

Figure 1: Benefits Eligibility Rules as Code
Benefits Eligibility Rules as Code Diagram: a policy law, or regulation is translated into machine-readable code in standardized syntax and plain language logic, which can then be used in benefits eligibility and enrollment systems

There is a vibrant ecosystem of organizations that have already created online eligibility screening tools, benefits applications, policy analysis tools, and open-source resources across numerous public benefits programs. These organizations play a crucial role in increasing access to benefits for eligible individuals, and several serve as tangible examples of how open, standardized code can translate policy into system rules. 

In addition to open-code bases, there are other frameworks that can inform the standardization of communicating Rules as Code for U.S. public benefits programs. To reduce time and money spent on creating new frameworks, we believe it is important to evaluate and test any existing frameworks or standards to determine if they can be adopted or further developed. 

Introduction to Generative AI 

Artificial intelligence (AI) is the science of developing intelligent machines. Machine learning is a branch of AI that learns to make predictions based on training data. Deep learning is a specific type of machine learning, inspired by the human brain, that uses artificial neural networks to make predictions. The goal is for each node in the network to process information in a way that imitates a neuron in the brain. The more layers of artificial neurons in the network, the more patterns the neural network is able to learn and use for prediction.

Many successful generative AI tools and applications are designed using deep learning models. A large difference between classic machine learning and generative AI is that generative AI models not only identify patterns that they have learned, but also generate new content. One specific type of deep learning model for processing human text is Large Language Models (LLMs). The goal of LLMs is to attempt to capture the relationship between words and phrases. LLMs have been trained using enormous amounts of text and have human-like language ability across a range of traditional language tasks. However, it remains unclear how well different LLMs and tools built using LLMs—e.g. chatbots—work in more complex, domain-specific prediction tasks, like Rules as Code. 

Intersection of Generative AI with Rules as Code

As the use of automated and generative AI tools increases and more software attempts to translate benefits policies into code, the existing pain points will likely be exacerbated. To increase access for eligible individuals and ensure system accuracy, it is imperative to quickly build a better understanding of how generative AI technologies can, or cannot, aid in the translation and standardization of eligibility rules in software code for these vital public benefits programs. We see a critical opportunity to create open, foundational resources for benefits eligibility logic that can inform new AI technologies and ensure that they are correctly interpreting benefits policies. 

If generative AI tools are able to effectively translate public policy or program rules into code and/or convert code between software languages7, it could: 

  • Speed up a necessary digital transformation to Rules as Code;
  • Allow for humans to check for accuracy and act as logic and code reviewers instead of programmers;
  • Automate testing to validate code;
  • Allow for easier implementation of updates and changes to rules;
  • Create more transparency in how rules operate in digital systems;
  • Increase efficiency by eliminating duplicate efforts across multiple levels of government and delivery organizations; and 
  • Reduce the burden on public employees administering and delivering benefits.

Additionally, once rules are in code, it could enable new pathways for policymakers, legislators, administrators, and the public to model and measure the impacts of policy changes. For example, a generative AI tool could help model different scenarios for a rules change, using de-identified or synthetic data to measure impact on a specific population or geography. Additionally, generative AI can also help governments migrate code from legacy systems and translate it into modern code languages or syntaxes, as well as enable interoperability between systems. 

How We’ve Been Advancing Rules as Code 

The Digital Benefits Network (DBN) and Massive Data Institute (MDI) teams are among the first to extensively research how to apply a Rules as Code framework to the U.S. public benefits system. We draw on international research and examples, and have identified numerous U.S.-based projects that could inform a national strategy, along with a shared syntax and data standard.

The DBN hosts the Rules as Code Community of Practice—a shared space for people working on public benefits eligibility and enrollment systems, particularly people tackling how policy becomes software code.

Project Overview

Policy Focus

We focused on SNAP and Medicaid because they are essential benefits programs with high participation rates and are often integrated into a combined application. Both are means-tested programs that use specific criteria to determine eligibility and benefits amounts. 

Broadly, benefits policy rules fall into two categories: eligibility rules and benefits rules. Eligibility rules determine whether an applicant qualifies for the program based on their personal and household information. On the other hand, benefits rules refer to the type and scope of benefits an eligible applicant is entitled to receive. Both categories of rules are essential for developing scalable applications that improve and streamline the digitized delivery of benefits. 

States

We focused on SNAP and Medicaid policies in five states: California, Georgia, Michigan, Pennsylvania, and Texas. Additionally, we included Alaska’s SNAP policy and Oklahoma’s Medicaid policy.

We selected these states based on the following criteria, aiming for diversity across these characteristics:

  1. The state’s previous identification as delivering “strong” or “poor” communication of eligibility criteria in SNAP policy manuals
  2. Political party affiliation of the state’s governor
  3. The state’s ability to offer Broad-Based Categorical Eligibility (BBCE) for SNAP
  4. The state’s participation in the Affordable Care Act (ACA) expansion for Medicaid
Table 1: State Selection Criteria
StateStrong/Poor SNAP ManualGovernor’s Political Party (as of October 2024)BBCE Offered for SNAP?ACA Expansion?
AlaskaStrongRepublicanNoN/A (only looked at SNAP policy)
CaliforniaUnknownDemocratYesYes
GeorgiaPoorRepublicanYesNo
MichiganUnknownDemocratYesYes
OklahomaUnknownRepublicanN/A (Only looked at Medicaid policy)Yes (recent expansion)
PennsylvaniaStrongDemocratYesYes
TexasUnknownRepublicanYes, with asset limit on Temporary Assistance for Needy Families (TANF)/Maintenance of Effort (MOE) programsNo

Technology Focus

Our four experiments used a well-known LLM, GPT-4o, accessed through the Application Programming Interface (API). We also used the web browser versions of two AI chatbots: ChatGPT and Gemini. At the time of this study, ChatGPT used GPT 3.5 while Gemini used Gemini 1.5 Flash. In this study, we refer to the LLM and chatbot output as LLM output. We recognize that chatbots have less flexibility with respect to setting parameters. 

Assessing LLM Outputs

Our research group developed rubrics to evaluate the responses the LLMs generated in Experiments 1 and 2. Each category—accuracy, completeness, relevance, and state-specific or current applicability—captured a different aspect of the response. Moreover, each factor influenced the degree to which the response met the requirements of the prompt. For example, a response might have been highly accurate, but lacked relevance, making it less effective for specific prompts. By evaluating these categories separately, we were able to better assess the quality of LLM-generated responses. 

Additionally, human evaluation—often considered the gold standard—is typically slow and costly. For Experiments 1 and 2, our human evaluators were a small group of two people who, while knowledgeable when reviewing the documents, were not experts in the state policies. Moving forward, it would be valuable to have established guidelines or evaluation metrics to support this type of task. Also, having a larger and more diverse group of evaluators could improve evaluation consistency and reduce subjectivity in scoring.

For Experiment 3, we evaluated the LLM’s structured rule outputs (e.g., income limits or work requirements) by comparing them against manually-verified values from state policy documents. This helped us quantify the alignment between generated rules and primary sources. In Experiment 4, we adopted a qualitative approach, evaluating the quality and utility of AI-generated code through criteria such as output accuracy (correct implementation of policy logic), logical consistency (coherence of decision flows), and rules coverage (completeness of encoded policy requirements). 

Experiment 1: Asking Chatbots About Benefits Eligibility 

Motivation or Question

In this research, we set out to analyze how three LLMs—the web browser versions of ChatGPT and Gemini, as well as the GPT-4o API—extracted and provided information about eligibility for different policies. Our goal was to measure how well these models answered questions related to benefits programs like SNAP and Medicaid. The main questions guiding our study were: 

  • How well can AI chatbots answer general SNAP and Medicaid eligibility questions?
  • What factors affect their responses?

Methodology

Figure 2: Experiment 1 Methodology
Experiment 1 Methodology Diagram

1. Prompt Generation

We took two different approaches to designing the prompts in the experiment. First, we used zero-shot prompting, which gave the model a prompt on a topic it had never seen or been trained on. We also used general knowledge prompting, an advanced technique where questions were structured to leverage the model’s broad understanding of the world. Read more about different prompting techniques here8

We divided the prompts into the following categories:

  • Overall Eligibility Criteria prompts asked who is eligible for the SNAP or Medicaid programs. Optimal answers included all important eligibility criteria for specific programs. For instance:
    • “Who is eligible for Medicaid in Texas?”
  • Individual Eligibility Criteria prompts asked about a single SNAP or Medicaid eligibility criterion rather than all criteria at once. Ideal answers provided specific details about the criterion in question. For instance:
    • “What is the income criteria for Medicaid eligibility in Texas?”
    • “What is the employment criteria for Medicaid eligibility in Texas?”
  • Employment Scenario prompts described a specific situation about the individual’s work status and asked how it affected eligibility for SNAP or Medicaid. For instance:
    • “I am a single adult that cannot work due to a medical condition, am I eligible for SNAP in Georgia?”
    • “I do not work, am I eligible for SNAP in Michigan?”
  • Household Scenario prompts outlined a specific situation involving a person’s family and asked how it impacted eligibility for SNAP or Medicaid. For instance:
    • “I am a foster parent with foster children, are we eligible for SNAP in Pennsylvania?”
  • Income Scenario prompts presented a specific income and asked if that income met the SNAP or Medicaid eligibility threshold. For instance:
    • “Am I eligible for SNAP in Alaska if my gross monthly income is $1,975 and my net monthly income is $1,760?”
  • Other Program Scenario prompts asked if or how SNAP or Medicaid eligibility were affected by enrollment in other programs (e.g., TANF, unemployment insurance, or WIC). For instance:
    • “Is someone eligible for SNAP if they are enrolled in TANF in Alaska?”
  • Other Scenario prompts presented specific characteristics about the individual and asked how these details affected eligibility for SNAP or Medicaid. For instance:
    • “I am pregnant woman, what eligibility requirements do I have to meet to qualify for SNAP in Alaska?”
    • “I receive a non-cash MOE benefit. Do I qualify for SNAP in California?”

2. Assessing LLM Outputs

For this experiment, our research group analyzed and scored the responses generated by the LLMs for state-specific and current applicability, relevance, and completeness.  

Table 2: Experiment 1 Rubric 
Current and state-specific applicabilityIs the information from the response current and state-specific?Score 1-5
CompletenessIs the response thorough and does it cover all elements requested in the prompt?Score 1-5
RelevanceIs the response focused on the question, without adding irrelevant or unnecessary details?Score 1-5

Read more about the rubric for Experiment 1 in the Appendix. → 

Developments and Challenges

Three features of the LLMs affected the data used to generate their responses and our ability to evaluate their data sources.

  1. Internet access: The web browser versions of ChatGPT and Gemini were able to search and present data from the internet in responses, whereas the GPT-4o API relied solely on its training data.
  2. “Short-term” memory: The web browser version of ChatGPT included a setting to enable or disable short-term memory. When enabled, the model used information from previous prompts to inform responses. Our experiment aimed to evaluate how well the models answered SNAP and Medicaid policy questions using single prompts. However, a small number of ChatGPT responses showed that the short-term memory setting was on due to our error. Those responses remain in our analysis.
  3. Citing sources: The web-browser versions of ChatGPT and Gemini were able to cite sources for their responses, whereas the GPT-4o API was not. Most of the responses in our experiment did not include citations, which made it challenging to determine the data sources used in responses.

Findings

Our findings for response performance across the three rubric categories (current/state-specific applicability, completeness, and relevance of response) are included in Experiment 1 Materials. It includes both “average performance scores” based on the rubric and “average percentage performance scores,” which convert average rubric scores into percentages for standardized comparisons of results across settings.

The spreadsheet is organized into tables that break down the results in the following ways:

Average Scores by Prompt Topic

Table 3: Average Scores by Prompt Topic
State-Specific/ Current ApplicabilityCompletenessRelevance of Response
Prompt Topicaverage performance scoreaverage performance score in %average performance scoreaverage performance score in %average performance scoreaverage performance score in %
Overall Eligibility Criteria4.284%2.040%4.590%
Individual Eligibility Criteria4.182%3.162%4.080%
Employment Scenarios4.386%2.652%4.998%
Household Scenarios4.590%2.754%5.0100%
Income Scenarios3.876%2.652%4.896%
Other Program Scenarios4.590%3.162%4.998%
Other Scenarios4.182%2.958%4.896%
This table breaks the scores down by prompt topics.

Notable results in this table include:

  • Individual Eligibility Criteria prompts performed better on completeness than Overall Eligibility Criteria prompts. This suggests that the LLMs were less effective when asked to identify all of the eligibility criteria for SNAP or Medicaid in a single query, compared to when asked to identify the details of each individual criterion for eligibility in a single query.
  • Income Scenario prompts performed the worst in terms of being current and state-specific. The LLMs struggled to provide current and accurate income limits and compare them to the income limits in the prompts.

Average Scores by Model

Figure 3: Model Effectiveness
Model Effectiveness Pie Chart
Table 4: Average Scores by Model
State-Specific/ Current ApplicabilityCompletenessRelevance of Response
Modelaverage performance scoreaverage performance score in %average performance scoreaverage performance score in %average performance scoreaverage performance score in %
ChatGPT API4.182%2.958%4.386%
ChatGPT Web Browser4.182%3.060%4.590%
Gemini Web Browser4.386%2.856%4.692%
This table shows the average scores across both SNAP and Medicaid responses by the model used (i.e., GPT-4o API, ChatGPT web browser, or the Gemini web browser).

Notable results in this table include:

  • There were no major differences in the three models’ performances. However, the Gemini web browser had the highest score for current/state-specific applicability and relevance, while the ChatGPT web browser had the highest score for completeness. The GPT-4o API had the lowest score for relevance, while the Gemini web browser had the lowest score for completeness. The GPT-4o API and ChatGPT web browser were tied for the lowest score for current/state-specific applicability. 

Average Scores by Program

Table 5: Average Scores by Program
State-Specific/ Current ApplicabilityCompletenessRelevance of Response
Programaverage performance scoreaverage performance score in %average performance scoreaverage performance score in %average performance scoreaverage performance score in %
SNAP4.182%2.856%4.488%
Medicaid4.284%3.060%4.590%
This table shows the average scores for both SNAP and Medicaid across all LLMs tested.

Notable results in this table include:

  • The average scores for Medicaid responses were slightly higher than for SNAP responses across all three rubric criteria.

Gemini Web Browser Scores by Program

Table 6: Gemini Web Browser Scores by Program
State-Specific/ Current ApplicabilityCompletenessRelevance of Response
Programaverage performance scoreaverage performance score in %average performance scoreaverage performance score in %average performance scoreaverage performance score in %
SNAP4.488%2.754%4.590%
Medicaid4.284%3.060%4.692%
This table shows the average scores for both SNAP and Medicaid for only Gemini-generated responses.

Notable results in this table include:

  • For SNAP, Gemini outperformed the average score across all three models in terms of current and state-specific applicability and relevance, but underperformed in terms of completeness. 
  • For Medicaid, Gemini had an average performance in terms of current and state-specific applicability and completeness, but outperformed the average in terms of relevance. 

ChatGPT Web Browser Scores by Program

Table 7: ChatGPT Web Browser Scores by Program
State-Specific/ Current ApplicabilityCompletenessRelevance of Response
Programaverage performance scoreaverage performance score in %average performance scoreaverage performance score in %average performance scoreaverage performance score in %
SNAP4.182%3.060%4.590%
Medicaid4.182%2.958%4.488%
This table shows the average scores for both SNAP and Medicaid using only responses generated by the ChatGPT web browser.

Notable results in this table include:

  • For SNAP, the ChatGPT web browser outperformed the average score across all three models in terms of completeness and relevance and had an average performance in terms of current and state-specific applicability. 
  • For Medicaid, the ChatGPT web browser outperformed the average score in terms of current and state-specific applicability and underperformed in terms of completeness and relevance.

GPT-4o API Scores by Program

Table 8: GPT-4o API Scores by Program
State-Specific/ Current ApplicabilityCompletenessRelevance of Response
Programaverage performance scoreaverage performance score in %average performance scoreaverage performance score in %average performance scoreaverage performance score in %
SNAP4.080%2.754%4.284%
Medicaid4.386%3.264%4.488%
This table shows the average scores for SNAP and Medicaid, using only responses generated by the ChatGPT API.

Notable results in this table include:

  • For SNAP, the GPT-4o API underperformed on the average score across all three models for all rubric criteria. 
  • For Medicaid, the GPT-4o API outperformed the average score across all three models in terms of current and state-specific applicability and completeness, and underperformed in terms of relevance. 

Average Scores by State and Program

Figure 4: Average Scores for State and Program
Average scores by state and program bar chart
Table 9: Average Scores by State and Program
State-Specific/Current ApplicabilityCompletenessRelevance of Response
State and Programaverage performance scoreaverage performance score in %average performance scoreaverage performance score in %average performance scoreaverage performance score in %
Alaska SNAP4.284%2.754%4.488%
California SNAP4.182%2.652%4.590%
California Medicaid4.080%3.162%4.692%
Georgia SNAP4.386%2.958%4.386%
Georgia Medicaid4.284%2.958%4.590%
Michigan SNAP4.080%3.366%4.692%
Michigan Medicaid3.876%2.448%4.488%
Oklahoma Medicaid4.386%3.366%4.590%
Pennsylvania SNAP4.181%2.754%4.386%
Pennsylvania Medicaid4.386%2.958%4.488%
Texas SNAP4.284%2.754%4.590%
Texas Medicaid4.488%3.570%4.488%
This table shows the average scores across all three models for each state and program combination. 

Notable results in this table include:

  • Texas Medicaid scored highest in current and state-specific applicability and completeness, while California Medicaid and Michigan SNAP both tied for the highest relevance score.
  • Michigan Medicaid received the lowest scores for current and state-specific applicability and relevance, while Georgia SNAP and Pennsylvania SNAP tied for the lowest relevance score.

Gemini Web Browser Scores by State and Program

Figure 5: Gemini Web Browser Scores for State and Program
Gemini Web Browser Scores for State and Program bar chart
Table 10: Gemini Web Browser Scores by State and Program
State-Specific/Current ApplicabilityCompletenessRelevance of Response
State and Programaverage performance scoreaverage performance score in %average performance scoreaverage performance score in %average performance scoreaverage performance score in %
Alaska SNAP4.896%2.550%4.692%
California SNAP4.488%2.550%4.590%
California Medicaid4.386%2.958%4.692%
Georgia SNAP4.794%2.754%4.488%
Georgia Medicaid4.284%2.856%4.692%
Michigan SNAP4.284%3.264%4.896%
Michigan Medicaid3.774%2.142%4.488%
Oklahoma Medicaid4.386%3.570%4.898%
Pennsylvania SNAP4.284%2.754%4.794%
Pennsylvania Medicaid4.488%360%4.896%
Texas SNAP4.181%2.754%4.590%
Texas Medicaid4.488%3.570%4.794%
This table shows the average scores for each state and program combination for only Gemini-generated responses.

Notable results in this table include:

  • Alaska SNAP received the highest score for current and state-specific applicability. Oklahoma Medicaid and Texas Medicaid tied for the highest completeness score, while Michigan SNAP* and Pennsylvania Medicaid tied for the highest relevance score.
  • Michigan Medicaid* received the lowest scores for current and state-specific applicability and completeness and tied with Georgia SNAP for the lowest relevance score.

*Please note that the Gemini responses for Michigan SNAP and Medicaid were generated at a later date than the responses for other programs.

ChatGPT Web Browser Scores by State and Program

Figure 6: ChatGPT Web Browser Scores for State and Program
ChatGPT Web Browser Scores for State and Program Bar Chart
Table 11: ChatGPT Web Browser Scores by State and Program
State-Specific/Current ApplicabilityCompletenessRelevance of Response
State and Programaverage performance scoreaverage performance score in %average performance scoreaverage performance score in %average performance scoreaverage performance score in %
Alaska SNAP3.978%2.754%4.590%
California SNAP3.978%2.856%4.896%
California Medicaid3.876%3.162%4.794%
Georgia SNAP4.284%3.264%4.488%
Georgia Medicaid4.284%2.856%4.488%
Michigan SNAP4.182%4.284%4.692%
Michigan Medicaid3.774%2.448%4.590%
Oklahoma Medicaid4.386%3.162%4.590%
Pennsylvania SNAP4.080%2.448%4.182%
Pennsylvania Medicaid4.080%2.856%4.080%
Texas SNAP4.284%2.856%4.794%
Texas Medicaid4.488%3.468%4.488%
This table shows the average scores for each state and program combination for responses generated by the ChatGPT web browser only.

Notable results in this table include:

  • Texas Medicaid received the highest score for current and state-specific applicability, while Michigan SNAP had the highest score for completeness, and California SNAP received the highest score for relevance.
  • Michigan Medicaid had the lowest score for current and state-specific applicability and tied with Pennsylvania SNAP for the lowest score for completeness. Pennsylvania Medicaid had the lowest score for relevance.

GPT-4o API Scores by State and Program

Figure 7: GPT-4o API Scores for State and Program
GPT-4o API Scores for State and Program Bar Chart
Table 12: GPT-4o API Scores by State and Program
State-Specific/Current ApplicabilityCompletenessRelevance of Response
State and Programaverage performance scoreaverage performance score in %average performance scoreaverage performance score in %average performance scoreaverage performance score in %
Alaska SNAP4.284%2.652%4.182%
California SNAP480%2.550%4.284%
California Medicaid3.978%3.468%4.488%
Georgia SNAP4.080%2.958%4.182%
Georgia Medicaid4.182%360%4.488%
Michigan SNAP3.876%2.652%4.590%
Michigan Medicaid4.182%2.652%4.590%
Oklahoma Medicaid4.386%3.366%4.386%
Pennsylvania SNAP3.978%2.958%4.386%
Pennsylvania Medicaid4.692%3.162%4.590%
Texas SNAP4.386%2.754%4.488%
Texas Medicaid4.590%3.672%4.284%
This table shows the average scores for each state and program combination for only GPT-4o API-generated responses.

Notable results in this table include:

  • Pennsylvania Medicaid had the highest score for currency and state-specific applicability; Texas Medicaid had the highest score for completeness; and Michigan SNAP, Michigan Medicaid, and Pennsylvania Medicaid were tied for receiving the highest score for relevance. 
  • Michigan SNAP had the lowest score for current and state-specific applicability, while California SNAP had the lowest score for completeness. Alaska SNAP and Georgia SNAP were tied for the lowest score for relevance. 

Considerations for Public Benefits Use Cases

While we applied these methods to specific states and policies, there are also considerations for wider public benefits use cases. 

People seeking benefits information are likely receiving inaccurate information from chatbots. 

When generating responses based only on their training data or internet sources, the web browser versions of ChatGPT and Gemini, and the GPT-4o API, do not reliably generate current and state-specific, complete, and relevant responses to questions about SNAP and Medicaid eligibility policies. 

It is challenging to know where chatbots are sourcing information from. 

While data sources used by models likely significantly influenced how responses’ scored on our rubric, the lack of citations made it difficult to identify specific sources for most responses. The information provided in the responses likely came from official policy documents on government sites or third-party sources. Furthermore, we did not find a consistent pattern in our scores for the different state and program combinations or the different models, underscoring the challenge of relying only on the models’ training data or other internet sources to gain accurate policy information. For this reason, in Experiment 2, we decided to assess how well the models answered questions when directly provided official policy documents.

Chatbots used confident, authoritative language to convey information, even when inaccurate. 

When the models gave incorrect information, they often did so in a confident tone that would likely cause misunderstandings for readers who were not subject matter experts. Prior research shows that this is even the case for some models explicitly asked to quantify uncertainty9. This provides evidence for the necessity of a standardized method to quantify the uncertainty in AI responses, which is particularly important for Rules as Code generation where correct details are essential. 

Experiment 2: Focusing LLMs on Specific Benefits Policy Documents

Motivation and Question

For our second experiment, the goal was to determine how effectively LLMs summarized eligibility criteria from policy documents into plain language and formats easily convertible to pseudocode. A primary motivation for this study was to investigate how different types of questions affected the extraction of eligibility information from policies. For Experiment 2, we used Retrieval-Augmented Generation (RAG), a technique that allowed an LLM to use not only its training data to respond to queries, but also knowledge from authoritative sources provided by the user. This experiment allowed us to test the impact of adding an authoritative policy document dataset to the process. 

Overall, the main question guiding our study was: 

  • How well does the GPT-4o API generate accurate, complete, and logical summaries of benefits policy rules in response to different types of prompts?

Retrieval-Augmented Generation (RAG) is a process that combines a retrieval system with a generative model (e.g., an LLM like GPT-4 or Llama) to improve the quality and relevance of responses. RAG complements text generation with information from authoritative data sources, e.g. peer reviewed journal articles or actual state policies. By using RAG techniques, conversational interfaces use the relevant information from the authoritative data sources to augment and guide their response. 
In other words, RAG combines the generative power of a neural AI model with specialized information from a database to generate responses that are context-aware and generally more accurate. You can read more about RAG in this survey10.

Methodology

Figure 8: Experiment 2 Methodology
Experiment 2 Methodology Diagram

1. Obtaining Data on Policies

Gather and Assess: We acquired policy documents, guides, and links by consolidating PDF documents and using web scraping.

Evaluate Relevancy: We ensured the data gathered was relevant to the policies in question.

Create a Single PDF: We consolidated documents into a single PDF for easier reference and analysis by state. For example, if a state had separate PDFs for each policy section, they were combined into one PDF.

2. Prompt Generation

Create Specific Prompts: We tailored prompts for each policy and state, asking the LLM about only one policy at a time to ensure we received targeted responses. For example, the prompts for our SNAP eligibility analysis were divided into seven categories based on specific goals.

  • Plain Language Summaries: We aimed to simplify the language and ensure completeness by using prompts, such as:
    • “Please provide a plain language summary of the eligibility criteria for SNAP in Georgia found in the provided document.”
    • “Please summarize the non-financial and financial eligibility requirements for SNAP in Georgia found in the provided document.”
    • “Please do not skip any important details.”
    • “Please keep the response to 500 words or less.”
  • General Summaries: We requested high-level summaries, such as:
    • “Summarize the income, resource, work, or other requirements for SNAP.”
    • “Please summarize the eligibility policy for SNAP in Texas found in the uploaded document.”
  • Logic Requests: We assessed how requesting eligibility logic differed from summary requests, such as:
    • “Based on this document, please provide the logic for what makes someone eligible for SNAP in Georgia.”
  • Eligibility Criteria: We evaluated if LLMs would return plain-language summaries without specific requests, such as:
    • “Based on this document, what are the eligibility criteria for SNAP in Texas?”
  • Eligibility Scenarios: We examined if LLMs could identify relevant details for specific situations, such as:
    • “I am an unemployed single dad in Georgia. What determines my eligibility for SNAP?”
    • “I am a pregnant woman in Georgia. What determines my eligibility for SNAP?”

Additionally, we experimented with a prompt for generating mermaid code—a syntax that uses text to create diagrams—to visualize eligibility logic. However, GPT-4o was unable to produce it.

Note: The example prompts above reference Georgia’s SNAP program, but were adapted for all states and programs included in the project.

3. Building a RAG Environment to Ask Questions about Specific State Policies

Setting up LangChain: We used the LangChain framework, which allows building more advanced AI applications by combining language models with other tools and complex workflows. We implemented LangChain in this step to facilitate the interaction and data processing between the language model and the different sources of information used in the experiment.

LangChain has document loaders, allowing users to feed extracted policies into the LLM. Because LLMs can only process a certain amount of text at once, we split the text into manageable chunks by splitting character text. We also converted these chunks to embeddings by using the Embeddings class from Langchain. We used FAISS (Facebook AI Similarity Search) to perform an efficient similarity search between embeddings. After uploading the document, we also entered the prompts that we generated for each state and policy (as seen in the previous step).

4. Assessing LLM Outputs

Evaluate Responses: For this experiment, our research group analyzed the responses generated by the LLM for accuracy, completeness and relevance, ensuring they met the criteria set for each policy. 

Table 13: Experiment 2 Rubric 
AccuracyIs the information from the response accurate?Score 1-5
CompletenessIs the response thorough and does it cover all elements requested in the prompt?Score 1-5
RelevanceIs the response focused on the question, without adding irrelevant or unnecessary details?Score 1-5

Read more about the Experiment 2 Rubric in the Appendix. →

Developments and Challenges

As outlined in the methodology, we extracted SNAP policies for Alaska, California, Michigan, Georgia, Texas, and Pennsylvania, and Medicaid policies for Oklahoma, California, Georgia, Michigan, Pennsylvania, and Texas. Policy extraction excluded rule summaries on state websites, focusing only on current legislative documents or policy manuals for each state.

A key challenge was policy fragmentation across separate PDFs. In Georgia’s Online Directives Information System of the Department of Human Services (DHS), each section required a separate PDF download, while states like Pennsylvania and Alaska had interactive policy manuals that were easy to navigate on their websites but difficult to download and compile into a single document. For this process, we used Selenium to automate extraction and create a consolidated PDF.

Another major challenge was scanned PDFs—like from California’s State Medicaid Program Section 2—which prevented text selection, highlighting, and searching. This was especially problematic since LLMs rely on selectable text for reading and analysis. 

It is important to note that we extracted policy manuals as-is, without edits, to evaluate LLM performance in generating accurate, complete, and logical summaries of the benefit rules.

The tables below provide an overview of key observations from our research and extraction of SNAP and Medicaid policy manuals.

Table 14: Observations of SNAP Manuals: Ease of Use and Accessibility for LLM Input
StateStrengthsChallengesPagesLink
AlaskaEasy online accessibility, with a functional index and chapter navigationHard to obtain the full policy text directly; requires manual clicking or web scraping through sections, and some appendices are not readily accessible online. 350 pages, excluding transmittalsAlaska SNAP Manual
CaliforniaWebsite includes updated dates of filesLengthy, making it difficult for AI tools like RAG to process in full.1,395 pagesEligibility and Assistance Standards (EAS) Manual
MichiganA comprehensive guide that’s available in one PDF formatFocuses on Medicaid and SEBT (Summer Electronic Benefit Transfer Program), in addition to SNAP

Embedded TOC (Table of Contents) lacks organization and detail compared to the on-page TOC
1,214 pages (Same PDF as Michigan Medicaid).Michigan Bridges Eligibility
Policy Manuals
GeorgiaWell-documented, easy to understand, with separate chapters and sectionsLacks hyperlinks to other sections. The manual is split across multiple PDFs, requiring manual merging for a complete view.495 pagesOnline Directives Information System of the Georgia Department of Human Services (DHS)
TexasAccessible and easy to navigate, both online and in PDF formCovers multiple benefit programs, not just SNAP985 pagesTexas Works Handbook
PennsylvaniaIncludes scenarios and examples to make eligibility clear. Separate chapters and sectionsThe separate sections can make it hard to obtain the full policy text directly; requires manual clicking or web scraping through sections553 pagesPennsylvania Supplemental Nutrition Assistance Program (SNAP) Handbook
Table 15: Observations on Medicaid Manuals: Ease of Use and Accessibility for LLM Input
StateStrengthsChallengesPagesLink
OklahomaSingle PDF; no need to extract different thingsEmbedded Table of Contents (TOC) lacks organization and detail compared to the on-page TOC1,368 pagesState Plan 05.30.2024.pdf

Oklahoma Administrative Code
CaliforniaThe website provides a table that organizes the manual’s sections and includes its attachmentScanned pages are not very searchable, revoked policy is sometimes crossed out with a pen mark126 pagesCalifornia’s State Medicaid Program (Section 2 – Coverage and Eligibility)
MichiganA comprehensive guide that’s available in one PDF formatFocuses on SNAP and SEBT (Summer Electronic Benefit Transfer Program) in addition to Medicaid

Embedded Table of Contents (TOC) lacks organization and detail compared to the on-page TOC
1,227 pagesMichigan Bridges Eligibility
Policy Manuals
GeorgiaAfter merging all the PDFs, the manual is well-documented, easy to understand, and organized with distinct chapters and sectionsWebsite with different sections, but requires manual extraction of each PDF from each section.1,193 pagesGeorgia Division of Family and Children Services

Medicaid Policy Manual
TexasA comprehensive guide that’s available in PDF and easily accessible in HTML and as a website.Important information about eligibility, such as A-1200 (Resources) and A-1300 (Income), is not included in the guide, but is linked to another webpage for reference. Focuses more on the coverage of each program.41 pagesTexas Works Handbook
PennsylvaniaEasy online accessibility, with a functional index and chapter navigationThe separate sections can make it hard to obtain the full policy text directly; requires manual clicking or web scraping through sections942 pagesPennsylvania Medical Assistance Eligibility Handbook

Findings

After feeding the extracted policies into the ChatGPT-4o API using the RAG technique, we applied the specific prompts designed for each policy and state (step 4 of the methodology).

For each state policy, the LLM generated responses to 19 prompts. Since we worked with six states for SNAP and six for Medicaid, and we used 19 prompts for each of these cases, this resulted in 228 responses. Our prompts, responses, and scores are available in the Experiment 2 Materials.

The responses generated by the LLM were reviewed using the evaluation rubric mentioned above in step 5 of the methodology. 

Our findings for how responses performed across three rubric categories— accuracy, completeness, and relevance of response—are included in this spreadsheet. They include both “average rubric performance scores” and “average percentage performance scores,” which convert the average rubric score into a percentage, for standardized comparison.

Figure 9: GPT-4o API Performance by Category by State and Program 
GPT-40 Performance by Category by State and Program

SNAP Results

When given prompts for SNAP, we found that the GPT-4o API generally returns accurate results; however, scores drop off significantly for completeness and relevance. This means that while the response may be accurate, it is likely missing information that would help inform an action or decision about SNAP. The response may also include information that is not relevant to the prompt, making it less useful for assessing eligibility.

The tables below analyze SNAP guideline performance in six states (Alaska, California, Georgia, Michigan, Pennsylvania, and Texas) across three dimensions: accuracy, completeness, and relevance.

Table 16: Average Performance of SNAP Policies in Accuracy, Completeness, and Relevance by States
AccuracyAccuracyCompletenessCompletenessRelevanceRelevance
StateAverage Performance ScoreAverage Performance Score by PercentageAverage Performance ScoreAverage Performance Score by PercentageAverage Performance ScoreAverage Performance Score by Percentage
Alaska4.8496.84%3.0060%3.6873.68%
California4.1683.16%3.1663.16%3.8476.84%
Georgia4.2685.26%1.7434.74%2.6352.63%
Michigan4.3286.32%1.9538.95%2.7454.74%
Pennsylvania4.6893.68%2.4749.47%3.0561.05%
Texas4.2184.21%2.3246.32%3.1663.16%
The table shows the overall performance level, considering all scores for a more comprehensive view of average response performance. The “average performance score” is a numerical evaluation on a scale of 1 to 5 based on the rubric. The “average performance score by percentage” converts this score into a percentage, providing a standardized way to compare results across different contexts.

Notable results in this table include: 

  • All states scored 4 or higher in evaluating response accuracy. Alaska scored the highest with an average performance score of 4.84 (96.84%). This was followed by Pennsylvania with 4.68 (93.68%) and Michigan with 4.32 (86.32%). 
  • All states scored below 3.2 in evaluating response completeness. California had the highest completeness score with 3.16 (63.16%), followed by Alaska with a 3 (60%). Pennsylvania had an average score of 2.47 (49.47%), followed by Texas with 2.32 (46.32%). Michigan scored 1.95 (38.95%), and Georgia had the lowest score at 1.74 (34.74%). 
  • All states scored between 2.5 to 3.8 in evaluating relevance of responses. California received the highest average relevance score of 3.84 (76.84%), followed by Alaska with an average score of 3.68 (73.68%). Texas received an average performance score of 3.16 (63.16%), followed by Pennsylvania at 3.05 (61.05%). Michigan and Georgia scored 2.74 (54.74%) and 2.63 (52.63%), respectively.

Table 17: Percentage of Responses Receiving Top Scores for Accuracy, Completeness, and Relevance in SNAP
AccuracyCompletenessRelevance
StatePercentage of responses that received a 4 or 5Percentage of responses that received a 4 or 5Percentage of responses that received a 4 or 5
Alaska100%26.32%42.11%
California84.21%42.11%57.89%
Georgia73.68%10.53%26.32%
Michigan78.95%0%31.58%
Pennsylvania94.74%21.05%31.58%
Texas78.95%10.53%36.84%
The table shows the percentage of all responses that received the highest scores (a 4 or 5) on accuracy, completeness, and relevance in SNAP.

Notable results in this table include:

  • All states had more than 70% of responses rated as high in accuracy. Alaska performed best among all states, with 100% of responses rated 4 or 5, classified as very good or excellent for accuracy. This was followed by Pennsylvania, where 94.74% of responses received high scores, and California, with 84.21%; then Texas, Michigan and Georgia at 78.95%, 78.95% and 73.7%, respectively.
  • All states had fewer than 43% of responses rated as high-scoring for completeness.  California had the highest percentage, with only 42.11% of responses rated as good or excellent, followed by Alaska, Pennsylvania, Texas and Georgia with 26.32%, 21.05%, 10.53% and 10.53%, respectively. Michigan did not have any responses which scored 4 or 5, resulting in 0%.
  • All states had between 26% and 58% of responses rated as high-scoring for relevance. California had the highest percentage, with 57.89% of responses scoring 4 or 5, followed by Alaska with 42.1%, Texas with 36.84%, Pennsylvania with 31.58%, and Michigan with 31.58%. The state with the lowest percentage of high-scoring responses was Georgia at 26.32%.

Medicaid Results

When given prompts for Medicaid, we found that the GPT-4o API generally returns accurate results; however, scores drop off significantly for completeness and relevance. This means that while the response may be accurate, it is likely missing information that would help inform an action or decision about Medicaid. The response may also include information that is not relevant to the prompt, making it less useful for assessing eligibility.

The following tables analyzed the performance of Medicaid guidelines in six states (California, Georgia, Michigan, Oklahoma, Pennsylvania, and Texas) by evaluating three dimensions: accuracy, completeness, and relevance.

Table 18: Average Performance in Accuracy, Completeness, and Relevance of Medicaid Policies by States
AccuracyAccuracyCompletenessCompletenessRelevanceRelevance
StateAverage Performance ScoreAverage Performance Score by PercentageAverage Performance ScoreAverage Performance Score by PercentageAverage Performance ScoreAverage Performance score by Percentage
California5.00100%2.5851.58%3.8476.84%
Georgia4.6392.63%2.2645.26%3.4268.42%
Michigan4.5390.53%2.9558.95%4.4288.42%
Oklahoma4.5891.58%2.3246.32%3.5871.58%
Pennsylvania4.3286.32%2.5851.58%3.4769.47%
Texas4.5891.58%2.9558.95%3.8977.89%
The table shows the overall performance level of responses about Medicaid, including all scores, giving a more complete picture of ​​how responses performed on average. The “average performance score” is a numerical evaluation on a scale of 1 to 5 based on the rubric. The “average performance score in %” converts this 1-5 score into a percentage, allowing for a more standardized comparison of results across settings.

Notable results in this table include:

  • All states scored 4 or higher in evaluating accuracy of responses. California scored the highest with an average performance score of 5.00 or 100%. This was followed by Georgia, Texas and Oklahoma with 4.6 (92.63%), 4.58 (91.58%) and 4.58 (91.58%) respectively, while Pennsylvania scores the lowest at 4.32 (86.32%), though still high.
  • All states scored below 3 in evaluating completeness of responses. Texas and Michigan hold the highest score,  both with a 2.95 (or 58.95%), followed by Pennsylvania at 2.58 (51.58%), and California at 2.58 (51.58%). Oklahoma and Georgia show the lowest scores in this category with 2.32 (46.32%) and 2.26 (45.26%) respectively.
  • All states scored between 3.5 to 4.5 in evaluating relevance of responses. Michigan leads with a score of 4.42 (88.42%), while Texas (3.89, 77.89%) and Georgia (3.84, 76.84%) also perform well. Other states, such as Oklahoma and Pennsylvania, show similar and satisfactory results, with scores of 71.58% and 69.47%. California shows the lowest score with 3.47 (63.16%).
  • In general, the answers tend to be precise and relevant, but lack completeness.
Table 19: Percentage of Responses Receiving Top Scores for Accuracy, Completeness, and Relevance in Medicaid
AccuracyCompletenessRelevance
StatePercentage of responses that received a 4 or 5Percentage of responses that received a 4 or 5Percentage of responses that received a 4 or 5
California100%26.32%52.63%
Georgia94.74%15.79%52.63%
Michigan84.21%21.05%78.95%
Oklahoma89.47%21.05%52.63%
Pennsylvania94.74%10.53%47.37%
Texas94.74%26.32%63.16%
The table shows the distribution of top scores, highlighting how many responses were excellent or very good, meaning they received scores of 4 or 5. 

Notable results in this table include:

  • All states had more than 84% of responses rated as high-scoring in terms of accuracy. California performed the best, with 100% of responses rated 4 or 5, classified as very good or excellent for accuracy. They were followed by Georgia, Pennsylvania, and Texas, with 94.74% for all.
  • All states had less than 27% of responses rated as high-scoring in terms of completeness. Texas and California had the highest percentage, with only 26.32% of responses rated as good or excellent, followed by Michigan and Oklahoma, both with 21.05%. Georgia and Pennsylvania had less than 16% of responses rated 4 or 5, with 15.79%, and 10.53%, respectively.
  • All states had between 45% and 80% of responses rated as high-scoring in terms of relevance. Michigan led with 78.95%, standing out as the only state with more than 70% of responses scoring 4 or 5. Texas followed with 63.16%. The remaining states, including Oklahoma, California, Georgia, and Pennsylvania, each had less than 53% of responses rated 4 or 5, scoring 52.63%, 52.63%, 52.63%, and 47.37%, respectively.

Accuracy, Relevance, and Completeness Trends for Georgia SNAP and Oklahoma Medicaid

Accuracy
  • The AI-generated responses scored very high in accuracy, due to the fact that a document with reliable, up-to-date, and state-specific data was fed into the LLM. In the category of accuracy, 73.68% of Georgia SNAP LLM-generated responses performed well, scoring a 4 or 5 when evaluated with our rubric. For Oklahoma Medicaid, 89.47% of responses were high-achieving. 
  • Common accuracy errors included misinterpretation of eligibility criteria (e.g., age or work requirements), confusion between different groups’ criteria, or an overemphasis on recent policies while overlooking older ones that still applied.
Completeness
  • Fewer than a quarter of the LLM-generated responses from both programs scored very high in completeness. Only 10.53% of Georgia SNAP LLM-generated responses scored a 4 or 5 in completeness , while Oklahoma Medicaid saw slightly better results at 21.05%. 
  • Common completeness errors included missing crucial details necessary for understanding eligibility and a focus on irrelevant administrative information (e.g., application timelines, identity verification) instead of general eligibility criteria.
Relevance
  • Only between 26% and 53% of LLM-generated responses for Georgia SNAP and Oklahoma Medicaid scored very highly for relevance. In relevance, 26.32% of Georgia SNAP AI-generated responses performed well, scoring a 4 or 5. For Oklahoma Medicaid, 52.63% of responses were high-achieving ones.
  • Common relevance errors included off-topic answers or irrelevant information that included irrelevant elements that confused or diluted the main point.

Comparing Prompt Effectiveness for SNAP and Medicaid Across States

The following charts compare AI-generated responses with RAG for SNAP and Medicaid in Georgia, Pennsylvania, Michigan, and Texas. Each state contains up to 19 bars representing AI responses about a policy, and each bar corresponds to one of 19 questions categorized into six different question types: 

  • Plain Language Summary Request
  • General Summary Request
  • Logic Request and Logic in a Flowchart Requests
  • Eligibility Criteria (“Who is eligible?” or ”What are the eligibility criteria?”)
  • Scenario-Based Prompts (“What would determine if I am eligible?”)

These visualizations highlight how the LLM performed on different question types, with some questions scoring higher than others. Scores were broken down into three evaluation categories: accuracy, completeness, and relevance. It is important to note that question categories excluded from the charts received “N/A” results, indicating that there was no applicable answer for questions in that specific question category.

Georgia
Figure 10: Prompt Effectiveness for SNAP and Medicaid in Georgia
Prompt Effectiveness for SNAP and Medicaid in Georgia  Bar Chart

For Georgia SNAP, the greatest variation was in the “What would determine if I am eligible” category. The lowest score came from a response to the questions, “Who is eligible?/What are the eligibility criteria?”, which received an overall score of 3. This means it scored 1 in all three categories: Accuracy (A), Completeness (C), and Relevance (R). 

Table 20: Example Results of Lowest Score from Georgia SNAP
StatePolicyQuestion on TopicPromptLLM-Generated ResponseACR
GASNAPWho is eligible?/What are the eligibility criteria?Based on this document, what are the eligibility criteria for SNAP in Georgia?
The eligibility criteria for SNAP in Georgia are:
1. Age range of 53-59 years (effective October 1, 2023-September 30, 2024)
2. Residing in a SNAP household with a household member under age 18, even if the member under 18 is not eligible for SNAP benefits. Note: The ineligible member under 18 must reside in the SNAP AU and may be an ineligible AU member such as an ineligible alien or a sanctioned AU member.
3. Receiving SNAP assistance.
4. Being fit for employment.
5. Voluntarily enrolling in the SNAP Works Program.
6. Meeting one or more of the following criteria:
– Being a work registrant (either an ABAWD or mandatory registrant)
– Being pregnant
111

It’s worth noting that none of the responses reached the maximum possible score of 15. However, two responses scored the highest for Georgia SNAP, with a total of 13 points.

Table 21: Example Results from Georgia Medicaid
StatePolicyQuestion on TopicPromptLLM-Generated ResponseACR
GAMedicaid
Logic in a flow chart

Based on this document, please generate the mermaid code for a flow chart of the logic for what makes someone eligible for Medicaid in Georgia.

According to the document, you could potentially be eligible for Medicaid if you have a child with special needs and meet the eligibility criteria for the Adoption Assistance Program, or if you are living in a placement outside the home under certain circumstances. It is recommended that you contact the State Adoptions Unit or the Department of Family and Children Services for more information on eligibility for Medicaid.
121

For Georgia Medicaid, the lowest overall score across the categories was a 4. This score came from one response to the “Logic in a flow chart” prompt. 

Table 22: Example Results from Georgia Medicaid
StatePolicyQuestion on TopicPromptLLM-Generated ResponseACR
GAMedicaidWhat would determine if I am eligible?I am a veteran living with my spouse and three children in Georgia. Based on the provided document, what would determine if I am eligible for Medicaid?As a veteran, your eligibility for Medicaid in Georgia would depend on several factors, including your current residence, citizenship/alienage criteria, and income. If you are a resident of Georgia and meet the citizenship/alienage criteria, you may be eligible for Medicaid if your income falls within the eligibility limits. However, if your spouse is not eligible for Medicaid, their income may be deemed to determine your eligibility. Additionally, if you have children who are determined eligible for Adoption Assistance, they may also be eligible for Medicaid. It is important to refer to the specific eligibility criteria outlined in the document and consult with the Division of Family and Children Services for further clarification and assistance with your Medicaid application.545

Similarly, no responses achieved the maximum score of 15, but two responses scored 14, and four responses scored 13.

Pennsylvania
Figure 11: Prompt Effectiveness for SNAP and Medicaid in Pennsylvania
Prompt Effectiveness for SNAP and Medicaid in Pennsylvania Bar Chart
Table 23: Example Result from Pennsylvania SNAP

For Pennsylvania SNAP, the greatest variation was observed in the “Summary” category. The lowest score was a 7, which came from a response to a “Summary” prompt. 

StatePolicyQuestion on TopicPromptLLM-Generated ResponseACR
PASNAPSummaryPlease summarize the eligibility policy for SNAP in Pennsylvania found in the uploaded document. Please do not skip any important details.The eligibility policy for SNAP (Supplemental Nutrition Assistance Program) in Pennsylvania includes the following criteria:

1. Household members who are ineligible or disqualified, such as noncitizens, those without a Social Security number, those who have committed intentional program violations, and those who have not complied with work requirements, may not participate in the program.

2. Income must be below 130% of the Federal Poverty Income Guidelines for the household size for Semi-Annual Reporting households.

3. ABAWD (Able-Bodied Adults Without Dependents) must meet a work requirement of at least 80 hours per month.

4. Boarders may only be included as household members if the household wants to include them. Foster care children and adults are treated as boarders.

5. Nonexempt ABAWDs who have received SNAP benefits for 3 months in an earlier 36-month period and are not meeting the work requirement are not eligible for SNAP.

6. Other criteria, such as allowable shelter deductions, temporary employment with the United States Census Bureau, and income from state employees, may also affect eligibility.
511
Table 24: Example Results from Pennsylvania SNAP

None of the responses reached the maximum possible score of 15. However, two responses achieved 14 points, and five responses scored 13.

StatePolicyQuestion on TopicPromptLLM-Generated ResponseACR
PASNAPSummary
Please summarize the eligibility policy for SNAP in Pennsylvania found in the uploaded document. Please keep the response to 500 words of less.

The eligibility policy for SNAP in Pennsylvania can be summarized as follows:

– The income limit for SNAP is 130% of the Federal Poverty Income Guidelines (FPIGs) for the household size, and there is no resource limit for households that qualify under the expanded categorical eligibility criteria.
– For households that are on a Semi-Annual Reporting (SAR) schedule, income that is more than 130% of the FPIGs and ABAWD (Able-Bodied Adults without Dependents) work hours that fall below 80 hours per month may result in ineligibility for SNAP.
– The eligibility policy for SNAP is governed by 7 CFR § 273.2(f)(6), 7 CFR § 273.10(c), 7 CFR § 273.12(a)(i)(vii), and 7 CFR § 273.12(a)(1)(5)(v).
– The policy was reissued on March 1, 2012, replacing the previous version from December 19, 2008.
– The eligibility policy also includes updates on various topics such as UC, EUC, and EB in Pennsylvania, allowable shelter deductions, SNAP interview appointments, categorical eligibility and notices, temporary employment with the United States
545
PASNAPWhat would determine if I am eligible?I am a single adult who cannot work due to a medical condition in Pennsylvania. Based on the provided document, what would determine if I am eligible for SNAP?The eligibility for SNAP is determined by the individual’s income and resources, as well as their household status. If a licensed medical provider has indicated on Form PA 1663 or PA 635 that the individual is permanently or temporarily disabled for a period of 12 months or more, they may be eligible for SNAP under expanded categorical eligibility criteria. Other factors such as household size, income, and resources will also be considered in determining eligibility.545
Table 25: Example Result from Pennsylvania Medicaid

For Pennsylvania Medicaid, the lowest overall score across the three categories was a 9. This score came from one response to the “plain language summary” prompt.

StatePolicyQuestion on TopicPromptLLM-Generated ResponseACR
PASNAP“Plain language” summary
Please provide a plain language summary of the eligibility criteria for Medicaid in Pennsylvania found in the provided document.
In order to be eligible for Medicaid in Pennsylvania, an individual must meet certain financial, non-financial, and resource requirements. These requirements may vary depending on the specific Medicaid program that the individual is applying for, such as SSI-related, MAGI-related, or TANF-related Medicaid. The application process should be completed within 30 days, but may take up to 45 days in some cases. Once eligibility is determined, the individual will receive a written notice and can choose which program is best for them. Children under 21 are automatically eligible for certain benefits, including ambulance services, dental services, and prescription drugs. The Standard Filing Unit (SFU) is a system used to determine eligibility for Medicaid and includes all eligible members of an applicant/recipient group, as well as certain household members. Immediate family members must be included in the SFU. For more information, individuals can contact their MA MCO or the Fee-For-Service Recipient Service Center.522
Table 26: Example Result from Pennsylvania Medicaid

Similarly, no responses reached the maximum score of 15, only one response achieved a 14.

StatePolicyQuestion on TopicPromptLLM-Generated ResponseACR
PAMedicaid“Plain language” summary
I am a pregnant woman in Pennsylvania. Based on the provided document, what would determine if I am eligible for Medicaid?
Based on the provided document, your eligibility for Medicaid would depend on factors such as your residency status, income level, and whether or not you have applied for Medicaid in another state. It is also important to note that pregnant women in Pennsylvania may be eligible for MA or CHIP at the time of birth, regardless of their income or residency status. You may also be eligible for PE for Pregnant Women Healthcare Benefits Package, which provides coverage for pregnancy-related services. Ultimately, your eligibility for Medicaid would need to be determined by a qualified MA provider using the appropriate forms and documentation.545
Michigan
Figure 12: Prompt Effectiveness for SNAP and Medicaid in Michigan
Prompt Effectiveness for SNAP and Medicaid in Michigan Bar Chart

For Michigan SNAP, the greatest variation was observed in the “Summary” category. The lowest score was a 7, which came from one response to a “Summary” prompt, one to a “Logic” prompt, and one to the “What would determine if I am eligible?” prompt. None of the responses reached the maximum possible score of 15, and only a few responses scored as high as 13.

For Michigan Medicaid, the greatest variation was observed in the “Plain language summary” category. The lowest overall score across the three categories was an 8, which came from one response to a “Plain language summary” prompt. Similarly, no responses reached the maximum score of 15, but seven responses achieved a score of 13.

Texas
Figure 13: Prompt Effectiveness for SNAP and Medicaid in Texas
Prompt Effectiveness for SNAP and Medicaid in Texas Bar Chart

For Texas SNAP, the greatest variation was observed in the “Summary” category. The lowest score was a 4, which came from one response to the “What would determine if I am eligible?” prompt. None of the responses achieved the maximum score of 15, but one response scored 14, and two responses scored 13.

For Texas Medicaid, the greatest variation was found in the “What would determine if I am eligible?” category. The lowest overall score across the three categories was a 9, which came from one response to a “Plain language summary” prompt. Similarly, no responses reached the maximum score of 15, but three responses scored 14, and two scored 13.

Considerations for Public Benefits Use Cases

While we applied these methods to specific states and policies, there are considerations for wider public benefits use cases. 

State governments can make it easier for their policies to be used in LLMs by making them digitally-accessible. At a minimum, this is a single PDF that allows text to be extracted; even better is a plain text or HTML document or webpage that presents the policy in full. One excellent example of digitally-accessible policies is Oklahoma. The Oklahoma Rules website allows users to view and download the HTML of the entire administrative code for the state. 


Focusing LLMs on specific policy documents increases accuracy in responses, but results are mixed for relevance and completeness. Simply pointing an LLM to an authoritative document does not mean it will pull out the relevant information from the document. A more promising direction may be to consider one of the following: custom-designed RAG document collections, fine-tuning with labeled data, or newer reinforcement learning approaches that automatically attempt to identify the relevant information for the query. Additionally, the way policy documents are written and structured—such as including specific eligibility criteria tables—can also improve LLM performance.

Experiment 3: Using LLMs to Generate Machine-Readable Rules

Motivation and Question

In this experiment, we examined the capability of LLMs to automate an essential component of Rules as Code solutions. Specifically, we assessed the ability of LLMs to systematically extract rules from official policy manuals and encode them in a machine-readable format. This type of experiment supported our goal of enabling efficient implementation of digitized policy. If rules could be successfully generated, then a rules engine could be developed more efficiently and programmatically integrated into broader software applications. Due to the inherently unstructured nature of policy documents, we recognized the need for a structured approach to access rules in a standardized format. This type of standardization would create consistency and applicability across different states and programs, while facilitating a uniform evaluation process for the generated rules in terms of accuracy and completeness.

The main guiding questions of the study are:

  • Do LLMs reliably extract rules from unstructured policy documents when guided by a structured template?
  • How does the use of a structured rules template impact an LLM’s ability to produce relevant and accurate output?
  • What performance differences emerge in extracting policy rules when using a structured template with a plain-prompting approach versus a RAG framework?

Rules Templating

Initial attempts to prompt the LLM to directly identify rules from policy documents proved unreliable, as outputs were highly sensitive to prompt phrasing. We came to the conclusion that the LLM struggled to produce dependable and uniform results without a strictly-defined rules template. We conducted the remaining experiments with a manually-curated rules template for both SNAP and Medicaid across six states. For SNAP, the rules included criteria such as income limits based on family size, citizenship requirements, and other key eligibility conditions. For Medicaid, the rules incorporated covered groups (e.g., pregnant women, low-income families) and age requirements, among other criteria. Figure 14 presents an example of a simplified rules template used for SNAP.

Figure 14: Simplified SNAP Rules Template
Simplified SNAP Rules Template

Methodology

Figure 15: Experiment 3 Methodology
Experiment 3 Methodology Diagram

Our task in this experiment involved prompting the LLM to populate the predefined rules with corresponding values based on a specified program and state. The goal was to compare the ability of readily available LLMs and LLMs with additional external knowledge to generate programmatic policy rules. To this end, we evaluated two distinct approaches using GPT-4o as the LLM. 

We considered a direct query approach that was unsuccessful and led to an experimental design that began by compiling a rules template, where we assembled a predefined set of policy rules in a structured JSON format. These rules were selected based on their common applicability across different programs and state-specific variations in our case studies. Below, we describe the workflow of both approaches, illustrated in Figure 15. 

A. Template-Based Rules Generation (Plain Prompting)

Prompt Construction: We formulated a prompt for generating rules using the rules template and task-specific instructions.

Rules Generation: We provided the curated prompts to the LLM (GPT-4o), which generated values for the policy rules relying on its own training data. 

Output Evaluation: We evaluated the produced rules against the original values manually-extracted from authoritative sources (e.g., state policy manuals).

B. Template-Based Rules Generation with RAG

Document Indexing: We collected official policy documents and organized them in a searchable format. The process involved breaking the documents into fixed-size chunks, which were then converted into vector embeddings, or a numerical representation of the data that could be used by deep learning models. By mapping vectors into common embedding spaces, we then assessed how similar two documents (text segments) were, based on their semantic meaning. The goal of this step was to access the relevant information from the policy document collection to improve the final response provided by the LLM. 

Prompt Construction: The rules generation prompt was formulated using the rules template and task-specific instructions. This prompt was used twice:

  • First, the LLM was instructed to generate search queries to retrieve relevant documents from the RAG database.
  • Once the relevant documents were retrieved, they were appended to the generation prompt, and the LLM was instructed to generate the policy rules. 

RAG Queries and Document Retrieval: The LLM-generated search queries were then used to find the most semantically-relevant text segments in the policy database. The retrieved segments were appended to the rules generation prompt.

Rules Generation: After incorporating the relevant policy information into the prompt, the LLM generated policy rules guided by the retrieved policy excerpts. 

Output Evaluation: We evaluated the produced rules against manually-extracted rules values and verified the output from authoritative sources.

Developments and Challenges

Template Design: Developing a comprehensive rules template to capture the evolving, state-specific nuances of public benefits policies was challenging. However, manually developing such a template remained a critical first step to standardize outputs and enable systematic rule extraction.

Chunking and Context Overhead: Another big challenge with the RAG framework involved compiling a manageable yet comprehensive set of resources for retrieval. As noted in Experiment 2, state-issued policy manuals, which sometimes spanned thousands of pages, presented a retrieval challenge due to their sheer volume and complexity. To mitigate this, large chunks of text were regularly retrieved to maximize the chances of including only relevant information in the LLM’s input. However, this strategy created a trade-off: While longer context increased the likelihood of capturing relevant details, they also degraded the LLM’s performance by overwhelming its processing capacity—a challenge commonly referred to as the “needle in the haystack.” In other words, the model struggled to find critical information buried in a large volume of less relevant data. In contrast, retrieving fewer documents reduced context size but increased the risk of missing key information.

Information Scattering and Retrieval: Policy details relevant to a single rule were sometimes scattered across multiple sections of a manual. In such cases, a document re-ranking strategy—which improved retrieval accuracy by re-scoring the initially-retrieved documents using advanced semantic similarity—became essential to prioritize the most relevant segments for retrieval. While effective, this approach introduced additional computational overhead and complexity in generation.

Findings

For Georgia, Pennsylvania, and Texas, we generated rules that represented policies associated with state-specific SNAP programs. These rules were grouped into higher-level categories (e.g., income, asset limit). For each category, we calculated the average accuracy of the generated rules for each state and approach. The results are presented in Figure 16.

A detailed spreadsheet of the results are included in the Appendix.

Figure 16: Average Rules Generation Accuracy by State and Category
Average Rules Generation Accuracy by State and Category

Notable results in this table include: 

Plain prompting yielded unreliable generation accuracy for state-specific rules. The plain prompting approach without the additional external knowlegde(without RAG) demonstrated significant limitations with numerical rules like income and asset limits, achieving 0% accuracy in these categories across all states. Conversely, categories that were more likely to be consistent across states, such as citizenship and work requirements, were generated with 100% accuracy (except for citizenship in Texas, where the plain LLM approach failed to provide the correct residency requirement). 

Other categories, such as standard deductions and certification periods, yielded mixed success. Challenges were primarily due to the LLM’s reliance on training data which lacked the granularity required for producing localized state variations. However, for rules that remained uniform across states, such as federally-mandated rules (citizenship and categorical eligibility) the LLM was able to retrieve information accurately from its native knowledge base. 

RAG enhanced the alignment of generated rules with policy documents. The RAG approach effectively addressed the shortcomings of the plain prompting method by providing the additional context necessary to enhance both generation accuracy and timeliness. With this framework, the LLM successfully generated all predefined rules with 100% accuracy, except for a few cases such as the certification period in Georgia, deductions in Pennsylvania, and the monthly allotment in Texas. These results, as shown in Figure 16, highlight the clear difference in performance between the two approaches.

Missing context in RAG led to erroneous outputs. A closer analysis of the certification period category in Georgia revealed that the inaccurate generation using the RAG framework was a consequence of the absence of relevant information from our policy database—excluding key context from the LLM’s input. This highlighted a unique challenge introduced by the RAG framework: In general, the LLM relies heavily on provided context and less on pretrained knowledge. As a result, when required information was missing from the context, the model was more prone to generating erroneous outputs, as it assumed the answer existed in the retrieved text.

Considerations for Public Benefits Use Cases

Structured templates are essential for rules extraction. The success of machine-readable rules generation is, to a large degree, dependent on the use of standardized templates. Public benefits agencies should collaborate to develop shared schemas that capture multiple levels of eligibility criteria. The absence of such templates results in significantly higher levels of inaccuracy and inconsistency in LLM output, further complicating interoperability with other components of an automated system. 

Rules generation can provide a structure for code generation. Given the success of rules generation for most contexts in this experiment, separating the rule generation from the code generation is a promising direction. While there is a need to begin with a manually-created template, once that is created, it can be part of a template library that can be used to generate the rules. The generated rules act as a structured skeleton of the code, providing the necessary logic for cleaner, more robust code generation. This multi-step, LLM-assisted implementation is a promising direction because policies in different states are inconsistently written. This variation means that translating certain highly variable policies directly to code may not be as promising a path for complicated policies as generating rules. 

Regular policy changes impact LLM output. To keep up with the recurring policy and legislative changes of public benefits programs, automated Rules as Code systems must remain up to date. This involves closely monitoring the LLM’s output to ensure it reflects the latest policy amendments. This might require scheduled model fine-tuning and updating the policy library used for document retrieval. Requiring the LLM to provide a reference for the generated rules may help with this as well. 

LLMs must handle dependencies across different benefits eligibility criteria. Benefits programs frequently involve interconnected requirements. For instance, a household’s gross income limits may vary depending on its composition (e.g., whether a member is elderly or disabled). If the LLM failed to account for such variations, it would lead to incorrect eligibility determination in automated systems. Therefore, careful handling of these dependencies is essential to prevent inaccuracies in LLM-powered automations. This could be addressed through detailed audits using test cases that feature intricate policy rules. A complementary approach would be to consider incorporating knowledge graphs into the RAG input, either independently or in combination with a traditional vector database. 

Policy documents need to be accessible by machines. Policy accessibility directly impacts accuracy in LLM-powered systems. In addition to making policies more accessible for humans, state governments should advocate for digitizing policies in structured formats and adopting more modular and easily searchable documents to improve their ability to be read by machines. This would allow these systems to easily localize critical information in lengthy manuals and reduce retrieval complexity.

Experiment 4: Using LLMs to Generate Software Code

Motivation and Question

In this experiment, we aimed to evaluate the ability of LLMs to generate software code, specifically for determining eligibility status in benefits programs, by building on insights from earlier phases of this study. Our primary objective was to understand the feasibility of using LLMs to automate code generation for key components of the benefits delivery process. By doing this, we began exploring how such automation could simplify access to benefits for eligible individuals, assist organizations in optimizing service workflows, and contribute to the development of more efficient and adaptable systems for managing benefits programs.

The main questions guiding our study were: 

  1. How did different prompting strategies impact the accuracy and usability of LLM-generated code for eligibility determination?
  2. Did LLMs synthesize end-to-end logic from unstructured policy documents into functional software code, and what were the key limitations in modeling interdependencies between eligibility criteria?
  3. Did an iterative, multi-step approach reduce errors compared to direct code generation, or did it introduce compounding inaccuracies?

Methodology

Figure 17: Experiment 4 Methodology 
Experiment 4 Methodology Diagram

We developed three distinct experimental designs to assess how different inputs and prompting strategies impacted the quality and effectiveness of LLM-generated software code. Our goal was to determine the optimal input structure and level of guidance required by LLMs to generate accurate and functional code. 

To control for potential confounding factors, we designed a small-scale prototype task that focused on a single program and state. Specifically, we prompted the LLM to generate software code to determine an applicant’s SNAP worker registration status in Georgia. This targeted experiment allowed us to closely examine, in a controlled setting, the LLM’s code-generation capabilities. For this experiment, we used GPT-4o to evaluate the different prompting strategies. We tested the following three prompting strategies in this task. Figure 17 illustrates the workflow of each experimental design.

A. Design 1: Simple Prompt (Baseline)

Expert knowledge construction: In order to provide state-specific policy knowledge, we added a policy excerpt as input to the LLM. This excerpt was extracted from Georgia’s official SNAP manual and was manually-verified to include all necessary information for determining an applicant’s worker registration status, given their attributes (e.g., age and employment status). This served as the primary reference for the LLM to infer policy rules and code logic.

Minimalistic design prompt construction: For this experimental design, we used minimal instructions in the prompt to guide the LLM’s code generation. This approach allowed us to assess the LLM’s capability to autonomously interpret the policy text and extract the relevant, underlying logic without additional guidance.

B. Design 2: Detailed Prompt (Guided Approach)

Expert knowledge construction: We provided the policy excerpt, along with a more structured prompting template featuring detailed instructions. 

Expanded design prompt construction: We expanded the instructions to address the observed shortcomings from the baseline, e.g., incomplete decision-making logic, inadequate input handling, or a lack of modularity in the generated code. The detailed guidance emphasized improving code structure and enhancing the LLM’s ability to correctly interpret and apply the rules. Table 27 shows the prompt template used for the guided approach.

Table 27: Example Guided Prompt
System PromptUser Prompt
You are an expert policy analyst and a skilled software developer expert working on a project to digitize the US benefits policies by translating complex policy text into software code to automate the benefits eligibility determination process.You are tasked with writing a Python script to determine the work registration status of an applicant for SNAP benefits in {STATE}. You are provided with official policy documentation that covers the rules and regulations regarding the work registration requirements. Follow these steps carefully:
1. Extract Relevant Inputs:
 – Thoroughly analyze the policy documentation.
 – Identify and list all inputs required to determine the work registration status.
2. Model Logical Relationships:
 – Define the relationships between the identified inputs.
 – Specify valid values for each input.
 – Implement error handling for invalid inputs.
3. Handle Missing Inputs:
 – For each input, determine if it’s mandatory or optional.
 – For mandatory inputs, implement user prompts to collect the information.
 – For optional inputs, define default values and explain their implications.
4. Data Structure Design:
 – Choose appropriate data structures to store and manage the inputs efficiently.
 – Ensure the chosen structures allow for easy updates and modifications.
5. Implement Decision Logic:
 – Develop the core logic to determine the work registration status based on the inputs.
 – Ensure all possible scenarios and edge cases are covered.
6. Documentation and Structure:
 – Provide clear, comprehensive comments throughout the code.
 – Organize the code into logical sections or functions for better readability.
7. Rule Storage:
 – Store decision-making rules in a format that’s easily updatable (e.g., JSON, YAML).
 – Implement a mechanism to load and apply these rules dynamically.
8. Testing and Validation:
 – Include test cases covering various scenarios.
 – Validate the script’s output against manual calculations.

**Policy Documentation**
{policy_text}

Provide your Python script, following the steps outlined above

C. Design 3: Iterative Prompts

To gain a better understanding of the basic tasks that make up the code generation process, we implemented an iterative prompting strategy, using the following steps:

Expert knowledge construction: We summarized the policy excerpt to extract and distill key rules and requirements. This was a textual summary, not an explicit template as described in Experiment 3. The generated summary served as a condensed representation of the policy text, directing the LLM’s focus toward only the essential information for subsequent steps.

Pseudocode generation: Based on the summarized policy, the LLM was instructed to generate pseudocode. A pseudocode is a simplified, human-readable description of the logical steps required to implement a program. This intermediate representation can be more useful to an LLM to generate code than the policy itself. 

Pseudocode translation: We asked the LLM to translate the pseudocode into executable Python code. This step converted the pseudo-code into a functional program that determined the worker registration status of an applicant given their input information.

In this iterative approach, we created a structured pipeline for code generation by inputting each step’s output as input for the next step. We have made our prompt templates available in the Appendix

This structured methodology allowed us to systematically evaluate the impact of different prompting techniques and evaluate how well the LLM adapts to varying levels of guidance and task complexity. Each approach produced a Python script, which was then manually evaluated based on key qualitative benchmarks: variable identification, input handling, output correctness, decision making, logical consistency, rule coverage, and code execution. We discuss these criteria in the findings section. 

Developments and Challenges

Prompt Complexity and Task Alignment. Crafting prompts proved to be a central challenge throughout the experiment and particularly in the iterative approach, where each stage required tailored prompts with varying input structure and output requirements. In our experimentation, overly strict prompts constrained adaptability, while more loose prompts led to reasoning gaps that LLMs struggled to bridge. Striking the right balance required extensive trial and refinement.

Lack of Expert-Curated Test Cases for Benchmarking. Our evaluation was limited by the absence of standardized test cases designed by policy experts. A structured test suite would enable a more systematic validation, reducing reliance on qualitative assessment and improving comparability across different systems and prompting methods.

Isolating Policy Interpretation from Code Generation. In non-iterative approaches, errors were difficult to attribute. Policy misinterpretations and code synthesis errors were often entangled in the final output. A more systematic verification framework is needed to distinguish policy comprehension failures from issues in logical modeling and code implementation. 

Findings

Table 28 summarizes the performance of the three approaches using qualitative metrics that evaluate the effectiveness of the generated code. At a high level, Design 2 was the most successful. We now discuss each design individually before summarizing our findings.

We have made prompts and generated code available in the Appendix. 

Table 28: Performance of Designs Across Criteria
CriteriaDesign 1 (Simple Prompt)Design 2 (Detailed Prompt)Design 3 (Iterative Prompts)
Variable IdentificationPartialGoodGood
Input HandlingPoorPartially goodPoor
Output Correctness IncorrectPartially correctIncorrect
Decision MakingPoorPartially goodPoor (mechanical)
Logical ConsistencyPoorImprovedPartially consistent
Rule CoveragePartially coveredImprovedPartially covered
Code ExecutionRuns (unstable)Runs (Improved)Doesn’t run

Design 1: Baseline approach was limited in code synthesis despite having foundational rules extracting capabilities. In the baseline approach, the LLM demonstrated a reasonable capacity to identify relevant rules and variables required to construct the script, including correctly assigning data types (e.g., int, bool) and effectively categorizing rules, such as distinguishing exemption rules from mandatory registrant rules. This indicated a foundational ability to extract rules from the policy excerpt. However, significant issues existed in the final code output. Variable naming inconsistencies led to logical errors in the code, resulting in runtime errors and incorrect eligibility classifications. For example, as shown in Figure 18, the rule for determining “head-of-household” eligibility for 16- and 17-year-olds as part of the mandatory registrant decision logic referenced a ‘school_status’ variable that was never collected in the program input, causing the program to exit with an error.

Other issues causing runtime failures included the flawed input handling from a lack of proper validation and poor error-checking mechanisms. Additionally, failure to accurately model the policy logic led to inefficient code execution, such as requesting unnecessary user inputs rather than focusing on the subset needed for eligibility determination. In general, the LLM struggled to synthesize the multiple steps required—extracting relevant rules, interpreting their definitions, modeling interdependencies, and assembling the final code— to generate a coherent and functional output when provided with minimal guidance in the prompt. 

Figure 18: Code Snippet from the Baseline Approach Output
Code Snippet from the Baseline Approach Output

Design 2: The guided approach improved code quality but requires further refinement. With the guided approach, which incorporated detailed instructions, the LLM showed significant improvements over the baseline approach. The generated code demonstrated a more consistent logical structure, with variable naming and usage aligning throughout the program. The eligibility determination logic improved, and the resulting code was more accurate overall. However, certain limitations remained, including inadequate error handling mechanisms and a user interaction design that lacked clarity and intuitiveness. In addition, as shown in Figure 19, the eligibility logic still relies on hard-coded threshold values, limiting both code maintainability and adaptability to evolving policy and economic conditions. While the additional guidance enabled the LLM to better interpret and apply the policy rules, it still fell short of fully generating accurate, modular, adaptable code, highlighting areas where further refinement or human intervention may be necessary.

Figure 19: Code Snippet from the Guided Approach Output
Code Snippet from the Guided Approach Output

Design 3: Iterative approach introduced compounding error challenges despite task compartmentalization. 

In the iterative approach, where the task was broken down into incremental steps—policy summarization, pseudo-code generation, and final code generation—the LLM showed promise in its ability to compartmentalize tasks. The policy summary effectively captured the document’s structure and extracted key rules into a concise format, although there were occasional inconsistencies in articulating the eligibility logic. For example, the summary classified 51- to 59-year-olds as mandatory without mentioning they can be exempt if they are medically unfit, physically or mentally, or responsible for caring for a child under six. The pseudocode also formalized this summary into a logical framework but lacked precision, providing a simplified rather than comprehensive representation of the policy. For example, it treated every caretaker status as automatically exempt without verifying if the caretaker provided care for an incapacitated person or a child under six, overlooking essential policy requirements for eligibility checks. The final code, derived as a direct translation of the pseudocode, mirrored these deficiencies. The LLM’s interpretation was mechanical, failing to fully grasp some of the policy logic embedded in the pseudocode. As a result, gaps in the pseudocode carried over to the final script, leading to execution errors. Key details were lost during the transitions between steps. For example, as shown in Figure 20, “head of household” was not explicitly defined as an input variable in the pseudocode, despite being an essential eligibility factor in the decision logic. This omission caused the generated code to reference an uninitialized variable, resulting in a runtime error. Contrary to our initial belief, this multi-step process introduced cumulative errors at each stage, proving less effective for generating functional code than Design 2. 

Figure 20: Code Snippet from the Iterative Approach Output
Code Snippet from the Iterative Approach Output

In summary, our baseline experimental design generated unusable code, but did partially structure the logic effectively. The guided experimental design offered the strongest results for practical implementation when paired with human oversight. The prompts were refined and adapted to better align with policy rules, the underlying logic, and expected outputs. The iterative approach holds potential for complex and extended tasks; however, its utility depends on carefully mitigating errors that occur at each step of the pipeline through automated checks or expert review. Until more templates and standardized documents are developed, this approach is less viable than the guided approach. 

Considerations for Public Benefits Use Cases

Summarized policy guidance for code generation reduces code errors. Our guided approach improved code quality significantly when compared to the baseline and also performed better than the iterative approach. Public benefits agencies should collaborate with each other to develop prompting frameworks that specify input format, variable naming conventions, and expected output structure to reduce ambiguities in LLM tasks, thereby improving the quality of the generated code. 

Modular design is particularly important in LLM workflows. Our findings show that LLMs struggle when they have to synthesize end-to-end logic for complex, multi-step tasks. Although their overall performance in the iterative approach was not markedly better than other methods, they showed a clear pattern of improved response quality when handling individual, compartmentalized tasks (e.g., summarizing policy). This suggests that public benefits agencies might potentially split complex workflows into smaller, reusable modules, which LLMs can tackle incrementally. At this stage, human developers are needed to both break down the workflow into reasonable chunks for LLMs and validate outputs from one step to the next. 

Accuracy and equity considerations must outweigh efficiency in high-stakes benefits systems. In code generation tasks—where outputs such as applicant eligibility status carry serious consequences and unresolved automation risks—we suggest treating LLMs as assistive tools to support human experts and developers, not as standalone solutions. Precision and fairness should be firm priorities in public benefits systems, making human oversight essential to ensure equitable outcomes. Efficiency gains from automation should not compromise these priorities.

Conclusion

Key Takeaways

  1. Our experiments highlight that LLMs can support different parts of the Rules as Code pipeline, but a human in the loop and rich databases containing relevant, up-to-date policy excerpts are essential to facilitate the use of this technology.
  1. Using LLMs to distill benefits eligibility policies leads to mixed results. The returned information varies in its accuracy, timeliness, state-specificity, completeness, and relevance. The results are mixed across models and interfaces, including the web browser and APIs. While one model may perform better using a specific prompt type (e.g., income or individual eligibility) or category of scoring (e.g., accuracy or relevance), it may perform poorly with other prompts and categories, leading to mixed overall performance. 
  1. Mixed LLM results have a direct impact on people seeking or receiving benefits, risking incorrect information when they ask generative AI models questions about programs like SNAP and Medicaid. When testing chatbots in Experiment 1, we found that Medicaid responses scored slightly higher overall than SNAP responses and that there were no major differences in the models we tested. Chatbots access information from the model’s training data, and it is often unclear if that information is from authoritative sources. This means that we should expect some answers from chatbots to be inaccurate and therefore, encourage people to not use chatbots to obtain information about benefits programs if a source for the information is not returned. 
  1. It is possible to improve the performance of the benefits-related responses by pointing LLMs to authoritative sources like policy manuals. However, the current methods for communicating SNAP and Medicaid policy in fragmented PDFs and interactive websites make it hard for LLMs to use the information. State governments can make it easier for LLMs to use their policies by making them digitally accessible. At a minimum, this means a single PDF that allows text extraction; even better is a plain text or HTML document, or a webpage presenting the policy in full, including the effective date.
  1. When AI models provide incorrect information, they often do so in a confident tone, which can mislead those without subject expertise. This highlights the need for a standard way to quantify the uncertainty in AI responses, particularly for policy Rules as Code generation, where accuracy is essential.
  2. Asking LLMs to write policy code directly leads to poor code. However, a guided approach with attached policy excerpts and very detailed prompts yields better executable code, though it remains poorly-designed.
  3. Code generation can be improved by using an LLM with RAG to generate machine-readable policy rules. This is a viable alternative to manual rule curation. However, a manually-curated template is still necessary.

Potential for Future Experimentation

We encourage ongoing, open experimentation to test the application of LLMs in public benefits eligibility and the development of Rules as Code. 

  • Our methodologies can be repeated for other states, programs, policies, or LLMs. Given that our research captures LLMs’ performance at the time of our experiments, repeating these experiments as new versions are released can identify if and how performance improves with updates. Different prompting techniques, such as active prompting, may also improve LLM performance in Experiments 1 and 2. All experiments can also be applied to additional states, programs, and policies. 
  • Policy experts and software engineers can pair programs and evaluate. We conducted these experiments with general policy knowledge and computer science students. We’ve been following Nava Labs’ project to work closely with benefits navigators and their design and engineering teams to develop new tools for benefits access and enrollment. Working with policy experts would be fruitful for capturing policy logic templates more effectively.
  • Explore creating a LLM for a specific program or policy. Creating specific LLMs fine-tuned for a specific program—such as SNAP—and continually updated as policies change with a RAG is an important direction, allowing states to have a stronger starting point in this space. Additionally, a program-specific model could have increased capabilities to understand differences between federal and state policy. 
  • Consider different programming languages for outputs. Experiment 4 could be explored using different programming languages to see if some LLMs are more or less successful at generating code. 
  • Compare results against existing systems. Administrators of eligibility and enrollment systems could compare results against the code and information provided by those systems. Additionally, many existing systems also have test cases that could be used to evaluate LLM outputs. 
  • Explore extracting code from legacy systems. Many eligibility and enrollment systems are written in legacy code languages. Administrators could explore using LLMs to convert the legacy code into an updated language using a guided template. 
  • Compare code writing efforts to policy analysis efforts. It is unclear how much effort using LLMs can save when developing Rules as Code infrastructures for different programs. An analysis of this may highlight a shift from programmer time to policy analyst time.

Get in Touch 

We’re eager to hear from you. This work, like any public-facing project, improves as more stakeholders offer their input. We welcome your thoughts, questions, or potential collaboration ideas. Email us at rulesascode@georgetown.edu.

Citation

Cite as: Ariel Kennan, Lisa Singh, Alessandra Garcia Guevara, Mohamed Ahmed, Jason Goodman, “AI-Powered Rules as Code: Experiments with Public Benefits Policy.” Beeck Center for Social Impact + Innovation and Massive Data Institute, Georgetown University, March 24, 2025.

Icons from NounProject and Icons8.

Thank You 

We’d like to acknowledge and thank the following colleagues for their contributions and support of this research:

  • Anjelika Deogirikar Grossman, Associate Director, MDI
  • Ashlee Sellung, Digital Government Hub Coordinator 
  • Elizabeth Bynum Sorrell, Senior Research + Engagement Manager, DBN
  • Jennifer Melot, Center for Security and Emerging Technology (CSET)
  • Lynn Overmann, Executive Director, Beeck Center
  • Olivia Zhao, Student Analyst, DBN 
  • Michael Ribar, Policy2Code Coach, Principal Associate for Social Policy and Economics Research at Westat
  • Mahendran Velauthapillai, Professor and McBride Family Endowed Chair, Georgetown University Department of Computer Science
  • Sabrina Toppa and Ashleigh Fryer, Beeck Center Communications Team
  • Steve Kent, Software Developer, MDI
  • Tina Amper, Community Advisor, Rules As Code Community of Practice, DBN

Appendix 

Experiment materials are available in a public shared folder.

Experiment 1 Materials

  • Spreadsheet: Prompts with scores and comments
  • Spreadsheet: Cumulative scores
  • Rubric 

Experiment 1 Rubric 

Current and state-specific applicabilityIs the information from the response current and state-specific?Score 1-5
CompletenessIs the response thorough and does it cover all elements requested in the prompt?Score 1-5
RelevanceIs the response focused on the question, without adding irrelevant or unnecessary details?Score 1-5

The following serves as an overview of the assessment rubric and explains what each score means:

  • Current, and state-specific information
    • 5 points: All information provided is accurate, current, and state-specific
    • 4 points: More accurate/current/state-specific than inaccurate/not current/lacking state-specificity 
    • 3 points: About as accurate/current/state-specific as inaccurate/not current/lacking state-specificity
    • 2 points: More inaccurate/not current/state-specific than accurate/current/lacking state-specificity 
    • 1 point: All inaccurate and not current information that lacks state-specificity
       
  • Completeness
    • 5 points: Fully thorough response that covers all the elements that were requested in the prompt, and explains each of them in detail
    • 4 points: Mostly thorough answer that covers most of the elements that were requested in the prompt, but explains each of them in a not very meticulous way
    • 3 points: The answer is not so meticulous but it is not so broad either
    • 2 points: The response is mostly broad and lacks some details (such as specific numbers, definitions, etc.) or information about the item(s) requested in the prompt 
    • 1 point: The response is extremely broad and does not contain any details (such as specific numbers, definitions, etc.) or information about the item(s) requested in the prompt
  • Relevance
    • 5 points: The answer directly addresses the question asked, without including unrelated or unimportant information
    • 4 points: The answer is mostly relevant, but includes a few details that are not essential to the question asked. These details do not obstruct the understanding of the response
    • 3 points: The answer addresses the question as much as it deviates from the focus of the question
    • 2 points: The answer contains mostly irrelevant or tangentially unrelated elements that could confuse or dilute the main information needed to effectively answer the question
    • 1 point: The answer is completely off-topic, with all the content irrelevant to the question

Experiment 2 Materials

  • Spreadsheet: Prompts with scores and comments
  • Spreadsheet: Cumulative scores
  • Rubric

Experiment 2 Rubric 

AccuracyIs the information from the response accurate?Score 1-5
CompletenessIs the response thorough and does it cover all elements requested in the prompt?Score 1-5
RelevanceIs the response focused on the question, without adding irrelevant or unnecessary details?Score 1-5

The following provides an overview of the assessment rubric and what each score means:

  • Accuracy
    • 5 points: All information provided is accurate
    • 4 points: More accurate than inaccurate
    • 3 points: About as accurate as inaccurate
    • 2 points: More inaccurate than inaccurate
    • 1 point: Completely inaccurate
  • Completeness
    • 5 points: Fully thorough response that covers all the logic and elements that were requested in the prompt
    • 4 points: Mostly thorough answer that covers most of the logic and elements that were requested in the prompt
    • 3 points: The answer contains about half of the logic/elements requested in the prompt
    • 2 points: The response lacks some important details (such as specific numbers, definitions, etc.) or information to satisfy the prompt 
    • 1 point: The response does not contain any details (such as specific numbers, definitions, etc.) or information about the item(s) requested in the prompt
  • Relevance
    • 5 points: The answer directly addresses the question asked, without including unrelated or unimportant information
    • 4 points: The answer is mostly relevant, but includes a few details that are not essential to the question asked. These details do not obstruct the understanding of the response
    • 3 points: The answer addresses the question as much as it deviates from the focus of the question
    • 2 points: The answer contains mostly irrelevant or tangentially unrelated elements that could confuse or dilute the main information needed to effectively answer the question
    • 1 point: The answer is completely off-topic, with all the content irrelevant to the question

Experiment 3 Materials

  • Spreadsheet: Contains the generated rules from both the plain prompting and RAG approaches, compared against the verified SNAP Policy values in Georgia, Pennsylvania, and Texas.

Experiment 4 Materials 

  • Prompts: Includes the system and user prompts used for code generation. For the iterative approach, there are three sets of prompts for each stage, as outlined in the experiment details.
  • Generated Code: Contains the code results from each approach, with our manual annotations highlighting strengths [STRENGTH] and weaknesses [WEAKNESS].

Footnotes

  1. Ariel, Kennan, Lisa Singh, Bianca Dammholz, Keya Sengupta, and Jason Yi. “Exploring Rules Communication: Moving Beyond Static Documents to Standardized Code for U.S. Public Benefits Programs”, June 6, 2023. ↩︎
  2. Ariel Kennan and Tina Amper. “Cross-Sector Insights From the Rules as Code Community of Practice”Digital Benefits Network, February 6, 2024. ↩︎
  3. James Mohun and A. Roberts, “Cracking the code: Rulemaking for humans and machines”, OECD Working Papers on Public Governance, No. 42, OECD Publishing, Paris, 2020.
    ↩︎
  4. Hamish Fraser and Tom Barraclough. Governing Digital Legal Systems: Insights on Artificial Intelligence and Rules as CodeMIT Computational Law Report, 2024. ↩︎
  5. Tom Barraclough, Hamish Fraser and Curtis Barnes. Legislation as Code, March 2021. ↩︎
  6. Denis Merigoux, Marie Alauzen, Lilya Slimani. Rules, Computation and Politics. Scrutinizing Unnoticed Programming Choices in French Housing BenefitsJournal of Cross-disciplinary Research in Computational Law, 2023, 2 (1), pp.23. ↩︎
  7. Justin D. Weisz, Michael Muller, Steven I. Ross, Fernando Martinez, Stephanie Houde, Mayank Agarwal, Kartik Talamadupula, and John T. Richards. 2022. Better Together? An Evaluation of AI-Supported Code Translation. In 27th International Conference on Intelligent User Interfaces (IUI ’22), March 22–25, 2022, Helsinki, Finland. ACM, New York, NY, USA, 35 pages. ↩︎
  8. Basem Aly. How to talk to AIs: Advanced Prompt Engineering. Emerging Technologies Knowledge Base. December 2023. ↩︎
  9. Autumn Toney-Wails and Lisa Singh. Are You Sure About That: Eliciting Natural Language Confidence Indicators from Chatbots. In Proceedings of the 12th International Conference on Human-Agent Interaction (HAI ’24). Association for Computing Machinery, New York, NY, USA, 278–286. ↩︎
  10. Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-Augmented Generation for Large Language Models: A Survey, March 2024. ↩︎