Publication Automation + AI Generative AI

A Sandbox for GenAI in the Commonwealth of Pennsylvania: A Digital Service Network Spotlight

DSN Spotlights are short-form project profiles that feature exciting work happening across our network of digital government practitioners. Spotlights celebrate our members’ stories, lift up actionable takeaways for other practitioners, and put the resources + examples we host in the Digital Government Hub in context. 

Author: Tuqa Alibadi
Published Date: Jun 9, 2025
Last Updated: Jun 9, 2025

Background

In February 2024, the Commonwealth of Pennsylvania (PA) launched a pilot initiative to explore the responsible use of generative artificial intelligence (genAI) in government. The pilot gave a select number of employees access to a large language model (LLM), enabling the Commonwealth to explore the practical application of LLMs in the day-to-day workflows of public employees. The aim of the pilot was to learn directly from employees how, when, and whether tools like LLMs might provide value in public-sector work. 

To learn more, the Digital Service Network spoke with Harrison MacRae, director of emerging technologies at the PA Office of Administration; Amaya Capellán, chief information officer of the Commonwealth of PA; and Terence Milstead, head of user experience research at the Commonwealth’s Office of Digital Experience (CODE PA).

What is an AI sandbox?

In software development, a “sandbox” is a controlled space for testing and experimentation, ensuring that issues, bugs, or vulnerabilities are contained within the experimental environment and don’t affect any live systems. 

Some state governments across the United States are setting up sandboxes to explore the use of genAI in public-sector operations and service delivery while managing risk, ensuring oversight, and centering user needs. These experiments enable time-bound, controlled settings in which teams can test functionality, assess value, and gather feedback without committing to full-scale implementation. In the case of digital government delivery in particular, AI sandboxes such as pilots can serve as a method to stay responsive to rapid technological change while upholding the public sector’s unique legal and ethical responsibilities.

In PA’s case, setting up a sandbox for genAI meant giving a limited number of employees access to an enterprise-grade LLM and carefully observing how they used those tools in their daily work. The sandbox approach enabled researchers and policymakers to evaluate usability, utility, and trust—while also identifying training needs, use case patterns, and adoption barriers.

Inside Pennsylvania’s genAI sandbox

By late 2023, as the day-to-day use of LLMs was rapidly taking off across sectors, public-sector agencies lacked clear guidance on how these tools could be adopted responsibly. “There was virtually no user-informed information on how to approach these tools—particularly for government employees,” McRae said.

In response, PA developed a pilot program to test, evaluate, and understand the real-world value and risks of using genAI in line with the duties and responsibilities of government. “We leveraged a procurement vehicle that gave us the ability to move forward quickly with an LLM tool for a very limited duration and scope,” MacRae said. 

Building the pilot team

The Office of Administration was in charge of the pilot initiative, with CODE PA as a partner and collaborator. CODE PA’s user research (UX) team brought expertise in human-centered design, collecting feedback, and structuring engagement strategies with participants. The Office of Administration also managed engagement with their LLM vendor, keeping them engaged and organizing training with their team on topics like prompt writing and building custom LLMs. Carnegie Mellon University also supported the initiative, offering strategic guidance and research input based on their work in AI and public interest technology. 

Pilot design and evaluation

The pilot engaged 14 state agencies in total, with a goal to include participants from a variety of roles and backgrounds. “We wanted to understand opportunities and challenges across a wide range of employees—folks brand new to the Commonwealth, folks who had been here 30 years, people in very technical roles and others in non-technical ones,” MacRae said.

PA’s LLM pilot was a year-long study, organized into two phases. The first phase, which lasted about six months, involved participants from a single agency—primarily IT and HR staff in the Office of Administration. In the second phase, more agencies joined, including policy teams, legal counsel, and administrative staff. The team running the pilot used ethnographic-style observational techniques to learn about how participants were using the LLM during the course of the pilot. Milstead emphasized the importance of this observational approach: “We didn’t want to lead them. We wanted to see what use cases would emerge organically when people were simply given a license.”

We wanted to see what use cases would emerge organically when people were simply given a license.”

Terence Milstead, Head of User Experience and Research, CODE PA

“We also wanted to observe the learning process,” Milstead added. To that end, beyond observation, the team used a mix of qualitative methodologies to collect insights: diary studies, bi-weekly surveys, one-on-one interviews, and focus groups. These methods helped surface emerging patterns in users’ understanding of the LLM and in how their use of the tool evolved over time.

Engagement strategy: open doors and collaborative learning

PA’s LLM pilot was informed by a guiding principle of human-centered engagement. This meant that the pilot team actively supported participants throughout the experience. “What helped us succeed was the constant communication and an open-door policy for questions, feedback, and learning,” Capellán said. This approach to participant engagement included a number of available touchpoints for feedback and learning: office hours, Microsoft Teams chats, internal websites, and direct email communication. “We welcomed any and all incoming questions about AI use.”

Shortly after the pilot began, it became clear that participants didn’t just feel they were learning how to use one specific LLM tool—they felt they were engaging with the broader AI “moment.” The pilot team acknowledged this experience and responded by hosting a monthly seminar series with sessions covering topics like federal AI policy, multimodal models, and the implications of genAI for climate, energy use, and infrastructure. Many of the topics for these sessions were suggested by pilot participants themselves.

What helped us succeed was constant communication and an open-door policy for questions, feedback, and learning,”

Amaya Capellán, Chief Information Officer, Commonwealth of Pennsylvania

Pilot results

Through the pilot, three dominant use cases emerged. The team developed user personas based on each use case:

  1. The Innovation Engine: These employees used the LLM to brainstorm, ideate, and generate new, creative solutions. One agency team, for example, used the tool to reverse-engineer business requirements from decades-old legacy code.
  2. The Bureaucracy Hacker: These employees used the LLM to cut through red tape— summarizing dense policies, translating legal text, and streamlining forms, among other things. One participant explored reducing 94 IT vendor procurement policies down to 30 by identifying overlapping language and redundancies.
  3. The Strategic Communicator: These employees leaned on the LLM to refine their internal and external communications. One HR team rebuilt its entire job description workflow, automating an estimated 80 percent of the process with standardized language. Milstead supported this team’s efforts by helping build a custom LLM dubbed the “Strategic Communication Advisor.”

The team also observed a set of key challenges related to uptake of LLM use in government settings. For example, the team found that adoption wasn’t necessarily tied to job function or technical background. “What we actually saw across the board was that adoption had a lot less to do with the specific type of work someone did and more to do with the extent that participants dealt with core adoption barriers,” McRae said. 

The barriers the pilot team identified were: 

  1. habit formation – employees reported they often forgot to try the LLM tool because it wasn’t part of their routine; 

2. workload – some employees reported they didn’t have the time to explore a new tool; 

3. prompt fluency – employees reported struggling to phrase prompts effectively; and 

4. trust and privacy – skepticism was healthy, and employees wanted clarity on data usage before using the LLM.

Other challenges related to operationalizing LLM use in government included keeping pace with rapid technological change with updated training and guidance. “New features are introduced almost weekly,” Milstead emphasized. 

Ultimately, 85 percent of participants in the pilot reported a positive experience. Many said that using an LLM improved their work product and saved them time.

What’s next for LLM use in Pennsylvania?

Using the results from the program, the pilot team is now exploring the development of a simple visual guide that walks employees through practical uses of LLMs based on their role or the task at hand. “We envision a ‘recipe card’ that would make it really clear in a visual way how LLMs can be applied as part of their work,” Milstead said. 

The pilot team also hopes to double down in the future on peer learning opportunities. “Some folks who participated in the pilot felt like they were on an island,” McRae said. This isolation limited their opportunity for collaborative problem-solving using the LLM. The team is considering how they can prioritize group-based deployments in the future to enable cross-pollination of ideas and support.

Beyond LLMs, the team felt that the pilot set a new precedent for testing the use of new technologies in government operations writ large. “Beyond genAI, our approach to this pilot serves as a more general playbook for responsibly testing and learning about new tools in a structured way,” Capellán said.

Lessons Learned

Real-world sandbox environments help assess the potential value of new technologies in government settings. One of the biggest questions going into the pilot was whether employees would actually find LLMs useful. The answer was yes— across multiple agencies and roles, participants reported that the tool helped them brainstorm, summarize dense materials, improve communications, and save time. This validation was the result of theory in practice: the real-life implementation of the tool in a responsible, time-bound setting enabled a deeper, more rigorous evaluation of the theoretical value of these newly available tools in government work. 

Understanding adoption barriers is essential to thoughtful, effective implementation of new technologies. The pilot highlighted that successful adoption of LLMs was less about role or technical background and more about behavioral factors like habit, time, and trust. Without this information, the pilot team may have not known the right support to develop for employees as they consider a broader rollout. Using human-centered strategies to pay close attention to real employees’ usage enabled these key insights about uptake.

Different employees need different tools—and that’s okay. Not everyone benefited or was interested in using the LLM. Some participants preferred features integrated into familiar tools, while others wanted simpler or more specialized applications tailored to their workflow. The pilot team recognized this as a result of the pilot— and can now consider a broader rollout knowing that LLMs can’t and won’t serve all employees and use cases.

To see how this work was put into practice, explore the following assets in the Digital Government Hub: