AI Agent Studio

A vision and prototype for a platform that helps users develop and monitor AI Agents with large language models

Context

Large language models are a “step-change” technology, meaning that integrating them into the products we use has the potential to revolutionize how we work, learn, travel, and much more. However, users building LLM-powered applications are running into problems around iteration, evaluation, and monitoring that are unique to this emerging technology. Palantir recognized a product opportunity to help users more effectively develop AI-powered products, and spun up a team of people to explore the space through rapid prototypes.

Project Structure

I partnered with a group of internal model builders + researchers for four weeks, working on concept development and vision mocks. Then, I pitched that vision to my modeling team, and partnered with four devs to build out a prototype over the course of six weeks.

Disclaimer: I left the company before this product fully shipped. The final product looks and behaves a little differently as a different team took it to production.

Stage 1: Gathering signal + defining the opportunity space

Agent builder pain points:

Configuring large language models is slow and confusing. There are a lot of knobs to turn that end up impacting the final result in unforeseen ways.
Users are unsure how to measure the quality of non-deterministic models, especially over time as their usage scales. Users are running into a lot of bugs they never even knew to look for.
There is SO much excitement around LLMs, but users are unsure where to start, and some don’t have any experience with modeling.

Examples of LLM-powered applications:

A movie-marathon agent that recommends a list of 3 movies to watch according to your requested theme and time period.
A fraud-detection agent that takes in an insurance claim document, predicts whether it’s a fraudulent claim or not, and sorts the fraudulent documents into a list for human review.
A travel-booking agent that assembles a full itinerary of flights, hotels and experiences based on a user’s request, and then books everything for you.

Product goal: Help people safely create, effectively test, and transparently monitor AI Agents.

🔥 Agent creation is easy + understandable
Help users improve the quality of their agents with a creation process that emphasizes safe and iterative experimentation.

📊 Robust testing, evaluation and monitoring
Help users maintain stable, production-quality models with robust evaluation tools.

👋 Transparent processes
Increase user confidence in their solutions by providing transparency into production agent activity

Stage 2: Concept + Architecture exploration

Concepts

Agent: An LLM configured to solve a problem. At minimum consists of an LLM + model parameters + prompt. Can optionally include tools + guard rails. (Example: A travel chat assistant that helps you book flights + accommodations)

LLM: Large language model. A model trained for general-purpose understanding and generation of language. (Example: GPT-3)

Prompt Template: A prompt template set by the LLM configurer. This prompt gets sent along with user input to define how the LLM forms its response, but the end-user never sees it. (Example: “Answer the following question from a user in the style of an old western cowboy, in two sentences maximum.”)

System prompt: A prompt that determines how/when the Agent orchestrates the execution of the LLM and its various tools. (Example: If an LLM responds with “I don’t know,” send the query again with a max of 3 iterations)

Tool: An additional capability given to the LLM. This could be a set of documents available for searching, another model, or even another agent. (Example: A plug-in that the LLM can use to actually book flights)

Guardrail: A description or limitation to how an agent can behave (Example: This agent can call on other agents as tools)

Experimenting with app architecture

Stage 3: Assembling a vision

Agent activity: Runs

Click into an agent to see a detailed log of all its executions. Dig into an individual run to see a timeline of steps the agent took, as well as a log of specific thoughts.

Transparency into agent processes boosts user confidence, and is critical for debugging issues to figure out which step is causing issues.

Note: There’s a squarespace issue causing the blurriness in this video. It should be resolved soon.

Agent activity: Monitors

Flip to the “monitors” tab to see an agent’s performance against key metrics over time, like errors or latency. Users can also see how changes in the agent configuration impacted those monitors to ensure quality is upheld as the agent gets updated.

Other edit pane tabs

Tools: Users can add tools from the central tool library to determine what capabilities/resources the agent has access to.

Guardrails: Users can choose guardrails from a central guardrail registry to determine specific limitations and behaviors. These guardrails can be independently authored by anyone, like security teams.

Evaluation

Users can construct an evaluation strategy defined by selecting a problem (like binary decision-making), the metric library that can visualize performance against that problem, and the benchmark datasets that contain the datasets for testing the agent.

The right side consists of those metrics, as well as the human-readable inputs and outputs.

Tool Library

An example of one of the resource libraries. Users can create custom tools, or import them from other sources – for example, a “Google search” tool from google, or a “Email drafter” tool from Microsoft.

Feedback

Users LOVED the sandbox view

This was the first time they’d seen all the configuration options together in one place in an easily-editable UI view. Versioning was a critical callout feature here.

Users wanted more evaluation options

Sending individual text queries doesn’t scale well. Users were interested in more systematic approaches to testing.

Formal evaluation is less important than quick testing

Users were interested in the formal evaluation capabilities, but it was a lower priority than the core functionality of iteration + quick testing.

Phase 4: Developing the prototype

I pitched the vision to fellow designers, teammates, and coworkers. The modeling team assembled a small crew of four engineers and gave us 6 weeks to prototype the platform to see if this was worth further investment. We built the agent configuration sandbox and some of the testing functionality before having to pause on the project. The following are screen recordings of the app:

Agent configuration page: Editing an agent and testing inputs
We successfully implemented version control, branching, edit pane, and testing ground fairly close to the vision mocks. We decided to add the execution log to this page since users found it useful to view while testing.

Testing workflows: Test cases + compare page
The testing workflows evolved as we were prototyping. Early users wanted the ability to save inputs and validate outputs for quicker testing, so we added a “Test Cases” feature to the configuration page. LLMs are non-deterministic, so users can choose to validate on exact responses, as well as response types or structures. Customers loved not having to re-type their test inputs over and over, and the validation saved them from having to manually review results (see example of validation results below).

Users also wanted the ability to tweak one component of the agent config to see how the agent performed, like trying different models with the same prompt. On this page, users can create multiple versions of the agent, change the configuration, and then run a test case to see how each model configuration performs side-by-side.

Reflection

Designing a product with emerging tech was challenging, but was actually a great way to learn and quickly synthesize concepts. The exploration mocks were an incredibly helpful tool for facilitating conversations, especially for making sure we were talking about the same concepts while the actual language was still in flux.

Next steps

This project is paused for now, while the company is investing in a separate, pipeline-builder type app. These vision mocks are being used to apply for several patents, and ideas proposed here are being adapted for other apps around the company. The prototype is still available internally at Palantir, and we believe the need for this will emerge as soon as the company reaches a certain scale of LLM-powered workflows to manage.