Inside Out | Three papers on Software Engineering agents

I read a set of papers related to Software Engineer AI agents in August 2024. This note summarizes the key learnings.

Papers in this note

Wang, Xingyao, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, et al. “OpenDevin: An Open Platform for AI Software Developers as Generalist Agents.” arXiv, July 23, 2024. Paper↗.
Wu, Qingyun, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, et al. “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation.” arXiv, October 3, 2023. Paper↗.
Xia, Chunqiu Steven, Yinlin Deng, Soren Dunn, and Lingming Zhang. “Agentless: Demystifying LLM-Based Software Engineering Agents.” arXiv, July 1, 2024. Paper↗.

Goal

Our goal is to learn about Software Engineering (SWE) agents in this edition.

Do we need SWE agents?
What are the challenges in building autonomous agents?
What are the building blocks?
How are they evaluated? What metrics do we care about?

We’ll take a holistic view across these papers instead of going through each paper one by one. Focus on mental model and bigger picture 🔆

Motivation

SWE Agents attempt to mimic a developer. A typical workflow will do the following:

Create and modify code.
Use tools to gather information
- Task planning.
- Diagnose and fix errors.
Reach out to experts to brainstorm and identify solutions to deep problems in a specific niche.
Commit changes safely, avoid side effects and keep human in the loop.

Different papers propose interesting approaches to achieve this workflow.

OpenDevin↗ breaks the developer loop into a state model. Agent observes an event stream and decides the next action. Each action operates on an environment using tools and records an observation in the event stream. Agents can optionally delegate tasks to other agents.
Autogen↗ on the other hand uses conversable agents as a core building block. Multiple agents work together to achieve the goals for an app.
Agentless↗ questions the basic premise. Do we really need complex agents for the SWE tasks? To our surprise, agentless is currently the SOTA in the SWE Bench evaluation suite.

The approaches above range from SWE domain-specific solution to generic agent framework that can be adopted to any domain. Each comes with a trade-off in complexity and efficiency we’ll examine in this note.

Challenges

Replicating the human developer and current practices like innerloops.
Building blocks and patterns for extensible agents and their tools.
Delegation and multi-agent communication protocols, consensus, moderation etc.
Trade-offs in finding a repeatable core loop. How does this change across various domains? How do we scale it?
Avoiding error amplification. Can agents reflect and autocorrect? Can we control the decision planning? What if the agent enters a loop following an incorrect path?

Approaches

Generalist Software Developers (OpenDevin)

OpenDevin↗ closely resembles today’s generalist software developer: plan, code, find information, communicate. They use the CodeAct↗ framework to enumerate tools in LLMs, decide which ones to invoke and process next action.

Agent’s world view consists of the following:

A sandboxed environment with various capabilities like a Unix shell, a browser etc. Think of this as a workspace.
An agent can operate on the environment using an interface. E.g., run synthesized programs, or browser websites. Think of these as tools.
Agents “act” on the environment via Actions, and “observe” the output. These actions and observations are captured in an event stream.

Above world view can be extended with multi-agent delegation. E.g., one agent can delegate a task to another. The second agent shares the world view.

OpenDevin Data Flow — Fig 1: Key abstractions in OpenDevin: Event Stream (state), Agent (interface) and the Runtime (environment). Source: opendevin paper↗.

Core task loop

Agent perceives the state of the environment. The state is a chronological collection of previous actions and observations.
Is there a pending task? If yes, trigger an action. Otherwise, create the final response.
The action can be static (e.g., browse web), or a dynamic python program, or a shell script. Actions are executed using the runtime on the environment.

Extensibility

Define custom tools as python functions (see CodeAct↗).
- AgentSkills library provides ready to use set of tools, automatically imported into the Jupyter notebook.
- E.g., edit_file, parse_image, parse_pdf etc.
Create new agents - allows overriding the state -> action logic.
- Default CodeActAgent is a generalist agent. BrowseAgent for browser usage.
- GPTSwarm agent allows multi-agent communication via a graph of operations and edges indicating collaborations.
- Community contributions are available in Agent Hub.
Possibility to create new actions to allow additional runtimes or delegation.
- AgentDelegateAction can be used by CodeActAgent to invoke BrowsingAgent.

Collaboration

Both Human-AI and the multi-agent collaborations are supported.

OpenDevin↗ includes a set of benchmarks evaluating the general assistant, web and software agent capabilities.

Customizable and Conversable Agents (Autogen)

Autogen↗ proposes multi-agent approach to build complex LLM apps. The strategy is to design multiple agents and accommodate conversation patterns to enable them to collaborate.

Motivation

Conversations enable feedback and cooperation.
Specialized agents are modular.
Planning and task decomposition allow divergent thinking and better reasoning.

Autogen model — Fig 2: Autogen building blocks. Source: autogen paper↗.

Agent’s world view consists of the following

We will have a Sender and Receiver agent for every message in the chat.
Agent can send or receive messages.
- Upon receipt, a reply function is invoked based on matching the Sender or previous message.
- E.g., if previous message mentions a tool call, the tool is invoked. Or, if provides a code block, a code executor is invoked.
- Reply functions are registered by the Agent, and are executed in LIFO order. E.g., check termination condition, tool calls, LLM reply.
Assistant is always the proxy for an LLM. UserProxyAgent is the equivalent user proxy. Latter can execute code and provide feedback to the Assistant; assistant can drive the next step.

Essentially, to build a self-sufficient Agent with access to tools and an environment to run these, you need to provide a pair <Assistant, UserProxyAgent>. See conversation pattern 3 below.

With Autogen↗, creating the agent graphs is an essential mental model to build. Here are a few conversation patterns. Note how the agents are constructed.

# Sample conversation patterns
# Pattern 1: two agent chat without human input
sender (UserProxyAgent): create a travel plan for 3 days to Bhutan
receiver (Assistant): // text blob from LLM
sender: // doesn't respond with anything
receiver: Is there anything else I can help you with?

# Pattern 1a: two agent, no human input, code execution
#   See https://github.com/microsoft/autogen/blob/main/notebook/agentchat_auto_feedback_from_code_execution.ipynb
sender (UserProxyAgent): plot a sine curve
receiver (Assistant): Sure, here's the python code...
sender: // executes the code and returns 0
receiver: The code has successfully executed. TERMINATE

# Pattern 2: two agent chat with sender providing code execution
#   See the MathChat example: https://github.com/autogen-ai/autogen/blob/main/notebook/agentchat_MathChat.ipynb
sender (MathProxyAgent): solve problem XYZ with two tools wolfram and python
receiver (Assistant): wolfram(a, b, c)
sender: // calls wolfram and provides output
receiver: python(x, y)
sender: // calls python and provides output
receiver: // LLM call decides we have an answer. TERMINATE

# Pattern 3: two agent with delegation to an expert
#   See the Planner example: https://github.com/microsoft/autogen/blob/main/notebook/agentchat_planning.ipynb
#   Agents: assistant, user, planner, planner_user
#     assistant is proxy to LLM
#     user provides code execution and a ask_planner capability. Latter forwards
the message to planner agent (as planner_user)
#     planner is a proxy to LLM with custom persona (system message)
sender (user): find and fix first good issue in xyz repo
receiver (assistant): // generates python code for fetching issue from github
sender (user): // runs py code and returns issues list
receiver (assistant): // generates tool call to ask_planner
sender (user): // initiate_chat as planner_user to get plan
  sender (planner_user): // get plan for issue 1
  receiver (planner): // create step by step plan
  // send the ask_planner response to assistant
receiver (assistant): // responds with the plan
sender (user): // empty
receiver (assistant): TERMINATE

Core task loop

Variants of conversation:
- chat with conversation sequence is always round-robin.
- group chat can provide a moderator and various heuristics for choosing a speaker.
Within an agent
- Evaluate the received message and find the reply function to answer
- Run the reply function
- Send a response to the sender
Send “TERMINATE” (for termination condition), or “UPDATE CONTEXT” (for RAG scenarios).

Extensibility¹

Create new agents based on ConversableAgent: AssistantAgent, or UserProxyAgent.
- Implement a unified conversation interface and provide a persona with default system prompt: send, receive, generate_reply.
- Act as human proxy, or use a LLM, or execute a tool via code/function execution.
- Common features: result cache, error handling, templating etc.
Create new function calls wrapped as nested agents (see the planner example above).
Create new code execution capabilities in the agent.

Collaboration

Communication is enabled by send, receive and generate_reply primitives.
Group chat provides custom speaker selection heuristics.

Two Phase Software Developer (Agentless)

Motivation: do we really need complex agents for software engineering tasks? Challenges:

Complex tool designs.
Delegation is confusing.
Cannot always self-reflect, might never recover from an incorrect step.

Proposal

No agents, no delegation of decision.
Two stage approach: localize the error, repair by generating patches and test & iterate.

Agentless approach — Fig 3: Two stage approach in Agentless. Source: agentless paper↗.

Stage 1: Localize an error

Find the files using a repo structure format along with issue description.
Find the classes by creating a skeleton format of class, functions signatures and comments.
Find the lines by direct code snippet references.

Stage 2: Repair error

Context for LLM: X lines of prefix and suffix to code snippet. Concatenate snippets with ...
Generate patches with less code, cost and hallucination.
- Search: find the original code.
- Replace: generate replacement code.
- Create a diff patch to apply.
Filter patches by running regression tests and syntax checks.
Re-rank the candidate solutions via Majority Voting.
- Normalize patch to ignore space/line ending diffs
- Parse old and new code with patch to AST
- Serialize AST to source code without doc strings
- Text diff between old and new code are judged via majority voting
Submit patch with the highest votes.

Agentless↗ tops the benchmarks and is the current SOTA (as of July 2024). See the paper for specifics of evaluation.

Conclusion

These three papers provided an interesting set of patterns for software engineering. We can design generalist agents and tools like OpenDevin↗, or create a collection of agents in Autogen↗, or choose a simple Agentless↗ approach.

Agentless approach is bare metal and directly provides the base set of capabilities required for finding and fixing issues. In my opinion, this is a core requirement in any SWE agent. It will need to browse code, localize error and create a patch fix.

Autogen and OpenDevin provides agents with access to a set of tools. Agents pick tools (python, browser etc.) to propose a solution. I had high hopes for finding a solution to the challenges of human-agent communications, consensus etc. which I didn’t yet see in the papers.

For my next project, I’d probably choose the Agentless approach to start with. I believe the tools provided by OpenDevin↗ like headless browsing, or editing capabilities will be unbundled. For example, see the hide.sh↗ project for headless IDEs. Autogen↗ is an interesting paradigm, but the core of the logic seems easy to roll on own. Will it be a good abstraction to stand through time?

There’s a strong interest in multi-agent paradigms, I am sure we’re just seeing the first few attempts. I’d keep an eye out to learn more.

We recommend looking at the Autogen source code for more clarity on the various abstractions. See conversable agent↗. ↩

Three papers on Software Engineering agents