How We Collect SWE-Bench for Other Languages

One consistent quality standard, no matter what you code in

In the world of software development, where code quality and efficiency are of paramount importance, there is a growing need for reliable tools to evaluate and compare AI agents capable of independently solving tasks in projects of varying complexity. One such tool that is gaining popularity is SWE-bench — a benchmark designed to measure the performance and efficiency of AI agents in the context of Python-related tasks. This benchmark is based on real-world problems that developers encounter and provides a standardized metric for assessing an AI agent’s ability to automate their resolution.

However, despite its value, the initial concept of SWE-bench being limited exclusively to Python restricted its applicability across the broader software development landscape. Modern projects often involve multiple programming languages, each serving different purposes, such as web development, mobile applications, or system programming. This has led to the need to expand SWE-bench to support a wider range of programming languages, enabling a more comprehensive evaluation of AI capabilities in automating developer tasks across various technology stacks.

My name is Kirill Stolbov, and I am part of the expert team at Fermatix AI. Our company recognized this need and took active steps to expand SWE-bench to other popular programming languages.

What is SWE-bench

SWE-bench, or Software Engineering Benchmark, is a comprehensive benchmark designed to evaluate AI agents in solving real-world problems that arise during software development. It is built on a carefully curated set of tasks sourced from popular open-source repositories on GitHub. Each task in SWE-bench typically consists of an issue description (a bug report, a feature request etc.) and a corresponding pull request that contains the changes intended to resolve the issue, along with a set of tests to verify the correctness of the proposed solution. The evaluated AI agent’s task is to analyze the issue description and the codebase and generate a solution that successfully resolves the problem while passing all related tests.

Unlike traditional benchmarks, which often focus on execution speed or resource consumption, SWE-bench aims to assess deeper aspects of software development, such as understanding existing code, generating new code, identifying and fixing bugs, and performing refactoring. The benchmark was created through a multi-step process that involves selecting repositories, filtering pull requests based on specific attributes (e.g., linkage to an issue and the presence of test changes), and verifying whether the applied changes successfully pass the tests. As a result, SWE-bench provides a standardized tool for evaluating and comparing different approaches to automating developer tasks, contributing to the advancement of more sophisticated AI agents for software engineering.

The initial version of SWE-bench focused on Python, which limited the evaluation of large language models to domains related to data processing and artificial intelligence. However, this did not cover other widely used and essential areas of software development that rely on fundamentally different technology stacks. The need for multilingual benchmarks can be illustrated by the Java version of SWE-bench, which was the first attempt to extend this benchmark to other programming languages. Our initiative to expand SWE-bench is a direct response to this growing demand for AI-driven development automation across diverse programming ecosystems.

Key Objectives of SWE-bench

SWE-bench serves several important purposes, including:

Objective Comparison of AI Agents: Provides a standardized framework for comparing the performance of different programming models and developer tools on real-world software engineering tasks.
Tracking AI Progress: Enables the assessment of advancements in AI-driven software automation and helps identify the most promising solutions.
Analysis of Current Capabilities: Helps uncover the strengths and weaknesses of both existing and emerging models and tools, guiding efforts toward their improvement.

Benefits for Clients

SWE-bench also offers significant advantages to clients, helping them make more informed decisions and improve development efficiency:

Optimizing the Selection of Tools and Technologies: Clients can leverage SWE-bench results to objectively evaluate various AI agents available on the market. This allows them to choose solutions that demonstrate the best performance on tasks relevant to their projects, ultimately improving code quality and reducing development time.
Enhancing Developer Team Efficiency: By understanding the capabilities and limitations of different AI agents, clients can more effectively integrate these tools into their workflows. SWE-bench helps determine which tasks can be successfully automated, freeing developers to focus on more complex and creative challenges, which ultimately boosts overall team productivity.
Evaluating and Improving In-House AI Development: For companies developing their own AI agents, SWE-bench serves as a reliable benchmark for assessing their effectiveness. By comparing their solutions against benchmark results, organizations can identify areas for improvement and direct their efforts toward building more competitive and high-quality AI-driven tools.

Challenges in Data Collection and Testing

The process of data collection requires a deep understanding of each programming language’s ecosystem, including common project structures, dependency management tools, and testing conventions. Simply porting the Python-centric approach of the original SWE-bench would be ineffective.

As with any benchmark, collecting and testing data for SWE-bench comes with several challenges:

Task Complexity: SWE-bench tasks are sourced from real-world projects and are often complex and ambiguous. This makes automating the data collection and testing process difficult. Developing effective methods for automatic evaluation and validation of model-generated solutions is essential.
Task Diversity: SWE-bench covers a wide range of software engineering challenges. This diversity complicates the creation of a universal testing methodology suitable for all task types. Specialized testing approaches need to be designed for different task categories.
Data Volume: To obtain statistically significant results, a large dataset must be collected. This requires substantial computational resources and time for data collection and processing. Optimizing data collection and testing processes is crucial for handling large-scale datasets efficiently.
Evaluation: Assessing the quality of AI-generated solutions in such scenarios is inherently challenging. There is often no single correct solution, and evaluations can be subjective. Ensuring objective and reliable assessments requires multiple test types. These include: - Pass-to-pass tests, which consistently pass both before and after modifications (e.g., through a pull request). - Fail-to-pass tests, which initially fail when applying test_patch (a test-only modification designed to check changes) but pass after applying golden_patch (the developer-proposed solution within the pull request).
Additionally, robust and objective evaluation metrics must be developed, and expert reviewers should be involved for manual verification in complex cases.
Resources: Building and maintaining SWE-bench requires significant resources — both computational and human. Proper funding and a sufficient number of AI trainers are essential to ensure the benchmark’s continued development and support.

Our Experience: Collected Data and Supported Programming Languages

Practical experience has shown that SWE-bench can be adapted for data collection and model evaluation across various programming languages. The key factors for successful adaptation include the availability of a sufficient number of relevant tasks and open-source projects in the target language, as well as client requirements.

As part of individual SWE-bench development projects, our team has actively explored data collection and benchmark adaptation for a wide range of programming languages, extending beyond the original Python focus. We have successfully collected data and conducted preliminary testing for the following languages: Java, C#, PHP, JavaScript/TypeScript, Ruby, Rust, Scala, Kotlin, C++, and Golang.

Manual Review, or SWE-bench Verified

Given the complexity and ambiguity of SWE-bench tasks, as well as the importance of ensuring high-quality evaluation, manual review of model-generated solutions plays a crucial role. The SWE-bench Verified concept involves a process of manual validation and verification of the quality and correctness of benchmark data points by software engineering experts. Our approach aims to address the inherent challenges by developing specialized methodologies for each programming language. Integrating manual verification underscores our commitment to maintaining the quality and reliability of benchmark data across all supported languages.

Automation alone cannot guarantee the relevance and solvability of benchmark tasks. Human expertise is essential for identifying and filtering out problematic cases, particularly in different language-specific contexts.

The key objectives of manual review include:

Validating the Accuracy of Automated Evaluation: The automatic quality assessment metrics used in SWE-bench may not always fully capture the real-world correctness of solutions. Manual review helps confirm or adjust the quality and accuracy of individual tasks, identifying potential issues such as incorrect conditions or ambiguous issue descriptions.
Detecting Errors and Inconsistencies in the Benchmark: During manual review, experts can uncover flaws in the benchmark itself, such as invalid tasks, unclear conditions, or vague evaluation criteria. This feedback loop improves the overall quality of SWE-bench.
Enhancing Trust in Benchmark Results: Manual validation by experts increases confidence in the accuracy and reliability of SWE-bench data points, making it a more authoritative and dependable tool for evaluating AI-driven programming models.

Engaging domain experts for each supported programming language is also critical to ensuring the precision and relevance of the manual review process. These experts provide valuable insights into language-specific challenges and best practices, leading to a more robust and credible benchmark. Maintaining this standard across multiple languages will significantly strengthen client confidence in the expanded benchmark.

Comparison with Other Software Engineering Benchmarks

SWE-bench is not the only benchmark designed for evaluating AI agents in software development. Several other benchmarks exist, each with its own strengths and limitations. Comparing SWE-bench with some of the most well-known alternatives helps highlight its unique features and advantages.

For example, HumanEval focuses on code generation from textual description in Python. MBPP is also focused on code generation in Python, but it is designed to evaluate the ability of models to synthesize short programs solvable by beginner-level programmers. CodeSearchNet specializes in code search from textual description in multiple languages, including Python, Java, Go, JavaScript, PHP, and Ruby. Defects4J focuses on bug fixing in Java code in real-world projects. There is also an extended benchmark HumanEval-V, which, although a multimodal benchmark, is Python-oriented and has a limited size and coverage of real-world tasks.

Unlike these benchmarks, our extended SWE-bench offers unique advantages due to its multilingual approach. It focuses on complex software engineering tasks, including code understanding, generation, bug fixing, and refactoring, and covers a wide range of programming languages. Additionally, our benchmark, based on real problems from open-source repositories, provides a more realistic evaluation of AI agents’ capabilities in practical development scenarios. The inclusion of manual validation via SWE-bench Verified for multiple languages ensures more reliable and high-quality assessment of model solutions compared to purely automated metrics used in most other benchmarks. The test-driven evaluation inherent in the SWE-bench framework, and extended by us to multiple languages, provides a stricter and more practical measure of AI agents’ performance compared to benchmarks that rely solely on code generation without execution.

Ultimately, the following differences of SWE-bench can be highlighted, which distinguish it from other benchmarks and make it especially valuable:

Focus on complex tasks: SWE-bench stands out from many other benchmarks by focusing on complex and multifaceted tasks that reflect real-world Software Engineering problems, rather than narrow, specialized tasks related to code generation or search.
Manual validation via SWE-bench Verified: The concept of manual validation through SWE-bench Verified provides a more reliable and high-quality evaluation of model solutions than purely automated metrics used in most other benchmarks.
Realistic tasks: The tasks in SWE-bench are taken from real-world open-source projects, ensuring their relevance and practical significance, unlike synthetic tasks used in some other benchmarks.

Examples of pricing models:

Pay-per-data-point: The client pays a fixed price for each data point, i.e., for testing the model on each SWE-bench task. The price may vary depending on the task’s complexity and the type of testing.
Package deals: Clients are offered packages of data points at a discounted price. Different packages may include varying amounts of data points and different levels of detail in the reports.
Custom agreements: For large clients with specific requirements, custom pricing agreements can be made, taking into account their particular needs and the volume of testing.

Although one of our pricing options is a pay-per-data-point model, which provides flexibility and scalability for adapting the use of the benchmark to the individual evaluation needs and budgets in different programming languages, we also offer other cooperation terms. The cost of a single data point may depend on the task’s complexity and the type of testing (e.g., automatic or verified). Furthermore, we are open to discussing custom terms for clients with specific requirements regarding particular programming languages, selecting specific repositories, or validation processes.
The exact cost of a single SWE-bench data point may vary and requires clarification. However, as an approximate estimate, the cost of one data point could range from several dollars to a hundred dollars, depending on the factors mentioned above.

Conclusion

Our extended and specialized SWE-bench is designed to evaluate AI agents across a wide range of programming languages, offering unique capabilities that go beyond standard benchmarks. Drawing on our experience, we provide clients not just a tool, but a reliable and valuable solution for achieving measurable results. Our version of SWE-bench is not only access to a powerful tool for evaluating AI agents but also direct collaboration with our team of experts. We are ready to provide you with deep knowledge, cutting-edge tools, and personalized support necessary for the successful development and accurate evaluation of your most ambitious projects.

Read our other stories:

Automating Our Client Dataset Verification with LLMs