← All posts

How We Collect SWE-Bench for Other Languages

One consistent quality standard, no matter what you code in

10 min read

In the world of software development, where code quality and efficiency are of paramount importance, there is a growing need for reliable tools to evaluate and compare AI agents capable of independently solving tasks in projects of varying complexity. One such tool that is gaining popularity is SWE-bench — a benchmark designed to measure the performance and efficiency of AI agents in the context of Python-related tasks. This benchmark is based on real-world problems that developers encounter and provides a standardized metric for assessing an AI agent's ability to automate their resolution.

However, despite its value, the initial concept of SWE-bench being limited exclusively to Python restricted its applicability across the broader software development landscape. Modern projects often involve multiple programming languages, each serving different purposes, such as web development, mobile applications, or system programming. This has led to the need to expand SWE-bench to support a wider range of programming languages, enabling a more comprehensive evaluation of AI capabilities in automating developer tasks across various technology stacks.

My name is Kirill Stolbov, and I am part of the expert team at Fermatix AI. Our company recognized this need and took active steps to expand SWE-bench to other popular programming languages.

What is SWE-bench

SWE-bench, or Software Engineering Benchmark, is a comprehensive benchmark designed to evaluate AI agents in solving real-world problems that arise during software development. It is built on a carefully curated set of tasks sourced from popular open-source repositories on GitHub. Each task in SWE-bench typically consists of an issue description (a bug report, a feature request etc.) and a corresponding pull request that contains the changes intended to resolve the issue, along with a set of tests to verify the correctness of the proposed solution. The evaluated AI agent's task is to analyze the issue description and the codebase and generate a solution that successfully resolves the problem while passing all related tests.

Unlike traditional benchmarks, which often focus on execution speed or resource consumption, SWE-bench aims to assess deeper aspects of software development, such as understanding existing code, generating new code, identifying and fixing bugs, and performing refactoring. The benchmark was created through a multi-step process that involves selecting repositories, filtering pull requests based on specific attributes (e.g., linkage to an issue and the presence of test changes), and verifying whether the applied changes successfully pass the tests. As a result, SWE-bench provides a standardized tool for evaluating and comparing different approaches to automating developer tasks, contributing to the advancement of more sophisticated AI agents for software engineering.

The initial version of SWE-bench focused on Python, which limited the evaluation of large language models to domains related to data processing and artificial intelligence. However, this did not cover other widely used and essential areas of software development that rely on fundamentally different technology stacks. The need for multilingual benchmarks can be illustrated by the Java version of SWE-bench, which was the first attempt to extend this benchmark to other programming languages. Our initiative to expand SWE-bench is a direct response to this growing demand for AI-driven development automation across diverse programming ecosystems.

Key Objectives of SWE-bench

Benefits for Clients

Challenges in Data Collection and Testing

The process of data collection requires a deep understanding of each programming language's ecosystem, including common project structures, dependency management tools, and testing conventions. Simply porting the Python-centric approach of the original SWE-bench would be ineffective.

As with any benchmark, collecting and testing data for SWE-bench comes with several challenges:

Additionally, robust and objective evaluation metrics must be developed, and expert reviewers should be involved for manual verification in complex cases.

Our Experience: Collected Data and Supported Programming Languages

Practical experience has shown that SWE-bench can be adapted for data collection and model evaluation across various programming languages. The key factors for successful adaptation include the availability of a sufficient number of relevant tasks and open-source projects in the target language, as well as client requirements.

As part of individual SWE-bench development projects, our team has actively explored data collection and benchmark adaptation for a wide range of programming languages, extending beyond the original Python focus. We have successfully collected data and conducted preliminary testing for the following languages: Java, C#, PHP, JavaScript/TypeScript, Ruby, Rust, Scala, Kotlin, C++, and Golang.

Manual Review, or SWE-bench Verified

Given the complexity and ambiguity of SWE-bench tasks, as well as the importance of ensuring high-quality evaluation, manual review of model-generated solutions plays a crucial role. The SWE-bench Verified concept involves a process of manual validation and verification of the quality and correctness of benchmark data points by software engineering experts. Our approach aims to address the inherent challenges by developing specialized methodologies for each programming language. Integrating manual verification underscores our commitment to maintaining the quality and reliability of benchmark data across all supported languages.

Automation alone cannot guarantee the relevance and solvability of benchmark tasks. Human expertise is essential for identifying and filtering out problematic cases, particularly in different language-specific contexts.

The key objectives of manual review include:

Engaging domain experts for each supported programming language is also critical to ensuring the precision and relevance of the manual review process. These experts provide valuable insights into language-specific challenges and best practices, leading to a more robust and credible benchmark. Maintaining this standard across multiple languages will significantly strengthen client confidence in the expanded benchmark.

Comparison with Other Software Engineering Benchmarks

SWE-bench is not the only benchmark designed for evaluating AI agents in software development. Several other benchmarks exist, each with its own strengths and limitations. Comparing SWE-bench with some of the most well-known alternatives helps highlight its unique features and advantages.

For example, HumanEval focuses on code generation from textual description in Python. MBPP is also focused on code generation in Python, but it is designed to evaluate the ability of models to synthesize short programs solvable by beginner-level programmers. CodeSearchNet specializes in code search from textual description in multiple languages, including Python, Java, Go, JavaScript, PHP, and Ruby. Defects4J focuses on bug fixing in Java code in real-world projects. There is also an extended benchmark HumanEval-V, which, although a multimodal benchmark, is Python-oriented and has a limited size and coverage of real-world tasks.

Unlike these benchmarks, our extended SWE-bench offers unique advantages due to its multilingual approach. It focuses on complex software engineering tasks, including code understanding, generation, bug fixing, and refactoring, and covers a wide range of programming languages. Additionally, our benchmark, based on real problems from open-source repositories, provides a more realistic evaluation of AI agents' capabilities in practical development scenarios. The inclusion of manual validation via SWE-bench Verified for multiple languages ensures more reliable and high-quality assessment of model solutions compared to purely automated metrics used in most other benchmarks. The test-driven evaluation inherent in the SWE-bench framework, and extended by us to multiple languages, provides a stricter and more practical measure of AI agents' performance compared to benchmarks that rely solely on code generation without execution.

Ultimately, the following differences of SWE-bench can be highlighted:

Examples of Pricing Models

Although one of our pricing options is a pay-per-data-point model, which provides flexibility and scalability for adapting the use of the benchmark to the individual evaluation needs and budgets in different programming languages, we also offer other cooperation terms. The cost of a single data point may depend on the task's complexity and the type of testing (e.g., automatic or verified). Furthermore, we are open to discussing custom terms for clients with specific requirements regarding particular programming languages, selecting specific repositories, or validation processes.

The exact cost of a single SWE-bench data point may vary and requires clarification. However, as an approximate estimate, the cost of one data point could range from several dollars to a hundred dollars, depending on the factors mentioned above.

Conclusion

Our extended and specialized SWE-bench is designed to evaluate AI agents across a wide range of programming languages, offering unique capabilities that go beyond standard benchmarks. Drawing on our experience, we provide clients not just a tool, but a reliable and valuable solution for achieving measurable results. Our version of SWE-bench is not only access to a powerful tool for evaluating AI agents but also direct collaboration with our team of experts. We are ready to provide you with deep knowledge, cutting-edge tools, and personalized support necessary for the successful development and accurate evaluation of your most ambitious projects.


Read our other stories: - Automating Our Client Dataset Verification with LLMs - Fermatix's Multilingual SWE-Bench

© 2025 All rights reserved · Privacy Policy