3 min read

Multilingual SWE-Bench Fermatix supply: Evaluating Compact Open-Source LLMs on Real-World Software Engineering Tasks

This paper introduces a multilingual extension of SWE-Bench—a carefully designed benchmark for assessing large language models (LLMs) in real-world software engineering tasks across various programming languages and industry codebases. We evaluate a set of leading open-source LLMs, offering in-depth quantitative and qualitative insights, and examine how benchmark quality influences the reliability of model evaluations.
Introduction
While current benchmarks for evaluating LLMs in code understanding and generation play a crucial role in model assessment, many public datasets have been incorporated into modern models' training data, resulting in contamination and artificially inflated performance. To mitigate this issue, we introduce a novel, independently curated multilingual extension of SWE-Bench, constructed from diverse open-source repositories. Unlike existing public datasets, our benchmark ensures no overlap with commonly used training corpora, providing a more reliable measure of model capabilities. Additionally, the dataset is inherently difficult to reproduce, as its validation process leverages AI trainers for data verification. This approach enables transparent, unbiased, and reproducible evaluation of LLMs across a wider spectrum of programming languages and real-world software engineering tasks.
SWE-Bench Multilingual: Dataset Description
Our dataset extends the SWE-Bench Verified benchmark with several key advancements:
1. Comprehensive Multilingual Support:
Our dataset encompasses a diverse array of programming languages, including C++, C#, Go, JavaScript, Kotlin, PHP, Ruby, Rust, and others. This extensive coverage facilitates a comprehensive evaluation of large language models (LLMs) across various programming paradigms, syntactic structures, and domain-specific conventions.
2. Rigorous Quality Assurance:
To balance scalability with accuracy, we employ a hybrid validation framework combining automated static analysis, rule-based checks, and manual review by expert annotators. This dual-layer verification guarantees that task formulations, code diffs, and test cases meet strict consistency standards while minimizing labeling noise.
3. Industrial-Grade Task Selection:
All tasks are sourced from actively maintained, high-impact open-source projects (e.g., kubernetes, llvm-project) and mirror real development workflows. Each instance represents an authentic software engineering activity—such as backporting bug fixes, implementing API extensions, or optimizing performance-critical components—providing a realistic proxy for evaluating models in production-like environments.
Dataset Statistics
Language
Task count
Avg.lines changed
Median lines changed
ALL
506
9.7
5.0
GO
137
7.1
4.0
JS
120
14.8
9.5
Rust
103
11.1
7.0
PHP
53
8.2
4.0
Ruby
53
0.8
0.0
Kotlin
34
14.9
13.5
CPP
3
5.0
2.0
CSHARP
3
8.7
4.0
The extended analysis (see Appendix A) demonstrates the distribution of the volume of changes for solving tasks and the distribution of repositories.
Evaluation Setup
Model Selection
Our study evaluated several prominent language models—both open-source and proprietary— that possess reasoning capabilities and demonstrate proficiency in the domains of Coding and Computer Science:
  • google/gemini-2.5-pro-preview
  • anthropic/claude-3.7-sonnet
  • meta-llama/llama-4-maverick
  • deepseek/deepseek-r1
  • deepseek/deepseek-chat (deepseek v3)
  • qwen/qwen-2.5-72b-instruct
All models were integrated and orchestrated using the MOpenHands agent framework. MOpenHands was responsible for managing the end-to-end interaction flow, including prompt formatting, model invocation, and post-processing of responses.

Patch Generation
  • For each benchmark task, solution patches were generated by issuing prompts to the models via the MOpenHands agent.
  • The agent ensured uniform prompt structure, consistent decoding parameters, and reproducible generation conditions across all models.

Patch Evaluation
  • Model-generated patches were evaluated using the multi-swe-bench evaluation framework. This framework automated the application of each patch to the corresponding codebase, performed project build and test execution, and produced detailed logs for each evaluation run.
  • The primary function of the multi-swe-bench framework in this setup was to provide a standardized and reproducible environment for patch application and test execution across all models. The framework did not compute aggregate metrics itself; instead, it outputted pass/fail statuses and execution logs for each task and patch.
  • After obtaining the test results, we conducted an analysis and calculated the metrics, ensuring consistency and full transparency in the calculation and reporting of key performance indicators such as pass@k, precision, and F1-score.

Evaluation Metrics
pass@k (k=1, 3):
The pass@k metric measures the fraction of tasks for which at least one correct solution is present among the top-k model completions. For each task, we generate k independent model outputs using the MopenHands agent and evaluate them using the multi-swe-bench framework, which applies each patch to the codebase, runs tests, and reports the test results.
  • pass@1 quantifies the success rate when only the top prediction is considered, reflecting the model's deterministic effectiveness.
  • pass@3 measures the probability that at least one of the top three predictions is correct. This metric is useful for scenarios where multiple valid solutions may exist, and the model can offer several correct solutions from the top-k results.

Precision and F1 Score:
We calculate precision and F1 Score at the patch and test levels, evaluating how accurately the model-generated code modifications correspond to the ground-truth patches. These metrics are derived from the execution logs produced by the multi-swe-bench framework, which provides raw results after applying the generated patches.
  • Precision is defined as the proportion of correctly generated changes among all changes proposed by the model.
  • F1 Score is calculated as the harmonic mean between precision and pass@1, providing a balanced evaluation of model performance.

Infrastructure & Hyperparameters
All model inference was performed using the MOpenHands agent framework, which was extended to interact with the OpenRouter platform. This integration ensures consistent hardware resources, standardized execution environments, and high-throughput evaluation for all experiments. Below, we detail the relevant aspects of the infrastructure, dependencies, and evaluation automation for full reproducibility.

Cloud & Hardware Environment
  • API/Hosting Provider:
The inference was conducted via OpenRouter (https://openrouter.ai), which provides unified access to a wide selection of open-source LLMs deployed on enterprise-grade infrastructure.
  • Compute Resources:
All inference requests were executed using OpenRouter’s backend infrastructure. Specific hardware configurations, such as GPU (e.g., NVIDIA A100/RTX 4090), are abstracted and managed by the provider, ensuring scalability and consistency across experiments.
  • Batching and Parallelization:
Queries were executed using task-level parallelization in the MOpenHands framework, which dispatched individual inference tasks in parallel. This setup maximized throughput and minimized latency by leveraging Python's standard ThreadPoolExecutor for parallel execution. Batch inference was disabled to avoid artifacts associated with batch processing and ensure independent handling of each model query.

Software Environment
  • Containerization: All scripts and tools were executed within Docker containers, incorporating the multi-swe-bench framework, to ensure a consistent and isolated environment throughout all experiments.
  • Operating System: Ubuntu 22.04 LTS (containerized)
  • Primary Language: Python 3.11.8
  • Key Frameworks:
MOpenHands: Facilitates interaction with models, task management, and GitHub integration for seamless prompt execution and patch generation.
multi-swe-bench: Used for evaluating model responses, including patch application, build/test execution, and result logging.
OpenRouter: Provides access to a wide selection of models, enabling consistent and standardized inference across experiments.
  • No fine-tuning or custom model training was performed — only inference with provided checkpoints.

Model Inference Parameters
  • All model inferences were performed using standard configuration settings provided by both MopenHands and OpenRouter. These settings include default decoding parameters and unified prompt templates for each task type, unless otherwise specified by the model provider.

Automation
  • Evaluation Orchestration:
The evaluation workflow was fully automated using custom Python scripts. All tasks were processed using the MopenHands framework, which handled interaction with OpenRouter and ensured consistent model evaluation across different tasks.
  • Parallel Task Processing:
Parallel task processing was performed through MopenHands using Python’s ThreadPoolExecutor, adapting the number of concurrent threads to the available API tokens and OpenRouter constraints.

Random Seed:
  • No random seed was explicitly set during model inference, and the default settings of the OpenRouter API and MopenHands were used. As all evaluations were conducted via external APIs, results may exhibit minor non-determinism due to stochastic decoding settings (e.g., top-k, top-p, temperature > 0), which are not user-configurable.

Logging:
  • Every API call, including all model parameters, request/response payloads, timestamps, and returned completions, was logged and versioned for full auditability.

Note:
All prompt engineering, context truncation, and input formatting were based on the original prompts from MopenHands, ensuring consistency and facilitating reproducibility for future work.

Results
Model
Precision (%)
pass@1 (%)
pass@3 (%)
F1 (%)
anthropic/claude-3.7-sonnet
16.54
5.63
14.47
8.4
google/gemini-2.5-pro-preview
14.08
4.78
11.27
7.13
deepseek/deepseek-r1
7.24
2.46
6.03
3.67
deepseek/deepseek-chat (deepseek v3)
7.58
3.04
7.22
4.33
meta-llama/llama-4-maverick
6.09
1.31
5.39
2.15
qwen/qwen-2.5-72b-instruct
1.26
0.47
1.09
0.63
Strengths & Weaknesses
Strengths:
  • Precision: Reflects the accuracy of proposed solutions. Models with high precision (e.g., anthropic/claude-3.7-sonnet) demonstrate that their solutions are often correct, even if not always top-ranked.
  • Pass@1: Measures the model’s ability to generate a correct solution as its first prediction. High values indicate models capable of solving tasks accurately on the first attempt.
  • Pass@3: Evaluates how often at least one of three proposed solutions is correct. This is useful for tasks with multiple valid solutions.

Weaknesses:
Low Pass@1 and F1 scores: Most models struggle to generate correct solutions on the first attempt, making them less suitable for high-stakes tasks requiring immediate accuracy.
Result instability: Certain models (e.g., qwen/qwen-2.5-72b-instruct) underperformed across all metrics. Likely reasons:
  • Smaller model size
  • Lightweight architecture
  • These factors appear to reduce their effectiveness for complex benchmark tasks.

Key Takeaways:
Despite strong precision, low Pass@1 and F1 scores highlight challenges in generating accurate solutions on the first attempt—a critical requirement for error-sensitive tasks.
Even with the ability to produce multiple answer variants, models exhibit difficulties delivering correct solutions immediately, underscoring the high complexity of this benchmark type for tested models.

Data Quality Impact
The quality of the evaluation dataset plays a pivotal role in the validity and interpretability of model assessment results, especially in the context of large language models for code generation and automated software engineering. In our study, several specific aspects of data quality were found to critically influence both overall scores and the observed ranking of models:

Rigorous Selection Process
Each task in the multilingual SWE-Bench was manually curated and reviewed to ensure it represents a realistic, unambiguous, and non-trivial software engineering scenario. Unlike some public benchmarks dominated by auto-generated or poorly vetted examples, our dataset prioritizes human oversight at every stage—from task selection to solution validation. This manual curation significantly reduces the prevalence of mislabeled, ill-defined, or irrelevant tasks that might otherwise artificially inflate model performance metrics.

Programming Language Distribution
Model performance often varies substantially across programming languages, especially when models are pretrained or fine-tuned primarily on one or a few dominant languages (e.g., Python). To mitigate bias and ensure fair generalization assessment, our multilingual SWE-Bench dataset was deliberately balanced to include a diverse set of languages with representative task counts and difficulty levels. This enables robust model differentiation and reveals genuine strengths/weaknesses in less common or understudied language domains.

Ground Truth Validation Protocol
Reference solutions underwent a two-tier verification process:
  1. Automated checks (builds/tests, patch application)
  2. Expert review (correctness and solution relevance validation)
This approach minimizes:
  • False positives (e.g., solutions that compile but contain semantic errors)
  • False rejections (e.g., valid alternative solutions mistakenly marked incorrect due to overly narrow dataset conversion criteria)

Cross-Checking with Public Benchmarks
During dataset development, we systematically compared a subset of our curated tasks with their analogues in prominent public code benchmarks. This cross-checking revealed several common issues:
  • Frequent Label Errors: Misalignment between provided "ground-truth" patches and actual repository state, or acceptance of syntactically correct but semantically incorrect solutions.
  • Synthetic Artifacts: The presence of programmatically generated tasks or solutions, which do not accurately reflect real-world software engineering challenges and can be exploited by models that have memorized common patterns.
  • Ambiguous or Underspecified Tasks: Tasks with incomplete context or multiple valid solutions, making objective evaluation difficult.

Impact on Model Evaluation
Our research demonstrates that high-quality, expert-validated, and balanced datasets:
  • Provide lower but more realistic model performance assessments
  • Enable more accurate model comparisons based on actual capabilities rather than dataset biases
Specifically, models achieving SOTA results on noisy benchmarks exhibited significant performance drops and ranking changes when evaluated on our multilingual benchmark and the original SWE-Bench.

Key Insight:
Reliable evaluation of code-generating LLMs requires:
  1. Diverse and challenging tasks
  2. Meticulous dataset preparation and validation
  3. Continuous quality control
Without these:
  • Benchmark results may overestimate models' real-world applicability
  • Progress in AI-powered software development slows
Discussion
Our findings highlight the critical importance of comprehensive, multi-task benchmarks based on real-world industrial challenges for accurately assessing large language models' (LLMs) true capabilities in software development.

Limited model generalizability: Models trained on narrow datasets (e.g., competitive programming problems or single-language corpora) perform well in constrained domains but struggle with diverse real-world scenarios — particularly in underrepresented programming languages and complex codebases.
The data quality imperative: Balanced and rigorously validated datasets yield more reliable and reproducible results while enabling finer differentiation between models that appear similar on lower-quality benchmarks.
Outstanding challenges: Even top-performing models face difficulties with:
  • Multi-file modifications
  • Project-level context understanding
  • Generating semantically correct fixes
Addressing these limitations will require simultaneous improvements in both model architectures and benchmark design.

Conclusion
This work introduces a multilingual, industry-focused extension of the SWE-Bench benchmark, specifically designed to address gaps in fairness, diversity, and realism that persist in existing evaluation datasets for code-focused large language models. By applying rigorous curation, balancing across programming languages, and thorough validation, we provide a robust platform for unbiased and reproducible assessment of open-source LLMs on practical software engineering tasks. Our empirical results demonstrate that many current models, despite impressive performance on standard benchmarks, face significant challenges when confronted with real-world, multi-language scenarios and higher-quality ground truth.

Beyond quantitative evaluation, our findings highlight the critical role of data quality and benchmark design in shaping the measured capabilities and limitations of advanced AI systems. We advocate for a continuous, community-driven effort to improve both datasets and evaluation protocols, as these underpin the development of AI tools that are trustworthy and effective in real engineering environments. Ultimately, the methodologies and insights presented in this study serve as a foundation for building more reliable, context-aware, and practically valuable AI-driven solutions for software development.
References
Chen, D., Zheng, S., Mishra, S., Du, X., Gu, J., Li, X., ... & Fried, D. (2023). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv preprint arXiv:2310.06780.
Original SWE-Bench dataset and evaluation pipeline. Core pipeline scripts and methodology for evaluation and patch application were adapted from this resource.
https://github.com/princeton-nlp/SWE-bench

Zan, D., Huang, Z., Liu, W., Chen, H., Zhang, L., Xin, S., ... & Xiang, L. (2025). Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving. arXiv preprint arXiv:2504.02605. https://arxiv.org/abs/2504.02605

Zan, D., Huang, Z., Liu, W., Chen, H., Zhang, L., Xin, S., ... & Xiang, L. (2025).
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving. Multi-SWE-bench Documentation. https://multi-swe-bench.github.io

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., ... & Narasimhan, K. R. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Proceedings of the 12th International Conference on Learning Representations (ICLR 2024).
https://arxiv.org/abs/2310.06770

Zhang, L., He, S., Zhang, C., ... & Zhang, D. (2025). SWE-bench Goes Live! arXiv preprint arXiv:2505.23419. https://arxiv.org/abs/2505.23419

Hugging Face. (2024). Transformers Library Documentation. https://huggingface.co/docs/transformers

OpenRouter. (2024). OpenRouter API Documentation. https://openrouter.ai/docs

Wolf, T., Debut, L., Sanh, V., ... (2020). Transformers: State-of-the-Art Natural Language Processing. Proceedings of the 2020 EMNLP: System Demonstrations (pp. 38-45). https://aclanthology.org/2020.emnlp-demos.6/

Multi-SWE-bench Team. (2025). Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving. GitHub Repository. https://github.com/multi-swe-bench/multi-swe-bench
Appendix
A. Data Collection & Validation Procedures

A.1. Source Selection:
Description of open-source repositories included
Task distribution by programming language
Key Observations:
Language Dominance:
  • Ruby is the dominant language in the dataset, appearing in 8 out of 20 repositories. These repositories are primarily focused on infrastructure and development tools, such as CocoaPods/CocoaPods, github-changelog-generator/github-changelog-generator, and ruby-grape/grape.
Repository Focus:
  • The repositories are predominantly centered around infrastructure tools (e.g., go-kratos/kratos, prometheus/prometheus) and developer utilities (e.g., gohugoio/hugo, go-resty/resty). There is also significant representation from web/social platforms (e.g., mastodon/mastodon, facebook/docusaurus).
  • Android and Kotlin repositories primarily focus on mobile applications, while Rust is concentrated around system-level utilities and performance-oriented tools.
Licensing Trends:
  • The majority of repositories are under permissive licenses such as MIT, with a few notable exceptions, like copyleft licenses for social/web platforms (e.g., mastodon/mastodon), which emphasize open-source and free software principles.
A.2. Task Extraction:
Pipeline for Identifying Real-World Software Engineering Tasks
Issue and Pull Request Mining:
  • Automatically collect issues and pull requests (PRs) from selected repositories using the GitHub API and/or direct repository scraping.
  • Focus on PRs that reference a specific issue and include both a human-readable description and a code patch.
Patch Extraction and Diff Generation:
  • For each qualifying PR, extract the minimal code diff (patch) associated with the issue or feature.
  • Retrieve and store all relevant files before and after the patch, ensuring that the full context for applying and validating the patch is available.
Context and Metadata Attachment
For each task, we record associated metadata including:
  • Commit hash
  • Change author
  • Modification date
  • Affected files
  • Linked issue description
A.3. Golden set generation
Automated Extraction of Ground-Truth Patches:
  • For each accepted pull request, the associated commit diff is extracted using the GitHub API or `git diff` tools.
  • The “before” and “after” versions of each affected file are stored, along with the patch file representing the minimal code change needed to resolve the issue.
  • Patches are linked to corresponding issue and PR metadata, ensuring traceability.
Checks for Patch Validity:
  • Each patch is automatically applied to the original codebase using standard VCS tools (`git apply`).
  • After application, the codebase is built and available automated tests are executed (via `make`, `pytest`, `mvn test`, etc.).
Only patches that successfully build and pass tests are retained.
A.4. Human Validation
Manual Validation Protocol
Sampling and Coverage:
  • Each task in the dataset was reviewed by at least one expert annotator with software development experience (100% coverage).
  • 30% of tasks underwent cross-validation by a second independent annotator to assess consistency (accounting for repository, programming language, and task type).
Annotator Assignment:
  • All tasks were evaluated using a standardized protocol.
  • For cross-validation, two annotators independently assessed solution correctness and completeness.
Acceptance Criteria:
A patch was marked as accepted if it:
  • Correctly resolved the described issue or implemented the intended feature;
  • Did not introduce regressions or unrelated changes;
  • Was minimal and focused on the described context.
Issue Specification Quality:
For each task, annotators rated issue clarity according to the SWE-bench Verified guidelines:
  • (0) “The issue is well-specified and it is clear what is required for a successful solution
  • (1) “There are some blanks to fill in about the issue, but there is a sensible interpretation of what is required for a successful solution
Test Coverage Quality:
For each task, annotators rated the coverage of the provided tests following the SWE-bench Verified guidelines:
  • (0) “The tests perfectly cover all possible solutions
  • (1) “The tests cover the majority of correct solutions, however some unusual solutions may be missed
Disagreement Resolution:
In cases where annotators disagreed on task validity or labeling, a third, more senior annotator adjudicated the decision. Tasks for which consensus could not be reached, or which remained ambiguous after discussion, were excluded from the final dataset.

Annotation Guidelines and Edge Cases:
Annotators are provided with a detailed guideline document covering:
Multiple Valid Solutions
  • All solutions meeting acceptance criteria must be marked as correct, even when multiple valid variations exist.
Partial Fixes
  • Partial fixes are excluded from evaluation unless:
  • Explicitly specified in requirements
  • Verifiable through testing
Refactoring/Style-Only Changes
  • Such modifications are excluded unless:
  • Explicitly requested in the task description
Exclusion/Disagreement Documentation
  • Annotators must document:
  • Rationale for task exclusion
  • Causes of evaluation disagreements
  • Supporting evidence for decisions
A.5. Language Balancing
Ensuring Proportional Task Representation:
  • After task extraction, the distribution of tasks across programming languages is analyzed.
  • Underrepresented languages may be supplemented by targeted mining of additional repositories or inclusion of less frequent, but still high-quality, tasks.
Maintaining Difficulty Distribution:
The dataset is stratified by difficulty, defined via:
  • Number of lines changed in a patch;
  • Number of files affected;
  • Presence of associated tests or complex bug reports.
Tasks are sampled so that each language has a balanced representation across easy, medium, and hard categories.
A post-processing step ensures no single language or task type (e.g., trivial bugfixes) dominates the final dataset, promoting fair model assessment.
B. Detailed Metric Definitions
B.1. pass@k:
  • Mathematical definition:
  • For each task, pass@k is computed as the probability that at least one out of k sampled model outputs produces a successful solution. The expected value is calculated as:
where n is the total number of generated samples per task, and c is the number of correct samples.
  • Sampling procedure:
For each task, the same prompt was submitted to the model 3 times, yielding 3 independent outputs. Each output was evaluated separately.
  • Success criterion:
A model output is considered successful if, after applying the generated patch, all test results (from the ground-truth logs and the LLM-generated patch logs) match exactly—indicating functional equivalence.

B.2. Precision, F1:
  • Precision Definition:
Precision is the proportion of correctly generated code changes among all the changes proposed by the model.
  • F1 Score Definition:
The F1 score is calculated as the harmonic mean of precision and pass@1, providing a balanced evaluation of the model's ability to generate correct and top-ranking solutions.
  • Handling of Partial Matches:
Functionally correct patches that may differ syntactically are considered valid. Partial matches or non-identical but functionally equivalent patches are treated as correct.

B.3. Qualitative Analysis Protocol:
After applying patches and testing using multi-swe-bench, the resulting logs are analyzed to evaluate model quality. During analysis, the logs are compared based on task statuses, verifying whether test results obtained with reference patches match those generated by the model. This enables manual assessment of syntax errors, logical inconsistencies, or contextual misunderstandings in the task execution.
D. Golden solution changes Analysis
Discussion of Possible Bias Due to Repository Selection
While we aimed to maximize language and domain diversity in repository selection, certain biases may persist:
  • Language Bias: The final dataset reflects a concentration of tasks in Go, Kotlin, Ruby, and Rust, due to the popularity and activity levels of available repositories in these languages. As a result, languages with fewer high-quality open-source projects (e.g., PHP, compared to Go or Ruby) are underrepresented, potentially influencing model evaluation outcomes in favor of models better suited to dominant languages.
  • Domain and Ecosystem Bias: The dataset contains a significant number of web frameworks, mobile clients, and infrastructure tools. Conversely, scientific computing, embedded, or low-level systems are less represented. This may favor models or approaches optimized for application and web development domains.
  • Maintenance and Popularity Bias: We prioritized actively maintained and popular repositories (as measured by stars, forks, and recent commit activity). While this improves task relevance and data quality, it may exclude less-known but technically significant projects, and could overemphasize patterns common to popular ecosystems.
  • Task Type Bias: Some repositories, such as content management systems and developer tools, naturally yield a larger volume of trivial or repetitive tasks. Despite our filtering, this may affect the distribution of task difficulty and types across the benchmark.
Read our other stories:
© 2025 All rights reserved
Privacy Policy