← All posts

Fermatix's Multilingual SWE-Bench: Evaluating Compact Open-Source LLMs on Real-World Software Engineering Tasks

3 min read

This paper introduces a multilingual extension of SWE-Bench — a carefully designed benchmark for assessing large language models (LLMs) in real-world software engineering tasks across various programming languages and industry codebases. We evaluate a set of leading open-source LLMs, offering in-depth quantitative and qualitative insights, and examine how benchmark quality influences the reliability of model evaluations.

Introduction

While current benchmarks for evaluating LLMs in code understanding and generation play a crucial role in model assessment, many public datasets have been incorporated into modern models' training data, resulting in contamination and artificially inflated performance. To mitigate this issue, we introduce a novel, independently curated multilingual extension of SWE-Bench, constructed from diverse open-source repositories. Unlike existing public datasets, our benchmark ensures no overlap with commonly used training corpora, providing a more reliable measure of model capabilities. Additionally, the dataset is inherently difficult to reproduce, as its validation process leverages AI trainers for data verification. This approach enables transparent, unbiased, and reproducible evaluation of LLMs across a wider spectrum of programming languages and real-world software engineering tasks.

SWE-Bench Multilingual: Dataset Description

Our dataset extends the SWE-Bench Verified benchmark with several key advancements:

1. Comprehensive Multilingual Support: Our dataset encompasses a diverse array of programming languages, including C++, C#, Go, JavaScript, Kotlin, PHP, Ruby, Rust, and others. This extensive coverage facilitates a comprehensive evaluation of large language models (LLMs) across various programming paradigms, syntactic structures, and domain-specific conventions.

2. Rigorous Quality Assurance: To balance scalability with accuracy, we employ a hybrid validation framework combining automated static analysis, rule-based checks, and manual review by expert annotators. This dual-layer verification guarantees that task formulations, code diffs, and test cases meet strict consistency standards while minimizing labeling noise.

3. Industrial-Grade Task Selection: All tasks are sourced from actively maintained, high-impact open-source projects (e.g., kubernetes, llvm-project) and mirror real development workflows. Each instance represents an authentic software engineering activity — such as backporting bug fixes, implementing API extensions, or optimizing performance-critical components — providing a realistic proxy for evaluating models in production-like environments.

Dataset Statistics

Language Task count Avg. lines changed Median lines changed
ALL 506 9.7 5.0
Go 137 7.1 4.0
JS 120 14.8 9.5
Rust 103 11.1 7.0
PHP 53 8.2 4.0
Ruby 53 0.8 0.0
Kotlin 34 14.9 13.5
C++ 3 5.0 2.0
C# 3 8.7 4.0

The extended analysis (see Appendix A) demonstrates the distribution of the volume of changes for solving tasks and the distribution of repositories.

Evaluation Setup

Model Selection

Our study evaluated several prominent language models — both open-source and proprietary — that possess reasoning capabilities and demonstrate proficiency in the domains of Coding and Computer Science:

All models were integrated and orchestrated using the MOpenHands agent framework. MOpenHands was responsible for managing the end-to-end interaction flow, including prompt formatting, model invocation, and post-processing of responses.

Patch Generation

For each benchmark task, solution patches were generated by issuing prompts to the models via the MOpenHands agent. The agent ensured uniform prompt structure, consistent decoding parameters, and reproducible generation conditions across all models.

Patch Evaluation

Model-generated patches were evaluated using the multi-swe-bench evaluation framework. This framework automated the application of each patch to the corresponding codebase, performed project build and test execution, and produced detailed logs for each evaluation run.

The primary function of the multi-swe-bench framework in this setup was to provide a standardized and reproducible environment for patch application and test execution across all models. The framework did not compute aggregate metrics itself; instead, it outputted pass/fail statuses and execution logs for each task and patch.

After obtaining the test results, we conducted an analysis and calculated the metrics, ensuring consistency and full transparency in the calculation and reporting of key performance indicators such as pass@k, precision, and F1-score.

Evaluation Metrics

pass@k (k=1, 3): The pass@k metric measures the fraction of tasks for which at least one correct solution is present among the top-k model completions. For each task, we generate k independent model outputs using the MopenHands agent and evaluate them using the multi-swe-bench framework.

Precision and F1 Score: We calculate precision and F1 Score at the patch and test levels, evaluating how accurately the model-generated code modifications correspond to the ground-truth patches.

Infrastructure & Hyperparameters

All model inference was performed using the MOpenHands agent framework, which was extended to interact with the OpenRouter platform.

Cloud & Hardware Environment

Software Environment

No fine-tuning or custom model training was performed — only inference with provided checkpoints.

Model Inference Parameters

All model inferences were performed using standard configuration settings provided by both MopenHands and OpenRouter. These settings include default decoding parameters and unified prompt templates for each task type.

Automation

Results

Model Precision (%) pass@1 (%) pass@3 (%) F1 (%)
anthropic/claude-3.7-sonnet 16.54 5.63 14.47 8.4
google/gemini-2.5-pro-preview 14.08 4.78 11.27 7.13
deepseek/deepseek-r1 7.24 2.46 6.03 3.67
deepseek/deepseek-chat (deepseek v3) 7.58 3.04 7.22 4.33
meta-llama/llama-4-maverick 6.09 1.31 5.39 2.15
qwen/qwen-2.5-72b-instruct 1.26 0.47 1.09 0.63

Strengths & Weaknesses

Strengths: - Precision: Reflects the accuracy of proposed solutions. Models with high precision (e.g., anthropic/claude-3.7-sonnet) demonstrate that their solutions are often correct, even if not always top-ranked. - Pass@1: Measures the model's ability to generate a correct solution as its first prediction. - Pass@3: Evaluates how often at least one of three proposed solutions is correct.

Weaknesses: - Low Pass@1 and F1 scores: Most models struggle to generate correct solutions on the first attempt, making them less suitable for high-stakes tasks requiring immediate accuracy. - Result instability: Certain models (e.g., qwen/qwen-2.5-72b-instruct) underperformed across all metrics. Likely reasons: smaller model size, lightweight architecture.

Key Takeaways: Despite strong precision, low Pass@1 and F1 scores highlight challenges in generating accurate solutions on the first attempt — a critical requirement for error-sensitive tasks. Even with the ability to produce multiple answer variants, models exhibit difficulties delivering correct solutions immediately, underscoring the high complexity of this benchmark type for tested models.

Data Quality Impact

The quality of the evaluation dataset plays a pivotal role in the validity and interpretability of model assessment results.

Rigorous Selection Process

Each task in the multilingual SWE-Bench was manually curated and reviewed to ensure it represents a realistic, unambiguous, and non-trivial software engineering scenario. Unlike some public benchmarks dominated by auto-generated or poorly vetted examples, our dataset prioritizes human oversight at every stage — from task selection to solution validation.

Programming Language Distribution

Model performance often varies substantially across programming languages. To mitigate bias and ensure fair generalization assessment, our multilingual SWE-Bench dataset was deliberately balanced to include a diverse set of languages with representative task counts and difficulty levels.

Ground Truth Validation Protocol

Reference solutions underwent a two-tier verification process: - Automated checks (builds/tests, patch application) - Expert review (correctness and solution relevance validation)

This approach minimizes false positives (e.g., solutions that compile but contain semantic errors) and false rejections (e.g., valid alternative solutions mistakenly marked incorrect).

Cross-Checking with Public Benchmarks

During dataset development, we systematically compared a subset of our curated tasks with their analogues in prominent public code benchmarks. This cross-checking revealed several common issues: - Frequent Label Errors: Misalignment between provided "ground-truth" patches and actual repository state, or acceptance of syntactically correct but semantically incorrect solutions. - Synthetic Artifacts: The presence of programmatically generated tasks or solutions. - Ambiguous or Underspecified Tasks: Tasks with incomplete context or multiple valid solutions.

Impact on Model Evaluation

Our research demonstrates that high-quality, expert-validated, and balanced datasets: - Provide lower but more realistic model performance assessments - Enable more accurate model comparisons based on actual capabilities rather than dataset biases

Specifically, models achieving SOTA results on noisy benchmarks exhibited significant performance drops and ranking changes when evaluated on our multilingual benchmark and the original SWE-Bench.

Key Insight: Reliable evaluation of code-generating LLMs requires diverse and challenging tasks, meticulous dataset preparation and validation, and continuous quality control. Without these, benchmark results may overestimate models' real-world applicability and progress in AI-powered software development slows.

Discussion

Our findings highlight the critical importance of comprehensive, multi-task benchmarks based on real-world industrial challenges for accurately assessing large language models' (LLMs) true capabilities in software development.

Addressing these limitations will require simultaneous improvements in both model architectures and benchmark design.

Conclusion

This work introduces a multilingual, industry-focused extension of the SWE-Bench benchmark, specifically designed to address gaps in fairness, diversity, and realism that persist in existing evaluation datasets for code-focused large language models. By applying rigorous curation, balancing across programming languages, and thorough validation, we provide a robust platform for unbiased and reproducible assessment of open-source LLMs on practical software engineering tasks. Our empirical results demonstrate that many current models, despite impressive performance on standard benchmarks, face significant challenges when confronted with real-world, multi-language scenarios and higher-quality ground truth.

Beyond quantitative evaluation, our findings highlight the critical role of data quality and benchmark design in shaping the measured capabilities and limitations of advanced AI systems. We advocate for a continuous, community-driven effort to improve both datasets and evaluation protocols, as these underpin the development of AI tools that are trustworthy and effective in real engineering environments.

References

  1. Chen, D., Zheng, S., Mishra, S., Du, X., Gu, J., Li, X., ... & Fried, D. (2023). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv preprint arXiv:2310.06780. https://github.com/princeton-nlp/SWE-bench
  2. Zan, D., Huang, Z., Liu, W., Chen, H., Zhang, L., Xin, S., ... & Xiang, L. (2025). Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving. arXiv preprint arXiv:2504.02605. https://arxiv.org/abs/2504.02605
  3. Zan, D., et al. (2025). Multi-SWE-bench Documentation. https://multi-swe-bench.github.io
  4. Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., ... & Narasimhan, K. R. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024. https://arxiv.org/abs/2310.06770
  5. Zhang, L., He, S., Zhang, C., ... & Zhang, D. (2025). SWE-bench Goes Live! arXiv preprint arXiv:2505.23419. https://arxiv.org/abs/2505.23419
  6. Hugging Face. (2024). Transformers Library Documentation. https://huggingface.co/docs/transformers
  7. OpenRouter. (2024). OpenRouter API Documentation. https://openrouter.ai/docs
  8. Wolf, T., Debut, L., Sanh, V., ... (2020). Transformers: State-of-the-Art Natural Language Processing. EMNLP 2020 System Demonstrations. https://aclanthology.org/2020.emnlp-demos.6/
  9. Multi-SWE-bench Team. (2025). GitHub Repository. https://github.com/multi-swe-bench/multi-swe-bench

Appendix

A. Data Collection & Validation Procedures

A.1. Source Selection - Description of open-source repositories included - Task distribution by programming language

Key Observations: - Language Dominance: Ruby is the dominant language in the dataset, appearing in 8 out of 20 repositories. These repositories are primarily focused on infrastructure and development tools, such as CocoaPods/CocoaPods, github-changelog-generator/github-changelog-generator, and ruby-grape/grape. - Repository Focus: The repositories are predominantly centered around infrastructure tools (e.g., go-kratos/kratos, prometheus/prometheus) and developer utilities (e.g., gohugoio/hugo, go-resty/resty). There is also significant representation from web/social platforms (e.g., mastodon/mastodon, facebook/docusaurus). Android and Kotlin repositories primarily focus on mobile applications, while Rust is concentrated around system-level utilities and performance-oriented tools. - Licensing Trends: The majority of repositories are under permissive licenses such as MIT, with a few notable exceptions, like copyleft licenses for social/web platforms (e.g., mastodon/mastodon).

A.2. Task Extraction — Pipeline for Identifying Real-World Software Engineering Tasks

Issue and Pull Request Mining: - Automatically collect issues and pull requests (PRs) from selected repositories using the GitHub API and/or direct repository scraping. - Focus on PRs that reference a specific issue and include both a human-readable description and a code patch.

Patch Extraction and Diff Generation: - For each qualifying PR, extract the minimal code diff (patch) associated with the issue or feature. - Retrieve and store all relevant files before and after the patch, ensuring that the full context for applying and validating the patch is available.

Context and Metadata Attachment — For each task, we record associated metadata including: commit hash, change author, modification date, affected files, linked issue description.

A.3. Golden Set Generation

Automated Extraction of Ground-Truth Patches: - For each accepted pull request, the associated commit diff is extracted using the GitHub API or git diff tools. - The "before" and "after" versions of each affected file are stored, along with the patch file representing the minimal code change needed to resolve the issue. - Patches are linked to corresponding issue and PR metadata, ensuring traceability.

Checks for Patch Validity: - Each patch is automatically applied to the original codebase using standard VCS tools (git apply). - After application, the codebase is built and available automated tests are executed (via make, pytest, mvn test, etc.). - Only patches that successfully build and pass tests are retained.

A.4. Human Validation — Manual Validation Protocol

Sampling and Coverage: - Each task in the dataset was reviewed by at least one expert annotator with software development experience (100% coverage). - 30% of tasks underwent cross-validation by a second independent annotator to assess consistency.

Annotator Assignment: All tasks were evaluated using a standardized protocol. For cross-validation, two annotators independently assessed solution correctness and completeness.

Acceptance Criteria: A patch was marked as accepted if it correctly resolved the described issue or implemented the intended feature, did not introduce regressions or unrelated changes, and was minimal and focused on the described context.

Issue Specification Quality — annotators rated issue clarity: - (0) "The issue is well-specified and it is clear what is required for a successful solution" - (1) "There are some blanks to fill in about the issue, but there is a sensible interpretation of what is required for a successful solution"

Test Coverage Quality: - (0) "The tests perfectly cover all possible solutions" - (1) "The tests cover the majority of correct solutions, however some unusual solutions may be missed"

Disagreement Resolution: In cases where annotators disagreed on task validity or labeling, a third, more senior annotator adjudicated the decision. Tasks for which consensus could not be reached were excluded from the final dataset.

Annotation Guidelines and Edge Cases: - Multiple Valid Solutions: All solutions meeting acceptance criteria must be marked as correct, even when multiple valid variations exist. - Partial Fixes: Excluded from evaluation unless explicitly specified in requirements and verifiable through testing. - Refactoring/Style-Only Changes: Excluded unless explicitly requested in the task description. - Exclusion/Disagreement Documentation: Annotators must document rationale for task exclusion, causes of evaluation disagreements, and supporting evidence for decisions.

A.5. Language Balancing

Ensuring Proportional Task Representation: After task extraction, the distribution of tasks across programming languages is analyzed. Underrepresented languages may be supplemented by targeted mining of additional repositories.

Maintaining Difficulty Distribution: The dataset is stratified by difficulty, defined via number of lines changed in a patch, number of files affected, and presence of associated tests or complex bug reports. Tasks are sampled so that each language has a balanced representation across easy, medium, and hard categories. A post-processing step ensures no single language or task type dominates the final dataset.

B. Detailed Metric Definitions

B.1. pass@k

Mathematical definition: For each task, pass@k is computed as the probability that at least one out of k sampled model outputs produces a successful solution.

Sampling procedure: For each task, the same prompt was submitted to the model 3 times, yielding 3 independent outputs. Each output was evaluated separately.

Success criterion: A model output is considered successful if, after applying the generated patch, all test results (from the ground-truth logs and the LLM-generated patch logs) match exactly — indicating functional equivalence.

B.2. Precision, F1

Precision: The proportion of correctly generated code changes among all the changes proposed by the model.

F1 Score: Harmonic mean of precision and pass@1, providing a balanced evaluation of the model's ability to generate correct and top-ranking solutions.

Handling of Partial Matches: Functionally correct patches that may differ syntactically are considered valid. Partial matches or non-identical but functionally equivalent patches are treated as correct.

B.3. Qualitative Analysis Protocol

After applying patches and testing using multi-swe-bench, the resulting logs are analyzed to evaluate model quality. During analysis, the logs are compared based on task statuses, verifying whether test results obtained with reference patches match those generated by the model. This enables manual assessment of syntax errors, logical inconsistencies, or contextual misunderstandings in the task execution.

D. Golden Solution Changes Analysis — Discussion of Possible Bias Due to Repository Selection


Read our other stories: - How We Collect SWE-Bench for Other Languages - Automating Our Client Dataset Verification with LLMs

© 2025 All rights reserved · Privacy Policy