Fermatix's Multilingual SWE‑Bench: Evaluating Compact Open‑Source LLMs on Real‑World Software Engineering Tasks

3 min read

This paper introduces a multilingual extension of SWE‑Bench — a carefully designed benchmark for assessing large language models (LLMs) in real‑world software engineering tasks across various programming languages and industry codebases. We evaluate a set of leading open‑source LLMs, offering in‑depth quantitative and qualitative insights, and examine how benchmark quality influences the reliability of model evaluations.

Introduction

While current benchmarks for evaluating LLMs in code understanding and generation play a crucial role in model assessment, many public datasets have been incorporated into modern models' training data, resulting in contamination and artificially inflated performance. To mitigate this issue, we introduce a novel, independently curated multilingual extension of SWE‑Bench, constructed from diverse open‑source repositories. Unlike existing public datasets, our benchmark ensures no overlap with commonly used training corpora, providing a more reliable measure of model capabilities. Additionally, the dataset is inherently difficult to reproduce, as its validation process leverages AI trainers for data verification. This approach enables transparent, unbiased, and reproducible evaluation of LLMs across a wider spectrum of programming languages and real‑world software engineering tasks.

SWE‑Bench Multilingual: Dataset Description

Our dataset extends the SWE‑Bench Verified benchmark with several key advancements:

1. Comprehensive Multilingual Support: Our dataset encompasses a diverse array of programming languages, including C++, C#, Go, JavaScript, Kotlin, PHP, Ruby, Rust, and others. This extensive coverage facilitates a comprehensive evaluation of large language models (LLMs) across various programming paradigms, syntactic structures, and domain‑specific conventions.

2. Rigorous Quality Assurance: To balance scalability with accuracy, we employ a hybrid validation framework combining automated static analysis, rule‑based checks, and manual review by expert annotators. This dual‑layer verification guarantees that task formulations, code diffs, and test cases meet strict consistency standards while minimizing labeling noise.

3. Industrial‑Grade Task Selection: All tasks are sourced from actively maintained, high‑impact open‑source projects (e.g., kubernetes, llvm‑project) and mirror real development workflows. Each instance represents an authentic software engineering activity — such as backporting bug fixes, implementing API extensions, or optimizing performance‑critical components — providing a realistic proxy for evaluating models in production‑like environments.

Dataset Statistics

Language	Task count	Avg. lines changed	Median lines changed
ALL	506	9.7	5.0
Go	137	7.1	4.0
JS	120	14.8	9.5
Rust	103	11.1	7.0
PHP	53	8.2	4.0
Ruby	53	0.8	0.0
Kotlin	34	14.9	13.5
C++	3	5.0	2.0
C#	3	8.7	4.0

The extended analysis (see Appendix A) demonstrates the distribution of the volume of changes for solving tasks and the distribution of repositories.

Evaluation Setup

Model Selection

Our study evaluated several prominent language models — both open‑source and proprietary — that possess reasoning capabilities and demonstrate proficiency in the domains of Coding and Computer Science:

google/gemini‑2.5‑pro‑preview
anthropic/claude‑3.7‑sonnet
meta‑llama/llama‑4‑maverick
deepseek/deepseek‑r1
deepseek/deepseek‑chat (deepseek v3)
qwen/qwen‑2.5‑72b‑instruct

All models were integrated and orchestrated using the MOpenHands agent framework. MOpenHands was responsible for managing the end‑to‑end interaction flow, including prompt formatting, model invocation, and post‑processing of responses.

Patch Generation

For each benchmark task, solution patches were generated by issuing prompts to the models via the MOpenHands agent. The agent ensured uniform prompt structure, consistent decoding parameters, and reproducible generation conditions across all models.

Patch Evaluation

Model‑generated patches were evaluated using the multi‑swe‑bench evaluation framework. This framework automated the application of each patch to the corresponding codebase, performed project build and test execution, and produced detailed logs for each evaluation run.

The primary function of the multi‑swe‑bench framework in this setup was to provide a standardized and reproducible environment for patch application and test execution across all models. The framework did not compute aggregate metrics itself; instead, it outputted pass/fail statuses and execution logs for each task and patch.

After obtaining the test results, we conducted an analysis and calculated the metrics, ensuring consistency and full transparency in the calculation and reporting of key performance indicators such as pass@k, precision, and F1‑score.

Evaluation Metrics

pass@k (k=1, 3): The pass@k metric measures the fraction of tasks for which at least one correct solution is present among the top‑k model completions. For each task, we generate k independent model outputs using the MopenHands agent and evaluate them using the multi‑swe‑bench framework.

pass@1 quantifies the success rate when only the top prediction is considered, reflecting the model's deterministic effectiveness.
pass@3 measures the probability that at least one of the top three predictions is correct.

Precision and F1 Score: We calculate precision and F1 Score at the patch and test levels, evaluating how accurately the model‑generated code modifications correspond to the ground‑truth patches.

Precision is defined as the proportion of correctly generated changes among all changes proposed by the model.
F1 Score is calculated as the harmonic mean between precision and pass@1.

Infrastructure & Hyperparameters

All model inference was performed using the MOpenHands agent framework, which was extended to interact with the OpenRouter platform.

Cloud & Hardware Environment

API/Hosting Provider: OpenRouter (https://openrouter.ai), which provides unified access to a wide selection of open‑source LLMs deployed on enterprise‑grade infrastructure.
Compute Resources: All inference requests were executed using OpenRouter's backend infrastructure. Specific hardware configurations, such as GPU (e.g., NVIDIA A100/RTX 4090), are abstracted and managed by the provider.
Batching and Parallelization: Queries were executed using task‑level parallelization in the MOpenHands framework, which dispatched individual inference tasks in parallel. Batch inference was disabled to avoid artifacts.

Software Environment

Containerization: Docker containers with the multi‑swe‑bench framework
Operating System: Ubuntu 22.04 LTS (containerized)
Primary Language: Python 3.11.8
Key Frameworks:
MOpenHands: Facilitates interaction with models, task management, and GitHub integration
multi‑swe‑bench: Used for evaluating model responses, including patch application, build/test execution, and result logging
OpenRouter: Provides access to a wide selection of models

No fine‑tuning or custom model training was performed — only inference with provided checkpoints.

Model Inference Parameters

All model inferences were performed using standard configuration settings provided by both MopenHands and OpenRouter. These settings include default decoding parameters and unified prompt templates for each task type.

Automation

Evaluation Orchestration: Fully automated using custom Python scripts. All tasks were processed using the MopenHands framework.
Parallel Task Processing: Performed through MopenHands using Python's ThreadPoolExecutor.
Random Seed: No random seed was explicitly set. Results may exhibit minor non‑determinism due to stochastic decoding settings.
Logging: Every API call, including all model parameters, request/response payloads, timestamps, and returned completions, was logged and versioned for full auditability.

Results

Model	Precision (%)	pass@1 (%)	pass@3 (%)	F1 (%)
anthropic/claude‑3.7‑sonnet	16.54	5.63	14.47	8.4
google/gemini‑2.5‑pro‑preview	14.08	4.78	11.27	7.13
deepseek/deepseek‑r1	7.24	2.46	6.03	3.67
deepseek/deepseek‑chat (deepseek v3)	7.58	3.04	7.22	4.33
meta‑llama/llama‑4‑maverick	6.09	1.31	5.39	2.15
qwen/qwen‑2.5‑72b‑instruct	1.26	0.47	1.09	0.63

Strengths & Weaknesses

Strengths: - Precision: Reflects the accuracy of proposed solutions. Models with high precision (e.g., anthropic/claude‑3.7‑sonnet) demonstrate that their solutions are often correct, even if not always top‑ranked. - Pass@1: Measures the model's ability to generate a correct solution as its first prediction. - Pass@3: Evaluates how often at least one of three proposed solutions is correct.

Weaknesses: - Low Pass@1 and F1 scores: Most models struggle to generate correct solutions on the first attempt, making them less suitable for high‑stakes tasks requiring immediate accuracy. - Result instability: Certain models (e.g., qwen/qwen‑2.5‑72b‑instruct) underperformed across all metrics. Likely reasons: smaller model size, lightweight architecture.

Key Takeaways: Despite strong precision, low Pass@1 and F1 scores highlight challenges in generating accurate solutions on the first attempt — a critical requirement for error‑sensitive tasks. Even with the ability to produce multiple answer variants, models exhibit difficulties delivering correct solutions immediately, underscoring the high complexity of this benchmark type for tested models.

Data Quality Impact

The quality of the evaluation dataset plays a pivotal role in the validity and interpretability of model assessment results.

Rigorous Selection Process

Each task in the multilingual SWE‑Bench was manually curated and reviewed to ensure it represents a realistic, unambiguous, and non‑trivial software engineering scenario. Unlike some public benchmarks dominated by auto‑generated or poorly vetted examples, our dataset prioritizes human oversight at every stage — from task selection to solution validation.

Programming Language Distribution

Model performance often varies substantially across programming languages. To mitigate bias and ensure fair generalization assessment, our multilingual SWE‑Bench dataset was deliberately balanced to include a diverse set of languages with representative task counts and difficulty levels.

Ground Truth Validation Protocol

Reference solutions underwent a two‑tier verification process: - Automated checks (builds/tests, patch application) - Expert review (correctness and solution relevance validation)

This approach minimizes false positives (e.g., solutions that compile but contain semantic errors) and false rejections (e.g., valid alternative solutions mistakenly marked incorrect).

Cross‑Checking with Public Benchmarks

During dataset development, we systematically compared a subset of our curated tasks with their analogues in prominent public code benchmarks. This cross‑checking revealed several common issues: - Frequent Label Errors: Misalignment between provided "ground‑truth" patches and actual repository state, or acceptance of syntactically correct but semantically incorrect solutions. - Synthetic Artifacts: The presence of programmatically generated tasks or solutions. - Ambiguous or Underspecified Tasks: Tasks with incomplete context or multiple valid solutions.

Impact on Model Evaluation

Our research demonstrates that high‑quality, expert‑validated, and balanced datasets: - Provide lower but more realistic model performance assessments - Enable more accurate model comparisons based on actual capabilities rather than dataset biases

Specifically, models achieving SOTA results on noisy benchmarks exhibited significant performance drops and ranking changes when evaluated on our multilingual benchmark and the original SWE‑Bench.

Key Insight: Reliable evaluation of code‑generating LLMs requires diverse and challenging tasks, meticulous dataset preparation and validation, and continuous quality control. Without these, benchmark results may overestimate models' real‑world applicability and progress in AI‑powered software development slows.

Discussion

Our findings highlight the critical importance of comprehensive, multi‑task benchmarks based on real‑world industrial challenges for accurately assessing large language models' (LLMs) true capabilities in software development.

Limited model generalizability: Models trained on narrow datasets (e.g., competitive programming problems or single‑language corpora) perform well in constrained domains but struggle with diverse real‑world scenarios — particularly in underrepresented programming languages and complex codebases.
The data quality imperative: Balanced and rigorously validated datasets yield more reliable and reproducible results while enabling finer differentiation between models that appear similar on lower‑quality benchmarks.
Outstanding challenges: Even top‑performing models face difficulties with multi‑file modifications, project‑level context understanding, and generating semantically correct fixes.

Addressing these limitations will require simultaneous improvements in both model architectures and benchmark design.

Conclusion

This work introduces a multilingual, industry‑focused extension of the SWE‑Bench benchmark, specifically designed to address gaps in fairness, diversity, and realism that persist in existing evaluation datasets for code‑focused large language models. By applying rigorous curation, balancing across programming languages, and thorough validation, we provide a robust platform for unbiased and reproducible assessment of open‑source LLMs on practical software engineering tasks. Our empirical results demonstrate that many current models, despite impressive performance on standard benchmarks, face significant challenges when confronted with real‑world, multi‑language scenarios and higher‑quality ground truth.

Beyond quantitative evaluation, our findings highlight the critical role of data quality and benchmark design in shaping the measured capabilities and limitations of advanced AI systems. We advocate for a continuous, community‑driven effort to improve both datasets and evaluation protocols, as these underpin the development of AI tools that are trustworthy and effective in real engineering environments.

References

Chen, D., Zheng, S., Mishra, S., Du, X., Gu, J., Li, X., ... & Fried, D. (2023). SWE‑bench: Can Language Models Resolve Real‑World GitHub Issues? arXiv preprint arXiv:2310.06780. https://github.com/princeton‑nlp/SWE‑bench
Zan, D., Huang, Z., Liu, W., Chen, H., Zhang, L., Xin, S., ... & Xiang, L. (2025). Multi‑SWE‑bench: A Multilingual Benchmark for Issue Resolving. arXiv preprint arXiv:2504.02605. https://arxiv.org/abs/2504.02605
Zan, D., et al. (2025). Multi‑SWE‑bench Documentation. https://multi‑swe‑bench.github.io
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., ... & Narasimhan, K. R. (2024). SWE‑bench: Can Language Models Resolve Real‑World GitHub Issues? ICLR 2024. https://arxiv.org/abs/2310.06770
Zhang, L., He, S., Zhang, C., ... & Zhang, D. (2025). SWE‑bench Goes Live! arXiv preprint arXiv:2505.23419. https://arxiv.org/abs/2505.23419
Hugging Face. (2024). Transformers Library Documentation. https://huggingface.co/docs/transformers
OpenRouter. (2024). OpenRouter API Documentation. https://openrouter.ai/docs
Wolf, T., Debut, L., Sanh, V., ... (2020). Transformers: State‑of‑the‑Art Natural Language Processing. EMNLP 2020 System Demonstrations. https://aclanthology.org/2020.emnlp‑demos.6/
Multi‑SWE‑bench Team. (2025). GitHub Repository. https://github.com/multi‑swe‑bench/multi‑swe‑bench

Appendix

A. Data Collection & Validation Procedures

A.1. Source Selection - Description of open‑source repositories included - Task distribution by programming language

Key Observations: - Language Dominance: Ruby is the dominant language in the dataset, appearing in 8 out of 20 repositories. These repositories are primarily focused on infrastructure and development tools, such as CocoaPods/CocoaPods, github‑changelog‑generator/github‑changelog‑generator, and ruby‑grape/grape. - Repository Focus: The repositories are predominantly centered around infrastructure tools (e.g., go‑kratos/kratos, prometheus/prometheus) and developer utilities (e.g., gohugoio/hugo, go‑resty/resty). There is also significant representation from web/social platforms (e.g., mastodon/mastodon, facebook/docusaurus). Android and Kotlin repositories primarily focus on mobile applications, while Rust is concentrated around system‑level utilities and performance‑oriented tools. - Licensing Trends: The majority of repositories are under permissive licenses such as MIT, with a few notable exceptions, like copyleft licenses for social/web platforms (e.g., mastodon/mastodon).

A.2. Task Extraction — Pipeline for Identifying Real‑World Software Engineering Tasks

Issue and Pull Request Mining: - Automatically collect issues and pull requests (PRs) from selected repositories using the GitHub API and/or direct repository scraping. - Focus on PRs that reference a specific issue and include both a human‑readable description and a code patch.

Patch Extraction and Diff Generation: - For each qualifying PR, extract the minimal code diff (patch) associated with the issue or feature. - Retrieve and store all relevant files before and after the patch, ensuring that the full context for applying and validating the patch is available.

Context and Metadata Attachment — For each task, we record associated metadata including: commit hash, change author, modification date, affected files, linked issue description.

A.3. Golden Set Generation

Automated Extraction of Ground‑Truth Patches: - For each accepted pull request, the associated commit diff is extracted using the GitHub API or git diff tools. - The "before" and "after" versions of each affected file are stored, along with the patch file representing the minimal code change needed to resolve the issue. - Patches are linked to corresponding issue and PR metadata, ensuring traceability.

Checks for Patch Validity: - Each patch is automatically applied to the original codebase using standard VCS tools (git apply). - After application, the codebase is built and available automated tests are executed (via make, pytest, mvn test, etc.). - Only patches that successfully build and pass tests are retained.

A.4. Human Validation — Manual Validation Protocol

Sampling and Coverage: - Each task in the dataset was reviewed by at least one expert annotator with software development experience (100% coverage). - 30% of tasks underwent cross‑validation by a second independent annotator to assess consistency.

Annotator Assignment: All tasks were evaluated using a standardized protocol. For cross‑validation, two annotators independently assessed solution correctness and completeness.

Acceptance Criteria: A patch was marked as accepted if it correctly resolved the described issue or implemented the intended feature, did not introduce regressions or unrelated changes, and was minimal and focused on the described context.

Issue Specification Quality — annotators rated issue clarity: - (0) "The issue is well‑specified and it is clear what is required for a successful solution" - (1) "There are some blanks to fill in about the issue, but there is a sensible interpretation of what is required for a successful solution"

Test Coverage Quality: - (0) "The tests perfectly cover all possible solutions" - (1) "The tests cover the majority of correct solutions, however some unusual solutions may be missed"

Disagreement Resolution: In cases where annotators disagreed on task validity or labeling, a third, more senior annotator adjudicated the decision. Tasks for which consensus could not be reached were excluded from the final dataset.

Annotation Guidelines and Edge Cases: - Multiple Valid Solutions: All solutions meeting acceptance criteria must be marked as correct, even when multiple valid variations exist. - Partial Fixes: Excluded from evaluation unless explicitly specified in requirements and verifiable through testing. - Refactoring/Style‑Only Changes: Excluded unless explicitly requested in the task description. - Exclusion/Disagreement Documentation: Annotators must document rationale for task exclusion, causes of evaluation disagreements, and supporting evidence for decisions.

A.5. Language Balancing

Ensuring Proportional Task Representation: After task extraction, the distribution of tasks across programming languages is analyzed. Underrepresented languages may be supplemented by targeted mining of additional repositories.

Maintaining Difficulty Distribution: The dataset is stratified by difficulty, defined via number of lines changed in a patch, number of files affected, and presence of associated tests or complex bug reports. Tasks are sampled so that each language has a balanced representation across easy, medium, and hard categories. A post‑processing step ensures no single language or task type dominates the final dataset.

B. Detailed Metric Definitions

B.1. pass@k

Mathematical definition: For each task, pass@k is computed as the probability that at least one out of k sampled model outputs produces a successful solution.

Sampling procedure: For each task, the same prompt was submitted to the model 3 times, yielding 3 independent outputs. Each output was evaluated separately.

Success criterion: A model output is considered successful if, after applying the generated patch, all test results (from the ground‑truth logs and the LLM‑generated patch logs) match exactly — indicating functional equivalence.

B.2. Precision, F1

Precision: The proportion of correctly generated code changes among all the changes proposed by the model.

F1 Score: Harmonic mean of precision and pass@1, providing a balanced evaluation of the model's ability to generate correct and top‑ranking solutions.

Handling of Partial Matches: Functionally correct patches that may differ syntactically are considered valid. Partial matches or non‑identical but functionally equivalent patches are treated as correct.

B.3. Qualitative Analysis Protocol

After applying patches and testing using multi‑swe‑bench, the resulting logs are analyzed to evaluate model quality. During analysis, the logs are compared based on task statuses, verifying whether test results obtained with reference patches match those generated by the model. This enables manual assessment of syntax errors, logical inconsistencies, or contextual misunderstandings in the task execution.

D. Golden Solution Changes Analysis — Discussion of Possible Bias Due to Repository Selection

Language Bias: The final dataset reflects a concentration of tasks in Go, Kotlin, Ruby, and Rust, due to the popularity and activity levels of available repositories in these languages. As a result, languages with fewer high‑quality open‑source projects (e.g., PHP, compared to Go or Ruby) are underrepresented.
Domain and Ecosystem Bias: The dataset contains a significant number of web frameworks, mobile clients, and infrastructure tools. Scientific computing, embedded, or low‑level systems are less represented.
Maintenance and Popularity Bias: We prioritized actively maintained and popular repositories (as measured by stars, forks, and recent commit activity). While this improves task relevance and data quality, it may exclude less‑known but technically significant projects.
Task Type Bias: Some repositories, such as content management systems and developer tools, naturally yield a larger volume of trivial or repetitive tasks. Despite our filtering, this may affect the distribution of task difficulty and types across the benchmark.

Read our other stories: - How We Collect SWE‑Bench for Other Languages - Automating Our Client Dataset Verification with LLMs