Model SelectionOur study evaluated several prominent language models—both open-source and proprietary— that possess reasoning capabilities and demonstrate proficiency in the domains of Coding and Computer Science:
- google/gemini-2.5-pro-preview
- anthropic/claude-3.7-sonnet
- meta-llama/llama-4-maverick
- deepseek/deepseek-r1
- deepseek/deepseek-chat (deepseek v3)
- qwen/qwen-2.5-72b-instruct
All models were integrated and orchestrated using the
MOpenHands agent framework.
MOpenHands was responsible for managing the end-to-end interaction flow, including prompt formatting, model invocation, and post-processing of responses.
Patch Generation- For each benchmark task, solution patches were generated by issuing prompts to the models via the MOpenHands agent.
- The agent ensured uniform prompt structure, consistent decoding parameters, and reproducible generation conditions across all models.
Patch Evaluation- Model-generated patches were evaluated using the multi-swe-bench evaluation framework. This framework automated the application of each patch to the corresponding codebase, performed project build and test execution, and produced detailed logs for each evaluation run.
- The primary function of the multi-swe-bench framework in this setup was to provide a standardized and reproducible environment for patch application and test execution across all models. The framework did not compute aggregate metrics itself; instead, it outputted pass/fail statuses and execution logs for each task and patch.
- After obtaining the test results, we conducted an analysis and calculated the metrics, ensuring consistency and full transparency in the calculation and reporting of key performance indicators such as pass@k, precision, and F1-score.
Evaluation Metricspass@k (k=1, 3):The pass@k metric measures the fraction of tasks for which at least one correct solution is present among the top-k model completions. For each task, we generate k independent model outputs using the MopenHands agent and evaluate them using the multi-swe-bench framework, which applies each patch to the codebase, runs tests, and reports the test results.
- pass@1 quantifies the success rate when only the top prediction is considered, reflecting the model's deterministic effectiveness.
- pass@3 measures the probability that at least one of the top three predictions is correct. This metric is useful for scenarios where multiple valid solutions may exist, and the model can offer several correct solutions from the top-k results.
Precision and F1 Score:We calculate
precision and
F1 Score at the patch and test levels, evaluating how accurately the model-generated code modifications correspond to the ground-truth patches. These metrics are derived from the execution logs produced by the
multi-swe-bench framework, which provides raw results after applying the generated patches.
- Precision is defined as the proportion of correctly generated changes among all changes proposed by the model.
- F1 Score is calculated as the harmonic mean between precision and pass@1, providing a balanced evaluation of model performance.
Infrastructure & HyperparametersAll model inference was performed using the
MOpenHands agent framework, which was extended to interact with the
OpenRouter platform. This integration ensures consistent hardware resources, standardized execution environments, and high-throughput evaluation for all experiments. Below, we detail the relevant aspects of the infrastructure, dependencies, and evaluation automation for full reproducibility.
Cloud & Hardware EnvironmentThe inference was conducted via
OpenRouter (
https://openrouter.ai), which provides unified access to a wide selection of open-source LLMs deployed on enterprise-grade infrastructure.
All inference requests were executed using OpenRouter’s backend infrastructure. Specific hardware configurations, such as GPU (e.g., NVIDIA A100/RTX 4090), are abstracted and managed by the provider, ensuring scalability and consistency across experiments.
- Batching and Parallelization:
Queries were executed using task-level parallelization in the
MOpenHands framework, which dispatched individual inference tasks in parallel. This setup maximized throughput and minimized latency by leveraging Python's standard ThreadPoolExecutor for parallel execution. Batch inference was disabled to avoid artifacts associated with batch processing and ensure independent handling of each model query.
Software Environment- Containerization: All scripts and tools were executed within Docker containers, incorporating the multi-swe-bench framework, to ensure a consistent and isolated environment throughout all experiments.
- Operating System: Ubuntu 22.04 LTS (containerized)
- Primary Language: Python 3.11.8
- Key Frameworks:
◦
MOpenHands: Facilitates interaction with models, task management, and GitHub integration for seamless prompt execution and patch generation.
◦
multi-swe-bench: Used for evaluating model responses, including patch application, build/test execution, and result logging.
◦
OpenRouter: Provides access to a wide selection of models, enabling consistent and standardized inference across experiments.
- No fine-tuning or custom model training was performed — only inference with provided checkpoints.
Model Inference Parameters- All model inferences were performed using standard configuration settings provided by both MopenHands and OpenRouter. These settings include default decoding parameters and unified prompt templates for each task type, unless otherwise specified by the model provider.
Automation- Evaluation Orchestration:
The evaluation workflow was fully automated using custom Python scripts. All tasks were processed using the
MopenHands framework, which handled interaction with OpenRouter and ensured consistent model evaluation across different tasks.
- Parallel Task Processing:
Parallel task processing was performed through
MopenHands using Python’s
ThreadPoolExecutor, adapting the number of concurrent threads to the available API tokens and
OpenRouter constraints.
Random Seed:- No random seed was explicitly set during model inference, and the default settings of the OpenRouter API and MopenHands were used. As all evaluations were conducted via external APIs, results may exhibit minor non-determinism due to stochastic decoding settings (e.g., top-k, top-p, temperature > 0), which are not user-configurable.
Logging:- Every API call, including all model parameters, request/response payloads, timestamps, and returned completions, was logged and versioned for full auditability.
Note:All prompt engineering, context truncation, and input formatting were based on the original prompts from MopenHands, ensuring consistency and facilitating reproducibility for future work.