End-to-end data solutions for code LLMs and AI agents

Train on code your competitors can't access.

Track record

Track record

An engineering team of 30, 21+ programming languages, and 24/7 worldwide delivery — from a pilot to 12,000+ datapoints per month.

4,000
Code repositories
322K
Datapoints
21+
Programming languages
30
Engineers & experts
Team

Engineering team & domain experts

30 engineers, annotators, and domain experts. Rigorous hiring pipeline — up to 100 structured interviews per week.

Quality

Multi-level QC

Cross-validation by practicing industry experts. Multi-stage QC gates and per-rater calibration.

Compliance

GDPR · ISO

Compliance with data security and confidentiality standards. Full PII redaction across enterprise data.

How we work

The full lifecycle, end to end

Not just a data supplier — we own every stage from sourcing to evaluation. Plug us in at any step, or hand off the whole chain. Every handoff is reproducible.

01

Data sourcing

Non-public repos, real enterprise content, licensed archives — never seen in public sets.

02

Generation & SFT

Prompt → response pairs, multi-turn dialogues, code-task authoring at scale.

03

Annotation & review

Human-in-the-loop labeling, rationales, pairwise ranking across 40 criteria.

04

Benchmarks & harnesses

SWE-Bench, Multi-SWE, Terminal-Bench, RAG eval, Dockerized reproducible environments.

05

Evaluation & red team

Agent-trajectory scoring, plan/thought eval, safety probes, regression testing.

Datasets

Ready-to-license datasets

Non-public code repositories, SWE-Bench benchmarks, alignment sets, and enterprise data.

01 Non-Public Code Repositories for LLM Pre-Training

Production code from real companies — never indexed, never crawled.

Around 3,000 proprietary codebases that have never appeared in public training sets (GitHub, GitLab, HuggingFace). Production-grade repositories from real companies — primarily sourced from a network of outsourcing agencies and startups whose products were discontinued or acquired.

Snapshot

~3,000
Repositories
300M+
Lines of code
100K+
Commits
500K+
Files
~25
Avg contributors
2018–25
Creation period

Distribution: JS/TS 35%, PHP 30%, Obj-C/Swift 12%, Java 8%, Python 4%, Other 11%.

Composition: 54% discontinued / 46% active or maintained. Full legal rights to license every repository.

Why this matters

  • Zero contamination risk. None of this code exists in public datasets — cleaner pre-training signal, more reliable benchmark evaluation.
  • Human-authored guarantee. All code reviewed by the engineering team. For post-2024 code, any LLM-assisted portions are redacted and verified as production-grade, human-reviewed output.

Start with a curated pilot subset to validate quality and fit before scaling.

02 SWE-Bench-Style Benchmarks

Fully compatible with the Multi-SWE-Bench framework.

Open-Source SWE-Bench

  • ~8,712 files (~8.85 GB)
  • Task types: bug fixing, code completion, PR generation, automated code review, regression validation, environment-setup verification
  • Artifacts: issue descriptions, PRs, commit messages, golden/test patches, install scripts, Dockerized environments, Parquet metadata
  • Verified: every task tested by real developers

Non-Public-Repository SWE-Bench

  • True out-of-distribution evaluation. Models can't memorize these tasks.
  • Multi-language support (not Python-only like public SWE-Bench Verified).
  • Golden/test patches, Dockerized environments, Parquet metadata for reproducible patch-apply, build, and test workflows.

A ready-made non-public Python repository is available as a pilot — fully annotated and bench-ready.

Coverage by language

Kotlin327
Rust250
Go207
PHP166
C++122
Java110
C109
C#102
Ruby80
TS47
Scala39
JS18
Total1,577

→ View sample (Hugging Face)

03 Post-training data

Fine-tuning, alignment, RAG and safety data for code models.

  • SFT dialogues — instruction-tuning data authored by practicing engineers, not crowd workers.
  • RLHF & DPO pairs — pairwise preferences, rejection sampling across custom rubrics.
  • Prompt-to-Code datasets — generation-focused training data.
  • RAG evaluation — retrieval-augmented generation evaluation datasets.
  • Red teaming — adversarial multi-step scenario datasets.
  • Open-source dataset expansion — popular open datasets extended with new sources, domains, and refined taxonomy.

04 Agent data

Trajectories, architecture eval, dialog scoring and safety probes for code agents.

  • Agent behavior trajectories — datasets of agent actions and reasoning for accuracy and performance improvements.
  • Agent architecture analysis — evaluation datasets for architectures: search, integration, tools, and external systems.
  • Dialog evaluation — datasets scoring dialog history across multiple criteria.
  • Dialog safety — evaluation suites for politeness, honesty, and integrity.

05 Enterprise Data

Internal enterprise data sourced from real companies — active, acquired, or wound down — each with certified consent to license.

Data types

  • Team communication platforms (Slack, Discord, CRM systems)
  • Task trackers (Jira, YouTrack)
  • Meeting recordings with transcripts
  • Knowledge bases and internal documentation

Coverage & compliance

~20
Companies now
100+
Scalable to
  • Certified consent from each source company.
  • Full PII redaction.

Start with a 10-company pilot at a bulk rate.

→ View metadata (Google Sheets)

Integration

Data delivered in the shape your pipeline expects

Not just data delivery — our engineers integrate at every stage of your pipeline. Standard formats, your evaluation frameworks, schema design, ingestion adapters, continuous delivery on your cadence — data arrives ready to train or benchmark, no glue code on your side.

Formats

Drop-in formats and schemas

JSONL for SFT and DPO, Parquet for pre-training corpora, HuggingFace Datasets for publishing. Conversation data in ShareGPT, Alpaca, or OpenAI chat schemas — drop-in for your training loop, no conversion on your side.

Reproducibility

Reproducible eval environments

Dockerized SWE-Bench and Multi-SWE-Bench harnesses (Harbor-compatible), RAG eval — patch-apply, build, and test runs work identically on our machines and yours.

Embedded experts

Engineers inside your roadmap

Practicing developers, ML engineers, and domain specialists as an extension of your team — scoping taxonomies, defining quality criteria, resolving edge cases as they surface.

Blog

Research & engineering notes

Contact

Start a pilot

Curated subset of non-public repositories and benchmark tasks for hands-on quality validation. Schedule a technical deep dive with our engineering team.

Email: hi@fermatix.ai

Website: Fermatix.AI


AVENIDAS INTELIGENTES, LDA

Lg Alberto Sampaio, 3 A, Sala 10

Linda a Velha, 2795-007

Portugal