Next-gen datasets for code AI

End-to-end data solutions for code LLMs and AI agents

Train on code your competitors can't access.

Covering every stage of code-data collection and annotation — from pre-training to post-training.

Track record

Scaling quality, not quantity

An engineering team of 124, 21+ programming languages, and 24/7 worldwide delivery — from a pilot to 12,000+ datapoints per month.

4,000
Code repositories
2.27M
Datapoints
21+
Programming languages
124
Engineers & experts
Team

Engineering team & domain experts

124 engineers, annotators, and domain experts. Rigorous hiring pipeline — up to 100 structured interviews per week.

Quality

Multi-level QC

Cross-validation across practicing industry experts. Project-specific optimizations speed up labeling by up to 300%.

Compliance

GDPR · ISO

Compliance with data security and confidentiality standards. Full PII redaction across enterprise data.

How we work

The full lifecycle, end to end.

Not just a data supplier — we own every stage from sourcing to evaluation. Plug us in at any step, or hand off the whole chain. Every handoff is reproducible.

01

Data sourcing

Non-public repos, real enterprise content, licensed archives — never seen in public sets.

02

Generation & SFT

Prompt → response pairs, multi-turn dialogues, code-task authoring at scale.

03

Annotation & review

Human-in-the-loop labeling, rationales, pairwise ranking across 40 criteria.

04

Benchmarks & harnesses

SWE-Bench, Multi-SWE, Terminal-Bench, RAG eval, Dockerized reproducible environments.

05

Evaluation & red team

Agent-trajectory scoring, plan/thought eval, safety probes, regression testing.

Datasets

Ready-to-license datasets

Non-public code repositories, SWE-Bench benchmarks, alignment sets, and enterprise data.

01 Non-Public Code Repositories for LLM Pre-Training

Production code from real companies — never indexed, never crawled.

Around 3,000 proprietary codebases that have never appeared in public training sets (GitHub, GitLab, HuggingFace). Production-grade repositories from real companies — primarily sourced from a network of outsourcing agencies and startups whose products were discontinued or acquired.

Snapshot

~3,000
Repositories
300M+
Lines of code
100K+
Commits
500K+
Files
~25
Avg contributors
2018–25
Creation period

Distribution: JS/TS 35%, PHP 30%, Obj-C/Swift 12%, Java 8%, Python 4%, Other 11%.

Composition: 54% discontinued / 46% active or maintained. Full legal rights to license every repository.

Why this matters

  • Zero contamination risk. None of this code exists in public datasets — cleaner pre-training signal, more reliable benchmark evaluation.
  • Human-authored guarantee. All code reviewed by the engineering team. For post-2024 code, any LLM-assisted portions are redacted and verified as production-grade, human-reviewed output.

Start with a curated pilot subset to validate quality and fit before scaling.

02 SWE-Bench-Style Benchmarks

Fully compatible with the Multi-SWE-Bench framework.

Open-Source SWE-Bench

  • ~8,712 files (~8.85 GB)
  • Task types: bug fixing, code completion, PR generation, automated code review, regression validation, environment-setup verification
  • Artifacts: issue descriptions, PRs, commit messages, golden/test patches, install scripts, Dockerized environments, Parquet metadata
  • Verified: every task tested by real developers

Non-Public-Repository SWE-Bench

  • True out-of-distribution evaluation. Models can't memorize these tasks.
  • Multi-language support (not Python-only like public SWE-Bench Verified).
  • Golden/test patches, Dockerized environments, Parquet metadata for reproducible patch-apply, build, and test workflows.

A ready-made non-public Python repository is available as a pilot — fully annotated and bench-ready.

Coverage by language

Kotlin327
Rust250
Go207
PHP166
C++122
Java110
C109
C#102
Ruby80
TS47
Scala39
JS18
Total1,577

→ View sample (Hugging Face)

03 Alignment & Evaluation Data

The full alignment stack — fine-tuning, evaluation, safety, and red-teaming — covering both code models and code agents.

For code agents

  • Agent Behavior Trajectories — datasets of agent actions and reasoning for accuracy and performance improvements.
  • Agent Architecture Analysis — evaluation datasets for architectures: search, integration, tools, and external systems.
  • Dialog Evaluation — datasets scoring dialog history across multiple criteria.
  • Dialog Safety — evaluation suites for politeness, honesty, and integrity.

For code models

  • Fine-tuning & Alignment — SFT dialogues, RLHF pairwise comparisons, critique annotations.
  • Prompt-to-Code Datasets — generation-focused training data.
  • RAG Evaluation — retrieval-augmented generation evaluation datasets.
  • Red Teaming — adversarial multi-step scenario datasets.
  • Open Source Dataset Expansion — extended popular open datasets with new sources, domains, refined taxonomy.

04 Enterprise Data

Internal enterprise data sourced from real companies — active, acquired, or wound down — each with certified consent to license.

Data types

  • Team communication platforms (Slack, Discord, CRM systems)
  • Task trackers (Jira, YouTrack)
  • Meeting recordings with transcripts
  • Knowledge bases and internal documentation

Coverage & compliance

~20
Companies now
100+
Scalable to
  • Certified consent from each source company.
  • Full PII redaction.

Start with a 10-company pilot at a bulk rate.

→ View metadata (Google Sheets)

Custom

Tell us what you need — we'll build it.

Beyond the ready-to-license catalog, our team builds bespoke datasets from scoping to delivery — sourcing, annotation, QC, delivery. Pre-training corpora, evaluation benchmarks, RLHF pairs, agent trajectories, safety probes, RAG sets — whatever your pipeline needs.

What we collect

Any code-related data, any format

  • Non-public code — from our partner network of agencies and startups, matched to your stack and domain.
  • Benchmarks & evaluation sets — bug-fix, PR review, agent trajectories, RAG, red-teaming.
  • Enterprise & multimodal data — Slack/Jira/meetings, images, video, audio — consent and PII redaction included.
  • Human-in-the-loop pipelines — SFT dialogues, RLHF pairwise comparisons, critique annotations, expert review.
How it works

Turnkey, on your schedule

  • Scoping — task taxonomy, data schema, quality criteria defined with your team.
  • Expert annotators — practicing developers and domain specialists, not crowd workers.
  • Multi-level QC — cross-validation, calibration, project-specific optimizations.
  • Scalable throughput — from a pilot to 12,000+ datapoints per month, 21+ languages, 24/7 delivery.
Beyond data delivery

Built into your pipeline

A dataset on its own rarely moves the needle. Our team stands up the infrastructure around it — benchmark harnesses, eval pipelines, annotation tooling, embedded experts — so the data is usable on day one and keeps producing signal long after.

Benchmark infrastructure

Reproducible eval environments

Dockerized SWE-Bench, Multi-SWE-Bench, and Terminal-Bench-style harnesses with golden/test patches, install scripts, and Parquet metadata — patch-apply, build, and test runs work identically on our machines and yours.

Integration

Wired into your training & eval stack

Schema design, format conversion, ingestion adapters, and continuous delivery on your cadence. Datasets land in the shape your pipeline already expects — no glue code on your side.

Evaluation as a service

Run scoring against expert rubrics

Agent-trajectory and plan/thought scoring, RAG retrieval accuracy with path:line citations, pairwise ranking across 40 criteria — calibrated against expert baselines.

Annotation tooling

Labeling platforms tailored to your rubric

Argilla and bespoke portals stood up per project: task taxonomy, batch assignment, multi-stage QC gates, and per-rater calibration configured to match your evaluation methodology.

Embedded experts

Engineers inside your roadmap

Practicing developers, ML engineers, and domain specialists working as an extension of your team — scoping task taxonomies, defining quality criteria, and resolving edge cases as they surface.

Safety & robustness

Red-teaming and adversarial eval

Adversarial multi-step scenarios, MCP tool-access stress tests, dialog safety probes, and regression suites that catch failure modes before deployment.

About

Three years on one thing: training data for coding AI

Most AI vendors do everything. Fermatix does code data — and only code data — at production grade for the labs and startups building frontier code models.

Why focus matters

Public code corpora are saturated. The frontier of code-model performance now depends on code that hasn't been seen — proprietary repositories, real enterprise interactions, multimodal corpora with documented provenance. Sourcing it takes a different kind of organization: legal infrastructure, anonymization pipelines, expert-only annotation. That's the only thing Fermatix builds.

Who we work with

Three years working with leading code-AI teams has shaped the catalog. Every dataset in production today exists because a partner asked for something they couldn't get elsewhere. We deliver the data, document its origin, and stay out of the way.

Partner

Not a vendor

Long-term engagements with frontier code-AI teams. We grow inside our clients' roadmaps, not outside them.

Supply

Unique code-data pipeline

Non-public codebases and enterprise data competitors can't access — continuously sourced, all properly licensed, traced to origin.

Quality

Human-reviewed, end-to-end

Every output reviewed by working engineers with production experience. No crowd-sourced labeling, no untrained annotators.

Blog

Research & engineering notes

Contact

Start a pilot

Curated subset of non-public repositories and benchmark tasks for hands-on quality validation. Schedule a technical deep dive with our engineering team.

Email: hi@fermatix.ai

Website: Fermatix.AI


AVENIDAS INTELIGENTES, LDA

Lg Alberto Sampaio, 3 A, Sala 10

Linda a Velha, 2795-007

Portugal