End‑to‑end data solutions for code LLMs and AI agents

Fast, no‑friction execution across every stage of code‑data collection and annotation — from pre‑training to post‑training.

4,000+
Code repositories
322K
Datapoints
21
Programming languages
30
Engineers & experts
Team

Engineering team & domain experts

30 engineers, annotators, and domain experts. Rigorous hiring pipeline — up to 100 structured interviews per week.

Quality

Multi‑level QC

Cross‑validation by practicing industry experts. Multi‑stage QC gates and per‑rater calibration.

Compliance

GDPR · ISO

Compliance with data security and confidentiality standards. Full PII redaction across enterprise data.

How we work

The full lifecycle, end to end

Not just a data supplier — we own every stage from sourcing to evaluation. Plug us in at any step, or hand off the whole chain.

01

Data sourcing

Non‑public repos, real enterprise content, licensed archives — never seen in public sets.

02

Generation & SFT

Prompt → response pairs, multi‑turn dialogues, code‑task authoring at scale.

03

Annotation & review

Human‑in‑the‑loop labeling, rationales, pairwise ranking across 40 criteria.

04

Benchmarks & harnesses

SWE‑Bench, Multi‑SWE, Harbor, RAG eval, Dockerized reproducible environments.

05

Evaluation & red team

Agent‑trajectory scoring, plan/thought eval, safety probes, regression testing.

Datasets

Ready‑to‑license datasets

Non‑public code repositories, SWE‑Bench benchmarks, alignment sets, and enterprise data.

01 Non‑Public Code Repositories for LLM Pre‑Training

Production code from real companies — never indexed, never crawled.

Around 3,000 proprietary codebases that have never appeared in public training sets (GitHub, GitLab, HuggingFace). Production‑grade repositories from real companies — primarily sourced from a network of outsourcing agencies and startups whose products were discontinued or acquired.

Snapshot

~3,000
Repositories
300M+
Lines of code
100K+
Commits
500K+
Files
~25
Avg contributors
2018–25
Creation period

Distribution: JS/TS 35%, PHP 30%, Obj‑C/Swift 12%, Java 8%, Python 4%, Other 11%.

Composition: 54% discontinued / 46% active or maintained. Full legal rights to license every repository.

02 SWE‑Bench‑Style Benchmarks

Open‑Source SWE‑Bench

  • ~8,712 files (~8.85 GB)
  • Task types: bug fixing, code completion, PR generation, automated code review, regression validation, environment‑setup verification
  • Artifacts: issue descriptions, PRs, commit messages, golden/test patches, install scripts, Dockerized environments, Parquet metadata
  • Verified: every task tested by real developers

Non‑Public‑Repository SWE‑Bench

  • True out‑of‑distribution evaluation. Models can't memorize these tasks.
  • Multi‑language support (not Python‑only like public SWE‑Bench Verified).
  • Golden/test patches, Dockerized environments, Parquet metadata for reproducible patch‑apply, build, and test workflows.

Coverage by language

Kotlin327
Rust250
Go207
PHP166
C++122
Java110
C109
C#102
Ruby80
TS47
Scala39
JS18
Total1,577

→ View sample (Hugging Face)

03 Post‑training data

  • SFT dialogues — instruction‑tuning data authored by practicing engineers, not crowd workers.
  • RLHF & DPO pairs — pairwise preferences, rejection sampling across custom rubrics.
  • Prompt‑to‑Code datasets — generation‑focused training data.
  • RAG evaluation — retrieval‑augmented generation evaluation datasets.
  • Red teaming — adversarial multi‑step scenario datasets.
  • Open‑source dataset expansion — popular open datasets extended with new sources, domains, and refined taxonomy.

04 Agent data

  • Agent behavior trajectories — datasets of agent actions and reasoning for accuracy and performance improvements.
  • Agent architecture analysis — evaluation datasets for architectures: search, integration, tools, and external systems.
  • Dialog evaluation — datasets scoring dialog history across multiple criteria.
  • Dialog safety — evaluation suites for politeness, honesty, and integrity.

05 Enterprise Data

Internal enterprise data sourced from real companies — active, acquired, or wound down — each with certified consent to license.

Data types

  • Team communication platforms (Slack, Discord, CRM systems)
  • Task trackers (Jira, YouTrack)
  • Meeting recordings with transcripts
  • Knowledge bases and internal documentation

Coverage & compliance

~20
Companies now
100+
Scalable to
  • Certified consent from each source company.
  • Full PII redaction.

→ View metadata (Google Sheets)

Integration

Data delivered in the shape your pipeline expects

Not just data delivery — our engineers integrate at every stage of your pipeline. Standard formats, your evaluation frameworks, schema design, ingestion adapters, continuous delivery on your cadence — data arrives ready to train or benchmark.

Formats

Drop‑in formats and schemas

JSONL for SFT and DPO, Parquet for pre‑training corpora, HuggingFace Datasets for publishing. Conversation data in ShareGPT, Alpaca, or OpenAI chat schemas — drop‑in for your training loop, no conversion on your side.

Reproducibility

Reproducible eval environments

Dockerized SWE‑Bench and Multi‑SWE‑Bench harnesses (Harbor‑compatible), RAG eval — patch‑apply, build, and test runs work identically on our machines and yours.

Embedded experts

Engineers inside your roadmap

Practicing developers, ML engineers, and domain specialists as an extension of your team — scoping taxonomies, defining quality criteria, resolving edge cases as they surface.

Blog

Research & engineering notes

Contact

Start a pilot

Curated subset of non‑public repositories and benchmark tasks for hands‑on quality validation. Schedule a technical deep dive with our engineering team.

Email: hi@fermatix.ai


AVENIDAS INTELIGENTES, LDA

Lg Alberto Sampaio, 3 A, Sala 10

Linda a Velha, 2795‑007

Portugal