Fermatix AI — End‑to‑end data solutions for code LLMs and AI agents

How we work

The full lifecycle, end to end

Not just a data supplier — we own every stage from sourcing to evaluation. Plug us in at any step, or hand off the whole chain.

Data sourcing

Non‑public repos, real enterprise content, licensed archives — never seen in public sets.

Generation & SFT

Prompt → response pairs, multi‑turn dialogues, code‑task authoring at scale.

Annotation & review

Human‑in‑the‑loop labeling, rationales, pairwise ranking across 40 criteria.

Benchmarks & harnesses

SWE‑Bench, Multi‑SWE, Harbor, RAG eval, Dockerized reproducible environments.

Evaluation & red team

Agent‑trajectory scoring, plan/thought eval, safety probes, regression testing.

Datasets

Ready‑to‑license datasets

Non‑public code repositories, SWE‑Bench benchmarks, alignment sets, and enterprise data.

01 Non‑Public Code Repositories for LLM Pre‑Training

Production code from real companies — never indexed, never crawled.

Around 3,000 proprietary codebases that have never appeared in public training sets (GitHub, GitLab, HuggingFace). Production‑grade repositories from real companies — primarily sourced from a network of outsourcing agencies and startups whose products were discontinued or acquired.

Snapshot

~3,000

Repositories

300M+

Lines of code

100K+

Commits

500K+

Files

~25

Avg contributors

2018–25

Creation period

Distribution: JS/TS 35%, PHP 30%, Obj‑C/Swift 12%, Java 8%, Python 4%, Other 11%.

Composition: 54% discontinued / 46% active or maintained. Full legal rights to license every repository.

02 SWE‑Bench‑Style Benchmarks

Open‑Source SWE‑Bench

~8,712 files (~8.85 GB)
Task types: bug fixing, code completion, PR generation, automated code review, regression validation, environment‑setup verification
Artifacts: issue descriptions, PRs, commit messages, golden/test patches, install scripts, Dockerized environments, Parquet metadata
Verified: every task tested by real developers

Non‑Public‑Repository SWE‑Bench

True out‑of‑distribution evaluation. Models can't memorize these tasks.
Multi‑language support (not Python‑only like public SWE‑Bench Verified).
Golden/test patches, Dockerized environments, Parquet metadata for reproducible patch‑apply, build, and test workflows.

Coverage by language

Kotlin327

Rust250

Go207

PHP166

C++122

Java110

C109

C#102

Ruby80

TS47

Scala39

JS18

Total1,577

→ View sample (Hugging Face)

03 Post‑training data

SFT dialogues — instruction‑tuning data authored by practicing engineers, not crowd workers.
RLHF & DPO pairs — pairwise preferences, rejection sampling across custom rubrics.
Prompt‑to‑Code datasets — generation‑focused training data.
RAG evaluation — retrieval‑augmented generation evaluation datasets.
Red teaming — adversarial multi‑step scenario datasets.
Open‑source dataset expansion — popular open datasets extended with new sources, domains, and refined taxonomy.

04 Agent data

Agent behavior trajectories — datasets of agent actions and reasoning for accuracy and performance improvements.
Agent architecture analysis — evaluation datasets for architectures: search, integration, tools, and external systems.
Dialog evaluation — datasets scoring dialog history across multiple criteria.
Dialog safety — evaluation suites for politeness, honesty, and integrity.

05 Enterprise Data

Internal enterprise data sourced from real companies — active, acquired, or wound down — each with certified consent to license.

Data types

Team communication platforms (Slack, Discord, CRM systems)
Task trackers (Jira, YouTrack)
Meeting recordings with transcripts
Knowledge bases and internal documentation

Coverage & compliance

~20

Companies now

100+

Scalable to

Certified consent from each source company.
Full PII redaction.

→ View metadata (Google Sheets)

Integration

Data delivered in the shape your pipeline expects

Not just data delivery — our engineers integrate at every stage of your pipeline. Standard formats, your evaluation frameworks, schema design, ingestion adapters, continuous delivery on your cadence — data arrives ready to train or benchmark.

Formats

Drop‑in formats and schemas

JSONL for SFT and DPO, Parquet for pre‑training corpora, HuggingFace Datasets for publishing. Conversation data in ShareGPT, Alpaca, or OpenAI chat schemas — drop‑in for your training loop, no conversion on your side.

Reproducibility

Reproducible eval environments

Dockerized SWE‑Bench and Multi‑SWE‑Bench harnesses (Harbor‑compatible), RAG eval — patch‑apply, build, and test runs work identically on our machines and yours.

Embedded experts

Engineers inside your roadmap

Practicing developers, ML engineers, and domain specialists as an extension of your team — scoping taxonomies, defining quality criteria, resolving edge cases as they surface.

Contact

Start a pilot

Curated subset of non‑public repositories and benchmark tasks for hands‑on quality validation. Schedule a technical deep dive with our engineering team.

Email: hi@fermatix.ai

AVENIDAS INTELIGENTES, LDA

Lg Alberto Sampaio, 3 A, Sala 10

Linda a Velha, 2795‑007

Portugal

End‑to‑end data solutions for code LLMs and AI agents

Engineering team & domain experts

Multi‑level QC

GDPR · ISO