← All posts

Automating Our Client Dataset Verification with LLMs

Cutting Errors by 40% and Costs by 60%

3 min read

I'm Javokhirbek Khaytibaev from Fermatix.AI. In this article, I'll share how we helped a client automate a critical step in their production pipeline, speeding up product delivery, reducing errors through automation, and saving them a significant amount of money.

Dataset Creation for Training Large Language Models (LLMs): How the Client Used to Work

Our client specialized in creating datasets to train LLMs used to assist programmers by detecting code errors, suggesting solutions, training novices, and automating repetitive tasks.

For fine-tuning generative models with supervised techniques (SFT) or reinforcement learning (RLHF), high-quality datasets of query-response pairs are crucial. To create these, the client outsourced prompt generation and verification.

Why is Data Verification Critical?

Verification is a key quality control step. Data quality directly impacts the efficiency of training large language models and, ultimately, the client's results. However, the verification process was:

How Automation with LLMs Improved Prompt Verification

We proposed an automated verification system using pre-trained LLMs, seamlessly integrated into the existing pipeline. This system achieved:

Automated dialogue verification was integrated into the existing pipeline, enhancing flexibility and accuracy. The results were transformative:

This approach eliminated bottlenecks and significantly sped up the creation of high-quality data for training language models.

Automating the Process: How We Did It

We began developing an automated data verification system for training LLMs with the development of tailored prompts. We established dialogue evaluation criteria based on the requirements provided by dataset clients. This approach allowed us to design multiple prompts for the LLM, each aimed at assessing specific aspects of a dialogue. Here's an example of one such prompt:

Evaluate the following JSON-formatted dialogue based on the specified criteria and provide a JSON-formatted response, considering each criterion. Minor issues should not lead to disqualification unless they critically undermine the dialogue's quality or there are numerous minor issues that collectively impact the dialogue's quality.

Apply assessment criteria only to the sections that correspond to them. Give an assessment for each criterion from 1 to 5, where 5 corresponds completely, and 1 does not correspond at all. Give specific examples of significant mistakes in detailed_issues field.

Some of the key criteria based on clients' needs include:

Then, we tested two hypotheses to find the best approach for dialogue verification: splitting dialogues into parts for separate evaluation or assessing them as a whole. Comparing both to manual verification showed better accuracy with the split method — deviations were under 20%, compared to 33% for whole dialogues. Based on this, we adopted the split approach.

The process involved these steps:

  1. Extract dialogues from Argilla.
  2. Split them into smaller parts.
  3. Send the parts asynchronously to our LLM with tailored prompts.
  4. Evaluate each part against predefined criteria and store the results in JSON format.
  5. Perform a final review based on the evaluations. Dialogues exceeding a threshold of low scores ("2" or "3") were marked invalid.

Here's an example output:

{
  "id": "15012002",
  "valid": false,
  "assessments": {
    "translation": {
      "question": {
        "Adequacy": 2,
        "Naturalness": 2
      },
      "answer": {
        "Executable Code": 3,
        "Relevance": 5,
        "Error Free": 3,
        "No Nonsense": 5,
        "Efficiency and Standards": 4,
        "Code Comments": 3,
        "Comprehensiveness": 5,
        "Neutrality": 1
      }
    }
  }
}

We conducted several iterations of tool testing to ensure accuracy and efficiency. A small dataset was processed through our automated verification tool and manually reviewed in parallel for comparison.

After analyzing differences between automated and manual checks, we refined the prompt accordingly. Within 3–4 iterations, we achieved a 10% discrepancy with manual verification, making the tool a reliable replacement.

Results

Implementing automated dialogue verification reduced errors by 40%, halved task processing time, and allowed for re-evaluation and feedback loops twice as fast. Manual effort was cut by 8,000 hours monthly, and dataset verification costs dropped by 60%. So far, 35,000 dataset elements have been processed through the system.


Read our other stories: - How We Collect SWE-Bench for Other Languages - Fermatix's Multilingual SWE-Bench

© 2025 All rights reserved · Privacy Policy