Page 1 of 540
1 2 3 540

New AI model predicts which genetic mutations truly drive disease

Scientists at Mount Sinai have created an artificial intelligence system that can predict how likely rare genetic mutations are to actually cause disease. By combining machine learning with millions of electronic health records and routine lab tests like cholesterol or kidney function, the system produces "ML penetrance" scores that place genetic risk on a spectrum rather than a simple yes/no. Some variants once thought dangerous showed little real-world impact, while others previously labeled uncertain revealed strong disease links.

Developing self-deploying material for next-gen robotics

The field of robotics has transformed drastically in this century, with a special focus on soft robotics. In this context, origami-inspired deployable structures with compact storage and efficient deployment features have gained prominence in aerospace, architecture, and medical fields.

Unusual microbug anatomy shown to optimize wing weight—findings could benefit tiny drone design

Skoltech and MSU scientists have uncovered the advantage gained by microscopic bugs from their feather-like wings that are unlike those of dragonflies, bees, mosquitoes and other familiar insects. A wing largely made up of bristles that stand somewhat apart from each other is lighter than the conventional membranous wing that comes in one piece.

Robot regret: New research helps robots make safer decisions around humans

Imagine for a moment that you're in an auto factory. A robot and a human are working next to each other on the production line. The robot is busy rapidly assembling car doors while the human runs quality control, inspecting the doors for damage and making sure they come together as they should.

Warehouse automation hasn’t made workers safer—it’s just reshuffled the risk, say researchers

Rapid advancements in robotics are changing the face of the world's warehouses, as dangerous and physically taxing tasks are being reassigned en masse from humans to machines. Automation and digitization are nothing new in the logistics sector, or any sector heavily reliant on manual labor. Bosses prize automation because it can bring up to two- to four-fold gains in productivity. But workers can also benefit from the putative improvements in safety that come from shifting dangerous tasks onto non-human shoulders.

Caltech breakthrough makes quantum memory last 30 times longer

While superconducting qubits are great at fast calculations, they struggle to store information for long periods. A team at Caltech has now developed a clever solution: converting quantum information into sound waves. By using a tiny device that acts like a miniature tuning fork, the researchers were able to extend quantum memory lifetimes up to 30 times longer than before. This breakthrough could pave the way toward practical, scalable quantum computers that can both compute and remember.

Smarter navigation: AI helps robots stay on track without a map

Navigating without a map is a difficult task for robots, especially when they can't reliably determine where they are. A new AI-powered solution helps robots overcome this challenge by training them to make movement decisions that also protect their ability to localize. Instead of blindly heading toward a target, the robot evaluates the visual richness of its surroundings and favors routes where it's less likely to get "lost."

Starfish-inspired tube feet could help underwater robots get a grip

Soft robotics, which uses flexible and deformable materials, is an emerging field in autonomous systems. It has recently been applied to next-generation tasks such as deep-sea sampling with soft robotic grippers—requiring strong adhesion and autonomous detachment. Bioinspired adhesion offers a promising solution.

Judging judges: Building trustworthy LLM evaluations

TL;DR
LLM-as-a-Judge systems can be fooled by confident-sounding but wrong answers, giving teams false confidence in their models. We built a human-labeled dataset and used our open-source framework syftr to systematically test judge configurations. The results? They’re in the full post. But here’s the takeaway: don’t just trust your judge — test it.


When we shifted to self-hosted open-source models for our agentic retrieval-augmented generation (RAG) framework, we were thrilled by the initial results. On tough benchmarks like FinanceBench, our systems appeared to deliver breakthrough accuracy. 

That excitement lasted right up until we looked closer at how our LLM-as-a-Judge system was grading the answers.

The truth: our new judges were being fooled.

A RAG system, unable to find data to compute a financial metric, would simply explain that it couldn’t find the information. 

The judge would reward this plausible-sounding explanation with full credit, concluding the system had correctly identified the absence of data. That single flaw was skewing results by 10–20% — enough to make a mediocre system look state-of-the-art.

Which raised a critical question: if you can’t trust the judge, how can you trust the results?

Your LLM judge might be lying to you, and you won’t know unless you rigorously test it. The best judge isn’t always the biggest or most expensive. 

With the right data and tools, however, you can build one that’s cheaper, more accurate, and more trustworthy than gpt-4o-mini. In this research deep dive, we show you how.

Why LLM judges fail

The challenge we uncovered went far beyond a simple bug. Evaluating generated content is inherently nuanced, and LLM judges are prone to subtle but consequential failures.

Our initial issue was a textbook case of a judge being swayed by confident-sounding reasoning. For example, in one evaluation about a family tree, the judge concluded:

“The generated answer is relevant and correctly identifies that there’s insufficient information to determine the specific cousin… While the reference answer lists names, the generated answer’s conclusion aligns with the reasoning that the question lacks necessary data.”

In reality, the information was available — the RAG system just failed to retrieve it. The judge was fooled by the authoritative tone of the response.

Digging deeper, we found other challenges:

  • Numerical ambiguity: Is an answer of 3.9% “close enough” to 3.8%? Judges often lack the context to decide.
  • Semantic equivalence: Is “APAC” an acceptable substitute for “Asia-Pacific: India, Japan, Malaysia, Philippines, Australia”?
  • Faulty references:  Sometimes the “ground truth” answer itself is wrong, leaving the judge in a paradox.

These failures underscore a key lesson: simply picking a powerful LLM and asking it to grade isn’t enough. Perfect agreement between judges, human or machine, is unattainable without a more rigorous approach.

Building a framework for trust

To address these challenges, we needed a way to evaluate the evaluators. That meant two things:

  1. A high-quality, human-labeled dataset of judgments.
  2. A system to methodically test different judge configurations.

First, we created our own dataset, now available on HuggingFace. We generated hundreds of question-answer-response triplets using a wide range of RAG systems.

Then, our team hand-labeled all 807 examples. 

Every edge case was debated, and we established clear, consistent grading rules.

The process itself was eye-opening, showing just how subjective evaluation can be. In the end, our labeled dataset reflected a distribution of 37.6% failing and 62.4% passing responses.

DataRobot's judging LLM judges dataset
The judge-eval dataset was created using syftr studies, which generate diverse agentic RAG flows across the latency–accuracy Pareto frontier. These flows produce LLM responses for many QA pairs, which human labelers then evaluate against reference answers to ensure high-quality judgment labels.

Next, we needed an engine for experimentation. That’s where our open-source framework, syftr, came in. 

We extended it with a new JudgeFlow class and a configurable search space to vary LLM choice, temperature, and prompt design. This made it possible to systematically explore — and identify — the judge configurations most aligned with human judgment.

Putting the judges to the test

With our framework in place, we began experimenting.

Our first test focused on the Master-RM model, specifically tuned to avoid “reward hacking” by prioritizing content over reasoning phrases. 

We pitted it against its base model using four prompts: 

  1. The “default” LlamaIndex CorrectnessEvaluator prompt, asking for a 1–5 rating
  2. The same CorrectnessEvaluator prompt, asking for a 1–10 rating
  3. A more detailed version of the CorrectnessEvaluator prompt with more explicit criteria. 
  4. A simple prompt: “Return YES if the Generated Answer is correct relative to the Reference Answer, or NO if it is not.”


The syftr optimization results are shown below in the cost-versus-accuracy plot. Accuracy is the simple percent agreement between the judge and human evaluators, and cost is estimated based on the per-token pricing of Together.ai‘s hosting services.

judge optimization master rm vs qwen2.5 7b instruct
Accuracy vs. cost for different judge prompts and LLMs. Each dot represents the performance of a trial with specific parameters. The “detailed” prompt delivers the most human-like performance but at significantly higher cost, estimated using Together.ai’s per-token hosting prices.)

The results were surprising. 

Master-RM was no more accurate than its base model and struggled with producing anything beyond the “simple” prompt response format due to its focused training.

While the model’s specialized training was effective in combating the effects of specific reasoning phrases, it did not improve overall alignment to the human judgements in our dataset.

We also saw a clear trade-off. The “detailed” prompt was the most accurate, but nearly four times as expensive in tokens.

Next, we scaled up, evaluating a cluster of large open-weight models (from Qwen, DeepSeek, Google, and NVIDIA) and testing new judge strategies:

  • Random: Selecting a judge at random from a pool for each evaluation.
  • Consensus: Polling 3 or 5 models and taking the majority vote.
judge optimization flow comparison
judge optimization prompt comparison
Optimization results from the larger study, broken down by judge type and prompt. The chart shows a clear Pareto frontier, enabling data-driven choices between cost and accuracy.)

Here the results converged: consensus-based judges offered no accuracy advantage over single or random judges. 

All three methods topped out around 96% agreement with human labels. Across the board, the best-performing configurations used the detailed prompt.

But there was an important exception: the simple prompt paired with a powerful open-weight model like Qwen/Qwen2.5-72B-Instruct was nearly 20× cheaper than detailed prompts, while only giving up a few percentage points of accuracy.

What makes this solution different?

For a long time, our rule of thumb was: “Just use gpt-4o-mini.” It’s a common shortcut for teams looking for a reliable, off-the-shelf judge. And while gpt-4o-mini did perform well (around 93% accuracy with the default prompt), our experiments revealed its limits. It’s just one point on a much broader trade-off curve.

A systematic approach gives you a menu of optimized options instead of a single default:

  • Top accuracy, no matter the cost. A consensus flow with the detailed prompt and models like Qwen3-32B, DeepSeek-R1-Distill, and Nemotron-Super-49B achieved 96% human alignment.
  • Budget-friendly, rapid testing. A single model with the simple prompt hit ~93% accuracy at one-fifth the cost of the gpt-4o-mini baseline.

By optimizing across accuracy, cost, and latency, you can make informed choices tailored to the needs of each project — instead of betting everything on a one-size-fits-all judge.

Building reliable judges: Key takeaways

Whether you use our framework or not, our findings can help you build more reliable evaluation systems:

  1. Prompting is the biggest lever. For the highest human alignment, use detailed prompts that spell out your evaluation criteria. Don’t assume the model knows what “good” means for your task.
  2. Simple works when speed matters. If cost or latency is critical, a simple prompt (e.g., “Return YES if the Generated Answer is correct relative to the Reference Answer, or NO if it is not.”) paired with a capable model delivers excellent value with only a minor accuracy trade-off.
  3. Committees bring stability. For critical evaluations where accuracy is non-negotiable, polling 3–5 diverse, powerful models and taking the majority vote reduces bias and noise. In our study, the top-accuracy consensus flow combined Qwen/Qwen3-32B, DeepSeek-R1-Distill-Llama-70B, and NVIDIA’s Nemotron-Super-49B.
  4. Bigger, smarter models help. Larger LLMs consistently outperformed smaller ones. For example, upgrading from microsoft/Phi-4-multimodal-instruct (5.5B) with a detailed prompt to gemma3-27B-it with a simple prompt delivered an 8% boost in accuracy — at a negligible difference in cost.

From uncertainty to confidence

Our journey began with a troubling discovery: instead of following the rubric, our LLM judges were being swayed by long, plausible-sounding refusals.

By treating evaluation as a rigorous engineering problem, we moved from doubt to confidence. We gained a clear, data-driven view of the trade-offs between accuracy, cost, and speed in LLM-as-a-Judge systems. 

More data means better choices.

We hope our work and our open-source dataset encourage you to take a closer look at your own evaluation pipelines. The “best” configuration will always depend on your specific needs, but you no longer have to guess.

Ready to build more trustworthy evaluations? Explore our work in syftr and start judging your judges.

The post Judging judges: Building trustworthy LLM evaluations appeared first on DataRobot.

Page 1 of 540
1 2 3 540