Category Robotics Classification

Page 2 of 540
1 2 3 4 540

Robot regret: New research helps robots make safer decisions around humans

Imagine for a moment that you're in an auto factory. A robot and a human are working next to each other on the production line. The robot is busy rapidly assembling car doors while the human runs quality control, inspecting the doors for damage and making sure they come together as they should.

Warehouse automation hasn’t made workers safer—it’s just reshuffled the risk, say researchers

Rapid advancements in robotics are changing the face of the world's warehouses, as dangerous and physically taxing tasks are being reassigned en masse from humans to machines. Automation and digitization are nothing new in the logistics sector, or any sector heavily reliant on manual labor. Bosses prize automation because it can bring up to two- to four-fold gains in productivity. But workers can also benefit from the putative improvements in safety that come from shifting dangerous tasks onto non-human shoulders.

Caltech breakthrough makes quantum memory last 30 times longer

While superconducting qubits are great at fast calculations, they struggle to store information for long periods. A team at Caltech has now developed a clever solution: converting quantum information into sound waves. By using a tiny device that acts like a miniature tuning fork, the researchers were able to extend quantum memory lifetimes up to 30 times longer than before. This breakthrough could pave the way toward practical, scalable quantum computers that can both compute and remember.

Smarter navigation: AI helps robots stay on track without a map

Navigating without a map is a difficult task for robots, especially when they can't reliably determine where they are. A new AI-powered solution helps robots overcome this challenge by training them to make movement decisions that also protect their ability to localize. Instead of blindly heading toward a target, the robot evaluates the visual richness of its surroundings and favors routes where it's less likely to get "lost."

Starfish-inspired tube feet could help underwater robots get a grip

Soft robotics, which uses flexible and deformable materials, is an emerging field in autonomous systems. It has recently been applied to next-generation tasks such as deep-sea sampling with soft robotic grippers—requiring strong adhesion and autonomous detachment. Bioinspired adhesion offers a promising solution.

Judging judges: Building trustworthy LLM evaluations

TL;DR
LLM-as-a-Judge systems can be fooled by confident-sounding but wrong answers, giving teams false confidence in their models. We built a human-labeled dataset and used our open-source framework syftr to systematically test judge configurations. The results? They’re in the full post. But here’s the takeaway: don’t just trust your judge — test it.


When we shifted to self-hosted open-source models for our agentic retrieval-augmented generation (RAG) framework, we were thrilled by the initial results. On tough benchmarks like FinanceBench, our systems appeared to deliver breakthrough accuracy. 

That excitement lasted right up until we looked closer at how our LLM-as-a-Judge system was grading the answers.

The truth: our new judges were being fooled.

A RAG system, unable to find data to compute a financial metric, would simply explain that it couldn’t find the information. 

The judge would reward this plausible-sounding explanation with full credit, concluding the system had correctly identified the absence of data. That single flaw was skewing results by 10–20% — enough to make a mediocre system look state-of-the-art.

Which raised a critical question: if you can’t trust the judge, how can you trust the results?

Your LLM judge might be lying to you, and you won’t know unless you rigorously test it. The best judge isn’t always the biggest or most expensive. 

With the right data and tools, however, you can build one that’s cheaper, more accurate, and more trustworthy than gpt-4o-mini. In this research deep dive, we show you how.

Why LLM judges fail

The challenge we uncovered went far beyond a simple bug. Evaluating generated content is inherently nuanced, and LLM judges are prone to subtle but consequential failures.

Our initial issue was a textbook case of a judge being swayed by confident-sounding reasoning. For example, in one evaluation about a family tree, the judge concluded:

“The generated answer is relevant and correctly identifies that there’s insufficient information to determine the specific cousin… While the reference answer lists names, the generated answer’s conclusion aligns with the reasoning that the question lacks necessary data.”

In reality, the information was available — the RAG system just failed to retrieve it. The judge was fooled by the authoritative tone of the response.

Digging deeper, we found other challenges:

  • Numerical ambiguity: Is an answer of 3.9% “close enough” to 3.8%? Judges often lack the context to decide.
  • Semantic equivalence: Is “APAC” an acceptable substitute for “Asia-Pacific: India, Japan, Malaysia, Philippines, Australia”?
  • Faulty references:  Sometimes the “ground truth” answer itself is wrong, leaving the judge in a paradox.

These failures underscore a key lesson: simply picking a powerful LLM and asking it to grade isn’t enough. Perfect agreement between judges, human or machine, is unattainable without a more rigorous approach.

Building a framework for trust

To address these challenges, we needed a way to evaluate the evaluators. That meant two things:

  1. A high-quality, human-labeled dataset of judgments.
  2. A system to methodically test different judge configurations.

First, we created our own dataset, now available on HuggingFace. We generated hundreds of question-answer-response triplets using a wide range of RAG systems.

Then, our team hand-labeled all 807 examples. 

Every edge case was debated, and we established clear, consistent grading rules.

The process itself was eye-opening, showing just how subjective evaluation can be. In the end, our labeled dataset reflected a distribution of 37.6% failing and 62.4% passing responses.

DataRobot's judging LLM judges dataset
The judge-eval dataset was created using syftr studies, which generate diverse agentic RAG flows across the latency–accuracy Pareto frontier. These flows produce LLM responses for many QA pairs, which human labelers then evaluate against reference answers to ensure high-quality judgment labels.

Next, we needed an engine for experimentation. That’s where our open-source framework, syftr, came in. 

We extended it with a new JudgeFlow class and a configurable search space to vary LLM choice, temperature, and prompt design. This made it possible to systematically explore — and identify — the judge configurations most aligned with human judgment.

Putting the judges to the test

With our framework in place, we began experimenting.

Our first test focused on the Master-RM model, specifically tuned to avoid “reward hacking” by prioritizing content over reasoning phrases. 

We pitted it against its base model using four prompts: 

  1. The “default” LlamaIndex CorrectnessEvaluator prompt, asking for a 1–5 rating
  2. The same CorrectnessEvaluator prompt, asking for a 1–10 rating
  3. A more detailed version of the CorrectnessEvaluator prompt with more explicit criteria. 
  4. A simple prompt: “Return YES if the Generated Answer is correct relative to the Reference Answer, or NO if it is not.”


The syftr optimization results are shown below in the cost-versus-accuracy plot. Accuracy is the simple percent agreement between the judge and human evaluators, and cost is estimated based on the per-token pricing of Together.ai‘s hosting services.

judge optimization master rm vs qwen2.5 7b instruct
Accuracy vs. cost for different judge prompts and LLMs. Each dot represents the performance of a trial with specific parameters. The “detailed” prompt delivers the most human-like performance but at significantly higher cost, estimated using Together.ai’s per-token hosting prices.)

The results were surprising. 

Master-RM was no more accurate than its base model and struggled with producing anything beyond the “simple” prompt response format due to its focused training.

While the model’s specialized training was effective in combating the effects of specific reasoning phrases, it did not improve overall alignment to the human judgements in our dataset.

We also saw a clear trade-off. The “detailed” prompt was the most accurate, but nearly four times as expensive in tokens.

Next, we scaled up, evaluating a cluster of large open-weight models (from Qwen, DeepSeek, Google, and NVIDIA) and testing new judge strategies:

  • Random: Selecting a judge at random from a pool for each evaluation.
  • Consensus: Polling 3 or 5 models and taking the majority vote.
judge optimization flow comparison
judge optimization prompt comparison
Optimization results from the larger study, broken down by judge type and prompt. The chart shows a clear Pareto frontier, enabling data-driven choices between cost and accuracy.)

Here the results converged: consensus-based judges offered no accuracy advantage over single or random judges. 

All three methods topped out around 96% agreement with human labels. Across the board, the best-performing configurations used the detailed prompt.

But there was an important exception: the simple prompt paired with a powerful open-weight model like Qwen/Qwen2.5-72B-Instruct was nearly 20× cheaper than detailed prompts, while only giving up a few percentage points of accuracy.

What makes this solution different?

For a long time, our rule of thumb was: “Just use gpt-4o-mini.” It’s a common shortcut for teams looking for a reliable, off-the-shelf judge. And while gpt-4o-mini did perform well (around 93% accuracy with the default prompt), our experiments revealed its limits. It’s just one point on a much broader trade-off curve.

A systematic approach gives you a menu of optimized options instead of a single default:

  • Top accuracy, no matter the cost. A consensus flow with the detailed prompt and models like Qwen3-32B, DeepSeek-R1-Distill, and Nemotron-Super-49B achieved 96% human alignment.
  • Budget-friendly, rapid testing. A single model with the simple prompt hit ~93% accuracy at one-fifth the cost of the gpt-4o-mini baseline.

By optimizing across accuracy, cost, and latency, you can make informed choices tailored to the needs of each project — instead of betting everything on a one-size-fits-all judge.

Building reliable judges: Key takeaways

Whether you use our framework or not, our findings can help you build more reliable evaluation systems:

  1. Prompting is the biggest lever. For the highest human alignment, use detailed prompts that spell out your evaluation criteria. Don’t assume the model knows what “good” means for your task.
  2. Simple works when speed matters. If cost or latency is critical, a simple prompt (e.g., “Return YES if the Generated Answer is correct relative to the Reference Answer, or NO if it is not.”) paired with a capable model delivers excellent value with only a minor accuracy trade-off.
  3. Committees bring stability. For critical evaluations where accuracy is non-negotiable, polling 3–5 diverse, powerful models and taking the majority vote reduces bias and noise. In our study, the top-accuracy consensus flow combined Qwen/Qwen3-32B, DeepSeek-R1-Distill-Llama-70B, and NVIDIA’s Nemotron-Super-49B.
  4. Bigger, smarter models help. Larger LLMs consistently outperformed smaller ones. For example, upgrading from microsoft/Phi-4-multimodal-instruct (5.5B) with a detailed prompt to gemma3-27B-it with a simple prompt delivered an 8% boost in accuracy — at a negligible difference in cost.

From uncertainty to confidence

Our journey began with a troubling discovery: instead of following the rubric, our LLM judges were being swayed by long, plausible-sounding refusals.

By treating evaluation as a rigorous engineering problem, we moved from doubt to confidence. We gained a clear, data-driven view of the trade-offs between accuracy, cost, and speed in LLM-as-a-Judge systems. 

More data means better choices.

We hope our work and our open-source dataset encourage you to take a closer look at your own evaluation pipelines. The “best” configuration will always depend on your specific needs, but you no longer have to guess.

Ready to build more trustworthy evaluations? Explore our work in syftr and start judging your judges.

The post Judging judges: Building trustworthy LLM evaluations appeared first on DataRobot.

Engineering fantasy into reality

“One of the dreams I had as a kid was about the first day of school, and being able to build and be creative, and it was the happiest day of my life. And at MIT, I felt like that dream became reality,” says Ballesteros. Credit: Ryan A Lannom, Jet Propulsion Laboratory.

By Jennifer Chu

Growing up in the suburban town of Spring, Texas, just outside of Houston, Erik Ballesteros couldn’t help but be drawn in by the possibilities for humans in space.

It was the early 2000s, and NASA’s space shuttle program was the main transport for astronauts to the International Space Station (ISS). Ballesteros’ hometown was less than an hour from Johnson Space Center (JSC), where NASA’s mission control center and astronaut training facility are based. And as often as they could, he and his family would drive to JSC to check out the center’s public exhibits and presentations on human space exploration.

For Ballesteros, the highlight of these visits was always the tram tour, which brings visitors to JSC’s Astronaut Training Facility. There, the public can watch astronauts test out spaceflight prototypes and practice various operations in preparation for living and working on the International Space Station.

“It was a really inspiring place to be, and sometimes we would meet astronauts when they were doing signings,” he recalls. “I’d always see the gates where the astronauts would go back into the training facility, and I would think: One day I’ll be on the other side of that gate.”

Today, Ballesteros is a PhD student in mechanical engineering at MIT, and has already made good on his childhood goal. Before coming to MIT, he interned on multiple projects at JSC, working in the training facility to help test new spacesuit materials, portable life support systems, and a propulsion system for a prototype Mars rocket. He also helped train astronauts to operate the ISS’ emergency response systems.

Those early experiences steered him to MIT, where he hopes to make a more direct impact on human spaceflight. He and his advisor, Harry Asada, are building a system that will quite literally provide helping hands to future astronauts. The system, dubbed SuperLimbs, consists of a pair of wearable robotic arms that extend out from a backpack, similar to the fictional Inspector Gadget, or Doctor Octopus (“Doc Ock,” to comic book fans). Ballesteros and Asada are designing the robotic arms to be strong enough to lift an astronaut back up if they fall. The arms could also crab-walk around a spacecraft’s exterior as an astronaut inspects or makes repairs.

Ballesteros is collaborating with engineers at the NASA Jet Propulsion Laboratory to refine the design, which he plans to introduce to astronauts at JSC in the next year or two, for practical testing and user feedback. He says his time at MIT has helped him make connections across academia and in industry that have fueled his life and work.

“Success isn’t built by the actions of one, but rather it’s built on the shoulders of many,” Ballesteros says. “Connections — ones that you not just have, but maintain — are so vital to being able to open new doors and keep great ones open.”

Getting a jumpstart

Ballesteros didn’t always seek out those connections. As a kid, he counted down the minutes until the end of school, when he could go home to play video games and watch movies, “Star Wars” being a favorite. He also loved to create and had a talent for cosplay, tailoring intricate, life-like costumes inspired by cartoon and movie characters.

In high school, he took an introductory class in engineering that challenged students to build robots from kits, that they would then pit against each other, BattleBots-style. Ballesteros built a robotic ball that moved by shifting an internal weight, similar to Star Wars’ fictional, sphere-shaped BB-8. 

“It was a good introduction, and I remember thinking, this engineering thing could be fun,” he says.

After graduating high school, Ballesteros attended the University of Texas at Austin, where he pursued a bachelor’s degree in aerospace engineering. What would typically be a four-year degree stretched into an eight-year period during which Ballesteros combined college with multiple work experiences, taking on internships at NASA and elsewhere. 

In 2013, he interned at Lockheed Martin, where he contributed to various aspects of jet engine development. That experience unlocked a number of other aerospace opportunities. After a stint at NASA’s Kennedy Space Center, he went on to Johnson Space Center, where, as part of a co-op program called Pathways, he returned every spring or summer over the next five years, to intern in various departments across the center.

While the time at JSC gave him a huge amount of practical engineering experience, Ballesteros still wasn’t sure if it was the right fit. Along with his childhood fascination with astronauts and space, he had always loved cinema and the special effects that forged them. In 2018, he took a year off from the NASA Pathways program to intern at Disney, where he spent the spring semester working as a safety engineer, performing safety checks on Disney rides and attractions.

During this time, he got to know a few people in Imagineering — the research and development group that creates, designs, and builds rides, theme parks, and attractions. That summer, the group took him on as an intern, and he worked on the animatronics for upcoming rides, which involved translating certain scenes in a Disney movie into practical, safe, and functional scenes in an attraction.

“In animation, a lot of things they do are fantastical, and it was our job to find a way to make them real,” says Ballesteros, who loved every moment of the experience and hoped to be hired as an Imagineer after the internship came to an end. But he had one year left in his undergraduate degree and had to move on.

After graduating from UT Austin in December 2019, Ballesteros accepted a position at NASA’s Jet Propulsion Laboratory in Pasadena, California. He started at JPL in February of 2020, working on some last adjustments to the Mars Perseverance rover. After a few months during which JPL shifted to remote work during the Covid pandemic, Ballesteros was assigned to a project to develop a self-diagnosing spacecraft monitoring system. While working with that team, he met an engineer who was a former lecturer at MIT. As a practical suggestion, she nudged Ballesteros to consider pursuing a master’s degree, to add more value to his CV.

“She opened up the idea of going to grad school, which I hadn’t ever considered,” he says.

Full circle

In 2021, Ballesteros arrived at MIT to begin a master’s program in mechanical engineering. In interviewing with potential advisors, he immediately hit it off with Harry Asada, the Ford Professor of Enginering and director of the d’Arbeloff Laboratory for Information Systems and Technology. Years ago, Asada had pitched JPL an idea for wearable robotic arms to aid astronauts, which they quickly turned down. But Asada held onto the idea, and proposed that Ballesteros take it on as a feasibility study for his master’s thesis.

The project would require bringing a seemingly sci-fi idea into practical, functional form, for use by astronauts in future space missions. For Ballesteros, it was the perfect challenge. SuperLimbs became the focus of his master’s degree, which he earned in 2023. His initial plan was to return to industry, degree in hand. But he chose to stay at MIT to pursue a PhD, so that he could continue his work with SuperLimbs in an environment where he felt free to explore and try new things.

“MIT is like nerd Hogwarts,” he says. “One of the dreams I had as a kid was about the first day of school, and being able to build and be creative, and it was the happiest day of my life. And at MIT, I felt like that dream became reality.”

Ballesteros and Asada are now further developing SuperLimbs. The team recently re-pitched the idea to engineers at JPL, who reconsidered, and have since struck up a partnership to help test and refine the robot. In the next year or two, Ballesteros hopes to bring a fully functional, wearable design to Johnson Space Center, where astronauts can test it out in space-simulated settings.

In addition to his formal graduate work, Ballesteros has found a way to have a bit of Imagineer-like fun. He is a member of the MIT Robotics Team, which designs, builds, and runs robots in various competitions and challenges. Within this club, Ballesteros has formed a sub-club of sorts, called the Droid Builders, that aim to build animatronic droids from popular movies and franchises.

“I thought I could use what I learned from Imagineering and teach undergrads how to build robots from the ground up,” he says. “Now we’re building a full-scale WALL-E that could be fully autonomous. It’s cool to see everything come full circle.”

Page 2 of 540
1 2 3 4 540