Engineers show humans and robots can build smarter, safer and faster
Autonomous underwater waste collection could soon be a reality
HONPINE – Harmonic Reducer,Robot Joint Module,RV Reducer,Planetary Reducer
Game Changer in Logistics
This new AI can spot solar storms days before they strike
PACK EXPO 2025 Product Preview
Are the New GPT-OSS Models Any Good? We put them to the test.
OpenAI hasn’t released an open-weight language model since GPT-2 back in 2019. Six years later, they surprised everyone with two: gpt-oss-120b and the smaller gpt-oss-20b.
Naturally, we wanted to know — how do they actually perform?
To find out, we ran both models through our open-source workflow optimization framework, syftr. It evaluates models across different configurations — fast vs. cheap, high vs. low accuracy — and includes support for OpenAI’s new “thinking effort” setting.
In theory, more thinking should mean better answers. In practice? Not always.
We also use syftr to explore questions like “is LLM-as-a-Judge actually working?” and “what workflows perform well across many datasets?”.
Our first results with GPT-OSS might surprise you: the best performer wasn’t the biggest model or the deepest thinker.
Instead, the 20b model with low thinking effort consistently landed on the Pareto frontier, even rivaling the 120b medium configuration on benchmarks like FinanceBench, HotpotQA, and MultihopRAG. Meanwhile, high thinking effort rarely mattered at all.
How we set up our experiments
We didn’t just pit GPT-OSS against itself. Instead, we wanted to see how it stacked up against other strong open-weight models. So we compared gpt-oss-20b and gpt-oss-120b with:
- qwen3-235b-a22b
- glm-4.5-air
- nemotron-super-49b
- qwen3-30b-a3b
- gemma3-27b-it
- phi-4-multimodal-instruct
To test OpenAI’s new “thinking effort” feature, we ran each GPT-OSS model in three modes: low, medium, and high thinking effort. That gave us six configurations in total:
- gpt-oss-120b-low / -medium / -high
- gpt-oss-20b-low / -medium / -high
For evaluation, we cast a wide net: five RAG and agent modes, 16 embedding models, and a range of flow configuration options. To judge model responses, we used GPT-4o-mini and compared answers against known ground truth.
Finally, we tested across four datasets:
- FinanceBench (financial reasoning)
- HotpotQA (multi-hop QA)
- MultihopRAG (retrieval-augmented reasoning)
- PhantomWiki (synthetic Q&A pairs)
We optimized workflows twice: once for accuracy + latency, and once for accuracy + cost—capturing the tradeoffs that matter most in real-world deployments.
Optimizing for latency, cost, and accuracy
When we optimized the GPT-OSS models, we looked at two tradeoffs: accuracy vs. latency and accuracy vs. cost. The results were more surprising than we expected:
- GPT-OSS 20b (low thinking effort):
Fast, inexpensive, and consistently accurate. This setup appeared on the Pareto frontier repeatedly, making it the best default choice for most non-scientific tasks. In practice, that means quicker responses and lower bills compared to higher thinking efforts. - GPT-OSS 120b (medium thinking effort):
Best suited for tasks that demand deeper reasoning, like financial benchmarks. Use this when accuracy on complex problems matters more than cost. - GPT-OSS 120b (high thinking effort):
Expensive and usually unnecessary. Keep it in your back pocket for edge cases where other models fall short. For our benchmarks, it didn’t add value.


Reading the results more carefully
At first glance, the results look straightforward. But there’s an important nuance: an LLM’s top accuracy score depends not just on the model itself, but on how the optimizer weighs it against other models in the mix. To illustrate, let’s look at FinanceBench.
When optimizing for latency, all GPT-OSS models (except high thinking effort) landed with similar Pareto-frontiers. In this case, the optimizer had little reason to concentrate on the 20b low thinking configuration—its top accuracy was only 51%.

When optimizing for cost, the picture shifts dramatically. The same 20b low thinking configuration jumps to 57% accuracy, while the 120b medium configuration actually drops 22%. Why? Because the 20b model is far cheaper, so the optimizer shifts more weight toward it.

The takeaway: Performance depends on context. Optimizers will favor different models depending on whether you’re prioritizing speed, cost, or accuracy. And given the huge search space of possible configurations, there may be even better setups beyond the ones we tested.
Finding agentic workflows that work well in your setup
The new GPT-OSS models performed strongly in our tests — especially the 20b with low thinking effort, which often outpaced more expensive competitors. The bigger lesson? More model and more effort doesn’t always mean more accuracy. Sometimes, paying more just gets you less.
This is exactly why we built syftr and made it open-source. Every use case is different, and the best workflow for you depends on the tradeoffs you care about most. Want lower costs? Faster responses? Maximum accuracy?
Run your own experiments and find the Pareto sweet spot that balances those priorities for your setup.
The post Are the New GPT-OSS Models Any Good? We put them to the test. appeared first on DataRobot.
Biohybrid crawlers can be controlled using optogenetic techniques
Simulated humanoid robots learn to hike rugged terrain autonomously
Matryoshka doll-like robot changes its shape in real time and in situ
Social robots can help relieve the pressures felt by caregivers
A security robot failed in NYC: Now, it’s trying to protect downtown Kansas City
Sensing the Shift: Trends in Smart Industrial Automation
#ICML2025 outstanding position paper: Interview with Jaeho Kim on addressing the problems with conference reviewing

At this year’s International Conference on Machine Learning (ICML2025), Jaeho Kim, Yunseok Lee and Seulki Lee won an outstanding position paper award for their work Position: The AI Conference Peer Review Crisis Demands Author Feedback and Reviewer Rewards. We hear from Jaeho about the problems they were trying to address, and their proposed author feedback mechanism and reviewer reward system.
Could you say something about the problem that you address in your position paper?
Our position paper addresses the problems plaguing current AI conference peer review systems, while also raising questions about the future direction of peer review.
The imminent problem with the current peer review system in AI conferences is the exponential growth in paper submissions driven by increasing interest in AI. To put this with numbers, NeurIPS received over 30,000 submissions this year, while ICLR saw a 59.8% increase in submissions in just one year. This huge increase in submissions has created a fundamental mismatch: while paper submissions grow exponentially, the pool of qualified reviewers has not kept pace.
Submissions to some of the major AI conferences over the past few years.
This imbalance has severe consequences. The majority of papers are no longer receiving adequate review quality, undermining peer review’s essential function as a gatekeeper of scientific knowledge. When the review process fails, inappropriate papers and flawed research can slip through, potentially polluting the scientific record.
Considering AI’s profound societal impact, this breakdown in quality control poses risks that extend far beyond academia. Poor research that enters the scientific discourse can mislead future work, influence policy decisions, and ultimately hinder genuine knowledge advancement. Our position paper focuses on this critical question and proposes methods on how we can enhance the quality of review, thus leading to better dissemination of knowledge.
What do you argue for in the position paper?
Our position paper proposes two major changes to tackle the current peer review crisis: an author feedback mechanism and a reviewer reward system.
First, the author feedback system enables authors to formally evaluate the quality of reviews they receive. This system allows authors to assess reviewers’ comprehension of their work, identify potential signs of LLM-generated content, and establish basic safeguards against unfair, biased, or superficial reviews. Importantly, this isn’t about penalizing reviewers, but rather creating minimal accountability to protect authors from the small minority of reviewers who may not meet professional standards.
Second, our reviewer incentive system provides both immediate and long-term professional value for quality reviewing. For short-term motivation, author evaluation scores determine eligibility for digital badges (such as “Top 10% Reviewer” recognition) that can be displayed on academic profiles like OpenReview and Google Scholar. For long-term career impact, we propose novel metrics like a “reviewer impact score” – essentially an h-index calculated from the subsequent citations of papers a reviewer has evaluated. This treats reviewers as contributors to the papers they help improve and validates their role in advancing scientific knowledge.
Could you tell us more about your proposal for this new two-way peer review method?
Our proposed two-way peer review system makes one key change to the current process: we split review release into two phases.
The authors’ proposed modification to the peer-review system.
Currently, authors submit papers, reviewers write complete reviews, and all reviews are released at once. In our system, authors first receive only the neutral sections – the summary, strengths, and questions about their paper. Authors then provide feedback on whether reviewers properly understood their work. Only after this feedback do we release the second part containing weaknesses and ratings.
This approach offers three main benefits. First, it’s practical – we don’t need to change existing timelines or review templates. The second phase can be released immediately after the authors give feedback. Second, it protects authors from irresponsible reviews since reviewers know their work will be evaluated. Third, since reviewers typically review multiple papers, we can track their feedback scores to help area chairs identify (ir)responsible reviewers.
The key insight is that authors know their own work best and can quickly spot when a reviewer hasn’t properly engaged with their paper.
Could you talk about the concrete reward system that you suggest in the paper?
We propose both short-term and long-term rewards to address reviewer motivation, which naturally declines over time despite starting enthusiastically.
Short-term: Digital badges displayed on reviewers’ academic profiles, awarded based on author feedback scores. The goal is making reviewer contributions more visible. While some conferences list top reviewers on their websites, these lists are hard to find. Our badges would be prominently displayed on profiles and could even be printed on conference name tags.
Example of a badge that could appear on profiles.
Long-term: Numerical metrics to quantify reviewer impact at AI conferences. We suggest tracking measures like an h-index for reviewed papers. These metrics could be included in academic portfolios, similar to how we currently track publication impact.
The core idea is creating tangible career benefits for reviewers while establishing peer review as a professional academic service that rewards both authors and reviewers.
What do you think could be some of the pros and cons of implementing this system?
The benefits of our system are threefold. First, it is a very practical solution. Our approach doesn’t change current review schedules or review burdens, making it easy to incorporate into existing systems. Second, it encourages reviewers to act more responsibly, knowing their work will be evaluated. We emphasize that most reviewers already act professionally – however, even a small number of irresponsible reviewers can seriously damage the peer review system. Third, with sufficient scale, author feedback scores will make conferences more sustainable. Area chairs will have better information about reviewer quality, enabling them to make more informed decisions about paper acceptance.
However, there is strong potential for gaming by reviewers. Reviewers might optimize for rewards by giving overly positive reviews. Measures to counteract these problems are definitely needed. We are currently exploring solutions to address this issue.
Are there any concluding thoughts you’d like to add about the potential future
of conferences and peer-review?
One emerging trend we’ve observed is the increasing discussion of LLMs in peer review. While we believe current LLMs have several weaknesses (e.g., prompt injection, shallow reviews), we also think they will eventually surpass humans. When that happens, we will face a fundamental dilemma: if LLMs provide better reviews, why should humans be reviewing? Just as the rapid rise of LLMs caught us unprepared and created chaos, we cannot afford a repeat. We should start preparing for this question as soon as possible.
About Jaeho
![]() |
Jaeho Kim is a Postdoctoral Researcher at Korea University with Professor Changhee Lee. He received his Ph.D. from UNIST under the supervision of Professor Seulki Lee. His main research focuses on time series learning, particularly developing foundation models that generate synthetic and human-guided time series data to reduce computational and data costs. He also contributes to improving the peer review process at major AI conferences, with his work recognized by the ICML 2025 Outstanding Position Paper Award. |
Read the work in full
Position: The AI Conference Peer Review Crisis Demands Author Feedback and Reviewer Rewards, Jaeho Kim, Yunseok Lee, Seulki Lee.