Towards a virtual stuntman
Motion control problems have become standard benchmarks for reinforcement learning, and deep RL methods have been shown to be effective for a diverse suite of tasks ranging from manipulation to locomotion. However, characters trained with deep RL often exhibit unnatural behaviours, bearing artifacts such as jittering, asymmetric gaits, and excessive movement of limbs. Can we train our characters to produce more natural behaviours?
Simulated humanoid performing a variety of highly dynamic and acrobatic skills.
A wealth of inspiration can be drawn from computer graphics, where the physics-based simulation of natural movements have been a subject of intense study for decades. The greater emphasis placed on motion quality is often motivated by applications in film, visual effects, and games. Over the years, a rich body of work in physics-based character animation have developed controllers to produce robust and natural motions for a large corpus of tasks and characters. These methods often leverage human insight to incorporate task-specific control structures that provide strong inductive biases on the motions that can be achieved by the characters (e.g. finite-state machines, reduced models, and inverse dynamics). But as a result of these design decisions, the controllers are often specific to a particular character or task, and controllers developed for walking may not extend to more dynamic skills, where human insight becomes scarce.
In this work, we will draw inspiration from the two fields to take advantage of the generality afforded by deep learning models while also producing naturalistic behaviours that rival the state-of-the-art in full body motion simulation in computer graphics. We present a conceptually simple RL framework that enables simulated characters to learn highly dynamic and acrobatic skills from reference motion clips, which can be provided in the form of mocap data recorded from human subjects. Given a single demonstration of a skill, such as a spin-kick or a backflip, our character is able to learn a robust policy to imitate the skill in simulation. Our policies produce motions that are nearly indistinguishable from mocap.
Motion Imitation
In most RL benchmarks, simulated characters are represented using simple models that provide only a crude approximation of real world dynamics. Characters are therefore prone to exploiting idiosyncrasies of the simulation to develop unnatural behaviours that are infeasible in the real world. Incorporating more realistic biomechanical models can lead to more natural behaviours. But constructing high-fidelity models can be extremely challenging, and the resulting motions may nonetheless be unnatural.
An alternative is to take a data-driven approach, where reference motion capture of humans provides examples of natural motions. The character can then be trained to produce more natural behaviours by imitating the reference motions. Imitating motion data in simulation has a long history in computer animation and has seen some recent demonstrations with deep RL. While the results do appear more natural, they are still far from being able to faithfully reproduce a wide variety of motions.
In this work, our policies will be trained through a motion imitation task, where the goal of the character is to reproduce a given kinematic reference motion. Each reference motion is represented by a sequence of target poses ${\hat{q}_0, \hat{q}_1,\ldots,\hat{q}_T}$, where $\hat{q}_t$ is the target pose at timestep $t$. The reward function is to minimize the least squares pose error between the target pose $\hat{q}_t$ and the pose of the simulated character $q_t$,
While more sophisticated methods have been applied for motion imitation, we found that simply minimizing the tracking error (along with a couple of additional insights) works surprisingly well. The policies are trained by optimizing this objective using PPO.
With this framework, we are able to develop policies for a rich repertoire of challenging skills ranging from locomotion to acrobatics, martial arts to dancing.
The humanoid learns to imitate various skills. The blue character is the simulated character, and the green character is replaying the respective mocap clip. Top left: sideflip. Top right: cartwheel. Bottom left: kip-up. Bottom right: speed vault.
Next, we compare our method with previous results that used (e.g. generative adversarial imitation learning (GAIL)) to imitate mocap clips. Our method is substantially simpler than GAIL and it is able to better reproduce the reference motions. The resulting policy avoids many of the artifacts commonly exhibited by deep RL methods, and enables the character to produce a fluid life-like running gait.
Comparison of our method (left) and work from Merel et al. [2017] using GAIL to imitate mocap data. Our motions appear significantly more natural than previous work using deep RL.
Insights
Reference State Initialization (RSI)
Suppose the character is trying to imitate a backflip. How would it know that doing a full rotation midair will result in high rewards? Since most RL algorithms are retrospective, they only observe rewards for states they have visited. In the case of a backflip, the character will have to observe successful trajectories of a backflip before it learns that those states will yield high rewards. But since a backflip can be very sensitive to the initial conditions at takeoff and landing, the character is unlikely to accidentally execute a successful trajectory through random exploration. To give the character a hint, at the start of each episode, we will initialize the character to a state sampled randomly along the reference motion. So sometimes the character will start on the ground, and sometimes it will start in the middle of the flip. This allows the character to learn which states will result in high rewards even before it has acquired the proficiency to reach those states.
RSI provides the character with a richer initial state distribution by initializing it to random point along the reference motion.
Below is a comparison of the backflip policy trained with RSI and without RSI, where the character is always initialized to a fixed initial state at the start of the motion. Without RSI, instead of learning a flip, the policy just cheats by hopping backwards.
Comparison of policies trained without RSI or ET. RSI and ET can be crucial for learning more dynamics motions. Left: RSI+ET. Middle: No RSI. Right: No ET.
Early Termination (ET)
Early termination is a staple for RL practitioners, and it is often used to improve simulation efficiency. If the character gets stuck in a state from which there is no chance of success, then the episode is terminated early, to avoid simulating the rest. Here we show that early termination can in fact have a significant impact on the results. Again, let’s consider a backflip. During the early stages of training, the policy is terrible and the character will spend most of its time falling. Once the character has fallen, it can be extremely difficult for it to recover. So the rollouts will be dominated by samples where the character is just struggling in vain on the ground. This is analogous to the class imbalance problem encountered by other methodologies such as supervised learning. This issue can be mitigated by terminating an episode as soon as the character enters such a futile state (e.g. falling). Coupled with RSI, ET helps to ensure that a larger portion of the dataset consists of samples close to the reference trajectory. Without ET the character never learns to perform a flip. Instead, it just falls and then tries to mime the motion on the ground.
More Results
In total, we have been able to learn over 24 skills for the humanoid just by providing it with different reference motions.
Humanoid trained to imitate a rich repertoire of skills.
In addition to imitating mocap clips, we can also train the humanoid to perform some additional tasks like kicking a randomly placed target, or throwing a ball to a target.
Policies trained to kick and throw a ball to a random target.
We can also train a simulated Atlas robot to imitate mocap clips from a human. Though the Atlas has a very different morphology and mass distribution, it is still able to reproduce the desired motions. Not only can the policies imitate the reference motions, they can also recover from pretty significant perturbations.
Atlas trained to perform a spin-kick and backflip. The policies are robust to significant perturbations.
But what do we do if we don’t have mocap clips? Suppose we want to simulate a T-Rex. For various reasons, it is a bit difficult to mocap a T-Rex. So instead, we can have an artist hand-animate some keyframes and then train a policy to imitate those.
Simulated T-Rex trained to imitate artist-authored keyframes.
By why stop at a T-Rex? Let’s train a lion:
Simulated lion. Reference motion courtesy of Ziva Dynamics.
and a dragon:
Simulated dragon with a 418D state space and 94D action space.
The story here is that a simple method ends up working surprisingly well. Just by minimizing the tracking error, we are able to train policies for a diverse collection of characters and skills. We hope this work will help inspire the development of more dynamic motor skills for both simulated characters and robots in the real world. Exploring methods for imitating motions from more prevalent sources such as video is also an exciting avenue for scenarios that are challenging to mocap, such as animals and cluttered environments.
To learn more, check out our paper.
We would like to thank the co-authors of this work: Pieter Abbeel, Sergey Levine, and Michiel van de Panne. This project was done in collaboration with the University of British Columbia. This article was initially published on the BAIR blog, and appears here with the authors’ permission.
NHTSA/SAE’s “levels” of robocars may be contributing to highway deaths
The NHTSA/SAE “levels” of robocars are not just incorrect. I now believe they are contributing to an attitude towards their “level 2” autopilots that plays a small, but real role in the recent Tesla fatalities.
Readers of this blog will know I have been critical of the NHTSA/SAE “levels” taxonomy for robocars since it was announced. My criticisms have ranged to simply viewing them as incorrect or misleading, and you might have enjoyed my satire of the levels which questions the wisdom of defining the robocar based on the role the human being plays in driving it.
Recent events lead me to go further. I believe a case can be made that this levels are holding the industry back, and have a possible minor role in the traffic fatalities we have seen with Tesla autopilot. As such I urge the levels be renounced by NHTSA and the SAE and replaced by something better.
Some history
It’s true that in the early days, when Google was effectively the only company doing work on a full self-driving car for the roads, people were looking for some sort of taxonomy to describe the different types of potential cars. NHTSA’s first article laid one out as a series of levels numbered 0 to 4 which gave the appearance of an evolutionary progression.
Problem was, none of those stages existed. Even Google didn’t know what it wanted to build, and my most important contribution there probably was being one of those pushing it from the highway car with occasional human driving to the limited area urban taxi. Anthony Levandowski first wanted the highway car because it was the easiest thing to build and he’s always been eager to get out into reality as soon as possible.
The models were just ideas, and I don’t think the authors at NHTSA knew how much they would be carving them into public and industry thinking by releasing the idea of levels in an official document. They may not have known that once in people’s minds, they would affect product development, and also change ideas about regulation. To regulate something you must define it, and this was the only definition coming from government.
The central error of the levels was threefold. First, it defined vehicles according to the role the human being played in their operation. (In my satire, I compare that to how the “horseless carriage” was first defined by the role the horse played, or didn’t play in it.)
Second, by giving numbered levels and showing a chart the future, it advanced the prediction that the levels were a progression. That they would be built in order, each level building on the work of the ones before it.
Worst of all, it cast into stone the guess that the driver assist (ADAS) technologies were closely related to, and the foundation of the robocar technologies. That they were just different levels of the same core idea.
(The ADAS technologies are things like adaptive cruise control, lanekeeping, forward collision avoidance, blindspot warning, anti-lock brakes and everything else that helps in driving or alerts or corrects driver errors.)
There certainly wasn’t consensus agreement on that guess. When Google pushed Nevada to pass the first robocar regulations, the car companies came forward during the comment phase to make one thing very clear — this new law had nothing to do with them. This law was for crazy out-there projects like Google. The law specifically exempted all the the ADAS projects car companies had, including and up to things like the Tesla Autopilot, from being governed by the law or considered self-driving car technology.
Many in the car companies, whose specialty was ADAS, loved the progression idea. It meant that they were already on the right course. That the huge expertise they had built in ADAS was the right decision, and would give them a lead into the future.
Outside the car companies, the idea was disregarded. Almost all companies went directly to full robocar projects, barely using existing ADAS tools and hardware if at all. The exceptions were companies in the middle like Tesla and MobilEye who had feet in both camps.
SAE, in their 2nd version of their standard, partly at my urging, added language to say that the fact that the levels were numbered was not to be taken as an ordering. Very good, but not enough.
The reality
In spite of the levels, the first vehicle to get commercial deployment was the Navia (now Navya) which is a low speed shuttle with no user controls inside. What would be called a “level 4.” Later, using entirely different technology, Tesla’s Autopilot was the first commercial offering of “level 2.” Recently, Audi has declared that given the constraints of operating in a traffic jam, they are selling “level 3.” While nobody sold it, car companies demonstrated autopilot technologies going back to 2006, and of course prototype “level 4” cars completed the DARPA grand challenge in 2005 and urban challenge in 2007.
In other words, no ordering at all. DARPA’s rules were so against human involvement that teams were not allowed to send any signal to their car other than “abort the race.”
The confusion between the two broad technological areas has extended out to the public. People routinely think of Tesla’s autopilot as actual self-driving car technology, or simply as primitive self-driving car technology, rather than as extra-smart ADAS.
Tesla’s messaging points the public in both directions. On the one hand, Tesla’s manuals and startup screen are very clear that the Autopilot is not a self-driving car. That it needs to be constantly watched, with the driver ready to take over at any time. In some areas, that’s obvious to drivers — the AutoPilot does not respond to stop signs or red lights, so anybody driving an urban street without looking would soon get in an accident. On the highway, though, it’s better, and some would say too good. It can cruise around long enough without intervention to lull drivers into a false sense of security.
To prevent that, Tesla takes one basic measure — it requires you to apply a little torque to the steering wheel every so often to indicate your hands are on it. Wait too long and you get a visual alert, wait longer and you get an audible alarm. This is the lowest level of driver attention monitoring out there. Some players have a camera actually watch the driver’s eyes to make sure they are on the road most of the time.
At the same time, Tesla likes to talk up the hope their AutoPilot is a stepping stone. When you order your new Tesla, you can order AutoPilot and you can also pay $5,000 for “full self driving.” It’s mostly clear that they are two different things. When you order the full self driving package, you don’t get it, because it doesn’t exist. Rather you get some extra sensors in the car, and Tesla’s promise that a new software release in the future will use those extra sensors to give you some form of full self driving. Elon Musk likes to made radically optimistic predictions of when Tesla will produce full robocars that can come to you empty or take you door to door.
Operating domain
NHTSA improved things in their later documents starting in 2016. In particular they clarified that it was very important to consider what roads and road conditions a robocar was rated to operate in. They called this the ODD (Operational Design Domain.) The SAE had made that clear earlier when they had added a “Level 5” to make it clear that their Level 4 did not go everywhere. The Level 5 car that can go literally everywhere remains a science fiction goal for now — nobody knows how to do it or even truly plans to do it, because there are diminishing economic returns to handling and certifying safety on absolutely all roads, but it exists to remind people that the only meaningful level (4) does not go everywhere and is not science fiction. The 3rd level is effectively a car whose driving domain includes places where a human must take the controls to leave it.
NHTSA unified their levels with the SAE a few years in, but they are still called the NHTSA levels by most.
The confusion
The recent fatalities involving Uber and Tesla have shown the level of confusion among the public is high. Indeed, there is even confusion within those with higher familiarity of the industry. It has required press comments from some of the robocar companies to remind people, “very tragic about that Tesla crash, but realize that was not a self-driving car.” And indeed, there are still people in the industry who believe they will turn ADAS into robocars. I am not declaring them to be fools, but rather stating that we need people to be aware that is very far from a foregone conclusion.
Are the levels solely responsible for the confusion? Surely not — a great deal of the blame can be lain in many places, including automakers who have been keen to being perceived as in the game even though their primary work is still in ADAS. Automakers were extremely frustrated back in 2010 when the press started writing that the true future of the car was in the hands of Google and other Silicon Valley companies, not with them. Many of them got to work on real robocar projects as well.
NHTSA and SAE’s levels may not be to blame for all the confusion, but they are to blame for not doing their best to counter it. They should renounce the levels and, if necessary, create a superior taxonomy which is both based on existing work and flexible enough to handle our lack of understanding of the future.
Robocars and ADAS should be declared two distinct things
While the world still hasn’t settled on a term (and the government and SAE documents have gone through a few themselves. (Highly Automated Vehicles, Driving Automation for On-Road Vehicles, Automated Vehicles, Automated Driving Systems etc.) I will use my preferred term here (robocars) but understand they will probably come up with something of their own. (As long as it’s not “driverless.”)
The Driver Assist systems would include traditional ADAS, as well as autopilots. There is no issue of the human’s role in this technology — it is always present and alert. These systems have been unregulated and may remain so, though there might be investigation into technologies to assure the drivers are remaining alert.
The robocar systems might be divided up by their operating domains. While this domain will be a map of specific streets, for the purposes of a taxonomy, people will be interested in types of roads and conditions. A rough guess at some categories would be “Highway,” “City-Fast” and “City-Slow.” Highway would be classified as roads that do not allow pedestrians and/or cyclists. The division between fast and slow will change with time, but today it’s probably at about 25mph. Delivery robots that run on roads will probably stick to the slow area. Subclassifications could include questions about the presence of snow, rain, crowds, animals and children.
What about old level 3?
What is called level 3 — a robocar that needs a human on standby to take over in certain situations — adds some complexity. This is a transitionary technology. It will only exist during the earliest phases of robocars as a “cheat” to get things going in places where the car’s domain is so limited that it’s forced to transition to human control while moving on short but not urgent notice.
Many people (including Waymo) think this is a bad idea — that it should never be made. It certainly should not be declared as one of the levels of a numbered progression. It is felt that a transition to human driving while moving at speed is a risky thing, exactly the sort of thing where failure is most common in other forms of automation.
Even so, car companies are building this, particularly for the traffic jam. While first visions of a car with a human on standby mostly talked about a highway car with the human handling exit ramps and construction zones, an easier and useful product is the traffic jam autopilot. This can drive safely with no human supervision in traffic jams. When the jam clears, the human needs to do the driving. This can be built without the need for takeover at speed, however. The takeover can be when stopped or at low speed, and if the human can’t takeover, stopping is a reasonable option because the traffic was recently very slow.
Generally, however, these standby driver cars will be a footnote of history, and don’t really deserve a place in the taxonomy. While all cars will have regions they don’t drive, they will also all be capable of coming to a stop near the border of such regions, allowing the human to take control while slow or stopped, which is safe.
The public confusion slows things down
Tesla knows it does not have a robocar, and warns its drivers about this regularly, though they ignore it. Some of that inattention may come from those drivers imagining they have “almost a robocar.” But even without that factor, the taxonomies create another problem. The public, told that the Tesla is just a lower level of robocar, sees the deaths of Tesla drivers as a sign that real robocar projects are more dangerous. The real projects do have dangers, but not the same dangers as the autopilots have. (Though clearly lack of driver attention is an issue both have on their plates.) A real robocar is not going to follow badly painted highway lines right into a crash barrier. They follow their maps, and the lane markers are just part of how they decide where they are and where to go.
But if the public says, “We need the government to slow down the robocar teams because of those Tesla crashes” or “I don’t trust getting in the Waymo car because of the Tesla crashes” then we’ve done something very wrong.
(If the public says they are worried about the Waymo car because of the Uber crash, they have a more valid point, though those teams are also very different from one another.)
The Automated/Autonomous confusion
For decades, roboticists used the word “autonomous” to refers to robots that took actions and decisions without having to rely on an outside source (such as human guidance.) They never used it in the “self-ruling” sense it has politically, though that is the more common (but not only) definition in common parlance.
Unfortunately, one early figure in car experiments hated that the roboticists’ vocabulary didn’t match his narrow view of the meaning of the word, and he pushed with moderate success for the academic and governmental communities to use the word “automated.” To many people, unfortunately, “automated” means simple levels of automation. Your dishwasher is automated. Your teller machine is automated. To the roboticist, the robocar is autonomous — it can operate entirely without you. The autopilot is automated — it needs human guidance.
I suspect that the public might better understand the difference if these words were split in these fashions. The Waymo car is autonomous, the Tesla automated. Everybody in robotics knows they don’t use the world autonomous in the political sense. I expressed this in a joke many years ago, “A truly autonomous car is one that, if you tell it to take you to the office, says it would rather go to the beach instead.” Nobody is building that car. Yet.
Are they completely disjoint?
I am not attempting to say that there are no commonalities between ADAS and robocars. In fact, as development of both technologies progresses, elements of each have slipped into the other, and will continue to do so. Robocars have always used radars created for ADAS work. Computer vision tools are being used in both systems. The small ultrasonic sensors for ADAS are used by some robocars for close in detection where their LIDARs don’t see.
Even so, the difference is big enough to be qualitative and not, as numbered levels imply, quantitative. A robocar is not just an ADAS autopilot that is 10 times or 100 times or even 1,000 times better. It’s such a large difference that it doesn’t happen by evolutionary improvement but through different ways of thinking.
There are people who don’t believe this, and Tesla is the most prominent of them.
As such, I am not declaring that the two goals are entirely disjoint, but rather that official taxonomies should declare them to be disjoint. Make sure that developers and the public know the difference and so modulate their expectations. It won’t forbid the cross pollination between the efforts. It won’t even stop those who disagree with all I have said from trying the evolutionary approach on their ADAS systems to create robocars.
SAE should release a replacement for its levels, and NHTSA should endorse this or do the same.
Five projects make the first cut and receive a ROBOTT-NET pilot
It all started with 166 companies spread across 12 European countries appling for a “golden ticket” to ROBOTT-NET’s Voucher Program. 64 companies received a voucher and highly specialized consultancy from a broad range of the brightest robotics experts around Europe. Now five of the 64 projects have been selected for a ROBOTT-NET pilot.
Trumpf, Maser, Picolo, Weibel and Air Liquide are the five companies that will have their technology implemented in a pilot on a real-world use case.
Their voucher work varies greatly. Whilst Trumpf wanted to find out if automated handling of a large variety of sheet metal parts was possible, Picolo was working on generating welding robot programs. Weibel concentrated on flexible PCB Soldering and Air Liquide focused on autonomous goods picking, handling and transportation in industrial environments. Finally, Maser investigated automated systems to detect types of defects in chromed parts.
ROBOTT-NETs mission is to collect and share the latest knowledge about robot technology that can improve production, bring new ideas to market and ensure economic competitiveness.
The pilot will help the companies develop their voucher work through proof of concept level and accelerate it towards commercialisation. It will be a medium-scale research installation that will last for up to 18 months, developing the robot-technology and business case explored in the voucher stage, and applying it to an industrial demonstrator at an end user’s site.
When selecting the five projects, ROBOTT-NET was looking for pilots that will scale well across new applications and create high impact on markets through enhanced productivity, competition and disruption. Scalability and market impact were key measures in the Pilot application.
Other than these five projects, there are more pilots to look forward to as three more pilots will be announced soon.
If you want to know more about the five projects that have been selected for a pilot you can find them on ROBOTT-NET or ROBOTT-NET’s YouTube channel.
Trumpf
Maser
Picolo
Weibel
Air Liquide
Note: ROBOTT-NET will be at HANNOVER MESSE from April 24-27, 2018. If you are there, make sure you pass by Stand G46 in Hall 6 by the European Commission and see project results from EU-funded projects like nextgenio, ultraSURFACE, covr, fed4sae, DiFiCIL, IPP4CPPS, Smart Anything Everywhere (SAE), RADICLE, cloudSME, BEinCPPS, CloudiFacturing & Fortissimo.
What’s all the fuss about AI, robotics and China?
In the constantly changing landscape of today’s global digital workspace, AI’s presence grows in almost every industry. Retail giants like Amazon and Alibaba are using algorithms written by machine learning software to add value to the customer experience. Machine learning is also prevalent in the new Service Robotics world as robots transition from blind, dumb and caged to mobile and perceptive.
Competition is particularly focused between the US and China even though other countries and global corporations have large AI programs as well. The competition is real, fierce and dramatic. Talent is hard to find and costly. It’s a complex field that few fully understand, consequently the talent pool is limited. Grabs of key players and companies headline the news every few days. “Apple hires away Google’s chief of search and AI.” “Amazon acquires AI cybersecurity startup.” “IBM invests millions into MIT AI research lab.” “Oracle acquires Zenedge.” “Ford acquires auto tech startup Argo AI.” “Baidu hires three world-renowned artificial intelligence scientists.”
Media, partly from the complexity of the subject, and partly from lack of knowledge, frighten people with scare headlines about misuse and autonomous weaponry. They exaggerate the competition into a hotly contested war for mastery of the field. It’s not really a “war” but it is dramatic and it’s playing out right now on many levels: immigration law, intellectual property transgressions, trade war fears, labor cost and availability challenges, and unfair competitive practices as well as technological breakthroughs and lower costs enabling experimentation and testing.
Two recent trends have sparked widespread use of machine learning: the availability of massive amounts of training data, and powerful and efficient parallel computing. GPUs are parallel processors and are used to train these deep neural networks. GPUs do so in less time, using far less datacenter infrastructure than non-parallel-processing super-computers.
Service and mobile robots often need to have all their computing power onboard as compared to stationary robots with control systems in separate nearby boxes. Sometimes onboard computing involves multiple processors; other times it necessitates super-computing power such as offered by chip makers that offer parallel processing and super-computer speeds. Nvidia’s Jetson chip, Isaac lab, and toolset are an example.
Nvidia
The recent Nvidia GPU Developers Conference held in San Jose last month highlighted Nvidia’s goal to capture the robotics AI market. They’ve set up an SDK and lab to help robotics companies capture and learn from the amount of data they are processing as they go about their tasks in mobility and vision processing.
Nvidia’s Jetson GPU, SDK, toolset and simulation platform are designed to help roboticists build and test robotics applications and simultaneously manage all the various onboard processes such as perception, navigation and manipulation. As a demonstration of the breath of capabilities in their toolset, Nvidia had a delivery robot to cart around objects at the show.
Nvidia is offering libraries, SDK, APIs, an open source deep learning accelerator, and other tools to encourage the use by robot makers for them to incorporate Nvidia chips into their products. Nvidia sees this as a future source of revenue. Right now it is mostly all research and experimentation.
Examples of deep learning in robotics
In a recent CBInsights graphic categorizing the 2018 AI 100, 12 companies were highlighted in the robotics and auto technology sectors. Note from the Venn Diagram that not all AI companies are involved with robotics (in fact, most aren’t – there were 2,000+ startups in the pool of companies from which the 100 were chosen). The same is true for robotics.
- Robotics:
- Vicarious
- Kindred (CA)
- Anki
- UBTech (CN)
- Brain Corp
- Neurala
- CloudMinds (CN)
- Auto Tech:
Here are four use cases of robot companies using AI chips in their products:
- Cobalt Robotics – Says CEO and Co-founder Travis Deyle, “Cobalt uses a high-end NVidia GPU (a 1080 variant) directly on the robot. We do a lot of processing locally (e.g. anomaly detection, person detection, etc) using a host of libraries: CUDA, TensorFlow, and various computer vision libraries. The algorithms running on the robot are just the tip of the iceberg. The on-robot detectors and classifiers are tuned to be very sensitive; upon detection, data is transmitted to the internet and runs through an extensive cloud-based machine learning pipeline and ultimately flags a remote human specialist for additional input and high-level decision making. The cloud-based pipeline also makes use of deep-learning processing power, which is likely powered by NVidia as well.”
- Bossa Nova Robotics – Walmart is partnering with San Francisco-based robotics Bossa Nova on robots that roam the grocery and health products aisles of Walmart stores, auditing shelves and then sending data back to employees to ensure that missing items are restocked, as well as locating incorrect prices and wrong or missing labels. Bossa Nova’s Walmart robots house three Nvidia GPUs: one for navigation and mapping; another for perception and image stitching (it’s viewing 6′ of shelving at 2 mph); and for computing and analyzing what it’s seeing and turning that info into actionable restocking reports.
- Fetch Robotics – Fetch Robotics’ automated material transports and Fetch’s new data survey line of AMRs, all, in addition to navigation, collision avoidance and mapping, collect data continuously and consistently. When the robots recharge themselves, all the stored collected data is uploaded to the cloud for post-processing and analytics.
- TUSimple (CN) – Beijing-based TuSimple’s truck driving technology is focused on the middle mile, ie, the need for transporting container boxes from one hub to another. Along the way TUSimple trucks are able to detect and track objects at distances of greater than 300 meters through advanced sensor fusion that combines data from multiple cameras using decimeter-level localization technology. Simultaneously, the truck’s decision-making system dynamically adapts to road conditions including changing lanes and adjusting driving speeds. TuSimple uses NVIDIA GPUs, NVIDIA DRIVE PX 2, Jetson TX2, CUDA, TensorRT and cuDNN in its autonomous driving solution.
The China factor
Twelve years ago, as a national long-term strategic goal, China crafted 5-year plans with specific goals to encourage the use of robots in manufacturing to enhance quality and reduce the need for unskilled labor, and to establish the manufacture of robots in-country to reduce the reliance on foreign suppliers. After three successive well funded and fully incentivized 5-year robotics plans, one can easily see the transformation: robot and component manufacturers have grown from fewer than 10 to more than 700 while companies using robots in their manufacturing and material handling process have grown similarly.
[NOTE: During the same period, America implemented various manufacturing initiatives involving robotics, however none were comparably funded or, more importantly, continuously funded over time.]
Recently China turned its focus to artificial intelligence. Specifically, they’ve set out a three-pronged plan to catch up by 2020, achieve mid-term parity in autonomous vehicles, image recognition and, perhaps, simultaneous translation by 2025, and lead the world in AI and machine learning by 2030.
Western companies doing business in China have been plagued by intellectual property thievery, copying and reverse engineering, and heavy-handed partnerships and joint ventures where IP must be given to the Chinese venture. Steve Dickinson, a lawyer with Harris | Bricken, a Seattle law firm whose slogan is “Tough Markets; Bold Lawyers,” wrote:
“With respect to appropriating the technology and then selling it back into the developed market from which it came: that is of course the Chinese strategy. It is the strategy of businesses in every developing country. The U.S. followed this approach during the entire 19th and early 20th centuries. Japan and Korea and Taiwan did it with great success in the post WWII era. That is how technical progress is made.”
“It is clear that appropriating foreign AI technology is the goal of every Chinese company operating in this sector [robotics, e-commerce, logistics and manufacturing]. For that reason, all foreign entities that work with Chinese companies in any way must be aware of the significant risk and must take the steps required to protect themselves.”
What is really clear is that where data in large quantity is available, as in China, and where speed is normal and privacy is nil, as in China, AI techniques such as machine and deep learning can thrive and achieve remarkable results at breakneck speed. That’s what is happening right now in China.
Bottom line:
Growth in the service robotics sector is still a promise more than a reality and there is a pressing need to deliver on those promises. We have seen tremendous progress on processors, sensors, cameras and communications but so far the integration is lacking. One roboticist characterized the integration of all that data as a need for a “reality sensor”, i.e., a higher-level indicator of what is being seen or processed. If the sensors pick up a series of pixels that are interpreted to be a person, and the processing determines its motion to be intersecting with your robot, it would be helpful to know whether it’s a pedestrian, a policeman, a fireman, a sanitation worker, a construction worker, a surveyor, etc. That information would help refine the prediction and your actions. It would add reality to image processing and visual perception.
Even as the ratio of development in hardware to software shifts more toward software, there are still many challenges to overcome. Henrik Christensen, the director of the Institute for Contextual Robotics at the University of California San Diego, cited a few of those challenges:
- Better end-effectors / hands. We still only have very limited capability hands and they are WAY too expensive
- The user interfaces for most robots are still very limited, eg, different robots have different chargers
- The cost of integrating systems is very high. We need much better plug-n-play systems
- We see lots of use of AI / deep learning but in most cases without performance guarantees; not a viable long-term solution until things improve
One often forgets the science involved in robotics, embedded AI, and the many challenges remaining until we have a functional fully-capable, fully-interactive service robot.
Artificial intelligence in action
By Meg Murphy
A person watching videos that show things opening — a door, a book, curtains, a blooming flower, a yawning dog — easily understands the same type of action is depicted in each clip.
“Computer models fail miserably to identify these things. How do humans do it so effortlessly?” asks Dan Gutfreund, a principal investigator at the MIT-IBM Watson AI Laboratory and a staff member at IBM Research. “We process information as it happens in space and time. How can we teach computer models to do that?”
Such are the big questions behind one of the new projects underway at the MIT-IBM Watson AI Laboratory, a collaboration for research on the frontiers of artificial intelligence. Launched last fall, the lab connects MIT and IBM researchers together to work on AI algorithms, the application of AI to industries, the physics of AI, and ways to use AI to advance shared prosperity.
The Moments in Time dataset is one of the projects related to AI algorithms that is funded by the lab. It pairs Gutfreund with Aude Oliva, a principal research scientist at the MIT Computer Science and Artificial Intelligence Laboratory, as the project’s principal investigators. Moments in Time is built on a collection of 1 million annotated videos of dynamic events unfolding within three seconds. Gutfreund and Oliva, who is also the MIT executive director at the MIT-IBM Watson AI Lab, are using these clips to address one of the next big steps for AI: teaching machines to recognize actions.
Learning from dynamic scenes
The goal is to provide deep-learning algorithms with large coverage of an ecosystem of visual and auditory moments that may enable models to learn information that isn’t necessarily taught in a supervised manner and to generalize to novel situations and tasks, say the researchers.
“As we grow up, we look around, we see people and objects moving, we hear sounds that people and object make. We have a lot of visual and auditory experiences. An AI system needs to learn the same way and be fed with videos and dynamic information,” Oliva says.
For every action category in the dataset, such as cooking, running, or opening, there are more than 2,000 videos. The short clips enable computer models to better learn the diversity of meaning around specific actions and events.
“This dataset can serve as a new challenge to develop AI models that scale to the level of complexity and abstract reasoning that a human processes on a daily basis,” Oliva adds, describing the factors involved. Events can include people, objects, animals, and nature. They may be symmetrical in time — for example, opening means closing in reverse order. And they can be transient or sustained.
Oliva and Gutfreund, along with additional researchers from MIT and IBM, met weekly for more than a year to tackle technical issues, such as how to choose the action categories for annotations, where to find the videos, and how to put together a wide array so the AI system learns without bias. The team also developed machine-learning models, which were then used to scale the data collection. “We aligned very well because we have the same enthusiasm and the same goal,” says Oliva.
Augmenting human intelligence
One key goal at the lab is the development of AI systems that move beyond specialized tasks to tackle more complex problems and benefit from robust and continuous learning. “We are seeking new algorithms that not only leverage big data when available, but also learn from limited data to augment human intelligence,” says Sophie V. Vandebroek, chief operating officer of IBM Research, about the collaboration.
In addition to pairing the unique technical and scientific strengths of each organization, IBM is also bringing MIT researchers an influx of resources, signaled by its $240 million investment in AI efforts over the next 10 years, dedicated to the MIT-IBM Watson AI Lab. And the alignment of MIT-IBM interest in AI is proving beneficial, according to Oliva.
“IBM came to MIT with an interest in developing new ideas for an artificial intelligence system based on vision. I proposed a project where we build data sets to feed the model about the world. It had not been done before at this level. It was a novel undertaking. Now we have reached the milestone of 1 million videos for visual AI training, and people can go to our website, download the dataset and our deep-learning computer models, which have been taught to recognize actions.”
Qualitative results so far have shown models can recognize moments well when the action is well-framed and close up, but they misfire when the category is fine-grained or there is background clutter, among other things. Oliva says that MIT and IBM researchers have submitted an article describing the performance of neural network models trained on the dataset, which itself was deepened by shared viewpoints. “IBM researchers gave us ideas to add action categories to have more richness in areas like health care and sports. They broadened our view. They gave us ideas about how AI can make an impact from the perspective of business and the needs of the world,” she says.
This first version of the Moments in Time dataset is one of the largest human-annotated video datasets capturing visual and audible short events, all of which are tagged with an action or activity label among 339 different classes that include a wide range of common verbs. The researchers intend to produce more datasets with a variety of levels of abstraction to serve as stepping stones toward the development of learning algorithms that can build analogies between things, imagine and synthesize novel events, and interpret scenarios.
In other words, they are just getting started, says Gutfreund. “We expect the Moments in Time dataset to enable models to richly understand actions and dynamics in videos.”