In February we asked for input from the robotics community regarding a potential Robotics Flagship, a pan European interdisciplinary effort with 1B EUR in funding, if successful! The goal of the flagship is to drive the development of future robots and AIs that are ethically, socially, economically, energetically, and environmentally responsible and sustainable.
This is the first of many activities we will host to engage the community. You can read more about the Robotics Flagship in a nutshell here.
We received 125 replies (120 from Europe) from roboticists.
In what areas does robotics have the highest potential to benefit society?
Overall, replies show the potential of robotics in all sectors to benefit society, since they all received an average score above 3 out 5 (high potential). Sectors which received the highest average score were industry, logistics, agriculture, inspection of infrastructure, healthcare, exploration, and transport, in that order, all of with an average above 4. Other sectors highlighted by respondents included ecology and environmental protection, tourism, construction, and the use of robots for human understanding, or for scientific investigation of body and brain.
What are the main challenges to achieving this potential?
The main challenge to achieving this potential was seen as technological with an average score of 4.35 out of 5 (very challenging), then societal and regulatory (average scores of 3.69), and finally economic (average score of 3.52). Respondents also highlighted ethical, ideological and political challenges.
What are the key abilities that need to be developed for the robots of tomorrow?
Central to the flagship proposal is the need for new robot abilities that will make robots a reality in our everyday lives. All abilities shown below were seen as central to develop the robots of tomorrow with average scores above 2.9 out of 5 (very important). Abilities which received the highest average score were learning, advanced sensing, cognition, in that order, all with an average above 4. This clearly shows the need to develop robotics and AI hand in hand. Other abilities highlighted by respondents included, reliability, security and safety, reconfigurability, modularity and customisation, advanced actuation, and efficient energy usage.
What resources would you need to make your robots a reality?
Finally, we asked the community what resources they would need to make their robots a reality. Not surprisingly, funding came out on top with an average score of 4.7 out of 5 (very important), next came experimental sites (3.72), networking opportunities (3.75), fabrication facilities (3.58) and standards (3.31).
So what else did the community think would be helpful? Time, software and hardware aggregators, integrators, and maintainers, ethical and legal support, as well as a better understanding of user requirements and social attitudes.
What would you like to see in a robotics flagship?
Finally, we asked what the community would like to see in a robotics flagship. There were too many suggestions to post here, but a recurring theme was high risk projects and big ideas, the need for cross-disciplinary research, and the hope that robots will finally leave the lab to work alongside humans.
Should we be afraid of artificial intelligence? For me, this is a simple question with an even simpler, two letter answer: no. But not everyone agrees – many people, including the late physicist Stephen Hawking, have raised concerns that the rise of powerful AI systems could spell the end for humanity.
Clearly, your view on whether AI will take over the world will depend on whether you think it can develop intelligent behaviour surpassing that of humans – something referred to as “super intelligence”. So let’s take a look at how likely this is, and why there is much concern about the future of AI.
Humans tend to be afraid of what they don’t understand. Fear is often blamed for racism, homophobia and other sources of discrimination. So it’s no wonder it also applies to new technologies – they are often surrounded with a certain mystery. Some technological achievements seem almost unrealistic, clearly surpassing expectations and in some cases human performance.
No ghost in the machine
But let us demystify the most popular AI techniques, known collectively as “machine learning”. These allow a machine to learn a task without being programmed with explicit instructions. This may sound spooky but the truth is it is all down to some rather mundane statistics.
The machine, which is a program, or rather an algorithm, is designed with the ability to discover relationships within provided data. There are many different methods that allow us to achieve this. For example, we can present to the machine images of handwritten letters (a-z), one by one, and ask it to tell us which letter we show each time in sequence. We have already provided the possible answers – it can only be one of (a-z). The machine at the beginning says a letter at random and we correct it, by providing the right answer. We have also programmed the machine to reconfigure itself so that next time, if presented with the same letter, it is more likely to give us the correct answer for the next one. As a consequence, the machine over time improves its performance and “learns” to recognise the alphabet.
In essence, we have programmed the machine to exploit common relationships in the data in order to achieve the specific task. For instance, all versions of “a” look structurally similar, but different to “b”, and the algorithm can exploit this. Interestingly, after the training phase, the machine can apply the obtained knowledge on new letter samples, for example written by a person whose handwriting the machine has never seen before.
Humans, however, are good at reading. Perhaps a more interesting example is Google Deepmind’s artificial Go player, which has surpassed every human player in their performance of the game. It clearly learns in a way different to humans – playing a number of games with itself that no human could play in their lifetime. It has been specifically instructed to win and told that the actions it takes determine whether it wins or not. It has also been told the rules of the game. By playing the game again and again it can discover in each situation what is the best action – inventing moves that no human has played before.
Toddlers versus robots
Now does that make the AI Go player smarter than a human? Certainly not. AI is very specialised to particular type of tasks and it doesn’t display the versatility that humans do. Humans develop an understanding of the world over years that no AI has achieved or seem likely to achieve anytime soon.
The fact that AI is dubbed “intelligent” is ultimately down to the fact that it can learn. But even when it comes to learning, it is no match for humans. In fact, toddlers can learn by just watching somebody solving a problem once. An AI, on the other hand, needs tonnes of data and loads of tries to succeed on very specific problems, and it is difficult to generalise its knowledge on tasks very different to those trained upon. So while humans develop breathtaking intelligence rapidly in the first few years of life, the key concepts behind machine learning are not so different from what they were one or two decades ago.
The success of modern AI is less due to a breakthrough in new techniques and more due to the vast amount of data and computational power available. Importantly, though, even an infinite amount of data won’t give AI human-like intelligence – we need to make a significant progress on developing artificial “general intelligence” techniques first. Some approaches to doing this involve building a computer model of the human brain – which we’re not even close to achieving.
Ultimately, just because an AI can learn, it doesn’t really follow that it will suddenly learn all aspects of human intelligence and outsmart us. There is no simple definition of what human intelligence even is and we certainly have little idea how exactly intelligence emerges in the brain. But even if we could work it out and then create an AI that could learn to become more intelligent, that doesn’t necessarily mean that it would be more successful.
Personally, I am more concerned by how humans use AI. Machine learning algorithms are often thought of as black boxes, and less effort is made in pinpointing the specifics of the solution our algorithms have found. This is an important and frequently neglected aspect as we are often obsessed with performance and less with understanding. Understanding the solutions that these systems have discovered is important, because we can also evaluate if they are correct or desirable solutions.
If, for instance, we train our system in a wrong way, we can also end up with a machine that has learned relationships that do not hold in general. Say for instance that we want to design a machine to evaluate the ability of potential students in engineering. Probably a terrible idea, but let us follow it through for the sake of the argument. Traditionally, this is a male dominated discipline, which means that training samples are likely to be from previous male students. If we don’t make sure, for instance, that the training data are balanced, the machine might end up with the conclusion that engineering students are male, and incorrectly apply it to future decisions.
Machine learning and artificial intelligence are tools. They can be used in a right or a wrong way, like everything else. It is the way that they are used that should concerns us, not the methods themselves. Human greed and human unintelligence scare me far more than artificial intelligence.
This week a Harvard Business School student challenged me to name a startup capable of producing an intelligent robot – TODAY! At first I did not understand the question, as artificial intelligence (AI) is an implement like any other in a roboticist’s toolbox. The student persisted, she demanded to know if I thought that the current co-bots working in factories could one day evolve to perceive the world like humans. It’s a good question that I didn’t appreciate at the time as robots are best deployed for specific repeatable tasks, even with deep learning systems. By contrast, mortals comprehend their surroundings (and other organisms) using a sixth sense, intuition.
As an avid tennis player, I also enjoyed meeting Tennibot this week. The autonomous ball-gathering robot sweeps the court like a roomba sucking up dust off a rug. In order to accomplish this task, without knocking over players, it navigates around the cage utilizing six cameras on each side. This is a perfect example of the type of job that an unmanned system excels at performing, freeing up athletes from wasting precious court time with tedious cleanup. Yet, Tennibot, at the end of the day, is a dumb appliance. While it gobbles up balls quicker than any person, it is unable to discern the quality of the game or the health of players.
No one expects Tennibot to save Roger Federer’s life, but what happens when a person has a heart attack inside a self-driving car on a two-hour journey? While autonomous vehicles are packed with sensors to identify and safely steer around cities and highways, few are able to perceive human intent. As Ann Cheng of Hyundai explains, “We [drivers] think about what that other person is doing or has the intent to do. We see a lot of AI companies working on more classical problems, like object detection [or] object classification. Perceptive is trying to go one layer deeper—what we do intuitively already.” Hyundai joined Jim Adler’s Toyota AI Ventures this month in investing Perceptive Automata, an “intuitive self-driving system that is able to recognize, understand, and predict human behavior.”
As stated by Adler’s Medium post, Perceptive’s technology uses “behavioral science techniques to characterize the way human drivers understand the state-of-mind of other humans and then train deep learning models to acquire that human ability. These deep learning models are designed for integration into autonomous driving stacks and next-generation driver assistance systems, sandwiched between the perception and planning layers. These deep learning, predictive models provide real-time information on the intention, awareness, and other state-of-mind attributes of pedestrians, cyclists and other motorists.”
While Perceptive Automata is creating “predictive models” for outside the vehicle, few companies are focused on the conditions inside the cabin. The closest implementations are a number of eye-tracking cameras that alert occupants to distracted driving. While these technologies observe the general conditions of passengers, they rely on direct eye contact to distinguish between emotions (fatigue, excitability, stress, etc.), which is impossible if one is passed out. Furthermore, none of these vision systems have the ability to predict human actions before they become catastrophic.
Isaac Litman, formerly of Mobileye, understands fully well the dilemma presented by computer vision systems in delivering on the promise of autonomous travel. In speaking with Litman this week about his newest venture Neteera, he declared that in today’s automative landscape the “the only unknown variable is the human.” Unfortunately, the recent wave of Tesla and Uber autopilot crashes has glaringly illustrated the importance of tracking the attention of vehicle occupants in handing off between autopilot systems and human drivers. Litman further explains that Waymo and others are collecting data on occupant comfort as AI-enabled drivers have reportedly led to high levels of nausea from driving too consistently. Litman describes this as the indigestion problem, clarifying that after eating a big meal one may want to drive more slowly than on an empty stomach. In the future Litman professes that autonomous cars will be marketed “not by the performance of their engines, but on the comfort of their rides.”
Litman’s view is further endorsed by the recent patent application filed this summer by Apple’s Project Titan team for developing “Comfort Profiles” for autonomous driving. According to AppleInsider, the application “describes how an autonomous driving and navigation system can move through an environment, with motion governed by a number of factors that are set indirectly by the passengers of the vehicle.” The Project Titan system would utilize a fusion of sensors (LIDAR, depth cameras, and infrared) to monitor the occupants’ “eye movements, body posture, gestures, pupil dilation, blinking, body temperature, heart beat, perspiration, and head position.” The application details how the data would integrate into the vehicle systems to automatically adjust the acceleration, turning rate, performance, suspension, traction control and other factors to the personal preferences of the riders. While Project Titan is taking the first step toward developing an autonomous comfort system, Litman expresses that it is limited by the inherent short-comings of vision-based systems that are susceptible to light, dust, line of sight, condensation, motion, resolution, and safety concerns.
Unlike vision sensors, Neteera is a cost-effective micro-radar on a chip that leverages its own network of proprietary algorithms to provide “the first contact free vital sign detection platform.” Its FDA-level of accuracy is not only being utilized by the automative sector, but healthcare systems across the United States for monitoring such elusive conditions as sleep apnea and sudden infant death syndrome. To date, the challenge of monitoring vital signs through micro-skin motion in the automotive industry has been the displacement caused by a moving vehicles. However, Litman’s team has developed a a patent-pending “motion compensation algorithm” that tracks “quasi-periodic signals in the presence of massive random motions,” providing near perfect accuracy (see tables below).
While the automotive industry races to launch fleets of autonomous vehicles, Litman estimates that the most successful players will be the ones that install empathic engines into the machines’ framework. Unlike the crowded field of AI and computer vision startups that are enabling robocars to safely navigate city streets, Neteera’s “intuition on a chip” is probably one of the only mechatronic ventures that actually report on the psychological state of drivers and passengers. Litman’s innovation has wider societal implications, as social robots begin to augment humans in the workplace and support the infirm and elderly in coping with the fragility of life.
As scientists improve artificial intelligence, it is still unclear what the reaction will be from ordinary people to such “emotional” robots. In the words of writer Adam Williams, “Emotion is something we reserve for ourselves: depth of feeling is what we use to justify the primacy of human life. If a machine is capable of feeling, that doesn’t make it dangerous in a Terminator-esque fashion, but in the abstract sense of impinging on what we think of as classically human.”
If you have seen science fiction television series such as Humans or Westworld, you might be imagining a near future where intelligent, humanoid robots play an important role in meeting the needs of people, including caring for children or older relatives.
Well-designed smart technology tools can help government agencies, the environment and residents. Smart cities can upgrade the efficiency of city services by eliminating redundancies and finding ways to save money, as well as provide higher-quality services at a lower cost.
Automated transportation is not just limited to providing practical solutions for logistical problems, but it also ensures that its solutions enable enterprises to pinpoint and address their logistical inefficiencies in real-time.
The iTENDO is the world’s first intelligent toolholder with real-time process control. In an interview, developers Friedrich Bleicher and Johannes Ketterer explain how the embedded systems solution makes production smart and economical.
The goal of our project is to create a dual-arm collaborative robot with a humanoid shape entirely designed, made, and built at Polytechnique Montreal (Canada).
A crucial task for energy providers is the reliable and safe operation of their plants, especially when producing energy offshore. Autonomous mobile robots are able to offer comprehensive support through regular and automated inspection of machinery and infrastructure. In a world’s first pilot installation, transmission system operator TenneT tested the autonomous legged robot ANYmal on one of the world’s largest offshore converter platforms in the North Sea.
In September 2018, the ANYbotics field team boarded a helicopter to fly out to one of the world’s largest offshore converter platforms in the North Sea. Equipped with a customized sensorhead, our four-legged robotic platform ANYmal autonomously performed various inspection tasks of the platform in a one-week pilot installation, making it the world’s first autonomous offshore robot.
Offshore Wind Farms
Offshore energy production is a key component of global energy supply. Apart from oil and natural gas extraction, wind energy is increasingly being produced offshore. One of the key innovators in this field is the Dutch-German transmission system operator TenneT, which connects large-scale offshore wind farms to the onshore grid over a high voltage DC connection. Set out to provide reliable and low-cost energy transmission and distribution, the company adopts most recent technologies.
Robotic Inspection with ANYmal
ANYbotics is partnering with TenneT to evaluate robotic inspection and maintenance on their offshore converter platforms. In periods of unmanned platform operation, a mobile robot helps to reduce the risk of disruptions and ensures the security of the electricity supply. Based on its autonomous navigation capabilities, ANYmal performs routine inspection tasks to monitor machine operations, read out sensory equipment and detect thermal hotspots and oil or water leakages. Whenever required, ANYmal can be remotely operated from an onshore control center in order for TenneT to receive real-time information through the robot’s onboard visual and thermal cameras, microphones and gas detection sensors.
Successful Pilot Installation
Before being deployed on the offshore mission, the ANYbotics field team underwent a rigorous safety training including helicopter escape and survival on sea scenarios. After being taken on a guided tour of the platform to 3D-map the environment and learn the position and characteristics of all inspection points, ANYmal autonomously navigated the platform and processed inspection protocols. The video documents a fully autonomous mission, covering a total of 16 inspection points such as gauges, levers, oil- and water levels and various other visual and thermal measurements.
Finding lost hikers in forests can be a difficult and lengthy process, as helicopters and drones can’t get a glimpse through the thick tree canopy. Recently, it’s been proposed that autonomous drones, which can bob and weave through trees, could aid these searches. But the GPS signals used to guide the aircraft can be unreliable or nonexistent in forest environments.
In a paper being presented at the International Symposium on Experimental Robotics conference next week, MIT researchers describe an autonomous system for a fleet of drones to collaboratively search under dense forest canopies. The drones use only onboard computation and wireless communication — no GPS required.
Each autonomous quadrotor drone is equipped with laser-range finders for position estimation, localization, and path planning. As the drone flies around, it creates an individual 3-D map of the terrain. Algorithms help it recognize unexplored and already-searched spots, so it knows when it’s fully mapped an area. An off-board ground station fuses individual maps from multiple drones into a global 3-D map that can be monitored by human rescuers.
In a real-world implementation, though not in the current system, the drones would come equipped with object detection to identify a missing hiker. When located, the drone would tag the hiker’s location on the global map. Humans could then use this information to plan a rescue mission.
“Essentially, we’re replacing humans with a fleet of drones to make the search part of the search-and-rescue process more efficient,” says first author Yulun Tian, a graduate student in the Department of Aeronautics and Astronautics (AeroAstro).
The researchers tested multiple drones in simulations of randomly generated forests, and tested two drones in a forested area within NASA’s Langley Research Center. In both experiments, each drone mapped a roughly 20-square-meter area in about two to five minutes and collaboratively fused their maps together in real-time. The drones also performed well across several metrics, including overall speed and time to complete the mission, detection of forest features, and accurate merging of maps.
Co-authors on the paper are: Katherine Liu, a PhD student in MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and AeroAstro; Kyel Ok, a PhD student in CSAIL and the Department of Electrical Engineering and Computer Science; Loc Tran and Danette Allen of the NASA Langley Research Center; Nicholas Roy, an AeroAstro professor and CSAIL researcher; and Jonathan P. How, the Richard Cockburn Maclaurin Professor of Aeronautics and Astronautics.
Exploring and mapping
On each drone, the researchers mounted a LIDAR system, which creates a 2-D scan of the surrounding obstacles by shooting laser beams and measuring the reflected pulses. This can be used to detect trees; however, to drones, individual trees appear remarkably similar. If a drone can’t recognize a given tree, it can’t determine if it’s already explored an area.
The researchers programmed their drones to instead identify multiple trees’ orientations, which is far more distinctive. With this method, when the LIDAR signal returns a cluster of trees, an algorithm calculates the angles and distances between trees to identify that cluster. “Drones can use that as a unique signature to tell if they’ve visited this area before or if it’s a new area,” Tian says.
This feature-detection technique helps the ground station accurately merge maps. The drones generally explore an area in loops, producing scans as they go. The ground station continuously monitors the scans. When two drones loop around to the same cluster of trees, the ground station merges the maps by calculating the relative transformation between the drones, and then fusing the individual maps to maintain consistent orientations.
“Calculating that relative transformation tells you how you should align the two maps so it corresponds to exactly how the forest looks,” Tian says.
In the ground station, robotic navigation software called “simultaneous localization and mapping” (SLAM) — which both maps an unknown area and keeps track of an agent inside the area — uses the LIDAR input to localize and capture the position of the drones. This helps it fuse the maps accurately.
The end result is a map with 3-D terrain features. Trees appear as blocks of colored shades of blue to green, depending on height. Unexplored areas are dark but turn gray as they’re mapped by a drone. On-board path-planning software tells a drone to always explore these dark unexplored areas as it flies around. Producing a 3-D map is more reliable than simply attaching a camera to a drone and monitoring the video feed, Tian says. Transmitting video to a central station, for instance, requires a lot of bandwidth that may not be available in forested areas.
More efficient searching
A key innovation is a novel search strategy that let the drones more efficiently explore an area. According to a more traditional approach, a drone would always search the closest possible unknown area. However, that could be in any number of directions from the drone’s current position. The drone usually flies a short distance, and then stops to select a new direction.
“That doesn’t respect dynamics of drone [movement],” Tian says. “It has to stop and turn, so that means it’s very inefficient in terms of time and energy, and you can’t really pick up speed.”
Instead, the researchers’ drones explore the closest possible area while considering their speed and direction and maintaining a consistent velocity. This strategy — where the drone tends to travel in a spiral pattern — covers a search area much faster. “In search and rescue missions, time is very important,” Tian says.
In the paper, the researchers compared their new search strategy with a traditional method. Compared to that baseline, the researchers’ strategy helped the drones cover significantly more area, several minutes faster and with higher average speeds.
One limitation for practical use is that the drones still must communicate with an off-board ground station for map merging. In their outdoor experiment, the researchers had to set up a wireless router that connected each drone and the ground station. In the future, they hope to design the drones to communicate wirelessly when approaching one another, fuse their maps, and then cut communication when they separate. The ground station, in that case, would only be used to monitor the updated global map.
Children learn language by observing their environment, listening to the people around them, and connecting the dots between what they see and hear. Among other things, this helps children establish their language’s word order, such as where subjects and verbs fall in a sentence.
In computing, learning language is the task of syntactic and semantic parsers. These systems are trained on sentences annotated by humans that describe the structure and meaning behind words. Parsers are becoming increasingly important for web searches, natural-language database querying, and voice-recognition systems such as Alexa and Siri. Soon, they may also be used for home robotics.
But gathering the annotation data can be time-consuming and difficult for less common languages. Additionally, humans don’t always agree on the annotations, and the annotations themselves may not accurately reflect how people naturally speak.
In a paper being presented at this week’s Empirical Methods in Natural Language Processing conference, MIT researchers describe a parser that learns through observation to more closely mimic a child’s language-acquisition process, which could greatly extend the parser’s capabilities. To learn the structure of language, the parser observes captioned videos, with no other information, and associates the words with recorded objects and actions. Given a new sentence, the parser can then use what it’s learned about the structure of the language to accurately predict a sentence’s meaning, without the video.
This “weakly supervised” approach — meaning it requires limited training data — mimics how children can observe the world around them and learn language, without anyone providing direct context. The approach could expand the types of data and reduce the effort needed for training parsers, according to the researchers. A few directly annotated sentences, for instance, could be combined with many captioned videos, which are easier to come by, to improve performance.
In the future, the parser could be used to improve natural interaction between humans and personal robots. A robot equipped with the parser, for instance, could constantly observe its environment to reinforce its understanding of spoken commands, including when the spoken sentences aren’t fully grammatical or clear. “People talk to each other in partial sentences, run-on thoughts, and jumbled language. You want a robot in your home that will adapt to their particular way of speaking … and still figure out what they mean,” says co-author Andrei Barbu, a researcher in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Center for Brains, Minds, and Machines (CBMM) within MIT’s McGovern Institute.
The parser could also help researchers better understand how young children learn language. “A child has access to redundant, complementary information from different modalities, including hearing parents and siblings talk about the world, as well as tactile information and visual information, [which help him or her] to understand the world,” says co-author Boris Katz, a principal research scientist and head of the InfoLab Group at CSAIL. “It’s an amazing puzzle, to process all this simultaneous sensory input. This work is part of bigger piece to understand how this kind of learning happens in the world.”
Co-authors on the paper are: first author Candace Ross, a graduate student in the Department of Electrical Engineering and Computer Science and CSAIL, and a researcher in CBMM; Yevgeni Berzak PhD ’17, a postdoc in the Computational Psycholinguistics Group in the Department of Brain and Cognitive Sciences; and CSAIL graduate student Battushig Myanganbayar.
Visual learner
For their work, the researchers combined a semantic parser with a computer-vision component trained in object, human, and activity recognition in video. Semantic parsers are generally trained on sentences annotated with code that ascribes meaning to each word and the relationships between the words. Some have been trained on still images or computer simulations.
The new parser is the first to be trained using video, Ross says. In part, videos are more useful in reducing ambiguity. If the parser is unsure about, say, an action or object in a sentence, it can reference the video to clear things up. “There are temporal components — objects interacting with each other and with people — and high-level properties you wouldn’t see in a still image or just in language,” Ross says.
The researchers compiled a dataset of about 400 videos depicting people carrying out a number of actions, including picking up an object or putting it down, and walking toward an object. Participants on the crowdsourcing platform Mechanical Turk then provided 1,200 captions for those videos. They set aside 840 video-caption examples for training and tuning, and used 360 for testing. One advantage of using vision-based parsing is “you don’t need nearly as much data — although if you had [the data], you could scale up to huge datasets,” Barbu says.
In training, the researchers gave the parser the objective of determining whether a sentence accurately describes a given video. They fed the parser a video and matching caption. The parser extracts possible meanings of the caption as logical mathematical expressions. The sentence, “The woman is picking up an apple,” for instance, may be expressed as: λxy. woman x, pick_up x y, apple y.
Those expressions and the video are inputted to the computer-vision algorithm, called “Sentence Tracker,” developed by Barbu and other researchers. The algorithm looks at each video frame to track how objects and people transform over time, to determine if actions are playing out as described. In this way, it determines if the meaning is possibly true of the video.
Connecting the dots
The expression with the most closely matching representations for objects, humans, and actions becomes the most likely meaning of the caption. The expression, initially, may refer to many different objects and actions in the video, but the set of possible meanings serves as a training signal that helps the parser continuously winnow down possibilities. “By assuming that all of the sentences must follow the same rules, that they all come from the same language, and seeing many captioned videos, you can narrow down the meanings further,” Barbu says.
In short, the parser learns through passive observation: To determine if a caption is true of a video, the parser by necessity must identify the highest probability meaning of the caption. “The only way to figure out if the sentence is true of a video [is] to go through this intermediate step of, ‘What does the sentence mean?’ Otherwise, you have no idea how to connect the two,” Barbu explains. “We don’t give the system the meaning for the sentence. We say, ‘There’s a sentence and a video. The sentence has to be true of the video. Figure out some intermediate representation that makes it true of the video.’”
The training produces a syntactic and semantic grammar for the words it’s learned. Given a new sentence, the parser no longer requires videos, but leverages its grammar and lexicon to determine sentence structure and meaning.
Ultimately, this process is learning “as if you’re a kid,” Barbu says. “You see world around you and hear people speaking to learn meaning. One day, I can give you a sentence and ask what it means and, even without a visual, you know the meaning.”
“This research is exactly the right direction for natural language processing,” says Stefanie Tellex, a professor of computer science at Brown University who focuses on helping robots use natural language to communicate with humans. “To interpret grounded language, we need semantic representations, but it is not practicable to make it available at training time. Instead, this work captures representations of compositional structure using context from captioned videos. This is the paper I have been waiting for!”
In future work, the researchers are interested in modeling interactions, not just passive observations. “Children interact with the environment as they’re learning. Our idea is to have a model that would also use perception to learn,” Ross says.
This work was supported, in part, by the CBMM, the National Science Foundation, a Ford Foundation Graduate Research Fellowship, the Toyota Research Institute, and the MIT-IBM Brain-Inspired Multimedia Comprehension project.
In this episode of Robots in Depth, Per Sjöborg speaks with Stefano Stramigioli about the Robotics and Mechatronics lab he leads at University of Twente. The lab focuses on inspection and maintenance robotics, as well as medical applications.
Stefano got into robotics when he saw the robots in Star Wars, and started out building a robotic arm from scratch, including doing his own PCBs.
He also tells us about the robotic peregrine falcon that has been spun out and is now a successful company.