Archive 18.11.2019

Page 2 of 3
1 2 3

Very Narrow Aisle (VNA) Inventory Counts using Drones in Warehouses & Distribution Centers

This, in turn, is driving inventory stakeholders to move their top-tier sites from majority bulk storage to majority racking, and from traditional aisles to very narrow aisles (VNAs). Rack heights have steadily increased from 25 feet, on average, to 35 feet or more.

#297: Using Natural Language in Human-Robot Collaboration, with Brad Hayes


In this episode, we hear from Brad Hayes, Assistant Professor of Computer Science at the University of Colorado Boulder, who directs the university’s Collaborative AI and Robotics lab. The lab’s work focuses on developing systems that can learn from and work with humans—from physical robots or machines, to software systems or decision support tools—so that together, the human and system can achieve more than each could achieve on their own.

Our interviewer Audrow caught up with Dr. Hayes to discuss why collaboration may at times be preferable to full autonomy and automation, how human naration can be used to help robots learn from demonstration, and the challenges of developing collaborative systems, including the importance of shared models and safety to allow adoption of such technologies in future.

Links

#IROS2019 videos and exhibit floor

The 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (#IROS2019) is being held in Macau this week. The theme this year is “robots connecting people”.

For those who can’t make it in person, or can’t possibly see everything, IROS is launching IROS TV. The first two episodes are below, but you can have a look at the 22 videos already produced here.

And here’s a quick tour of the exhibit floor.

Finally, follow #IROS2019 or @IROS2019MACAU on twitter.

Sabine (@sabinehauert) from Robohub will be on site, please share your IROS stories, and latest publications with her.

RoboBee powered by soft muscles

The Wyss Institute’s and SEAS robotics team built different models of the soft actuator powered RoboBee. Shown here is a four-wing, two actuator, and an eight-wing, four-actuator RoboBee model the latter of which being the first soft actuator-powered flying microrobot that is capable of controlled hovering flight. Credit: Harvard Microrobotics Lab/Harvard SEAS
By Leah Burrows

The sight of a RoboBee careening towards a wall or crashing into a glass box may have once triggered panic in the researchers in the Harvard Microrobotics Laboratory at the Harvard John A. Paulson School of Engineering and Applied Science (SEAS), but no more.

Researchers at SEAS and Harvard’s Wyss Institute for Biologically Inspired Engineering have developed a resilient RoboBee powered by soft artificial muscles that can crash into walls, fall onto the floor, and collide with other RoboBees without being damaged. It is the first microrobot powered by soft actuators to achieve controlled flight.

“There has been a big push in the field of microrobotics to make mobile robots out of soft actuators because they are so resilient,” said Yufeng Chen, Ph.D., a former graduate student and postdoctoral fellow at SEAS and first author of the paper. “However, many people in the field have been skeptical that they could be used for flying robots because the power density of those actuators simply hasn’t been high enough and they are notoriously difficult to control. Our actuator has high enough power density and controllability to achieve hovering flight.”

The research is published in Nature.

To solve the problem of power density, the researchers built upon the electrically-driven soft actuators developed in the lab of David Clarke, Ph.D., the Extended Tarr Family Professor of Materials at SEAS. These soft actuators are made using dielectric elastomers, soft materials with good insulating properties that deform when an electric field is applied.

By improving the electrode conductivity, the researchers were able to operate the actuator at 500 Hertz, on par with the rigid actuators used previously in similar robots.

Another challenge when dealing with soft actuators is that the system tends to buckle and become unstable. To solve this challenge, the researchers built a lightweight airframe with a piece of vertical constraining thread to prevent the actuator from buckling.

The soft actuators can be easily assembled and replaced in these small-scale robots. To demonstrate various flight capabilities, the researchers built several different models of the soft actuator-powered RoboBee. A two-wing model could take off from the ground but had no additional control. A four-wing, two-actuator model could fly in a cluttered environment, overcoming multiple collisions in a single flight.

“One advantage of small-scale, low-mass robots is their resilience to external impacts,” said Elizabeth Farrell Helbling, Ph.D., a former graduate student at SEAS and a coauthor on the paper. “The soft actuator provides an additional benefit because it can absorb impact better than traditional actuation strategies. This would come in handy in potential applications such as flying through rubble for search and rescue missions.”

An eight-wing, four-actuator model demonstrated controlled hovering flight, the first for a soft-actuator-powered flying microrobot.

Next, the researchers aim to increase the efficiency of the soft-powered robot, which still lags far behind more traditional flying robots.

“Soft actuators with muscle-like properties and electrical activation represent a grand challenge in robotics,” says Wyss Institute Core Faculty member Robert Wood, Ph.D., who also is the Charles River Professor of Engineering and Applied Sciences in SEAS and senior author of the paper. “If we could engineer high performance artificial muscles, the sky is the limit for what robots we could build.”

Harvard’s Office of Technology Development has protected the intellectual property relating to this project and is exploring commercialization opportunities.

This paper was also co-authored by Huichan Zhao, Jie Mao, Pakpong Chirarattananon, Nak-seung, and Patrick Hyun. It supported in part by the National Science Foundation.

Look then listen: Pre-learning environment representations for data-efficient neural instruction following

By David Gaddy

When learning to follow natural language instructions, neural networks tend to be very data hungry – they require a huge number of examples pairing language with actions in order to learn effectively. This post is about reducing those heavy data requirements by first watching actions in the environment before moving on to learning from language data. Inspired by the idea that it is easier to map language to meanings that have already been formed, we introduce a semi-supervised approach that aims to separate the formation of abstractions from the learning of language.

Empirically, we find that pre-learning of patterns in the environment can help us learn grounded language with much less data.

Before we dive into the details, let’s look at an example to see why neural networks struggle to learn from smaller amounts of data. For now, we’ll use examples from the SHRDLURN block stacking task, but later we’ll look at results on another environment.

Let’s put ourselves in the shoes of a model that is learning to follow instructions. Suppose we are given the single training example below, which pairs a language command with an action in the environment:

This example tells us that if we are in state (a) and are trying to follow the instruction (b), the correct output for our model is the state (c). Before learning, the model doesn’t know anything about language, so we must rely on examples like the one shown to figure out the meaning of the words. After learning, we will be given new environment states and new instructions, and the model’s job is to choose the correct output states from executing the instructions. First let’s consider a simple case where we get the exact same language, but the environment state is different, like the one shown here:

On this new state, the model has many different possible outputs that it could consider. Here are just a few:

Some of these outputs seem reasonable to a human, like stacking red blocks on orange blocks or stacking red blocks on the left, but others are kind of strange, like generating a completely unrelated configuration of blocks. To a neural network with no prior knowledge, however, all of these options look plausible.

A human learning a new language might approach this task by reasoning about possible meanings of the language that are consistent with the given example and choosing states that correspond to those meanings. The set of possible meanings to consider comes from prior knowledge about what types of things might happen in an environment and how we can talk about them. In this context, a meaning is an abstract transformation that we can apply to states to get new states. For example, if someone saw the training instance above paired with language they didn’t understand, they might focus on two possible meanings for the instruction: it could be telling us to stack red blocks on orange blocks, or it could be telling us to stack a red block on the leftmost position.

Although we don’t know which of these two options is correct – both are plausible given the evidence – we now have many fewer options and might easily distinguish between them with just one or two more related examples. Having a set of pre-formed meanings makes learning easier because the meanings constrain the space of possible outputs that must be considered.

In fact, pre-formed meanings do even more than just restricting the number of choices, because once we have chosen a meaning to pair with the language, it specifies the correct way to generalize across a wide variety of different initial environment states. For example, consider the following transitions:

If we know in advance that all of these transitions belong together in a single semantic group (adding a red block on the left), learning language becomes easier because we can map to the group instead of the individual transitions. An end-to-end network that doesn’t start with any grouping of transitions has a much harder time because it has to learn the correct way to generalize across initial states. One approach used by a long line of past work has been to provide the learner with a manually defined set of abstractions called logical forms. In contrast, we take a more data-driven approach where we learn abstractions from unsupervised (language-free) data instead.

In this work, we help a neural network learn language with fewer examples by first learning abstractions from language-free observations of actions in an environment. The idea here is that if the model sees lots of actions happening in an environment, perhaps it can pick up on patterns in what tends to be done, and these patterns might give hints at what abstractions are useful. Our pre-learned abstractions can make language learning easier by constraining the space of outputs we need to consider and guiding generalization across different environment states.

We break up learning into two phases: an environment learning phase where our agent builds abstractions from language-free observation of the environment, and a language learning phase where natural language instructions are mapped to the pre-learned abstractions. The motivation for this setup is that language-free observations of the environment are often easier to get than interactions paired with language, so we should use the cheaper unlabeled data to help us learn with less language data. For example, a virtual assistant could learn with data from regular smartphone use, or in the longer term robots might be able to learn by watching humans naturally interact with the world. In the environments we are using in this post, we don’t have a natural source of unlabeled observations, so we generate the environment data synthetically.

Method

Now we’re ready to dive into our method. We’ll start with the environment learning phase, where we will learn abstractions by observing an agent, such as a human, acting in the environment. Our approach during this phase will be to create a type of autoencoder of the state transitions (actions) that we see, shown below:

The encoder takes in the states before and after the transition and computes a representation of the transition itself. The decoder takes that transition representation from the encoder and must use it to recreate the final state from the initial one. The encoder and decoder architectures will be task specific, but use generic components such as convolutions or LSTMs. For example, in the block stacking task states are represented as a grid and we use a convolutional architecture. We train using a standard cross-entropy loss on the decoder’s output state, and after training we will use the representation passed between the encoder and decoder as our learned abstraction.

One thing that this autoencoder will learn is which type of transitions tend to happen, because the model will learn to only output transitions like the ones it sees during training. In addition, this model will learn to group different transitions. This grouping happens because the representation between the encoder and decoder acts as an information bottleneck, and its limited capacity forces the model to reuse the same representation vector for multiple different transitions. We find that often the groupings it chooses tend to be semantically meaningful because representations that align with the semantics of the environment tend to be the most compact.

After environment learning pre-training, we are ready to move on to learning language. For the language learning phase, we will start with the decoder that we pre-trained during environment learning (“action decoder” in the figures above and below). The decoder maps from our learned representation space to particular state outputs. To learn language, we now just need to introduce a language encoder module that maps from language into the representation space and train it by backpropagating through the decoder. The model structure is shown in the figure below.

The model in this phase looks a lot like other encoder-decoder models used previously for instruction following tasks, but now the pre-trained decoder can constrain the output and help control generalization.

Results

Now let’s look at some results. We’ll compare our method to an end-to-end neural model, which has an identical neural architecture to our ultimate language learning model but without any environment learning pre-training of the decoder. First we test on the SHURDLURN block stacking task, a task that is especially challenging for neural models because it requires learning with just tens of examples. A baseline neural model gets an accuracy of 18% on the task, but with our environment learning pre-training, the model reaches 28%, an improvement of ten absolute percentage points.

We also tested our method on a string manipulation task where we learn to execute instructions like “insert the letters vw after every vowel” on a string of characters. The chart below shows accuracy as we vary the amount of data for both the baseline end-to-end model and the model with our pre-training procedure.

As shown above, using our pre-training method leads to much more data-efficient language learning compared to learning from scratch. By pre-learning abstractions from the environment, our method increases data efficiency by more than an order of magnitude. To learn more about our method, including some additional performance-improving tricks and an analysis of what pre-training learns, check out our paper from ACL 2019: https://arxiv.org/abs/1907.09671.

This article was initially published on the BAIR blog, and appears here with the authors’ permission.

Page 2 of 3
1 2 3