Category robots in business

Page 313 of 432
1 311 312 313 314 315 432

This drone can play dodgeball – and win

By Nicola Nosengo

Drones can do many things, but avoiding obstacles is not their strongest suit yet – especially when they move quickly. Although many flying robots are equipped with cameras that can detect obstacles, it typically takes from 20 to 40 milliseconds for the drone to process the image and react. It may seem quick, but it is not enough to avoid a bird or another drone, or even a static obstacle when the drone itself is flying at high speed. This can be a problem when drones are used in unpredictable environments, or when there are many of them flying in the same area.

Reaction of a few milliseconds
In order to solve this problem, researchers at the University of Zurich have equipped a quadcopter (a drone with four propellers) with special cameras and algorithms that reduced its reaction time down to a few milliseconds – enough to avoid a ball thrown at it from a short distance. The results, published in Science Robotics, can make drones more effective in situations such as the aftermath of a natural disaster. The work was funded by the Swiss National Science Foundation through the National Center of Competence in Research (NCCR) Robotics.

“For search and rescue applications, such as after an earthquake, time is very critical, so we need drones that can navigate as fast as possible in order to accomplish more within their limited battery life” explains Davide Scaramuzza, who leads the Robotics and Perception Group at the University of Zurich as well as the NCCR Robotics Search and Rescue Grand Challenge. “However, by navigating fast drones are also more exposed to the risk of colliding with obstacles, and even more if these are moving. We realised that a novel type of camera, called Event Camera, are a perfect fit for this purpose”.

Event cameras have smart pixels
Traditional video cameras, such as the ones found in every smartphone, work by regularly taking snapshots of the whole scene. This is done by exposing the pixels of the image all at the same time. This way, though, a moving object can only be detected after all the pixels have been analysed by the on-board computer. Event cameras, on the other hand, have smart pixels that work independently of each other. The pixels that detect no changes remain silent, while the ones that see a change in light intensity immediately send out the information. This means that only a tiny fraction of the all pixels of the image will need to be processed by the onboard computer, therefore speeding up the computation a lot.

Event cameras are a recent innovation, and existing object-detection algorithms for drones do not work well with them. So the researchers had to invent their own algorithms that collect all the events recorded by the camera over a very short time, then subtracts the effect of the drone’s own movement – which typically account for most of the changes in what the camera sees.

Only 3.5 milliseconds to detect incoming objects
Scaramuzza and his team first tested the cameras and algorithms alone. They threw objects of various shapes and sizes towards the camera, and measured how efficient the algorithm was in detecting them. The success rate varied between 81 and 97 per cent, depending on the size of the object and the distance of the throw, and the system only took 3.5 milliseconds to detect incoming objects.

Then the most serious test began: putting cameras on an actual drone, flying it both indoor and outdoor and throwing objects directly at it. The drone was able to avoid the objects – including a ball thrown from a 3-meter distance and travelling at 10 meters per second – more than 90 per cent of the time. When the drone “knew” the size of the object in advance, one camera was enough. When, instead, it had to face objects of varying size, two cameras were used to give it stereoscopic vision.

According to Scaramuzza, these results show that event cameras can increase the speed at which drones can navigate by up to ten times, thus expanding their possible applications. “One day drones will be used for a large variety of applications, such as delivery of goods, transportation of people, aerial filmography and, of course, search and rescue,” he says. “But enabling robots to perceive and make decision faster can be a game changer for also for other domains where reliably detecting incoming obstacles plays a crucial role, such as automotive, good delivery, transportation, mining, and remote inspection with robots”.

Nearly as reliable as human pilots
In the future, the team aims to test this system on an even more agile quadrotor. “Our ultimate goal is to make one day autonomous drones navigate as good as human drone pilots. Currently, in all search and rescue applications where drones are involved, the human is actually in control. If we could have autonomous drones navigate as reliable as human pilots we would then be able to use them for missions that fall beyond line of sight or beyond the reach of the remote control”, says Davide Falanga, the PhD student who is the primary author of the article.

Does on-policy data collection fix errors in off-policy reinforcement learning?

Reinforcement learning has seen a great deal of success in solving complex decision making problems ranging from robotics to games to supply chain management to recommender systems. Despite their success, deep reinforcement learning algorithms can be exceptionally difficult to use, due to unstable training, sensitivity to hyperparameters, and generally unpredictable and poorly understood convergence properties. Multiple explanations, and corresponding solutions, have been proposed for improving the stability of such methods, and we have seen good progress over the last few years on these algorithms. In this blog post, we will dive deep into analyzing a central and underexplored reason behind some of the problems with the class of deep RL algorithms based on dynamic programming, which encompass the popular DQN and soft actor-critic (SAC) algorithms – the detrimental connection between data distributions and learned models.

Before diving deep into a description of this problem, let us quickly recap some of the main concepts in dynamic programming. Algorithms that apply dynamic programming in conjunction with function approximation are generally referred to as approximate dynamic programming (ADP) methods. ADP algorithms include some of the most popular, state-of-the-art RL methods such as variants of deep Q-networks (DQN) and soft actor-critic (SAC) algorithms. ADP methods based on Q-learning train action-value functions, $Q(s, a)$, via a Bellman backup. In practice, this corresponds to training a parametric function, $Q_\theta(s, a)$, by minimizing the mean squared difference to a backup estimate of the Q-function, defined as:

$\mathcal{B}^*Q(s, a) = r(s, a) + \gamma \mathbb{E}_{s’|s, a} [\max_{a’} \bar{Q}(s’, a’)],$

where $\bar{Q}$ denotes a previous instance of the original Q-function, $Q_\theta$, and is commonly referred to as a target network. This update is summarized in the equation below.

An analogous update is also used for actor-critic methods that also maintain an explicitly parametrized policy, $\pi_\phi(a|s)$, alongside a Q-function. Such an update typically replaces $\max_{a’}$ with an expectation under the policy, $\mathbb{E}_{a’ \sim \pi_\phi}$. We shall use the $\max_{a’}$ version for consistency throughout, however, the actor-critic version follows analogously. These ADP methods aim at learning the optimal value function, $Q^*$, by applying the Bellman backup iteratively untill convergence.

A central factor that affects the performance of ADP algorithms is the choice of the training data-distribution, $\mathcal{D}$, as shown in the equation above. The choice of $\mathcal{D}$ is an integral component of the backup, and it affects solutions obtained via ADP methods, especially since function approximation is involved. Unlike tabular settings, function approximation causes the learned Q function to depend on the choice of data distribution $\mathcal{D}$, thereby affecting the dynamics of the learning process. We show that on-policy exploration induces distributions $\mathcal{D}$ such that training Q-functions under $\mathcal{D}$ may fail to correct systematic errors in the Q-function, even if Bellman error is minimized as much as possible – a phenomenon that we refer to as an absence of corrective feedback.

Corrective Feedback and Why it is Absent in ADP

What is corrective feedback formally? How do we determine if it is present or absent in ADP methods? In order to build intuition, we first present a simple contextual bandit (one step RL) example, where the Q-function is trained to match $Q^*$ via supervised updates, without bootstrapping. This enjoys corrective feedback, and we then contrast it with ADP methods, which do not. In this example, the goal is to learn the optimal value function $Q^*(s, a)$, which, is equal to the reward $r(s, a)$. At iteration $k$, the algorithm minimizes the estimation error of the Q-function:

$\mathcal{L}(Q) = \mathbb{E}_{s \sim \beta(s), a \sim \pi_k(a|s)}[|Q_k(s, a) – Q^*(s, a)|].$

Using an $\varepsilon$-greedy or Boltzmann policy for exploration, denoted by $\pi_k$, gives rise to a hard negative mining phenomenon – the policy chooses precisely those actions that correspond to possibly over-estimated Q-values for each state $s$ and observes the corresponding, $r(s, a)$ or $Q^*(s, a)$, as a result. Then, minimizing $\mathcal{L}(Q)$, on samples collected this way corrects errors in the Q-function, as $Q_k(s, a)$ is pushed closer to match $Q^*(s, a)$ for actions $a$ with incorrectly high Q-values, correcting precisely the Q-values which may cause sub-optimal performance. This constructive interaction between online data collection and error correction – where the induced online data distribution corrects errors in the value function – is what we refer to as corrective feedback.

In contrast, we will demonstrate that ADP methods that rely on previous Q-functions to generate targets for training the current Q-function, may not benefit from corrective feedback. This difference between bandits and ADP happens because the target values are computed by applying a Bellman backup on the previous Q-function, (target value), rather than the optimal $Q^*$, so, errors in $\bar{Q}$, at the next states can result in incorrect Q-value targets at the current state. No matter how often the current transition is observed, or how accurately Bellman errors are minimized, the error in the Q-value with respect to the optimal Q-function, $|Q – Q^*|$, at this state is not reduced. Furthermore, in order to obtain correct target values, we need to ensure that values at state-action pairs occurring at the tail ends of the data distribution $\mathcal{D}$, which are primary causes of errors in Q-values at other states, are correct. However, as we will show via a simple didactic example, that this correction process may be extremely slow and may not occur, mainly because of undesirable generalization effects of the function approximator.

Let’s consider a didactic example of a tree-structured deterministic MDP with 7 states and 2 actions, $a_1$ and $a_2$, at each state.




Figure 1: Run of an ADP algorithm with on-policy data collection. Boxed nodes and circled nodes denote groups of states aliased by function approximation — values of these nodes are affected due to parameter sharing and function approximation.

A run of an ADP algorithm that chooses the current on-policy state-action marginal as $\mathcal{D}$ on this tree MDP is shown in Figure 1. Thus, the Bellman error at a state is minimized in proportion to the frequency of occurrence of that state in the policy state-action marginal. Since the leaf node states are the least frequent in this on-policy marginal distribution (due to the discounting), the Bellman backup is unable to correct errors in Q-values at such leaf nodes, due to their low frequency and aliasing with other states arising due to function approximation. Using incorrect Q-values at the leaf nodes to generate targets for other nodes in the tree, just gives rise to incorrect values, even if Bellman error is fully minimized at those states. Thus, most of the Bellman updates do not actually bring Q-values at the states of the MDP closer to $Q^*$, since the primary cause of incorrect target values isn’t corrected.

This observation is surprising, since it demonstrates how the choice of an online distribution coupled with function approximation might actually learn incorrect Q-values. On the other hand, a scheme that chooses to update states level by level progressively (Figure 2), ensuring that target values used at any iteration of learning are correct, very easily learns correct Q-values in this example.




Figure 2: Run of an ADP algorithm with an oracle distribution, that updates states level-by level, progressing through the tree from the leaves to the root. Even in the presence of function approximation, selecting the right set of nodes for updates gives rise to correct Q-values.

Consequences of Absent Corrective Feedback

Now, one might ask if an absence of corrective feedback occurs in practice, beyond a simple didactic example and whether it hurts in practical problems. Since visualizing the dynamics of the learning process is hard in practical problems as we did for the didactic example, we instead devise a metric that quantifies our intuition for corrective feedback. This metric, what we call value error, is given by:

Increasing values of imply that the algorithm is pushing Q-values farther away from $Q^*$, which means that corrective feedback is absent, if this happens over a number of iterations. On the other hand, decreasing values of $\mathcal{E}_k$ implies that the algorithm is continuously improving its estimate of $Q$, by moving it towards $Q^*$ with each iteration, indicating the presence of corrective feedback.

Observe in Figure 3, that ADP methods can suffer from prolonged periods where this global measure of error in the Q-function, $\mathcal{E}_k$, is increasing or fluctuating, and the corresponding returns degrade or stagnate, implying an absence of corrective feedback.




Figure 3: Consequences of absent corrective feedback, including (a) sub-optimal convergence, (b) instability in learning and (c) inability to learn with sparse rewards.

In particular, we describe three different consequences of an absence of corrective feedback:

  1. Convergence to suboptimal Q-functions. We find that on-policy sampling can cause ADP to converge to a suboptimal solution, even in the absence of sampling error. Figure 3(a) shows that the value error $\mathcal{E}_k<$ rapidly decreases initially, and eventually converges to a value significantly greater than 0, from which the learning process never recovers.

  2. Instability in the learning process. We observe that ADP with replay buffers can be unstable. For instance, the algorithm is prone to degradation even if the latest policy obtains returns that are very close to the optimal return in Figure 3(b).

  3. Inability to learn with low signal-to-noise ratio. Absence of corrective feedback can also prevent ADP algorithms from learning quickly in scenarios with low signal-to-noise ratio, such as tasks with sparse/noisy rewards as shown in Figure 3(c). Note that this is not an exploration issue, since all transitions in the MDP are provided to the algorithm in this experiment.

Inducing Maximal Corrective Feedback via Distribution Correction

Now that we have defined corrective feedback and gone over some detrimental consequences an absence of it can have on the learning process of an ADP algorithm, what might be some ways to fix this problem? To recap, an absence of corrective feedback occurs when ADP algorithms naively use the on-policy or replay buffer distributions for training Q-functions. One way to prevent this problem is by computing an “optimal” data distribution that provides maximal corrective feedback, and train Q-functions using this distribution? This way we can ensure that the ADP algorithm always enjoys corrective feedback, and hence makes steady learning progress. The strategy we used in our work is to compute this optimal distribution and then perform a weighted Bellman update that re-weights the data distribution in the replay buffer to this optimal distribution (in practice, a tractable approximation is required, as we will see) via importance sampling based techniques.

We will not go into the full details of our derivation in this article, however, we mention the optimization problem used to obtain a form for this optimal distribution and encourage readers interested in the theory to checkout Section 4 in our paper. In this optimization problem, our goal is to minimize a measure of corrective feedback, given by value error $\mathcal{E}_k$, with respect to the distribution $p_k$ used for Bellman error minimization, at every iteration $k$. This gives rise to the following problem:

$\min _{p_{k}} \; \mathbb{E}_{d^{\pi_{k}}}[|Q_{k}-Q^{*}|]$

$\text { s.t. }\;\; Q_{k}=\arg \min _{Q} \mathbb{E}_{p_{k}}\left[\left(Q-\mathcal{B}^{*} Q_{k-1}\right)^{2}\right]$

We show in our paper that the solution of this optimization problem, that we refer to as the optimal distribution, $p_k^*$, is given by:

$p_{k}^*(s, a) \propto \exp \left(-\left|Q_{k}-Q^{*}\right|(s, a)\right) \frac{\left|Q_{k}-\mathcal{B}^{*} Q_{k-1}\right|(s, a)}{\lambda^{*}}$

By simplifying this expression, we obtain a practically viable expression for weights, $w_k$, at any iteration $k$ that can be used to re-weight the data distribution:

$w_{k}(s, a) \propto \exp \left(-\frac{\gamma \mathbb{E}_{s’|s, a} \mathbb{E}_{a’ \sim \pi_\phi(\cdot|s’)} \Delta_{k-1}(s’, a’)}{\tau}\right)$

where $\Delta_k$ is the accumulated Bellman error over iterations, and it satisfies a convenient recursion making it amenable to practical implementations,

$\Delta_{k}(s, a) =\left|Q_{k}-\mathcal{B}^{*} Q_{k-1}\right|(s, a) +\gamma \mathbb{E}_{s’|s, a} \mathbb{E}_{a’ \sim \pi_\phi(\cdot|s’)} \Delta_{k-1}(s’, a’)$

and $\pi_\phi$ is the Boltzmann or $\varepsilon-$greedy policy corresponding to the current Q function.

What does this expression for $w_k$ intuitively correspond to? Observe that the term appearing in the exponent in the expression for $w_k$ corresponds to the accumulated Bellman error in the target values. Our choice of $w_k$, thus, basically down-weights transitions with highly incorrect target values. This technique falls into a broader class of abstention based techniques that are common in supervised learning settings with noisy labels, where down-weighting datapoints (transitions here) with errorful labels (target values here) can boost generalization and correctness properties of the learned model.




Figure 4: Schematic of the DisCor algorithm. Transitions with errorful target values are downweighted.

Why does our choice of $\Delta_k$, i.e. the sum of accumulated Bellman errors suffice? This is because this value $\Delta_k$ accounts for how error is propagated in ADP methods. Bellman errors, $|Q_k – \mathcal{B}^*Q_{k-1}|$ are propagated under the current policy $\pi_{k-1}$, and then discounted when computing target values for updates in ADP. $\Delta_k$ captures exactly this, and therefore, using this estimate in our weights suffices.

Our practical algorithm, that we refer to as DisCor (Distribution Correction), is identical to conventional ADP methods like Q-learning, with the exception that it performs a weighted Bellman backup – it assigns a weight $w_k(s,a)$ to a transition, $(s, a, r, s’)$ and performs a Bellman backup weighted by these weights, as shown below.

We depict the general principle in the schematic diagram shown in Figure 4.

How does DisCor perform in practice?

We finally present some results that demonstrate the efficacy of our method, DisCor, in practical scenarios. Since DisCor only modifies the chosen distribution for the Bellman update, it can be applied on top of any standard ADP algorithm including soft actor-critic (SAC) or deep Q-network (DQN). Our paper presents results for a number of tasks spanning a wide variety of settings including robotic manipulation tasks, multi-task reinforcement learning tasks, learning with stochastic and noisy rewards, and Atari games. In this blog post, we present two of these results from robotic manipulation and multi-task RL.

  1. Robotic manipulation tasks. On six challenging benchmark tasks from the MetaWorld suite, we observe that DisCor when combined with SAC greatly outperforms prior state-of-the-art RL algorithms such as soft actor-critic (SAC) and prioritized experience replay (PER) which is a prior method that prioritizes states with high Bellman error during training. Note that DisCor usually starts learning earlier than other methods compared to. DisCor outperforms vanilla SAC by a factor of about 50% on average, in terms of success rate on these tasks.


  2. Multi-task reinforcement learning. We also present certain results on the Multi-task 10 (MT10) and Multi-task 50 (MT50) benchmarks from the Meta-world suite. The goal here is to learn a single policy that can solve a number of (10 or 50, respectively) different manipulation tasks that share common structure. We note that DisCor outperforms, state-of-the-art SAC algorithm on both of these benchmarks by a wide margin (for e.g. 50% on MT10, success rate). Unlike the learning process of SAC that tends to plateau over the course of learning, we observe that DisCor always exhibits a non-zero gradient for the learning process, until it converges.


In our paper, we also perform evaluations on other domains such as Atari games and OpenAI gym benchmarks, and we encourage the readers to check those out. We also perform an analysis of the method on tabular domains, understanding different aspects of the method.

Perspectives, Future Work and Open Problems

Some of our and other prior work has highlighted the impact of the data distribution on the performance of ADP algorithms, We observed in another prior work that in contrast to the intuitive belief about the efficacy of online Q-learning with on-policy data collection, Q-learning with a uniform distribution over states and actions seemed to perform best. Obtaining a uniform distribution over state-action tuples during training is not possible in RL, unless all states and actions are observed at least once, which may not be the case in a number of scenarios. We might also ask the question about whether the uniform distribution is the best choice that can be used in an RL setting? The form of the optimal distribution derived in Section 4 of our paper, is a potentially better choice since it is customized to the MDP under consideration.

Furthermore, in the domain of purely offline reinforcement learning, studied in our prior work and some other works, such as this and this, we observe that the data distribution is again a central feature, where backing up out-of-distribution actions and the inability to try these actions out in the environment to obtain answers to counterfactual queries, can cause error accumulation and backups to diverge. However, in this work, we demonstrate a somewhat counterintuitive finding: even with on-policy data collection, where the algorithm, in principle, can evaluate all forms of counterfactual queries, the algorithm may not obtain a steady learning progress, due to an undesirable interaction between the data distribution and generalization effects of the function approximator.


What might be a few promising directions to pursue in future work?

Formal analysis of learning dynamics: While our study is an initial foray into the role that data distributions play in the learning dynamics of ADP algorithms, this motivates a significantly deeper direction of future study. We need to answer questions related to how deep neural network based function approximators actually behave, which are behind these ADP methods, in order to get them to enjoy corrective feedback.

Re-weighting to supplement exploration in RL problems: Our work depicts the promise of re-weighting techniques as a practically simple replacement for altering entire exploration strategies. We believe that re-weighting techniques are very promising as a general tool in our toolkit to develop RL algorithms. In an online RL setting, re-weighting can help remove the some of the burden off exploration algorithms, and can thus, potentially help us employ complex exploration strategies in RL algorithms.

More generally, we would like to make a case of analyzing effects of data distribution more deeply in the context of deep RL algorithms. It is well known that narrow distributions can lead to brittle solutions in supervised learning that also do not generalize. What is the corresponding analogue in reinforcement learning? Distributional robustness style techniques have been used in supervised learning to guarantee a uniformly convergent learning process, but it still remains unclear how to apply these in an RL with function approximation setting. Part of the reason is that the theory of RL often derives from tabular settings, where distributions do not hamper the learning process to the extent they do with function approximation. However, as we showed in this work, choosing the right distribution may lead to significant gains in deep RL methods, and therefore, we believe, that this issue should be studied in more detail.


This blog post is based on our recent paper:

  • DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction
    Aviral Kumar, Abhishek Gupta, Sergey Levine
    arXiv

We thank Sergey Levine and Marvin Zhang for their valuable feedback on this blog post.


This article was initially published on the BAIR blog, and appears here with the authors’ permission.

Does on-policy data collection fix errors in off-policy reinforcement learning?

Reinforcement learning has seen a great deal of success in solving complex decision making problems ranging from robotics to games to supply chain management to recommender systems. Despite their success, deep reinforcement learning algorithms can be exceptionally difficult to use, due to unstable training, sensitivity to hyperparameters, and generally unpredictable and poorly understood convergence properties. Multiple explanations, and corresponding solutions, have been proposed for improving the stability of such methods, and we have seen good progress over the last few years on these algorithms. In this blog post, we will dive deep into analyzing a central and underexplored reason behind some of the problems with the class of deep RL algorithms based on dynamic programming, which encompass the popular DQN and soft actor-critic (SAC) algorithms – the detrimental connection between data distributions and learned models.

Before diving deep into a description of this problem, let us quickly recap some of the main concepts in dynamic programming. Algorithms that apply dynamic programming in conjunction with function approximation are generally referred to as approximate dynamic programming (ADP) methods. ADP algorithms include some of the most popular, state-of-the-art RL methods such as variants of deep Q-networks (DQN) and soft actor-critic (SAC) algorithms. ADP methods based on Q-learning train action-value functions, $Q(s, a)$, via a Bellman backup. In practice, this corresponds to training a parametric function, $Q_\theta(s, a)$, by minimizing the mean squared difference to a backup estimate of the Q-function, defined as:

$\mathcal{B}^*Q(s, a) = r(s, a) + \gamma \mathbb{E}_{s’|s, a} [\max_{a’} \bar{Q}(s’, a’)],$

where $\bar{Q}$ denotes a previous instance of the original Q-function, $Q_\theta$, and is commonly referred to as a target network. This update is summarized in the equation below.

An analogous update is also used for actor-critic methods that also maintain an explicitly parametrized policy, $\pi_\phi(a|s)$, alongside a Q-function. Such an update typically replaces $\max_{a’}$ with an expectation under the policy, $\mathbb{E}_{a’ \sim \pi_\phi}$. We shall use the $\max_{a’}$ version for consistency throughout, however, the actor-critic version follows analogously. These ADP methods aim at learning the optimal value function, $Q^*$, by applying the Bellman backup iteratively untill convergence.

A central factor that affects the performance of ADP algorithms is the choice of the training data-distribution, $\mathcal{D}$, as shown in the equation above. The choice of $\mathcal{D}$ is an integral component of the backup, and it affects solutions obtained via ADP methods, especially since function approximation is involved. Unlike tabular settings, function approximation causes the learned Q function to depend on the choice of data distribution $\mathcal{D}$, thereby affecting the dynamics of the learning process. We show that on-policy exploration induces distributions $\mathcal{D}$ such that training Q-functions under $\mathcal{D}$ may fail to correct systematic errors in the Q-function, even if Bellman error is minimized as much as possible – a phenomenon that we refer to as an absence of corrective feedback.

Corrective Feedback and Why it is Absent in ADP

What is corrective feedback formally? How do we determine if it is present or absent in ADP methods? In order to build intuition, we first present a simple contextual bandit (one step RL) example, where the Q-function is trained to match $Q^*$ via supervised updates, without bootstrapping. This enjoys corrective feedback, and we then contrast it with ADP methods, which do not. In this example, the goal is to learn the optimal value function $Q^*(s, a)$, which, is equal to the reward $r(s, a)$. At iteration $k$, the algorithm minimizes the estimation error of the Q-function:

$\mathcal{L}(Q) = \mathbb{E}_{s \sim \beta(s), a \sim \pi_k(a|s)}[|Q_k(s, a) – Q^*(s, a)|].$

Using an $\varepsilon$-greedy or Boltzmann policy for exploration, denoted by $\pi_k$, gives rise to a hard negative mining phenomenon – the policy chooses precisely those actions that correspond to possibly over-estimated Q-values for each state $s$ and observes the corresponding, $r(s, a)$ or $Q^*(s, a)$, as a result. Then, minimizing $\mathcal{L}(Q)$, on samples collected this way corrects errors in the Q-function, as $Q_k(s, a)$ is pushed closer to match $Q^*(s, a)$ for actions $a$ with incorrectly high Q-values, correcting precisely the Q-values which may cause sub-optimal performance. This constructive interaction between online data collection and error correction – where the induced online data distribution corrects errors in the value function – is what we refer to as corrective feedback.

In contrast, we will demonstrate that ADP methods that rely on previous Q-functions to generate targets for training the current Q-function, may not benefit from corrective feedback. This difference between bandits and ADP happens because the target values are computed by applying a Bellman backup on the previous Q-function, (target value), rather than the optimal $Q^*$, so, errors in $\bar{Q}$, at the next states can result in incorrect Q-value targets at the current state. No matter how often the current transition is observed, or how accurately Bellman errors are minimized, the error in the Q-value with respect to the optimal Q-function, $|Q – Q^*|$, at this state is not reduced. Furthermore, in order to obtain correct target values, we need to ensure that values at state-action pairs occurring at the tail ends of the data distribution $\mathcal{D}$, which are primary causes of errors in Q-values at other states, are correct. However, as we will show via a simple didactic example, that this correction process may be extremely slow and may not occur, mainly because of undesirable generalization effects of the function approximator.

Let’s consider a didactic example of a tree-structured deterministic MDP with 7 states and 2 actions, $a_1$ and $a_2$, at each state.




Figure 1: Run of an ADP algorithm with on-policy data collection. Boxed nodes and circled nodes denote groups of states aliased by function approximation — values of these nodes are affected due to parameter sharing and function approximation.

A run of an ADP algorithm that chooses the current on-policy state-action marginal as $\mathcal{D}$ on this tree MDP is shown in Figure 1. Thus, the Bellman error at a state is minimized in proportion to the frequency of occurrence of that state in the policy state-action marginal. Since the leaf node states are the least frequent in this on-policy marginal distribution (due to the discounting), the Bellman backup is unable to correct errors in Q-values at such leaf nodes, due to their low frequency and aliasing with other states arising due to function approximation. Using incorrect Q-values at the leaf nodes to generate targets for other nodes in the tree, just gives rise to incorrect values, even if Bellman error is fully minimized at those states. Thus, most of the Bellman updates do not actually bring Q-values at the states of the MDP closer to $Q^*$, since the primary cause of incorrect target values isn’t corrected.

This observation is surprising, since it demonstrates how the choice of an online distribution coupled with function approximation might actually learn incorrect Q-values. On the other hand, a scheme that chooses to update states level by level progressively (Figure 2), ensuring that target values used at any iteration of learning are correct, very easily learns correct Q-values in this example.




Figure 2: Run of an ADP algorithm with an oracle distribution, that updates states level-by level, progressing through the tree from the leaves to the root. Even in the presence of function approximation, selecting the right set of nodes for updates gives rise to correct Q-values.

Consequences of Absent Corrective Feedback

Now, one might ask if an absence of corrective feedback occurs in practice, beyond a simple didactic example and whether it hurts in practical problems. Since visualizing the dynamics of the learning process is hard in practical problems as we did for the didactic example, we instead devise a metric that quantifies our intuition for corrective feedback. This metric, what we call value error, is given by:

Increasing values of imply that the algorithm is pushing Q-values farther away from $Q^*$, which means that corrective feedback is absent, if this happens over a number of iterations. On the other hand, decreasing values of $\mathcal{E}_k$ implies that the algorithm is continuously improving its estimate of $Q$, by moving it towards $Q^*$ with each iteration, indicating the presence of corrective feedback.

Observe in Figure 3, that ADP methods can suffer from prolonged periods where this global measure of error in the Q-function, $\mathcal{E}_k$, is increasing or fluctuating, and the corresponding returns degrade or stagnate, implying an absence of corrective feedback.




Figure 3: Consequences of absent corrective feedback, including (a) sub-optimal convergence, (b) instability in learning and (c) inability to learn with sparse rewards.

In particular, we describe three different consequences of an absence of corrective feedback:

  1. Convergence to suboptimal Q-functions. We find that on-policy sampling can cause ADP to converge to a suboptimal solution, even in the absence of sampling error. Figure 3(a) shows that the value error $\mathcal{E}_k<$ rapidly decreases initially, and eventually converges to a value significantly greater than 0, from which the learning process never recovers.

  2. Instability in the learning process. We observe that ADP with replay buffers can be unstable. For instance, the algorithm is prone to degradation even if the latest policy obtains returns that are very close to the optimal return in Figure 3(b).

  3. Inability to learn with low signal-to-noise ratio. Absence of corrective feedback can also prevent ADP algorithms from learning quickly in scenarios with low signal-to-noise ratio, such as tasks with sparse/noisy rewards as shown in Figure 3(c). Note that this is not an exploration issue, since all transitions in the MDP are provided to the algorithm in this experiment.

Inducing Maximal Corrective Feedback via Distribution Correction

Now that we have defined corrective feedback and gone over some detrimental consequences an absence of it can have on the learning process of an ADP algorithm, what might be some ways to fix this problem? To recap, an absence of corrective feedback occurs when ADP algorithms naively use the on-policy or replay buffer distributions for training Q-functions. One way to prevent this problem is by computing an “optimal” data distribution that provides maximal corrective feedback, and train Q-functions using this distribution? This way we can ensure that the ADP algorithm always enjoys corrective feedback, and hence makes steady learning progress. The strategy we used in our work is to compute this optimal distribution and then perform a weighted Bellman update that re-weights the data distribution in the replay buffer to this optimal distribution (in practice, a tractable approximation is required, as we will see) via importance sampling based techniques.

We will not go into the full details of our derivation in this article, however, we mention the optimization problem used to obtain a form for this optimal distribution and encourage readers interested in the theory to checkout Section 4 in our paper. In this optimization problem, our goal is to minimize a measure of corrective feedback, given by value error $\mathcal{E}_k$, with respect to the distribution $p_k$ used for Bellman error minimization, at every iteration $k$. This gives rise to the following problem:

$\min _{p_{k}} \; \mathbb{E}_{d^{\pi_{k}}}[|Q_{k}-Q^{*}|]$

$\text { s.t. }\;\; Q_{k}=\arg \min _{Q} \mathbb{E}_{p_{k}}\left[\left(Q-\mathcal{B}^{*} Q_{k-1}\right)^{2}\right]$

We show in our paper that the solution of this optimization problem, that we refer to as the optimal distribution, $p_k^*$, is given by:

$p_{k}^*(s, a) \propto \exp \left(-\left|Q_{k}-Q^{*}\right|(s, a)\right) \frac{\left|Q_{k}-\mathcal{B}^{*} Q_{k-1}\right|(s, a)}{\lambda^{*}}$

By simplifying this expression, we obtain a practically viable expression for weights, $w_k$, at any iteration $k$ that can be used to re-weight the data distribution:

$w_{k}(s, a) \propto \exp \left(-\frac{\gamma \mathbb{E}_{s’|s, a} \mathbb{E}_{a’ \sim \pi_\phi(\cdot|s’)} \Delta_{k-1}(s’, a’)}{\tau}\right)$

where $\Delta_k$ is the accumulated Bellman error over iterations, and it satisfies a convenient recursion making it amenable to practical implementations,

$\Delta_{k}(s, a) =\left|Q_{k}-\mathcal{B}^{*} Q_{k-1}\right|(s, a) +\gamma \mathbb{E}_{s’|s, a} \mathbb{E}_{a’ \sim \pi_\phi(\cdot|s’)} \Delta_{k-1}(s’, a’)$

and $\pi_\phi$ is the Boltzmann or $\varepsilon-$greedy policy corresponding to the current Q function.

What does this expression for $w_k$ intuitively correspond to? Observe that the term appearing in the exponent in the expression for $w_k$ corresponds to the accumulated Bellman error in the target values. Our choice of $w_k$, thus, basically down-weights transitions with highly incorrect target values. This technique falls into a broader class of abstention based techniques that are common in supervised learning settings with noisy labels, where down-weighting datapoints (transitions here) with errorful labels (target values here) can boost generalization and correctness properties of the learned model.




Figure 4: Schematic of the DisCor algorithm. Transitions with errorful target values are downweighted.

Why does our choice of $\Delta_k$, i.e. the sum of accumulated Bellman errors suffice? This is because this value $\Delta_k$ accounts for how error is propagated in ADP methods. Bellman errors, $|Q_k – \mathcal{B}^*Q_{k-1}|$ are propagated under the current policy $\pi_{k-1}$, and then discounted when computing target values for updates in ADP. $\Delta_k$ captures exactly this, and therefore, using this estimate in our weights suffices.

Our practical algorithm, that we refer to as DisCor (Distribution Correction), is identical to conventional ADP methods like Q-learning, with the exception that it performs a weighted Bellman backup – it assigns a weight $w_k(s,a)$ to a transition, $(s, a, r, s’)$ and performs a Bellman backup weighted by these weights, as shown below.

We depict the general principle in the schematic diagram shown in Figure 4.

How does DisCor perform in practice?

We finally present some results that demonstrate the efficacy of our method, DisCor, in practical scenarios. Since DisCor only modifies the chosen distribution for the Bellman update, it can be applied on top of any standard ADP algorithm including soft actor-critic (SAC) or deep Q-network (DQN). Our paper presents results for a number of tasks spanning a wide variety of settings including robotic manipulation tasks, multi-task reinforcement learning tasks, learning with stochastic and noisy rewards, and Atari games. In this blog post, we present two of these results from robotic manipulation and multi-task RL.

  1. Robotic manipulation tasks. On six challenging benchmark tasks from the MetaWorld suite, we observe that DisCor when combined with SAC greatly outperforms prior state-of-the-art RL algorithms such as soft actor-critic (SAC) and prioritized experience replay (PER) which is a prior method that prioritizes states with high Bellman error during training. Note that DisCor usually starts learning earlier than other methods compared to. DisCor outperforms vanilla SAC by a factor of about 50% on average, in terms of success rate on these tasks.


  2. Multi-task reinforcement learning. We also present certain results on the Multi-task 10 (MT10) and Multi-task 50 (MT50) benchmarks from the Meta-world suite. The goal here is to learn a single policy that can solve a number of (10 or 50, respectively) different manipulation tasks that share common structure. We note that DisCor outperforms, state-of-the-art SAC algorithm on both of these benchmarks by a wide margin (for e.g. 50% on MT10, success rate). Unlike the learning process of SAC that tends to plateau over the course of learning, we observe that DisCor always exhibits a non-zero gradient for the learning process, until it converges.


In our paper, we also perform evaluations on other domains such as Atari games and OpenAI gym benchmarks, and we encourage the readers to check those out. We also perform an analysis of the method on tabular domains, understanding different aspects of the method.

Perspectives, Future Work and Open Problems

Some of our and other prior work has highlighted the impact of the data distribution on the performance of ADP algorithms, We observed in another prior work that in contrast to the intuitive belief about the efficacy of online Q-learning with on-policy data collection, Q-learning with a uniform distribution over states and actions seemed to perform best. Obtaining a uniform distribution over state-action tuples during training is not possible in RL, unless all states and actions are observed at least once, which may not be the case in a number of scenarios. We might also ask the question about whether the uniform distribution is the best choice that can be used in an RL setting? The form of the optimal distribution derived in Section 4 of our paper, is a potentially better choice since it is customized to the MDP under consideration.

Furthermore, in the domain of purely offline reinforcement learning, studied in our prior work and some other works, such as this and this, we observe that the data distribution is again a central feature, where backing up out-of-distribution actions and the inability to try these actions out in the environment to obtain answers to counterfactual queries, can cause error accumulation and backups to diverge. However, in this work, we demonstrate a somewhat counterintuitive finding: even with on-policy data collection, where the algorithm, in principle, can evaluate all forms of counterfactual queries, the algorithm may not obtain a steady learning progress, due to an undesirable interaction between the data distribution and generalization effects of the function approximator.


What might be a few promising directions to pursue in future work?

Formal analysis of learning dynamics: While our study is an initial foray into the role that data distributions play in the learning dynamics of ADP algorithms, this motivates a significantly deeper direction of future study. We need to answer questions related to how deep neural network based function approximators actually behave, which are behind these ADP methods, in order to get them to enjoy corrective feedback.

Re-weighting to supplement exploration in RL problems: Our work depicts the promise of re-weighting techniques as a practically simple replacement for altering entire exploration strategies. We believe that re-weighting techniques are very promising as a general tool in our toolkit to develop RL algorithms. In an online RL setting, re-weighting can help remove the some of the burden off exploration algorithms, and can thus, potentially help us employ complex exploration strategies in RL algorithms.

More generally, we would like to make a case of analyzing effects of data distribution more deeply in the context of deep RL algorithms. It is well known that narrow distributions can lead to brittle solutions in supervised learning that also do not generalize. What is the corresponding analogue in reinforcement learning? Distributional robustness style techniques have been used in supervised learning to guarantee a uniformly convergent learning process, but it still remains unclear how to apply these in an RL with function approximation setting. Part of the reason is that the theory of RL often derives from tabular settings, where distributions do not hamper the learning process to the extent they do with function approximation. However, as we showed in this work, choosing the right distribution may lead to significant gains in deep RL methods, and therefore, we believe, that this issue should be studied in more detail.


This blog post is based on our recent paper:

  • DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction
    Aviral Kumar, Abhishek Gupta, Sergey Levine
    arXiv

We thank Sergey Levine and Marvin Zhang for their valuable feedback on this blog post.


This article was initially published on the BAIR blog, and appears here with the authors’ permission.

How the Automotive Technology Showcased at CES Could Impact Vehicle Manufacturing

Which automakers are best poised to succeed in the future of mobility? That’s the question prompting the appearance of many of the world’s top automotive manufacturers – including Ford, Honda and BMW – at major technology shows like CES 2020 in Las Vegas.

Showing robots how to do your chores

Roboticists are developing automated robots that can learn new tasks solely by observing humans. At home, you might someday show a domestic robot how to do routine chores.
Image: Christine Daniloff, MIT

By Rob Matheson

Training interactive robots may one day be an easy job for everyone, even those without programming expertise. Roboticists are developing automated robots that can learn new tasks solely by observing humans. At home, you might someday show a domestic robot how to do routine chores. In the workplace, you could train robots like new employees, showing them how to perform many duties.

Making progress on that vision, MIT researchers have designed a system that lets these types of robots learn complicated tasks that would otherwise stymie them with too many confusing rules. One such task is setting a dinner table under certain conditions.  

At its core, the researchers’ “Planning with Uncertain Specifications” (PUnS) system gives robots the humanlike planning ability to simultaneously weigh many ambiguous — and potentially contradictory — requirements to reach an end goal. In doing so, the system always chooses the most likely action to take, based on a “belief” about some probable specifications for the task it is supposed to perform.

In their work, the researchers compiled a dataset with information about how eight objects — a mug, glass, spoon, fork, knife, dinner plate, small plate, and bowl — could be placed on a table in various configurations. A robotic arm first observed randomly selected human demonstrations of setting the table with the objects. Then, the researchers tasked the arm with automatically setting a table in a specific configuration, in real-world experiments and in simulation, based on what it had seen.

To succeed, the robot had to weigh many possible placement orderings, even when items were purposely removed, stacked, or hidden. Normally, all of that would confuse robots too much. But the researchers’ robot made no mistakes over several real-world experiments, and only a handful of mistakes over tens of thousands of simulated test runs.  

“The vision is to put programming in the hands of domain experts, who can program robots through intuitive ways, rather than describing orders to an engineer to add to their code,” says first author Ankit Shah, a graduate student in the Department of Aeronautics and Astronautics (AeroAstro) and the Interactive Robotics Group, who emphasizes that their work is just one step in fulfilling that vision. “That way, robots won’t have to perform preprogrammed tasks anymore. Factory workers can teach a robot to do multiple complex assembly tasks. Domestic robots can learn how to stack cabinets, load the dishwasher, or set the table from people at home.”

Joining Shah on the paper are AeroAstro and Interactive Robotics Group graduate student Shen Li and Interactive Robotics Group leader Julie Shah, an associate professor in AeroAstro and the Computer Science and Artificial Intelligence Laboratory.

Bots hedging bets

Robots are fine planners in tasks with clear “specifications,” which help describe the task the robot needs to fulfill, considering its actions, environment, and end goal. Learning to set a table by observing demonstrations, is full of uncertain specifications. Items must be placed in certain spots, depending on the menu and where guests are seated, and in certain orders, depending on an item’s immediate availability or social conventions. Present approaches to planning are not capable of dealing with such uncertain specifications.

A popular approach to planning is “reinforcement learning,” a trial-and-error machine-learning technique that rewards and penalizes them for actions as they work to complete a task. But for tasks with uncertain specifications, it’s difficult to define clear rewards and penalties. In short, robots never fully learn right from wrong.

The researchers’ system, called PUnS (for Planning with Uncertain Specifications), enables a robot to hold a “belief” over a range of possible specifications. The belief itself can then be used to dish out rewards and penalties. “The robot is essentially hedging its bets in terms of what’s intended in a task, and takes actions that satisfy its belief, instead of us giving it a clear specification,” Ankit Shah says.

The system is built on “linear temporal logic” (LTL), an expressive language that enables robotic reasoning about current and future outcomes. The researchers defined templates in LTL that model various time-based conditions, such as what must happen now, must eventually happen, and must happen until something else occurs. The robot’s observations of 30 human demonstrations for setting the table yielded a probability distribution over 25 different LTL formulas. Each formula encoded a slightly different preference — or specification — for setting the table. That probability distribution becomes its belief.

“Each formula encodes something different, but when the robot considers various combinations of all the templates, and tries to satisfy everything together, it ends up doing the right thing eventually,” Ankit Shah says.

Following criteria

The researchers also developed several criteria that guide the robot toward satisfying the entire belief over those candidate formulas. One, for instance, satisfies the most likely formula, which discards everything else apart from the template with the highest probability. Others satisfy the largest number of unique formulas, without considering their overall probability, or they satisfy several formulas that represent highest total probability. Another simply minimizes error, so the system ignores formulas with high probability of failure.

Designers can choose any one of the four criteria to preset before training and testing. Each has its own tradeoff between flexibility and risk aversion. The choice of criteria depends entirely on the task. In safety critical situations, for instance, a designer may choose to limit possibility of failure. But where consequences of failure are not as severe, designers can choose to give robots greater flexibility to try different approaches.

With the criteria in place, the researchers developed an algorithm to convert the robot’s belief — the probability distribution pointing to the desired formula — into an equivalent reinforcement learning problem. This model will ping the robot with a reward or penalty for an action it takes, based on the specification it’s decided to follow.

In simulations asking the robot to set the table in different configurations, it only made six mistakes out of 20,000 tries. In real-world demonstrations, it showed behavior similar to how a human would perform the task. If an item wasn’t initially visible, for instance, the robot would finish setting the rest of the table without the item. Then, when the fork was revealed, it would set the fork in the proper place. “That’s where flexibility is very important,” Ankit Shah says. “Otherwise it would get stuck when it expects to place a fork and not finish the rest of table setup.”

Next, the researchers hope to modify the system to help robots change their behavior based on verbal instructions, corrections, or a user’s assessment of the robot’s performance. “Say a person demonstrates to a robot how to set a table at only one spot. The person may say, ‘do the same thing for all other spots,’ or, ‘place the knife before the fork here instead,’” Ankit Shah says. “We want to develop methods for the system to naturally adapt to handle those verbal commands, without needing additional demonstrations.”  

The Tentacle Bot

By Leah Burrows

Of all the cool things about octopuses (and there are a lot), their arms may rank among the coolest.

Two-thirds of an octopus’s neurons are in its arms, meaning each arm literally has a mind of its own. Octopus arms can untie knots, open childproof bottles, and wrap around prey of any shape or size. The hundreds of suckers that cover their arms can form strong seals even on rough surfaces underwater.

Imagine if a robot could do all that.

Researchers have developed an octopus-inspired robot can grip, move, and manipulate a wide range of objects. Credit: Elias Knubben, Zhexin Xie, August Domel, and Li Wen

Researchers at Harvard’s Wyss Institute for Biologically Inspired Engineering and John A. Paulson School of Engineering and Applied Sciences (SEAS) and colleagues from Beihang University have developed an octopus-inspired soft robotic arm that can grip, move, and manipulate a wide range of objects. Its flexible, tapered design, complete with suction cups, gives the gripper a firm grasp on objects of all shapes, sizes and textures — from eggs to smartphones to large exercise balls.

“Most previous research on octopus-inspired robots focused either on mimicking the suction or the movement of the arm, but not both,” said co-first author August Domel, Ph.D., a Postdoctoral Scholar at Stanford University and former graduate student at the Wyss Institute and Harvard. “Our research is the first to quantify the tapering angles of the arms and the combined functions of bending and suction, which allows for a single small gripper to be used for a wide range of objects that would otherwise require the use of multiple grippers.”

The research is published in Soft Robotics.

The researchers began by studying the tapering angle of real octopus arms and quantifying which design for bending and grabbing objects would work best for a soft robot. Next, the team looked at the layout and structure of the suckers (yes, that is the scientific term) and incorporated them into the design.

“We mimicked the general structure and distribution of these suckers for our soft actuators,” said co-first author Zhexin Xie, Ph.D., a graduate student at Beihang University. “Although our design is much simpler than its biological counterpart, these vacuum-based biomimetic suckers can attach to almost any object.”

Xie is the co-inventor of the Festo Tentacle Gripper, which is the first fully integrated implementation of this technology in a commercial prototype.

The soft robot is controlled with two valves, one to apply pressure for bending the arm and one for a vacuum that engages the suckers. By changing the pressure and vacuum, the arm can attach to any object, wrap around it, carry it, and release it. Credit: Bertoldi Lab/Harvard SEAS

Researchers control the arm with two valves, one to apply pressure for bending the arm and one as a vacuum that engages the suckers. By changing the pressure and vacuum, the arm can attach to an object, wrap around it, carry it, and release it.

The researchers successfully tested the device on many different objects, including thin sheets of plastic, coffee mugs, test tubes, eggs, and even live crabs. The tapered design also allowed the arm to squeeze into confined spaces and retrieve objects.

“The results from our study not only provide new insights into the creation of next-generation soft robotic actuators for gripping a wide range of morphologically diverse objects, but also contribute to our understanding of the functional significance of arm taper angle variability across octopus species,” said Katia Bertoldi, Ph.D., an Associate Faculty member of the Wyss Institute who is also the William and Ami Kuan Danoff Professor of Applied Mechanics at SEAS, and co-senior author of the study.

This research was also co-authored by James Weaver from the Wyss Institute, Ning An and Connor Green from Harvard SEAS, Zheyuan Gong, Tianmiao Wang, and Li Wen from Beihang University, and Elias M. Knubben from Festo SE & Co. It was supported in part by the National Science Foundation under grant DMREF-1533985 and Festo Corporate’s project division.

Driverless shuttles: the latest from two European projects

AIhub | Horizon | Keolis autonomous shuttle
Autonomous vehicles must be well-integrated into public transport systems if they are to take off in Europe’s cities, say researchers. Image credit – Keolis

By Julianna Photopoulos

Jutting out into the sea, the industrial port area of Nordhavn in Denmark’s capital, Copenhagen, is currently being transformed into a futuristic waterfront city district made up of small islets. It’s billed as Scandinavia’s largest metropolitan development project and, when complete, will have living space for 40,000 people and workspace for another 40,000.

At the moment, Nordhavn is only served by a nearby S-train station and bus stops located near the station. There are no buses or trains running within the development area, although there are plans for an elevated metro line, and parking will be discouraged in the new neighbourhood. This is a great opportunity for autonomous vehicles (AVs) to operate as a new public transport solution, connecting this area more efficiently, says Professor Dimitri Konstantas at the University of Geneva in Switzerland.

‘We believe that AVs will become the new form of transport in Europe,’ he said. ‘We want to prove that autonomous vehicles are a sustainable, viable and environmental solution for urban and suburban public transportation.’

Prof. Konstantas is coordinating a project called AVENUE, which aims to do this in four European cities. In Nordhavn, the team plans to roll out autonomous shuttles on a loop with six stops around the seafront. They hope to have them up and running in two years. But once in place, the Nordhavn plan may provide a glimpse of how AV-based public transportation systems could work in the future.

Prof. Konstantas envisages these eventually becoming an on-demand, door-to-door service, where people can get picked up and go where they want rather than predetermined itineraries and bus stops.

In Nordhavn, AVENUE will test and implement an autonomous ‘mobility cloud’, currently under development, to link the shuttles with existing public transport, such as the nearby train station. An on-demand service will ultimately allow passengers to access the available transport with a single app, says Prof. Konstantas.

Integrating autonomous shuttles into the wider transport system is vital if they are to take off, says Guido Di Pasquale from the International Association of Public Transport (UITP) in Brussels, Belgium.

‘Autonomous vehicles have to be deployed as fleets of shared vehicles, fully integrated and complementing public transport,’ he said. ‘This is the only way we can ensure a sustainable usage of AVs in terms of space occupancy, traffic congestion and the environment.’

Single service

Di Pasquale points to a concept known as Mobility-as-a-Service (MaaS) as a possible model for future transport systems. This model combines both public and private transport. It allows users to create, manage and pay trips as a single service with an online account. For example, Uber, UbiGo in Sweden and Transport for Greater Manchester in the UK are exploring MaaS to enable users to get from one destination to another by combining transport and booking it as one trip, depending on their preferred option based on cost, time and convenience.

Di Pasquale coordinates a project called SHOW, which aims to deploy more than 70 automated vehicles in 21 European cities to assess how they can best be integrated with different wider transport systems and diverse users’ needs. They are testing combinations of AV types, from shuttles to cars and buses, in real-life conditions over the next four years. During this time, he expects the project’s AVs to transport more than 1,500,000 people and 350,000 containers of goods. ‘SHOW will be the biggest ever showcase and living lab for AV fleets,’ he said.

He says that most of the cities involved have tested autonomous last-mile vehicles in the past and are keen to include them in their future sustainable urban mobility plans.

However, rolling out AVs requires overcoming city-specific challenges, such as demonstrating safety.

‘Safety and security risks have restricted the urban use of AVs to dedicated lanes and low speed — typically below 20km/h,’ explained Di Pasquale. ‘This strongly diminishes their usefulness and efficiency, as in most city environments there is a lack of space and a high cost to keep or build such dedicated lanes.’

It could also deter users. ‘For most people, a speed barely faster than walking is not an attractive solution,’ he said.

We want to prove that autonomous vehicles are a sustainable, viable and environmental solution for urban and suburban public transportation.

Prof. Dimitri Konstantas, University of Geneva, Switzerland

Di Pasquale hopes novel technology will make higher speed and mixed traffic more secure, and guarantee fleets operating safely by monitoring and controlling them remotely.

Each city participating in SHOW will use autonomous vehicles in various settings, including mixed and dedicated lanes, at various speeds and types of weather. For safety and regulation reasons, all of them will have a driver present.

The objective is to make the vehicle fully autonomous without the need for a driver as well as optimise the service to encourage people to make the shift from ownership of cars to shared services, according to Di Pasquale. ‘This would also make on-demand and last-mile services sustainable in less densely populated areas or rural areas,’ he said.

Authorisation

But the technical issues of making the vehicle autonomous are only a part of the challenge.

There’s also the issue of who pays for it, says Di Pasquale. ‘AVs require sensors onboard, as well as adaptations to the physical and digital infrastructure to be deployed,’ he explained. ‘Their market deployment would require cities to drastically renew their fleets and infrastructures.’

SHOW’s pilots are scheduled to start in two years from now, as each city has to prepare by obtaining the necessary permits and getting the vehicles and technology ready, says Di Pasquale.

Getting authorisation to operate in cities is one of the biggest hurdles. City laws and regulations differ everywhere, says Prof. Konstantas.

AVENUE is still awaiting city licences to test in Nordhavn, despite a national law being passed on 1 July 2017 allowing for AVs to be tested in public areas. Currently, they have pilots taking place in Lyon, France and Luxembourg. In Geneva, the team has managed to get the required licences and the first worldwide on-demand, AV public transportation service will be rolled out on a 69-bus-stop circuit this summer.

AVENUE’s initial results show that cities need to make substantial investments to deploy AVs and to benefit from this technology. The legal and regulatory framework in Europe will also need to be adapted for smooth deployment of services, says Prof. Konstantas.

Both he and Di Pasquale hope their work can pave the way to convince operators and authorities to invest in fleets across Europe’s cities.

‘Depending on the willingness of public authorities, this can take up to four years until we see real, commercially sustainable AV-based public transportation services on a large scale in Europe,’ said Prof. Konstantas.

The research in this article was funded by the EU.

This post Driverless shuttles: what are we waiting for? was originally published on Horizon: the EU Research & Innovation magazine | European Commission.

Aerial Imaging Market Growth Predicted at 12% Till 2024, Revenue to Cross USD 4 Billion-Mark: Global Market Insights, Inc.

Major aerial imaging market players include Eagle View Technologies, Google, Digital Aerial Solutions, DJI, Kucera International, PrecisionHawk, Nearmap, Cooper Aerial Surveys, Getmapping, 3D Robotics, DroneDeploy, Airobotics, and GeoVantage.

Aerial Imaging Market Growth Predicted at 12% Till 2024, Revenue to Cross USD 4 Billion-Mark: Global Market Insights, Inc.

Major aerial imaging market players include Eagle View Technologies, Google, Digital Aerial Solutions, DJI, Kucera International, PrecisionHawk, Nearmap, Cooper Aerial Surveys, Getmapping, 3D Robotics, DroneDeploy, Airobotics, and GeoVantage.
Page 313 of 432
1 311 312 313 314 315 432