Interactive Learning with Corrective Feedback for Policies based on Deep Neural Networks

Deep Reinforcement Learning (DRL) has become a powerful strategy to
solve complex decision making problems based on Deep Neural Networks (DNNs).

However, it is highly data demanding, so unfeasible in physical systems for most
applications. In this work, we approach an alternative Interactive Machine Learning (IML) strategy for training DNN policies based on human corrective feedback,
with a method called Deep COACH (D-COACH). This approach not only takes advantage of the knowledge and insights of human teachers as well as the power of
DNNs, but also has no need of a reward function (which sometimes implies the
need of external perception for computing rewards). We combine Deep Learning
with the COrrective Advice Communicated by Humans (COACH) framework, in
which non-expert humans shape policies by correcting the agent’s actions during
execution. The D-COACH framework has the potential to solve complex problems
without much data or time required.

Experimental results validated the efficiency of the framework in three different problems (two simulated, one with a real robot),with state spaces of low and high dimensions, showing the capacity to successfully learn policies for continuous action spaces like in the Car Racing and Cart-Pole problems faster than with DRL.

Introduction

Deep Reinforcement Learning (DRL) has obtained unprecedented results in decisionmaking problems, such as playing Atari games [1], or beating the world champion inGO [2].

Nevertheless, in robotic problems, DRL is still limited in applications with
real-world systems [3]. Most of the tasks that have been successfully addressed with
DRL have two common characteristics: 1) they have well-specified reward functions, and 2) they require large amounts of trials, which means long training periods
(or powerful computers) to obtain a satisfying behavior. These two characteristics
can be problematic in cases where 1) the goals of the tasks are poorly defined or
hard to specify/model (reward function does not exist), 2) the execution of many
trials is not feasible (real systems case) and/or not much computational power or
time is available, and 3) sometimes additional external perception is necessary for
computing the reward/cost function.

On the other hand, Machine Learning methods that rely on transfer of human
knowledge, Interactive Machine Learning (IML) methods, have shown to be time efficient for obtaining good performance policies and may not require a well-specified
reward function; moreover, some methods do not need expert human teachers for
training high performance agents [4–6]. In previous years, IML techniques were
limited to work with low-dimensional state spaces problems and to the use of function approximation such as linear models of basis functions (choosing a right basis
function set was crucial for successful learning), in the same way as RL. But, as
DRL have showed, by approximating policies with Deep Neural Networks (DNNs)
it is possible to solve problems with high-dimensional state spaces, without the need
of feature engineering for preprocessing the states. If the same approach is used in
IML, the DRL shortcomings mentioned before can be addressed with the support of
human users who participate in the learning process of the agent.
This work proposes to extend the use of human corrective feedback during task
execution to learn policies with state spaces of low and high dimensionality in continuous action problems (which is the case for most of the problems in robotics)
using deep neural networks.

We combine Deep Learning (DL) with the corrective advice based learning
framework called COrrective Advice Communicated by Humans (COACH) [6],
thus creating the Deep COACH (D-COACH) framework. In this approach, no reward functions are needed and the amount of learning episodes is significantly reduced in comparison to alternative approaches. D-COACH is validated in three different tasks, two in simulations and one in the real-world.

Conclusions

This work presented D-COACH, an algorithm for training policies modeled with
DNNs interactively with corrective advice. The method was validated in a problem
of low-dimensionality, along with problems of high-dimensional state spaces like
raw pixel observations, with a simulated and a real robot environment, and also
using both simulated and real human teachers.

The use of the experience replay buffer (which has been well tested for DRL) was
re-validated for this different kind of learning approach, since this is a feature not
included in the original COACH. The comparisons showed that the use of memory
resulted in an important boost in the learning speed of the agents, which were able
to converge with less feedback, and to perform better even in cases with a significant
amount of erroneous signals.

The results of the experiments show that teachers advising corrections can train
policies in fewer time steps than a DRL method like DDPG. So it was possible
to train real robot tasks based on human corrections during the task execution, in
an environment with a raw pixel level state space. The comparison of D-COACH
with respect to DDPG, shows how this interactive method makes it more feasible
to learn policies represented with DNNs, within the constraints of physical systems.
DDPG needs to accumulate millions of time steps of experience in order to obtain