mice-reinforcement

Instructions:From the left column, select the step in the indicated move of the MICE Model that best corresponds to each group of sentences in this introduction. The feedback is shown at the bottom.

Adapted from Jens Lundell, "Dynamic movement primitives and reinforcement learning for adapting a learned skill," Master's thesis, Dept. of Elect. Eng., Aalto Univ., Espoo, Finland, 2016.

1 Introduction (steps within moves)

Step

Move 1: Current situation

Step 1
Step 2
Step 3

¹Robots are used in human environments today more than ever before.

Step

Move 1: Current situation

Step 1
Step 2
Step 3

²Examples of such robots include simple vacuum cleaning or lawn mowing robots. ³Although robots have become increasingly popular, they are still mainly used in industrial settings as welders and assemblers. ⁴The reason for robots excelling in industrial settings is that the environment is for the most part static and that the task at hand is known a priori. ⁵For this reason, the robot can be preprogrammed to precisely follow specific trajectories.

Step

Move 2: Problem

Step 1
Step 2

⁶ However, in a constantly changing environment, such as in our homes, or with complex tasks, such as playing ping pong [61], the robot can not be preprogrammed as before.

Step

Move 1: Current situation

Step 1
Step 2
Step 3

⁷Thus, for a robot to learn in these situations, one possible and frequently used solution is to learn from a human demonstrator. ⁸The concept of having a human teach the robot by interacting with it is known as Learning from Demonstration (LfD), as well as Programming by Demonstration (PbD) or imitation learning. ⁹LfD has enabled robots to carry out complex tasks, such as playing the ball-in-a-cup game [50], playing ping pong [61], as well as many other tasks [10,63]. ¹⁰For robots to acquire a movement, they first and foremost must record the demonstrated movement by using either their internal sensors or an external monitoring system. ¹¹Then, the responsibility is shifted over to the robot, which in turn is expected to learn a reasonable representation of the movement from the recorded information. ¹²This learning is achieved through the guidance of a learning algorithm, which must be able to produce a representation of the movement as well as cope with perturbations without risking failure.

Step

Move 1: Current situation

Step 1
Step 2
Step 3

¹³LfD has attracted significant attention in recent years, with the number of papers written on the subject serving as a significant indicator [5,60,64]. ¹⁴Several reasons have been identified for this increased interest [3,53]. ¹⁵The first reason is the reduced learning time compared to traditional approaches that require a human to program the robot off-line. ¹⁶Another reason is that by demonstrating the movement to the robot, the user can also predict the behaviour in advance before the robot executes the learned skill, as this should follow the demonstrated movement quite accurately. ¹⁷This, in turn, increases safety and is exceptionally important when integrating robots into human environments. ¹⁸Finally, because the acquired movement is in line with the human workspace, critical parts of the movement can be modelled very accurately.

Step

Move 2: Problem

Step 1
Step 2

¹⁹ Despite all the advantages of LfD, problems still remain concerning how to improve and generalize the learned skill. ²⁰For example, learning from a demonstration has been shown to not be enough to master complex tasks, such as flipping a pancake [52] or playing the ball-in-a-cup game [50]. ²¹Hence, to really master a learned skill requires subsequent practice.

Step

Move 1: Current situation

Step 1
Step 2
Step 3

²²A method for improving a skill through practice has already been implemented in robotics and is known as Reinforcement Learning (RL), in which the robot tries to improve the learned movement by interacting with the environment and receiving responses in the form of rewards [48]. ²³Based on these rewards, the learned movement is adapted to increase future rewards. ²⁴This resembles the way that humans acquire a new skill: first, we imitate the skill performed by another person, and then we subsequently practise the skill in order to fine-tune the movement until we can master it.

Step

Move 2: Problem

Step 1
Step 2

²⁵ Although reinforcement learning has become the standard approach for improving a taught skill, one problem still remains and that is how to safely explore new trajectories.

Step

Move 1: Current situation

Step 1
Step 2
Step 3

²⁶To explore new trajectories, noise is typically added to the learned representation of the skill, resulting in a slightly new trajectory.

Step

Move 2: Problem

Step 1
Step 2

²⁷ However, generating the noise is not a trivial task, and problems, such as the magnitude of the generated noise and whether the noise should be correlated, still remain an open question.

Step

Move 1: Current situation

Step 1
Step 2
Step 3

²⁸Consequently, various solutions have been proposed for generating noise. ²⁹For example, in [49] and [50], the generated noise is dependent on the learned model.

Step

Move 2: Problem

Step 1
Step 2

³⁰Intuitively, this makes sense ; however, it does not take into consideration the constraints of the executing system, which in both cases is a robotic arm.

Step

Move 2: Problem

Step 1
Step 2

³¹ Although recent studies [44] have generated noise focusing on the constraints of the actual system, no studies have yet attempted to apply this to improving a learned model represented as a Dynamic Movement Primitives (DMP).

Step

Move 4: Your solution

Step 1
Step 2
Step 3
Step 4
Step 5

³²Hence, this thesis will develop a method to safely explore new trajectories for a skill modelled as a DMP.

Step

Move 4: Your solution

Step 1
Step 2
Step 3
Step 4
Step 5

³³ To accomplish this objective, the thesis must carry out the following tasks:

model the skill as a DMP,
select an RL algorithm for improving the modelled skill,
evaluate the safety of the exploration method.

Step

Move 4: Your solution

Step 1
Step 2
Step 3
Step 4
Step 5

³⁴ To test the effectiveness of the new approach for generating safe trajectories works, a challenging enough real-world task is needed for experimentation. ³⁵For this purpose, the thesis will use the ball-in-a-cup game, as this game can be seen as a benchmark problem in robotics [50]. ³⁶The game consists of a ball attached to a string, which is then attached to the bottom of a cup. ³⁷The objective of the game is to get the ball into the cup. ³⁸This is achieved by inducing a movement on the ball that quickly moves the cup back and forth and then pulls it up and moves the cup under the ball. ³⁹Although the game is simple by nature, it is not that easy to learn, as it requires fast, precise movement, and even small changes in the movement can have a drastic impact on the trajectory of the ball. ⁴⁰The ball-in-a-cup game is of interest, as a robot cannot get the ball into the cup by merely executing the movement produced by the initially learned model, thus requiring subsequent RL to successfully master the skill.

Step

Move 4: Your solution

Step 1
Step 2
Step 3
Step 4
Step 5

⁴¹ The rest of this thesis is organized as follows. ⁴² Chapter 2 introduces and compares several state-of-the-art LfD approaches. ⁴³Based on the results, DMP is chosen to model movements. ⁴⁴ Chapter 3, in turn, is devoted to explaining DMP in more detail. ⁴⁵As reinforcement learning is used to improve the learned movement, Chapter 4 is dedicated to explaining RL. ⁴⁶ The chapter first explains the theory behind RL and then introduces state-of-the-art algorithms. ⁴⁷ Chapter 5 introduces the testbed, the hardware, as well as already implemented and newly developed software for supporting the experimental part of this thesis. ⁴⁸The experiments and results are presented in Chapter 6. ⁴⁹These are also critically discussed in the same chapter. ⁵⁰Finally, conclusions and suggestions for future work are presented in Chapter 7.