According to AWS: “AWS DeepRacer is the fastest way to get rolling with machine learning, literally. Get hands-on with a fully autonomous 1/18th scale race car driven by reinforcement learning, 3D racing simulator, and global racing league.”
Here what I think it is: It presents a wonderful opportunity for us to get our wet feet into reinforcement learning in a fun way.
Why learn reinforcement learning?
Reinforcement learning (RL) has many uses beyond cool robotics projects; there is an increased application of RL in NLP, financial trading, and various other domains that are traditionally trained using supervised learning. The biggest attraction of RL is that it enables models to learn complex behaviour without labeled data.
Main building blocks of RL
Let’s take a look at how reinforcement learning works by using the AWS DeepRacer as an example.
RL is about creating a model that can be used by an agent to choose which actions to take in an environment in order to achieve a specific goal. To put things into context, the agent is the DeepRacer car, the environment is the race track and what the car can see through its camera, actions are steering and throttle used to control the car, goal is obviously to complete a lap as quickly as possible without going out of the track.
How does a RL model learn?
With these key components of RL in mind, let’s see how they work together to train a RL model.
The camera at the front of the car captures the images of the front track when it’s moving, and these images represent the state of the environment. The agent (DeepRacer) uses the state(images) and the model to decide which action to choose(steer left/right or speed up). When this action interacts with the environment, the agent receives the reward and updated state(images). The model in the agent learns from these response and the whole cycle repeats.
At the beginning, the untrained reinforcement model will enter an exploration stage where it choose actions at random to explore the environment but over time it will be trained and notice which actions lead to better outcomes. Then it will gradually enter an exploitation stage which starts to exploit some of the knowledge that it’s gained to repeatedly take actions that lead to better outcomes. In practice, the speed of transition between exploration and exploitation is a hyperparamter to tune and we should be well familiar with the trade-offs, just like any other ML algorithms.
Which action to take?
At each timestep, the model is going to choose the action that it thinks will lead to the eventual outcome with the highest reward. As a result of choosing that action, the model gets in return a reward which quantify how good or bad an outcome is. Every action doesn’t necessarily returns a reward; in some cases, an agent needs to perform a series of steps to finally receive a +1 reward and that is the true advantage of reinforcement learning.
All of these experiences obtained will be used to update the model, by learning which actions in each state will lead to the maximized cumulative reward. We could imagine that as a lookup table:
This forms the basis of q-learning and q-table, which we will talk more in-depth in future post. But you get the idea.
Although maintaining this lookup table to retrieve the maximum cumulative reward for a given state seems easy, it’s not possible to explore every state action combination in some scenarios such as continuous actions, or steering a wheel by 1 degree, 2.5 degree, and so on.
To overcome this problem, we could approximate the value function by using a method called Policy Gradient, which we will be talking in a future post.
Training the DeepRacer
Let’s get back to the AWS DeepRacer. We don’t need to put the RL model physically into the DeepRacer; that would slow down the whole training/evaluation cycle.
AWS recommends everyone to use the DeepRacer console as an end-to-end platform to train, evaluate and simulate RL models.
While we do not have access to the DeepRacer console as it required a whitelist access and many people have reported waiting for very long for it, we can look at some screenshots of the platform to see how it works.
Interactive training process where we could see the model’s ability to take action that leads to higher cumulative rewards and a simulation of the environment.
Using reward to incentivise correct driving behaviour
One thing we need to configure here is the reward function. Remember that reward drives the decision of the action and by returning the appropriate amount of reward from a particular action, the model learns to adjust its appropriately.
The console offers a panel where we could program the logic of the reward function in Python.
Back in the above diagram, this reward function could be seen as a manipulatable component of the environment module. In some non-editable environments such as OpenAI Gym, Mario games, we could also further adjust the amount of reward sent to the model in additional to the default rewards given by environment, and this could be seen as a reward function as well.
Under the hood of DeepRacer Console
As a geek, we are tempted to find out whats behind the console. Actually, the console is merely a platform that puts a series of AWS services to facilitate the training process.
We could see that the training part(model, reward, parameters) which we saw above is powered by the Amazon SageMaker. Then it kicks off the AWS RoboMaker to create a simulation of the environment. The trained models are stored in our S3 buckets, the first person view video will be stored in the Kinesis, and the metrics(cumulative rewards) are stored in CloudWatch which provided the data behind training graph.
We could also use the Amazon Sagemaker and RoboMaker without going through the console. There are example notebooks for building models and tutorials to link these services together here.
Refer here for the latest information and league standings.
Beyond AWS DeepRacer
The realm of RL is far beyond Deep Racing. There are tons of interesting environments available at OpenAI Gym, from the classic control problems like balancing a pole to Atari games.
This is a result of a trained RL model (Deep Q-Network), tasked to get an under-power car to the top of the hill with flag. We could see that in the beginning, the agent(car) is exploring and actually receives a -1 reward for every step and once it has received a +10 reward for reaching the flag, it will learn how to model it’s future behaviour to reach the maximal reward.
Refer here to the expected rewards and states in this environment.
Other fun examples
In the future posts, we are going to take look at the simplest RL method, Q-learning and gradually move to Deep learning methods (Deep Q-learning Network (DQN)) and some sophisticated approaches from DeepMind papers (Actor Critic, A2C, A3C, etc)
learn to make short term decisions while optimising for long-term goal