Exponentially Improving at AWS Deep Racer

Published in

AWS Tip

9 min readNov 8, 2022

In the top 2% for re:Invent 2022 Warm Up event at the time of writing

When you first start learning how to use AWS Deep Racer it’s almost inevitable you’ll go in with a complete trial and error approach. AWS Deep Racer is designed to be a friendly introduction into reinforcement learning and the low code approach to hyper-parameters selectable via a GUI and pre-canned python reward functions that you can use, expand on or replace with your own provides a real low barrier of entry. But what should you do if you want to take a more scientific approach to your understanding of AWS Deep Racer and really want to take your models to the next level?

It’s all in the analysis!

If all you do is look at the graphs in the AWS Deep Racer console you’re likely to be left scratching your head as to what’s truly going on.

The reward graph in the console will certainly give you an indication on whether your model is ‘improving’ (defined really as the reward increasing over time), with a view of the average reward function and average percentage of lap completion in training and evaluation. This can be useful to get an overall picture of whether your model is improving during the training process or reaching the utopia of convergence.

Example where a model is improving over time

Example where model stops improving over time

Where you see a lot of zig zagging this could indicate the model’s learning rate is too high, so it may be worth starting again and experimenting with the same reward function but with different learning rates to see which works best. Generally speaking the more complex the track the lower the learning rate should be.

However beyond this generalisation the reward graph provides little further insight into your model. Crucially it doesn’t really tell you whether the intent you had behind your reward function is actually translating into how the model is being rewarded. To understand that in more detail you really need to get into log analysis. There are a number of different ways you could analyse the logs, I came across this repo in the AWS Deep Racer community GitHub which I set-up to use and has been a real eye opener. There are instructions in the ReadMe how to set it up, and I used my Windows laptop with WSL to create a Docker image.

Once I had the Jupyter notebook set-up and learnt how to use it I was able to do all sorts of analysis, which I’ve broken out in to the key things I learnt in the sections below.

The start point isn’t the start/finish line during training

One of the first things I used once the Jupyter notebook was fired up was the waypoint mapping of the tracks.

Note — I’ve seen a number of waypoint based models on the internet, where car behaviour is ‘hardcoded’ in the reward function. These models work by telling the car to be on the left of the track at certain waypoints, or going faster on waypoints on the straights. Looking at some of the models in the AWS virtual races my suspicion is some people are using this type of model. For me this almost feels like cheating, and certainly isn’t within the spirit of machine learning, and I can’t imagine Tesla are programming waypoints into their cars for every road on Earth!

My hope is these models are overfit and don’t generalise well enough to translate into the real world where wrinkles in the track, shadows from lighting or a track not quite laid out and stretched to the exact dimensions of the virtual world may all be a factor. This type of waypoint model also won’t generalise well enough to be used on a track other than what it was trained on. If my hope doesn’t come true you’ll find me crying on Ryan Myrehn’s should in the MGM Arena in a few weeks!

What I did learn from looking at the waypoints for a track, along with episodes of training, was that the training doesn’t always begin from the start / finish line (waypoint 0). Therefore you cannot infer from the reward graphs in the console that if a model is averaging 20% completion that it’s struggling to get around a corner 20% of the way around the track. If you watch the training videos in the console you can see episode 0 starts at the start / finish line, but as the training progresses and the car leaves the track it restarts a new episode with 0 reward from a point close to where it left the track. This makes sense if you think it through. If the training always started at the start / finish line you’d end up with a model that was much better at the first part of the lap than it would be the latter part of the lap, as it wouldn’t experience the latter part of the lap as often.

This episode started it’s lap at waypoint 19

Progress isn’t always rewarding

On simple tracks I’d had some success with some fairly simple models rewarding progress around the track, using a reward function not dissimilar to AWS’ sample code for steps. An example might look something like this: -

TOTAL_NUM_STEPS = 150 #needs setting per track
if progress > (steps / TOTAL_NUM_STEPS) * 100: 
    reward = 1
else: 
    reward = 0.01

On a simple track with fairly uniform progress (re:Invent 2018, Bowtie etc.) this type of reward function works well. However on more complex tracks I learnt that these types of models don't’ work so well, this is due to the variable amount of progress being made whether the car is on a long, fast straight, or going through a series of complex turns. When analysing my models I found if the lap didn’t start well the model never recovered to a level of progress that it’d pick up greater reward, so it would just potter around the track slowly picking up 0.01 per step based on the above code snippet.

I thought about varying the progress reward based on how much of the track was complete, but once I realised that each training episode starts at a variable location on the track I realised this was somewhat futile. I realised trying to reward progress more at the start of the lap (which in my head was always the straight on the re:Invent 2022 track) could actually randomly be the complex turns half way around the track if a training episode happened to start from that point. Mapping progress to certain waypoints would be an option but again per my earlier comment doesn’t feel within the spirit of things and wouldn’t transfer to a well generalised model.

Penalising behaviour

Penalising going off track is a good way of getting your model to learn to get 100% around the track, however it’s important to set the amount of penalisation to an appropriate level, which in my experience needs to be a proportion of the overall reward the model is capable of getting. Too low a proportion and the car may prioritise picking up reward on the track and not being unduly influence by the penalty for going off. Too high a penalty could make the model very risk adverse as going off track wipes out any reward picked up prior, the end result being a very conservative, slow, lap time.

Model 1 — reward graph with penalty of -1000 for going off track

Model 1evaluation — capable of doing hot laps but doesn’t always get around without a 3s penalty

Model 2 — reward graph with penalty of -10000 for going off track

Model 2 evaluation — much slower and as it’s more cautious

Impact of speed on reward

It may not be immediately obvious to the casual user of Deep Racer, as it wasn’t immediately apparent to me, but it’s possible that your reward functions could be discouraging going faster, as overall less reward is received for going faster. I didn’t understand this until I started doing some log analysis having scratched my head as to why my trial and error approach didn’t rsult in quicker laps, even as the average reward increased.

No correlation between faster lap times and reward

To understand what’s going on you really need to understand a bit more about Deep Racer and how rewards are given. The Deep Racer camera, and the virtual equivalent, capture 15 frames per second (fps) for the model to learn from. Each fps translates to a ‘step’, and the model is provided with a reward based on your reward function at every step (unless you code your reward function to only reward at specific steps). This means that if your Deep Racer model completes a lap in 10 seconds that lap has 150 steps where it was given reward, whereas a lap that takes 20 seconds has 300 steps.

reward = speed

So working with a simple reward function if the 10s lap was doing 3m/s then the total reward would be 3 x 150 = 450. If the 20s lap was going 2m/s then total reward would be 2 x 300 = 600, so actually the model would learn that it’s better to go slower and complete the lap in 20s because the reward was higher.

if all_wheels_on_track == True:
    reward = 2 x speed
else:
    reward = 0.01

The reality is most reward functions combine multiple desired behaviours together. So working with the above example and assuming the wheels always remained on the track the 10s lap at 3m/s would have a reward of 2 x 3 x 150 = 900, whereas the 20s lap at 2m/s would have a reward of 2 x 2 x 300 = 1200.

reward = speed**2

It’s therefore necessary to exponentially reward speed to improve lap times. The above ‘**’ is how you write ‘to the power of’ in Python, so the result from the above is ‘speed squared’. So in the same scenario as before the 10s lap at 3m/s would have a reward of 3 x 3 x 150 = 1350, whereas the 20s lap at 2m/s would have a reward of 2 x 2 x 300 = 1200. Once I understood this I started to get models with much more correlation between my fastest laps also being my best reward laps in training.

Better correlation between fastest and most rewarded laps

It’s all relative

The final point I’d like to share is making sure you consider the relative reward you’re providing, if the intent of your reward function is to reward multiple behaviours. If you overly reward one behaviour over another then the behaviour with the smaller reward can get ‘drowned’ out by the reward for the dominant behaviour.

marker_1 = 0.1 * track_width
marker_2 = 0.25 * track_width

if distance_from_center <= marker_1:
    reward = 1.0
elif distance_from_center <= marker_2:
    reward = 0.5
else:
    reward = 1e-3  # likely crashed/ close to off track
    
reward *= speed**2

In the above example the intent is to keep the car within 10% of the middle of the track and go quickly. However the behaviour of the car will be different because the last part multiplies the reward from the track position by ‘speed squared’. So going 4m/s within the middle 25% (but outside the middle 10%) would result in a reward of 0.5 x 4 x 4 = 8, whereas going 2m/s in the middle 10% will only get 1 x 2 x 2 = 4. Altering the reward levels within the if statement to values that aren’t overly dominated by the ‘speed**2’ would make the behaviour more balanced.

I hope this blog post was useful. If you’re looking for more useful resources on AWS Deep Racer I would recommend looking at this GitHub Organisation, joining the Slack Community And reading my blog on doing DeepRacer cost effectively with spot instances . Happy Deep Racing!

AWS Tip

Exponentially Improving at AWS Deep Racer

It’s all in the analysis!

The start point isn’t the start/finish line during training

Progress isn’t always rewarding

Penalising behaviour

Impact of speed on reward

It’s all relative

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in AWS Tip

Written by Mark Ross

Responses (2)