DeepRacer Model Optimization

Published in

AWS Tip

13 min readSep 6, 2024

It’s not uncommon for people to join our DeepRacer community on Discord and ask for tips on how to improve their racing. As it’s a competition no-one is giving away all their secrets, many of which have been hard won over months / years or research, trial and error etc. This blog covers fairly advanced topics and therefore assumes the reader is reasonably familiar with DeepRacer and has trained some models, performed some log analysis and is looking to take their racing to the next level.

Whilst I won’t be giving away everything here I do want to share some useful observations and techniques that can be used. You’ll need to do some structured experimentation, ideally changing one thing at a time and observing the results. What I will say here is you need a good combination of reward function, hyperparameters and action space. My personal opinion is that I’ve just listed them in priority order to make the most progress with your training, however a bad action space (e.g. only having slow actions), will ruin your the lap times, and poor hyperparameters will hamper performance too.

It should be noted that techniques optimized for virtual racing, which the top racers are biasing for overfit models that give the best possible performance on the track they’re trained for, is totally different to physical racing, where top racers are biasing for well generalized models that are not overfit.

The reward function is, in my opinion, the most important part of your DeepRacer model creation. You can have the ‘best’ hyperparameter and action space combination but without a good reward function you’re not going to see the results you hope for. The result of a good reward function should be reducing lap times over time, until at some point it’ll inevitably plateau when it reaches the limit of the reward function / hyperparamaters / action space combination, at which point you’ll have to decide whether to clone the model and tweak something to see if further improvements can be made, or whether you should start again (note — whilst existing actions can be changed during a clone you cannot add / remove actions, the total number must remain the same).

A good reward function’s reducing lap times (y axis) over time (x axis)

The best advice I can give you whilst creating your reward function is first think about the ‘business problem’ you’re trying to solve for. Once you’ve got that clear in your head then start looking at what input parameters exist to help you solve for that problem. This ‘top down’ way of thinking can give you ideas that you might not consider with ‘bottom up’ thinking of simply looking at the input parameters and trying to shoehorn them into your code. You could even try putting in your plain text strategy into the AWS PartyRock application I created to give you some inspiration with your coding — https://partyrock.aws/u/markjross2/Rdjf1Rg9I/DeepRacer-Model-Builder (note - as with all GenAI the results aren’t guaranteed to be accurate but should be more help than hindrance).

The next best piece of advice is to understand how the reward is calculated and what the model associates with that reward. The reward is calculated 15 times a second, i.e. at every ‘step’. At each step the virtual camera takes a picture of the environment, which is converted into grey scale and reduced in size, and this picture is what is associated with the reward, and therefore the action that’s more or less likely to be taken when the model sees that picture again in the future. Over time this will lead to the weighted probability of one action being taken more frequently when seeing that picture than another. You can test this in the DeepRacer analysis repo notebook or the GenAI workshop, both of which are in the DeepRacer community GitHub.

Analysis — Actions becoming more certain over time

What might not be immediately apparent is that it’s therefore really important that the same (or at least similar) reward is giving when seeing the same picture, otherwise the model isn’t going to learn what action is best. You might think this is a given, but it’s not.

It’s possible to get wildly different rewards for performing the same action when faced with the same picture, by virtue of the fact that the training doesn’t always start from the same position on the track. Consider the concept of a bonus every time you complete 20% of the track, when the car is moving 5% around the track for each training episode. Assuming full laps are being completed and there are 100 waypoints then when starting at zero the model is going to get a bonus at around waypoint 20, 40, 60, 80 and 100, roughly speaking. However when the car then starts it’s next episode at waypoint 5 it’s going to get a bonus at waypoint 25, 45, 65, 85 and 5. It’s therefore going to be very hard for the model to learn what is desirable behaviour when getting a big reward for seeing a picture one lap, and then seeing the same picture, doing the same thing, but not getting the same a reward on the next lap. The ‘self motivator’ reward function I came across also suffers with this problem of inconsistent reward for the same action, because progress/steps is at the heart of the reward and this means that the reward at the current step is highly dependent on how well the rest of the lap has gone. Start on a straight and make good progress from the start and the rest of the steps also get a higher reward, starting on a corner means slower initial progress and the rest of the steps get a lower reward. This model therefore tends to ‘oscillate’ in terms of reward it gets and this impacts overall performance.

You can view ‘surrogate loss’ in the Sagemaker logs which gives you a view of what the model expected versus what it got as a reward. The smaller the value the better.

Another thing to understand from the above is the slower the model goes around the track the more times the reward function is going to be calculated. Consider that a 10 second lap will have 150 steps, whereas a 20s lap will have 300 steps, so the reward function is calculated twice as much per lap in the latter.

Whilst the model isn’t aware of the concept of lap time, it is going to learn how to maximize the reward it can get within the horizon of the discount factor you set, and maximizing the reward over each section of the track it’s considering will ultimately lead to slower or faster laps depending on how you craft your reward function. General correlation between lap time and reward is good, finding a method to force best lap times to be the best rewarded can be better. If you’re getting much larger rewards per lap for slower laps in my experience that will result in slower laps over time.

Laptimes from a model getting slower over time

Scatter plot of complete lap times and reward — note the lack of correlation meant this model stopped improving and then got slower over time

Fastest laps — a table with few exampls from the 5th quintile is also a sign of a lack of model improvement over time (unless the mean times are still coming down)

Whilst I’m not using the centerline model for my virtual racing I want to use it here to explain how you might encourage your model to do what you want, using GeoGebra to create the second and third graphs.

The default centerline model in the console uses bands for reward. If you’re in the middle 10% of the track you get a reward of 1, if you’re in the middle 50% but not the middle 10% you get 0.5, if you’re within the outer 50% you get 0.1 and beyond that a tiny reward. Whilst this reward function will converge it often leads to laps slowing down over time and the car learning to zig-zag. The reason for this is once the model learns to stay within the middle 10% band it cannot differentiate between actions, as they’re all rewarded the same. Go straight ahead, get 1 reward, turn left down the straight but stay within the middle 10% of the track, get 1 reward. If may also take a little longer to train because it’s not getting a hint from a different size of reward about what it should do, until it goes between the bands.

The middle graph alters this behavior by giving a linear reward. This gives the model more instant feedback, so it should learn to go to the centerline more quickly, as well as keep it near the centerline without the zig zag behavior as being on the centerline is more rewarding than being off it.

The graph on the right takes this concept one step further by making the reward exponential to the distance from center.

You do have to be careful with the reward that you don’t over bias an element though, as if you are heavily rewarding one element the model may virtual ignore another element. One example here is a penalty for going off-track, if you give a large penalty for doing this you will likely improve your percentage of lap completions, however you may find your lap times slow down as the model learns to bias towards going slower and staying on track as overall that’s more rewarding more frequently than going faster and getting some off tracks.

If you’re creating advanced reward functions a couple of other techniques may be of interest to you.

You can import modules into you code. In the console it’s limited to math, random, numpy, spicy and shapely, however in DeepRacer for Cloud or DeepRacer on the Spot it’s not limited. Simply import what you want prior to defining your reward function: -

You can define global variables outside of your reward function. The variables can be altered by your reward function code, thereby allowing you to store things beyond each step. The below example provides an example of storing steps for a good laptime, and then referencing it in the reward function. What's’ not showing in this example is it’s actually possible to update the variable in your reward too, so before returning the reward you could add ‘good_laptime_steps=140’ and then the next execution of the reward function would see the variable as 140, not 150. I’m not saying that I’m using the below code in the models I’m submitting to the league, but the technique is a useful one to be aware of.

The hyperparameter batch size controls the number of experiences available for the model to learn from when it is updated in between training. These experiences are randomly sampled, and the number of times the random sampling occurs is controlled by the number of epochs. A smaller batch size means that an experience remains available to be learned from for less time than a larger batch size, and the probability of an experience being learned from increased the larger epoch value.

The trade-off here is time, the larger the batch size and number of epochs the slower the model will likely learn and improve. Therefore depending on the time horizon you’re training over you may find these settings can adversely impact model performance. With a larger batch size it’s possible to not even see a complete lap when a smaller batch size resulted in many complete laps by the same iteration for example.

The hyperparameter discount factor controls the balance between considering the immediate reward and future rewards. The maths can get a little complicated, as each future step’s reward is reduced. For example a discount factor of 0.99 and a reward at every step of 1 results in the current step’s reward being considered as ‘1’, the next step as 1 x 0.99 (i.e. 0.99), the step after that as 1 x 0.99 x 0.99 (i.e. 0.9801) and so on. The easiest way to think about discount factor is that the closer to 1 the value becomes the further down the track the model considers, whereas a value closer to 0 the less distance the model considers down the track.

A very small discount factor can therefore result in a model that goes extremely quickly into a corner but is unable to stay on the track as often, whereas a very large discount factor results in a model that’s slower as it’s giving too much weight to future rewards and therefore the difference in values of the reward at the current step doesn’t carry enough weight. In my personal experience discount factor wins the award for the worst default value AWS chose for the console of 0.999 and I’d be surprised if anyone is having success with anything beyond 0.99!

Learning rate controls how much the model is updated between training iterations. If you consider the bottom of the below gradients to be the optimal, then you can see the impact of learning rates of different sizes, where you might never get to the optimal result.

Even with a large batch size and a high number of epochs the model won’t improve much with a tiny learning rate. Conversely with a large learning rate the model could be updated too quickly, lurching from taking one action to another and that can impact lap times adversely too.

There are two supported training algorithms (PPO and SAC) I’ve only done limited testing with SAC and have kept to using PPO. The SAC algorithm only supports a single Robomaker worker, so this constrains training improvements versus the parallelism that can be achieved with PPO.

There are two loss types (Huber and Mean Squared Error). I’ve only done limited testing with Mean Squared Error which delivered less optimal results than Huber.

Entropy controls the amount of randomness in the model. When you start training entropy is high, the model doesn’t know what to do, and reduces over time as the model becomes more certain over what action to take when confronted with a particular picture. The value for entropy can be seen in the Sagemaker logs during the policy update that occurs between each training iteration. You can inject more or less entropy by altering the entropy hyperparameter, although the default setting for this hyperparameter is fairly good in my experience.

Number of episodes is another important hyperparameter to consider, as it controls the number of attempted laps between updates to the model. In the console this isn’t too taxing to consider, you have one Robomaker worker and it’ll perform all the episodes and then update. However if you’re using the community provided DeepRacer for Cloud (configure yourself on-premise, AWS , Azure or GCP) or DeepRacer on the Spot (optimized DRfC wrapper for training on AWS Spot Instances) then you need to do some calculations to make sure that the number of episodes each Robomaker worker completes and your value for DR_TRAIN_ROUND_ROBIN_ADVANCE_DIST ensures it gets all the way back to teh start line by the end, otherwise your evaluations will take place at points on the track that are not the start/finish line and it’ll be very hard for you to tell if your model is improving over time. The formula to use is: -

So for example 20 episodes with 2 Robomaker worker and a DR_TRAIN_ROUND_ROBIN_ADVANCE_DIST of 0.1 will result in each worker doing 10 episodes each, advancing 10% of the track each time it starts an episode, and once it gets to the end of the episodes it’ll have advanced 100% of the track and the evaluations will start from the start line.

The action space defines the combination of steering angles and speeds that the model can take. With a continuous action space you select the steering angle and speed ranges, e.g. -30 to 30 degrees and 0.5ms to 4ms speed: -

With the discrete action space you set specific actions, e.g. 0 degrees at 4ms, 30 degrees at 1ms, -30 degrees at 1ms etc.: -

Logic suggests the continuous action space should be better, because it has a near infinite combination of speed and steering angles. However the reality is that the continuous action space poses two challenges, one is simply the time it would take to find the perfect actions in the near infinite possibilities, and the second is that without somehow shaping the reward / hyperparamaters it can be difficult to get a model to enter the corners at a sensible speed in order to be able to avoid going offtrack.

It’s therefore easier to control what the model does with discrete actions, for example by only giving fast actions at narrow angles, and slower actions at wider angles, finding steering angle and speed combinations that avoid, or minimize, the chance of spinning. Finding the balance between some power sliding / drifting and avoiding spinning is I think what’s has elevated the performance of those at the top of the league in 2024.

Hopefully the above is useful in improving your own DeepRacer journey. Don’t forget to use the great community resources to analyze and improve your models, it’ll lead to better performance!

DeepRacer Model Optimization

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in AWS Tip

Written by Mark Ross

No responses yet