National Institute of Technology Karnataka, Surathkal
Deadline for SMP 2022 registration is 26th May 6PM. Register here

Navigation using Reinforcement Learning


The aim is to train a maze solving agent in a 3D Unity engine environment. To do this we first have to make a 3D maze in Unity as the environment in which the agent will learn to navigate. The goal is to have the agent learn how to efficiently navigate to the location of a target object that is randomly spawned in this maze. The agent has to be designed with an action space that would allow it to freely navigate the entire maze. We also have to keep in mind that the observation the agent takes from the environment is sufficient for it to learn this behavior.


Reinforcement learning being the core of the project we have to use a relevant RL algorithm to learn the required behavior. We use Unity’s inbuilt Proximal Policy Optimization (PPO) which is well suited to this task.


● Unity
● Unity ML-Agents Toolkit (
● Python environment with ml-agents dependencies installed (Tensorflow environment)


The first step is to build the environment in Unity which fulfills the criterion mentioned in the problem statement. We built a simple maze with a base and walls that act as obstacles for the agent preventing it from reaching the target easily. We also have to decide the agent and its functionality. We used a cube with a simple action space being the four cardinal directions (Up, Down, Left and Right movement) with which it can navigate the maze as the agent. A stationary cylindrical object was used as a target.

The next step was deciding the observations that the agent collected from the surroundings. These observations serve as the input to the PPO algorithm by which the model learns. Initially we used the vector coordinates of both the agent and the target as observations. It is clear to see why this would fail as the trained model has no means to identify obstacles (walls) in its path to the target. To overcome this, we needed to use additional observations and we choose to use Raycast Observations provided by unity. Raycast allows us to project rays of defined maximum length in different directions from the agent. It allows the agent to perceive the surroundings by telling it if the rays hit an obstacle and if so at what distance. With Raycast added on for making observations it was enough information for the PPO algorithm to learn a model that reliably navigates to the target.

To train the model we had to come up with a relevant reward scheme that would encourage the agent to learn the right behavior. This is the scheme we decided to use. The agent starts each episode with a fixed positive reward. We then give it small penalties at regular time intervals to encourage it to solve the maze as soon as possible. We give the agent larger penalties if it sticks to a wall. This penalty continues for as long as the agent is in contact with the wall. With the agent observation and reward function as described above we could move on to train the model with Unity’s inbuilt Proximal Policy Optimization (PPO) algorithm. Also we have applied the DQN algorithm to the flappy bird environment to understand the working of the algorithm and its implementation.


We succeeded in training a model that can reliably take the agent to the target. Though rarely the agent looks like it is unsure of what action to take, given enough time it will almost always reach the randomly spawned target.
We were able to successfully apply the DQN algorithm to the flappy bird environment game.



This project was done mainly for learning purposes.


The concepts we learn as part of this project are understanding reinforcement learning algorithms and its implementation using tensorflow framework to the environment we created using Unity Platform. Also we have applied the reinforcement learning algorithms to the flappy bird environment.


We were able to successfully achieve our project learning outcome.



● Chaitany Pandiya (
● Karn Tiwari (
● K Rahul Reddy (
● Rajath Aralikatti (