Reinforcement Learning is a domain of Machine Learning that enables a software to learn to take optimal actions in different scenarios in order to maximise the cummulative reward. It is one of the three paradigms of Machine Learning that includes Supervised and Unsupervised Learning. Reinforcement Learning is potentially the future of AI as it enables the machine to learn in realtime while interacting with the environment without being explicitly told about the dynamics of the system.
Markov Property - This property is said to be true if the conditional probability on future states of an environment depends only upon the current state and not on any of the events preceeding it.
In RL, we typically refer to the environment as being Markov and the decision process as Markov Decision Process. It is really important to note that most of the Reinforcement Learning Algorithms today are based on this Markov Property though there are extensions to the non - Markovian world.
Reinforcement Learning is quite different than the conventional Supervised or Unsupervised learning that has taken off in the recent years. Unlike supervised learning, it doesn't need labelled data to be trained on.
Many aspects of Reinforcement Learning are yet to be explored. There are several concepts that have not been proven as yet. But surely, it is gonna take off soon with the advancements in Deep Learning that serves as a from of representation of information in Deep Reinforcement Learning which we talk about on this [ page].
Let us first understand what Reinforcement Learning is all about. Basially, the agent (our program) is repsonsible for taking optimal actions (decisions) while interacting with the environment. the agent receives some reward from the environment in response to it's action and a new state of the environment. The agent analyzes this and in order to maximise the cummulative reward earned, it starts exploiting it's knowledge about the environment that it has learned while interacting with the environment.
Here peeps in one of the most fundamental challenges in RL i.e. Exploration vs Exploitation Tradeoff. The agent seeking to maximise the reward takes the optimal action based on it's knowledge, but it is quite obvious that the agent might not have explored the environment fully in order to take the best action. It is quite natural to think that taking a sub-optimal action in order to explore more about the environment is essential to optimal decision making in the future. This tradeoff is really at the core of RL. There are several stratergies to tackle this, which we look into in the later sections. So delve in here to get a deeper insight into the concept.
Multi Armed Bandits
We first look at a simple situation which we build upon to get greater insights. Consider a k armed bandit problem. At every time step, you are require to select one of the k-options which gives you some reward that you want to maximise over time by exploring the reward that you get by stochastically/deterministically selecting actions. This is similar to a slot machine with levers. Basically, if you see the task is non-associative. The action on one time step doesn't affect the reward at any further time step. We talk about associative tasks in the next section.
Understanding this is really fundamental to RL as it introduces two main concepts:
- Action - Value Method - This is kind of estimating the expected reward for each action in order to make informed decisions in the future.
- Incremental Updates - Updating our action values incrementally with more and more data collected helps in getting closer to the true mean of the reward distribution. There is also a pragmatic reason behind this concept that it is memory efficient and much eaiser to code.
Go through the slides to get a better intuition into and more technalities of the concept.
We looked at non-assocaitive taks till now, that we had to face the same state every time we had to choose an action. We now move on to Associative tasks. Let us say that there are n k-bandits and at any timestep, we will be facing any of them at random. So, we have to associate a separate action space to each of them and find the best action for each of them individually. Delve into this article to get into the technical details of the concepts.
We have progressed quite a lot to jump into the actual RL, which is one step higher than contextual bandits. Take your time to get through these concepts first.
Exploration vs Exploitation
- Optimistic Initial Values - We start by initialising action values optmisitically i.e. some high value. It is a good technique to enhance exploration. It is easy to see that if teh values are too high, incremental updates actually reduce the action values, so even after exploring one action, greedy selection will choose some other action and hence enable greater exploration. But once the action value are close to the true values, taking greedy actions doesn't lead to exploration. So, this can lead to problems if the environment changes after sometime because our agent is not exploring anymore.
- Upper Confidence Bound Action Selection - it is based on the concept that the action that gives more variance in values have to be explored more to converge the action value to the true value. It gives some priority based on the number of times that action has been taken and the time passed since the action was last taken. It is kind of computationally expensive that we have to stor ethe number of time each action was taken and process it every time we have to take an action.
- Epsilon Greedy Stratergy - With epsilon probability, we take an action randomly from the action space, else we choose the action greedily. We see that this enables exploration irrespective of the timestamp. Also, it is computationally efficient as only 1 number is to be stored. It therefore meets our challenges quite well. There are several variants to Epsilon Greedy Stratergy which we discuss in another section. By far, it is one of the most used exploration stratergies.