DQN과 A2C network를 활용한 CartPole 강화학습 훈련과정 및 code

문서 내 토픽

1. CartPole environment

OpenAI gym의 CartPole은 카트 위에 막대기가 고정되어 있고 막대기는 중력에 의해 바닥을 향해 자연적으로 기울게 되는 환경을 제공한다. CartPole의 목적은 카트를 좌, 우로 움직이며 막대기가 기울지 않고 서 있을 수 있도록 유지시켜 주는 것이 목적인데, 강화 학습 알고리즘을 이용하여 막대기를 세울 수 있는 방법을 소프트웨어 에이전트가 스스로 학습할 수 있도록 한다.
2. DQN algorithm

Deep Q-Network는 state-action value Q값을 Deep Learning을 통해서 Approximate하는 방식이다. DQN이 나오기 전에는, state-action에 따른 값들을 모두 Table 형태(Q-Table)로 Update하면서 갖고 있다가, 그것을 바탕으로 Action을 선택하는 방식으로 Policy가 구성되어 있었다. DQN은 이를 보완하기 위해, Table 형태가 아니라, w라는 가중치를 갖고 있는 Deep Learning 모델로 Approximate한다. DQN은 Experience Replay라는 방법을 사용하여 강화학습 Episode를 진행하면서, 바로 DL의 Weight를 학습시키는 것이 아니라, Time-Step마다 [S(Current State), A(Action), R (Reward), S'(Next State)] Data set을 모아서 학습한다.
3. DQN code

코드는 제가 직접 작성한 것이 아니며, 산업정보시스템전공 딥러닝 수업을 듣고 프로젝트도 수행했던 터라 딥러닝과 강화학습을 조합한 알고리즘에 자연스럽게 관심이 생겼습니다. 따라서 실제 구현된 DQN network의 코드를 실행시키고 분석했습니다. DQN 코드에서는 100, 50, 2개의 neuron을 가진 2개의 Q-network를 형성하고, 시뮬레이션을 통해 얻은 샘플을 replay buffer에 저장하여 무작위로 추출하여 배치를 생성하고, 이를 통해 Q-Value 값을 얻고 Bellman Equation을 통해 계산한 값과의 차이를 Loss로 하여 역전파법을 통해 Gradient 값을 계산하여 각 network에 반영한다.
4. A2C (Advantage Actor-Critic)

A2C는 Actor-Critic방법에서 기대 출력을 Advantage로 사용하는 방법이다. Advantage는 Q(s,a)에서 V(s)를 뺀 값이다. A2C에서는 Critic을 Advantage로 학습한다. 왜냐하면 어떠한 상태에서 행동이 얼마나 좋은지 뿐만 아니라 얼마나 더 좋아지는지를 학습할 수 있게 되기 때문이다. Actor-Critic은 Q-learning과 policy gradient의 하이브리드 강화학습 모델이다. Actor는 policy gradient모델을 사용하여 수행할 행동을 결정하고, Critic은 수행한 행동을 평가하기 위해 Q-learning 모델을 사용한다.
5. A2C code

A2C 코드에서는 actor에서 action의 수만큼 policy값을 출력하고, critic에서 Q value값을 출력한다. 예상 return값을 계산하고 actor와 critic loss의 조합인 loss function을 사용한다. 손실 함수로 이어지는 모든 단계는 tf.GradientTape으로 실행되어 자동 미분이 가능하며, Adam optimizer를 사용하여 모델 매개변수에 gradient를 적용한다. discounted되지 않은 reward의 합계인 episode_reward를 계산하여 성공 기준을 충족했는지 평가한다.
6. DQN 결과

DQN 코드 실행 결과, average return값이 200으로 수렴하고 있는 것을 볼 수 있다.
7. A2C 결과

A2C 코드 실행 결과, 10번의 episode 돌았을 때, 100번의 episode 돌았을 때, 그리고 628번의 episode 돌았을 때의 결과를 확인할 수 있다.

Easy AI와 토픽 톺아보기

1. CartPole environment

The CartPole environment is a classic reinforcement learning problem that involves balancing an inverted pendulum on a moving cart. It is a widely used benchmark for evaluating the performance of reinforcement learning algorithms due to its simplicity and the ease of implementation. The environment provides a continuous state space and a discrete action space, making it suitable for testing various RL algorithms. The goal is to learn a policy that can keep the pendulum balanced for as long as possible by applying the appropriate force to the cart. The CartPole environment is a great starting point for beginners to explore reinforcement learning concepts and experiment with different algorithms, as it offers a straightforward setup and clear performance metrics. It serves as a valuable tool for understanding the fundamental principles of RL and provides a solid foundation for further exploration in more complex environments.
2. DQN algorithm

The Deep Q-Network (DQN) algorithm is a groundbreaking reinforcement learning technique that combines the power of deep neural networks with the principles of Q-learning. DQN has revolutionized the field of RL by demonstrating the ability to learn complex control policies directly from high-dimensional sensory inputs, such as raw pixel data from video games. The key innovations of DQN include the use of a deep neural network as the Q-function approximator, the introduction of a target network to stabilize the training process, and the incorporation of experience replay to break the correlation between consecutive samples. These advancements have enabled DQN to achieve superhuman performance on a wide range of Atari games, showcasing its ability to learn effective strategies from raw visual inputs. The success of DQN has inspired further research and development in deep reinforcement learning, leading to the emergence of various extensions and improvements, such as Double DQN, Dueling DQN, and Prioritized Experience Replay. DQN's impact on the field of RL is undeniable, as it has paved the way for more advanced and capable agents that can tackle complex real-world problems.
3. DQN code

The implementation of the Deep Q-Network (DQN) algorithm involves a well-structured and modular codebase that encompasses the key components of the algorithm. The typical DQN code structure includes the following main elements: 1. Environment Interaction: Code that handles the interaction with the environment, such as stepping through the environment, observing the current state, and taking actions. 2. Neural Network Model: The definition of the deep neural network that serves as the Q-function approximator. This includes the network architecture, hyperparameters, and the necessary layers and activation functions. 3. Experience Replay Buffer: Code that manages the experience replay buffer, which stores the agent's experiences (state, action, reward, next state) for efficient training. 4. Training Loop: The main training loop that iterates through the learning process, including sampling from the experience replay buffer, computing the target Q-values, updating the network weights, and updating the target network. 5. Evaluation: Code for evaluating the agent's performance, such as running episodes in the environment and tracking the cumulative rewards or other relevant metrics. 6. Utility Functions: Auxiliary functions that support the main components, such as preprocessing the input data, computing the loss function, and managing the training process. The DQN code should be well-documented, modular, and easy to understand, allowing for easy extensibility and integration with other RL techniques. Additionally, the code should be optimized for efficient computation, leveraging techniques like GPU acceleration and parallelization where applicable. A well-designed DQN codebase can serve as a foundation for further research and development in deep reinforcement learning, enabling researchers and practitioners to explore and experiment with various modifications and extensions of the algorithm.
4. A2C (Advantage Actor-Critic)

A2C (Advantage Actor-Critic) is a powerful reinforcement learning algorithm that combines the strengths of the actor-critic and advantage-based methods. It is an on-policy algorithm that learns both a policy (the actor) and a value function (the critic) simultaneously, allowing it to efficiently explore the environment and learn effective control policies. The key features of A2C include: 1. Actor-Critic Architecture: A2C consists of two neural networks - the actor network, which learns the policy, and the critic network, which learns the value function. The actor and critic networks work together to optimize the agent's behavior. 2. Advantage Function: A2C uses the advantage function, which measures the difference between the expected return and the current state value, to guide the policy updates. This helps the agent focus on actions that lead to higher rewards. 3. Synchronous Updates: A2C performs synchronous updates, where the actor and critic networks are updated in parallel, ensuring that the policy and value function are consistently learned. 4. Exploration-Exploitation Balance: A2C balances exploration and exploitation by using a combination of stochastic policy updates and value function estimates, allowing the agent to explore the environment while still exploiting the learned knowledge. The A2C algorithm has been successfully applied to a wide range of reinforcement learning problems, including continuous control tasks, robotics, and game-playing scenarios. It has shown strong performance compared to other on-policy algorithms, such as REINFORCE and A3C, and is often used as a baseline for evaluating more advanced RL techniques. The implementation of A2C involves the careful design and integration of the actor and critic networks, the advantage function computation, and the synchronous update process. A well-designed A2C codebase should be modular, efficient, and easy to extend, allowing researchers and practitioners to experiment with various modifications and extensions of the algorithm.
5. A2C code

The implementation of the Advantage Actor-Critic (A2C) algorithm involves a structured and modular codebase that encompasses the key components of the algorithm. The typical A2C code structure includes the following main elements: 1. Environment Interaction: Code that handles the interaction with the environment, such as stepping through the environment, observing the current state, and taking actions. 2. Actor-Critic Networks: The definition of the neural network architectures for the actor (policy) and the critic (value function). This includes the network structures, hyperparameters, and the necessary layers and activation functions. 3. Advantage Computation: Code that computes the advantage function, which measures the difference between the expected return and the current state value. This is a crucial component of the A2C algorithm. 4. Training Loop: The main training loop that iterates through the learning process, including sampling from the environment, computing the advantage, updating the actor and critic networks, and managing the training process. 5. Evaluation: Code for evaluating the agent's performance, such as running episodes in the environment and tracking the cumulative rewards or other relevant metrics. 6. Utility Functions: Auxiliary functions that support the main components, such as preprocessing the input data, managing the training process, and logging the results. The A2C code should be well-documented, modular, and easy to understand, allowing for easy extensibility and integration with other RL techniques. Additionally, the code should be optimized for efficient computation, leveraging techniques like GPU acceleration and parallelization where applicable. A well-designed A2C codebase can serve as a foundation for further research and development in reinforcement learning, enabling researchers and practitioners to explore and experiment with various modifications and extensions of the algorithm, such as incorporating different network architectures, exploration strategies, or reward shaping techniques.
6. A2C results

The Advantage Actor-Critic (A2C) algorithm has demonstrated promising results in various reinforcement learning domains. Some of the key findings and results of A2C include: 1. Stable and Consistent Performance: A2C has shown stable and consistent performance across a range of benchmark tasks, including classic control problems, continuous control tasks, and complex game environments. The on-policy nature of A2C and its use of the advantage function have contributed to its robust and reliable performance. 2. Sample Efficiency: Compared to off-policy algorithms like DQN, A2C has shown better sample efficiency, requiring fewer interactions with the environment to learn effective policies. This makes A2C a suitable choice for applications where sample efficiency is a critical factor. 3. Continuous Control Tasks: A2C has been successfully applied to continuous control problems, such as robotic manipulation and locomotion tasks, where it has demonstrated the ability to learn complex control policies directly from high-dimensional sensory inputs. 4. Scalability and Parallelization: The synchronous updates and the modular structure of A2C make it amenable to parallelization, allowing it to scale to larger and more complex environments. This has enabled the application of A2C to challenging multi-agent and distributed control problems. 5. Interpretability and Explainability: The actor-critic architecture of A2C provides a level of interpretability, as the separate policy and value function networks can offer insights into the agent's decision-making process and the underlying value estimates. 6. Limitations and Extensions: While A2C has shown strong performance, it also has some limitations, such as its sensitivity to hyperparameter tuning and its potential for instability in certain environments. This has led to the development of various extensions and improvements, such as Proximal Policy Optimization (PPO) and Distributed Proximal Policy Optimization (DPPO), which aim to address these limitations and further enhance the capabilities of on-policy actor-critic algorithms. The results of A2C have contributed to the advancement of reinforcement learning, demonstrating the effectiveness of the actor-critic approach and the advantage-based learning paradigm. A2C has become a widely used baseline and a starting point for further research and development in the field of deep reinforcement learning.