CS 4789/5789: Introduction to Reinforcement Learning
Modern Artificial Intelligent (AI) systems often need the ability to make sequential decisions in an unknown,
uncertain, possibly hostile environment, by actively interacting with the environment to collect relevant data.
Reinforcement Learning (RL) is a general framework that can capture the interactive learning setting and
has been used to optimize generative AI models and design intelligent agents that achieve super-human level performances on
challenging tasks such as board games, computer games, and robotics.
This course focuses on basics of Reinforcement Learning. The four main parts of the course are
(1) basics of Markov Decision Process, (2) Planning and Control in MDP, (3) Learning in MDPs, and (4) RL from human feedback.
After taking this course, students will be able to understand classic RL algorithms, their analysis, and their usage in modern AI applications.
All lectures will be math heavy. We will go through algorithms and their analysis.
|
Staff
Instructor: Wen Sun
Lecture time: Monday/Wednesday 2:55pm - 4:10pm ET
Instructor office hours: Thursday 3-4 pm Gates 416b
TAs office hours: see ED discussion
Contact: ED Discussion
Please communicate to the instructors and TA through ED.
Emails sent to instructor and TAs, with regards to the course,
will not be responded to in a timely manner.
|
Recorded Lectures
The course does not provide recorded videos. You can find recorded lectures from Spring 2021 here,
but they are out of dated.
|
Prerequisites
Since lectures are math heavy and we will focus on algorithm design and analysis, we require students to have strong Machine Learning background (e.g., CS 4780). Students should be comfortable about basics of probability and linear algebra.
Since HWs contain programming problems, we expect students are comfortable about programming. We will use Python as the programming language in ALL HWs.
|
Grading Policies
CS4789: Exams: 50%; Homework: 10%; Programming assingments: 40%
CS5789: Exams: 45%; Homework: 10%; Programming assingments: 35%; paper comprehension: 10%
Undergraduates enrolled in 4780 may choose to do the paper comprehension assignments;
if completed you will receive the higher of your two grades between the above schemes.
Homework: There will be a number of homework assignments throughout the course,
typically made available roughly one to two weeks before the due date.
The homework primarily focuses on theoretical aspects of the material and is intended to provide preparation for the exams.
Homework may be completed in groups of up to two. You are allowed two slip days per homework.
Programming: To provide hands on learning with the methods we will discuss in class
there are a number of programming projects throughout the course.
The projects may be completed solo or in a group of two. You are allowed two slip days per project.
Paper comprehension: Students enrolled in this course at the graduate level (i.e., enrolled in 5789)
are required to read assigned research papers and complete the associated online quiz.
Papers will be assigned roughly once every two to three weeks. You are allowed two slip days per quiz.
Exams: There will be two exams for this class, an evening prelim and a final exam.
Prelim:
Final:
For both prelim and final, we will usually arrange one (and only one) makeup exam during that week.
You are expected to be available during the prelim and final weeks (e.g., do not plan travel during that two weeks). No online exam will be given.
If you miss the prelim, you can use the final to cover the prelim. If you miss the final,
you can take an incomplete and take the exam next semester.
|
Diversity and Inclusiveness
While many academic disciplines have historically been dominated by one cross section of society,
the study of and participation in STEM disciplines is a joy that the instructors hope that everyone can pursue,
regardless of their socio-economic background, race, gender, etc.
We encourage students to both be mindful of these issues, and,
in good faith, try to take steps to fix them. You are the next generation here.
You should expect and demand to be treated by your classmates and the course staff with respect.
You belong here, and we are here to help you learn and enjoy this course.
If any incident occurs that challenges this commitment to a supportive and inclusive environment,
please let the instructors know so that the issue can be addressed. We are personally committed to this, and subscribe to the
Computer Science Department Values of Inclusion.
|
Honor Code
Collaborations only where explicitly allowed
Do not use of forums like Course Hero, Chegg;
Whatever materials you use for your HWs including generative AI (e.g., GPT), properly cite the references;
If you are unclear about whether some online material can be used, ask instructors and TAs first
No sharing of your solutions within or outside class at any time
We will be extremely serious about academic integrity. The above is not an exhaustive list, and in general any Cornell and common sense rules about academic integrity apply. If it is not something we explicitly allowed, ask us whether it is OK before you do it.
Cornell University Code of Academic Integrity,
CS Department Code of Academic Integrity.
|
Course Notes
The course will sometimes use working draft of
the book "Reinforcement Learning Theory and
Algorithms", available
here.
Note that this is an extremely advanced RL theory book. A lot of the materials in the book are out of the scope of this class.
Thus we will pick very specific sections for you to read.
If you find typos or errors, please let us
know. We would appreciate it!
You can also self-study the classic book "Reinforcement Learning: An Introduction", available here
|
Schedule (tentative)
|
|
Lecture |
Reading |
Slides/HW |
01/22/25 |
|
Introduction |
AJKS: 1.1.1 |
Slides |
01/27/25 |
|
Fundamentals: Markov Decision Processes |
AJKS: 1.1.1 |
Slides, Annotated Slides |
01/29/25 |
|
Fundamentals: Policy Evaluation |
AJKS: 1.1.2 |
Slides, Annotated Slides |
02/3/25 |
|
Fundamentals: Value Iteration |
AJKS: 1.3.1 |
Slides, Annotated Slides |
02/5/25 |
|
Fundamentals: Policy Iteration |
AJKS: 1.3.2 |
Slides, Annotated Slides |
02/10/25 |
|
Value-based Learning: Temporal Difference Learning |
RL intro: 6.1 |
Slides, Annotated Slides |
02/12/25 |
|
Value-based Learning: Q Learning |
RL intro: 6.5 |
Slides, Annotated Slides |
02/19/25 |
|
Model-based Learning: Tabular Model-based RL |
|
Slides, Annotated Slides |
02/24/25 |
|
ML recap: Supervised Learning, Pytorch, and Gym |
|
|
02/26/25 |
|
Value-based Learning: Deep Q Network (DQN) |
Paper |
|
03/03/25 |
|
Policy Optimization: REINFORCE |
|
|
03/05/25 |
|
Policy Optimization: Natural Policy Gradient |
|
|
03/10/25 |
|
Policy Optimization: REBEL |
|
|
03/12/25 |
|
Policy Optimization: Variance reduction and Actor-Critic |
|
|
03/17/25 |
|
Policy Optimization: Proximal Policy Optimization (PPO) |
|
|
03/19/25 |
|
Midterm office hour |
|
|
03/24/25 |
|
Policy Optimization: TBD |
|
|
03/26/25 |
|
Policy Optimization: TBD |
|
|
04/07/25 |
|
RL for LLM: Direct Preference Optimization (DPO) |
|
|
04/09/25 |
|
RL for LLM: Reward modeling and online RL |
|
|
04/14/25 |
|
RL for LLM: Q#: solving KL-regularized RL |
|
|
04/16/25 |
|
Search: MCTS |
|
|
04/21/25 |
|
Case study:AlphaGo |
|
|
|
|