CS 6789: Foundations of Reinforcement Learning


Modern Artificial Intelligent (AI) systems often need the ability to make sequential decisions in an unknown, uncertain, possibly hostile environment, by actively interacting with the environment to collect relevant data. Reinforcement Learning (RL) is a general framework that can capture the interactive learning setting and has been used to design intelligent agents that achieve super-human level performances on challenging tasks such as Go, computer games, and robotics manipulation.

This graduate level course focuses on theoretical and algorithmic foundations of Reinforcement Learning. The four main themes of the course are (1) fundamentals (MDPs, computation, statistics, generalization) (2) provably efficient exploration (and high dimensional RL) (3) direct policy optimization (e.g. policy gradient methods), (4) further topics (control, offline RL, partial observability, and RL from human feedback).

After taking this course, students will be able to understand both classic and state-of-art provably correct RL algorithms and their analysis. Students will be able to conduct research on RL related topics.

Staff


Instructors: Wen Sun

TAs: Nico Espinosa Dice

Lecture time: Tuesday/Thursday 1:25-2:40pm ET

Office hours: TBD

Location: Gates 114

Contact: cornellcs6789@gmail.com.

Please communicate to the instructors only through this account. Emails not sent to this list, with regards to the course, will not be responded to in a timely manner.


Prerequisites


This is an advanced and theory-heavy course: there is no programming assignment and students are required to work on a theory-focused course project.

Students need strong grasp on Machine Learning (e.g., CS 4780), Probability and Statistics (e.g., BTRY 3080 or ECON 3130, or MATH 4710), Optimization (e.g., ORIE 3300), and Linear Algebra (e.g., MATH 2940). The best way to access your background is to check out HW0. For undergraduate and Meng students enrollment: permission of instructor subject to your performance on HW0.

Grading Policies


Assignments 55% (HW0:10%, HW1-HW3: 15% each), Project 40%, Reading 5%, Participation bonus 5%

All homework will be mathematical in nature, focussing on the theory of RL and bandits; there will not be a programming component. The entire HW must be submitted in one single typed pdf document (not handwritten). HW0 is MANDATORY to pass to satisfactory level; it is to check your knowledge of the prerequisites in probability, statistics, and linear algebra.

Homework Rules: Homework must be done individually: each student must understand, write, and hand in their own answers. It is acceptable for students to discuss problems with each other; it is not acceptable for students to share answers and look at another students written answers. You must also indicate on each homework with whom you collaborated with and what online resources you used.

Late days: Homeworks must be submitted by the posted due date. You are allowed up to 6 total LATE DAYs for the homeworks throughout the entire semester (late days do not apply to HW0 and project reports). These will be automatically deducted if your assignment is late. For example, any day in which an assignment is late by up to 24 hours, then one late day will be used. After your late days are used up, late penalties will be applied: any assignment turned in late will incur a reduction in score by 33% for each late day, so if an assignment is up to 24 hours late, it incurs a penalty of 33%. Else if it is up to 48 hours late, it incurs a penalty of 66%. And any longer, it will receive no credit. We will track all your late days and any deductions will be applied in computing the final grades. If you are unable to turn in HWs on time, aside from permitted days, then do not enroll in the course.

Participation/extra effort bonus: We encourage participation including asking/answering questions in lectures and ED discussion, and extra effort on reading the book chapters and lecture notes (e.g., proof reading additional chapters and sending back comments/feedback).

Reading Assignment

Please sign up for reading materials here.

Reading assignment is done individually or in group (size 2). Each will read the assigned chapter in the AJKS book (V3) or a research paper. You are required to submit a one page report that summarizes the chapter or paper. For chapter reading, the additional requirement is that you also carefully read the chapter, checking for errors, typos, and arguments/explanations that are not clear; please point this out to the instructors either in a separate page in the report or via Ed Discussion. For both chapter and paper readings, the expectation is that you check all mathematical steps; this gives you an opportunity to obtain a mastery of the chapter that you choose.

Course Project

Please see the course project page. It is a course requirement that you be in attendance for all student presentations. In addition, we ask everyone to block approximately 2 hours for each of the presentation sessions (tentatively there will be 3 presentation sessions at the end of the semester)

Diversity in STEM

While many academic disciplines have historically been dominated by one cross section of society, the study of and participation in STEM disciplines is a joy that the instructors hope that everyone can pursue, regardless of their socio-economic background, race, gender, etc. The instructors encourage students to both be mindful of these issues, and, in good faith, try to take steps to fix them. You are the next generation here.

Course Notes: RL Theory and Algorithms

The course will be largely based of the working draft of the book "Reinforcement Learning Theory and Algorithms", available here. If you find typos or errors, please let us know. We would appreciate it!

Honor Code

Cornell University Code of Academic Integrity, CS Department Code of Academic Integrity.

Schedule (tentative)

Lecture Reading Slides/HW
08/27/24 Fundamentals: Markov Decision Processes Ch.1 Slides, Annotated Slides, HW0
08/29/24 Fundamentals: Value Iteration Ch.1 Slides, Annotated Slides
09/3/24 Fundamentals: Policy Iteration and LP-Formulation Ch.1 Slides, Annotated Slides
09/5/24 Fundamentals: Tabular MDP with a Generative Model Ch.2 Slides, Annotated Slides, Simulation Lemma note
09/10/24 Fundamentals: Linear functions w/ Generative model Ch.3 Slides, Annotated Slides
09/12/24 Fundamentals: Linear Bellman complete (continued) Slides, Annotated Slides
09/17/24 Exploration: MAB Ch 6 Slides, Annotated Slides
09/19/24 Exploration: tabular MDP Ch 7 Slides, Annotated Slides
09/24/24 Exploration: tabular MDP (continued) Ch 7 Slides,Annotated Slides
09/26/24 Exploration: Contextual bandits Ch 8 Slides
10/01/24 Exploration: Linear Bandits Ch 6 Slides
10/03/24 Exploration: Model-free RL w/ function approximation Ch 8 Slides
10/08/24 Exploration: Model-free RL w/ function approximation (continue) Ch 8 Slides
10/10/24 Optimization: Policy gradient formulation Ch 10 Slides
10/15/24 No class (Fall break)
10/17/24 Exploration: Natural Policy Gradient Ch 12 Slides
10/22/24 Policy Optimization: Global optimality of PG Ch 11 Slides
10/24/24 Policy Optimization: Global Optimality of PG and NPG Ch 12 Slides
10/29/24 Offline RL: Fitted Q iteration Ch 4 Slides
10/31/24 Offline RL: Model-based Offine RL w/ partial Coverage Paper Slides
11/05/24 Offline RL: Model-based offline RL (continue) Slides
11/07/24 Offline RL: No class (instructor traveling)
11/12/24 Hybrid RL: Efficient RL from both online & offline data Paper Note
11/14/24 RL from human feedback: BT model and REBEL Paper Slides
11/19/24 RL from human feedback: Direct Preference Optimization Paper Slides
11/21/24 RL from human feedback: Multi-turn RLHF Paper Slides
11/26/24 Student Presentation
11/28/24 No Class (thanksgiving):
12/03/24 Student Presentation
12/05/24 Student Presentation