CS 6789 Foundations of RL

Important Dates

Midterm report due: March 31th 11:59pm

Project Presentations: May 2, May 4, May 9

Final report due: May 11 11:59pm

Gradings

Midterm report: 5%

Project Presentations: 15%

Final report due: 20%

Reports and Presentations

Presentations: Details forthcoming.

Report Format: we use NeurIPS format. You must use the NeurIPS LaTex format.

Midterm Report: Your report should be 2 pages maximum (not including references). Your midterm report should include title, team members, abstract, related works, problem formulation and goals.

Final Report: Your report should be 9 pages maximum (not including references). Your final report will be evaluated by the following criteria:

Merit: Do you have sound reasoning for the approach? Is the question well motivated and are you taking a justifiably simple approach or, if you are choosing a more complicated method, do you have sound reasoning for doing this?
Technical depth: How technically challenging was what you did? Did you use a package or write your own code? It is fine if you use a package, though this means other aspects of your project must be more ambitious.
Presentation: How well did you explain what you did, your results, and interpret the outcomes? Did you use good graphs and visualizations? How clear was the writing? Did you justify your approach?

Project Ideas

We provide a few project ideas below. Studying existing RL theory papers and reproducing proofs is also a good option for the course project. Experiments for verifying conclusions and testing conjectures are also welcome.

Refined analysis in Tabular MDPs: Conduct a survey on a family of tabular MDP papers with tight regret bounds, e.g., Azar et.al , Jin et.al, Wang et.al

Comparison between variants of linear MDP models: Conduct a survey on papers with some kind of linear structures, e.g., Yang and Wang , Jin et.al

Thompson Sampling in RL: Survey Thompson sampling techniques used in RL. This is a good starting point.

Gittins Index: Understand and survey the Gittins index method. This is a framework for Bayes optimal learning for multi-armed bandits. Think about open questions and why extensions are difficult. This is a good starting point.

RL with Constraints: RL with convex and knapsack constraints is studied here for tabular settings. Can you extend it to non-tabular setting such as linear MDPs?

RL with Adversarial Corruption: Exploration in RL with corruption is studied here. Can you think about different attack models and study attack/defense in other RL frameworks such as policy gradient or batch RL?

Policy Gradient: Starting from the analysis of PG/NPG, can you think about how to do data-reuse in policy optimization to potentially improve its sample complexity?

Policy Gradient with Exploration: Starting from PC-PG, can you think about ways to improve its sample complexity?

Policy Gradient: Starting from this paper, can you think about how to extend the algorithm here to other linear MDP models?

Reward Free Exploration: Conduct a survey on a MDP methods, which do not use a reward signal. See Max-Ent exploration as a starting point.

Imitation Learning from many experts: This paper shows learning from multiple experts in the interactive learning setting. Can we do learning from multiple experts in non-interactive settings?

Online MDPs with expert advice. Sometimes RL can be done in adversarial contexts. Conduct a survey of online MDP methods (in adversarial settings). See Online MDPs as a starting point. Also, comment on the connections to the NPG analysis.

Safe Control: Control barrier function has been used for learning safe policies (e.g., this paper). Can we provide learning guarantees (e.g., regret analysis) for safe RL?

Meta Learning in RL or Imitation Learning: Can you formulate the setting of Meta-learning in RL or Imitation learning? Multitask RL has been previously studied in tabular setting (e.g., this paper)

Thompson Sampling for Kernelized Nonlinear Regulator: Can you think about Thompson sampling algorithms for KNRs?

Hierarchical RL: when the action space is complex, and/or the planning horizon is long, it is often considered helpful to impose a hierarchy on the action space. This has been formalized in theory using frameworks such as options. There is some theory on the benefits of doing this here. For a reading project, survey these or other papers which focus on techniques for hierarchical RL. For a novel project, think about how to analyze learning with options and how to formalize Hierarchical rl and its theoretical benefits.

Square Root T Regret in Large Scale MDPs. Can you modify and turn the Bellman Rank and Witness rank papers' algorithms and analysis to a square root T regret bound?