Important Dates
Midterm report due: Oct 13th 11:59pm
Project Presentations: Nov 24, Dec 1st, Dec 3rd
Final report due: Dec 10th 11:59pm

Gradings
Midterm report: 5%
Project Presentations: 15%
Final report due: 25%

Reports and Presentations
Presentations: Details forthcoming.
Report Format: we use NeurIPS format. You must use the NeurIPS LaTex format.
Midterm Report: Your report should be 2 pages maximum (not including references). Your midterm report should include title, team members, abstract, related works, problem formulation and goals.
Final Report: Your report should be 9 pages maximum (not including references). Your final report will be evaluated by the following criteria:
 Merit: Do you have sound reasoning for the approach? Is the question well motivated and are you taking a justifiably simple approach or, if you are choosing a more complicated method, do you have sound reasoning for doing this?
 Technical depth: How technically challenging was what you did? Did you use a package or write your own code? It is fine if you use a package, though this means other aspects of your project must be more ambitious.
 Presentation: How well did you explain what you did, your results, and interpret the outcomes? Did you use good graphs and visualizations? How clear was the writing? Did you justify your approach?

Project Ideas
We provide a few project ideas below. Studying existing RL theory papers and reproducing proofs is also a good option for the course project. Experiments
for verifying conclusions and testing conjectures are also welcome.
Refined analysis in Tabular MDPs: Conduct a survey on a family of tabular MDP papers with tight regret bounds, e.g., Azar et.al ,
Jin
et.al,
Wang et.al
Comparison between variants of linear MDP models: Conduct a survey on papers with some kind of linear structures, e.g., Yang and Wang ,
Jin et.al
Thompson Sampling in RL: Survey Thompson sampling techniques used in RL. This is a good starting point.
Gittins Index: Understand and survey the
Gittins index method. This is a framework for Bayes
optimal learning for multiarmed
bandits. Think about open questions and why extensions
are difficult. This is a good starting point.
RL with Constraints: RL with convex and knapsack constraints is studied here for tabular settings.
Can you extend it to nontabular setting such as linear MDPs?
RL with Adversarial Corruption: Exploration in RL with corruption is studied here.
Can you think about different attack models and study attack/defense in other RL frameworks such as policy gradient or batch RL?
Policy Gradient: Starting from the analysis of PG/NPG, can you think about how to do
datareuse in policy optimization to potentially improve its sample complexity?
Policy Gradient with Exploration: Starting from PCPG, can you think about ways to improve
its sample complexity?
Policy Gradient: Starting from this paper, can you think about how to extend the algorithm here
to other linear MDP models?
Reward Free Exploration:
Conduct a survey on a MDP methods, which do not use a
reward signal. See MaxEnt
exploration as a starting point.
Imitation Learning from many experts: This paper shows learning from
multiple experts in the interactive learning setting. Can we do learning from multiple experts in noninteractive settings?
Online MDPs with expert advice.
Sometimes RL can be done in adversarial contexts. Conduct a survey of
online MDP methods (in adversarial settings). See Online
MDPs as a starting point. Also, comment on the connections to the
NPG analysis.
Safe Control: Control barrier function has been used for learning safe policies
(e.g., this paper). Can we provide learning guarantees (e.g., regret analysis) for safe RL?
Meta Learning in RL or Imitation Learning: Can you formulate the setting of Metalearning in RL or Imitation learning? Multitask RL has been
previously studied in tabular setting (e.g., this paper)
Thompson Sampling for Kernelized Nonlinear Regulator: Can you think about Thompson sampling algorithms for
KNRs?
Hierarchical RL: when the action space is complex, and/or the planning horizon is long, it is often considered helpful to impose a hierarchy on the action space. This has been formalized in theory using frameworks such as
options. There is some theory on the benefits of doing
this here. For a reading project, survey these or other papers which focus on techniques for hierarchical RL.
For a novel project, think about how to analyze learning with options and how to formalize Hierarchical rl and its theoretical benefits.
Square Root T Regret in Large Scale MDPs. Can you modify and turn the Bellman Rank and Witness rank papers' algorithms and analysis to a square root T regret bound?

