Theoretical Foundations of Reinforcement Learning
@ ICML 2020
July 17, 2020


In solidarity with #ShutDownSTEM , the organizing committee of the ICML 2020 Workshop on the Theoretical Foundations of Reinforcement Learning has moved the paper submission deadline to June 13, midnight UCT to encourage all submitting authors to participate in the strike. We grieve the deaths of George Floyd, Breonna Taylor, Ahmaud Arbery, David McAtee, and the thousands of Black people to police brutality, which is but a small part of systemic inequality. Black lives matter -- a statement whose very necessity betrays the tragedy - and the urgency - of our present conditions.


In many settings such as education, healthcare, drug design, robotics, transportation, and achieving better-than-human performance in strategic games, it is important to make decisions sequentially. This poses two interconnected algorithmic and statistical challenges: effectively exploring to learn information about the underlying dynamics and effectively planning using this information. Reinforcement Learning (RL) is the main paradigm tackling both of these challenges simultaneously which is essential in the aforementioned applications. Over the last years, reinforcement learning has seen enormous progress both in solidifying our understanding on its theoretical underpinnings and in applying these methods in practice.

This workshop aims to highlight recent theoretical contributions, with an emphasis on addressing significant challenges on the road ahead. Such theoretical understanding is important in order to design algorithms that have robust and compelling performance in real-world applications. As part of the ICML 2020 conference, this workshop will be held virtually. It will feature keynote talks from six reinforcement learning experts tackling different significant facets of RL. It will also offer the opportunity for contributed material (see below the call for papers and our outstanding program committee). The authors of each accepted paper will prerecord a 10-minute presentation and will also appear in a poster session. Finally, the workshop will have a panel discussing important challenges in the road ahead.


6:30 am - 7:15 am PDT Exploration, Policy Gradient Methods, and the Deadly Triad - Sham Kakade (Invited Talk)
7:20 am - 8:05 am PDT A Unifying View of Optimism in Episodic Reinforcement Learning - Gergely Neu (Invited Talk)
8:10 am - 9:25 am PDT Early Poster Session (check below for zoom links)
9:30 am - 10:25 am PDT Speaker Panel
10:30 am - 11:25 am PDT An Off-policy Policy Gradient Theorem: A Tale About Weightings - Martha White (Invited Talk)
11:20 am - 12:50 pm PDT Short contributed talks - Kwang-Sung Jun, Sean Sinclair, Omar Darwiche Domingues, Edgar Minasyan, Tiancheng Yu, Kush Bhatia
1:00 pm - 2:15 pm PDT Late Poster Session (check below for zoom links)
2:20 pm - 3:05 pm PDT Representation learning and exploration in reinforcement learning - Akshay Krishnamurthy (Invited Talk)
3:10 pm - 3:55 pm PDT Learning to price under the Bass model for dynamic demand - Shipra Agrawal (Invited Talk)
4:00 pm - 4:45 pm PDT Efficient Planning in Large MDPs with Weak Linear Function Approximation - Csaba Szepesvari (Invited Talk)

Keynote Speakers

Shipra Agrawal

Assistant Professor
Columbia University

Sham Kakade

University of Washington

Akshay Krishnamurthy

Principal Researcher
Microsoft Research NYC

Gergely Neu

Research Assistant Professor
Universitat Pompeu Fabra

Csaba Szepesvari

University of Alberta / DeepMind

Martha White

Assistant Professor
University of Alberta

Contributed Papers

Early Poster Session

  • (10) Provable Hierarchical Imitation Learning via EM
    Zhiyu Zhang, Ioannis Paschalidis
    [video] [zoom]
  • (12) Multi-Task Reinforcement Learning as a Hidden-Parameter Block MDP
    Amy Zhang, Shagun Sodhani, Khimya Khetarpal, Joelle Pineau
    [arXiv] [zoom]
  • (14) Sample Complexity of Estimating the Policy Gradient for Nearly Deterministic Dynamical Systems
    Osbert Bastani
    [arXiv] [zoom]
  • (15) If MaxEnt RL is the Answer, What is the Question?
    Benjamin Eysenbach, Sergey Levine
    [arXiv] [video] [zoom]
  • (17) Online Markov Decision Processes with Max-Min Fairness
    Wang Chi Cheung
  • (24) Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium
    Qiaomin Xie, Yudong Chen, Zhaoran Wang, Zhuoran Yang
    [arXiv] [zoom]
  • (28) Power-Constrained Bandits
    Jiayu Yao, Emma Brunskill, Weiwei Pan, Susan Murphy, Finale Doshi-Velez
    [arXiv] [zoom]
  • (35) Adaptive Reward-Free Exploration
    Emilie Kaufmann, Pierre MENARD, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, Michal Valko
    [arXiv] [video] [zoom]
  • (38) Near-Optimal Reinforcement Learning with Self-Play
    Yu Bai, Chi Jin, Tiancheng Yu
    [arXiv] [video] [zoom]
  • (40) Reinforcement Learning with Feedback Graphs
    Christoph Dann, Yishay Mansour, Mehryar Mohri, Ayush Sekhari, Karthik Sridharan
    [arXiv] [zoom]
  • (41) Learning Implicit Credit Assignment for Multi-Agent Actor-Critic
    Meng Zhou, Ziyu Liu, Pengwei Sui, Yixuan Li, Yuk Ying Chung
    [arXiv] [video] [zoom]
  • (42) Refined Analysis of FPL for Adversarial Markov Decision Processes
    Yuanhao Wang, Kefan Dong
  • (47) A Kernel-Based Approach to Non-Stationary Reinforcement Learning in Metric Spaces
    Omar Darwiche Domingues, Pierre MENARD, Matteo Pirotta, Emilie Kaufmann, Michal Valko
    [arXiv] [video] [zoom]
  • (50) Provably Efficient Exploration for Reinforcement Learning with Unsupervised Learning
    Fei Feng, Ruosong Wang, Wotao Yin, Simon Shaolei Du, lin Yang
    [arXiv] [video] [zoom]
  • (52) Multi-Armed Bandits with Correlated Arms
    Samarth Gupta, Shreyas Chaudhari, Gauri Joshi, Osman Yagan
    [arXiv] [video] [zoom]
  • (54) TDprop: Does Jacobi Preconditioning Help Temporal Difference Learning?
    Joshua Romoff, Peter Henderson, David Kanaa, Emmanuel Bengio, Ahmed Touati, Pierre-Luc Bacon, Joelle Pineau
    [arXiv] [zoom]
  • (55) Sharp Analysis of Smoothed Bellman Error Embedding
    Ahmed Touati, Pascal Vincent
    [arXiv] [zoom]
  • (60) Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition
    Tiancheng Jin, Haipeng Luo
    [arXiv] [zoom]
  • (62) Adaptive Discretization for Model-Based Reinforcement Learning
    Sean R. Sinclair, Tianyu Wang, Gauri Jain, Sid Banerjee, Christina Yu
    [arXiv] [zoom]
  • (64) Exploration-Exploitation in Constrained MDPs
    Yonathan Efroni, Shie Mannor, Matteo Pirotta
    [arXiv] [zoom]
  • (66) Learning the Linear Quadratic Regulator from Nonlinear Observations
    Zakaria Mhammedi, Dylan J Foster, Max Simchowitz, Dipendra Misra, Wen Sun, Akshay Krishnamurthy, Alexander Rakhlin, John Langford
  • (69) Finite-Sample Analysis of Stochastic Approximation Using Smooth Convex Envelopes
    Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, Karthikeyan Shanmugam
    [arXiv] [video] [zoom]
  • (81) Bypassing the Monster: A Faster and Simpler Optimal Algorithm for Contextual Bandits under Realizability
    David Simchi-Levi, Yunzong Xu
    [arXiv] [zoom]
  • (83) Set-Invariant Constrained Reinforcement Learning with a Meta-Optimizer
    Chuangchuang Sun, Dong-Ki Kim, JONATHAN P HOW
    [arXiv] [zoom]
  • (85) Finding Equilibrium in Multi-Agent Games with Payoff Uncertainty
    Wenshuo Guo, Mihaela Curmei, Serena Wang, Benjamin Recht
    [arXiv] [zoom]
  • (88) Reward-Free Exploration beyond Finite-Horizon
    Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric
    [paper] [video] [zoom]
  • (96) Control as Hybrid Inference
    Alexander Tschantz, Beren Millidge, Anil K Seth, Christopher Buckley
    [arXiv] [zoom]
  • (34) Distributional Robustness and Regularization in Reinforcement Learning
    Esther Derman, Shie Mannor
    [arXiv] [video] [zoom]

Late Poster Session

  • (2) Minimax Confidence Interval for Off-Policy Evaluation and Policy Optimization
    Nan Jiang, Jiawei Huang
    [arXiv] [zoom]
  • (4) An operator view of policy gradient methods
    Dibya Ghosh, Marlos C. Machado, Nicolas Le Roux
    [arXiv] [video] [zoom]
  • (5) Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality
    Kwang-Sung Jun, Chicheng Zhang
    [arXiv] [zoom]
  • (13) Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies
    Nathan Kallus, Masatoshi Uehara
  • (16) Q-Learning Algorithm for Mean-Field Controls, with Convergence and Complexity Analysis
    Haotian Gu, Xin Guo, Xiaoli Wei, Renyuan Xu
    [arXiv] [video] [zoom]
  • (21) PAC Imitation and Model-based Batch Learning of Contextual MDPs
    Yash Nair, Finale Doshi-Velez
    [arXiv] [zoom]
  • (23) A Decentralized Policy Gradient Approach toMulti-task Reinforcement Learning
    Sihan Zeng, Aqeel Anwar, Thinh T. Doan, Arijit Raychowdhury, Justin Romberg
    [arXiv] [zoom]
  • (25) A Finite-Time Analysis of Two Time-Scale Actor-Critic Methods
    Yue Wu, Weitong ZHANG, Pan Xu, Quanquan Gu
  • (33) Provably More Efficient Q-Learning in the Full-Feedback/One-Sided-Feedback Settings
    Xiao-Yue Gong, David Simchi-Levi
    [arXiv] [zoom]
  • (39) Sample-Efficient Reinforcement Learning of Undercomplete POMDPs
    Chi Jin, Sham M. Kakade, Akshay Krishnamurthy, Qinghua Liu
    [arXiv] [video] [zoom]
  • (44) Bandit Linear Control
    Asaf Benjamin Cassel, Tomer Koren
    [arXiv] [zoom]
  • (45) Robust Reinforcement Learning via Adversarial training with Langevin Dynamics
    Parameswaran Kamalaruban, Yu-Ting Huang, Ya-Ping Hsieh, Paul Rolland, Cheng Shi, Volkan Cevher
    [arXiv] [zoom]
  • (51) The Mean-Squared Error of Double Q-Learning
    Wentao Weng, Harsh Gupta, Niao He, Lei Ying, R. Srikant
  • (53) Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement Learning
    Nathan Kallus, Angela Zhou
    [arXiv] [zoom]
  • (56) Minimax Model Learning
    Cameron Voloshin, Nan Jiang, Yisong Yue
  • (57) Preference learning along multiple criteria: A game-theoretic perspective
    Kush Bhatia, Ashwin Pananjady, Peter Bartlett, Anca Dragan, Martin Wainwright
  • (58) Provably Good Batch Reinforcement Learning Without Great Exploration
    Yao Liu, Adith Swaminathan, Alekh Agarwal, Emma Brunskill
  • (68) Black-Box Control for Linear Dynamical Systems
    Xinyi Chen, Elad Hazan
  • (71) Finite Sample Analysis of Two-Time-Scale Natural Actor-Critic Algorithm
    Sajad Khodadadian, Thinh T. Doan, Siva Theja Maguluri, Justin Romberg
  • (72) Logarithmic Regret Bound in Partially Observable Linear Dynamical Systems
    Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, Anima Anandkumar
    [arXiv] [zoom]
  • (73) Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis
    Koulik Khamaru, Ashwin Pananjady, Feng Ruan, Martin J. Wainwright, Michael Jordan
  • (75) Smoothness-Adaptive Contextual Bandits
    Yonatan Gur, Ahmadreza Momeni, Stefan Wager
    [arXiv] [zoom]
  • (84) Geometric Exploration for Online Control
    Orestis Plevrakis, Elad Hazan
    [video] [zoom]
  • (86) Conservative Q-Learning for Offline Reinforcement Learning
    Aviral Kumar, Aurick Zhou, George Tucker, Sergey Levine
    [ArXiv] [zoom]
  • (89) Efficient MDP Analysis for Selfish-Mining in Blockchain
    Ittay Eyal, Aviv Tamar
  • (95) Adaptive Regret for Online Control
    Paula Gradu, Elad Hazan, Edgar Minasyan
  • (97) Generalized Chernoff Sampling: A New Perspective on Structured Bandit Algorithms
    Subhojyoti Mukherjee, Ardhendu Tripathy, Robert D Nowak
    [video] [zoom] }
  • (100) Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes
    Yi Tian, Jian Qian, Suvrit Sra
    [arXiv] [zoom]
  • (102) Model Selection for Finite and Continuous-Armed Stochastic Contextual Bandits
    Avishek Ghosh, Abishek Sankararaman, Kannan Ramchandran
    [arXiv] [zoom]

Program Committee

  • Dhaval Adjodah (MIT)
  • Alon Cohen (Google Research)
  • Sarah Dean (UC Berkeley)
  • Yaqi Duan (Princeton University)
  • Chris Dann (Google Research)
  • Dylan Foster (MIT)
  • Botao Hao (Purdue University)
  • Chi Jin (Princeton University)
  • Alec Koppel (U.S. Army Research Laboratory)
  • Tor Lattimore (Deepmind)
  • Christina Lee Yu (Cornell University)
  • Bo Liu (Auburn)
  • Horia Mania (UC Berkeley)
  • Aditya Modi (U of Michigan at Ann Arbor)
  • Tong Mu (Stanford University)
  • Vidya Muthukumar (UC Berkeley)
  • Sobhan Miryoosefi (Princeton)
  • Aldo Pacchiano (UC Berkeley)
  • Ciara Pike-Burke (Universitat Pompeu Fabra)
  • Tim G. J. Rudner (University of Oxford)
  • Tuhin Sarkar (MIT)
  • Karan Singh (Princeton University)
  • Adith Swaminathan (Microsoft Research Redmond)
  • Yi Su (Cornell University)
  • Masatoshi Uehara (Harvard University)
  • Ruosong Wang (CMU)
  • Qiaomin Xie (Cornell University)
  • Tengyang Xie (UIUC)
  • Renyuan Xu (Oxford University)
  • Lin Yang (UCLA)
  • Zhuoran Yang (Princeton University)
  • Tiancheng Yu (MIT)
  • Andrea Zanette (Stanford University)
  • Angela Zhou (Cornell University)
  • Zhengyuan Zhou (NYU)

Important Dates


Paper Submission Deadline: June 13, 2020, 11:59 PM UTC ([OpenReview])

Author Notification: July 3, 2020, 11:59 PM PDT

Final Version: July 10, 2020, 11:59 PM PDT

Workshop: July 17, 2020 (Time: TBD)

Workshop Organizers                

Emma Brunskill

Stanford University

Thodoris Lykouris

Microsoft Research NYC

Max Simchowitz

UC Berkeley

Wen Sun

Cornell University / Microsoft Research NYC

Mengdi Wang

Princeton University

We thank Hoang M. Le from providing the website template.