1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities

Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzciński, Benjamin Eysenbach

  • Problem: how to scale RL productively and efficiently given sparse rewards, which make it challenging to train large RL models
  • Idea: reinforcement learning + self-supervised learning = self-supervised RL
  • How: deep contrastive RL with modern architectural improvements
    • Contrastive RL (CRL)
    • Increasing available data in scalable way via GPU-accelerated RL frameworks
    • 100x deeper networks (1024 layers) vs. traditional RL (2-5 layers)
    • Residual connections
    • Layer normalization
    • Swish activation (don’t know what this is)
    • Experimental: batch size and network width
  • Context: unsupervised goal-conditioned locomotion and manipulation tasks, with no demonstrations or rewards, driving agent to explore the environment from scratch
  • Results: depth-wise scaling of RL models pays off
    • Significant performance increase vs. baselines, emergent capabilities at certain network scales
    • Networks that can be productively scaled depth-wise

Details

Objective: maximize expected reward (probability of reaching goal in next time step) over possible policies where the goal is simply a function of certain state

Contrastive reinforcement learning: actor-critic method that pushes the policy to maximize the critic, which is the L2-norm between the embeddings of the (state, action) tuple and the goal , written as . The embeddings are generated by two neural networks which are trained to maximize the difference between the critic evaluated on samples from the agent’s own trajectory and the critic evaluated on samples where the goal has been swapped to a different random goal from another trajectory. This is a form of Representation Learning but in the context of reinforcement learning.

Experimental Results

Emergent policies through depth: significant jumps in performance observed at particular critical network depths, varying by task / environment. The authors highlight the humanoid environments in particular, where “to the best of our knowledge, this is the first goal-conditioned approach to document such behaviors on the humanoid environment.” Residual connections appear to be critical to enabling this, as networks without this component do not see improved performance with depth.

Depth outperforms width: scaling depth appears to work much better than scaling network width, with a doubling of depth from 4 to 8 outperforming the widest networks in all test environments. This is again most notable in the humanoid environments. Width does help as well, but not to the same degree as depth.

Scaling the critic helps more than scaling the actor: scaling the critic networks drive substantial performance gains while scaling the actor is only marginally helpful. Worth noting however that this does contrast prior work that found scaling the actor to be harmful rather than merely unhelpful, which isn’t the case here.

Scaling batch size is beneficial with sufficient deep networks: batch size has historically not been a useful lever for RL performance. However, experiments here show that once the network reaches a certain depth, larger batches sizes do seem to outperform smaller ones. This suggests that larger networks have the ability to learn from larger batches in a way that smaller networks do not, which perhaps explains why batch size generally been found to be unhelpful in prior research.

Deep networks learn better representations: deep networks appear to learn better representations in their training process. The authors evaluate this in two ways – investigating the Q values computed at various points in an environment and by looking where the the agent tends to spend time in embedding space throughout a trajectory. The Q values for the larger networks are more sophisticated in ways that likely improve task performance, while the trajectories of the larger networks cluster around the goal in embedding space in a broad way relative to smaller networks which tend to form much tighter clusters.


References

paperreadonline