Paper: https://arxiv.org/abs/2306.07052
Abstract:
In this work, we empirically show that updating pretrained LMs (350M, 1.3B, 2.7B) with just a few steps of Gradient Ascent Post-training (GAP) on random, unlabeled text corpora enhances its zero-shot generalization capabilities across diverse NLP tasks. Specifically, we show that GAP can allow LMs to become comparable to 2-3x times larger LMs across 12 different NLP tasks. We also show that applying GAP on out-of-distribution corpora leads to the most reliable performance improvements. Our findings indicate that GAP can be a promising method for improving the generalization capability of LMs without any task-specific fine-tuning.
Pretty cool research. I’m wondering if this method could be applied in a more effective way, e.g. by introducing gradient ascent throughout the full training process (I’d be curious to see how different ratios of descent:ascent during training would affect convergence/generalization abilities). Will also be neat to see this applied to larger models.
Edit for Lemmies: You can read paper here: https://www.researchgate.net/publication/371505904_Gradient_Ascent_Post-training_Enhances_Language_Model_Generalization