Microsoft and Open AI have developed a new method for fine-tuning massive AI models that are otherwise too expensive to retrain, such as GPT-3.
A blog post published by Microsoft Research describes a technique called µ-Parametrization (or µP), which plays on the discovery of similarities between the behaviour of small- and large-scale AI models to minimize the quantity of compute resources required to make optimizations.
Although you’d need a doctorate to make sense of the specifics, the essential message is this: with µ-Parametrization, it will be cheaper and simpler to develop larger-scale AI models capable of yielding far superior performance to those available today.
Optimizing AI models
As explained in the blog post, one reason large AI models are difficult to train effectively is because we have little insight into the way their behavior changes as they scale. As such, the larger the AI model, the less well-tuned researchers would currently expect it to be.
However, µ-Parametrization offers a route to tuning large-scale models at much lower costs and much greater efficiency, by capitalizing on the insight that neural networks of varying sizes share the same optimal hyperparameters (HPs) in some conditions.
Essentially, this means a small-scale tuning process can be extrapolated outwards and mapped onto a much larger model, instead of retraining an entire multi-billion-parameter model from scratch.
“µP’s principled way of parameterizing the model and selecting the learning rate make it easier for anybody to scale the training of deep neural networks. Such an elegant combination of beautiful theory and practical impact,” said Johannes Gehrke, Lab Director at Microsoft Research.
To put the theory into practice, Microsoft worked with OpenAI to unleash µ-Parametrization on GPT-3, a natural language model whose largest iteration is made up of 175 billion parameters.
“After parameterizing a version of GPT-3 with relative attention in µP, we tuned a small proxy model with 40 million parameters before copying the best hyperparameter combination to the 6.7-billion parameter variant of GPT-3,” Microsoft explained.
The results were quite startling; the collaborators managed to create an even more performant version of GPT-3, using just 7% of the compute power consumed in the pretraining of the 6.7-billion parameter model.
To help other practitioners benefit from these findings, Microsoft has published a PyTorch package designed to help integrate µ-Parametrization into their existing models, which can supposedly be finicky in practice.
The company also says there remains plenty that is yet to be understood about the scaling of AI models, however, and pledged to continue its work to “derive more principled approaches to large-scale machine learning”.