Google's New AI Architecture: Why Do Neural Networks Need Their Own "Experts"?

Blockchain in RussiaBlockchain in Russia

Google's New AI Architecture: Why Do Neural Networks Need Their Own "Experts"?

March 10, 2022, 12:28 (UTC+3)|

942

In early January, the developers of Google Brain, a research division of the IT corporation, spoke about the neural networks architecture Vision Mixture of Experts (V-MoE) and published its source code. According to their article, this approach — adding special “expert” layers to the neural network — allows you to achieve higher efficiency when working with images using artificial intelligence without a significant increase in computing power.

ICT.Moscow figured out what the features of this architecture are, whether it is really a new development, and how applicable it is to solving real problems in different areas and using different AI tools — text, voice and visual. One of the authors of the development, Carlos Riquelme, a Research Scientist at Google Brain, as well as AI developers from Yandex, SberDevices, Speech Technology Center (STC), EORA, Intento, Tortu, helped to understand the issue.

New “experts” for neural networks

The V-MoE architecture, if we summarize the findings of ICT.Moscow's interlocutors, is a logical, evolutionary development of work with transformer neural networks, which were presented by Google Brain in 2017. A distinctive feature of visual transformers (ViT), as Kirill Danilyuk, the head of the computer vision group in autonomous vehicles at Yandex, explained, is that they contain the least amount of Inductive Bias. In other words, such networks “configure themselves”, which, in turn, requires vast datasets and computing power. Accordingly, the history of the development of transformers began with text models, with the most famous example being GPT-3 (ruGPT-3 was created for the Russian language by Sberbank).

Transformers came primarily into language models because there were huge open volumes of texts. You could just give all the available data and hope that the neural network itself would learn all the necessary patterns. It turned out that the more data you give, and the more model parameters you have, the more complex patterns the network can learn.

But there is a problem: the more parameters you have, the more calculations need to be done, and any increase in the depth of the neural network significantly affects the requirements for calculations. Then, in 2017, Google began to think about how to make it possible to further increase the number of parameters without increasing the required calculations, and thus get a more complex network. And they figured out how to use the Mixture of Experts (MoE) approach, creating separate subnets inside the neural network, starting calculations not through everything at once, but only through those specially selected by the router. Google's examples show that small V-MoE architectures contain four and a half times more parameters for the same number of calculations. This ratio is also valid for other variants of ViT vs V-MoE.

There were two tasks. The first was to ensure that the model scales to a much larger number of parameters in order to work with more complex patterns. And the second was to make the calculations that are produced by this network grow in a controlled manner.

Kirill Daniliuk

Head of Computer Vision in Autonomous Vehicles at Yandex

The principle of constructing a transformer neural network with MoE “expert” layers. Source: Google Brain

In other words, the complexity is as follows: for the work of transformer neural networks with texts, a large number of parameters and serious computing power are already needed (it is no coincidence that that supercomputers have become one of the notable topics of the AI agenda in recent years). And since images are much more complex material for processing by neural networks, then computers must be much more powerful — in a square proportion, as Kirill Danilyuk explained. To reduce the required power, the MoE architecture (V-MoE for imaging) was invented.

What these networks have in common is that their architecture is not monolithic, but the learning process is end-to-end (with the calculation of the gradient for all network weights at once). Instead, a set of individually trained “expert” networks is used, the connection/disconnection of which (and with different weights) is controlled by a separate neural network model-dispatcher (Gating Network). This architecture is called “Mixture of Experts” (MoE). The training process of the MoE model is somewhat similar to the layered training of deep networks, popular at the turn of the millennium.

In V-MoE, experts “learn separately” not in the sense that they learn separately from the rest of the grid, but that there are sparse activations, and gradients pass through only one “expert” in one update. By the way, before Switch Transformer (developed in 2021, the Google Brain transformer neural network model, for which the MoE approach is used — ed.), at least two “experts” were used on one Switch Layer.

Sergey Markov

Managing Director at the Department of Experimental Machine Learning Systems, SberDevices

Visualization of the difference in the operation of the MoE neural network from conventional transformers with limited computing resources. Source: Google Brain

Carlos Riquelme from Google Brain explained to ICT.Moscow that the operation of one such neural network usually requires several computers forming a network, and here lies another problem that has not yet been solved.

MoE's offer massive capacity (i.e. networks with a huge number of parameters that can learn extremely complex functions). In order to take advantage of so many parameters, we need lots of data. I guess datasets have been growing in size during the last few years, maybe enabling the successful training of these sparse networks just recently.

Also, from an engineering perspective, implementing and training these networks is non-trivial. They are so big that they are typically distributed across a bunch of computers, and even in a single forward pass data needs to be exchanged across computers (for example, input I may need to go to an expert stored in machine J first, and then another one in machine K). We are starting to develop mature tools to handle these new kinds of models.

Carlos Riquelme

Research Scientist at Google Brain

The developers with whom ICT.Moscow spoke, although they evaluate the new architecture positively, do not yet see a breakthrough from an applied point of view. According to Vlad Vinogradov from EORA Data Lab, this development requires efforts to move forward.

In general, their paper is not as unexpected as, say, the paper from 2017 with the participation of Geoffrey Hinton (a well-known scientist in the field of neural networks — ed.). There is an even earlier mention of MoE in a 1991 article.

On the images, Google assembled a visual transformer (ViT) with MoE and got a 2x speedup without losing quality. This is not a breakthrough as the double acceleration does not attract with the high complexity of implementation and training. However, it may be suitable for those who have made the most of current methods for the sake of a balance of speed and quality.

For me, a breakthrough will occur when it is possible to speed up the model at least tenfold without losing quality.

Vlad Vinogradov

CTO at EORA Data Lab

Grigory Sapunov from Intento also noted that this is a rather old approach, and the development presented by Google in January is “not a revolution, but rather a consistently wider application of an already known idea” (a more detailed technical analysis of this architecture can be found here).

In general, it's an interesting architecture, with good potential. You can create large distributed networks. Essentially, it adds another dimension to model scaling. Now you can add not only layers to make the model heavier but also expert layers, without particularly slowing down the entire model.

In size, you can get a huge neural network, much larger than the largest GPT-3, but at any given time, only a small part of the weights will “work”. Previously, the main difficulty in training such architectures was probably that they require a more complex distributed infrastructure. Now, this is no longer such a problem.

Grigory Sapunov

Co-founder and CTO at Intento

Why do we need neural networks with “experts”

Based on the results of conversations with AI developers, it was possible to identify at least four positive effects inherent in transformer neural networks with “expert” layers. The first one has already been mentioned above, that is, the ability to work with a large number of parameters compared to traditional transformers.

Models based on the MoE approach outperform monolithic models in terms of the number of parameters (and at the same time require fewer operations to execute), so V-MoE with 15 billion parameters has formally become the largest pre-trained neural network for solving computer vision problems. For comparison, the DALL-E and ruDALL-E generative models contain 13 billion parameters, while similar in functionality to V-MoE networks ViT-G/14 (transformer neural network without "experts" — ed.) and Florence have 1.8 and 637 million parameters, respectively.

Sergey Markov

Managing Director at the Department of Experimental Machine Learning Systems, SberDevices

Kirill Danilyuk from Yandex compared the development with convolutional neural networks, which are pre-configured for correct operation and are now used in unmanned vehicles.

An optimized convolutional neural network from those used in unmanned vehicles (in this case, a classifier) has about 10 million parameters. There is another convolutional neural network, ResNet-152, which is 60 million parameters. It is already too big to be used in a car. Furthermore, Vision-transformer (ViT) has 656 million parameters, while what Google presented has 15 billion.

Kirill Daniliuk

Head of Computer Vision in Autonomous Vehicles at Yandex

15 billion parameters for V-MoE neural networks is probably far from the limit if this architecture continues to develop in the same way as the “parent” MoE. Grigory Sapunov gave several examples of text models that can continue the series started by experts from SberDevices and Yandex: language MoE models presented in December by Meta AI for 1.1 trillion parameters and GLaM by Google for 1.2 trillion.

The second effect is the less processing power required to work with an equivalent amount of data. Interlocutors of ICT.Moscow see this effect from two positions. Kirill Danilyuk from Yandex considers MoE as a tool to reduce the growth in required computing power with an increase in the number of parameters and data (compared to traditional transformers). Grigory Sapunov from Intento refers to a direct comparison of the work of GLaM (MoE-neural networks) with GPT-3 (a comparable AI model without "experts"). It is difficult to get exact figures, but, according to the developer, in 5 out of 7 task categories, “GLaM significantly outperformed GPT-3”.

From a practical point of view, MoE networks are faster and better trained. Another effect follows from this: if learning is faster and better, then the carbon footprint of computers is noticeably reduced. However, this effect is not yet fully understood as it is not clear what other hardware, software, data centers, and everything else contribute.

Grigory Sapunov

Co-founder and CTO at Intento

Thus, the four positive effects of MoE networks compared to traditional transformers are directly interconnected:

they work with a large number of parameters;
they require less computing power;
they learn faster and better on less data;
as a result, carbon emissions from the work of computers with such neural networks are reduced.

But what does this mean from a practical point of view - where can MoE neural networks and, in particular, V-MoE be applied in reality?

In what fields will such neural networks be useful?

Carlos Riquelme from Google Brain is positive that there are no fundamental limitations on the applicability of neural networks with “experts” in the future.

it's been applied successfully to NLP and machine translation before. The MoE layers (or more generally, their Inductive Bias) can be applied to any type of learning problem. We believe this is a good fit for problems where the input can be split into a sequence of tokens (like words in sentences for language problems, or frames in videos).

Moreover, MoE's could be helpful for multi-task setups where some tasks may interfere in learning others, and MoE paths could make it easy to isolate task-specific parts of the network (learning to play game 2 doesn't make you forget how to play game 1).

Carlos Riquelme

Research Scientist at Google Brain

At the current stage, developers are more likely to take a wait-and-see attitude as they are looking at what specific effects this architecture can achieve and how easy it is to work with compared to the tools they already have. This also applies to professionals who work with AI voice interfaces.

When developing conversational interfaces, several layers can be distinguished on which the design and implementation of the task are carried out. In our opinion, when working with voice and conversational user interfaces (VUI/CUI), it is the design of the dialogue system (scripts) that plays a more important role — the client layer, so that the dialogue turns out to be as human-like, informative and takes into account all possible forks. NLP (instrumental layer) plays the role of a processing and classification system on which the above logic is implemented.

Work with segmentation has been carried out before, so we can't say that this is a new approach to working with data. Another thing is the application of this approach in data processing algorithms. Profiling and narrow segmentation increase the detail and quality of the analysis in a particular area, but it is important not to lose context (the big picture). Most likely, the Google team was able to find the best approach. Let's look at their results and the development of the chosen architecture.

Julia Mitskevich

COO of the Design and Development team for conversational products at TORTU

Experts working with static images — similar to those cited by the developers from Google Brain — are also in no hurry to draw specific conclusions about the applicability of V-MoE for their tasks.

We could use this approach to train models for searching for similar images — we need large models with high generalization ability. At the same time, we are also interested in high processing speed. However, before real use, you will have to weigh whether it is worth complicating training methods for the sake of a twofold acceleration or is it still better to lose a couple of percent in quality, but use a reduced classical neural network.

Vlad Vinogradov

CTO at EORA Data Lab

This also applies to developers involved in neural networks for the analysis of medical images.

Using Mixture of Experts (MoE) in transformers for 2D and 3D medical image segmentation can really improve accuracy and reduce exam processing time. At the same time, the question of how well the weights of the model are transferred to medical problems remains unexplored.

Aleksander Gromov

Computer Vision Team Lead at Third Opinion

The task of biometric recognition is more difficult than working with static images. But Speech Technology Center (STC), which develops such systems, is confident that such a neural network architecture will be applicable in their work.

We are planning experiments on applying the approach of the dynamic network to improve our own biometric facial recognition system, including for solving the problem of masked facial recognition. For example, if the face is covered by a mask, the network can calculate from one branch, and if the face is not masked, from another.

In the future, the approach of "dynamic networks" can be expanded for even better work of the algorithm with faces of minors, the elderly, various ethnic groups (relevant in the cases of the biometric video identification system in the Safe City and Smart City projects). According to our forecasts, the “dynamic networks” approach will continue to be actively used in computer vision.

Dmitry Dyrmovsky

CEO at STC Group

Finally, the most difficult scenario is the operation of such neural networks “in the field” to solve problems in real-time. Here, the example of autonomous vehicles will be indicative, which not only work with images in real-time but also must be as reliable as possible. Whether there is a prospect of V-MoE for solving such problems is still an open question.

An autonomous vehicle works with the image from cameras and with data from other sensors, and what we see from Google is only about images. Obviously, there is no work with data from lidars, radars, and so on. Google uses the ImageNet dataset with low-resolution images (224/224 pixels). If we take a picture with a resolution of 1000/1000 pixels (1000 squared), then the complexity of working with it increases proportionally, or quadratically. Plus, we work with video, that is, with a sequence of frames, and not with a single frame. You need to be able to process a stream of images with a frequency of 10-30 frames per second, which makes it even more difficult.

Thus, from the point of view of self-driving cars, this development is still too far from practical application. Further, with the help of transformers, the problem of classifying objects is solved, which is uninteresting for self-driving cars in the frame, we need to solve a more complex problem — the detection. That is, to obtain data not only about what kind of object it is, but also where it is located, to predict its movement — in three-dimensional space. In other words, again, we need data not only from cameras but also from other sensors.

It is also worth noting that the goal of the “in combat” model is not to show the best metrics on some dataset (as Google does on the JFT-300M and ImageNet datasets). The goal is to see the required types of objects reliably enough, to do it as quickly as possible and as efficiently as possible in terms of resources.

Kirill Danilyuk

Head of Computer Vision in Autonomous Vehicles at Yandex

The interlocutor of ICT.Moscow emphasized that even taking into account the reduced requirements for computing power, “we could probably solve our problems right now if we had a data center or a cluster of several powerful computers on the machine.” In other words, even taking into account MoE, transformer neural networks require much more calculations and data.

Visualization of data collected by lidar on an unmanned vehicle. Source: Yandex

The question of data also remains open. There were no problems with the collection of texts for transformer networks: in this case, text markup for the dataset occurs completely automatically. Working with static images for V-MoE is more difficult, but the Google Brain developers coped with this task and collected the JFT-300M dataset. Automating the assembly of a dataset for autonomous vehicles tasks is a task that has not yet been solved by existing methods.

Transformers require an order of magnitude more data compared to the AI models that are currently used in self-driving cars. Looking at the JFT-300M dataset that Google put together semi-automatically for V-MoE training, it's 300 times larger than the ImageNet they show their result on. And this in itself is a large dataset collected over several years.

Google came up with an interesting way to automatically collect data for this class of tasks. How to automatically collect markings from cameras and sensors for autonomous vehicles tasks is still an open question. When the world figures out how to assemble a good, high-quality, and automatic markup for this class of tasks, we will probably get closer to starting using transformers.

In any case, we believe that neural networks with V-MoE will eventually come to the real-time industry, that is, they will be used to solve problems “here and now”. But whether this will be implemented in self-driving cars is not a fact yet.

Kirill Daniliuk

Head of Computer Vision in Autonomous Vehicles at Yandex

Prospects for the development of MoE neural networks

Working with visual transformers, including those made in the MoE architecture, according to Sergey Markov from SberDevices, will remain primarily the prerogative of large research teams in the near future.

Experiments with such large neural networks today can only be done by large research teams that have not only advanced knowledge in the field of neural network technologies but also the necessary computing resources and data sets.

Sergey Markov

Managing Director at the Department of Experimental Machine Learning Systems, SberDevices

However, we can still try to determine in which direction this architecture will develop further.

There are many possible directions for development, but from what is already on within sight, this is, for example, decoupled training, when in general one layer ceases to depend on another and the entire network can be trained in a distributed manner. Plus, modularization can develop even more when large networks can be "experts".

In the very short term, this is still about scaling, there is room for development. A little further on the horizon it's more about modularization and multimodality — multimodal MoEs will certainly be interesting when they appear.

Grigory Sapunov

Co-founder and CTO at Intento

A distinctive feature of multimodal neural networks is that they are trained in parallel on several types of data: for example, texts and images. An example is DALL-E, a transformer that generates images based on text descriptions. It is logical to assume that the multimodal approach will also be of interest to autonomous vehicles developers, since, as Kirill Danilyuk said, it also requires several types of data: from cameras, lidars and radars, which should be used within the same network for a single forecasting task.

If we assume that MoE neural networks will indeed develop in the areas that Grigory Sapunov spoke about — modularization, multimodality and scaling — then this will probably affect the development of the trend that Igor Pivovarov spoke to ICT.Moscow about earlier — the emergence of Foundation Models. The idea is that now you don’t even need to write code: you take a ready-made model, customize it for your tasks and get a working solution.

Foundation is a really big topic, I wrote about it not so long ago. There is a certain development in that direction, and it is good. MoE, of course, can be one solution in such models, but not necessarily. In general, it fits in well, I would expect that part of the Foundation Models would be based on MoE. But hardly everything: some other new solutions will definitely appear. Clearly, MoE is not the end of the story.

Grigory Sapunov

Co-founder and CTO at Intento

Original (in Russian)

Technologies:

#neural_networks #artificial_intelligence #big_data

Companies:

#Google #SberDevices #Intento #EORA #Yandex #Tortu #Speech_Technology_Center