In early January, the developers of Google Brain, a research division of the IT corporation, spoke about the neural networks architecture Vision Mixture of Experts (V-MoE) and published its source code. According to their article, this approach — adding special “expert” layers to the neural network — allows you to achieve higher efficiency when working with images using artificial intelligence without a significant increase in computing power.
ICT.Moscow figured out what the features of this architecture are, whether it is really a new development, and how applicable it is to solving real problems in different areas and using different AI tools — text, voice and visual. One of the authors of the development, Carlos Riquelme, a Research Scientist at Google Brain, as well as AI developers from Yandex, SberDevices, Speech Technology Center (STC), EORA, Intento, Tortu, helped to understand the issue.
The V-MoE architecture, if we summarize the findings of ICT.Moscow's interlocutors, is a logical, evolutionary development of work with transformer neural networks, which were presented by Google Brain in 2017. A distinctive feature of visual transformers (ViT), as Kirill Danilyuk, the head of the computer vision group in autonomous vehicles at Yandex, explained, is that they contain the least amount of Inductive Bias. In other words, such networks “configure themselves”, which, in turn, requires vast datasets and computing power. Accordingly, the history of the development of transformers began with text models, with the most famous example being GPT-3 (ruGPT-3 was created for the Russian language by Sberbank).
But there is a problem: the more parameters you have, the more calculations need to be done, and any increase in the depth of the neural network significantly affects the requirements for calculations. Then, in 2017, Google began to think about how to make it possible to further increase the number of parameters without increasing the required calculations, and thus get a more complex network. And they figured out how to use the Mixture of Experts (MoE) approach, creating separate subnets inside the neural network, starting calculations not through everything at once, but only through those specially selected by the router. Google's examples show that small V-MoE architectures contain four and a half times more parameters for the same number of calculations. This ratio is also valid for other variants of ViT vs V-MoE.
There were two tasks. The first was to ensure that the model scales to a much larger number of parameters in order to work with more complex patterns. And the second was to make the calculations that are produced by this network grow in a controlled manner.
Kirill Daniliuk
Head of Computer Vision in Autonomous Vehicles at Yandex
The principle of constructing a transformer neural network with MoE “expert” layers. Source: Google Brain
In other words, the complexity is as follows: for the work of transformer neural networks with texts, a large number of parameters and serious computing power are already needed (it is no coincidence that that supercomputers have become one of the notable topics of the AI agenda in recent years). And since images are much more complex material for processing by neural networks, then computers must be much more powerful — in a square proportion, as Kirill Danilyuk explained. To reduce the required power, the MoE architecture (V-MoE for imaging) was invented.
In V-MoE, experts “learn separately” not in the sense that they learn separately from the rest of the grid, but that there are sparse activations, and gradients pass through only one “expert” in one update. By the way, before Switch Transformer (developed in 2021, the Google Brain transformer neural network model, for which the MoE approach is used — ed.), at least two “experts” were used on one Switch Layer.
Sergey Markov
Managing Director at the Department of Experimental Machine Learning Systems, SberDevices
Visualization of the difference in the operation of the MoE neural network from conventional transformers with limited computing resources. Source: Google Brain
Carlos Riquelme from Google Brain explained to ICT.Moscow that the operation of one such neural network usually requires several computers forming a network, and here lies another problem that has not yet been solved.
Also, from an engineering perspective, implementing and training these networks is non-trivial. They are so big that they are typically distributed across a bunch of computers, and even in a single forward pass data needs to be exchanged across computers (for example, input I may need to go to an expert stored in machine J first, and then another one in machine K). We are starting to develop mature tools to handle these new kinds of models.
Research Scientist at Google Brain
The developers with whom ICT.Moscow spoke, although they evaluate the new architecture positively, do not yet see a breakthrough from an applied point of view. According to Vlad Vinogradov from EORA Data Lab, this development requires efforts to move forward.
On the images, Google assembled a visual transformer (ViT) with MoE and got a 2x speedup without losing quality. This is not a breakthrough as the double acceleration does not attract with the high complexity of implementation and training. However, it may be suitable for those who have made the most of current methods for the sake of a balance of speed and quality.
For me, a breakthrough will occur when it is possible to speed up the model at least tenfold without losing quality.
Vlad Vinogradov
CTO at EORA Data Lab
Grigory Sapunov from Intento also noted that this is a rather old approach, and the development presented by Google in January is “not a revolution, but rather a consistently wider application of an already known idea” (a more detailed technical analysis of this architecture can be found here).
In general, it's an interesting architecture, with good potential. You can create large distributed networks. Essentially, it adds another dimension to model scaling. Now you can add not only layers to make the model heavier but also expert layers, without particularly slowing down the entire model.
In size, you can get a huge neural network, much larger than the largest GPT-3, but at any given time, only a small part of the weights will “work”. Previously, the main difficulty in training such architectures was probably that they require a more complex distributed infrastructure. Now, this is no longer such a problem.
Grigory Sapunov
Co-founder and CTO at Intento
Based on the results of conversations with AI developers, it was possible to identify at least four positive effects inherent in transformer neural networks with “expert” layers. The first one has already been mentioned above, that is, the ability to work with a large number of parameters compared to traditional transformers.
Sergey Markov
Managing Director at the Department of Experimental Machine Learning Systems, SberDevices
Kirill Danilyuk from Yandex compared the development with convolutional neural networks, which are pre-configured for correct operation and are now used in unmanned vehicles.
Kirill Daniliuk
Head of Computer Vision in Autonomous Vehicles at Yandex
15 billion parameters for V-MoE neural networks is probably far from the limit if this architecture continues to develop in the same way as the “parent” MoE. Grigory Sapunov gave several examples of text models that can continue the series started by experts from SberDevices and Yandex: language MoE models presented in December by Meta AI for 1.1 trillion parameters and GLaM by Google for 1.2 trillion.
The second effect is the less processing power required to work with an equivalent amount of data. Interlocutors of ICT.Moscow see this effect from two positions. Kirill Danilyuk from Yandex considers MoE as a tool to reduce the growth in required computing power with an increase in the number of parameters and data (compared to traditional transformers). Grigory Sapunov from Intento refers to a direct comparison of the work of GLaM (MoE-neural networks) with GPT-3 (a comparable AI model without "experts"). It is difficult to get exact figures, but, according to the developer, in 5 out of 7 task categories, “GLaM significantly outperformed GPT-3”.
Grigory Sapunov
Co-founder and CTO at Intento
Thus, the four positive effects of MoE networks compared to traditional transformers are directly interconnected:
But what does this mean from a practical point of view - where can MoE neural networks and, in particular, V-MoE be applied in reality?
Carlos Riquelme from Google Brain is positive that there are no fundamental limitations on the applicability of neural networks with “experts” in the future.
Moreover, MoE's could be helpful for multi-task setups where some tasks may interfere in learning others, and MoE paths could make it easy to isolate task-specific parts of the network (learning to play game 2 doesn't make you forget how to play game 1).
Research Scientist at Google Brain
At the current stage, developers are more likely to take a wait-and-see attitude as they are looking at what specific effects this architecture can achieve and how easy it is to work with compared to the tools they already have. This also applies to professionals who work with AI voice interfaces.
Work with segmentation has been carried out before, so we can't say that this is a new approach to working with data. Another thing is the application of this approach in data processing algorithms. Profiling and narrow segmentation increase the detail and quality of the analysis in a particular area, but it is important not to lose context (the big picture). Most likely, the Google team was able to find the best approach. Let's look at their results and the development of the chosen architecture.
Julia Mitskevich
COO of the Design and Development team for conversational products at TORTU
Experts working with static images — similar to those cited by the developers from Google Brain — are also in no hurry to draw specific conclusions about the applicability of V-MoE for their tasks.
Vlad Vinogradov
CTO at EORA Data Lab
This also applies to developers involved in neural networks for the analysis of medical images.
Aleksander Gromov
Computer Vision Team Lead at Third Opinion
The task of biometric recognition is more difficult than working with static images. But Speech Technology Center (STC), which develops such systems, is confident that such a neural network architecture will be applicable in their work.
In the future, the approach of "dynamic networks" can be expanded for even better work of the algorithm with faces of minors, the elderly, various ethnic groups (relevant in the cases of the biometric video identification system in the Safe City and Smart City projects). According to our forecasts, the “dynamic networks” approach will continue to be actively used in computer vision.
Dmitry Dyrmovsky
CEO at STC Group
Finally, the most difficult scenario is the operation of such neural networks “in the field” to solve problems in real-time. Here, the example of autonomous vehicles will be indicative, which not only work with images in real-time but also must be as reliable as possible. Whether there is a prospect of V-MoE for solving such problems is still an open question.
Thus, from the point of view of self-driving cars, this development is still too far from practical application. Further, with the help of transformers, the problem of classifying objects is solved, which is uninteresting for self-driving cars in the frame, we need to solve a more complex problem — the detection. That is, to obtain data not only about what kind of object it is, but also where it is located, to predict its movement — in three-dimensional space. In other words, again, we need data not only from cameras but also from other sensors.
It is also worth noting that the goal of the “in combat” model is not to show the best metrics on some dataset (as Google does on the JFT-300M and ImageNet datasets). The goal is to see the required types of objects reliably enough, to do it as quickly as possible and as efficiently as possible in terms of resources.
Kirill Danilyuk
Head of Computer Vision in Autonomous Vehicles at Yandex
The interlocutor of ICT.Moscow emphasized that even taking into account the reduced requirements for computing power, “we could probably solve our problems right now if we had a data center or a cluster of several powerful computers on the machine.” In other words, even taking into account MoE, transformer neural networks require much more calculations and data.
Visualization of data collected by lidar on an unmanned vehicle. Source: Yandex
The question of data also remains open. There were no problems with the collection of texts for transformer networks: in this case, text markup for the dataset occurs completely automatically. Working with static images for V-MoE is more difficult, but the Google Brain developers coped with this task and collected the JFT-300M dataset. Automating the assembly of a dataset for autonomous vehicles tasks is a task that has not yet been solved by existing methods.
Google came up with an interesting way to automatically collect data for this class of tasks. How to automatically collect markings from cameras and sensors for autonomous vehicles tasks is still an open question. When the world figures out how to assemble a good, high-quality, and automatic markup for this class of tasks, we will probably get closer to starting using transformers.
In any case, we believe that neural networks with V-MoE will eventually come to the real-time industry, that is, they will be used to solve problems “here and now”. But whether this will be implemented in self-driving cars is not a fact yet.
Kirill Daniliuk
Head of Computer Vision in Autonomous Vehicles at Yandex
Working with visual transformers, including those made in the MoE architecture, according to Sergey Markov from SberDevices, will remain primarily the prerogative of large research teams in the near future.
Sergey Markov
Managing Director at the Department of Experimental Machine Learning Systems, SberDevices
However, we can still try to determine in which direction this architecture will develop further.
In the very short term, this is still about scaling, there is room for development. A little further on the horizon it's more about modularization and multimodality — multimodal MoEs will certainly be interesting when they appear.
Grigory Sapunov
Co-founder and CTO at Intento
A distinctive feature of multimodal neural networks is that they are trained in parallel on several types of data: for example, texts and images. An example is DALL-E, a transformer that generates images based on text descriptions. It is logical to assume that the multimodal approach will also be of interest to autonomous vehicles developers, since, as Kirill Danilyuk said, it also requires several types of data: from cameras, lidars and radars, which should be used within the same network for a single forecasting task.
If we assume that MoE neural networks will indeed develop in the areas that Grigory Sapunov spoke about — modularization, multimodality and scaling — then this will probably affect the development of the trend that Igor Pivovarov spoke to ICT.Moscow about earlier — the emergence of Foundation Models. The idea is that now you don’t even need to write code: you take a ready-made model, customize it for your tasks and get a working solution.
Grigory Sapunov
Co-founder and CTO at Intento
By clicking the button you agree to Privacy Policy
Unless otherwise stated, the content is available under Creative Commons BY 4.0 license
Supported by the Moscow Government
Content and Editorial:tech@ict.moscow