Longread: will LO-shot revolutionize machine learning?

November 17, 2020, 10:03 (UTC+3)|

1088

In October, MIT described a new approach to training artificial intelligence (AI) algorithms - “Less than one”-shot, or LO-shot learning, which, in theory, could fundamentally change the approach to machine learning (ML). Its main feature is the ability of AI model trained using this technique to recognize more real objects than was presented in the training dataset. In other words, training neural networks does not require such a large sample of data as was usually needed.

ICT.Moscow and Russian AI and ML experts have discussed how revolutionary this approach really is, whether it has real prospects for changing the field of machine learning, and whether it is nothing more than a curious experiment. The experts explained what the LO-shot approach is all about, how applicable it is today or may be applicable in the future, and what might hinder its application.

The method is not new, but promising

*This block contains a lot of technical information. The main barriers and prospects for using LO-shot learning are presented in the following sections.

Most of the experts agreed that the technique can be promising. In particular, Alexander Gromov, Head of the Computer Vision Department at “Third Opinion”, noted that the LO-shot principle is an interesting concept that can potentially be applied to certain areas of AI.

But it is still too early to say that it will fundamentally change the approach to machine learnin

This is an interesting experiment, but certainly not a revolution in ML. Moreover, the concept of the experiment is not new, there is a branch of research in the field of data science, which is developed in Few-shot learning/One-shot learning. The author of the article went further and suggested an approach to LO-shot learning, when we artificially construct “super informative examples” for training. There is also a brilliant article written in 2015 by Geoffrey Hinton on knowledge distillation.

The approaches described in the article are based on the beautiful and simple idea of using soft-targets, which are more informative for training ML-models, instead of using hard-targets.

We can explain this idea with a simple example: during training, we do not tell ML-models: “This is a cat”. We say that this is a ‘cat’, but it looks a little like a ‘lynx’, just a bit like a ‘dog’, does not at all look like a ‘plane’. So, we transfer much more knowledge about the modeled world in one example to the model. A person in his training can also build such relationships and learn from them: for example, a unicorn is a horse, but also just a little like a rhino; or pegasus - a horse, but has similar features with a bird.

Yaroslav Shmulev

Head of the machine learning group, Jet Infosystems

This technique is generally as promising as data compression (as an approach). I know people who do not listen to music in mp3, because the loss of information is uncomfortable for them, but I cannot imagine a robot that will complain that it is taught using not all the available data, but using such “residues” with the high concentration of content. If it really works, that is pretty cool.

Peter Emelianov

R&D Director, UBIC Technologies

There is a range of approaches to training: Zero-shot learning, One-shot learning, Few-shot learning, conventional training on large datasets. This technique occupies a position somewhere in the middle between the first two. All of these approaches have their own achievements, so, in this regard, the current work is more likely to complement the existing landscape. This is a further development of the works of the same authors from 2019, where they have already described the possibility of training a neural network when the number of examples is fewer than the number of classes. The current work is more of a theoretical study of this new approach and provides a basis for understanding it.

Both articles are a development of a 2018 work by other authors titled Dataset Distillation. All of them focus on “downsizing” the existing dataset and replacing it with a small artificial synthesized dataset.

Grigory Sapunov

Co-founder and CTO, Intento

Igor Korsakov from Webiomed, a developer in the field of medicine, notes that in the present case, approaches to the development of machine learning algorithms already used in the industry are also being used and developed.

Synthetic data generation has been used for a quite a long time in cases where there is not enough data in the dataset for deep learning. The LO-shot approach also uses the generation of synthetic data (with the involvement of a person), but, unlike in the existing methods, here data reduction takes place.

Igor Korsakov

Machine learning expert, Webiomed

There has always been unsupervised learning in machine learning - learning without a teacher. It allowed to structure the dataset into classes, without any labeling at all. Actually, the KNN method uses this principle. A new approach in this case will be that some of the classes will be labeled, and some will be semi-supervised.

Anton Balakirev

Founder of Celado AI

However, in comparison with more traditional approaches, experts see a number of advantages and a positive trend in the development of this area.

The LO-shot principle is quite interesting, because it shows that the use of probabilistic labels allows to get higher training accuracy on several generated training data prototypes obtained from the distillation of the entire training set. In fact, the authors develop the previously proposed approach for the distillation of examples of each class into one prototype, proposing to train not only the parameters of the prototype of the class, but also its probabilistic distribution of labels, which makes it possible to obtain characteristics of several classes in this prototype. The attractive side of this method is the possibility to obtain good training data free of privacy-related restrictions, as they do not explicitly contain any information about the subjects.

Sergey Milyaev

Head of research projects, VisionLabs

In machine learning the rule almost always is “the more the better”. In reality (especially in business), clients often have little data, but the needed quality of algorithm is 99%. Thus, the general trend in machine learning is the competent reuse of open data and additional training on a smaller number of target examples that illustrate a specific required skill. There already are Few-shot learning, One-shot learning, Zero-shot learning, and now Less-than-one has appeared. The result of the work is interesting, it fits very well into the general trend, meets the general desire to reduce costs and speed up the training of algorithms.

Tatiana Shavrina

Head of R&D in NLP, Sberbank

Nevertheless, not all experts view LO-shot learning postitvely. For example, Dmitry Nikolaev, CTO of Smart Engines, called the article “theoretically odd”.

Simply speaking, it is obvious to a person that the Sun revolves around the Earth, but this is fundamentally wrong. Obviously, training on N classes requires at least N examples, but this is not true either. From the conclusion in the article it is clear that the authors themselves do not attach any applied significance to it - this is a pure scientific sport. The article is certainly interesting, but it is not a breakthrough, it describes research, not technology.

Dmitry Nikolaev

CTO Smart Engines

I cannot say that it looks like something very unique and useful. At best, it is a small and useful study on dataset distillation, but so far it looks not really useful from a practical point of view. Perhaps, some more practical modifications will appear in the future, but I would not be too optimistic about it.

Alexey Tikhonov

Leading analyst, Yandex

So, here the question about the barriers that could hinder the development of a new approach to machine learning arises.

Modify the method and determine the data

Most experts named the key factor that raises doubts about the prospects of LO-shot - the method has not yet been worked out in practice.

The method is still immature, too much additional research has to be done to implement it in development. The main reason is that large datasets are being distilled now. But it is not clear what to do if there is no big data for a specific task.

Roman Doronin

CEO EORA

The main limiting factor is obvious - so far, one experiment has been performed using one very narrow dataset. Perhaps in other cases the application of this method will not be so easy.

Valeriy Babushkin

Machine Learning Expert, WhatsApp Integrity on Facebook

It is too early to talk about the limiting factors for the application of this approach because of the insufficiency of both its description and the scope of application.

Igor Korsakov

Machine learning expert, Webiomed

So far, the main limiting factor is the lack of publications on the accuracy and effectiveness of this method. We need to increase the number of studies and publications in peer-reviewed scientific literature for us to try this technology in practice.

Alexander Gusev

Chief Business Development Officer, Webiomed

The results on large training samples with a large number of classes (for example, in face recognition) have not been demonstrated yet, while this is one of the important indicators of the wide applicability of new methods. During the process of defining objects that are very similar to each other the combination of classes into one prototype can deprive the model of the ability to correctly distinguish objects of close classes.

Sergey Milyaev

Head of research projects, VisionLabs

Natural Language Processing expert Tatyana Shavrina explains the idea: not every small dataset is suitable for LO-shot.

The main limiting factor is that not all small data is equally good. Most commonly, small datasets that provide training quality comparable to big data training are very carefully calibrated and sorted. In fact, this is the same big data, but artificially reduced, and not just “small data that our accountant has accumulated in a month”, as it happens in the industry.

Tatiana Shavrina

Head of R&D in NLP, Sberbank

Therefore, preparing the correct dataset is likely to require more resources than collecting a standard dataset.

In practice, there are several pitfalls here. First, the accuracy of the resulting classifier (so far this approach is used mainly only for training the classifiers) is quite poor. Not critically poor, but it is still noticeable. Secondly, the dataset distillation procedure is sophisticated and resource-demanding.

It fact, first you need to collect a dataset, and only then get a lighter one from it. It is not quite clear how you can get a lighter dataset right away. As a result, these methods started to be applied to a limited extent, but I would not expect mass implementation yet.

Grigory Sapunov

Co-founder and CTO, Intento

Alexander Gromov from “Third Opinion” adds that problems can arise even if there is data, but it is not prepared correctly.

The main limiting factors are the instability of this approach in the case of “noisy” data, the ambiguity of producing soft labels (distribution vectors describing the class - a term from the article) and the complexity of this process in the case of large data sets.

Alexander Gromov

Head of the Computer Vision Department, “Third Opinion”

So, in fact, there is no clear algorithm for constructing “correct” datasets for the correct operation of LO-shot - this idea is explained by Yaroslav Shmulev from Jet Infosystems.

The main limiting factor is the lack of important pieces of the “puzzle” in this method, for example, there is no algorithm for constructing the “super examples”, which should still be independent of the used ML-algorithm. At the moment, LO-shot learning is not an approach, but an idea, since the author of the article does not have ready answers to critical questions that hinder its application:

• How to design those several examples for training neural networks?

• What is the complexity of the algorithm for generating such examples?

• What is the amount of data that will be needed for the operation of such an algorithm?

Yaroslav Shmulev

Head of the machine learning group at Jet Infosystems

Finally, according to Adam Turaev from Cleverbots, even if everything is done correctly, the method cannot be applied to all tasks.

The main limiting factor is that in some tasks it is impossible to define intermediate characteristics - it is not always possible to say that class C is something in between classes A and B. And this idea is the basis of LO-shot learning. And, of course, so far there are no cases that demonstrate how well this approach works with data of a different nature: texts, images, etc.

Adam Turaev

Strategy Director, Cleverbots

Peter Emelianov, R&D Director of UBIC Technologies, highlighted the aspect of demand for the new approach.

The main limiting factor is the number of people who may be interested in the approach. It seems to me that this method is still a niche thing: I am interested because it can potentially improve my product. But I do not think that everyone will immediately break into a run to “make a revolution”.

Peter Emelianov

R&D Director, UBIC Technologies

Anton Balakirev from Celado AI draws a disappointing conclusion.

There will be no revolution. This is another algorithm, which in some cases will provide a result a little faster than, for instance, classical neural networks.

Anton Balakirev

Founder of Celado AI

But is it true that this method is really useless and cannot be applied to any tasks?

Applicable almost everywhere, but there are caveats

Some experts believe that the method is viable, if it is modified and tested on a sufficient number of practical cases.

It still cannot be called a revolution, but it is a very promising area that can potentially accelerate the development and creation of machine learning algorithms. The approach seems to be very versatile, but it is too early to talk about it. It can be suitable for many types of problems.

Valeriy Babushkin

Machine Learning Expert, WhatsApp Integrity on Facebook

Sergey Milyaev from VisionLabs highlights an important aspect: the method can be applied to neural networks of different architectures - it is not tied to a specific ML stack. But from the point of view of the tasks performed, it is necessary to take into account some problems that may arise.

The underlying concept of the method makes it possible to apply it for any tasks where it is necessary to get pattern recognition model for multiple classes, regardless of the architecture of the chosen neural network model. The main requirement is that the loss function must be differentiable. However, the effectiveness of the application, at first glance, depends heavily on the final task.

The authors of the work showed that when recognizing numbers, it is possible to create several prototypes, the features and probabilistic labels of which contain the main properties of each of the numbers, which seems logical, since the number of classes of numbers is limited and their images are redundant. But when adjusting this idea for face recognition, where the number of classes can reach a million and the differences between classes can be minimal (classes of twins or very similar people), the obtained prototypes of the classes may lose the features necessary for training the model for recognizing the people who look alike.

Sergey Milyaev

Head of research projects, VisionLabs

Peter Emelianov from UBIC Technologies believes that the new approach can be applied in the field of cryptographic data protection.

We work with data privacy and develop data-in-use protection methods (there are three aggregate data states: at rest, in transit and in use, - the methods of cryptographic data protection in the first two are well known and used everywhere, while data in the third state is most vulnerable), including in machine learning tasks. In other words, we train models on encrypted data. The good news is that it works, and your data is always protected, the bad news is that it takes models for cryptography tasks two times longer to be trained. Therefore, any technology that can somehow speed up the process is interesting for us. I think we will experiment with it.

Peter Emelianov

R&D Director, UBIC Technologies

NLP experts see the potential of the new approach in working with texts and voice, but they treat it with caution.

In many areas, in particular in text, image and video processing, this approach is potentially applicable. It is too early to talk about a specific application, because it is always possible to make a model faster, cheaper, easier, but you need to check how good and stable the quality is.

Tatiana Shavrina

Head of R&D in NLP, Sberbank

The problem raised by this technology is very urgent, since it is usually difficult to find good data for non-classical tasks, this is a very labour-intensive process. For our company this problem is also relevant.

But it is important to understand that this is not a silver bullet, and no matter how cool the name “Less-than-one”-shot learning sounds, it does not mean that data is no longer needed. Moreover, we still cannot abandon the principle “the more data, the better”. We believe there is an opportunity to try a similar approach for some of our tasks, but we would not expect incredible results.

Adam Turaev

Strategy Director, Cleverbots

However, for developers in the field of document recognition LO-shot learning may simply be not interesting, notes Dmitry Nikolaev, CTO of Smart Engines.

In document recognition the task is not to significantly reduce the volume of datasets. We do not have many classes to reduce. Even the recognition of Chinese characters does not require this. And vice versa - we solve the problem of insufficient real data (when we want to increase the number of examples for each class, because the classes are very variable) by generating synthetic (artificial) examples.

Dmitry Nikolaev

CTO, Smart Engines

Nevertheless, there is an area where the accuracy of algorithms is such a critical factor that relying on artificially generated data and adopting an approach with “floating” characteristics can be dangerous. This area is medicine.

The current version of this method can be used for some clustering tasks. Unfortunately, it cannot be applied to our problems and tasks.

Alexander Gromov

Head of the Computer Vision Department, “Third Opinion”

I did not see any clear evidence of high efficiency in this method and made the following conclusions for myself. For medicine, this approach is rather inapplicable. Maybe it can help in some tasks in computer vision, but I doubt it. This is due to insufficient elaboration, lack of successful public experience in the community, and the skepticism about the architecture of the solution. Also in medicine there are very high quality requirements and many “corner cases” (cases outside the standard models). This approach is unlikely to cover all these aspects.

Konstantin Schetkin

Deep learning engineer, Care Mentor AI

The authors did not provide any examples of real areas for specific use in AI tasks. In medicine there are strict requirements both for the machine learning algorithms and for the datasets that are used for training. For example, to pass clinical trials, datasets must contain data from at least 20-50 medical organizations, and each organization should provide data on about 500-1000 patients. Therefore, it is impossible to talk about artificial data creation, including synthetic generation.

Igor Korsakov

Machine learning expert, Webiomed

However, Webiomed CEO Alexander Gusev still provides one example where the new approach can theoretically be applied to medical problems.

In medicine there are specific tasks where it is impossible to obtain datasets with a large number of records. For example, rare (orphan) diseases. Taking into account their prevalence proportion, it is almost impossible to get datasets with thousands of records simply because there are not so many patients. Therefore, the presented approach can be very promising in healthcare just for such rare diseases.