To talk about self-supervised learning, first, we need to recap what we doing in training the model. We use the labeled data for training the model to reach the classification or regression task by iteratively minimizing the prediction results between the output label. By recapping the process of supervised learning, we can see that the model performance has highly related to the quantity and quality of the training dataset.
If we only have little data on our domain
There’s not always easy to obtain a huge amount of the labeled dataset, and it can cost a lot of time and money if required from the specific domain.
To solve the lack of labeled dataset on our domain, here comes out the approach called transfer learning. First, we can train our model on the large labeled dataset which has existed but has a few related to our domain. Then, we use our good training model to retrain our limited quantity of the dataset by fixing the lower layer of the model. This is the process called fine-tune. The idea behind this is that the model can learn how to feature extracting from existing large datasets, and fit the result on our domain. This can also avoid overfitting if we train our model directly on a small amount domain dataset.
But it’s obviously can see, there no necessary dataset used by the pre-trained model has a direct relationship with our domain. Here an idea comes out, “what if we train our pre-train model from the unlabeled dataset?”.
What if the data we learn from on pre-trained model do not have to label
That is the promise of self-supervised learning, getting a gigantic unlabeled dataset is much easier in many domains.
How does self-supervised learning work?
We take the unlabeled data and force the model to learn feature representation from the data we mainly care about, the goal in mind is to get the good feature representation we can reuse to fine-tune the downstream task.
How does the model learn great representation from data?
A good example to learn good representation is that we have images we rotate from multiple degrees, and we let our model learn to predict what’s the degree to of the images be rotated.
By classifying the degree of the rotation, the model must learn the representation of the image such as the legs must concatenate with the body or the head must be in front of the body.
The noise data is much easier to obtain than the formal labeled data. The domain like NLP needs to pre-trained the model from the one-hot vector, this means the dataset we have is discrete from all its vector elements not like the image dataset. The feature representation is more like the continuous vector of mathematical meaning. To transform the vector from discrete to continuous needs much more data because the elements in the one-hot vector don’t represent any causality between themselves, this means we need to prepare larger data on the training process to give the model more information, that’s why we need self-supervised learning.