What is important in contrastive learning and how it can help

Photo by Nadir sYzYgY on Unsplash

A big amount of labeled data is necessary for deep learning models to get high accuracy and nice performance. We can use millions of labeled data from public datasets like ImageNet for academic purposes. However, it is extremely difficult to collect large amount of labeled data in commercial projects, due to the limit of time and budget.

Since data labeling is expensive, people tend to make use of labeled data more efficiently by using unsupervised pre-training + supervised fine-tuning, rather than using supervised training directly. According to some recent achievements by the researchers from Google and Facebook, unsupervised pre-training +…

Making object detection end to end

(source)

Object detection is a traditional task in computer vision. Since 2015, people tend to use modern deep learning techniques to improve the performance of object detection. Although the accuracy of the models are getting higher, the complexity of the models is also increased, mainly due to various kinds of dynamic labeling during training and NMS post-processing. This complexity not only makes the implementation of object detection models difficult, but also hinders it from end-to-end style of model design.

Early methods (2015 ~ 2019)

Since 2015, various methods of object detection in deep learning have been proposed, bringing great influence to the area. The methods are…

Hands-on Tutorials

Leverage video frames with sparsely labeled data

(source)

Convolution layer is the basic layer in convolution neural networks. Although it is widely used in computer vision and deep learning, it has several shortcomings. For example, the kernel weights are fixed for a certain input feature map and is not adaptable to the local feature changes, thus we need more kernels to model the complicated context of feature maps, which is redundant and not efficient. Moreover, since the receptive field of an output pixel is always a rectangle, as a cumulative effect of layered convolutions, the receptive field is getting bigger in which some context background unrelated to the…

How to solve optical flow using multi-scale correlations in an iterative manner

(source)

The most important problem in computer vision is correspondence learning. That is, given two images of an object, how to find the corresponding pixels of the object from the two images?

Correspondence learning, mainly for videos, has broad applications in object detection and tracking. Especially when the objects are occluded, color-changed, deformed in a video.

What is optical flow?

Avoiding the numerical traps inside a deep neural network

(source)

Computer vision and deep learning is a field that grows rapidly. There are thousands of good papers every year in the top conferences, and new drafts on arXiv from day to day. Though the number of papers is so huge that no individual can read all of them, the ideas inside are almost 50% common among all the papers — they are (almost) all based on convolutional neural networks (CNN), which can be easily implemented using modern libraries such as PyTorch. …

The important design ideas behind U-Net, FPN, PSPNet and HRNet etc.

Photo by @canmandawe on Unsplash

In deep learning and computer vision, one of or maybe the most important task is to learn a meaningful feature representation from images. The learned feature representations maybe in general purpose, or task-specific. The feature representations in general purpose may be learned from some unsupervised or self-supervised methods, such as auto-encoders [1]. The task-specific feature representations, as the name implies, are learned from specific label domains of different tasks, including classification, segmentation, object detection, or key point estimation etc.

Although the effort being taken in general purpose feature representation learning increases fast in recent years, it is still weak in…

Understanding the importance of concept, methods, and online training data mining in metric learning

Photo by Markos Mant on Unsplash

Imagine you have a database containing face images of 1000 people, in which only a few images represent the same person. Now you want to build a face recognition system based on this dataset. How would you do that?

To build a classification model? No! Because each person has only a few representing face images which are far from enough for classification training.

Actually, in deep learning and computer vision, such a task is already well studied in the past 15 years, which is called metric learning.

Metric learning, as the name implies, is a technology to map images to…

A smooth version of cross entropy loss for quick convergence

Photo by Patrick Schneider on Unsplash

As I wrote in the last article of this series, focal loss is a more focused cross entropy loss. In semantic segmentation problems, focal loss can help the model focus on pixels that have not been well trained yet, which is more effective and purposeful than cross entropy loss. I recommend the article if you haven ‘t read it yet.

Again in this article, I ‘d like to talk about a variant of focal loss that can be used as a distance-aware cross entropy loss in semantic segmentation problems, especially with sparse labels.

The issue of standard cross entropy loss

Cross entropy loss is typically used in…

An elegant method to group predictions without labeling

Photo by Alex Alvarez on Unsplash

In some tasks of computer vision and deep learning, we need to predict all the results first and then split the results to several individual results.

A common task in this spirit is pose estimation for multi-people, in which the key points for all the people in the image are predicted first and then split into individual pose as the final prediction (Fig. 1).

An elegant method to group predictions without labeling

Photo by Gian D. on Unsplash

In some tasks of computer vision and deep learning, we need to predict all the results first and then split the results to several individual results.

A common task in this spirit is pose estimation for multi-people, in which the key points for all the people in the image are predicted first and then split into individual pose as the final prediction (Fig. 1).

Shuchen Du

Machine learning engineer based in Tokyo

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store