Kernel learning for visual perception

6 Dec 2019 · Chen Wang ·

The visual perceptual system in animals allows them to assimilate information from their surroundings. In artificial intelligence, the objective of visual perception is to enable the capability of a computer system to interpret the surrounding environment using data acquired from cameras and other aided sensors. Since the last century, researchers in visual perception have delivered many marvelous technologies and algorithms for various applications, such as object detection and image recognition, etc. Despite the technological progresses, human beings are still confronted by the unsatisfactory performance of artificial visual perceptual systems. One of the main reasons is that the traditional methods usually rely on large amount of training data, powerful processors, and require great efforts and time for process modeling. The research goal of this thesis is to develop visual perceptual systems that requires less computational resources but with higher performance. To this end, the novel kernel learning methods for several basic visual perceptual tasks, including object tracking, localization, mapping, and image recognition, are proposed and demonstrated both theoretically and practically. In visual object tracking, the state-of-the-art algorithms that leverage on kernelized correlation filters are limited by circulant training data and non-weighted kernel functions. This makes them only applicable for translation prediction and prevents their usage in other applications. To overcome the problems, a kernel cross-correlator (KCC) is introduced. First, by introducing the kernel trick, the KCC extends linear cross-correlation to non-linear space, which is more robust to signal noises and distortions. Second, connections to the existing works show that the KCC provides a unified solution for correlation filters. Third, the KCC is not only applicable to any training data and kernel functions, but also able to predict affine transforms with customized properties. Last, by leveraging the fast Fourier transform (FFT), the KCC eliminates direct calculation of kernel vectors, thus achieving better performance at a reasonable computational cost. Comprehensive experiments on visual tracking and human activity recognition using wearable devices have demonstrated its robustness, flexibility, and efficiency. Optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene. It is calculated from sequences of ordered images and allows the estimation of motion as instantaneous image velocities, which is crucial for autonomous robot navigation. This thesis proposes a KCC-based algorithm to determine optical flow using a monocular camera, which is named as correlation flow (CF). CF can provide reliable and accurate velocity estimation and is robust to motion blur. In addition, a joint kernel scale-rotation correlator is proposed to estimate the altitude velocity and yaw rate which are not available by traditional methods. Autonomous flight tests on a quadcopter show that correlation flow can provide robust trajectory estimation with very low processing power. In the problem of simultaneous localization and mapping (SLAM), traditional odometry methods resort to iterative algorithms which are usually computationally expensive or require well-designed initialization. To overcome this problem, a KCC-based non-iterative solution to RGB-D-inertial odometry system is proposed. To reduce the odometry and inertial drifts, two frameworks for non-iterative SLAM (NI-SLAM) are presented. One is to combine a visual loop closure detection, another one is to seek the aids from ultra wide-band (UWB) technology. Dominated by the FFT, the non-iterative front-end is only of $\mathcal{O}(n\log n)$ complexity, where $n$ is the number of pixels. Therefore, both frameworks can provide reliable performance and are of very low computational complexity. The map fusion is conducted by element-wise operation, so that both time and space complexity are further reduced. Extensive experiments show that, due to the lightweight of the proposed non-iterative front-end, both frameworks of NI-SLAM can run at a much faster speed and yet still with comparable accuracy with the state-of-the-arts. Convolutional neural network (CNN) is one of the most powerful tools in visual perception. It has enabled many state-of-the-art performances in image recognition, object detection, etc. However, little effort has been devoted to establishing convolution in non-linear space. In this thesis, a new operation, kervolution (kernel convolution), is introduced to approximate the non-linear behavior of the human perceptual system. It generalizes traditional convolution and increases the model capacity without introducing more parameters. Similarly, kervolution can also be calculated through element-wise multiplication via Fourier transform. The extensive experiments show that the kervolutional neural networks (KNN) achieve better performance and faster convergence than traditional CNN on the MNIST, CIFAR, and ImageNet datasets. In summary, the thesis demonstrates the superiority of the proposed kernel tools for visual perceptual problems, including KCC, CF, NI-SLAM and KNN. With the kernel tools, we may expect their usage in more applications, such as internet of things, robotics, transfer learning, reinforcement learning, etc.

PDF Abstract