We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU).
Ranked #5 on Visual Question Answering (VQA) on DocVQA test (using extra training data)
This paper proposes an adaptive dynamic programming-based adaptive-gain sliding mode control (ADP-ASMC) scheme for a fixed-wing unmanned aerial vehicle (UAV) with matched and unmatched disturbances.
Computer vision applications such as visual relationship detection and human object interaction can be formulated as a composite (structured) set detection problem in which both the parts (subject, object, and predicate) and the sum (triplet as a whole) are to be detected in a hierarchical fashion.
Deep convolutional neural networks (CNNs) have demonstrated remarkable success in computer vision by supervisedly learning strong visual feature representations.
The objective learning formulation is essential for the success of convolutional neural networks.
In particular, existing deep learning methods consider mostly either class balanced data or moderately imbalanced data in model training, and ignore the challenge of learning from significantly imbalanced training data.
Recognising detailed facial or clothing attributes in images of people is a challenging task for computer vision, especially when the training data are both in very large scale and extremely imbalanced among different attribute classes.
Recognising detailed clothing characteristics (fine-grained attributes) in unconstrained images of people in-the-wild is a challenging task for computer vision, especially when there is only limited training data from the wild whilst most data available for model learning are captured in well-controlled environments using fashion models (well lit, no background clutter, frontal view, high-resolution).