Along with providing baseline results for existing object detectors on FGVD Dataset, we also present the results of a combination of an existing detector and the recent Hierarchical Residual Network (HRN) classifier for the FGVD task.
We introduce Action-GPT, a plug and play framework for incorporating Large Language Models (LLMs) into text-based action generation models.
Pictionary, the popular sketch-based guessing game, provides an opportunity to analyze shared goal cooperative game play in restricted communication settings.
no code implementations • 27 Sep 2022 • Kushagra Srivastava, Dhruv Patel, Aditya Kumar Jha, Mohhit Kumar Jha, Jaskirat Singh, Ravi Kiran Sarvadevabhatla, Pradeep Kumar Ramancharla, Harikumar Kandath, K. Madhava Krishna
Unmanned Aerial Vehicle (UAV) based remote sensing system incorporated with computer vision has demonstrated potential for assisting building construction and in disaster management like damage assessment during earthquakes.
At the representation level, we propose a global frame based part stream approach as opposed to conventional modality based streams.
Ranked #5 on Skeleton Based Action Recognition on NTU RGB+D
In many Asian countries with unconstrained road traffic conditions, driving violations such as not wearing helmets and triple-riding are a significant source of fatalities involving motorcycles.
This skew affects all stages within the pipelines of deep crowd counting approaches.
We obtain TCDCA of 96. 77% on the test videos, with a remarkable improvement of 22. 58% over baseline, and demonstrate that our counting module's performance is close to human level.
We introduce MUGL, a novel deep neural model for large-scale, diverse generation of single and multi-person pose-based action sequences with locomotion.
We introduce MeronymNet, a novel hierarchical approach for controllable, part-based generation of multi-category objects using a single unified model.
F3 adopts multiple heuristics to improve fairness across different demographic groups without requiring data homogeneity assumption.
Precise boundary annotations of image regions can be crucial for downstream applications which rely on region-class semantics.
Handwritten documents are often characterized by dense and uneven layout.
We analyze the performance of representative crowd counting approaches across standard datasets at per strata level and in aggregate.
State of the art architectures for untrimmed video Temporal Action Localization (TAL) have only considered RGB and Flow modalities, leaving the information-rich audio modality totally unexploited.
Ranked #1 on Temporal Action Localization on THUMOS'14
Given a monocular colour image of a warehouse rack, we aim to predict the bird's-eye view layout for each shelf in the rack, which we term as multi-layer layout prediction.
The lack of fine-grained joints (facial joints, hand fingers) is a fundamental performance bottleneck for state of the art skeleton action recognition models.
Ranked #1 on Skeleton Based Action Recognition on NTU60-X
We deploy SynSE for the task of skeleton-based action sequence recognition.
Ranked #1 on Zero Shot Skeletal Action Recognition on NTU RGB+D
In particular, our integration of VPR with SLAM by leveraging the robustness of deep-learned features and our homography-based extreme viewpoint invariance significantly boosts the performance of VPR, feature correspondence, and pose graph submodules of the SLAM pipeline.
To study skeleton-action recognition in the wild, we introduce Skeletics-152, a curated and 3-D pose-annotated subset of RGB videos sourced from Kinetics-700, a large-scale action dataset.
Ranked #1 on Skeleton Based Action Recognition on Skeletics-152
We propose OPAL-Net, a novel hierarchical architecture for part-based layout generation of objects from multiple categories using a single unified model.
At the intermediate level, the map is represented as a Manhattan Graph where the nodes and edges are characterized by Manhattan properties and as a Pose Graph at the lower-most level of detail.
To address this deficiency, we introduce Indiscapes, the first ever dataset with multi-regional layout annotations for historical Indic manuscripts.
Therefore, target identifications by operator in a subset of cameras cannot be utilized to improve ranking of the target in remaining set of network cameras.
Similarly, performance on multi-disciplinary tasks such as Visual Question Answering (VQA) is considered a marker for gauging progress in Computer Vision.
We propose SketchParse, the first deep-network architecture for fully automatic parsing of freehand object sketches.
A class of recent approaches for generating images, called Generative Adversarial Networks (GAN), have been used to generate impressively realistic images of objects, bedrooms, handwritten digits and a variety of other image modalities.
In this paper, we analyze the results of a free-viewing gaze fixation study conducted on 3904 freehand sketches distributed across 160 object categories.
Our results show that the proposed benchmarking procedure enables additional differentiation among state-of-the-art object classifiers in terms of their ability to handle missing content and insufficient object detail.
In our work, we propose a recurrent neural network architecture for sketch object recognition which exploits the long-term sequential and structural regularities in stroke data in a scalable manner.
Current state of the art object recognition architectures achieve impressive performance but are typically specialized for a single depictive style (e. g. photos only, sketches only).
With this new paradigm, every problem in computer vision is now being re-examined from a deep learning perspective.
Studies from neuroscience show that part-mapping computations are employed by human visual system in the process of object recognition.
With a view to provide a user-friendly interface for designing, training and developing deep learning frameworks, we have developed Expresso, a GUI tool written in Python.
Therefore, analyzing such sparse sketches can aid our understanding of the neuro-cognitive processes involved in visual representation and recognition.