Search Results for author: Khoa Vo

Found 12 papers, 8 papers with code

Offboard 3D Object Detection from Point Cloud Sequences

no code implementations CVPR 2021 Charles R. Qi, Yin Zhou, Mahyar Najibi, Pei Sun, Khoa Vo, Boyang Deng, Dragomir Anguelov

While current 3D object recognition research mostly focuses on the real-time, onboard scenario, there are many offboard use cases of perception that are largely under-explored, such as using machines to automatically generate high-quality 3D labels.

3D Object Detection 3D Object Recognition +2

AEI: Actors-Environment Interaction with Adaptive Attention for Temporal Action Proposals Generation

1 code implementation21 Oct 2021 Khoa Vo, Hyekang Joo, Kashu Yamazaki, Sang Truong, Kris Kitani, Minh-Triet Tran, Ngan Le

In this paper, we make an attempt to simulate that ability of a human by proposing Actor Environment Interaction (AEI) network to improve the video representation for temporal action proposals generation.

Action Detection Temporal Action Proposal Generation

ABN: Agent-Aware Boundary Networks for Temporal Action Proposal Generation

1 code implementation16 Mar 2022 Khoa Vo, Kashu Yamazaki, Sang Truong, Minh-Triet Tran, Akihiro Sugimoto, Ngan Le

Temporal action proposal generation (TAPG) aims to estimate temporal intervals of actions in untrimmed videos, which is a challenging yet plays an important role in many tasks of video analysis and understanding.

Action Detection Temporal Action Proposal Generation

VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning

1 code implementation26 Jun 2022 Kashu Yamazaki, Sang Truong, Khoa Vo, Michael Kidd, Chase Rainwater, Khoa Luu, Ngan Le

In this paper, we leverage the human perceiving process, that involves vision and language interaction, to generate a coherent paragraph description of untrimmed videos.

Contrastive Learning Video Captioning

AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation

1 code implementation5 Oct 2022 Khoa Vo, Sang Truong, Kashu Yamazaki, Bhiksha Raj, Minh-Triet Tran, Ngan Le

PMR module represents each video snippet by a visual-linguistic feature, in which main actors and surrounding environment are represented by visual information, whereas relevant objects are depicted by linguistic features through an image-text model.

Action Detection Temporal Action Proposal Generation

AISFormer: Amodal Instance Segmentation with Transformer

1 code implementation12 Oct 2022 Minh Tran, Khoa Vo, Kashu Yamazaki, Arthur Fernandes, Michael Kidd, Ngan Le

AISFormer explicitly models the complex coherence between occluder, visible, amodal, and invisible masks within an object's regions of interest by treating them as learnable queries.

Amodal Instance Segmentation Segmentation +1

VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

1 code implementation28 Nov 2022 Kashu Yamazaki, Khoa Vo, Sang Truong, Bhiksha Raj, Ngan Le

Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling.

Sentence Video Captioning

CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection

1 code implementation9 Dec 2022 Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, Ngan Le

Video anomaly detection (VAD) -- commonly formulated as a multiple-instance learning problem in a weakly-supervised manner due to its labor-intensive nature -- is a challenging problem in video surveillance where the frames of anomaly need to be localized in an untrimmed video.

Anomaly Detection Multiple Instance Learning +1

Contextual Explainable Video Representation: Human Perception-based Understanding

1 code implementation12 Dec 2022 Khoa Vo, Kashu Yamazaki, Phong X. Nguyen, Phat Nguyen, Khoa Luu, Ngan Le

We choose video paragraph captioning and temporal action detection to illustrate the effectiveness of human perception based-contextual representation in video understanding.

Action Detection Action Recognition +4

Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation

no code implementations5 Oct 2023 Kashu Yamazaki, Taisei Hanyu, Khoa Vo, Thang Pham, Minh Tran, Gianfranco Doretto, Anh Nguyen, Ngan Le

Open-Fusion harnesses the power of a pre-trained vision-language foundation model (VLFM) for open-set semantic comprehension and employs the Truncated Signed Distance Function (TSDF) for swift 3D scene reconstruction.

3D Scene Reconstruction

Cannot find the paper you are looking for? You can Submit a new open access paper.