We augment the ground-truth solutions of our seed data and train a back-translation model to translate the augmented solutions back into new questions.
Recent advancements in Large Multimodal Models (LMMs) have shown promising results in mathematical reasoning within visual contexts, with models approaching human-level performance on existing benchmarks such as MathVista.
Ranked #1 on Multimodal Reasoning on MATH-V (using extra training data)
Beyond capturing global information like other multi-modal models, our proposed model excels at tasks demanding a detailed understanding of local information within the input.
no code implementations • 29 Oct 2023 • Nan He, Hanyu Lai, Chenyang Zhao, Zirui Cheng, Junting Pan, Ruoyu Qin, Ruofan Lu, Rui Lu, Yunchen Zhang, Gangming Zhao, Zhaohui Hou, Zhiyuan Huang, Shaoqing Lu, Ding Liang, Mingjie Zhan
Based on TeacherLM-7. 1B, we augmented 58 NLP datasets and taught various student models with different parameters from OPT and BLOOM series in a multi-task setting.
no code implementations • 25 Jul 2023 • Alireza Shafizadeh, Hossein Shahbeik, Mohammad Hossein Nadian, Vijai Kumar Gupta, Abdul-Sattar Nizami, Su Shiung Lam, WanXi Peng, Junting Pan, Meisam Tabatabaei, Mortaza Aghbashlo
Literature is used to compile a database covering a variety of catalyst characteristics and reaction conditions.
Video Question Answering (VideoQA) has been significantly advanced from the scaling of recent Large Language Models (LLMs).
Ranked #3 on Temporal/Casual QA on NExT-QA (using extra training data)
no code implementations • 24 May 2023 • Hossein Shahbeik, Alireza Shafizadeh, Mohammad Hossein Nadian, Dorsa Jeddi, Seyedali Mirjalili, Yadong Yang, Su Shiung Lam, Junting Pan, Meisam Tabatabaei, Mortaza Aghbashlo
The input features are constructed using an innovative approach to reflect the physics of the process.
Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding.
Driven by large-data pre-training, Segment Anything Model (SAM) has been demonstrated as a powerful and promptable framework, revolutionizing the segmentation models.
Ranked #1 on Personalized Segmentation on PerSeg
In this paper, we aim to reduce model complexity from large vision transformers pretrained by MAE with assistant of sparse training.
1 code implementation • 6 Dec 2022 • Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, LiMin Wang, Yu Qiao
Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.
Ranked #1 on Action Classification on Kinetics-400 (using extra training data)
2 code implementations • 17 Nov 2022 • Guo Chen, Sen Xing, Zhe Chen, Yi Wang, Kunchang Li, Yizhuo Li, Yi Liu, Jiahao Wang, Yin-Dong Zheng, Bingkun Huang, Zhiyu Zhao, Junting Pan, Yifei HUANG, Zun Wang, Jiashuo Yu, Yinan He, Hongjie Zhang, Tong Lu, Yali Wang, LiMin Wang, Yu Qiao
In this report, we present our champion solutions to five tracks at Ego4D challenge.
Ranked #1 on State Change Object Detection on Ego4D
This has led to a new research direction in parameter-efficient transfer learning.
Ranked #20 on Action Recognition on Something-Something V2 (using extra training data)
In this work, pushing further along this under-studied direction we introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs in the tradeoff between accuracy and on-device efficiency.
This technical report introduces our winning solution to the spatio-temporal action localization track, AVA-Kinetics Crossover, in ActivityNet Challenge 2020.
We propose to explicitly model the Actor-Context-Actor Relation, which is the relation between two actors based on their interactions with the context.
Ranked #2 on Action Recognition on AVA v2.1
This paper proposes the novel task of video generation conditioned on a SINGLE semantic label map, which provides a good balance between flexibility and quality in the generation process.
Imagining multiple consecutive frames given one single snapshot is challenging, since it is difficult to simultaneously predict diverse motions from a single image and faithfully generate novel frames without visual distortions.
We aim to tackle a novel task in action detection - Online Detection of Action Start (ODAS) in untrimmed, streaming videos.
We introduce SalGAN, a deep convolutional neural network for visual saliency prediction trained with adversarial examples.
The prediction of salient areas in images has been traditionally addressed with hand-crafted features based on neuroscience principles.