1 code implementation • 28 Aug 2024 • Jialin Wu, Jiangyi Deng, Shengyuan Pang, Yanjiao Chen, Jiayang Xu, Xinfeng Li, Wenyuan Xu
We propose a practical and unified content moderation framework for LLM services, named Legilimens, which features both effectiveness and efficiency.
1 code implementation • 4 Jul 2024 • Xinnan Zhang, Jialin Wu, Junyi Xie, Tianlong Chen, Kaixiong Zhou
In view of these, we conduct a comprehensive survey and benchmark for drug-target interaction modeling from a structure perspective, via integrating tens of explicit (i. e., GNN-based) and implicit (i. e., Transformer-based) structure learning algorithms.
no code implementations • CVPR 2024 • Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu, Hui Miao, Florian Schroff, Hartwig Adam, Ting Liu, Boqing Gong, Philipp Krähenbühl, Liangzhe Yuan
Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%.
no code implementations • CVPR 2024 • Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias Minderer, Filip Pavetic, Austin Waters, Gang Li, Ibrahim Alabdulmohsin, Lucas Beyer, Julien Amelot, Kenton Lee, Andreas Peter Steiner, Yang Li, Daniel Keysers, Anurag Arnab, Yuanzhong Xu, Keran Rong, Alexander Kolesnikov, Mojtaba Seyedhosseini, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut
We explore the boundaries of scaling up a multilingual vision and language model both in terms of size of the components and the breadth of its training task mixture.
no code implementations • 19 Dec 2023 • Mehran Kazemi, Hamidreza Alvari, Ankit Anand, Jialin Wu, Xi Chen, Radu Soricut
In this paper, we evaluate the reasoning capabilities of VLMs along various axes through the lens of geometry problems.
no code implementations • CVPR 2024 • Jialin Wu, Xia Hu, Yaqing Wang, Bo Pang, Radu Soricut
Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks.
Ranked #1 on Visual Question Answering (VQA) on A-OKVQA (using extra training data)
no code implementations • 18 Oct 2023 • Yaqing Wang, Jialin Wu, Tanmaya Dabral, Jiageng Zhang, Geoff Brown, Chun-Ta Lu, Frederick Liu, Yi Liang, Bo Pang, Michael Bendersky, Radu Soricut
Intrusive PEFT techniques directly change a model's internal architecture.
1 code implementation • 13 Oct 2023 • Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, Radu Soricut
This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger.
Ranked #2 on Temporal/Casual QA on NExT-QA (using extra training data)
1 code implementation • 14 Aug 2023 • Nan Ding, Tomer Levinboim, Jialin Wu, Sebastian Goodman, Radu Soricut
Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each other, compared to causal language models (causalLM), which use auto-regressive attention that prohibits in-context samples to attend to future samples.
1 code implementation • 28 Jul 2023 • Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Michalewski, Igor Mordatch, Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet, Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan Vuong, Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, Brianna Zitkovich
Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web.
2 code implementations • 29 May 2023 • Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias Minderer, Filip Pavetic, Austin Waters, Gang Li, Ibrahim Alabdulmohsin, Lucas Beyer, Julien Amelot, Kenton Lee, Andreas Peter Steiner, Yang Li, Daniel Keysers, Anurag Arnab, Yuanzhong Xu, Keran Rong, Alexander Kolesnikov, Mojtaba Seyedhosseini, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.
Ranked #1 on Fine-Grained Image Recognition on OVEN
no code implementations • 18 Oct 2022 • Jialin Wu, Raymond J. Mooney
To address these issues, we propose an Entity-Focused Retrieval (EnFoRe) model that provides stronger supervision during training and recognizes question-relevant entities to help retrieve more specific knowledge.
no code implementations • 1 Sep 2022 • Jialin Wu
The possibility of super-AIs taking over the world has been intensively studied by numerous scholars.
no code implementations • 29 Sep 2021 • Jialin Wu, Ray Mooney
While general Visual Question Answering (VQA) focuses on querying visual content within an image, there is a recent trend towards Knowledge-Based VQA (KB-VQA) where a system needs to link some aspects of the question to different types of knowledge beyond the image, such as commonsense concepts and factual information.
1 code implementation • 23 Mar 2021 • Jialin Wu, Jiasen Lu, Ashish Sabharwal, Roozbeh Mottaghi
Instead of searching for the answer in a vast collection of often irrelevant facts as most existing approaches do, MAVEx aims to learn how to extract relevant knowledge from noisy sources, which knowledge source to trust for each answer candidate, and how to validate the candidate using that source.
no code implementations • 22 Jan 2021 • Jung-Jun Kim, Dong-Gyu Lee, Jialin Wu, Hong-Gyu Jung, Seong-Whan Lee
We quantitatively and qualitatively evaluated the proposed method on the VQA v2 dataset and compared it with state-of-the-art methods in terms of answer prediction.
no code implementations • COLING 2020 • Jungjun Kim, Hanbin Ko, Jialin Wu
These highly-related neighbors are determined by an attentional ranking module, as complementary features, highlighting the discriminating aspects for the target object.
no code implementations • 28 Jun 2020 • Jialin Wu, Liyan Chen, Raymond J. Mooney
Most recent state-of-the-art Visual Question Answering (VQA) systems are opaque black boxes that are only trained to fit the answer distribution given the question and visual content.
no code implementations • 31 Oct 2019 • Jialin Wu, Raymond J. Mooney
Most RNN-based image captioning models receive supervision on the output words to mimic human captions.
no code implementations • ACL 2019 • Jialin Wu, Zeyuan Hu, Raymond J. Mooney
Visual question answering (VQA) and image captioning require a shared body of general knowledge connecting language and vision.
Ranked #31 on Visual Question Answering (VQA) on VQA v2 test-std
1 code implementation • NeurIPS 2019 • Jialin Wu, Raymond J. Mooney
Visual Question Answering (VQA) deep-learning systems tend to capture superficial statistical correlations in the training data because of strong language priors and fail to generalize to test data with a significantly different question-answer (QA) distribution.
Ranked #6 on Visual Question Answering (VQA) on VQA-CP
no code implementations • ICLR 2019 • Simiao Zuo, Jialin Wu
There has long been debates on how we could interpret neural networks and understand the decisions our models make.
1 code implementation • WS 2019 • Jialin Wu, Raymond J. Mooney
AI systems' ability to explain their reasoning is critical to their utility and trustworthiness.
Ranked #5 on Explanatory Visual Question Answering on GQA-REX
no code implementations • 22 May 2018 • Jialin Wu, Zeyuan Hu, Raymond J. Mooney
Answering visual questions need acquire daily common knowledge and model the semantic connection among different parts in images, which is too difficult for VQA systems to learn from images with the only supervision from answers.
no code implementations • ECCV 2018 • Jialin Wu, Dai Li, Yu Yang, Chandrajit Bajaj, Xiangyang Ji
We propose a dynamic filtering strategy with large sampling field for ConvNets (LS-DFN), where the position-specific kernels learn from not only the identical position but also multiple sampled neighbor regions.
no code implementations • 9 Jul 2016 • Jialin Wu, Gu Wang, Wukui Yang, Xiangyang Ji
We propose a novel deep supervised neural network for the task of action recognition in videos, which implicitly takes advantage of visual tracking and shares the robustness of both deep Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN).
Action Recognition In Videos Temporal Action Localization +1