Learning with 3D rotations, a hitchhiker's guide to SO(3)

martius-lab/hitchhiking-rotations 17 Apr 2024

Many settings in machine learning require the selection of a rotation representation.

Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head

om-ai-lab/OmDet 11 Mar 2024

End-to-end transformer-based detectors (DETRs) have shown exceptional performance in both closed-set and open-vocabulary object detection (OVD) tasks through the integration of language modalities.

Open Vocabulary Object Detection Real-Time Object Detection

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

thudm/chatglm4 18 Jun 2024

We introduce ChatGLM, an evolving family of large language models that we have been developing over time.

GSM8K Instruction Following +1

TextGrad: Automatic "Differentiation" via Text

zou-group/textgrad 11 Jun 2024

Without modifying the framework, TextGrad improves the zero-shot accuracy of GPT-4o in Google-Proof Question Answering from $51\%$ to $55\%$, yields $20\%$ relative performance gain in optimizing LeetCode-Hard coding problem solutions, improves prompts for reasoning, designs new druglike small molecules with desirable in silico binding, and designs radiation oncology treatment plans with high specificity.

 Ranked #1 on on GPQA

Question Answering Specificity

Tarsier: Recipes for Training and Evaluating Large Video Description Models

bytedance/tarsier arXiv 2024

In this work, we introduce Tarsier, a family of large-scale video-language models designed to generate high-quality video descriptions.

Video Captioning Video Description +2

xLSTM-UNet can be an Effective 2D & 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart

tianrun-chen/xLSTM-UNet-PyTorch 1 Jul 2024

With comprehensive experiments performed, this technical report highlights the potential of xLSTM-based architectures in advancing biomedical image analysis in both 2D and 3D.

3D Medical Imaging Segmentation Image Classification +3

ProPainter: Improving Propagation and Transformer for Video Inpainting

sczhou/propainter ICCV 2023

We also propose a mask-guided sparse video Transformer, which achieves high efficiency by discarding unnecessary and redundant tokens.

Optical Flow Estimation Video Inpainting

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

opengvlab/internvl 25 Apr 2024

Compared to both open-source and proprietary models, InternVL 1. 5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks.

4k Language Modelling +4

TokenPacker: Efficient Visual Projector for Multimodal LLM

circleradon/tokenpacker 2 Jul 2024

However, the visual tokens are redundant and can be considerably increased when dealing with high-resolution images, impairing the efficiency of MLLMs significantly.

Language Modelling Large Language Model +1

Consistency Flow Matching: Defining Straight Flows with Velocity Consistency

yangling0818/consistency_flow_matching 2 Jul 2024

Additionally, we propose a multi-segment training approach for Consistency-FM to enhance expressiveness, achieving a better trade-off between sampling quality and speed.

Image Generation

