Segment Anything in High Quality

syscv/sam-hq 2 Jun 2023

HQ-SAM is only trained on the introduced detaset of 44k masks, which takes only 4 hours on 8 GPUs.

2D Semantic Segmentation Semantic Segmentation

CodeTF: One-stop Transformer Library for State-of-the-art Code LLM

salesforce/codetf 31 May 2023

In this paper, we present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence.

ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models

billxbf/rewoo 23 May 2023

Augmented Language Models (ALMs) blend the reasoning capabilities of Large Language Models (LLMs) with tools that allow for knowledge retrieval and action execution.


DeepFilterNet: Perceptually Motivated Real-Time Speech Enhancement

rikorose/deepfilternet 14 May 2023

Multi-frame algorithms for single-channel speech enhancement are able to take advantage from short-time correlations within the speech signal.

Speech Enhancement

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

vahe1994/spqr 5 Jun 2023

Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities.

Language Modelling Quantization

Humans in 4D: Reconstructing and Tracking Humans with Transformers

shubham-goel/4D-Humans 31 May 2023

To analyze video, we use 3D reconstructions from HMR 2. 0 as input to a tracking system that operates in 3D.

Action Recognition Human Mesh Recovery +1

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

facebookresearch/hiera 1 Jun 2023

Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance.

 Ranked #1 on Action Recognition on AVA v2.2 (using extra training data)

Action Classification Action Recognition In Videos +4

XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech

vinairesearch/xphonebert 31 May 2023

We present XPhoneBERT, the first multilingual model pre-trained to learn phoneme representations for the downstream text-to-speech (TTS) task.

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

damo-nlp-sg/video-llama 5 Jun 2023

For the second challenge, we leverage ImageBind, a universal embedding model aligning multiple modalities as the pre-trained audio encoder, and introduce an Audio Q-former on top of ImageBind to learn reasonable auditory query embeddings for the LLM module.

Language Modelling Text Generation +1

Gorilla: Large Language Model Connected with Massive APIs

ShishirPatil/gorilla 24 May 2023

Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis.

Language Modelling Mathematical Reasoning +2

