Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

homebrewltd/ichigo 20 Oct 2024

Large Language Models (LLMs) have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities.

Question Answering speech-recognition +1

1,631
1.31 stars / hour

PiML Toolbox for Interpretable Machine Learning Model Development and Diagnostics

selfexplainml/piml-toolbox 7 May 2023

PiML (read $\pi$-ML, /`pai`em`el/) is an integrated and open-access Python toolbox for interpretable machine learning model development and model diagnostics.

Fairness Interpretable Machine Learning

1,150
0.88 stars / hour

Classification Done Right for Vision-Language Pre-Training

x-cls/superclass 5 Nov 2024

Due to the absence of the text encoding as contrastive target, SuperClass does not require a text encoder and does not need to maintain a large batch size as CLIP does.

Classification

21
0.79 stars / hour

D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement

Peterande/D-FINE 17 Oct 2024

When pretrained on Objects365, D-FINE-L / X attains 57. 1% / 59. 3% AP, surpassing all existing real-time detectors.

 Ranked #1 on Real-Time Object Detection on MS COCO (using extra training data)

Real-Time Object Detection regression

626
0.66 stars / hour

Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks

jill0001/leopard 2 Oct 2024

Tasks involving multiple text-rich images are especially challenging, as they require not only understanding the content of individual images but reasoning about inter-relationships and logical flows across multiple visual inputs.

Language Modelling

107
0.64 stars / hour

Moonshine: Speech Recognition for Live Transcription and Voice Commands

usefulsensors/moonshine 21 Oct 2024

This paper introduces Moonshine, a family of speech recognition models optimized for live transcription and voice command processing.

Decoder Position +2

2,010
0.72 stars / hour

Adaptive Length Image Tokenization via Recurrent Allocation

shivamduggal4/adaptive-length-tokenizer 4 Nov 2024

Our encoder-decoder architecture recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts.

Decoder

19
0.63 stars / hour

Data Formulator 2: Iteratively Creating Rich Visualizations with AI

microsoft/data-formulator 28 Aug 2024

To create rich visualizations, data analysts often need to iterate back and forth among data processing and chart specification to achieve their goals.

Code Generation Navigate

1,164
0.62 stars / hour

Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

fudan-generative-vision/hallo2 10 Oct 2024

To the best of our knowledge, Hallo2, proposed in this paper, is the first method to achieve 4K resolution and generate hour-long, audio-driven portrait image animations enhanced with textual prompts.

4k Image Animation +2

3,468
0.65 stars / hour

Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

gpt-omni/mini-omni2 15 Oct 2024

It can understand visual, auditory, and textual modalities, directly output audio, and support flexible duplex interaction.

Language Modelling

1,473
0.61 stars / hour