InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

reallm-labs/infiguiagent 8 Jan 2025

Graphical User Interface (GUI) Agents, powered by multimodal large language models (MLLMs), have shown great potential for task automation on computing devices such as computers and mobile phones.

41
0.45 stars / hour

Into the Unknown Unknowns: Engaged Human Learning through Participation in Language Model Agent Conversations

stanford-oval/storm 27 Aug 2024

While language model (LM)-powered chatbots and generative search engines excel at answering concrete queries, discovering information in the terrain of unknown unknowns remains challenging for users.

Sentiment Analysis

19,964
0.61 stars / hour

PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models

ssmisya/PRMBench 6 Jan 2025

Since language models are prone to various types of errors during the reasoning process, PRMs are required to possess nuanced capabilities for detecting various implicit error types in real-world scenarios.

Decision Making

52
0.44 stars / hour

MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization

tencent-ailab/muq 2 Jan 2025

In this paper, we propose a self-supervised music representation learning model for music understanding.

Contrastive Learning Key Detection +4

113
0.45 stars / hour

CNMBert: A Model for Hanyu Pinyin Abbreviation to Character Conversion Task

igarashiakatuki/cnmbert 18 Nov 2024

The task of converting hanyu pinyin abbreviations to Chinese characters is a significant branch within the domain of Chinese Spelling Correction (CSC).

Fill Mask named-entity-recognition +3

75
0.45 stars / hour

garak: A Framework for Security Probing Large Language Models

leondz/garak 16 Jun 2024

As Large Language Models (LLMs) are deployed and integrated into thousands of applications, the need for scalable evaluation of how models respond to adversarial attacks grows rapidly.

Red Teaming

3,603
0.52 stars / hour

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

mark12ding/dispider 6 Jan 2025

Active Real-time interaction with video LLMs introduces a new paradigm for human-computer interaction, where the model not only understands user intent but also responds while continuously processing streaming video on the fly.

58
0.41 stars / hour

MinerU: An Open-Source Solution for Precise Document Content Extraction

opendatalab/mineru 27 Sep 2024

Document content analysis has been a crucial research area in computer vision.

Diversity Optical Character Recognition (OCR)

24,344
0.51 stars / hour

JoyGen: Audio-Driven 3D Depth-Aware Talking-Face Video Editing

JOY-MM/JoyGen 3 Jan 2025

Significant progress has been made in talking-face video generation research; however, precise lip-audio synchronization and high visual quality remain challenging in editing lip shapes based on input audio.

3D Reconstruction Motion Generation +3

73
0.42 stars / hour

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

freedomintelligence/huatuogpt-o1 25 Dec 2024

To address this, we propose verifiable medical problems with a medical verifier to check the correctness of model outputs.

Reinforcement Learning (RL)

634
0.48 stars / hour