Search Results for author: Jen-tse Huang

Found 20 papers, 15 papers with code

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

1 code implementation18 Mar 2024 Jen-tse Huang, Eric John Li, Man Ho Lam, Tian Liang, Wenxuan Wang, Youliang Yuan, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Michael R. Lyu

Additionally, we conduct evaluations across various LLMs and find that GPT-4 outperforms other models on GAMA-Bench, achieving a score of 72. 5.

Decision Making

New Job, New Gender? Measuring the Social Bias in Image Generation Models

no code implementations1 Jan 2024 Wenxuan Wang, Haonan Bai, Jen-tse Huang, Yuxuan Wan, Youliang Yuan, Haoyi Qiu, Nanyun Peng, Michael R. Lyu

BiasPainter uses a diverse range of seed images of individuals and prompts the image generation models to edit these images using gender, race, and age-neutral queries.

Fairness Image Generation

The Earth is Flat? Unveiling Factual Errors in Large Language Models

no code implementations1 Jan 2024 Wenxuan Wang, Juluan Shi, Zhaopeng Tu, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu

Current methods for evaluating LLMs' veracity are limited by test data leakage or the need for extensive human labor, hindering efficient and accurate error detection.

In-Context Learning Multiple-choice

A & B == B & A: Triggering Logical Reasoning Failures in Large Language Models

no code implementations1 Jan 2024 Yuxuan Wan, Wenxuan Wang, Yiliu Yang, Youliang Yuan, Jen-tse Huang, Pinjia He, Wenxiang Jiao, Michael R. Lyu

In addition, the test cases of LogicAsker can be further used to design demonstration examples for in-context learning, which effectively improves the logical reasoning ability of LLMs, e. g., 10\% for GPT-4.

Code Generation In-Context Learning +2

Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models

1 code implementation31 Oct 2023 Tian Liang, Zhiwei He, Jen-tse Huang, Wenxuan Wang, Wenxiang Jiao, Rui Wang, Yujiu Yang, Zhaopeng Tu, Shuming Shi, Xing Wang

Ideally, an advanced agent should possess the ability to accurately describe a given word using an aggressive description while concurrently maximizing confusion in the conservative description, enhancing its participation in the game.

Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in Large Language Models

no code implementations19 Oct 2023 Wenxuan Wang, Wenxiang Jiao, Jingyuan Huang, Ruyi Dai, Jen-tse Huang, Zhaopeng Tu, Michael R. Lyu

This paper identifies a cultural dominance issue within large language models (LLMs) due to the predominant use of English data in model training (e. g., ChatGPT).

All Languages Matter: On the Multilingual Safety of Large Language Models

1 code implementation2 Oct 2023 Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu

In this work, we build the first multilingual safety benchmark for LLMs, XSafety, in response to the global deployment of LLMs in practice.

Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench

1 code implementation2 Oct 2023 Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu

Large Language Models (LLMs) have recently showcased their remarkable capacities, not only in natural language processing tasks but also across diverse domains such as clinical medicine, legal consultation, and education.

Benchmarking

An Image is Worth a Thousand Toxic Words: A Metamorphic Testing Framework for Content Moderation Software

no code implementations18 Aug 2023 Wenxuan Wang, Jingyuan Huang, Jen-tse Huang, Chang Chen, Jiazhen Gu, Pinjia He, Michael R. Lyu

Moreover, through retraining the models with the test cases generated by OASIS, the robustness of the moderation model can be improved without performance degradation.

GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

1 code implementation12 Aug 2023 Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, Zhaopeng Tu

We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers.

Ethics

Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench

1 code implementation7 Aug 2023 Jen-tse Huang, Man Ho Lam, Eric John Li, Shujie Ren, Wenxuan Wang, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu

Evaluating Large Language Models' (LLMs) anthropomorphic capabilities has become increasingly important in contemporary discourse.

Revisiting the Reliability of Psychological Scales on Large Language Models

1 code implementation31 May 2023 Jen-tse Huang, Wenxuan Wang, Man Ho Lam, Eric John Li, Wenxiang Jiao, Michael R. Lyu

Recent research has extended beyond assessing the performance of Large Language Models (LLMs) to examining their characteristics from a psychological standpoint, acknowledging the necessity of understanding their behavioral characteristics.

ParroT: Translating during Chat using Large Language Models tuned with Human Translation and Feedback

1 code implementation5 Apr 2023 Wenxiang Jiao, Jen-tse Huang, Wenxuan Wang, Zhiwei He, Tian Liang, Xing Wang, Shuming Shi, Zhaopeng Tu

Therefore, we propose ParroT, a framework to enhance and regulate the translation abilities during chat based on open-source LLMs (e. g., LLaMA), human-written translation and feedback data.

Instruction Following Machine Translation +1

Improving the Transferability of Adversarial Samples by Path-Augmented Method

1 code implementation CVPR 2023 Jianping Zhang, Jen-tse Huang, Wenxuan Wang, Yichen Li, Weibin Wu, Xiaosen Wang, Yuxin Su, Michael R. Lyu

However, such methods selected the image augmentation path heuristically and may augment images that are semantics-inconsistent with the target images, which harms the transferability of the generated adversarial samples.

Image Augmentation

MTTM: Metamorphic Testing for Textual Content Moderation Software

1 code implementation11 Feb 2023 Wenxuan Wang, Jen-tse Huang, Weibin Wu, Jianping Zhang, Yizhan Huang, Shuqing Li, Pinjia He, Michael Lyu

In addition, we leverage the test cases generated by MTTM to retrain the model we explored, which largely improves model robustness (0% to 5. 9% EFR) while maintaining the accuracy on the original test set.

Sentence

Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine

1 code implementation20 Jan 2023 Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, Shuming Shi, Zhaopeng Tu

By evaluating on a number of benchmark test sets, we find that ChatGPT performs competitively with commercial translation products (e. g., Google Translate) on high-resource European languages but lags behind significantly on low-resource or distant languages.

Machine Translation Sentence +1

Tencent's Multilingual Machine Translation System for WMT22 Large-Scale African Languages

1 code implementation18 Oct 2022 Wenxiang Jiao, Zhaopeng Tu, Jiarui Li, Wenxuan Wang, Jen-tse Huang, Shuming Shi

This paper describes Tencent's multilingual machine translation systems for the WMT22 shared task on Large-Scale Machine Translation Evaluation for African Languages.

Data Augmentation Machine Translation +1

AEON: A Method for Automatic Evaluation of NLP Test Cases

1 code implementation13 May 2022 Jen-tse Huang, Jianping Zhang, Wenxuan Wang, Pinjia He, Yuxin Su, Michael R. Lyu

However, in practice, many of the generated test cases fail to preserve similar semantic meaning and are unnatural (e. g., grammar errors), which leads to a high false alarm rate and unnatural test cases.

Semantic Similarity Semantic Textual Similarity +1

Cannot find the paper you are looking for? You can Submit a new open access paper.