1 code implementation • 30 Jan 2025 • Benjamin Feuer, Chinmay Hegde
Language model (LLM) post-training, from DPO to distillation, can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy.
1 code implementation • 5 Dec 2024 • Kasra Arabi, Benjamin Feuer, R. Teal Witter, Chinmay Hegde, Niv Cohen
For detection, we (i) retrieve the relevant group of noises, and (ii) search within the given group for an initial noise that might match our image.
1 code implementation • 7 Oct 2024 • Benjamin Feuer, Jiawei Xu, Niv Cohen, Patrick Yubeaton, Govind Mittal, Chinmay Hegde
In this work, we take steps towards a formal evaluation of data curation strategies and introduce SELECT, the first large-scale benchmark of curation strategies for image classification.
1 code implementation • 23 Sep 2024 • Benjamin Feuer, Micah Goldblum, Teresa Datta, Sanjana Nambiar, Raz Besaleli, Samuel Dooley, Max Cembalest, John P. Dickerson
In this work, we attempt to answer the following question -- do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not?
2 code implementations • 25 Jun 2024 • Chih-Hsuan Yang, Benjamin Feuer, Zaki Jubery, Zi K. Deng, Andre Nakkab, Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, Nirmal Baishnab, Asheesh K Singh, Arti Singh, Soumik Sarkar, Nirav Merchant, Chinmay Hegde, Baskar Ganapathysubramanian
We introduce BioTrove, the largest publicly accessible dataset designed to advance AI applications in biodiversity.
2 code implementations • 17 Feb 2024 • Benjamin Feuer, Robin Tibor Schirrmeister, Valeriia Cherepanova, Chinmay Hegde, Frank Hutter, Micah Goldblum, Niv Cohen, Colin White
Notably, TabPFN achieves very strong performance on small tabular datasets but is not designed to make predictions for datasets of size larger than 1000.
no code implementations • 17 Nov 2023 • Benjamin Feuer, Chinmay Hegde, Niv Cohen
Tabular classification has traditionally relied on supervised algorithms, which estimate the parameters of a prediction model using its training data.
no code implementations • 7 Nov 2023 • Benjamin Feuer, Chinmay Hegde
Modern computer vision foundation models are trained on massive amounts of data, incurring large economic and environmental costs.
1 code implementation • 27 Oct 2023 • Benjamin Feuer, Yurong Liu, Chinmay Hegde, Juliana Freire
We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner.
Ranked #1 on
Column Type Annotation
on WDC SOTAB
(Weighted F1 metric)
1 code implementation • 7 Aug 2023 • Benjamin Feuer, Ameya Joshi, Minh Pham, Chinmay Hegde
To our knowledge, this is the first result showing (near) state-of-the-art distributional robustness on limited data budgets.
2 code implementations • NeurIPS 2023 • Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Benjamin Feuer, Chinmay Hegde, Ganesh Ramakrishnan, Micah Goldblum, Colin White
To this end, we conduct the largest tabular data analysis to date, comparing 19 algorithms across 176 datasets, and we find that the 'NN vs. GBDT' debate is overemphasized: for a surprisingly high number of datasets, either the performance difference between GBDTs and NNs is negligible, or light hyperparameter tuning on a GBDT is more important than choosing between NNs and GBDTs.
1 code implementation • 12 Feb 2023 • Andre Nakkab, Benjamin Feuer, Chinmay Hegde
Recent advances in training vision-language models have demonstrated unprecedented robustness and transfer learning effectiveness; however, standard computer vision datasets are image-only, and therefore not well adapted to such training methods.
1 code implementation • 13 Oct 2022 • Benjamin Feuer, Ameya Joshi, Chinmay Hegde
Vision language (VL) models like CLIP are robust to natural distribution shifts, in part because CLIP learns on unstructured data using a technique called caption supervision; the model inteprets image-linked texts as ground-truth labels.
no code implementations • 15 Jun 2022 • Benjamin Feuer, Ameya Joshi, Chinmay Hegde
State-of-the-art image classifiers trained on massive datasets (such as ImageNet) have been shown to be vulnerable to a range of both intentional and incidental distribution shifts.