Search Results for author: Yuval Pinter

Found 43 papers, 23 papers with code

CIAug: Equipping Interpolative Augmentation with Curriculum Learning

1 code implementation NAACL 2022 Ramit Sawhney, Ritesh Soun, Shrey Pandit, Megh Thakkar, Sarvagya Malaviya, Yuval Pinter

CIAug achieves state-of-the-art results over existing interpolative augmentation methods on 10 benchmark datasets across 4 languages in text classification and named-entity recognition tasks.

Data Augmentation named-entity-recognition +5

Probing Subphonemes in Morphology Models

no code implementations16 May 2025 Gal Astrach, Yuval Pinter

Transformers have achieved state-of-the-art performance in morphological inflection tasks, yet their ability to generalize across languages and morphological rules remains limited.

Morphological Inflection

Splintering Nonconcatenative Languages for Better Tokenization

1 code implementation18 Mar 2025 Bar Gazit, Shaltiel Shmidman, Avi Shmidman, Yuval Pinter

Common subword tokenization algorithms like BPE and UnigramLM assume that text can be split into meaningful units by concatenative measures alone.

Token-Level Privacy in Large Language Models

no code implementations5 Mar 2025 Re'em Harel, Niv Gilboa, Yuval Pinter

We evaluate dchi-stencil using state-of-the-art language models and diverse datasets, achieving comparable and even better trade-off between utility and privacy compared to existing methods.

Privacy Preserving Semantic Similarity +1

How Much is Enough? The Diminishing Returns of Tokenization Training Data

no code implementations27 Feb 2025 Varshini Reddy, Craig W. Schmidt, Yuval Pinter, Chris Tanner

For Russian text, we observe diminishing returns after training a tokenizer from 200GB of data, which is approximately 33% more than when training on English.

Attribute

Information Types in Product Reviews

1 code implementation20 Feb 2025 Ori Shapira, Yuval Pinter

Information in text is communicated in a way that supports a goal for its reader.

Don't Touch My Diacritics

no code implementations31 Oct 2024 Kyle Gorman, Yuval Pinter

The common practice of preprocessing text before feeding it into NLP models introduces many decision points which have unintended consequences on model performance.

Multilingual NLP

OMPar: Automatic Parallelization with AI-Driven Source-to-Source Compilation

no code implementations23 Sep 2024 Tal Kadosh, Niranjan Hasabnis, Prema Soundararajan, Vy A. Vo, Mihai Capota, Nesreen Ahmed, Yuval Pinter, Gal Oren

Manual parallelization of code remains a significant challenge due to the complexities of modern software systems and the widespread adoption of multi-core architectures.

C++ code

Protecting Privacy in Classifiers by Token Manipulation

no code implementations1 Jul 2024 Re'em Harel, Yair Elboher, Yuval Pinter

Using language models as a remote service entails sending private information to an untrusted provider.

text-classification Text Classification

Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

1 code implementation20 Apr 2024 Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, Gábor Bella

Our empirical findings show that the accuracy of UniMorph Labeller is 98%, and that, in all language models studied (including ALBERT, BERT, RoBERTa, and DeBERTa), alien tokenization leads to poorer generalizations compared to morphological tokenization for semantic compositionality of word meanings.

text-classification Text Classification

An Analysis of BPE Vocabulary Trimming in Neural Machine Translation

no code implementations30 Mar 2024 Marco Cognetta, Tatsuya Hiraoka, Naoaki Okazaki, Rico Sennrich, Yuval Pinter

We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a postprocessing step that replaces rare subwords with their component subwords.

Machine Translation Translation

Greed is All You Need: An Evaluation of Tokenizer Inference Methods

1 code implementation2 Mar 2024 Omri Uzan, Craig W. Schmidt, Chris Tanner, Yuval Pinter

While subword tokenizers such as BPE and WordPiece are typically used to build vocabularies for NLP models, the method of decoding text into a sequence of tokens from these vocabularies is often left unspecified, or ill-suited to the method in which they were constructed.

All

Tokenization Is More Than Compression

2 code implementations28 Feb 2024 Craig W. Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, Chris Tanner

Tokenization is a foundational step in natural language processing (NLP) tasks, bridging raw text and language models.

Data Compression

MonoCoder: Domain-Specific Code Language Model for HPC Codes and Tasks

3 code implementations20 Dec 2023 Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien, Mihai Capota, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy Mattson, Gal Oren

Specifically, we start with HPC as a domain and build an HPC-specific LM, named MonoCoder, which is orders of magnitude smaller than existing LMs but delivers better performance on non-HPC and HPC codes.

Code Generation Language Modeling +1

Tokenization Matters: Navigating Data-Scarce Tokenization for Gender Inclusive Language Technologies

no code implementations19 Dec 2023 Anaelia Ovalle, Ninareh Mehrabi, Palash Goyal, Jwala Dhamala, Kai-Wei Chang, Richard Zemel, Aram Galstyan, Yuval Pinter, Rahul Gupta

Our paper is the first to link LLM misgendering to tokenization and deficient neopronoun grammar, indicating that LLMs unable to correctly treat neopronouns as pronouns are more prone to misgender.

Analyzing Cognitive Plausibility of Subword Tokenization

no code implementations20 Oct 2023 Lisa Beinborn, Yuval Pinter

Subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce.

Emptying the Ocean with a Spoon: Should We Edit Models?

no code implementations18 Oct 2023 Yuval Pinter, Michael Elhadad

We call into question the recently popularized method of direct model editing as a means of correcting factual errors in LLM generations.

Model Editing Retrieval

Scope is all you need: Transforming LLMs for HPC Code

2 code implementations18 Aug 2023 Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy Mattson, Gal Oren

With easier access to powerful compute resources, there is a growing trend in the field of AI for software development to develop larger and larger language models (LLMs) to address a variety of programming tasks.

All Code Completion

Advising OpenMP Parallelization via a Graph-Based Approach with Transformers

2 code implementations16 May 2023 Tal Kadosh, Nadav Schneider, Niranjan Hasabnis, Timothy Mattson, Yuval Pinter, Gal Oren

Specifically, we propose a novel approach, called OMPify, to detect and predict the OpenMP pragmas and shared-memory attributes in parallel code, given its serial version.

Data Augmentation

Incorporating Context into Subword Vocabularies

1 code implementation13 Oct 2022 Shaked Yehezkel, Yuval Pinter

Most current popular subword tokenizers are trained based on word frequency statistics over a corpus, without considering information about co-occurrence or context.

NER

Lost in Space Marking

no code implementations2 Aug 2022 Cassandra L. Jacobs, Yuval Pinter

We look at a decision taken early in training a subword tokenizer, namely whether it should be the word-initial token that carries a special mark, or the word-final one.

UniMorph 4.0: Universal Morphology

no code implementations LREC 2022 Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kieraś, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yustinus Ghanggo Ate, Maria Ryskina, Sabrina J. Mielke, Elena Budianskaya, Charbel El-Khaissi, Tiago Pimentel, Michael Gasser, William Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Benoît Sagot, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay, Juan López Bautista, Gema Celeste Silva Villegas, Lucas Torroba Hennigen, Adam Ek, David Guriel, Peter Dirix, Jean-Philippe Bernardy, Andrey Scherbakov, Aziyana Bayyr-ool, Antonios Anastasopoulos, Roberto Zariquiey, Karina Sheifer, Sofya Ganieva, Hilaria Cruz, Ritván Karahóǧa, Stella Markantonatou, George Pavlidis, Matvey Plugaryov, Elena Klyachko, Ali Salehi, Candy Angulo, Jatayu Baxi, Andrew Krizhanovsky, Natalia Krizhanovskaya, Elizabeth Salesky, Clara Vania, Sardana Ivanova, Jennifer White, Rowan Hall Maudslay, Josef Valvoda, Ran Zmigrod, Paula Czarnowska, Irene Nikkarinen, Aelita Salchak, Brijesh Bhatt, Christopher Straughn, Zoey Liu, Jonathan North Washington, Yuval Pinter, Duygu Ataman, Marcin Wolinski, Totok Suhardijanto, Anna Yablonskaya, Niklas Stoehr, Hossep Dolatian, Zahroh Nuriah, Shyam Ratan, Francis M. Tyers, Edoardo M. Ponti, Grant Aiton, Aryaman Arora, Richard J. Hatcher, Ritesh Kumar, Jeremiah Young, Daria Rodionova, Anastasia Yemelina, Taras Andrushko, Igor Marchenko, Polina Mashkovtseva, Alexandra Serova, Emily Prud'hommeaux, Maria Nepomniashchaya, Fausto Giunchiglia, Eleanor Chodroff, Mans Hulden, Miikka Silfverberg, Arya D. McCarthy, David Yarowsky, Ryan Cotterell, Reut Tsarfaty, Ekaterina Vylomova

The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema.

Morphological Inflection

Learning to Parallelize in a Shared-Memory Environment with Transformers

2 code implementations27 Apr 2022 Re'em Harel, Yuval Pinter, Gal Oren

As a result, there is a growing need to utilize these architectures by introducing shared memory parallelization schemes to software applications.

Management

Integrating Approaches to Word Representation

no code implementations10 Sep 2021 Yuval Pinter

The problem of representing the atomic elements of language in modern neural learning systems is one of the central challenges of the field of natural language processing.

Survey

Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information

no code implementations1 Aug 2021 Yuval Pinter, Amanda Stent, Mark Dredze, Jacob Eisenstein

Commonly-used transformer language models depend on a tokenization schema which sets an unchangeable subword vocabulary prior to pre-training, destined to be applied to all downstream tasks regardless of domain shift, novel word formations, or other sources of vocabulary mismatch.

Restoring Hebrew Diacritics Without a Dictionary

1 code implementation Findings (NAACL) 2022 Elazar Gershuni, Yuval Pinter

We demonstrate that it is feasible to diacritize Hebrew script without any human-curated resources other than plain diacritized text.

Will it Unblend?

1 code implementation SCiL 2021 Yuval Pinter, Cassandra L. Jacobs, Jacob Eisenstein

Natural language processing systems often struggle with out-of-vocabulary (OOV) terms, which do not appear in training data.

Learning to Faithfully Rationalize by Construction

2 code implementations ACL 2020 Sarthak Jain, Sarah Wiegreffe, Yuval Pinter, Byron C. Wallace

In NLP this often entails extracting snippets of an input text `responsible for' corresponding model output; when such a snippet comprises tokens that indeed informed the model's prediction, it is a faithful explanation.

Feature Importance text-classification +1

NYTWIT: A Dataset of Novel Words in the New York Times

1 code implementation COLING 2020 Yuval Pinter, Cassandra L. Jacobs, Max Bittker

We present baseline results for both uncontextual and contextual prediction of novelty class, showing that there is room for improvement even for state-of-the-art NLP systems.

Attending Form and Context to Generate Specialized Out-of-VocabularyWords Representations

no code implementations14 Dec 2019 Nicolas Garneau, Jean-Samuel Leboeuf, Yuval Pinter, Luc Lamontagne

We propose a new contextual-compositional neural network layer that handles out-of-vocabulary (OOV) words in natural language processing (NLP) tagging tasks.

Form Sentence

Attention is not not Explanation

2 code implementations IJCNLP 2019 Sarah Wiegreffe, Yuval Pinter

We show that even when reliable adversarial distributions can be found, they don't perform well on the simple diagnostic, indicating that prior work does not disprove the usefulness of attention mechanisms for explainability.

Decision Making Diagnostic +1

Character Eyes: Seeing Language through Character-Level Taggers

1 code implementation WS 2019 Yuval Pinter, Marc Marone, Jacob Eisenstein

Character-level models have been used extensively in recent years in NLP tasks as both supplements and replacements for closed-vocabulary token-level word representations.

POS

Predicting Semantic Relations using Global Graph Properties

1 code implementation EMNLP 2018 Yuval Pinter, Jacob Eisenstein

Semantic graphs, such as WordNet, are resources which curate natural language on two distinguishable layers.

Link Prediction

Si O No, Que Penses? Catalonian Independence and Linguistic Identity on Social Media

no code implementations NAACL 2018 Ian Stewart, Yuval Pinter, Jacob Eisenstein

We also find that Catalan is used more often in referendum-related discourse than in other contexts, contrary to prior findings on language variation.

Sí o no, què penses? Catalonian Independence and Linguistic Identity on Social Media

1 code implementation13 Apr 2018 Ian Stewart, Yuval Pinter, Jacob Eisenstein

We also find that Catalan is used more often in referendum-related discourse than in other contexts, contrary to prior findings on language variation.

Mimicking Word Embeddings using Subword RNNs

2 code implementations EMNLP 2017 Yuval Pinter, Robert Guthrie, Jacob Eisenstein

In this paper, we present MIMICK, an approach to generating OOV word embeddings compositionally, by learning a function from spellings to distributional embeddings.

Word Embeddings

The Yahoo Query Treebank, V. 1.0

no code implementations10 May 2016 Yuval Pinter, Roi Reichart, Idan Szpektor

A description and annotation guidelines for the Yahoo Webscope release of Query Treebank, Version 1. 0, May 2016.

Cannot find the paper you are looking for? You can Submit a new open access paper.