In the case of DDx, the proposed network has achieved a mean accuracy of 99. 82% and a mean F1 score of 0. 9472.
Since thermal imagery offers a unique modality to investigate pain, the U. S. National Institutes of Health (NIH) has collected a large and diverse set of cancer patient facial thermograms for AI-based pain research.
In this paper, we propose cuSLINK, a novel and state-of-the-art reformulation of the SLINK algorithm on the GPU which requires only $O(Nk)$ space and uses a parameter $k$ to trade off space and time.
Differential privacy (DP) is the prevailing technique for protecting user data in machine learning models.
In this survey we review the current literature on attacks and their real-world occurrences, or limited evidence thereof, to critically evaluate the real-world risks of adversarial machine learning (AML) for the average entity.
For a variety of biometric cross-spectral tasks, Visible-Thermal (VT) facial pairs are used.
Our work explores the potential of antivirus (AV) scan data as a scalable source of features for malware.
In recent years, self-attention has become the dominant paradigm for sequence modeling in a variety of domains.
LASSO regularized logistic regression is particularly useful for its built-in feature selection, allowing coefficients to be removed from deployment and producing sparse solutions.
Memorization, or the tendency of large language models (LLMs) to output entire sequences from their training data verbatim, is a key concern for safely deploying language models.
3 code implementations • 3 Apr 2023 • Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, Oskar van der Wal
How do large language models (LLMs) develop and evolve over the course of training?
Ranked #23 on Language Modelling on LAMBADA
Linear $L_1$-regularized models have remained one of the simplest and most effective tools in data analysis, especially in information retrieval problems where n-grams over text with TF-IDF or Okapi feature values are a strong and easy baseline.
Thermal facial imagery offers valuable insight into physiological states such as inflammation and stress by detecting emitted radiation in the infrared spectrum, which is unseen in the visible spectra.
Over the past decade, the machine learning security community has developed a myriad of defenses for evasion attacks.
1 code implementation • 19 Dec 2022 • Zheng-Xin Yong, Hailey Schoelkopf, Niklas Muennighoff, Alham Fikri Aji, David Ifeoluwa Adelani, Khalid Almubarak, M Saiful Bari, Lintang Sutawika, Jungo Kasai, Ahmed Baruwa, Genta Indra Winata, Stella Biderman, Edward Raff, Dragomir Radev, Vassilina Nikoulina
We find language adaptation to be effective at improving zero-shot performance in new languages.
In this paper, we explore the use of metric learning to embed Windows PE files in a low-dimensional vector space for downstream use in a variety of applications, including malware detection, family classification, and malware attribute tagging.
Recurrent neural nets have been successful in processing sequences for a number of tasks; however, they are known to be both ineffective and computationally expensive when applied to very long sequences.
1 code implementation • 3 Nov 2022 • Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, Colin Raffel
We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages that appear only in the pretraining corpus.
Ranked #1 on Zero-Shot Learning on Story Cloze
We present a framework to statistically audit the privacy guarantee conferred by a differentially private machine learner in practice.
The quantification of uncertainty is important for the adoption of machine learning, especially to reject out-of-distribution (OOD) data back to human experts for review.
Due to the computational cost of running inference for a neural network, the need to deploy the inferential steps on a third party's compute environment or hardware is common.
Many metric learning tasks, such as triplet learning, nearest neighbor retrieval, and visualization, are treated primarily as embedding tasks where the ultimate metric is some variant of the Euclidean distance (e. g., cosine or Mahalanobis), and the algorithm must learn to embed points into the pre-chosen space.
Data augmentation has been rare in the cyber security domain due to technical difficulties in altering data in a manner that is semantically consistent with the original data.
Generating and editing images from open domain text prompts is a challenging task that heretofore has required expensive and specially trained models.
Yet to the best of our knowledge, only one work has attempted to look at this combined space, concluding that non-reproducible work is more highly cited.
Cancer patients experience high rates of chronic pain throughout the treatment process.
These challenges are widely studied in enterprise networks, but there are many gaps in research and practice as well as novel problems in other domains.
We explore the utility of information contained within a dropout based Bayesian neural network (BNN) for the task of detecting out of distribution (OOD) data.
Our method, which we term continuously generalized ordinal logistic, significantly outperforms the standard ordinal logistic model over a thorough set of ordinal regression benchmark datasets.
In this paper we explore whether transformers can be used to solve introductory level programming assignments while bypassing commonly used AI tools to detect similarities between pieces of software.
Although groups of strongly correlated antivirus engines are known to exist, at present there is limited understanding of how or why these correlations came to be.
Learning to understand grounded language, which connects natural language to percepts, is a critical research area.
Malware family classification is a significant issue with public safety and research implications that has been hindered by the high cost of expert labels.
In some problem spaces, the high cost of obtaining ground truth labels necessitates use of lower quality reference datasets.
In this work we note that as studied, current transfer attack research has an unrealistic advantage for the attacker: the attacker has the exact same training data as the victim.
HRRs today are not effective in a differentiable solution due to numerical instability, a problem we solve by introducing a projection step that forces the vectors to exist in a well behaved point in space.
The detection of malware is a critical task for the protection of computing environments.
These combined data are captured from similar sensors in order to bootstrap the training and transfer learning task, especially valuable because visible-thermal face datasets are limited.
The use of Machine Learning has become a significant part of malware detection efforts due to the influx of new malware, an ever changing threat landscape, and the ability of Machine Learning methods to discover meaningful distinctions between malicious and benign software.
High-performance primitives for mathematical operations on sparse vectors must deal with the challenges of skewed degree distributions and limits on memory consumption that are typically not issues in dense operations.
no code implementations • 1 Mar 2021 • Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, Pascal Vincent
Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices.
Recent works within machine learning have been tackling inputs of ever-increasing size, with cybersecurity presenting sequence classification problems of particularly extreme lengths.
Ordering the selection of training data using active learning can lead to improvements in learning efficiently from smaller corpora.
But we also propose that thermal imagery may provide a semi-anonymous modality for computer vision, over RGB, which has been plagued by misuse in facial recognition.
Yara rules are a ubiquitous tool among cybersecurity practitioners and analysts.
We propose a cross-modality manifold alignment procedure that leverages triplet loss to jointly learn consistent, multi-modal embeddings of language-based concepts of real-world items.
The Uniform Manifold Approximation and Projection (UMAP) algorithm has become widely popular for its ease of use, quality of results, and support for exploratory, unsupervised, supervised, and semi-supervised learning.
no code implementations • 29 Jul 2020 • Patrick Jenkins, Rishabh Sachdeva, Gaoussou Youssouf Kebe, Padraig Higgins, Kasra Darvish, Edward Raff, Don Engel, John Winder, Francis Ferraro, Cynthia Matuszek
Grounded language acquisition -- learning how language-based interactions refer to the world around them -- is amajor area of research in robotics, NLP, and HCI.
Malware classification is a difficult problem, to which machine learning methods have been applied for decades.
As the time to retweet increases, the density of connections also increase where in our sample, we found distinct users dominating the attention of Covid19 retweeters.
Prior work inspired by compression algorithms has described how the Burrows Wheeler Transform can be used to create a distance measure for bioinformatics problems.
We treat each individual layer of the DNN as a nonlinear dynamical system and use Lyapunov theory to prove stability and robustness locally.
Successful malware attacks on information technology systems can cause millions of dollars in damage, the exposure of sensitive and private information, and the irreversible destruction of data.
Recent work has developed Bayesian methods for the automatic statistical analysis and description of single time series as well as of homogeneous sets of time series data.
N-grams have been a common tool for information retrieval and machine learning applications for decades.
Significant work is being done to develop the math and tools necessary to build provable defenses, or at least bounds, against adversarial attacks of neural networks.
Artificial Intelligence and Machine Learning have become transformative to a number of industries, and as such many industries need for AI talent is increasing the demand for individuals with these skills.
Adversarial attacks against neural networks are a problem of considerable importance, for which effective defenses are not yet readily available.
As machine-learning (ML) based systems for malware detection become more prevalent, it becomes necessary to quantify the benefits compared to the more traditional anti-virus (AV) systems widely used today.
The Min-Hashing approach to sketching has become an important tool in data analysis, information retrial, and classification.
In this work we explore the use of metric index structures, which accelerate nearest neighbor queries, in the scenario where we need to interleave insertions and queries during deployment.
The potential lack of fairness in the outputs of machine learning algorithms has recently gained attention both within the research community as well as in society more broadly.
In this work we introduce malware detection from raw byte sequences as a fruitful research area to the larger machine learning community.
Recent work has proposed the Lempel-Ziv Jaccard Distance (LZJD) as a method to measure the similarity between binary byte sequences for malware classification.
Cryptography and Security