no code implementations • 28 Mar 2024 • Sarwan Ali, Prakash Chourasia, Murray Patterson
This study introduces a novel approach, combining substruct counting, $k$-mers, and Daylight-like fingerprints, to expand the representation of chemical structures in SMILES strings.
no code implementations • 12 Feb 2024 • Sarwan Ali, Tamkanat E Ali, Prakash Chourasia, Murray Patterson
In this work, we present a novel approach based on the compression-based Model, motivated from \cite{jiang2023low}, which combines the simplicity of basic compression algorithms like Gzip and Bz2, with Normalized Compression Distance (NCD) algorithm to achieve better performance on classification tasks without relying on handcrafted features or pre-trained models.
no code implementations • 15 Jul 2023 • Usama Sardar, Sarwan Ali, Muhammad Sohaib Ayub, Muhammad Shoaib, Khurram Bashir, Imdad Ullah Khan, Murray Patterson
We curated a comprehensive dataset of Nanobody-Antigen binding and nonbinding data and devised an embedding method based on gapped k-mers to predict binding based only on sequences of nanobody and antigen.
no code implementations • 8 Jun 2023 • Mansoor Ahmed, Usama Sardar, Sarwan Ali, Shafiq Alam, Murray Patterson, Imdad Ullah Khan
The proposed BAE framework provides a new approach for estimating brain age, which has important implications for the understanding of neurological disorders and age-related brain changes.
1 code implementation • 25 Apr 2023 • Zahra Tayebi, Sarwan Ali, Prakash Chourasia, Taslim Murad, Murray Patterson
Sparse coding is a popular technique in machine learning that enables the representation of data with a set of informative features and can capture complex relationships between amino acids and identify subtle patterns in the sequence that might be missed by low-dimensional methods.
no code implementations • 24 Apr 2023 • Sarwan Ali, Babatunde Bello, Prakash Chourasia, Ria Thazhe Punathil, Pin-Yu Chen, Imdad Ullah Khan, Murray Patterson
Understanding the host-specificity of different families of viruses sheds light on the origin of, e. g., SARS-CoV-2, rabies, and other such zoonotic pathogens in humans.
no code implementations • 13 Apr 2023 • Sarwan Ali, Taslim Murad, Murray Patterson
Therefore, the usage of only the spike protein, instead of the full genome, provides most of the essential information for performing analyses such as host classification.
1 code implementation • 6 Apr 2023 • Sarwan Ali, Prakash Chourasia, Zahra Tayebi, Babatunde Bello, Murray Patterson
In this work, we propose \emph{ViralVectors}, a compact feature vector generation from virome sequencing data that allows effective downstream analysis.
no code implementations • 1 Apr 2023 • Sarwan Ali, Usama Sardar, Murray Patterson, Imdad Ullah Khan
Kernel-based methods, e. g., SVM, are a proven efficient and useful alternative for several machine learning (ML) tasks such as sequence classification.
no code implementations • 4 Mar 2023 • Taslim Murad, Sarwan Ali, Murray Patterson
New tools for biological sequence analysis are provided by machine learning (ML) technologies to effectively analyze the functions and structures of the sequences.
no code implementations • 17 Feb 2023 • Prakash Chourasia, Taslim Murad, Zahra Tayebi, Sarwan Ali, Imdad Ullah Khan, Murray Patterson
This paper presents a federated learning (FL) approach to train an AI model for SARS-Cov-2 variant classification.
no code implementations • 1 Feb 2023 • Sarwan Ali, Prakash Chourasia, Murray Patterson
Anderson acceleration (AA) is a well-known method for accelerating the convergence of iterative algorithms, with applications in various fields including deep learning and optimization.
1 code implementation • 16 Nov 2022 • Prakash Chourasia, Sarwan Ali, Murray Patterson
We show that by using different techniques, such as informed initialization and kernel matrix selection, that t-SNE performs significantly better.
no code implementations • 15 Nov 2022 • Prakash Chourasia, Sarwan Ali, Simone Ciccolella, Gianluca Della Vedova, Murray Patterson
As a result, new methods such as Pangolin, which can scale to the millions of samples of SARS-CoV-2 currently available, have appeared.
no code implementations • 11 Sep 2022 • Sarwan Ali, Bikram Sahoo, Muhammad Asad Khan, Alexander Zelikovsky, Imdad Ullah Khan, Murray Patterson
More specifically, we improve the quality of the approximate kernel using domain knowledge (computed using information gain) and efficient preprocessing (using minimizers computation) to classify coronavirus spike protein sequences corresponding to different variants (e. g., Alpha, Beta, Gamma).
1 code implementation • 18 Jul 2022 • Sarwan Ali, Bikram Sahoo, Alexander Zelikovskiy, Pin-Yu Chen, Murray Patterson
The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome -- millions of sequences and counting.
no code implementations • 6 Jan 2022 • Sarwan Ali, Babatunde Bello, Prakash Chourasia, Ria Thazhe Punathil, Yijing Zhou, Murray Patterson
In coronaviruses, the surface (S) protein, or spike protein, is an important part of determining host specificity since it is the point of contact between the virus and the host cell membrane.
1 code implementation • 18 Oct 2021 • Zahra Tayebi, Sarwan Ali, Murray Patterson
We then show that with the appropriate feature selection, we can efficiently and effectively cluster the spike sequences based on the different variants.
no code implementations • 18 Oct 2021 • Sarwan Ali, Yijing Zhou, Murray Patterson
Applying machine learning based algorithms to this big data is a natural approach to take to this aim, since they can quickly scale to such data, and extract the relevant information in the presence of variety and different levels of veracity.
1 code implementation • 2 Oct 2021 • Sarwan Ali, Babatunde Bello, Zahra Tayebi, Murray Patterson
With the rapid spread of COVID-19 worldwide, viral genomic data is available in the order of millions of sequences on public databases such as GISAID.
no code implementations • 29 Sep 2021 • Sarwan Ali, Bikram Sahoo, Pin-Yu Chen, Murray Patterson
The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 viral genome --- millions of sequences and counting.
1 code implementation • 12 Sep 2021 • Sarwan Ali, Murray Patterson
Through experiments, we show that Spike2Vec is not only scalable on several million spike sequences, but also outperforms the baseline models in terms of prediction accuracy, F1 score, etc.
no code implementations • 18 Aug 2021 • Sarwan Ali, Tamkanat-E-Ali, Muhammad Asad Khan, Imdadullah Khan, Murray Patterson
Using a $k$-mer based feature vector generation and efficient feature selection methods, our approach is effective in identifying variants, as well as being efficient and scalable to millions of sequences.
no code implementations • 7 Aug 2021 • Sarwan Ali, Bikram Sahoo, Naimat Ullah, Alexander Zelikovskiy, Murray Patterson, Imdadullah Khan
With the rapid spread of the novel coronavirus (COVID-19) across the globe and its continuous mutation, it is of pivotal importance to design a system to identify different known (and unknown) variants of SARS-CoV-2.