Benchmarking Machine Learning Robustness in Covid-19 Spike Sequence Classification

29 Sep 2021 · Sarwan Ali, Bikram Sahoo, Pin-Yu Chen, Murray Patterson ·

The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 viral genome --- millions of sequences and counting. This amount of data, while being orders of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics and evolution of viruses, is nonetheless a rich resource for machine learning (ML) and deep learning (DL) approaches as alternatives for extracting such important information from these data. It is of hence utmost importance to design a framework for testing and benchmarking the robustness of these ML and DL approaches. This paper the first (to our knowledge) to explore such a framework. In this paper, we introduce several ways to perturb SARS-CoV-2 spike protein sequences in ways that mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show from experiments on a wide array of ML approaches from naive Bayes to logistic regression, that DL approaches are more robust (and accurate) to such adverarial attacks to the input sequences, while $k$-mer based feature vector representations are more robust than the baseline one-hot embedding. Our benchmarking framework may developers of futher ML and DL techniques to properly assess their approaches towards understanding the behaviour of the SARS-CoV-2 virus, or towards avoiding possible future pandemics.

PDF Abstract