Seq2Tok: Deep Sequence Tokenizer for Retrieval
Search over sequences is a fundamental problem. Very efficient solutions exist for text sequences, which are made up of discrete tokens chosen from a finite alphabet. Sequences, such as audio, video or sensor readings, are made up of continuous-valued samples with a large sampling rate, making similarity search inefficient. This paper proposes Seq2Tok, a deep sequence tokenizer that converts continuous-valued sequences to discrete tokens that are easier to retrieve via sequence queries. The only information available for training Seq2Tok is pairs of similar sequences, i.e., depending on how we form the pairs, the similarity semantics are learnt. Seq2Tok compresses the query and target sequences into short sequences of tokens that are faster to match. Experiments show consistent performance of Seq2Tok across various audio retrieval tasks, namely, music search (query by humming) and speech keyword search via audio query.
PDF Abstract