Encoding high-cardinality string categorical variables

1 code implementation3 Jul 2019 Patricio Cerda, Gaël Varoquaux

We introduce two encoding approaches for string categories: a Gamma-Poisson matrix factorization on substring counts, and the min-hash encoder, for fast approximation of string similarities.

AutoML Feature Engineering

Similarity encoding for learning with dirty categorical variables

2 code implementations4 Jun 2018 Patricio Cerda, Gaël Varoquaux, Balázs Kégl

We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains.

Dimensionality Reduction

