no code implementations • 2 Feb 2024 • Tanishq Kumar, Kevin Luo, Mark Sellke
We put forward a theoretical explanation for this, based on the model's effective parameter count, $p_\text{eff}$, given by the sum of the number of non-zero weights in the final network and the mutual information between the sparsity mask and the data.