We introduce a learning-based framework to optimize tensor programs for deep
learning workloads. Efficient implementations of tensor operators, such as
matrix multiplication and high dimensional convolution, are key enablers of
effective deep learning systems...
However, existing systems rely on manually
optimized libraries such as cuDNN where only a narrow range of server class
GPUs are well-supported. The reliance on hardware-specific operator libraries
limits the applicability of high-level graph optimizations and incurs
significant engineering costs when deploying to new hardware targets. We use
learning to remove this engineering burden. We learn domain-specific
statistical cost models to guide the search of tensor operator implementations
over billions of possible program variants. We further accelerate the search by
effective model transfer across workloads. Experimental results show that our
framework delivers performance competitive with state-of-the-art hand-tuned
libraries for low-power CPU, mobile GPU, and server-class GPU.