Search Results for author: Pradeep Dubey

Found 17 papers, 4 papers with code

Systolic Computing on GPUs for Productive Performance

no code implementations29 Oct 2020 Hongbo Rong, Xiaochen Hao, Yun Liang, Lidong Xu, Hong H Jiang, Pradeep Dubey

We propose a language and compiler to productively build high-performance {\it software systolic arrays} that run on GPUs.

Context-Aware Parse Trees

no code implementations24 Mar 2020 Fangke Ye, Shengtian Zhou, Anand Venkat, Ryan Marcus, Paul Petersen, Jesmin Jahan Tithi, Tim Mattson, Tim Kraska, Pradeep Dubey, Vivek Sarkar, Justin Gottschlich

The simplified parse tree (SPT) presented in Aroma, a state-of-the-art code recommendation system, is a tree-structured representation used to infer code semantics by capturing program \emph{structure} rather than program \emph{syntax}.

K-TanH: Efficient TanH For Deep Learning

no code implementations17 Sep 2019 Abhisek Kundu, Alex Heinecke, Dhiraj Kalamkar, Sudarshan Srinivasan, Eric C. Qin, Naveen K. Mellempudi, Dipankar Das, Kunal Banerjee, Bharat Kaul, Pradeep Dubey

We propose K-TanH, a novel, highly accurate, hardware efficient approximation of popular activation function TanH for Deep Learning.

Translation

On Scale-out Deep Learning Training for Cloud and HPC

no code implementations24 Jan 2018 Srinivas Sridharan, Karthikeyan Vaidyanathan, Dhiraj Kalamkar, Dipankar Das, Mikhail E. Smorkalov, Mikhail Shiryaev, Dheevatsa Mudigere, Naveen Mellempudi, Sasikanth Avancha, Bharat Kaul, Pradeep Dubey

The exponential growth in use of large deep neural networks has accelerated the need for training these deep neural networks in hours or even minutes.

Ternary Residual Networks

no code implementations15 Jul 2017 Abhisek Kundu, Kunal Banerjee, Naveen Mellempudi, Dheevatsa Mudigere, Dipankar Das, Bharat Kaul, Pradeep Dubey

Aided by such an elegant trade-off between accuracy and compute, the 8-2 model (8-bit activations, ternary weights), enhanced by ternary residual edges, turns out to be sophisticated enough to achieve very high accuracy ($\sim 1\%$ drop from our FP-32 baseline), despite $\sim 1. 6\times$ reduction in model size, $\sim 26\times$ reduction in number of multiplications, and potentially $\sim 2\times$ power-performance gain comparing to 8-8 representation, on the state-of-the-art deep network ResNet-101 pre-trained on ImageNet dataset.

Ternary Neural Networks with Fine-Grained Quantization

no code implementations2 May 2017 Naveen Mellempudi, Abhisek Kundu, Dheevatsa Mudigere, Dipankar Das, Bharat Kaul, Pradeep Dubey

We address this by fine-tuning Resnet-50 with 8-bit activations and ternary weights at $N=64$, improving the Top-1 accuracy to within $4\%$ of the full precision result with $<30\%$ additional training overhead.

Quantization

Parallelizing Word2Vec in Multi-Core and Many-Core Architectures

1 code implementation18 Nov 2016 Shihao Ji, Nadathur Satish, Sheng Li, Pradeep Dubey

Word2vec is a widely used algorithm for extracting low-dimensional vector representations of words.

Faster CNNs with Direct Sparse Convolutions and Guided Pruning

1 code implementation4 Aug 2016 Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, Pradeep Dubey

Pruning CNNs in a way that increase inference speed often imposes specific sparsity structures, thus limiting the achievable sparsity levels.

Parallelizing Word2Vec in Shared and Distributed Memory

no code implementations15 Apr 2016 Shihao Ji, Nadathur Satish, Sheng Li, Pradeep Dubey

In combination, these techniques allow us to scale up the computation near linearly across cores and nodes, and process hundreds of millions of words per second, which is the fastest word2vec implementation to the best of our knowledge.

Machine Translation Named Entity Recognition +3

Distributed Deep Learning Using Synchronous Stochastic Gradient Descent

no code implementations22 Feb 2016 Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidynathan, Srinivas Sridharan, Dhiraj Kalamkar, Bharat Kaul, Pradeep Dubey

We design and implement a distributed multinode synchronous SGD algorithm, without altering hyper parameters, or compressing data, or altering algorithmic behavior.

BlackOut: Speeding up Recurrent Neural Network Language Models With Very Large Vocabularies

1 code implementation21 Nov 2015 Shihao Ji, S. V. N. Vishwanathan, Nadathur Satish, Michael J. Anderson, Pradeep Dubey

One way to understand BlackOut is to view it as an extension of the DropOut strategy to the output layer, wherein we use a discriminative training loss and a weighted sampling scheme.

Language Modelling

Cannot find the paper you are looking for? You can Submit a new open access paper.