I'm trying to profile my pytorch network to see what is the bottleneck. I noticed that there is an operation called cudaLaunchKernel which is taking up most of the time. This answer says that it is called for every operation done with cuda. If suppose I implement this network in C++ or any other language, would it be possible to reduce this time?
Basically, I'm asking if this overhead is because I've implemented my network in python or will this overhead be always there and impossible to optimize in any language?
Full profiler output:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaLaunchKernel 99.80% 933.739ms 99.80% 933.739ms 20.750ms 0.000us 0.00% 0.000us 0.000us 45
model_inference 0.05% 453.000us 100.00% 935.567ms 935.567ms 0.000us 0.00% 195.000us 195.000us 1
aten::cudnn_convolution 0.04% 388.000us 99.84% 934.047ms 103.783ms 195.000us 100.00% 195.000us 21.667us 9
aten::_convolution 0.01% 138.000us 99.88% 934.419ms 103.824ms 0.000us 0.00% 195.000us 21.667us 9
aten::conv2d 0.01% 122.000us 99.89% 934.592ms 103.844ms 0.000us 0.00% 195.000us 21.667us 9
aten::add_ 0.01% 112.000us 0.02% 155.000us 17.222us 0.000us 0.00% 0.000us 0.000us 9
aten::upsample_nearest2d 0.01% 82.000us 0.01% 105.000us 26.250us 0.000us 0.00% 0.000us 0.000us 4
aten::empty 0.01% 79.000us 0.01% 79.000us 3.292us 0.000us 0.00% 0.000us 0.000us 24
aten::threshold 0.01% 74.000us 0.02% 149.000us 18.625us 0.000us 0.00% 0.000us 0.000us 8
aten::_cat 0.01% 71.000us 0.01% 119.000us 29.750us 0.000us 0.00% 0.000us 0.000us 4
aten::relu 0.01% 57.000us 0.02% 206.000us 25.750us 0.000us 0.00% 0.000us 0.000us 8
aten::convolution 0.01% 51.000us 99.88% 934.470ms 103.830ms 0.000us 0.00% 195.000us 21.667us 9
aten::view 0.01% 50.000us 0.01% 50.000us 5.556us 0.000us 0.00% 0.000us 0.000us 9
aten::cat 0.00% 32.000us 0.02% 151.000us 37.750us 0.000us 0.00% 0.000us 0.000us 4
aten::reshape 0.00% 29.000us 0.01% 79.000us 8.778us 0.000us 0.00% 0.000us 0.000us 9
aten::resize_ 0.00% 25.000us 0.00% 25.000us 0.962us 0.000us 0.00% 0.000us 0.000us 26
aten::rsub 0.00% 21.000us 0.00% 33.000us 33.000us 0.000us 0.00% 0.000us 0.000us 1
aten::mul 0.00% 17.000us 0.00% 27.000us 27.000us 0.000us 0.00% 0.000us 0.000us 1
aten::zeros 0.00% 13.000us 0.00% 16.000us 16.000us 0.000us 0.00% 0.000us 0.000us 1
cudaEventRecord 0.00% 12.000us 0.00% 12.000us 1.333us 0.000us 0.00% 0.000us 0.000us 9
cudaBindTexture 0.00% 11.000us 0.00% 11.000us 2.750us 0.000us 0.00% 0.000us 0.000us 4
aten::empty_strided 0.00% 6.000us 0.00% 6.000us 6.000us 0.000us 0.00% 0.000us 0.000us 1
aten::zero_ 0.00% 1.000us 0.00% 1.000us 1.000us 0.000us 0.00% 0.000us 0.000us 1
cudnn::maxwell::gemm::computeOffsetsKernel(cudnn::ma... 0.00% 0.000us 0.00% 0.000us 0.000us 195.000us 100.00% 195.000us 195.000us 1
cudaUnbindTexture 0.00% 0.000us 0.00% 0.000us 0.000us 0.000us 0.00% 0.000us 0.000us 4
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 935.583ms
Self CUDA time total: 195.000us
from What is cudaLaunchKernel in pytorch profiler output
No comments:
Post a Comment