Another point of slowness in the code is call to the __svml_exp4_e9 which I'm using "exp" function in the other part of my code. According to VTune analysis, in the non-vectorized code the exp function takes ~1sec, but in the vectorized code, the __svml_exp4_e9 takes ~4sec. Do I need to do some tunning before call to math functions?
Another point of slowness in the code is call to the __svml_exp4_e9 which I'm using "exp" function in the other part of my code. According to VTune analysis, in the non-vectorized code the exp function takes ~1sec, but in the vectorized code, the __svml_exp4_e9 takes ~4sec. Do I need to do some tunning before call to math functions?