As it was already mentioned your code could have AVX-to-SSE transition penalties.Your programme is single-threaded so there is no execution ports stalls.But I am thinking about the possibility that your thread could have some interdependencies in floating point code so the underlying hardware(Port0 and Port1) cannot fully exploit instruction level paralellism.
As it was already mentioned your code could have AVX-to-SSE transition penalties.Your programme is single-threaded so there is no execution ports stalls.But I am thinking about the possibility that your thread could have some interdependencies in floating point code so the underlying hardware(Port0 and Port1) cannot fully exploit instruction level paralellism.