- Moved implementations for different sizes to inline functions that are defined using each other, reducing the amount of redundant code. - Performance of sad_8bit_32x32_avx2 improved by about 10% due to unrolling of the loop.