- Added 64x64 version for completeness.
- With the exception of 16x16, these were all slightly slower than the ASM
versions, as measured by "kvazaar_test -s speed -t intra_sad", but now they
are on par or slightly faster.
- None of these actually use any AVX2 intrinsics, and probably never will,
unless someone adds an interface for doing more than one block at a time,
in which case the non-destructive versions might come in handy.
- Having more than one rule in a pattern rule means that both of those files
are created at the same time with the rule. This only worked for debug,
because debug build was never done in the same invocation as release build.
- Always use the compiler to invoke the linker. Clang will give additional
parameters to the linker when compiled with -flto.
- Giving a different optimization level to linker did not make any difference
in gcc-5.1.1.
- It's required for .so and .dylib, but not for .dll or the executable.
- It might be better to use libtool for this, but I'm not ready to go that
far yet.
- Moves travis package installations to addons.apt.packages.
- Disables sudo in travis configuration.
- Substitutes nasm for yasm in travis builds since yasm is not available
on travis.
Makes is possible to build kvazaar using nasm instead of yasm.
- Adds trailing slashes to -I params in ASFLAGS.
- Disables CPU NOP directives when assembler is not yasm.
Commit 9cfbd55e removed "./" prefix of the TESTS variable in the
Makefile but the recipe of target tests was expecting it. Fixed by
prepending "./" to the tests recipe.
Replaces calls to __get_cpuid by __cpuid_count on gcc and clang and
calls to __cpuid by __cpuidex on MSVC. Unlike __get_cpuid and __cpuid,
__cpuid_count and __cpuidex set the ecx register which is required for
AVX2 detection.
- Moved cpuid data to a struct to make it easier to group data from one
cpuid call together.
- Renamed the bit masks to make it harder to mask the wrong register or
cpuid.
- Remove the .byte trick. We don't really need to support such ancient
compilers?
The lengths of the leaf streams must be available when the slice header
is written. Writing the header before joining child streams removes the
need to copy leaf bitstreams instead of moving them.