Commit graph

47 commits

Author SHA1 Message Date
Pauli Oikkonen df2e6c54fd 4-unroll hor_sad_sse41_arbitrary
This may not increase perf though because it's so rarely used
function, so keeping icache footprint may be more essential...
2019-03-05 22:45:23 +02:00
Pauli Oikkonen 448eacba7b Avoid overreading block borders in hor_sad_sse41_arbitrary 2019-03-05 22:34:50 +02:00
Pauli Oikkonen 41f51c08c4 Avoid overrunning buffer in hor_sad_sse41_w32 2019-03-01 15:37:38 +02:00
Pauli Oikkonen 9b0e079262 Use SSE instructions for 64-bit SADs instead of MMX
VC++ seems to choke on MMX instructions
2019-02-18 20:13:33 +02:00
Pauli Oikkonen d8b8923028 Add LGPL notices to reg_sad headers 2019-02-18 17:52:47 +02:00
Pauli Oikkonen 770db825b9 Create hor_sad_w8 and w4 epol mask the way w16 works 2019-02-06 19:34:26 +02:00
Pauli Oikkonen aa19bcac8a Avoid branching in creating shuffle mask in hor_sad_w16 2019-02-06 18:58:46 +02:00
Pauli Oikkonen 2d05ca8520 Remove width from constant-width hor_sad func params
They should kinda know it already
2019-02-04 20:41:40 +02:00
Pauli Oikkonen 57db234d95 Move 32-wide SSE4.1 hor_sad to picture-sse41.c
It's not used by picture-avx2.c that also includes the header, so
it should not be in the header
2019-02-04 20:41:40 +02:00
Pauli Oikkonen ff70c8a5ec Utilize horizontal SAD functions for SSE4.1 as well 2019-02-04 20:41:40 +02:00
Pauli Oikkonen f5ff4db01f 4-wide hor_sad border agnostic 2019-02-04 20:41:40 +02:00
Pauli Oikkonen 35e7f9a700 Fix hor_sad w8 to work with both borders 2019-02-04 20:41:40 +02:00
Pauli Oikkonen 69687c8d24 Modify hor_sad_sse41_w16 to work over left and right borders 2019-02-04 20:41:40 +02:00
Pauli Oikkonen 686fb2c957 Unroll arbitrary-width SSE4.1 hor_sad by 4 2019-02-04 20:41:40 +02:00
Pauli Oikkonen 768203a2de First version of arbitrary-width SSE4.1 hor_sad 2019-02-04 20:41:40 +02:00
Pauli Oikkonen ccf683b9b6 Start work on left and right border aware hor_sad
Comes with 4, 8, 16 and 32 pixel wide implementations now, at some point
investigate if this can start to thrash icache
2019-02-04 20:41:40 +02:00
Pauli Oikkonen c36482a11a Fix bug in 24-wide SAD
*facepalm*
2019-02-04 20:41:40 +02:00
Pauli Oikkonen f781dc31f0 Create strategy for ver_sad
Easy to vectorize
2019-02-04 20:41:40 +02:00
Pauli Oikkonen 9db0a1bcda Create get_optimized_sad func for SSE4.1 2019-02-04 20:41:40 +02:00
Pauli Oikkonen 91cb0fbd45 Create strategy for directly obtaining pointer to constant-width SAD function 2019-02-04 20:41:40 +02:00
Pauli Oikkonen 94035be342 Unify unrolling naming conventions 2019-02-04 20:41:40 +02:00
Pauli Oikkonen 517a4338f6 Unroll SSE SAD for 8-wide blocks to process 4 lines at once 2019-02-04 20:41:40 +02:00
Pauli Oikkonen 0f665b28f6 Unroll arbitrary width SSE4.1 SAD by 4 2019-02-04 20:41:40 +02:00
Pauli Oikkonen 84cf771dea Unroll 32 and 16 wide SAD vector implementations by 4 2019-02-04 20:41:40 +02:00
Pauli Oikkonen 5df5c5f8a4 Cast all pointers to const types in vector SAD funcs
Also tidy up the pointer arithmetic
2019-02-04 20:41:40 +02:00
Pauli Oikkonen a711ce3df5 Inline fixed width vectorized SAD functions 2019-02-04 20:41:40 +02:00
Pauli Oikkonen 4cb371184b Add SSE4.1 strategy for 24px wide SAD and an AVX2 strategy for 16 2019-02-04 20:41:40 +02:00
Pauli Oikkonen 796568d9cc Add SSE4.1 strategies for SAD on widths 4 and 12 and AVX2 strategies for 32 and 64 2019-02-04 20:41:40 +02:00
Pauli Oikkonen 2eaa7bc9d2 Move SSE4.1 SAD functions to separate header 2019-02-04 20:41:40 +02:00
Pauli Oikkonen d2db0086e1 Create constant width SAD versions for 8 and 16 pixels 2019-02-04 20:41:40 +02:00
Pauli Oikkonen 26e1b2c783 Use (u)int32_t instead of (unsigned) int in reg_sad_sse41 2019-01-10 14:37:04 +02:00
Pauli Oikkonen b2176bf72a Optimize SSE4.1 version of SAD
Make it use the same vblend trick as AVX2. Interestingly, on my test
setup this seems to be faster than the same code using 256-bit AVX
vectors.
2019-01-07 19:40:57 +02:00
Ari Koivula 14a7bcba25 Use a faster function for clipped inter SAD
Use the vectorized general SSE41 inter SAD in AVX reg_sad for shapes
for which we don't have AVX versions yet.

Also improves speed of --smp and --amp a lot. Got a 1.25x speedup for:
--preset=ultrafast -q 27 --gop=lp-g4d3r3t1 --me-early-termination=on --rd=1 --pu-depth-inter=1-3 --smp --amp

* Suite speed_tests:
-PASS inter_sad: 0.898M x reg_sad(64x63):x86_asm_avx (1000 ticks, 1.000 sec)
+PASS inter_sad: 2.503M x reg_sad(64x63):x86_asm_avx (1000 ticks, 1.000 sec)
-PASS inter_sad: 115.054M x reg_sad(1x1):x86_asm_avx (1000 ticks, 1.000 sec)
+PASS inter_sad: 133.577M x reg_sad(1x1):x86_asm_avx (1000 ticks, 1.000 sec)
2016-09-27 20:48:30 +03:00
Ari Koivula 61fc3e87ba Run include-what-you-use fix_includes.py fix_includes.py
The includes should make more sense now and not just happen to compile
due to headers included from other headers.

Used a modified version of IWYU. Modifications were to attribute int8_t
and so on to stdint.h instead of sys/types.h and immintrin.h instead of
more specific headers.

include-what-you-use 0.7 (git:b70df35)
based on clang version 3.9.0 (trunk 264728)
2016-04-01 17:46:55 +03:00
Ari Koivula 8908d85d66 Change all relative includes to absolute 2016-04-01 17:46:44 +03:00
Ari Koivula 4876879b82 Add IWYU pragmas 2016-03-31 12:33:34 +03:00
Ari Koivula fa1af14637 Fix includes to include global.h first everywhere 2016-01-22 15:07:49 +02:00
Ari Koivula 947bae24f9 Update Doxygen documentation
Add module information to all header files.

Update all header file documentations to briefly say what they are, and
to use the javadoc format so the brief actually gets included into the
doxygen documentation.

Remove \file from implementation files, in order to not repeat the info
from the header files.

Add files under strategies and tools to Doxygen and update the Doxygen
settings to be just plain better.

Make README be the main page of Doxygen documentation.
2015-12-17 14:05:50 +02:00
Arttu Ylä-Outinen 3a10e9e3e0 Prefix all non-static symbols with "kvz_". 2015-08-26 13:02:28 +03:00
Ari Lemmetti 5d96dbc6c0 Make strategy selection use bit depth given via parameter instead of excluding registration with defines 2015-08-12 13:33:38 +03:00
Ari Lemmetti 4122f36089 Prevent the registration of strategies that are incompatible when KVZ_BIT_DEPTH != 8
Remove unnecessary or misleading mentions of "8bit"
2015-08-12 11:29:53 +03:00
Arttu Ylä-Outinen f7f17a060c Rename pixel_t to kvz_pixel. 2015-07-02 16:58:28 +03:00
Ari Koivula ded6fd9ee8 Renamed typedef pixel to pixel_t. 2015-03-04 16:35:53 +02:00
Ari Koivula d7383ccb25 Change license to LGPL.
- Everyone who has contributed code to the project has been asked to license
  their contributions under LPGL and they have agreed.

- COPYING file changed to say LGPLv2.1 instead of GPLv2.

- GPL changed to LGPL in the header of every single file that a header and
  header added to the few that were missing one.

- Also.. Happy new year!
2015-02-25 15:19:05 +02:00
Ari Koivula 8a5b24bcbe Remove usages of GCC __attribute__.
- To allow clang to compile, as it doesn't according to #58.
- The target attributes are not needed anymore due to makefile handling
  targetting now.
- The __attribute__((unused)) used for debugging. I don't know if clang
  supports this attribute or not but it doesn't seem very important so
  I'm removing it just in case.
2014-10-13 16:46:26 +03:00
Ari Koivula f49332c9b8 Add missing includes. 2014-07-18 17:56:15 +03:00
Ari Koivula 5d0df56c94 Move optimizations to their own compilation units according to target.
- This is necessary in order to compile AVX intrinsics correctly in
  Visual Studio. Having everything in their own units should also make
  compiling normal C code with optimizations on easier.

- For now the makefile still relies on GCC __target__ attribute for compiling
  intrinsics.
2014-07-11 17:26:19 +03:00