Pauli Oikkonen
768203a2de
First version of arbitrary-width SSE4.1 hor_sad
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
ccf683b9b6
Start work on left and right border aware hor_sad
...
Comes with 4, 8, 16 and 32 pixel wide implementations now, at some point
investigate if this can start to thrash icache
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
f781dc31f0
Create strategy for ver_sad
...
Easy to vectorize
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
91cb0fbd45
Create strategy for directly obtaining pointer to constant-width SAD function
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
cbca3347b5
Unroll 64-wide AVX2 SAD by 2
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
84cf771dea
Unroll 32 and 16 wide SAD vector implementations by 4
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
a711ce3df5
Inline fixed width vectorized SAD functions
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
6504145cce
Remove 16-pixel wide AVX2 SAD implementation
...
At least on Skylake, it's noticeably slower than the very simple
version using SSE4.1
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
4cb371184b
Add SSE4.1 strategy for 24px wide SAD and an AVX2 strategy for 16
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
796568d9cc
Add SSE4.1 strategies for SAD on widths 4 and 12 and AVX2 strategies for 32 and 64
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
4d45d828fa
Use constant-width SSE4.1 SAD funcs for AVX2
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
a13fc51003
Include a blank AVX2 strategy registration function even in non-AVX2 builds
2019-02-04 19:52:24 +02:00
Pauli Oikkonen
d55414db66
Only build AVX2 coeff encoding when supported
...
..whoops
2019-02-04 19:34:30 +02:00
Pauli Oikkonen
3fe2f29456
Merge branch 'encode-coeffs-avx2'
2019-02-04 18:52:31 +02:00
Pauli Oikkonen
722b738888
Fix more naming issues
2019-02-04 16:05:43 +02:00
Pauli Oikkonen
e26d98fb75
Rename a couple variables and add crucial comments
2019-02-04 15:57:07 +02:00
Pauli Oikkonen
f186455619
Move encode_last_significant_xy out of strategy modules
...
It's the exact same in both AVX2 and generic, and does not seem to
be worth even trying to vectorize
2019-02-04 14:55:41 +02:00
Pauli Oikkonen
3f7340c932
Fine-tune pack_16x16b_to_16x2b
...
Avoid mm_set1 operation when it's possible to create the constant with
one bit-shift operation from another instead. Thanks Intel for
3-operand instruction encoding!
2019-02-04 14:44:47 +02:00
Pauli Oikkonen
314f5b0e1f
Rename 16x2b cmpgt function, comment it better, optimize it slightly
...
Eliminate an unnecessary bit masking to make it even more messy
2019-02-04 14:44:32 +02:00
Pauli Oikkonen
d8ff6a6459
Fix _andn_u32 to work on old Visual Studio
2019-02-01 15:34:42 +02:00
Pauli Oikkonen
3a1f2eb752
Prefer SSE4.1 implementation of SAD over AVX2
...
It seems that the 128-bit wide version consistently outperforms the
256-bit one
2019-01-10 13:48:55 +02:00
Pauli Oikkonen
9b24d81c6a
Use SSE instead of AVX for small widths
...
Highly dubious if this will help performance at all
2019-01-07 20:12:13 +02:00
Pauli Oikkonen
887d7700a8
Modify AVX2 SAD to mask data by byte granularity in AVX registers
...
Avoids using any SAD calculations narrower than 256 bits, and
simplifies the code. Also improves execution speed
2019-01-07 18:53:15 +02:00
Pauli Oikkonen
7585f79a71
AVX2-ize SAD calculation
...
Performance is no better than SSE though
2019-01-07 16:26:24 +02:00
Pauli Oikkonen
ab3dc58df6
Copy SAD SSE4.1 impl to AVX2
2019-01-03 18:31:57 +02:00
Pauli Oikkonen
45ac6e6d03
Tidy pack_16x16b_to_16x2b comments
2019-01-03 16:37:05 +02:00
Pauli Oikkonen
016eb014ad
Move packing 16x16b -> 16x2b into separate function
2018-12-20 10:51:44 +02:00
Ari Lemmetti
b234897e8a
Fix smp and amp blocks in fme and revert previous change.
...
Filter 8x8 (sub)blocks even with 8x4, 4x8, 16x4, 4x16 etc.
Calculate SATD on the 8x4, ... part
2018-12-19 21:30:53 +02:00
Pauli Oikkonen
9aaa6f260d
Fixes to enable portability
2018-12-18 20:42:09 +02:00
Pauli Oikkonen
2fdbbe9730
Move CG reordering code from quant-avx2 to shared header
2018-12-18 19:42:18 +02:00
Pauli Oikkonen
d02207306d
Create a header file for shared AVX2 code
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
361bf0c7db
Precompute >=2 coeff encoding loop with 2-bit arithmetic
...
Who needs 16x16b vectors when you can do practically the same with
16x2b pseudovectors in 32-bit general purpose registers!
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
f66cb23d5b
Optimize greater1 encoding loop
...
Calculating the c1 variable need not be a serial operation!
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
8c8b791c35
Vectorize kvz_context_get_sig_ctx_inc
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
033261eb74
Eliminate two branches using bit magic
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
c4434e8d04
Scan CG's in forward order to simplify finding last significant
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
efd097f5a5
Vectorize the coeff group loop to some extent
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
a01362e638
use the efficient method of reordering raster->scan
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
50a888e789
Use the efficient method to find first and last nz coeffs in block
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
7e9203f566
Scan coeff groups in scan order to help find last significant one
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
9a5a6fdbc7
Simplify two ifs in encode_coeff_nxn-avx2
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
37a2a8bac8
See if loop can be optimized by rearranging
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
584f2f74b6
Vectorize significant coeff group scanning loop
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
1bfed73221
Add AVX2 strategy for encode_coding_tree
2018-12-18 19:41:09 +02:00
Reima Hyvönen
1fcc5c6a8d
Merge branch 'bipred_recon'
2018-12-11 09:59:35 +02:00
Reima Hyvönen
e4a10880f3
Added case 12 to bipred_recon no mov
2018-12-11 09:52:17 +02:00
Marko Viitanen
a4f3968e52
Fix Visual Studio errors by initializing some variables used in AVX2 signhiding
2018-12-11 09:33:26 +02:00
Pauli Oikkonen
c465578048
Add a descriptive comment to coefficient reordering
2018-12-03 15:36:32 +02:00
Pauli Oikkonen
f78bf2ebcb
Optimize q_coefs usage for indexed fetch
2018-12-03 15:36:32 +02:00
Pauli Oikkonen
d9591f1b49
Eliminate midway buffering of reordered coefs
...
TODO: For some mysterious reason seems slightly slower than the
buffered one
2018-12-03 15:36:32 +02:00
Pauli Oikkonen
7fe454c51f
Optimize get_cheapest_alternative()
2018-12-03 15:36:32 +02:00
Pauli Oikkonen
6bbd3e5a44
Optimize rearrange_512 function
2018-12-03 15:36:32 +02:00
Pauli Oikkonen
cb8209d1b3
Vectorize transform coefficient reordering loop
2018-12-03 15:36:32 +02:00
Pauli Oikkonen
7cf4c7ae5f
Rename "reduce" functions to hsum
...
That's what the functions fundamendally do anyway
2018-12-03 15:36:32 +02:00
Pauli Oikkonen
316cd8a846
Fix ALIGNED keyword and grow alignment to 64B
2018-12-03 15:36:32 +02:00
Pauli Oikkonen
1befc69a4c
Implement sign bit hiding in AVX2
2018-12-03 15:36:32 +02:00
Reima Hyvönen
f8696b54a4
Updated bipred_recon_avx2 in avx2/picture-avx2.c. Now it detects blocks that can be not equal to 8 (ie. width = 12)
2018-11-20 17:09:19 +02:00
Reima Hyvönen
710ba288db
Chroma has some problems
2018-11-15 16:42:48 +02:00
Ari Lemmetti
a832206bb6
Replace 32-bit incompatible instrinsics
2018-11-12 18:54:33 +02:00
Ari Lemmetti
5c774c4105
Rewrite most of FME and interpolation filters
...
Changes had to break a lot of stuff and were just squashed into this horrible code dump
2018-11-08 20:21:16 +02:00
Reima Hyvönen
7406c33a42
Some more cleaning
2018-10-26 12:25:18 +03:00
Reima Hyvönen
4c71546b2e
Cleaned some coding
2018-10-26 12:19:44 +03:00
Reima Hyvönen
4fe3909e48
Switched luma to use 32bits size ints intstead of 16bit size
2018-10-24 18:24:46 +03:00
Reima Hyvönen
381e786e10
Trying to find the bug in luma
2018-10-11 18:08:41 +03:00
Reima Hyvönen
2f5f81bac3
removed the non-optimated bipred function
2018-10-09 11:19:23 +03:00
Reima Hyvönen
212a8e68fa
Modified to avoid memory overflow, still some bug inside luma
2018-10-02 20:23:32 +03:00
Reima Hyvönen
896034b7cf
Some renamed functions back
2018-08-28 15:31:10 +03:00
Reima Hyvönen
7de5c74434
Updated bipred_recon to work faster
2018-08-28 15:12:31 +03:00
Reima Hyvönen
2ca99a44e8
Updated shuffle operation to be in right order
2018-08-27 18:16:38 +03:00
Reima Hyvönen
508b218a12
some modifications made to prevent reading too much
2018-08-14 10:50:39 +03:00
Reima Hyvönen
1d935ee888
some useless stuff removed
2018-08-13 16:47:11 +03:00
Reima Hyvönen
ce3ac4c05e
some modifications to no_mov
2018-08-13 16:41:02 +03:00
Reima Hyvönen
15a613ae94
test if no_mov breaks testing
2018-08-13 16:02:56 +03:00
Reima Hyvönen
97a2049e58
removed pointer declaration out from switch
2018-08-10 16:42:26 +03:00
Reima Hyvönen
aa94bcedbc
Stream is now pointer
2018-08-10 16:38:49 +03:00
Reima Hyvönen
fa5b227ece
256 to 32 doesn't work, made them by hand
2018-08-10 16:01:20 +03:00
Reima Hyvönen
408dedbcc8
removed _mm256_extract_epi8 and replaced with _mm_stream
2018-08-10 15:53:26 +03:00
Reima Hyvönen
31c35091c6
_mm256_cvtsi256_si32 removed
2018-08-10 10:06:40 +03:00
Reima Hyvönen
99dc43074f
_mm256_cvtsi256_si32 breaks system, too much bits. back to extract
2018-08-10 09:59:33 +03:00
Reima Hyvönen
4f1f80b2cb
Transformed convert from 256 to cast 256 -> 128 and then convert from 128
2018-08-09 15:35:54 +03:00
Reima Hyvönen
4957555eb3
Removed leftover from 939
2018-08-09 15:25:03 +03:00
Reima Hyvönen
28b165c971
Clearified some sections, added _MM_SHUFFLE macro
2018-08-09 15:23:01 +03:00
Reima Hyvönen
dd04df8667
testing if error in both avx2 functions
2018-08-03 11:49:00 +03:00
Reima Hyvönen
ed50d71fde
Switched some variables to different location, altered inter_recon_bipred_avx2 function
2018-08-02 16:08:59 +03:00
Reima Hyvönen
f5739a0028
Renaming and removing useless prints
2018-08-02 14:47:17 +03:00
Reima Hyvönen
bc09f59bb6
Edited some definitions
2018-08-02 11:54:53 +03:00
Reima Hyvönen
a4bf77f208
Tested some extract functions
2018-07-12 09:29:32 +03:00
Reima Hyvönen
c05033a893
Even more useless vectors removed
2018-07-11 15:09:14 +03:00
Reima Hyvönen
884cb77238
Removed some not used vectors
2018-07-11 15:06:11 +03:00
Reima Hyvönen
792689a5ff
Removed for-loops, added extract instead
2018-07-11 14:56:41 +03:00
Reima Hyvönen
f9c7f6ee66
Added some break-operations for avx2 optimation
2018-07-11 14:15:38 +03:00
Reima Hyvönen
cc064da143
some more optimation for bipred
2018-07-11 11:27:54 +03:00
Reima Hyvönen
a22cf03ddb
Updated to have no movement function to avx2 strategies
2018-07-10 16:07:15 +03:00
Reima Hyvönen
ea83ae45f0
Toimiva ratkaisu
2018-07-03 11:18:51 +03:00
Reima Hyvönen
17babfffa4
25.6 working optimation, ~50% faster than original
2018-06-25 17:06:16 +03:00
Arttu Ylä-Outinen
0a69e6d18f
Fix selection of transform function for 4x4 blocks
...
DST function was returned for inter luma transform blocks of size 4x4
even though they must use DCT. Fixed by checking the prediction mode of
the block in addition to whether it is chroma or luma.
2018-01-18 10:36:25 +02:00
Arttu Ylä-Outinen
9694bd2fae
Fix build on 32-bit systems
...
Function coeff_abs_sum_avx2 that was added in e950c9b
was outside the
AVX2 #if directive.
2017-07-28 09:19:29 +03:00
Arttu Ylä-Outinen
e950c9b101
Add AVX2 implementation for coefficient sum
2017-07-28 07:39:36 +03:00
Arttu Ylä-Outinen
fdb3480b54
Enable strategies for SAO reconstruction
...
Re-enables strategies for SAO reconstruction. They were disabled in
commit ec9ff42
.
2017-07-11 10:35:18 +03:00
Arttu Ylä-Outinen
333dba3884
Add static to SAO strategies
2017-07-11 10:02:01 +03:00