Reima Hyvönen
7853be8eeb
Incomple optimation
2019-08-07 16:35:24 +03:00
Ari Lemmetti
40609aa865
Add missing headers to Makefile.am
2019-07-12 19:15:51 +03:00
Ari Lemmetti
5db3a78499
Bump versions for release 1.3
2019-07-09 22:09:32 +03:00
Ari Lemmetti
d513ab1999
Add missing newline
2019-07-09 21:06:05 +03:00
Ari Lemmetti
4967072625
Do not bypass search on skip cu if early_skip is not enabled
2019-07-09 20:20:12 +03:00
Ari Lemmetti
b20992a9f3
Rename functions more descriptive
2019-07-09 20:20:11 +03:00
Ari Lemmetti
a348a0ec23
Fix transform depth in early skip
2019-07-09 20:05:48 +03:00
Pauli Oikkonen
8d48bee180
Tidy fast coeff cost code
2019-07-09 18:01:54 +03:00
Pauli Oikkonen
201a43b08e
Clean up the RD-estimation code
2019-07-09 18:01:54 +03:00
Pauli Oikkonen
b111df5073
Create preliminary version of improved cost estimator
2019-07-09 18:01:54 +03:00
Ari Lemmetti
be08a87d94
Add missing parameter max-merge to the help message
2019-07-09 16:28:46 +03:00
Ari Lemmetti
d0bb9b4a6d
Add parameter max-merge to presets
2019-07-09 16:26:03 +03:00
Ari Lemmetti
4097331fd6
Early skip
2019-07-09 15:59:31 +03:00
Joose Sainio
977e885ea2
Fix issue with gop=0 introduced in 1c36f68d0c
2019-07-05 12:57:27 +03:00
Mikko Pitkänen
a7f09c8114
Merge branch 'threadwrapper'
2019-06-24 16:54:59 +03:00
Mikko Pitkänen
3dd606ce2e
Add new threadwrapper
2019-06-18 18:45:45 +03:00
Joose Sainio
c94077d15e
remove hardcoded value
2019-06-12 14:37:41 +03:00
Joose Sainio
ac68c8444d
remove negation that wasn't supposed to be there
2019-06-12 14:35:24 +03:00
Joose Sainio
5851dcc3be
missing negation
2019-06-12 14:08:18 +03:00
Joose Sainio
1c36f68d0c
Fix owf>=9 gop=8 and add test to catch such problem in future
2019-06-12 14:04:41 +03:00
Ari Lemmetti
933ff6ed55
Merge branch 'set-qp-in-cu-fix'
2019-06-07 09:01:03 +03:00
Ari Lemmetti
c6da839002
Set lcu sqrt lambda according to lcu lambda instead of frame lambda when ROI is used
2019-05-29 18:32:10 +03:00
Ari Lemmetti
9339845e8b
Set QP completely at CU level as the name '--set-qp-in-cu' implies
...
-Move slice delta QP to CU level when using --set-qp-in-cu
-Separate functionality from roi
2019-05-24 20:38:39 +03:00
Pauli Oikkonen
081d16fc33
Fix intrinsics that may be missing on some systems
...
Create a header to collect all the workarounds for missing intrinsics
in one place
2019-05-23 19:59:40 +03:00
Pauli Oikkonen
87a9208db8
Eliminate cvtsi64_si128 intrinsic
...
Apparently it'll cause Win32 builds to break because it emits the movq
instruction or something..
2019-04-17 16:30:40 +03:00
Pauli Oikkonen
7175d20bb2
Still include stdint.h for non-vector builds
2019-04-15 19:36:01 +03:00
Pauli Oikkonen
1315c7e2b0
Do not compile any vector code for non-SSE4/AVX2 builds
2019-04-15 19:10:48 +03:00
Pauli Oikkonen
f5f70e7bc5
Merge branch 'sad-optimization'
2019-04-15 19:02:01 +03:00
Jan Beich
85f46e17a9
Detect AltiVec via elf_aux_info() on FreeBSD 12+
2019-04-01 13:08:04 +00:00
Jan Beich
82486255da
Simplify AltiVec detection on Linux
2019-04-01 13:08:04 +00:00
Pauli Oikkonen
6d43759604
Create a border-respecting 32-wide AVX hor_sad
2019-03-07 18:01:22 +02:00
Pauli Oikkonen
f218cecb38
Remove offending hor_sad_avx2_w32 function
...
Consider possibly creating a non-offending AVX2 version instead, the
way hor_sad_sse41_w32 works. Or maybe there's more essential work to
do.
2019-03-05 22:51:41 +02:00
Pauli Oikkonen
df2e6c54fd
4-unroll hor_sad_sse41_arbitrary
...
This may not increase perf though because it's so rarely used
function, so keeping icache footprint may be more essential...
2019-03-05 22:45:23 +02:00
Pauli Oikkonen
448eacba7b
Avoid overreading block borders in hor_sad_sse41_arbitrary
2019-03-05 22:34:50 +02:00
Eemeli Kallio
c159e275b7
Merge branch 'max_merge'
2019-03-05 14:39:03 +02:00
Pauli Oikkonen
41f51c08c4
Avoid overrunning buffer in hor_sad_sse41_w32
2019-03-01 15:37:38 +02:00
Pauli Oikkonen
bcd9879359
Include quant coeff range check in non-scaling list execution path too
2019-02-27 17:26:44 +02:00
Pauli Oikkonen
24e6363f64
Remove the kvz_quant_avx2 wrapper function
2019-02-27 16:32:58 +02:00
Pauli Oikkonen
748820f3c5
Eliminate unnecessary loading of coeffs if scaling lists are off
2019-02-27 16:26:35 +02:00
Pauli Oikkonen
5994350f40
Allow quant_flat_avx2 to be used with scaling lists on
2019-02-27 16:25:59 +02:00
Eemeli Kallio
7f4e0acf41
Added check if max-merge is out of bounds
2019-02-19 13:53:42 +02:00
Pauli Oikkonen
9b0e079262
Use SSE instructions for 64-bit SADs instead of MMX
...
VC++ seems to choke on MMX instructions
2019-02-18 20:13:33 +02:00
Pauli Oikkonen
d8b8923028
Add LGPL notices to reg_sad headers
2019-02-18 17:52:47 +02:00
Eemeli Kallio
2a40560888
some variables to const
2019-02-12 11:24:10 +02:00
Eemeli Kallio
8f8e7bb53c
Added possibility to reduce number of maximum number of merge candidates.
2019-02-12 09:21:03 +02:00
Pauli Oikkonen
770db825b9
Create hor_sad_w8 and w4 epol mask the way w16 works
2019-02-06 19:34:26 +02:00
Pauli Oikkonen
aa19bcac8a
Avoid branching in creating shuffle mask in hor_sad_w16
2019-02-06 18:58:46 +02:00
Pauli Oikkonen
2d05ca8520
Remove width from constant-width hor_sad func params
...
They should kinda know it already
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
57db234d95
Move 32-wide SSE4.1 hor_sad to picture-sse41.c
...
It's not used by picture-avx2.c that also includes the header, so
it should not be in the header
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
dd7d989a39
Implement 32-wide hor_sad on AVX2
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
ff70c8a5ec
Utilize horizontal SAD functions for SSE4.1 as well
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
f5ff4db01f
4-wide hor_sad border agnostic
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
35e7f9a700
Fix hor_sad w8 to work with both borders
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
836783dd6e
Use hor_sad_w32 for both left and right borders
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
69687c8d24
Modify hor_sad_sse41_w16 to work over left and right borders
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
51c2abe99a
Modify image_interpolated_sad to use kvz_hor_sad
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
1e0eb1af30
Add generic strategy for hor_sad'ing an non-split width block
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
686fb2c957
Unroll arbitrary-width SSE4.1 hor_sad by 4
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
768203a2de
First version of arbitrary-width SSE4.1 hor_sad
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
ccf683b9b6
Start work on left and right border aware hor_sad
...
Comes with 4, 8, 16 and 32 pixel wide implementations now, at some point
investigate if this can start to thrash icache
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
760bd0397d
Pad the image buffer by 64 bytes from both ends
...
This will be necessary for an efficient and straightforward
implementation of hor_sad for blocks over 16 pixels wide, because they
cannot use the shuffle trick because inter-lane shuffling is so hard to
do
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
c36482a11a
Fix bug in 24-wide SAD
...
*facepalm*
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
f781dc31f0
Create strategy for ver_sad
...
Easy to vectorize
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
ca94ae9529
Handle extrapolated blocks with unmodified width using optimized_sad pointer
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
91b30c7064
Tidy up kvz_image_calc_sad
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
9db0a1bcda
Create get_optimized_sad func for SSE4.1
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
91380729b1
Add generic get_optimized_sad implementation
...
NOTE: To force generic SAD implementation on devices supporting
vectorized variants, you now have to override both get_optimized_sad
and reg_sad to generic (only overriding get_optimized_sad on AVX2
hardware would just run all SAD blocks through reg_sad_avx2). Let's
see if there's a more sensible way to do it, but it's not trivial.
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
45f36645a6
Move choosing of tailored SAD function higher up the calling chain
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
91cb0fbd45
Create strategy for directly obtaining pointer to constant-width SAD function
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
94035be342
Unify unrolling naming conventions
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
517a4338f6
Unroll SSE SAD for 8-wide blocks to process 4 lines at once
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
0f665b28f6
Unroll arbitrary width SSE4.1 SAD by 4
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
cbca3347b5
Unroll 64-wide AVX2 SAD by 2
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
84cf771dea
Unroll 32 and 16 wide SAD vector implementations by 4
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
5df5c5f8a4
Cast all pointers to const types in vector SAD funcs
...
Also tidy up the pointer arithmetic
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
a711ce3df5
Inline fixed width vectorized SAD functions
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
6504145cce
Remove 16-pixel wide AVX2 SAD implementation
...
At least on Skylake, it's noticeably slower than the very simple
version using SSE4.1
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
4cb371184b
Add SSE4.1 strategy for 24px wide SAD and an AVX2 strategy for 16
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
796568d9cc
Add SSE4.1 strategies for SAD on widths 4 and 12 and AVX2 strategies for 32 and 64
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
4d45d828fa
Use constant-width SSE4.1 SAD funcs for AVX2
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
2eaa7bc9d2
Move SSE4.1 SAD functions to separate header
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
d2db0086e1
Create constant width SAD versions for 8 and 16 pixels
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
a13fc51003
Include a blank AVX2 strategy registration function even in non-AVX2 builds
2019-02-04 19:52:24 +02:00
Pauli Oikkonen
d55414db66
Only build AVX2 coeff encoding when supported
...
..whoops
2019-02-04 19:34:30 +02:00
Pauli Oikkonen
3fe2f29456
Merge branch 'encode-coeffs-avx2'
2019-02-04 18:52:31 +02:00
Pauli Oikkonen
722b738888
Fix more naming issues
2019-02-04 16:05:43 +02:00
Pauli Oikkonen
e26d98fb75
Rename a couple variables and add crucial comments
2019-02-04 15:57:07 +02:00
Pauli Oikkonen
f186455619
Move encode_last_significant_xy out of strategy modules
...
It's the exact same in both AVX2 and generic, and does not seem to
be worth even trying to vectorize
2019-02-04 14:55:41 +02:00
Pauli Oikkonen
3f7340c932
Fine-tune pack_16x16b_to_16x2b
...
Avoid mm_set1 operation when it's possible to create the constant with
one bit-shift operation from another instead. Thanks Intel for
3-operand instruction encoding!
2019-02-04 14:44:47 +02:00
Pauli Oikkonen
314f5b0e1f
Rename 16x2b cmpgt function, comment it better, optimize it slightly
...
Eliminate an unnecessary bit masking to make it even more messy
2019-02-04 14:44:32 +02:00
Pauli Oikkonen
d8ff6a6459
Fix _andn_u32 to work on old Visual Studio
2019-02-01 15:34:42 +02:00
Pauli Oikkonen
26e1b2c783
Use (u)int32_t instead of (unsigned) int in reg_sad_sse41
2019-01-10 14:37:04 +02:00
Pauli Oikkonen
3a1f2eb752
Prefer SSE4.1 implementation of SAD over AVX2
...
It seems that the 128-bit wide version consistently outperforms the
256-bit one
2019-01-10 13:48:55 +02:00
Pauli Oikkonen
9b24d81c6a
Use SSE instead of AVX for small widths
...
Highly dubious if this will help performance at all
2019-01-07 20:12:13 +02:00
Pauli Oikkonen
b2176bf72a
Optimize SSE4.1 version of SAD
...
Make it use the same vblend trick as AVX2. Interestingly, on my test
setup this seems to be faster than the same code using 256-bit AVX
vectors.
2019-01-07 19:40:57 +02:00
Pauli Oikkonen
887d7700a8
Modify AVX2 SAD to mask data by byte granularity in AVX registers
...
Avoids using any SAD calculations narrower than 256 bits, and
simplifies the code. Also improves execution speed
2019-01-07 18:53:15 +02:00
Pauli Oikkonen
7585f79a71
AVX2-ize SAD calculation
...
Performance is no better than SSE though
2019-01-07 16:26:24 +02:00
Pauli Oikkonen
ab3dc58df6
Copy SAD SSE4.1 impl to AVX2
2019-01-03 18:31:57 +02:00
Pauli Oikkonen
45ac6e6d03
Tidy pack_16x16b_to_16x2b comments
2019-01-03 16:37:05 +02:00
Ari Lemmetti
cd818db724
Add missing quantization and residual in cost calculation (inter rd=2).
2018-12-21 15:55:29 +02:00