Commit graph

465 commits

Author SHA1 Message Date
Ari Lemmetti 84222cf3e7 Replace old block extrapolation with more capable one.
Separate paddings for different directions can be now specified.
2021-03-08 22:36:04 +02:00
Pauli Oikkonen 816789c9f4 Allow fast coeff weights to be read from a file 2020-10-29 15:22:51 +02:00
Pauli Oikkonen 6799019db0 Move fast coeff table to transform.h
Guess this is a more logical place for it
2020-10-29 15:20:27 +02:00
Pauli Oikkonen 4712ce5f59 Round the fast coeff result instead of flooring 2020-10-29 15:20:27 +02:00
Pauli Oikkonen 0fb09c9920 New filtered coeff weight by QP values 2020-10-29 15:20:27 +02:00
Pauli Oikkonen 24d487f553 New weights for 12 <= QP <= 42
Trained using MSU ultrafast settings now
2020-10-29 15:20:27 +02:00
Pauli Oikkonen 3e1c6d84b8 Fix issues in fast coeff estimation
Allow weight table to start from nonzero QP, and round weights to Q8.8
instead of flooring them
2020-10-29 15:20:27 +02:00
Pauli Oikkonen 5f91bda762 Use newer data for fast coeff cost estimation
Same training dataset, but this time only buckets 0...3 were used to
approximate the function, no sign/cg width bucket.
2020-10-29 15:20:27 +02:00
Pauli Oikkonen 2abd733199 Use unsigned min() to correctly clip -32768
If a coeff happens to be -32768 (0x8000), its 16-bit abs() is also
0x8000. It should ultimately be clipped to 3, so interpret absolute
values as unsigned instead to make that happen.
2020-10-29 15:20:27 +02:00
Pauli Oikkonen b93b90c0d7 Implement new fast coeff cost estimator in AVX2 2020-10-29 15:20:27 +02:00
Pauli Oikkonen 2f74a112b3 Try first lookup table based fast coeff estimation 2020-10-29 15:20:27 +02:00
Pauli Oikkonen 780da4568a Exclude 8-bit-only code from 10-bit builds and use uint8_t instead of kvz_pixel for code that assumes 8-bit pixels 2020-09-02 17:46:33 +03:00
Jan Beich 1fa69c705d Rename truncate() from 30ce461d98 to avoid conflict with POSIX version
strategies/avx2/dct-avx2.c:55:23: error: static declaration of 'truncate' follows non-static declaration
static INLINE __m256i truncate(__m256i v, __m256i debias, int32_t shift)
                      ^
/usr/include/stdio.h:448:6: note: previous declaration is here
int      truncate(const char *, __off_t);
         ^
2020-04-22 16:09:42 +00:00
Ari Lemmetti f31dddc019 Bypass inverse quantization and inverse transform when trying early skip 2020-04-10 16:02:09 +03:00
Pauli Oikkonen 8617530b13 Use _mm_store_epi64 instead of _mm_cvtsi128_si64
Fix 32-bit builds that tend to lack the cvt intrinsic. Hope it will be
optimized to a movq r64, xmm on modern platforms though
2020-04-07 23:51:54 +03:00
Pauli Oikkonen a82966c0f5 Fix lacking _mm256_cvtss_f32 intrinsic on VS
Cast __m256 into __m128 first, the XMM variant of the intrinsic has been
around for a long enough time to be supported
2020-04-07 22:38:10 +03:00
Ari Lemmetti 901c25c0c8 Merge branch 'vaq' 2020-04-03 19:51:17 +03:00
Ari Lemmetti 51451be5ef Handle cases where the number of pixels is not divisible by 32 2020-04-03 19:37:47 +03:00
siivonek e5267f7706 Fix define for use with Visual Studio. 2020-04-03 15:11:01 +02:00
Pauli Oikkonen addc1c3ede Fix warning about potentially unused hsum_8x32b
There's a lot of alternative options available, such as making it
globally visible with a kvz_ prefix, force inlining it, or anything.
This could be good too, hope it won't be compiled at all to translation
units where it's not used.
2020-04-02 16:44:22 +03:00
siivonek 566680af7b Move function hsum to file where it is used to avoid errors. 2020-04-02 14:03:06 +02:00
siivonek 58be514e2a Fix pipeline error. 2020-04-02 13:50:08 +02:00
Pauli Oikkonen 99889dab15 Fix switch(bool) in picture-avx2.c
It passes on GCC but warns on Clang
2020-03-31 15:42:19 +03:00
Jaakko Laitinen af3d559d8d Let pu-depth be defined per gop-layer 2020-03-17 17:57:18 +02:00
Pauli Oikkonen 60e7956dc5 Disable inaccurate integer variance calculation for now 2020-03-02 19:18:55 +02:00
Pauli Oikkonen fc1b91335b Implement variance calculation in integer math
Maybe this is a bit faster than FP, it's not accurate though
2020-03-02 18:17:18 +02:00
Pauli Oikkonen 35c825c75f Move hsum_8x32b to avx2_common_functions 2020-02-27 17:52:17 +02:00
Pauli Oikkonen b00ac7d1c4 AVX2 version of buffer variance calculation 2020-02-25 15:57:56 +02:00
Pauli Oikkonen 1bd9c6dd93 Make a strategy out of pixel_var 2020-02-24 19:37:36 +02:00
Ari Lemmetti 3c7dd0752f Remove the broken "no mov" branch.
Causes hash mismatches for example in SlideShow sequence.
2020-02-03 15:26:31 +02:00
RLamm 30d5df40c5 Custom headers for the distributed coding 2020-01-29 15:54:49 +02:00
Pauli Oikkonen c3d9e97e9f Fix VS build 2019-12-12 18:34:55 +02:00
Pauli Oikkonen 7f238ca299 Remove debug print functions
Whoops
2019-12-12 18:19:31 +02:00
Pauli Oikkonen eefb5e50b3 De-inline pred_filtered_dc functions, shouldn't make much difference though 2019-12-12 17:30:00 +02:00
Pauli Oikkonen 169314de4f 32x32 filtered DC prediction in AVX2 2019-12-11 18:17:06 +02:00
Pauli Oikkonen fb2481b7e4 16x16 filtered DC implemented in AVX2 2019-12-10 15:54:50 +02:00
Pauli Oikkonen da370ea36d Implement AVX2 8x8 filtered DC algorithm 2019-11-28 14:10:10 +02:00
Pauli Oikkonen 5d9b7019ca Implement a 4x4 filtered DC pred function 2019-11-26 17:05:54 +02:00
Pauli Oikkonen f1485ab087 Start doing an arbitrary size filtered DC pred - maybe easier to just create separate functions for fixed block sizes? 2019-11-25 15:20:29 +02:00
Pauli Oikkonen 979d66031c Create a strategy out of intra_pred_filtered_dc 2019-11-19 14:50:31 +02:00
Pauli Oikkonen fa4bb86406 Optimize intra_pred_planar_avx2 for 4x4 blocks 2019-11-19 13:39:02 +02:00
Pauli Oikkonen 4761d228f9 Start to vectorize the 4x4 loop 2019-11-15 17:32:40 +02:00
Pauli Oikkonen 8d45ab4951 Stupidify the 4x4 planar loop for vectorization 2019-11-14 17:14:04 +02:00
Pauli Oikkonen 6d7a4f555c Also remove 16x16 (A * B^T)^T matrix multiply
Can be done using (B * A^T) instead, it's the exact same
2019-10-28 16:19:42 +02:00
Pauli Oikkonen 2c2deb2366 Tidy AVX2 32x32 matrix multiply 2019-10-28 16:19:42 +02:00
Pauli Oikkonen 98ad78b333 Tidy the old AVX2 32x32 matrix multiply
It was actually a very good algorithm, just looked messy!
2019-10-28 16:19:42 +02:00
Pauli Oikkonen 4a921cbdb5 Retain data as much in YMM registers as possible
This seems to make it a whole lot quicker
2019-10-28 16:19:42 +02:00
Pauli Oikkonen ac4d710e23 Unroll 32x32 matrix multiply, use all regs 2019-10-28 16:19:42 +02:00
Pauli Oikkonen a58608d0b8 Remove totally unnecessary (A * B^T)^T 32x32 multiply 2019-10-28 16:19:42 +02:00
Pauli Oikkonen 043f53539f Implement a streamlined matrix-multiply 32x32 DCT 2019-10-28 16:19:42 +02:00