Pauli Oikkonen
|
780da4568a
|
Exclude 8-bit-only code from 10-bit builds and use uint8_t instead of kvz_pixel for code that assumes 8-bit pixels
|
2020-09-02 17:46:33 +03:00 |
|
Pauli Oikkonen
|
8617530b13
|
Use _mm_store_epi64 instead of _mm_cvtsi128_si64
Fix 32-bit builds that tend to lack the cvt intrinsic. Hope it will be
optimized to a movq r64, xmm on modern platforms though
|
2020-04-07 23:51:54 +03:00 |
|
Pauli Oikkonen
|
a82966c0f5
|
Fix lacking _mm256_cvtss_f32 intrinsic on VS
Cast __m256 into __m128 first, the XMM variant of the intrinsic has been
around for a long enough time to be supported
|
2020-04-07 22:38:10 +03:00 |
|
Ari Lemmetti
|
901c25c0c8
|
Merge branch 'vaq'
|
2020-04-03 19:51:17 +03:00 |
|
Ari Lemmetti
|
51451be5ef
|
Handle cases where the number of pixels is not divisible by 32
|
2020-04-03 19:37:47 +03:00 |
|
Pauli Oikkonen
|
99889dab15
|
Fix switch(bool) in picture-avx2.c
It passes on GCC but warns on Clang
|
2020-03-31 15:42:19 +03:00 |
|
Pauli Oikkonen
|
60e7956dc5
|
Disable inaccurate integer variance calculation for now
|
2020-03-02 19:18:55 +02:00 |
|
Pauli Oikkonen
|
fc1b91335b
|
Implement variance calculation in integer math
Maybe this is a bit faster than FP, it's not accurate though
|
2020-03-02 18:17:18 +02:00 |
|
Pauli Oikkonen
|
b00ac7d1c4
|
AVX2 version of buffer variance calculation
|
2020-02-25 15:57:56 +02:00 |
|
Ari Lemmetti
|
3c7dd0752f
|
Remove the broken "no mov" branch.
Causes hash mismatches for example in SlideShow sequence.
|
2020-02-03 15:26:31 +02:00 |
|
RLamm
|
30d5df40c5
|
Custom headers for the distributed coding
|
2020-01-29 15:54:49 +02:00 |
|
Ari Lemmetti
|
557bcbc6aa
|
Make luma or chroma only inter "recon" or predict possible
|
2019-09-02 17:15:28 +03:00 |
|
Pauli Oikkonen
|
7175d20bb2
|
Still include stdint.h for non-vector builds
|
2019-04-15 19:36:01 +03:00 |
|
Pauli Oikkonen
|
1315c7e2b0
|
Do not compile any vector code for non-SSE4/AVX2 builds
|
2019-04-15 19:10:48 +03:00 |
|
Pauli Oikkonen
|
6d43759604
|
Create a border-respecting 32-wide AVX hor_sad
|
2019-03-07 18:01:22 +02:00 |
|
Pauli Oikkonen
|
f218cecb38
|
Remove offending hor_sad_avx2_w32 function
Consider possibly creating a non-offending AVX2 version instead, the
way hor_sad_sse41_w32 works. Or maybe there's more essential work to
do.
|
2019-03-05 22:51:41 +02:00 |
|
Pauli Oikkonen
|
2d05ca8520
|
Remove width from constant-width hor_sad func params
They should kinda know it already
|
2019-02-04 20:41:40 +02:00 |
|
Pauli Oikkonen
|
dd7d989a39
|
Implement 32-wide hor_sad on AVX2
|
2019-02-04 20:41:40 +02:00 |
|
Pauli Oikkonen
|
f5ff4db01f
|
4-wide hor_sad border agnostic
|
2019-02-04 20:41:40 +02:00 |
|
Pauli Oikkonen
|
35e7f9a700
|
Fix hor_sad w8 to work with both borders
|
2019-02-04 20:41:40 +02:00 |
|
Pauli Oikkonen
|
836783dd6e
|
Use hor_sad_w32 for both left and right borders
|
2019-02-04 20:41:40 +02:00 |
|
Pauli Oikkonen
|
69687c8d24
|
Modify hor_sad_sse41_w16 to work over left and right borders
|
2019-02-04 20:41:40 +02:00 |
|
Pauli Oikkonen
|
768203a2de
|
First version of arbitrary-width SSE4.1 hor_sad
|
2019-02-04 20:41:40 +02:00 |
|
Pauli Oikkonen
|
ccf683b9b6
|
Start work on left and right border aware hor_sad
Comes with 4, 8, 16 and 32 pixel wide implementations now, at some point
investigate if this can start to thrash icache
|
2019-02-04 20:41:40 +02:00 |
|
Pauli Oikkonen
|
f781dc31f0
|
Create strategy for ver_sad
Easy to vectorize
|
2019-02-04 20:41:40 +02:00 |
|
Pauli Oikkonen
|
91cb0fbd45
|
Create strategy for directly obtaining pointer to constant-width SAD function
|
2019-02-04 20:41:40 +02:00 |
|
Pauli Oikkonen
|
6504145cce
|
Remove 16-pixel wide AVX2 SAD implementation
At least on Skylake, it's noticeably slower than the very simple
version using SSE4.1
|
2019-02-04 20:41:40 +02:00 |
|
Pauli Oikkonen
|
4cb371184b
|
Add SSE4.1 strategy for 24px wide SAD and an AVX2 strategy for 16
|
2019-02-04 20:41:40 +02:00 |
|
Pauli Oikkonen
|
796568d9cc
|
Add SSE4.1 strategies for SAD on widths 4 and 12 and AVX2 strategies for 32 and 64
|
2019-02-04 20:41:40 +02:00 |
|
Pauli Oikkonen
|
4d45d828fa
|
Use constant-width SSE4.1 SAD funcs for AVX2
|
2019-02-04 20:41:40 +02:00 |
|
Pauli Oikkonen
|
3a1f2eb752
|
Prefer SSE4.1 implementation of SAD over AVX2
It seems that the 128-bit wide version consistently outperforms the
256-bit one
|
2019-01-10 13:48:55 +02:00 |
|
Pauli Oikkonen
|
9b24d81c6a
|
Use SSE instead of AVX for small widths
Highly dubious if this will help performance at all
|
2019-01-07 20:12:13 +02:00 |
|
Pauli Oikkonen
|
887d7700a8
|
Modify AVX2 SAD to mask data by byte granularity in AVX registers
Avoids using any SAD calculations narrower than 256 bits, and
simplifies the code. Also improves execution speed
|
2019-01-07 18:53:15 +02:00 |
|
Pauli Oikkonen
|
7585f79a71
|
AVX2-ize SAD calculation
Performance is no better than SSE though
|
2019-01-07 16:26:24 +02:00 |
|
Pauli Oikkonen
|
ab3dc58df6
|
Copy SAD SSE4.1 impl to AVX2
|
2019-01-03 18:31:57 +02:00 |
|
Reima Hyvönen
|
1fcc5c6a8d
|
Merge branch 'bipred_recon'
|
2018-12-11 09:59:35 +02:00 |
|
Reima Hyvönen
|
e4a10880f3
|
Added case 12 to bipred_recon no mov
|
2018-12-11 09:52:17 +02:00 |
|
Reima Hyvönen
|
f8696b54a4
|
Updated bipred_recon_avx2 in avx2/picture-avx2.c. Now it detects blocks that can be not equal to 8 (ie. width = 12)
|
2018-11-20 17:09:19 +02:00 |
|
Reima Hyvönen
|
710ba288db
|
Chroma has some problems
|
2018-11-15 16:42:48 +02:00 |
|
Ari Lemmetti
|
5c774c4105
|
Rewrite most of FME and interpolation filters
Changes had to break a lot of stuff and were just squashed into this horrible code dump
|
2018-11-08 20:21:16 +02:00 |
|
Reima Hyvönen
|
7406c33a42
|
Some more cleaning
|
2018-10-26 12:25:18 +03:00 |
|
Reima Hyvönen
|
4c71546b2e
|
Cleaned some coding
|
2018-10-26 12:19:44 +03:00 |
|
Reima Hyvönen
|
4fe3909e48
|
Switched luma to use 32bits size ints intstead of 16bit size
|
2018-10-24 18:24:46 +03:00 |
|
Reima Hyvönen
|
381e786e10
|
Trying to find the bug in luma
|
2018-10-11 18:08:41 +03:00 |
|
Reima Hyvönen
|
2f5f81bac3
|
removed the non-optimated bipred function
|
2018-10-09 11:19:23 +03:00 |
|
Reima Hyvönen
|
212a8e68fa
|
Modified to avoid memory overflow, still some bug inside luma
|
2018-10-02 20:23:32 +03:00 |
|
Reima Hyvönen
|
896034b7cf
|
Some renamed functions back
|
2018-08-28 15:31:10 +03:00 |
|
Reima Hyvönen
|
7de5c74434
|
Updated bipred_recon to work faster
|
2018-08-28 15:12:31 +03:00 |
|
Reima Hyvönen
|
2ca99a44e8
|
Updated shuffle operation to be in right order
|
2018-08-27 18:16:38 +03:00 |
|
Reima Hyvönen
|
508b218a12
|
some modifications made to prevent reading too much
|
2018-08-14 10:50:39 +03:00 |
|