RLamm
30d5df40c5
Custom headers for the distributed coding
2020-01-29 15:54:49 +02:00
Pauli Oikkonen
c3d9e97e9f
Fix VS build
2019-12-12 18:34:55 +02:00
Pauli Oikkonen
7f238ca299
Remove debug print functions
...
Whoops
2019-12-12 18:19:31 +02:00
Pauli Oikkonen
eefb5e50b3
De-inline pred_filtered_dc functions, shouldn't make much difference though
2019-12-12 17:30:00 +02:00
Pauli Oikkonen
169314de4f
32x32 filtered DC prediction in AVX2
2019-12-11 18:17:06 +02:00
Pauli Oikkonen
fb2481b7e4
16x16 filtered DC implemented in AVX2
2019-12-10 15:54:50 +02:00
Pauli Oikkonen
da370ea36d
Implement AVX2 8x8 filtered DC algorithm
2019-11-28 14:10:10 +02:00
Pauli Oikkonen
5d9b7019ca
Implement a 4x4 filtered DC pred function
2019-11-26 17:05:54 +02:00
Pauli Oikkonen
f1485ab087
Start doing an arbitrary size filtered DC pred - maybe easier to just create separate functions for fixed block sizes?
2019-11-25 15:20:29 +02:00
Marko Viitanen
eb2caf9118
Fix intra angle filter, changed from gauss filter table to run-time calculated 4-tap filter
2019-11-19 15:15:21 +02:00
Pauli Oikkonen
979d66031c
Create a strategy out of intra_pred_filtered_dc
2019-11-19 14:50:31 +02:00
Marko Viitanen
466d8772b0
Apply JVET_P0170_ZERO_POS_SIMPLIFICATION in coeff bypass coding
2019-11-19 14:32:38 +02:00
Pauli Oikkonen
fa4bb86406
Optimize intra_pred_planar_avx2 for 4x4 blocks
2019-11-19 13:39:02 +02:00
Marko Viitanen
17a53230fd
Code cleanup, remove unused arrays and remove tabs
2019-11-18 09:01:23 +02:00
Pauli Oikkonen
4761d228f9
Start to vectorize the 4x4 loop
2019-11-15 17:32:40 +02:00
Pauli Oikkonen
8d45ab4951
Stupidify the 4x4 planar loop for vectorization
2019-11-14 17:14:04 +02:00
Pauli Oikkonen
6d7a4f555c
Also remove 16x16 (A * B^T)^T matrix multiply
...
Can be done using (B * A^T) instead, it's the exact same
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
2c2deb2366
Tidy AVX2 32x32 matrix multiply
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
98ad78b333
Tidy the old AVX2 32x32 matrix multiply
...
It was actually a very good algorithm, just looked messy!
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
4a921cbdb5
Retain data as much in YMM registers as possible
...
This seems to make it a whole lot quicker
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
ac4d710e23
Unroll 32x32 matrix multiply, use all regs
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
a58608d0b8
Remove totally unnecessary (A * B^T)^T 32x32 multiply
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
043f53539f
Implement a streamlined matrix-multiply 32x32 DCT
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
e9da2d851b
Tidy 32x32 fast DCT's helper functions
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
e382339182
Implement fast (butterfly) 32x32 DCT in AVX2
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
b5962dadac
Tidy indentation in AVX2 16x16 iDCT
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
36a8f89025
Fine-tune 16x16 AVX2 iDCT
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
ca9409de2b
Implement 16x16 DCT as butterfly algorithm in AVX2
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
7c69a26717
Use aligned loads and stores for AVX2 DCT
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
8e9c65dca6
Align DCT matrices and temp transform buffers
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
148a150522
Align DCT source and dest blocks to cache line
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
8e60bbf6a6
Slightly tune 16x16 forward DCT
...
Use an array of __m256i's to store temporary value, essentially letting
the compiler enforce alignment and use aligned loads and stores.
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
c0cc0e8a75
Optimize 16x16 multiply by only slicing right mat once
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
e463d27f22
Implement streamlined generic 16x16 matrix multiply
...
It can't be this fast for real, can it?
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
beb85ce9d6
Reorder parameters for 8x8 matrix multiplies
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
292af62256
Implement tailored 16x16 forward DCT
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
30ce461d98
Redo 4x4 matrix multiplication
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
07970ea82f
Streamline by-the-book 8x8 matrix multiplication
...
Also chop up the forward transform into two tailored multiply functions
2019-10-28 16:19:42 +02:00
Pauli Oikkonen
7ec7ab3361
Implement a tailored AVX2 8x8 DCT
2019-10-28 16:19:42 +02:00
pkubaj
1d7fcf4227
Fix build on powerpc64 with LLVM
2019-09-12 15:05:00 +02:00
Pauli Oikkonen
99597b828a
Work around the ancient Win32 calling convention hassle
...
See if this'll work now
2019-09-06 13:14:42 +03:00
Pauli Oikkonen
c5ca18950c
Revert "Revert to 6924d90052
due to broken visual studio build"
...
This reverts commit 1dd0619bd7
.
2019-09-05 18:21:55 +03:00
Pauli Oikkonen
55529decd5
Implement _mm256_insert_epi32 and extract pseudo-ops
...
Visual Studio headers apparently lack these guys
2019-09-05 18:20:52 +03:00
Ari Lemmetti
557bcbc6aa
Make luma or chroma only inter "recon" or predict possible
2019-09-02 17:15:28 +03:00
RLamm
60be6d411c
Intra filtering fixed at least for luma. All intra modes output valid luma (hashes match), but chroma is still broken.
2019-08-30 16:14:00 +03:00
Marko Viitanen
cb0d7c340a
Use the new PDPC filtering in angular intra
2019-08-23 14:44:41 +03:00
Marko Viitanen
5bebb18943
Change intra filtering according to VTM6
2019-08-23 08:56:35 +03:00
Marko Viitanen
a16efe6b52
Merge remote-tracking branch 'remotes/github_kvazaar/master'
...
# Conflicts:
# build/kvazaar_VS2013.sln
# build/kvazaar_VS2015.sln
# build/kvazaar_VS2017.sln
# build/kvazaar_cli/kvazaar_cli.vcxproj
# build/kvazaar_lib/kvazaar_lib.vcxproj
# build/kvazaar_tests/kvazaar_tests.vcxproj
# src/encode_coding_tree.c
# src/encode_coding_tree.h
# src/encoder_state-bitstream.c
# src/inter.c
# src/strategies/avx2/quant-avx2.c
2019-08-22 15:12:01 +03:00
Ari Lemmetti
1dd0619bd7
Revert to 6924d90052
due to broken visual studio build
2019-08-08 15:15:34 +03:00
Pauli Oikkonen
2852baa673
Separate sign3_diff_epu8 from calc_eo_cat
...
Just to keep things simple, clear and obvious
2019-08-07 16:35:24 +03:00
Pauli Oikkonen
a858e7dd4b
Combine duplicate code into inline functions
2019-08-07 16:35:24 +03:00
Pauli Oikkonen
de0e97f711
Take 8/16/24b loads and stores into separate functions
2019-08-07 16:35:24 +03:00
Pauli Oikkonen
10979f58fe
Tidy up code
2019-08-07 16:35:24 +03:00
Pauli Oikkonen
9cc11976c0
Combine the delta accumulation from edge and band ddistortion into shared func
...
This won't reduce object size, but there'll be less duplicate code
2019-08-07 16:35:24 +03:00
Pauli Oikkonen
55d877bd66
Vectorize sao_edge_ddistortion
2019-08-07 16:35:24 +03:00
Pauli Oikkonen
aef0f301d3
Fix function signatures
...
Mark anything intended as read-only to be const, and fix alignment
2019-08-07 16:35:24 +03:00
Pauli Oikkonen
997fd369b3
Redo calc_sao_edge_dir_avx2
...
Do it wider, 32 pixels at once!
2019-08-07 16:35:24 +03:00
Pauli Oikkonen
db1e475e02
Use i32 instead of i8 for x/y offsets
...
Doesn't matter too much, because this number isn't used in SIMD
computation, only as a memory reference offset.
2019-08-07 16:35:24 +03:00
Pauli Oikkonen
12de466ef5
Reimplement non-band SAO color reconstruction in AVX2
...
Streamline things to work on 32 pixels at once instead of 8
2019-08-07 16:35:24 +03:00
Pauli Oikkonen
e8bff99329
Redo the SAO_TYPE_BAND subsection of AVX2 SAO color reconstruction
...
Vectorize it all, hope this helps with perf
2019-08-07 16:35:24 +03:00
Pauli Oikkonen
7b5dffa855
Implement calc_sao_offset_array in AVX2
...
To be efficient, the AVX2 color reconstruction algorithm will need
offsets in byte, not dword, arrays. This is completely specific to 8-bit
pixels and the function signature is fundamentally distinct from the
generic algorithm, so it's better to not strategize SAO offset array
calculation.
2019-08-07 16:35:24 +03:00
Pauli Oikkonen
08881f5e9b
(TEMP) (TODO) (whatever) Avoid compiler warnings
...
I want the CI to not crash on its -Wall -Werror, but instead to actually
build the thing and report me about actual memory errors etc
2019-08-07 16:35:24 +03:00
Pauli Oikkonen
c18adc5ee0
Redo sao_band_ddistortion_avx2
...
Avoid branching and do the entire thing on 32 pixels at once in YMMs.
Also make the sao_bands function parameter const.
2019-08-07 16:35:24 +03:00
Pauli Oikkonen
1bb9a079a8
Fix indentation
2019-08-07 16:35:24 +03:00
Reima Hyvönen
7bc959c7c5
3 sao functions are now working
2019-08-07 16:35:24 +03:00
Reima Hyvönen
0e0f2d3490
made to clear sum vector after it has been set to memory
2019-08-07 16:35:24 +03:00
Reima Hyvönen
f146de7acb
removed some variables to prevent memory losses
2019-08-07 16:35:24 +03:00
Reima Hyvönen
247c3a7a71
conversed gined to unsigned int
2019-08-07 16:35:24 +03:00
Reima Hyvönen
ac5c216974
Some more memory error preventing to sao_edge_ddistortion_avx2
2019-08-07 16:35:24 +03:00
Reima Hyvönen
3fb1cbca35
more editing sao_edge_ddistortion_avx2
2019-08-07 16:35:24 +03:00
Reima Hyvönen
afbb6fb960
some more modifications to sao_edge_ddistortion_avx2 to prevent memory failures
2019-08-07 16:35:24 +03:00
Reima Hyvönen
3496a57f7a
Edited sao_edge_ddistortion_avx2 to avoid memory overflow
2019-08-07 16:35:24 +03:00
Reima Hyvönen
267ba1d6ce
Modified sao_band_ddistortion_avx2
2019-08-07 16:35:24 +03:00
Reima Hyvönen
e70663b245
added some sub commands to avoid memory read errors
2019-08-07 16:35:24 +03:00
Reima Hyvönen
59dfb4570c
Converted some loads to load int8_t instead ints
2019-08-07 16:35:24 +03:00
Reima Hyvönen
8b253209a8
Found false address load from calc_sao_edge_dir. Should now work like generic
2019-08-07 16:35:24 +03:00
Reima Hyvönen
50e0a47b7a
Took away __restrict
2019-08-07 16:35:24 +03:00
Reima Hyvönen
8a39eb674e
Removed c-variable from calc_sao_edge_dir_avx2
2019-08-07 16:35:24 +03:00
Reima Hyvönen
bc0a36830d
Clerified some 6 pixel loads
2019-08-07 16:35:24 +03:00
Reima Hyvönen
1a8b211e05
Added break to line 170
2019-08-07 16:35:24 +03:00
Reima Hyvönen
d05e750ebe
Added some switches to prevent segmentation fault from reading
2019-08-07 16:35:24 +03:00
Reima Hyvönen
203580047d
Defined some AVX functions
2019-08-07 16:35:24 +03:00
Reima Hyvönen
c884c738b1
Updated some commands to match the standard
2019-08-07 16:35:24 +03:00
Reima Hyvönen
b412ed2f59
Removed some setr and used loads calc_sao_edge_dir_avx2
2019-08-07 16:35:24 +03:00
Reima Hyvönen
c6cc063534
converted some hadd operations at calc_sao_edge_dir_avx2 to cast and extract
2019-08-07 16:35:24 +03:00
Reima Hyvönen
47ac109b10
optimated some sao_reconstruct_color_avx2 when sao->type == SAO_TYPE_BAND
2019-08-07 16:35:24 +03:00
Reima Hyvönen
96dc60a1ed
first working optimation
2019-08-07 16:35:24 +03:00
Reima Hyvönen
c148aff9fb
Some optimation done to function sao_reconstruct_color_avx2
2019-08-07 16:35:24 +03:00
Reima Hyvönen
bf16ba6cc4
Remade sao_edge_ddistortion_avx2 and calc_sao_edge_dir_avx2
2019-08-07 16:35:24 +03:00
Reima Hyvönen
79dc39a676
Some editing for sao_edge_ddistortion_avx2
2019-08-07 16:35:24 +03:00
Reima Hyvönen
06ee52924e
some reconst done to calc_sao_edge_dir_avx2
2019-08-07 16:35:24 +03:00
Reima Hyvönen
5fbc65d823
reconst optimation doesn't work yet
2019-08-07 16:35:24 +03:00
Reima Hyvönen
d29f834a69
Remove useless function
2019-08-07 16:35:24 +03:00
Reima Hyvönen
a232a12160
calc_sao_edge_dir_avx2 updated
2019-08-07 16:35:24 +03:00
Reima Hyvönen
b1febc02a5
sao_edge_ddistortion_avx2 now working proberly
2019-08-07 16:35:24 +03:00
Reima Hyvönen
cd6092a1ec
Still too much bits, looking for where they appear
2019-08-07 16:35:24 +03:00
Reima Hyvönen
7853be8eeb
Incomple optimation
2019-08-07 16:35:24 +03:00
Marko Viitanen
dfa5621024
Intrapred cleanup
2019-07-16 14:23:10 +03:00
Pauli Oikkonen
8d48bee180
Tidy fast coeff cost code
2019-07-09 18:01:54 +03:00
Pauli Oikkonen
201a43b08e
Clean up the RD-estimation code
2019-07-09 18:01:54 +03:00
Pauli Oikkonen
b111df5073
Create preliminary version of improved cost estimator
2019-07-09 18:01:54 +03:00
Marko Viitanen
10d850e98a
Use index_offset in intra angular and change the offset to width+1
2019-07-08 14:23:19 +03:00
Marko Viitanen
3d1fa2a9cf
Fixing angular intra prediction reference pixels
2019-07-08 14:00:02 +03:00
Marko Viitanen
0656c54cab
Fix some problems with reference pixels in angular intra prediction kvz_angular_pred_generic()
2019-07-05 15:54:51 +03:00
Marko Viitanen
89ca2d4ba1
Use correct type for modedisp2sampledisp array
2019-07-05 14:12:10 +03:00
Marko Viitanen
c6217e236f
Enable 4-tap filtering for the intra angular
2019-07-04 16:26:10 +03:00
Marko Viitanen
cda6d951c0
Change DCT arrays back to 8-bit -> some frames are now correct
2019-07-04 15:59:10 +03:00
Marko Viitanen
8280bd3217
Add channel info to angular_pred and fix the displacement tables.
...
Also includes 4-tap intra filtering code commented out
2019-07-04 09:35:47 +03:00
Pauli Oikkonen
081d16fc33
Fix intrinsics that may be missing on some systems
...
Create a header to collect all the workarounds for missing intrinsics
in one place
2019-05-23 19:59:40 +03:00
Marko Viitanen
30a8a7b97c
WIP fixing the last significant xy coding
2019-05-07 15:01:02 +03:00
Pauli Oikkonen
87a9208db8
Eliminate cvtsi64_si128 intrinsic
...
Apparently it'll cause Win32 builds to break because it emits the movq
instruction or something..
2019-04-17 16:30:40 +03:00
Pauli Oikkonen
7175d20bb2
Still include stdint.h for non-vector builds
2019-04-15 19:36:01 +03:00
Pauli Oikkonen
1315c7e2b0
Do not compile any vector code for non-SSE4/AVX2 builds
2019-04-15 19:10:48 +03:00
Pauli Oikkonen
f5f70e7bc5
Merge branch 'sad-optimization'
2019-04-15 19:02:01 +03:00
Pauli Oikkonen
6d43759604
Create a border-respecting 32-wide AVX hor_sad
2019-03-07 18:01:22 +02:00
Pauli Oikkonen
f218cecb38
Remove offending hor_sad_avx2_w32 function
...
Consider possibly creating a non-offending AVX2 version instead, the
way hor_sad_sse41_w32 works. Or maybe there's more essential work to
do.
2019-03-05 22:51:41 +02:00
Pauli Oikkonen
df2e6c54fd
4-unroll hor_sad_sse41_arbitrary
...
This may not increase perf though because it's so rarely used
function, so keeping icache footprint may be more essential...
2019-03-05 22:45:23 +02:00
Pauli Oikkonen
448eacba7b
Avoid overreading block borders in hor_sad_sse41_arbitrary
2019-03-05 22:34:50 +02:00
Pauli Oikkonen
41f51c08c4
Avoid overrunning buffer in hor_sad_sse41_w32
2019-03-01 15:37:38 +02:00
Pauli Oikkonen
bcd9879359
Include quant coeff range check in non-scaling list execution path too
2019-02-27 17:26:44 +02:00
Pauli Oikkonen
24e6363f64
Remove the kvz_quant_avx2 wrapper function
2019-02-27 16:32:58 +02:00
Pauli Oikkonen
748820f3c5
Eliminate unnecessary loading of coeffs if scaling lists are off
2019-02-27 16:26:35 +02:00
Pauli Oikkonen
5994350f40
Allow quant_flat_avx2 to be used with scaling lists on
2019-02-27 16:25:59 +02:00
Pauli Oikkonen
9b0e079262
Use SSE instructions for 64-bit SADs instead of MMX
...
VC++ seems to choke on MMX instructions
2019-02-18 20:13:33 +02:00
Pauli Oikkonen
d8b8923028
Add LGPL notices to reg_sad headers
2019-02-18 17:52:47 +02:00
Pauli Oikkonen
770db825b9
Create hor_sad_w8 and w4 epol mask the way w16 works
2019-02-06 19:34:26 +02:00
Pauli Oikkonen
aa19bcac8a
Avoid branching in creating shuffle mask in hor_sad_w16
2019-02-06 18:58:46 +02:00
Pauli Oikkonen
2d05ca8520
Remove width from constant-width hor_sad func params
...
They should kinda know it already
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
57db234d95
Move 32-wide SSE4.1 hor_sad to picture-sse41.c
...
It's not used by picture-avx2.c that also includes the header, so
it should not be in the header
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
dd7d989a39
Implement 32-wide hor_sad on AVX2
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
ff70c8a5ec
Utilize horizontal SAD functions for SSE4.1 as well
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
f5ff4db01f
4-wide hor_sad border agnostic
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
35e7f9a700
Fix hor_sad w8 to work with both borders
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
836783dd6e
Use hor_sad_w32 for both left and right borders
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
69687c8d24
Modify hor_sad_sse41_w16 to work over left and right borders
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
1e0eb1af30
Add generic strategy for hor_sad'ing an non-split width block
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
686fb2c957
Unroll arbitrary-width SSE4.1 hor_sad by 4
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
768203a2de
First version of arbitrary-width SSE4.1 hor_sad
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
ccf683b9b6
Start work on left and right border aware hor_sad
...
Comes with 4, 8, 16 and 32 pixel wide implementations now, at some point
investigate if this can start to thrash icache
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
c36482a11a
Fix bug in 24-wide SAD
...
*facepalm*
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
f781dc31f0
Create strategy for ver_sad
...
Easy to vectorize
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
9db0a1bcda
Create get_optimized_sad func for SSE4.1
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
91380729b1
Add generic get_optimized_sad implementation
...
NOTE: To force generic SAD implementation on devices supporting
vectorized variants, you now have to override both get_optimized_sad
and reg_sad to generic (only overriding get_optimized_sad on AVX2
hardware would just run all SAD blocks through reg_sad_avx2). Let's
see if there's a more sensible way to do it, but it's not trivial.
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
45f36645a6
Move choosing of tailored SAD function higher up the calling chain
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
91cb0fbd45
Create strategy for directly obtaining pointer to constant-width SAD function
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
94035be342
Unify unrolling naming conventions
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
517a4338f6
Unroll SSE SAD for 8-wide blocks to process 4 lines at once
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
0f665b28f6
Unroll arbitrary width SSE4.1 SAD by 4
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
cbca3347b5
Unroll 64-wide AVX2 SAD by 2
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
84cf771dea
Unroll 32 and 16 wide SAD vector implementations by 4
2019-02-04 20:41:40 +02:00