Ari Lemmetti
|
3c7dd0752f
|
Remove the broken "no mov" branch.
Causes hash mismatches for example in SlideShow sequence.
|
2020-02-03 15:26:31 +02:00 |
|
RLamm
|
30d5df40c5
|
Custom headers for the distributed coding
|
2020-01-29 15:54:49 +02:00 |
|
Pauli Oikkonen
|
c3d9e97e9f
|
Fix VS build
|
2019-12-12 18:34:55 +02:00 |
|
Pauli Oikkonen
|
7f238ca299
|
Remove debug print functions
Whoops
|
2019-12-12 18:19:31 +02:00 |
|
Pauli Oikkonen
|
eefb5e50b3
|
De-inline pred_filtered_dc functions, shouldn't make much difference though
|
2019-12-12 17:30:00 +02:00 |
|
Pauli Oikkonen
|
169314de4f
|
32x32 filtered DC prediction in AVX2
|
2019-12-11 18:17:06 +02:00 |
|
Pauli Oikkonen
|
fb2481b7e4
|
16x16 filtered DC implemented in AVX2
|
2019-12-10 15:54:50 +02:00 |
|
Pauli Oikkonen
|
da370ea36d
|
Implement AVX2 8x8 filtered DC algorithm
|
2019-11-28 14:10:10 +02:00 |
|
Pauli Oikkonen
|
5d9b7019ca
|
Implement a 4x4 filtered DC pred function
|
2019-11-26 17:05:54 +02:00 |
|
Pauli Oikkonen
|
f1485ab087
|
Start doing an arbitrary size filtered DC pred - maybe easier to just create separate functions for fixed block sizes?
|
2019-11-25 15:20:29 +02:00 |
|
Pauli Oikkonen
|
fa4bb86406
|
Optimize intra_pred_planar_avx2 for 4x4 blocks
|
2019-11-19 13:39:02 +02:00 |
|
Pauli Oikkonen
|
4761d228f9
|
Start to vectorize the 4x4 loop
|
2019-11-15 17:32:40 +02:00 |
|
Pauli Oikkonen
|
8d45ab4951
|
Stupidify the 4x4 planar loop for vectorization
|
2019-11-14 17:14:04 +02:00 |
|
Pauli Oikkonen
|
6d7a4f555c
|
Also remove 16x16 (A * B^T)^T matrix multiply
Can be done using (B * A^T) instead, it's the exact same
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
2c2deb2366
|
Tidy AVX2 32x32 matrix multiply
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
98ad78b333
|
Tidy the old AVX2 32x32 matrix multiply
It was actually a very good algorithm, just looked messy!
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
4a921cbdb5
|
Retain data as much in YMM registers as possible
This seems to make it a whole lot quicker
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
ac4d710e23
|
Unroll 32x32 matrix multiply, use all regs
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
a58608d0b8
|
Remove totally unnecessary (A * B^T)^T 32x32 multiply
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
043f53539f
|
Implement a streamlined matrix-multiply 32x32 DCT
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
e9da2d851b
|
Tidy 32x32 fast DCT's helper functions
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
e382339182
|
Implement fast (butterfly) 32x32 DCT in AVX2
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
b5962dadac
|
Tidy indentation in AVX2 16x16 iDCT
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
36a8f89025
|
Fine-tune 16x16 AVX2 iDCT
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
ca9409de2b
|
Implement 16x16 DCT as butterfly algorithm in AVX2
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
7c69a26717
|
Use aligned loads and stores for AVX2 DCT
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
8e9c65dca6
|
Align DCT matrices and temp transform buffers
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
148a150522
|
Align DCT source and dest blocks to cache line
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
8e60bbf6a6
|
Slightly tune 16x16 forward DCT
Use an array of __m256i's to store temporary value, essentially letting
the compiler enforce alignment and use aligned loads and stores.
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
c0cc0e8a75
|
Optimize 16x16 multiply by only slicing right mat once
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
e463d27f22
|
Implement streamlined generic 16x16 matrix multiply
It can't be this fast for real, can it?
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
beb85ce9d6
|
Reorder parameters for 8x8 matrix multiplies
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
292af62256
|
Implement tailored 16x16 forward DCT
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
30ce461d98
|
Redo 4x4 matrix multiplication
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
07970ea82f
|
Streamline by-the-book 8x8 matrix multiplication
Also chop up the forward transform into two tailored multiply functions
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
7ec7ab3361
|
Implement a tailored AVX2 8x8 DCT
|
2019-10-28 16:19:42 +02:00 |
|
Pauli Oikkonen
|
99597b828a
|
Work around the ancient Win32 calling convention hassle
See if this'll work now
|
2019-09-06 13:14:42 +03:00 |
|
Pauli Oikkonen
|
c5ca18950c
|
Revert "Revert to 6924d90052 due to broken visual studio build"
This reverts commit 1dd0619bd7 .
|
2019-09-05 18:21:55 +03:00 |
|
Pauli Oikkonen
|
55529decd5
|
Implement _mm256_insert_epi32 and extract pseudo-ops
Visual Studio headers apparently lack these guys
|
2019-09-05 18:20:52 +03:00 |
|
Ari Lemmetti
|
557bcbc6aa
|
Make luma or chroma only inter "recon" or predict possible
|
2019-09-02 17:15:28 +03:00 |
|
Ari Lemmetti
|
1dd0619bd7
|
Revert to 6924d90052 due to broken visual studio build
|
2019-08-08 15:15:34 +03:00 |
|
Pauli Oikkonen
|
2852baa673
|
Separate sign3_diff_epu8 from calc_eo_cat
Just to keep things simple, clear and obvious
|
2019-08-07 16:35:24 +03:00 |
|
Pauli Oikkonen
|
a858e7dd4b
|
Combine duplicate code into inline functions
|
2019-08-07 16:35:24 +03:00 |
|
Pauli Oikkonen
|
de0e97f711
|
Take 8/16/24b loads and stores into separate functions
|
2019-08-07 16:35:24 +03:00 |
|
Pauli Oikkonen
|
10979f58fe
|
Tidy up code
|
2019-08-07 16:35:24 +03:00 |
|
Pauli Oikkonen
|
9cc11976c0
|
Combine the delta accumulation from edge and band ddistortion into shared func
This won't reduce object size, but there'll be less duplicate code
|
2019-08-07 16:35:24 +03:00 |
|
Pauli Oikkonen
|
55d877bd66
|
Vectorize sao_edge_ddistortion
|
2019-08-07 16:35:24 +03:00 |
|
Pauli Oikkonen
|
aef0f301d3
|
Fix function signatures
Mark anything intended as read-only to be const, and fix alignment
|
2019-08-07 16:35:24 +03:00 |
|
Pauli Oikkonen
|
997fd369b3
|
Redo calc_sao_edge_dir_avx2
Do it wider, 32 pixels at once!
|
2019-08-07 16:35:24 +03:00 |
|
Pauli Oikkonen
|
db1e475e02
|
Use i32 instead of i8 for x/y offsets
Doesn't matter too much, because this number isn't used in SIMD
computation, only as a memory reference offset.
|
2019-08-07 16:35:24 +03:00 |
|