hashirama/uvg266

mirror of https://github.com/ultravideo/uvg266.git synced 2024-12-04 13:54:05 +00:00

Author	SHA1	Message	Date
Ari Lemmetti	3c7dd0752f	Remove the broken "no mov" branch. Causes hash mismatches for example in SlideShow sequence.	2020-02-03 15:26:31 +02:00
RLamm	30d5df40c5	Custom headers for the distributed coding	2020-01-29 15:54:49 +02:00
Pauli Oikkonen	c3d9e97e9f	Fix VS build	2019-12-12 18:34:55 +02:00
Pauli Oikkonen	7f238ca299	Remove debug print functions Whoops	2019-12-12 18:19:31 +02:00
Pauli Oikkonen	eefb5e50b3	De-inline pred_filtered_dc functions, shouldn't make much difference though	2019-12-12 17:30:00 +02:00
Pauli Oikkonen	169314de4f	32x32 filtered DC prediction in AVX2	2019-12-11 18:17:06 +02:00
Pauli Oikkonen	fb2481b7e4	16x16 filtered DC implemented in AVX2	2019-12-10 15:54:50 +02:00
Pauli Oikkonen	da370ea36d	Implement AVX2 8x8 filtered DC algorithm	2019-11-28 14:10:10 +02:00
Pauli Oikkonen	5d9b7019ca	Implement a 4x4 filtered DC pred function	2019-11-26 17:05:54 +02:00
Pauli Oikkonen	f1485ab087	Start doing an arbitrary size filtered DC pred - maybe easier to just create separate functions for fixed block sizes?	2019-11-25 15:20:29 +02:00
Pauli Oikkonen	fa4bb86406	Optimize intra_pred_planar_avx2 for 4x4 blocks	2019-11-19 13:39:02 +02:00
Pauli Oikkonen	4761d228f9	Start to vectorize the 4x4 loop	2019-11-15 17:32:40 +02:00
Pauli Oikkonen	8d45ab4951	Stupidify the 4x4 planar loop for vectorization	2019-11-14 17:14:04 +02:00
Pauli Oikkonen	6d7a4f555c	Also remove 16x16 (A * B^T)^T matrix multiply Can be done using (B * A^T) instead, it's the exact same	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	2c2deb2366	Tidy AVX2 32x32 matrix multiply	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	98ad78b333	Tidy the old AVX2 32x32 matrix multiply It was actually a very good algorithm, just looked messy!	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	4a921cbdb5	Retain data as much in YMM registers as possible This seems to make it a whole lot quicker	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	ac4d710e23	Unroll 32x32 matrix multiply, use all regs	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	a58608d0b8	Remove totally unnecessary (A * B^T)^T 32x32 multiply	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	043f53539f	Implement a streamlined matrix-multiply 32x32 DCT	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	e9da2d851b	Tidy 32x32 fast DCT's helper functions	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	e382339182	Implement fast (butterfly) 32x32 DCT in AVX2	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	b5962dadac	Tidy indentation in AVX2 16x16 iDCT	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	36a8f89025	Fine-tune 16x16 AVX2 iDCT	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	ca9409de2b	Implement 16x16 DCT as butterfly algorithm in AVX2	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	7c69a26717	Use aligned loads and stores for AVX2 DCT	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	8e9c65dca6	Align DCT matrices and temp transform buffers	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	148a150522	Align DCT source and dest blocks to cache line	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	8e60bbf6a6	Slightly tune 16x16 forward DCT Use an array of __m256i's to store temporary value, essentially letting the compiler enforce alignment and use aligned loads and stores.	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	c0cc0e8a75	Optimize 16x16 multiply by only slicing right mat once	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	e463d27f22	Implement streamlined generic 16x16 matrix multiply It can't be this fast for real, can it?	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	beb85ce9d6	Reorder parameters for 8x8 matrix multiplies	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	292af62256	Implement tailored 16x16 forward DCT	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	30ce461d98	Redo 4x4 matrix multiplication	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	07970ea82f	Streamline by-the-book 8x8 matrix multiplication Also chop up the forward transform into two tailored multiply functions	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	7ec7ab3361	Implement a tailored AVX2 8x8 DCT	2019-10-28 16:19:42 +02:00
Pauli Oikkonen	99597b828a	Work around the ancient Win32 calling convention hassle See if this'll work now	2019-09-06 13:14:42 +03:00
Pauli Oikkonen	c5ca18950c	Revert "Revert to `6924d90052` due to broken visual studio build" This reverts commit `1dd0619bd7`.	2019-09-05 18:21:55 +03:00
Pauli Oikkonen	55529decd5	Implement _mm256_insert_epi32 and extract pseudo-ops Visual Studio headers apparently lack these guys	2019-09-05 18:20:52 +03:00
Ari Lemmetti	557bcbc6aa	Make luma or chroma only inter "recon" or predict possible	2019-09-02 17:15:28 +03:00
Ari Lemmetti	1dd0619bd7	Revert to `6924d90052` due to broken visual studio build	2019-08-08 15:15:34 +03:00
Pauli Oikkonen	2852baa673	Separate sign3_diff_epu8 from calc_eo_cat Just to keep things simple, clear and obvious	2019-08-07 16:35:24 +03:00
Pauli Oikkonen	a858e7dd4b	Combine duplicate code into inline functions	2019-08-07 16:35:24 +03:00
Pauli Oikkonen	de0e97f711	Take 8/16/24b loads and stores into separate functions	2019-08-07 16:35:24 +03:00
Pauli Oikkonen	10979f58fe	Tidy up code	2019-08-07 16:35:24 +03:00
Pauli Oikkonen	9cc11976c0	Combine the delta accumulation from edge and band ddistortion into shared func This won't reduce object size, but there'll be less duplicate code	2019-08-07 16:35:24 +03:00
Pauli Oikkonen	55d877bd66	Vectorize sao_edge_ddistortion	2019-08-07 16:35:24 +03:00
Pauli Oikkonen	aef0f301d3	Fix function signatures Mark anything intended as read-only to be const, and fix alignment	2019-08-07 16:35:24 +03:00
Pauli Oikkonen	997fd369b3	Redo calc_sao_edge_dir_avx2 Do it wider, 32 pixels at once!	2019-08-07 16:35:24 +03:00
Pauli Oikkonen	db1e475e02	Use i32 instead of i8 for x/y offsets Doesn't matter too much, because this number isn't used in SIMD computation, only as a memory reference offset.	2019-08-07 16:35:24 +03:00

1 2 3 4 5 ...

328 commits