hashirama/uvg266

mirror of https://github.com/ultravideo/uvg266.git synced 2024-11-30 20:54:07 +00:00

Author	SHA1	Message	Date
Pauli Oikkonen	770db825b9	Create hor_sad_w8 and w4 epol mask the way w16 works	2019-02-06 19:34:26 +02:00
Pauli Oikkonen	aa19bcac8a	Avoid branching in creating shuffle mask in hor_sad_w16	2019-02-06 18:58:46 +02:00
Pauli Oikkonen	2d05ca8520	Remove width from constant-width hor_sad func params They should kinda know it already	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	57db234d95	Move 32-wide SSE4.1 hor_sad to picture-sse41.c It's not used by picture-avx2.c that also includes the header, so it should not be in the header	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	dd7d989a39	Implement 32-wide hor_sad on AVX2	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	ff70c8a5ec	Utilize horizontal SAD functions for SSE4.1 as well	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	f5ff4db01f	4-wide hor_sad border agnostic	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	35e7f9a700	Fix hor_sad w8 to work with both borders	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	836783dd6e	Use hor_sad_w32 for both left and right borders	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	69687c8d24	Modify hor_sad_sse41_w16 to work over left and right borders	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	1e0eb1af30	Add generic strategy for hor_sad'ing an non-split width block	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	686fb2c957	Unroll arbitrary-width SSE4.1 hor_sad by 4	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	768203a2de	First version of arbitrary-width SSE4.1 hor_sad	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	ccf683b9b6	Start work on left and right border aware hor_sad Comes with 4, 8, 16 and 32 pixel wide implementations now, at some point investigate if this can start to thrash icache	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	c36482a11a	Fix bug in 24-wide SAD facepalm	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	f781dc31f0	Create strategy for ver_sad Easy to vectorize	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	9db0a1bcda	Create get_optimized_sad func for SSE4.1	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	91380729b1	Add generic get_optimized_sad implementation NOTE: To force generic SAD implementation on devices supporting vectorized variants, you now have to override both get_optimized_sad and reg_sad to generic (only overriding get_optimized_sad on AVX2 hardware would just run all SAD blocks through reg_sad_avx2). Let's see if there's a more sensible way to do it, but it's not trivial.	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	45f36645a6	Move choosing of tailored SAD function higher up the calling chain	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	91cb0fbd45	Create strategy for directly obtaining pointer to constant-width SAD function	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	94035be342	Unify unrolling naming conventions	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	517a4338f6	Unroll SSE SAD for 8-wide blocks to process 4 lines at once	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	0f665b28f6	Unroll arbitrary width SSE4.1 SAD by 4	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	cbca3347b5	Unroll 64-wide AVX2 SAD by 2	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	84cf771dea	Unroll 32 and 16 wide SAD vector implementations by 4	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	5df5c5f8a4	Cast all pointers to const types in vector SAD funcs Also tidy up the pointer arithmetic	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	a711ce3df5	Inline fixed width vectorized SAD functions	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	6504145cce	Remove 16-pixel wide AVX2 SAD implementation At least on Skylake, it's noticeably slower than the very simple version using SSE4.1	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	4cb371184b	Add SSE4.1 strategy for 24px wide SAD and an AVX2 strategy for 16	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	796568d9cc	Add SSE4.1 strategies for SAD on widths 4 and 12 and AVX2 strategies for 32 and 64	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	4d45d828fa	Use constant-width SSE4.1 SAD funcs for AVX2	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	2eaa7bc9d2	Move SSE4.1 SAD functions to separate header	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	d2db0086e1	Create constant width SAD versions for 8 and 16 pixels	2019-02-04 20:41:40 +02:00
Pauli Oikkonen	a13fc51003	Include a blank AVX2 strategy registration function even in non-AVX2 builds	2019-02-04 19:52:24 +02:00
Pauli Oikkonen	d55414db66	Only build AVX2 coeff encoding when supported ..whoops	2019-02-04 19:34:30 +02:00
Pauli Oikkonen	3fe2f29456	Merge branch 'encode-coeffs-avx2'	2019-02-04 18:52:31 +02:00
Pauli Oikkonen	722b738888	Fix more naming issues	2019-02-04 16:05:43 +02:00
Pauli Oikkonen	e26d98fb75	Rename a couple variables and add crucial comments	2019-02-04 15:57:07 +02:00
Pauli Oikkonen	f186455619	Move encode_last_significant_xy out of strategy modules It's the exact same in both AVX2 and generic, and does not seem to be worth even trying to vectorize	2019-02-04 14:55:41 +02:00
Pauli Oikkonen	3f7340c932	Fine-tune pack_16x16b_to_16x2b Avoid mm_set1 operation when it's possible to create the constant with one bit-shift operation from another instead. Thanks Intel for 3-operand instruction encoding!	2019-02-04 14:44:47 +02:00
Pauli Oikkonen	314f5b0e1f	Rename 16x2b cmpgt function, comment it better, optimize it slightly Eliminate an unnecessary bit masking to make it even more messy	2019-02-04 14:44:32 +02:00
Pauli Oikkonen	d8ff6a6459	Fix _andn_u32 to work on old Visual Studio	2019-02-01 15:34:42 +02:00
Pauli Oikkonen	26e1b2c783	Use (u)int32_t instead of (unsigned) int in reg_sad_sse41	2019-01-10 14:37:04 +02:00
Pauli Oikkonen	3a1f2eb752	Prefer SSE4.1 implementation of SAD over AVX2 It seems that the 128-bit wide version consistently outperforms the 256-bit one	2019-01-10 13:48:55 +02:00
Pauli Oikkonen	9b24d81c6a	Use SSE instead of AVX for small widths Highly dubious if this will help performance at all	2019-01-07 20:12:13 +02:00
Pauli Oikkonen	b2176bf72a	Optimize SSE4.1 version of SAD Make it use the same vblend trick as AVX2. Interestingly, on my test setup this seems to be faster than the same code using 256-bit AVX vectors.	2019-01-07 19:40:57 +02:00
Pauli Oikkonen	887d7700a8	Modify AVX2 SAD to mask data by byte granularity in AVX registers Avoids using any SAD calculations narrower than 256 bits, and simplifies the code. Also improves execution speed	2019-01-07 18:53:15 +02:00
Pauli Oikkonen	7585f79a71	AVX2-ize SAD calculation Performance is no better than SSE though	2019-01-07 16:26:24 +02:00
Pauli Oikkonen	ab3dc58df6	Copy SAD SSE4.1 impl to AVX2	2019-01-03 18:31:57 +02:00
Pauli Oikkonen	45ac6e6d03	Tidy pack_16x16b_to_16x2b comments	2019-01-03 16:37:05 +02:00
Pauli Oikkonen	016eb014ad	Move packing 16x16b -> 16x2b into separate function	2018-12-20 10:51:44 +02:00
Ari Lemmetti	b234897e8a	Fix smp and amp blocks in fme and revert previous change. Filter 8x8 (sub)blocks even with 8x4, 4x8, 16x4, 4x16 etc. Calculate SATD on the 8x4, ... part	2018-12-19 21:30:53 +02:00
Pauli Oikkonen	9aaa6f260d	Fixes to enable portability	2018-12-18 20:42:09 +02:00
Pauli Oikkonen	2fdbbe9730	Move CG reordering code from quant-avx2 to shared header	2018-12-18 19:42:18 +02:00
Pauli Oikkonen	d02207306d	Create a header file for shared AVX2 code	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	361bf0c7db	Precompute >=2 coeff encoding loop with 2-bit arithmetic Who needs 16x16b vectors when you can do practically the same with 16x2b pseudovectors in 32-bit general purpose registers!	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	f66cb23d5b	Optimize greater1 encoding loop Calculating the c1 variable need not be a serial operation!	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	8c8b791c35	Vectorize kvz_context_get_sig_ctx_inc	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	033261eb74	Eliminate two branches using bit magic	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	c4434e8d04	Scan CG's in forward order to simplify finding last significant	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	efd097f5a5	Vectorize the coeff group loop to some extent	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	a01362e638	use the efficient method of reordering raster->scan	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	50a888e789	Use the efficient method to find first and last nz coeffs in block	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	7e9203f566	Scan coeff groups in scan order to help find last significant one	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	9a5a6fdbc7	Simplify two ifs in encode_coeff_nxn-avx2	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	37a2a8bac8	See if loop can be optimized by rearranging	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	584f2f74b6	Vectorize significant coeff group scanning loop	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	1bfed73221	Add AVX2 strategy for encode_coding_tree	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	c3a6f3112a	Add generic strategy group for encode_coding_tree	2018-12-18 19:41:09 +02:00
Sergei Trofimovich	68a70e45a1	x86 asm: mark stack as non-executable Gentoo's `scanelf` QA tool detects writable/executable stack of assembly-writtent files as: ``` $ scanelf -qRa . 0644 LE !WX --- --- ./src/strategies/x86_asm/.libs/picture-x86-asm-sad.o 0644 LE !WX --- --- ./src/strategies/x86_asm/.libs/picture-x86-asm-satd.o 0644 LE !WX --- --- ./src/strategies/x86_asm/picture-x86-asm-sad.o 0644 LE !WX --- --- ./src/strategies/x86_asm/picture-x86-asm-satd.o ``` Normally C compiler emits non-executable stack marking (or GNU assembler via `-Wa,--noexecstack`). The change adds non-executable stack marking for yasm-based assmbly files. https://wiki.gentoo.org/wiki/Hardened/GNU_stack_quickstart has more details. Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>	2018-12-16 11:31:56 +00:00
Reima Hyvönen	1fcc5c6a8d	Merge branch 'bipred_recon'	2018-12-11 09:59:35 +02:00
Reima Hyvönen	e4a10880f3	Added case 12 to bipred_recon no mov	2018-12-11 09:52:17 +02:00
Marko Viitanen	a4f3968e52	Fix Visual Studio errors by initializing some variables used in AVX2 signhiding	2018-12-11 09:33:26 +02:00
Pauli Oikkonen	c465578048	Add a descriptive comment to coefficient reordering	2018-12-03 15:36:32 +02:00
Pauli Oikkonen	f78bf2ebcb	Optimize q_coefs usage for indexed fetch	2018-12-03 15:36:32 +02:00
Pauli Oikkonen	d9591f1b49	Eliminate midway buffering of reordered coefs TODO: For some mysterious reason seems slightly slower than the buffered one	2018-12-03 15:36:32 +02:00
Pauli Oikkonen	7fe454c51f	Optimize get_cheapest_alternative()	2018-12-03 15:36:32 +02:00
Pauli Oikkonen	6bbd3e5a44	Optimize rearrange_512 function	2018-12-03 15:36:32 +02:00
Pauli Oikkonen	cb8209d1b3	Vectorize transform coefficient reordering loop	2018-12-03 15:36:32 +02:00
Pauli Oikkonen	7cf4c7ae5f	Rename "reduce" functions to hsum That's what the functions fundamendally do anyway	2018-12-03 15:36:32 +02:00
Pauli Oikkonen	316cd8a846	Fix ALIGNED keyword and grow alignment to 64B	2018-12-03 15:36:32 +02:00
Pauli Oikkonen	1befc69a4c	Implement sign bit hiding in AVX2	2018-12-03 15:36:32 +02:00
Reima Hyvönen	f8696b54a4	Updated bipred_recon_avx2 in avx2/picture-avx2.c. Now it detects blocks that can be not equal to 8 (ie. width = 12)	2018-11-20 17:09:19 +02:00
Reima Hyvönen	710ba288db	Chroma has some problems	2018-11-15 16:42:48 +02:00
Ari Lemmetti	a832206bb6	Replace 32-bit incompatible instrinsics	2018-11-12 18:54:33 +02:00
Ari Lemmetti	5c774c4105	Rewrite most of FME and interpolation filters Changes had to break a lot of stuff and were just squashed into this horrible code dump	2018-11-08 20:21:16 +02:00
Reima Hyvönen	7406c33a42	Some more cleaning	2018-10-26 12:25:18 +03:00
Reima Hyvönen	4c71546b2e	Cleaned some coding	2018-10-26 12:19:44 +03:00
Reima Hyvönen	4fe3909e48	Switched luma to use 32bits size ints intstead of 16bit size	2018-10-24 18:24:46 +03:00
Marko Viitanen	465bc2cfee	[EMT] make functions static and prefix arrays with kvz_g	2018-10-18 10:54:33 +03:00
Marko Viitanen	169febd1c4	[EMT] Simplify DCT8, DCT5, DST1 and DST7 definitions	2018-10-17 12:17:54 +03:00
Marko Viitanen	e015d7eb2b	Fix compiler warnings	2018-10-17 10:43:11 +03:00
Marko Viitanen	ad310c77d3	Added EMT transforms to the strategies	2018-10-17 08:56:49 +03:00
Reima Hyvönen	381e786e10	Trying to find the bug in luma	2018-10-11 18:08:41 +03:00
Reima Hyvönen	2f5f81bac3	removed the non-optimated bipred function	2018-10-09 11:19:23 +03:00
Reima Hyvönen	212a8e68fa	Modified to avoid memory overflow, still some bug inside luma	2018-10-02 20:23:32 +03:00
Marko Viitanen	389aeebe07	Added 2x2 transform functions	2018-09-13 14:51:07 +03:00
Marko Viitanen	445c059b4a	Fix transforms for VTM 2.0, generated new transform matrices and added a shift by 2 for forward and inverse	2018-09-13 14:39:49 +03:00
Marko Viitanen	382917bcd3	New table for choosing angular intra filtered references and a small bugfix on the end condition of angular intra	2018-09-13 09:35:55 +03:00
Marko Viitanen	d4ed0ee3ad	Fixed some array offsets in intra angular prediction	2018-09-12 08:53:17 +03:00
Sami Ahovainio	787264f568	Fixed dst indexing in kvz_angular_pred_generic	2018-08-31 10:36:28 +03:00
Sami Ahovainio	d2291fea83	Intra mode scaling moved from angular prediction to kvz_intra_predict. pdpc implemented in kvz_intra_predict.	2018-08-31 10:01:28 +03:00
Sami Ahovainio	54ebadfc43	Clarifying comments and changes towards WAIP	2018-08-29 16:00:08 +03:00
Reima Hyvönen	896034b7cf	Some renamed functions back	2018-08-28 15:31:10 +03:00
Reima Hyvönen	e8b5e6db4c	Did some merging	2018-08-28 15:26:27 +03:00
Reima Hyvönen	7de5c74434	Updated bipred_recon to work faster	2018-08-28 15:12:31 +03:00
Reima Hyvönen	47b357cca2	Comment one test	2018-08-27 18:52:14 +03:00
Reima Hyvönen	2ca99a44e8	Updated shuffle operation to be in right order	2018-08-27 18:16:38 +03:00
Sami Ahovainio	42741a2c40	Some changes for PCM and Intra towards VTM 2.0 compatibility.	2018-08-27 09:18:15 +03:00
Marko Viitanen	4f7da86285	Commented out sign hiding code, which is not used in VVC	2018-08-17 09:38:11 +03:00
Marko Viitanen	c9cbdd5dc3	Added couple of ToDo comments for large CTU support	2018-08-17 09:37:14 +03:00
Marko Viitanen	daf041406f	Disable DST	2018-08-16 16:05:32 +03:00
Reima Hyvönen	508b218a12	some modifications made to prevent reading too much	2018-08-14 10:50:39 +03:00
Reima Hyvönen	1d935ee888	some useless stuff removed	2018-08-13 16:47:11 +03:00
Reima Hyvönen	ce3ac4c05e	some modifications to no_mov	2018-08-13 16:41:02 +03:00
Reima Hyvönen	15a613ae94	test if no_mov breaks testing	2018-08-13 16:02:56 +03:00
Reima Hyvönen	97a2049e58	removed pointer declaration out from switch	2018-08-10 16:42:26 +03:00
Reima Hyvönen	aa94bcedbc	Stream is now pointer	2018-08-10 16:38:49 +03:00
Reima Hyvönen	fa5b227ece	256 to 32 doesn't work, made them by hand	2018-08-10 16:01:20 +03:00
Reima Hyvönen	408dedbcc8	removed _mm256_extract_epi8 and replaced with _mm_stream	2018-08-10 15:53:26 +03:00
Reima Hyvönen	31c35091c6	_mm256_cvtsi256_si32 removed	2018-08-10 10:06:40 +03:00
Reima Hyvönen	99dc43074f	_mm256_cvtsi256_si32 breaks system, too much bits. back to extract	2018-08-10 09:59:33 +03:00
Reima Hyvönen	4f1f80b2cb	Transformed convert from 256 to cast 256 -> 128 and then convert from 128	2018-08-09 15:35:54 +03:00
Reima Hyvönen	4957555eb3	Removed leftover from 939	2018-08-09 15:25:03 +03:00
Reima Hyvönen	28b165c971	Clearified some sections, added _MM_SHUFFLE macro	2018-08-09 15:23:01 +03:00
Reima Hyvönen	dd04df8667	testing if error in both avx2 functions	2018-08-03 11:49:00 +03:00
Reima Hyvönen	ed50d71fde	Switched some variables to different location, altered inter_recon_bipred_avx2 function	2018-08-02 16:08:59 +03:00
Reima Hyvönen	f5739a0028	Renaming and removing useless prints	2018-08-02 14:47:17 +03:00
Reima Hyvönen	bc09f59bb6	Edited some definitions	2018-08-02 11:54:53 +03:00
Reima Hyvönen	a4bf77f208	Tested some extract functions	2018-07-12 09:29:32 +03:00
Reima Hyvönen	c05033a893	Even more useless vectors removed	2018-07-11 15:09:14 +03:00
Reima Hyvönen	884cb77238	Removed some not used vectors	2018-07-11 15:06:11 +03:00
Reima Hyvönen	792689a5ff	Removed for-loops, added extract instead	2018-07-11 14:56:41 +03:00
Reima Hyvönen	f9c7f6ee66	Added some break-operations for avx2 optimation	2018-07-11 14:15:38 +03:00
Reima Hyvönen	cc064da143	some more optimation for bipred	2018-07-11 11:27:54 +03:00
Reima Hyvönen	a22cf03ddb	Updated to have no movement function to avx2 strategies	2018-07-10 16:07:15 +03:00
Reima Hyvönen	ea83ae45f0	Toimiva ratkaisu	2018-07-03 11:18:51 +03:00
Reima Hyvönen	17babfffa4	25.6 working optimation, ~50% faster than original	2018-06-25 17:06:16 +03:00
Reima Hyvönen	9fed29f950	optimation for inter_recon_bipred	2018-04-18 15:25:44 +03:00
Arttu Ylä-Outinen	0a69e6d18f	Fix selection of transform function for 4x4 blocks DST function was returned for inter luma transform blocks of size 4x4 even though they must use DCT. Fixed by checking the prediction mode of the block in addition to whether it is chroma or luma.	2018-01-18 10:36:25 +02:00
Arttu Ylä-Outinen	9694bd2fae	Fix build on 32-bit systems Function coeff_abs_sum_avx2 that was added in `e950c9b` was outside the AVX2 #if directive.	2017-07-28 09:19:29 +03:00
Arttu Ylä-Outinen	e950c9b101	Add AVX2 implementation for coefficient sum	2017-07-28 07:39:36 +03:00
Arttu Ylä-Outinen	d50ae6990c	Add sum of absolute coefficients to strategies	2017-07-28 07:39:15 +03:00
Arttu Ylä-Outinen	fdb3480b54	Enable strategies for SAO reconstruction Re-enables strategies for SAO reconstruction. They were disabled in commit `ec9ff42`.	2017-07-11 10:35:18 +03:00
Arttu Ylä-Outinen	333dba3884	Add static to SAO strategies	2017-07-11 10:02:01 +03:00
Arttu Ylä-Outinen	563bc26e71	Fix out-of-bounds read in AVX2 SAO AVX2 version of SAO loaded offsets with a 256 bit read even though there are only five 32 bit integers.	2017-07-06 13:04:52 +03:00
Arttu Ylä-Outinen	2c66e0bbd2	Fix warnings about invalid reads in AVX2 ipol AVX2 filter functions read pixels in chunks of 8 or 16 bytes. At the end of the block, the read goes out of the bounds of the pixels array. The extra pixels do not affect the result. Fixes valgrind complaining about the invalid reads by allocating 5 extra pixels in kvz_get_extended_block_avx2	2017-06-22 09:37:55 +03:00
Arttu Ylä-Outinen	95775a1645	Change coefficient storage order Changes coefficient storage order to a zig-zag order. Reduces unnecessary copying of coefficients to temporary arrays.	2017-05-12 16:46:57 +03:00
Arttu Ylä-Outinen	51786eda67	Drop redundant fields in encoder_control_t Some of the fields in encoder_control_t were simply copies of the corresponding fields in kvz_config. This commit drops the copied fields in favor of using the fields in encoder_control_t.cfg directly.	2017-02-09 14:05:28 +09:00
Arttu Ylä-Outinen	e78a8dfcf5	Copy the kvz_config passed to encoder_open The kvz_config struct is created by the user but kvazaar keeps a pointer to it. It is easy to break things by modifying the configuration outside kvazaar. In addition, kvazaar modifies the struct even though it is has a const modifier. This commit changes the field cfg in encoder_control_t to be a copy of the kvz_config struct instead of a pointer, removing modifications to the const struct and allowing users to do whatever they want with it after opening the encoder.	2017-02-09 13:23:54 +09:00
Arttu Ylä-Outinen	640ff94ecd	Use separate lambda and QP for each LCU Adds fields lambda, lambda_sqrt and qp to encoder_state_t. Drops field cur_lambda_cost_sqrt from encoder_state_config_frame_t and renames cur_lambda_cost to lambda.	2017-01-09 01:24:23 +09:00
Ari Lemmetti	70a52f0e48	10-bit: add missing bit depth adjustment to ssd	2016-11-17 19:28:04 +02:00
Ari Lemmetti	29153ed503	Remove unused variable	2016-10-21 17:28:42 +03:00
Ari Lemmetti	778e46dfd8	Add AVX2 version of SSD	2016-10-21 15:07:53 +03:00
Ari Lemmetti	6f5d7c9e06	Move SSD to strategies	2016-10-21 15:07:23 +03:00
Ari Lemmetti	89b941eab4	Fix typo	2016-10-21 15:07:02 +03:00
Ari Koivula	cbfa824d1a	Merge branch 'simd'	2016-09-27 20:49:45 +03:00
Ari Koivula	14a7bcba25	Use a faster function for clipped inter SAD Use the vectorized general SSE41 inter SAD in AVX reg_sad for shapes for which we don't have AVX versions yet. Also improves speed of --smp and --amp a lot. Got a 1.25x speedup for: --preset=ultrafast -q 27 --gop=lp-g4d3r3t1 --me-early-termination=on --rd=1 --pu-depth-inter=1-3 --smp --amp * Suite speed_tests: -PASS inter_sad: 0.898M x reg_sad(64x63):x86_asm_avx (1000 ticks, 1.000 sec) +PASS inter_sad: 2.503M x reg_sad(64x63):x86_asm_avx (1000 ticks, 1.000 sec) -PASS inter_sad: 115.054M x reg_sad(1x1):x86_asm_avx (1000 ticks, 1.000 sec) +PASS inter_sad: 133.577M x reg_sad(1x1):x86_asm_avx (1000 ticks, 1.000 sec)	2016-09-27 20:48:30 +03:00
Eemeli Kallio	f41e428e5f	Removed kvz_skip_unnecessary_rdoq and reworked --rdoq-skip to skip 4x4 blocks when it is on.	2016-09-09 10:26:07 +03:00
Ari Koivula	02cd17b427	Add faster AVX inter SAD for 32x32 and 64x64 Add implementations for these functions that process the image line by line instead of using the 16x16 function to process block by block. The 32x32 is around 30% faster, and 64x64 is around 15% faster, on Haswell. PASS inter_sad: 28.744M x reg_sad(32x32):x86_asm_avx (1014 ticks, 1.014 sec) PASS inter_sad: 7.882M x reg_sad(64x64):x86_asm_avx (1014 ticks, 1.014 sec) to PASS inter_sad: 37.828M x reg_sad(32x32):x86_asm_avx (1014 ticks, 1.014 sec) PASS inter_sad: 9.081M x reg_sad(64x64):x86_asm_avx (1014 ticks, 1.014 sec)	2016-09-01 21:36:39 +03:00
Ari Lemmetti	28c4174d0e	Fix incorrect shuffle parameters _MM_SHUFFLE uses reverse order	2016-08-23 19:40:46 +03:00
Ari Lemmetti	ce77bfa15b	Replace KVZ_PERMUTE with _MM_SHUFFLE The same exact macro already exists	2016-08-22 19:08:46 +03:00
Eemeli Kallio	99d8b9abeb	Changed skip_rdoq name to kvz_skip_unnecessary_rdoq. Changed the order it uses when it goes through CGs and tuned its sum calculation.	2016-08-18 14:02:56 +03:00
Eemeli Kallio	1fb4755f31	Added rdoq-skip to quant-generic.c	2016-08-18 12:17:54 +03:00
Eemeli Kallio	d20ac03ca2	Added --rdoq-skip option	2016-08-18 12:17:53 +03:00
Arttu Ylä-Outinen	2a946bd88e	Rename encoder_state_t.global to frame "Frame" is more accurate than "global" since when OWF is used, encoder states for each frame have their own struct.	2016-08-10 13:22:36 +09:00
Arttu Ylä-Outinen	5fbb0a8c27	Fix includes	2016-08-10 13:05:40 +09:00
Ari Lemmetti	6bcba004ff	Comment out to fix unused code error on clang.	2016-07-14 14:12:16 +03:00
Ari Lemmetti	c0979ebdcb	Implement AVX2 luma sampling	2016-07-14 12:53:02 +03:00
Ari Lemmetti	6244560426	Add avx2 strategy for kvz_filter_frac_blocks_luma.	2016-07-14 12:53:02 +03:00
Ari Lemmetti	9c4e9e049b	Load only what is needed. Eliminate latency from hadds.	2016-07-14 12:53:01 +03:00
Ari Lemmetti	fccfbd2f28	Add strategy for kvz_filter_frac_blocks_luma	2016-07-14 12:51:02 +03:00
Ari Lemmetti	2b0c8db349	Add quad satd for avx2	2016-07-14 12:50:24 +03:00
Ari Lemmetti	0ff69fd6f8	Add any size multi satd	2016-07-14 12:48:37 +03:00
Arttu Ylä-Outinen	bf26661782	Add support for 4x4 blocks to SATD_ANY_SIZE. Makes functions satd_any_size_generic and satd_any_size_8bit_avx2 work on blocks whose width and/or height are not multiples of 8.	2016-06-16 18:53:17 +09:00
Ari Lemmetti	3107a93eaf	Fix avx2 chroma sampling for amp	2016-05-17 14:09:57 +03:00
Ari Lemmetti	efbdc5dade	Utilize registers more efficiently for 8x8 and larger blocks	2016-04-21 13:26:38 +03:00
Ari Lemmetti	192cee95b2	Vectorize vertical filtering	2016-04-21 13:26:38 +03:00
Ari Lemmetti	0be35f72b8	Filter 4 pixels simultaneously in x direction	2016-04-21 13:26:38 +03:00
Ari Lemmetti	10484bda9f	Make strategies out of fractional pixel sample functions	2016-04-21 13:26:38 +03:00
Ari Lemmetti	8247faf8e0	Remove 64-bit only instruction to fix 32-bit compilation.	2016-04-19 18:05:11 +03:00
Ari Lemmetti	eb55d6b6b9	Fix writing over boundary.	2016-04-19 16:03:43 +03:00
Ari Lemmetti	bcabc6fadd	Remove pixel blit from strategies. Use memcpy instead.	2016-04-06 18:44:04 +03:00
Ari Koivula	61fc3e87ba	Run include-what-you-use fix_includes.py fix_includes.py The includes should make more sense now and not just happen to compile due to headers included from other headers. Used a modified version of IWYU. Modifications were to attribute int8_t and so on to stdint.h instead of sys/types.h and immintrin.h instead of more specific headers. include-what-you-use 0.7 (git:b70df35) based on clang version 3.9.0 (trunk 264728)	2016-04-01 17:46:55 +03:00
Ari Koivula	8908d85d66	Change all relative includes to absolute	2016-04-01 17:46:44 +03:00
Ari Koivula	4876879b82	Add IWYU pragmas	2016-03-31 12:33:34 +03:00
Ari Koivula	5b66578f71	Add kvz_ prefix to md5 functions The non kvz_ symbols were being exported in the static lib, which got caught by Travis tests.	2016-03-18 13:13:35 +02:00
Ari Koivula	4125218cfa	Add --hash=md5 Add md5 through extras/libmd5 taken from HM with BSD license. It's implemented as a generic strategy using the same interface as checksum, so we can write a SIMD version if it seems necessary.	2016-03-18 05:23:57 +02:00
Ari Lemmetti	e502292ba8	Remove old function	2016-03-16 20:18:55 +02:00
Ari Lemmetti	c6cc96f5ec	Optimize sao band ddistortion	2016-03-16 20:16:00 +02:00
Ari Lemmetti	ab577f476f	Optimize sao reconstruct color	2016-03-16 20:15:32 +02:00
Ari Lemmetti	48bfddf4ec	Optimize calc sao edge dir	2016-03-16 20:14:50 +02:00
Ari Lemmetti	ba69992941	Optimize sao edge ddistortion	2016-03-16 20:14:19 +02:00
Ari Lemmetti	941b6b3e27	Optimize calc eo cat	2016-03-16 20:13:30 +02:00
Ari Lemmetti	04fbb48a09	Add strategy for avx2. Copy generic functions there.	2016-03-16 20:13:15 +02:00
Ari Lemmetti	4e30a215d8	Create generic strategy for sao.	2016-03-16 20:11:15 +02:00
Ari Lemmetti	99e37ec235	Update old pixel type to the current one	2016-01-30 19:33:09 +02:00
Ari Koivula	fa1af14637	Fix includes to include global.h first everywhere	2016-01-22 15:07:49 +02:00
Ari Lemmetti	44656aeb19	Remove useless calculation	2016-01-19 16:35:16 +02:00
Ari Lemmetti	a2fc9920e6	Merge branch 'alternative-satd'	2016-01-13 15:00:43 +02:00

... 2 3 4 5 6 ...

491 commits