hashirama/uvg266

mirror of https://github.com/ultravideo/uvg266.git synced 2024-11-24 18:34:06 +00:00

Author	SHA1	Message	Date
Pauli Oikkonen	3a1f2eb752	Prefer SSE4.1 implementation of SAD over AVX2 It seems that the 128-bit wide version consistently outperforms the 256-bit one	2019-01-10 13:48:55 +02:00
Pauli Oikkonen	9b24d81c6a	Use SSE instead of AVX for small widths Highly dubious if this will help performance at all	2019-01-07 20:12:13 +02:00
Pauli Oikkonen	b2176bf72a	Optimize SSE4.1 version of SAD Make it use the same vblend trick as AVX2. Interestingly, on my test setup this seems to be faster than the same code using 256-bit AVX vectors.	2019-01-07 19:40:57 +02:00
Pauli Oikkonen	887d7700a8	Modify AVX2 SAD to mask data by byte granularity in AVX registers Avoids using any SAD calculations narrower than 256 bits, and simplifies the code. Also improves execution speed	2019-01-07 18:53:15 +02:00
Pauli Oikkonen	7585f79a71	AVX2-ize SAD calculation Performance is no better than SSE though	2019-01-07 16:26:24 +02:00
Pauli Oikkonen	ab3dc58df6	Copy SAD SSE4.1 impl to AVX2	2019-01-03 18:31:57 +02:00
Pauli Oikkonen	45ac6e6d03	Tidy pack_16x16b_to_16x2b comments	2019-01-03 16:37:05 +02:00
Pauli Oikkonen	016eb014ad	Move packing 16x16b -> 16x2b into separate function	2018-12-20 10:51:44 +02:00
Ari Lemmetti	b234897e8a	Fix smp and amp blocks in fme and revert previous change. Filter 8x8 (sub)blocks even with 8x4, 4x8, 16x4, 4x16 etc. Calculate SATD on the 8x4, ... part	2018-12-19 21:30:53 +02:00
Pauli Oikkonen	9aaa6f260d	Fixes to enable portability	2018-12-18 20:42:09 +02:00
Pauli Oikkonen	2fdbbe9730	Move CG reordering code from quant-avx2 to shared header	2018-12-18 19:42:18 +02:00
Pauli Oikkonen	d02207306d	Create a header file for shared AVX2 code	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	361bf0c7db	Precompute >=2 coeff encoding loop with 2-bit arithmetic Who needs 16x16b vectors when you can do practically the same with 16x2b pseudovectors in 32-bit general purpose registers!	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	f66cb23d5b	Optimize greater1 encoding loop Calculating the c1 variable need not be a serial operation!	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	8c8b791c35	Vectorize kvz_context_get_sig_ctx_inc	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	033261eb74	Eliminate two branches using bit magic	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	c4434e8d04	Scan CG's in forward order to simplify finding last significant	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	efd097f5a5	Vectorize the coeff group loop to some extent	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	a01362e638	use the efficient method of reordering raster->scan	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	50a888e789	Use the efficient method to find first and last nz coeffs in block	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	7e9203f566	Scan coeff groups in scan order to help find last significant one	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	9a5a6fdbc7	Simplify two ifs in encode_coeff_nxn-avx2	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	37a2a8bac8	See if loop can be optimized by rearranging	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	584f2f74b6	Vectorize significant coeff group scanning loop	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	1bfed73221	Add AVX2 strategy for encode_coding_tree	2018-12-18 19:41:09 +02:00
Pauli Oikkonen	c3a6f3112a	Add generic strategy group for encode_coding_tree	2018-12-18 19:41:09 +02:00
Sergei Trofimovich	68a70e45a1	x86 asm: mark stack as non-executable Gentoo's `scanelf` QA tool detects writable/executable stack of assembly-writtent files as: ``` $ scanelf -qRa . 0644 LE !WX --- --- ./src/strategies/x86_asm/.libs/picture-x86-asm-sad.o 0644 LE !WX --- --- ./src/strategies/x86_asm/.libs/picture-x86-asm-satd.o 0644 LE !WX --- --- ./src/strategies/x86_asm/picture-x86-asm-sad.o 0644 LE !WX --- --- ./src/strategies/x86_asm/picture-x86-asm-satd.o ``` Normally C compiler emits non-executable stack marking (or GNU assembler via `-Wa,--noexecstack`). The change adds non-executable stack marking for yasm-based assmbly files. https://wiki.gentoo.org/wiki/Hardened/GNU_stack_quickstart has more details. Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>	2018-12-16 11:31:56 +00:00
Reima Hyvönen	1fcc5c6a8d	Merge branch 'bipred_recon'	2018-12-11 09:59:35 +02:00
Reima Hyvönen	e4a10880f3	Added case 12 to bipred_recon no mov	2018-12-11 09:52:17 +02:00
Marko Viitanen	a4f3968e52	Fix Visual Studio errors by initializing some variables used in AVX2 signhiding	2018-12-11 09:33:26 +02:00
Pauli Oikkonen	c465578048	Add a descriptive comment to coefficient reordering	2018-12-03 15:36:32 +02:00
Pauli Oikkonen	f78bf2ebcb	Optimize q_coefs usage for indexed fetch	2018-12-03 15:36:32 +02:00
Pauli Oikkonen	d9591f1b49	Eliminate midway buffering of reordered coefs TODO: For some mysterious reason seems slightly slower than the buffered one	2018-12-03 15:36:32 +02:00
Pauli Oikkonen	7fe454c51f	Optimize get_cheapest_alternative()	2018-12-03 15:36:32 +02:00
Pauli Oikkonen	6bbd3e5a44	Optimize rearrange_512 function	2018-12-03 15:36:32 +02:00
Pauli Oikkonen	cb8209d1b3	Vectorize transform coefficient reordering loop	2018-12-03 15:36:32 +02:00
Pauli Oikkonen	7cf4c7ae5f	Rename "reduce" functions to hsum That's what the functions fundamendally do anyway	2018-12-03 15:36:32 +02:00
Pauli Oikkonen	316cd8a846	Fix ALIGNED keyword and grow alignment to 64B	2018-12-03 15:36:32 +02:00
Pauli Oikkonen	1befc69a4c	Implement sign bit hiding in AVX2	2018-12-03 15:36:32 +02:00
Reima Hyvönen	f8696b54a4	Updated bipred_recon_avx2 in avx2/picture-avx2.c. Now it detects blocks that can be not equal to 8 (ie. width = 12)	2018-11-20 17:09:19 +02:00
Reima Hyvönen	710ba288db	Chroma has some problems	2018-11-15 16:42:48 +02:00
Ari Lemmetti	a832206bb6	Replace 32-bit incompatible instrinsics	2018-11-12 18:54:33 +02:00
Ari Lemmetti	5c774c4105	Rewrite most of FME and interpolation filters Changes had to break a lot of stuff and were just squashed into this horrible code dump	2018-11-08 20:21:16 +02:00
Reima Hyvönen	7406c33a42	Some more cleaning	2018-10-26 12:25:18 +03:00
Reima Hyvönen	4c71546b2e	Cleaned some coding	2018-10-26 12:19:44 +03:00
Reima Hyvönen	4fe3909e48	Switched luma to use 32bits size ints intstead of 16bit size	2018-10-24 18:24:46 +03:00
Reima Hyvönen	381e786e10	Trying to find the bug in luma	2018-10-11 18:08:41 +03:00
Reima Hyvönen	2f5f81bac3	removed the non-optimated bipred function	2018-10-09 11:19:23 +03:00
Reima Hyvönen	212a8e68fa	Modified to avoid memory overflow, still some bug inside luma	2018-10-02 20:23:32 +03:00
Reima Hyvönen	896034b7cf	Some renamed functions back	2018-08-28 15:31:10 +03:00
Reima Hyvönen	e8b5e6db4c	Did some merging	2018-08-28 15:26:27 +03:00
Reima Hyvönen	7de5c74434	Updated bipred_recon to work faster	2018-08-28 15:12:31 +03:00
Reima Hyvönen	47b357cca2	Comment one test	2018-08-27 18:52:14 +03:00
Reima Hyvönen	2ca99a44e8	Updated shuffle operation to be in right order	2018-08-27 18:16:38 +03:00
Reima Hyvönen	508b218a12	some modifications made to prevent reading too much	2018-08-14 10:50:39 +03:00
Reima Hyvönen	1d935ee888	some useless stuff removed	2018-08-13 16:47:11 +03:00
Reima Hyvönen	ce3ac4c05e	some modifications to no_mov	2018-08-13 16:41:02 +03:00
Reima Hyvönen	15a613ae94	test if no_mov breaks testing	2018-08-13 16:02:56 +03:00
Reima Hyvönen	97a2049e58	removed pointer declaration out from switch	2018-08-10 16:42:26 +03:00
Reima Hyvönen	aa94bcedbc	Stream is now pointer	2018-08-10 16:38:49 +03:00
Reima Hyvönen	fa5b227ece	256 to 32 doesn't work, made them by hand	2018-08-10 16:01:20 +03:00
Reima Hyvönen	408dedbcc8	removed _mm256_extract_epi8 and replaced with _mm_stream	2018-08-10 15:53:26 +03:00
Reima Hyvönen	31c35091c6	_mm256_cvtsi256_si32 removed	2018-08-10 10:06:40 +03:00
Reima Hyvönen	99dc43074f	_mm256_cvtsi256_si32 breaks system, too much bits. back to extract	2018-08-10 09:59:33 +03:00
Reima Hyvönen	4f1f80b2cb	Transformed convert from 256 to cast 256 -> 128 and then convert from 128	2018-08-09 15:35:54 +03:00
Reima Hyvönen	4957555eb3	Removed leftover from 939	2018-08-09 15:25:03 +03:00
Reima Hyvönen	28b165c971	Clearified some sections, added _MM_SHUFFLE macro	2018-08-09 15:23:01 +03:00
Reima Hyvönen	dd04df8667	testing if error in both avx2 functions	2018-08-03 11:49:00 +03:00
Reima Hyvönen	ed50d71fde	Switched some variables to different location, altered inter_recon_bipred_avx2 function	2018-08-02 16:08:59 +03:00
Reima Hyvönen	f5739a0028	Renaming and removing useless prints	2018-08-02 14:47:17 +03:00
Reima Hyvönen	bc09f59bb6	Edited some definitions	2018-08-02 11:54:53 +03:00
Reima Hyvönen	a4bf77f208	Tested some extract functions	2018-07-12 09:29:32 +03:00
Reima Hyvönen	c05033a893	Even more useless vectors removed	2018-07-11 15:09:14 +03:00
Reima Hyvönen	884cb77238	Removed some not used vectors	2018-07-11 15:06:11 +03:00
Reima Hyvönen	792689a5ff	Removed for-loops, added extract instead	2018-07-11 14:56:41 +03:00
Reima Hyvönen	f9c7f6ee66	Added some break-operations for avx2 optimation	2018-07-11 14:15:38 +03:00
Reima Hyvönen	cc064da143	some more optimation for bipred	2018-07-11 11:27:54 +03:00
Reima Hyvönen	a22cf03ddb	Updated to have no movement function to avx2 strategies	2018-07-10 16:07:15 +03:00
Reima Hyvönen	ea83ae45f0	Toimiva ratkaisu	2018-07-03 11:18:51 +03:00
Reima Hyvönen	17babfffa4	25.6 working optimation, ~50% faster than original	2018-06-25 17:06:16 +03:00
Reima Hyvönen	9fed29f950	optimation for inter_recon_bipred	2018-04-18 15:25:44 +03:00
Arttu Ylä-Outinen	0a69e6d18f	Fix selection of transform function for 4x4 blocks DST function was returned for inter luma transform blocks of size 4x4 even though they must use DCT. Fixed by checking the prediction mode of the block in addition to whether it is chroma or luma.	2018-01-18 10:36:25 +02:00
Arttu Ylä-Outinen	9694bd2fae	Fix build on 32-bit systems Function coeff_abs_sum_avx2 that was added in `e950c9b` was outside the AVX2 #if directive.	2017-07-28 09:19:29 +03:00
Arttu Ylä-Outinen	e950c9b101	Add AVX2 implementation for coefficient sum	2017-07-28 07:39:36 +03:00
Arttu Ylä-Outinen	d50ae6990c	Add sum of absolute coefficients to strategies	2017-07-28 07:39:15 +03:00
Arttu Ylä-Outinen	fdb3480b54	Enable strategies for SAO reconstruction Re-enables strategies for SAO reconstruction. They were disabled in commit `ec9ff42`.	2017-07-11 10:35:18 +03:00
Arttu Ylä-Outinen	333dba3884	Add static to SAO strategies	2017-07-11 10:02:01 +03:00
Arttu Ylä-Outinen	563bc26e71	Fix out-of-bounds read in AVX2 SAO AVX2 version of SAO loaded offsets with a 256 bit read even though there are only five 32 bit integers.	2017-07-06 13:04:52 +03:00
Arttu Ylä-Outinen	2c66e0bbd2	Fix warnings about invalid reads in AVX2 ipol AVX2 filter functions read pixels in chunks of 8 or 16 bytes. At the end of the block, the read goes out of the bounds of the pixels array. The extra pixels do not affect the result. Fixes valgrind complaining about the invalid reads by allocating 5 extra pixels in kvz_get_extended_block_avx2	2017-06-22 09:37:55 +03:00
Arttu Ylä-Outinen	95775a1645	Change coefficient storage order Changes coefficient storage order to a zig-zag order. Reduces unnecessary copying of coefficients to temporary arrays.	2017-05-12 16:46:57 +03:00
Arttu Ylä-Outinen	51786eda67	Drop redundant fields in encoder_control_t Some of the fields in encoder_control_t were simply copies of the corresponding fields in kvz_config. This commit drops the copied fields in favor of using the fields in encoder_control_t.cfg directly.	2017-02-09 14:05:28 +09:00
Arttu Ylä-Outinen	e78a8dfcf5	Copy the kvz_config passed to encoder_open The kvz_config struct is created by the user but kvazaar keeps a pointer to it. It is easy to break things by modifying the configuration outside kvazaar. In addition, kvazaar modifies the struct even though it is has a const modifier. This commit changes the field cfg in encoder_control_t to be a copy of the kvz_config struct instead of a pointer, removing modifications to the const struct and allowing users to do whatever they want with it after opening the encoder.	2017-02-09 13:23:54 +09:00
Arttu Ylä-Outinen	640ff94ecd	Use separate lambda and QP for each LCU Adds fields lambda, lambda_sqrt and qp to encoder_state_t. Drops field cur_lambda_cost_sqrt from encoder_state_config_frame_t and renames cur_lambda_cost to lambda.	2017-01-09 01:24:23 +09:00
Ari Lemmetti	70a52f0e48	10-bit: add missing bit depth adjustment to ssd	2016-11-17 19:28:04 +02:00
Ari Lemmetti	29153ed503	Remove unused variable	2016-10-21 17:28:42 +03:00
Ari Lemmetti	778e46dfd8	Add AVX2 version of SSD	2016-10-21 15:07:53 +03:00
Ari Lemmetti	6f5d7c9e06	Move SSD to strategies	2016-10-21 15:07:23 +03:00
Ari Lemmetti	89b941eab4	Fix typo	2016-10-21 15:07:02 +03:00
Ari Koivula	cbfa824d1a	Merge branch 'simd'	2016-09-27 20:49:45 +03:00
Ari Koivula	14a7bcba25	Use a faster function for clipped inter SAD Use the vectorized general SSE41 inter SAD in AVX reg_sad for shapes for which we don't have AVX versions yet. Also improves speed of --smp and --amp a lot. Got a 1.25x speedup for: --preset=ultrafast -q 27 --gop=lp-g4d3r3t1 --me-early-termination=on --rd=1 --pu-depth-inter=1-3 --smp --amp * Suite speed_tests: -PASS inter_sad: 0.898M x reg_sad(64x63):x86_asm_avx (1000 ticks, 1.000 sec) +PASS inter_sad: 2.503M x reg_sad(64x63):x86_asm_avx (1000 ticks, 1.000 sec) -PASS inter_sad: 115.054M x reg_sad(1x1):x86_asm_avx (1000 ticks, 1.000 sec) +PASS inter_sad: 133.577M x reg_sad(1x1):x86_asm_avx (1000 ticks, 1.000 sec)	2016-09-27 20:48:30 +03:00

1 2 3 4 5 ...

333 commits