Pauli Oikkonen
9db0a1bcda
Create get_optimized_sad func for SSE4.1
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
91380729b1
Add generic get_optimized_sad implementation
...
NOTE: To force generic SAD implementation on devices supporting
vectorized variants, you now have to override both get_optimized_sad
and reg_sad to generic (only overriding get_optimized_sad on AVX2
hardware would just run all SAD blocks through reg_sad_avx2). Let's
see if there's a more sensible way to do it, but it's not trivial.
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
45f36645a6
Move choosing of tailored SAD function higher up the calling chain
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
91cb0fbd45
Create strategy for directly obtaining pointer to constant-width SAD function
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
94035be342
Unify unrolling naming conventions
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
517a4338f6
Unroll SSE SAD for 8-wide blocks to process 4 lines at once
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
0f665b28f6
Unroll arbitrary width SSE4.1 SAD by 4
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
cbca3347b5
Unroll 64-wide AVX2 SAD by 2
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
84cf771dea
Unroll 32 and 16 wide SAD vector implementations by 4
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
5df5c5f8a4
Cast all pointers to const types in vector SAD funcs
...
Also tidy up the pointer arithmetic
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
a711ce3df5
Inline fixed width vectorized SAD functions
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
6504145cce
Remove 16-pixel wide AVX2 SAD implementation
...
At least on Skylake, it's noticeably slower than the very simple
version using SSE4.1
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
4cb371184b
Add SSE4.1 strategy for 24px wide SAD and an AVX2 strategy for 16
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
796568d9cc
Add SSE4.1 strategies for SAD on widths 4 and 12 and AVX2 strategies for 32 and 64
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
4d45d828fa
Use constant-width SSE4.1 SAD funcs for AVX2
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
2eaa7bc9d2
Move SSE4.1 SAD functions to separate header
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
d2db0086e1
Create constant width SAD versions for 8 and 16 pixels
2019-02-04 20:41:40 +02:00
Pauli Oikkonen
a13fc51003
Include a blank AVX2 strategy registration function even in non-AVX2 builds
2019-02-04 19:52:24 +02:00
Pauli Oikkonen
d55414db66
Only build AVX2 coeff encoding when supported
...
..whoops
2019-02-04 19:34:30 +02:00
Pauli Oikkonen
3fe2f29456
Merge branch 'encode-coeffs-avx2'
2019-02-04 18:52:31 +02:00
Pauli Oikkonen
722b738888
Fix more naming issues
2019-02-04 16:05:43 +02:00
Pauli Oikkonen
e26d98fb75
Rename a couple variables and add crucial comments
2019-02-04 15:57:07 +02:00
Pauli Oikkonen
f186455619
Move encode_last_significant_xy out of strategy modules
...
It's the exact same in both AVX2 and generic, and does not seem to
be worth even trying to vectorize
2019-02-04 14:55:41 +02:00
Pauli Oikkonen
3f7340c932
Fine-tune pack_16x16b_to_16x2b
...
Avoid mm_set1 operation when it's possible to create the constant with
one bit-shift operation from another instead. Thanks Intel for
3-operand instruction encoding!
2019-02-04 14:44:47 +02:00
Pauli Oikkonen
314f5b0e1f
Rename 16x2b cmpgt function, comment it better, optimize it slightly
...
Eliminate an unnecessary bit masking to make it even more messy
2019-02-04 14:44:32 +02:00
Pauli Oikkonen
d8ff6a6459
Fix _andn_u32 to work on old Visual Studio
2019-02-01 15:34:42 +02:00
Pauli Oikkonen
26e1b2c783
Use (u)int32_t instead of (unsigned) int in reg_sad_sse41
2019-01-10 14:37:04 +02:00
Pauli Oikkonen
3a1f2eb752
Prefer SSE4.1 implementation of SAD over AVX2
...
It seems that the 128-bit wide version consistently outperforms the
256-bit one
2019-01-10 13:48:55 +02:00
Pauli Oikkonen
9b24d81c6a
Use SSE instead of AVX for small widths
...
Highly dubious if this will help performance at all
2019-01-07 20:12:13 +02:00
Pauli Oikkonen
b2176bf72a
Optimize SSE4.1 version of SAD
...
Make it use the same vblend trick as AVX2. Interestingly, on my test
setup this seems to be faster than the same code using 256-bit AVX
vectors.
2019-01-07 19:40:57 +02:00
Pauli Oikkonen
887d7700a8
Modify AVX2 SAD to mask data by byte granularity in AVX registers
...
Avoids using any SAD calculations narrower than 256 bits, and
simplifies the code. Also improves execution speed
2019-01-07 18:53:15 +02:00
Pauli Oikkonen
7585f79a71
AVX2-ize SAD calculation
...
Performance is no better than SSE though
2019-01-07 16:26:24 +02:00
Pauli Oikkonen
ab3dc58df6
Copy SAD SSE4.1 impl to AVX2
2019-01-03 18:31:57 +02:00
Pauli Oikkonen
45ac6e6d03
Tidy pack_16x16b_to_16x2b comments
2019-01-03 16:37:05 +02:00
Pauli Oikkonen
016eb014ad
Move packing 16x16b -> 16x2b into separate function
2018-12-20 10:51:44 +02:00
Ari Lemmetti
b234897e8a
Fix smp and amp blocks in fme and revert previous change.
...
Filter 8x8 (sub)blocks even with 8x4, 4x8, 16x4, 4x16 etc.
Calculate SATD on the 8x4, ... part
2018-12-19 21:30:53 +02:00
Pauli Oikkonen
9aaa6f260d
Fixes to enable portability
2018-12-18 20:42:09 +02:00
Pauli Oikkonen
2fdbbe9730
Move CG reordering code from quant-avx2 to shared header
2018-12-18 19:42:18 +02:00
Pauli Oikkonen
d02207306d
Create a header file for shared AVX2 code
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
361bf0c7db
Precompute >=2 coeff encoding loop with 2-bit arithmetic
...
Who needs 16x16b vectors when you can do practically the same with
16x2b pseudovectors in 32-bit general purpose registers!
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
f66cb23d5b
Optimize greater1 encoding loop
...
Calculating the c1 variable need not be a serial operation!
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
8c8b791c35
Vectorize kvz_context_get_sig_ctx_inc
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
033261eb74
Eliminate two branches using bit magic
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
c4434e8d04
Scan CG's in forward order to simplify finding last significant
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
efd097f5a5
Vectorize the coeff group loop to some extent
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
a01362e638
use the efficient method of reordering raster->scan
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
50a888e789
Use the efficient method to find first and last nz coeffs in block
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
7e9203f566
Scan coeff groups in scan order to help find last significant one
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
9a5a6fdbc7
Simplify two ifs in encode_coeff_nxn-avx2
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
37a2a8bac8
See if loop can be optimized by rearranging
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
584f2f74b6
Vectorize significant coeff group scanning loop
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
1bfed73221
Add AVX2 strategy for encode_coding_tree
2018-12-18 19:41:09 +02:00
Pauli Oikkonen
c3a6f3112a
Add generic strategy group for encode_coding_tree
2018-12-18 19:41:09 +02:00
Sergei Trofimovich
68a70e45a1
x86 asm: mark stack as non-executable
...
Gentoo's `scanelf` QA tool detects writable/executable stack
of assembly-writtent files as:
```
$ scanelf -qRa .
0644 LE !WX --- --- ./src/strategies/x86_asm/.libs/picture-x86-asm-sad.o
0644 LE !WX --- --- ./src/strategies/x86_asm/.libs/picture-x86-asm-satd.o
0644 LE !WX --- --- ./src/strategies/x86_asm/picture-x86-asm-sad.o
0644 LE !WX --- --- ./src/strategies/x86_asm/picture-x86-asm-satd.o
```
Normally C compiler emits non-executable stack marking (or GNU assembler
via `-Wa,--noexecstack`).
The change adds non-executable stack marking for yasm-based assmbly files.
https://wiki.gentoo.org/wiki/Hardened/GNU_stack_quickstart has more details.
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
2018-12-16 11:31:56 +00:00
Reima Hyvönen
1fcc5c6a8d
Merge branch 'bipred_recon'
2018-12-11 09:59:35 +02:00
Reima Hyvönen
e4a10880f3
Added case 12 to bipred_recon no mov
2018-12-11 09:52:17 +02:00
Marko Viitanen
a4f3968e52
Fix Visual Studio errors by initializing some variables used in AVX2 signhiding
2018-12-11 09:33:26 +02:00
Pauli Oikkonen
c465578048
Add a descriptive comment to coefficient reordering
2018-12-03 15:36:32 +02:00
Pauli Oikkonen
f78bf2ebcb
Optimize q_coefs usage for indexed fetch
2018-12-03 15:36:32 +02:00
Pauli Oikkonen
d9591f1b49
Eliminate midway buffering of reordered coefs
...
TODO: For some mysterious reason seems slightly slower than the
buffered one
2018-12-03 15:36:32 +02:00
Pauli Oikkonen
7fe454c51f
Optimize get_cheapest_alternative()
2018-12-03 15:36:32 +02:00
Pauli Oikkonen
6bbd3e5a44
Optimize rearrange_512 function
2018-12-03 15:36:32 +02:00
Pauli Oikkonen
cb8209d1b3
Vectorize transform coefficient reordering loop
2018-12-03 15:36:32 +02:00
Pauli Oikkonen
7cf4c7ae5f
Rename "reduce" functions to hsum
...
That's what the functions fundamendally do anyway
2018-12-03 15:36:32 +02:00
Pauli Oikkonen
316cd8a846
Fix ALIGNED keyword and grow alignment to 64B
2018-12-03 15:36:32 +02:00
Pauli Oikkonen
1befc69a4c
Implement sign bit hiding in AVX2
2018-12-03 15:36:32 +02:00
Reima Hyvönen
f8696b54a4
Updated bipred_recon_avx2 in avx2/picture-avx2.c. Now it detects blocks that can be not equal to 8 (ie. width = 12)
2018-11-20 17:09:19 +02:00
Reima Hyvönen
710ba288db
Chroma has some problems
2018-11-15 16:42:48 +02:00
Ari Lemmetti
a832206bb6
Replace 32-bit incompatible instrinsics
2018-11-12 18:54:33 +02:00
Ari Lemmetti
5c774c4105
Rewrite most of FME and interpolation filters
...
Changes had to break a lot of stuff and were just squashed into this horrible code dump
2018-11-08 20:21:16 +02:00
Reima Hyvönen
7406c33a42
Some more cleaning
2018-10-26 12:25:18 +03:00
Reima Hyvönen
4c71546b2e
Cleaned some coding
2018-10-26 12:19:44 +03:00
Reima Hyvönen
4fe3909e48
Switched luma to use 32bits size ints intstead of 16bit size
2018-10-24 18:24:46 +03:00
Marko Viitanen
465bc2cfee
[EMT] make functions static and prefix arrays with kvz_g
2018-10-18 10:54:33 +03:00
Marko Viitanen
169febd1c4
[EMT] Simplify DCT8, DCT5, DST1 and DST7 definitions
2018-10-17 12:17:54 +03:00
Marko Viitanen
e015d7eb2b
Fix compiler warnings
2018-10-17 10:43:11 +03:00
Marko Viitanen
ad310c77d3
Added EMT transforms to the strategies
2018-10-17 08:56:49 +03:00
Reima Hyvönen
381e786e10
Trying to find the bug in luma
2018-10-11 18:08:41 +03:00
Reima Hyvönen
2f5f81bac3
removed the non-optimated bipred function
2018-10-09 11:19:23 +03:00
Reima Hyvönen
212a8e68fa
Modified to avoid memory overflow, still some bug inside luma
2018-10-02 20:23:32 +03:00
Marko Viitanen
389aeebe07
Added 2x2 transform functions
2018-09-13 14:51:07 +03:00
Marko Viitanen
445c059b4a
Fix transforms for VTM 2.0, generated new transform matrices and added a shift by 2 for forward and inverse
2018-09-13 14:39:49 +03:00
Marko Viitanen
382917bcd3
New table for choosing angular intra filtered references and a small bugfix on the end condition of angular intra
2018-09-13 09:35:55 +03:00
Marko Viitanen
d4ed0ee3ad
Fixed some array offsets in intra angular prediction
2018-09-12 08:53:17 +03:00
Sami Ahovainio
787264f568
Fixed dst indexing in kvz_angular_pred_generic
2018-08-31 10:36:28 +03:00
Sami Ahovainio
d2291fea83
Intra mode scaling moved from angular prediction to kvz_intra_predict. pdpc implemented in kvz_intra_predict.
2018-08-31 10:01:28 +03:00
Sami Ahovainio
54ebadfc43
Clarifying comments and changes towards WAIP
2018-08-29 16:00:08 +03:00
Reima Hyvönen
896034b7cf
Some renamed functions back
2018-08-28 15:31:10 +03:00
Reima Hyvönen
e8b5e6db4c
Did some merging
2018-08-28 15:26:27 +03:00
Reima Hyvönen
7de5c74434
Updated bipred_recon to work faster
2018-08-28 15:12:31 +03:00
Reima Hyvönen
47b357cca2
Comment one test
2018-08-27 18:52:14 +03:00
Reima Hyvönen
2ca99a44e8
Updated shuffle operation to be in right order
2018-08-27 18:16:38 +03:00
Sami Ahovainio
42741a2c40
Some changes for PCM and Intra towards VTM 2.0 compatibility.
2018-08-27 09:18:15 +03:00
Marko Viitanen
4f7da86285
Commented out sign hiding code, which is not used in VVC
2018-08-17 09:38:11 +03:00
Marko Viitanen
c9cbdd5dc3
Added couple of ToDo comments for large CTU support
2018-08-17 09:37:14 +03:00
Marko Viitanen
daf041406f
Disable DST
2018-08-16 16:05:32 +03:00
Reima Hyvönen
508b218a12
some modifications made to prevent reading too much
2018-08-14 10:50:39 +03:00
Reima Hyvönen
1d935ee888
some useless stuff removed
2018-08-13 16:47:11 +03:00
Reima Hyvönen
ce3ac4c05e
some modifications to no_mov
2018-08-13 16:41:02 +03:00
Reima Hyvönen
15a613ae94
test if no_mov breaks testing
2018-08-13 16:02:56 +03:00
Reima Hyvönen
97a2049e58
removed pointer declaration out from switch
2018-08-10 16:42:26 +03:00
Reima Hyvönen
aa94bcedbc
Stream is now pointer
2018-08-10 16:38:49 +03:00
Reima Hyvönen
fa5b227ece
256 to 32 doesn't work, made them by hand
2018-08-10 16:01:20 +03:00
Reima Hyvönen
408dedbcc8
removed _mm256_extract_epi8 and replaced with _mm_stream
2018-08-10 15:53:26 +03:00
Reima Hyvönen
31c35091c6
_mm256_cvtsi256_si32 removed
2018-08-10 10:06:40 +03:00
Reima Hyvönen
99dc43074f
_mm256_cvtsi256_si32 breaks system, too much bits. back to extract
2018-08-10 09:59:33 +03:00
Reima Hyvönen
4f1f80b2cb
Transformed convert from 256 to cast 256 -> 128 and then convert from 128
2018-08-09 15:35:54 +03:00
Reima Hyvönen
4957555eb3
Removed leftover from 939
2018-08-09 15:25:03 +03:00
Reima Hyvönen
28b165c971
Clearified some sections, added _MM_SHUFFLE macro
2018-08-09 15:23:01 +03:00
Reima Hyvönen
dd04df8667
testing if error in both avx2 functions
2018-08-03 11:49:00 +03:00
Reima Hyvönen
ed50d71fde
Switched some variables to different location, altered inter_recon_bipred_avx2 function
2018-08-02 16:08:59 +03:00
Reima Hyvönen
f5739a0028
Renaming and removing useless prints
2018-08-02 14:47:17 +03:00
Reima Hyvönen
bc09f59bb6
Edited some definitions
2018-08-02 11:54:53 +03:00
Reima Hyvönen
a4bf77f208
Tested some extract functions
2018-07-12 09:29:32 +03:00
Reima Hyvönen
c05033a893
Even more useless vectors removed
2018-07-11 15:09:14 +03:00
Reima Hyvönen
884cb77238
Removed some not used vectors
2018-07-11 15:06:11 +03:00
Reima Hyvönen
792689a5ff
Removed for-loops, added extract instead
2018-07-11 14:56:41 +03:00
Reima Hyvönen
f9c7f6ee66
Added some break-operations for avx2 optimation
2018-07-11 14:15:38 +03:00
Reima Hyvönen
cc064da143
some more optimation for bipred
2018-07-11 11:27:54 +03:00
Reima Hyvönen
a22cf03ddb
Updated to have no movement function to avx2 strategies
2018-07-10 16:07:15 +03:00
Reima Hyvönen
ea83ae45f0
Toimiva ratkaisu
2018-07-03 11:18:51 +03:00
Reima Hyvönen
17babfffa4
25.6 working optimation, ~50% faster than original
2018-06-25 17:06:16 +03:00
Reima Hyvönen
9fed29f950
optimation for inter_recon_bipred
2018-04-18 15:25:44 +03:00
Arttu Ylä-Outinen
0a69e6d18f
Fix selection of transform function for 4x4 blocks
...
DST function was returned for inter luma transform blocks of size 4x4
even though they must use DCT. Fixed by checking the prediction mode of
the block in addition to whether it is chroma or luma.
2018-01-18 10:36:25 +02:00
Arttu Ylä-Outinen
9694bd2fae
Fix build on 32-bit systems
...
Function coeff_abs_sum_avx2 that was added in e950c9b
was outside the
AVX2 #if directive.
2017-07-28 09:19:29 +03:00
Arttu Ylä-Outinen
e950c9b101
Add AVX2 implementation for coefficient sum
2017-07-28 07:39:36 +03:00
Arttu Ylä-Outinen
d50ae6990c
Add sum of absolute coefficients to strategies
2017-07-28 07:39:15 +03:00
Arttu Ylä-Outinen
fdb3480b54
Enable strategies for SAO reconstruction
...
Re-enables strategies for SAO reconstruction. They were disabled in
commit ec9ff42
.
2017-07-11 10:35:18 +03:00
Arttu Ylä-Outinen
333dba3884
Add static to SAO strategies
2017-07-11 10:02:01 +03:00
Arttu Ylä-Outinen
563bc26e71
Fix out-of-bounds read in AVX2 SAO
...
AVX2 version of SAO loaded offsets with a 256 bit read even though there
are only five 32 bit integers.
2017-07-06 13:04:52 +03:00
Arttu Ylä-Outinen
2c66e0bbd2
Fix warnings about invalid reads in AVX2 ipol
...
AVX2 filter functions read pixels in chunks of 8 or 16 bytes. At the end
of the block, the read goes out of the bounds of the pixels array. The
extra pixels do not affect the result.
Fixes valgrind complaining about the invalid reads by allocating 5 extra
pixels in kvz_get_extended_block_avx2
2017-06-22 09:37:55 +03:00
Arttu Ylä-Outinen
95775a1645
Change coefficient storage order
...
Changes coefficient storage order to a zig-zag order. Reduces
unnecessary copying of coefficients to temporary arrays.
2017-05-12 16:46:57 +03:00
Arttu Ylä-Outinen
51786eda67
Drop redundant fields in encoder_control_t
...
Some of the fields in encoder_control_t were simply copies of the
corresponding fields in kvz_config. This commit drops the copied fields
in favor of using the fields in encoder_control_t.cfg directly.
2017-02-09 14:05:28 +09:00
Arttu Ylä-Outinen
e78a8dfcf5
Copy the kvz_config passed to encoder_open
...
The kvz_config struct is created by the user but kvazaar keeps a pointer
to it. It is easy to break things by modifying the configuration outside
kvazaar. In addition, kvazaar modifies the struct even though it is has
a const modifier.
This commit changes the field cfg in encoder_control_t to be a copy of
the kvz_config struct instead of a pointer, removing modifications to
the const struct and allowing users to do whatever they want with it
after opening the encoder.
2017-02-09 13:23:54 +09:00
Arttu Ylä-Outinen
640ff94ecd
Use separate lambda and QP for each LCU
...
Adds fields lambda, lambda_sqrt and qp to encoder_state_t. Drops field
cur_lambda_cost_sqrt from encoder_state_config_frame_t and renames
cur_lambda_cost to lambda.
2017-01-09 01:24:23 +09:00
Ari Lemmetti
70a52f0e48
10-bit: add missing bit depth adjustment to ssd
2016-11-17 19:28:04 +02:00
Ari Lemmetti
29153ed503
Remove unused variable
2016-10-21 17:28:42 +03:00
Ari Lemmetti
778e46dfd8
Add AVX2 version of SSD
2016-10-21 15:07:53 +03:00
Ari Lemmetti
6f5d7c9e06
Move SSD to strategies
2016-10-21 15:07:23 +03:00
Ari Lemmetti
89b941eab4
Fix typo
2016-10-21 15:07:02 +03:00
Ari Koivula
cbfa824d1a
Merge branch 'simd'
2016-09-27 20:49:45 +03:00
Ari Koivula
14a7bcba25
Use a faster function for clipped inter SAD
...
Use the vectorized general SSE41 inter SAD in AVX reg_sad for shapes
for which we don't have AVX versions yet.
Also improves speed of --smp and --amp a lot. Got a 1.25x speedup for:
--preset=ultrafast -q 27 --gop=lp-g4d3r3t1 --me-early-termination=on --rd=1 --pu-depth-inter=1-3 --smp --amp
* Suite speed_tests:
-PASS inter_sad: 0.898M x reg_sad(64x63):x86_asm_avx (1000 ticks, 1.000 sec)
+PASS inter_sad: 2.503M x reg_sad(64x63):x86_asm_avx (1000 ticks, 1.000 sec)
-PASS inter_sad: 115.054M x reg_sad(1x1):x86_asm_avx (1000 ticks, 1.000 sec)
+PASS inter_sad: 133.577M x reg_sad(1x1):x86_asm_avx (1000 ticks, 1.000 sec)
2016-09-27 20:48:30 +03:00
Eemeli Kallio
f41e428e5f
Removed kvz_skip_unnecessary_rdoq and reworked --rdoq-skip to skip 4x4 blocks when it is on.
2016-09-09 10:26:07 +03:00
Ari Koivula
02cd17b427
Add faster AVX inter SAD for 32x32 and 64x64
...
Add implementations for these functions that process the image line by
line instead of using the 16x16 function to process block by block.
The 32x32 is around 30% faster, and 64x64 is around 15% faster,
on Haswell.
PASS inter_sad: 28.744M x reg_sad(32x32):x86_asm_avx (1014 ticks, 1.014 sec)
PASS inter_sad: 7.882M x reg_sad(64x64):x86_asm_avx (1014 ticks, 1.014 sec)
to
PASS inter_sad: 37.828M x reg_sad(32x32):x86_asm_avx (1014 ticks, 1.014 sec)
PASS inter_sad: 9.081M x reg_sad(64x64):x86_asm_avx (1014 ticks, 1.014 sec)
2016-09-01 21:36:39 +03:00
Ari Lemmetti
28c4174d0e
Fix incorrect shuffle parameters
...
_MM_SHUFFLE uses reverse order
2016-08-23 19:40:46 +03:00
Ari Lemmetti
ce77bfa15b
Replace KVZ_PERMUTE with _MM_SHUFFLE
...
The same exact macro already exists
2016-08-22 19:08:46 +03:00
Eemeli Kallio
99d8b9abeb
Changed skip_rdoq name to kvz_skip_unnecessary_rdoq. Changed the order it uses when it goes through CGs and tuned its sum calculation.
2016-08-18 14:02:56 +03:00
Eemeli Kallio
1fb4755f31
Added rdoq-skip to quant-generic.c
2016-08-18 12:17:54 +03:00
Eemeli Kallio
d20ac03ca2
Added --rdoq-skip option
2016-08-18 12:17:53 +03:00
Arttu Ylä-Outinen
2a946bd88e
Rename encoder_state_t.global to frame
...
"Frame" is more accurate than "global" since when OWF is used, encoder
states for each frame have their own struct.
2016-08-10 13:22:36 +09:00
Arttu Ylä-Outinen
5fbb0a8c27
Fix includes
2016-08-10 13:05:40 +09:00
Ari Lemmetti
6bcba004ff
Comment out to fix unused code error on clang.
2016-07-14 14:12:16 +03:00
Ari Lemmetti
c0979ebdcb
Implement AVX2 luma sampling
2016-07-14 12:53:02 +03:00
Ari Lemmetti
6244560426
Add avx2 strategy for kvz_filter_frac_blocks_luma.
2016-07-14 12:53:02 +03:00
Ari Lemmetti
9c4e9e049b
Load only what is needed. Eliminate latency from hadds.
2016-07-14 12:53:01 +03:00
Ari Lemmetti
fccfbd2f28
Add strategy for kvz_filter_frac_blocks_luma
2016-07-14 12:51:02 +03:00
Ari Lemmetti
2b0c8db349
Add quad satd for avx2
2016-07-14 12:50:24 +03:00
Ari Lemmetti
0ff69fd6f8
Add any size multi satd
2016-07-14 12:48:37 +03:00
Arttu Ylä-Outinen
bf26661782
Add support for 4x4 blocks to SATD_ANY_SIZE.
...
Makes functions satd_any_size_generic and satd_any_size_8bit_avx2 work
on blocks whose width and/or height are not multiples of 8.
2016-06-16 18:53:17 +09:00
Ari Lemmetti
3107a93eaf
Fix avx2 chroma sampling for amp
2016-05-17 14:09:57 +03:00
Ari Lemmetti
efbdc5dade
Utilize registers more efficiently for 8x8 and larger blocks
2016-04-21 13:26:38 +03:00
Ari Lemmetti
192cee95b2
Vectorize vertical filtering
2016-04-21 13:26:38 +03:00
Ari Lemmetti
0be35f72b8
Filter 4 pixels simultaneously in x direction
2016-04-21 13:26:38 +03:00
Ari Lemmetti
10484bda9f
Make strategies out of fractional pixel sample functions
2016-04-21 13:26:38 +03:00
Ari Lemmetti
8247faf8e0
Remove 64-bit only instruction to fix 32-bit compilation.
2016-04-19 18:05:11 +03:00
Ari Lemmetti
eb55d6b6b9
Fix writing over boundary.
2016-04-19 16:03:43 +03:00
Ari Lemmetti
bcabc6fadd
Remove pixel blit from strategies. Use memcpy instead.
2016-04-06 18:44:04 +03:00
Ari Koivula
61fc3e87ba
Run include-what-you-use fix_includes.py fix_includes.py
...
The includes should make more sense now and not just happen to compile
due to headers included from other headers.
Used a modified version of IWYU. Modifications were to attribute int8_t
and so on to stdint.h instead of sys/types.h and immintrin.h instead of
more specific headers.
include-what-you-use 0.7 (git:b70df35)
based on clang version 3.9.0 (trunk 264728)
2016-04-01 17:46:55 +03:00
Ari Koivula
8908d85d66
Change all relative includes to absolute
2016-04-01 17:46:44 +03:00
Ari Koivula
4876879b82
Add IWYU pragmas
2016-03-31 12:33:34 +03:00
Ari Koivula
5b66578f71
Add kvz_ prefix to md5 functions
...
The non kvz_ symbols were being exported in the static lib, which got caught
by Travis tests.
2016-03-18 13:13:35 +02:00
Ari Koivula
4125218cfa
Add --hash=md5
...
Add md5 through extras/libmd5 taken from HM with BSD license. It's
implemented as a generic strategy using the same interface as checksum,
so we can write a SIMD version if it seems necessary.
2016-03-18 05:23:57 +02:00
Ari Lemmetti
e502292ba8
Remove old function
2016-03-16 20:18:55 +02:00
Ari Lemmetti
c6cc96f5ec
Optimize sao band ddistortion
2016-03-16 20:16:00 +02:00
Ari Lemmetti
ab577f476f
Optimize sao reconstruct color
2016-03-16 20:15:32 +02:00
Ari Lemmetti
48bfddf4ec
Optimize calc sao edge dir
2016-03-16 20:14:50 +02:00
Ari Lemmetti
ba69992941
Optimize sao edge ddistortion
2016-03-16 20:14:19 +02:00
Ari Lemmetti
941b6b3e27
Optimize calc eo cat
2016-03-16 20:13:30 +02:00
Ari Lemmetti
04fbb48a09
Add strategy for avx2. Copy generic functions there.
2016-03-16 20:13:15 +02:00
Ari Lemmetti
4e30a215d8
Create generic strategy for sao.
2016-03-16 20:11:15 +02:00
Ari Lemmetti
99e37ec235
Update old pixel type to the current one
2016-01-30 19:33:09 +02:00
Ari Koivula
fa1af14637
Fix includes to include global.h first everywhere
2016-01-22 15:07:49 +02:00
Ari Lemmetti
44656aeb19
Remove useless calculation
2016-01-19 16:35:16 +02:00
Ari Lemmetti
a2fc9920e6
Merge branch 'alternative-satd'
2016-01-13 15:00:43 +02:00
Ari Lemmetti
1ed34f2df8
Add some planar pred optimization for blocks larger than 8x8
2016-01-13 14:50:17 +02:00
Ari Lemmetti
0df88697ff
Copy generic function to AVX2 strategy
2016-01-12 23:51:18 +02:00
Ari Lemmetti
62799a9fc3
Create generic strategy of planar prediction
2016-01-12 23:50:47 +02:00
Ari Lemmetti
3cb1cebfe5
Add missing inlines
2016-01-12 23:03:31 +02:00
Ari Lemmetti
6a0b13b8b6
Remove unused functions
2016-01-12 22:55:37 +02:00
Ari Lemmetti
61155f0edd
Add 128-bit version of the functions as well
2016-01-12 22:52:00 +02:00
Ari Lemmetti
a6afb8a8f4
Small refactoring
2016-01-12 22:29:33 +02:00
Ari Lemmetti
a756f6133a
Manually unroll vertical Hadamard transform
2016-01-12 21:45:02 +02:00
Ari Lemmetti
66350aa20e
Experiment with alternative implementation of FWHT
2016-01-11 16:25:56 +02:00
Ari Koivula
947bae24f9
Update Doxygen documentation
...
Add module information to all header files.
Update all header file documentations to briefly say what they are, and
to use the javadoc format so the brief actually gets included into the
doxygen documentation.
Remove \file from implementation files, in order to not repeat the info
from the header files.
Add files under strategies and tools to Doxygen and update the Doxygen
settings to be just plain better.
Make README be the main page of Doxygen documentation.
2015-12-17 14:05:50 +02:00
Arttu Ylä-Outinen
864c77f6eb
Use kvz_satd_any_size in inter search.
...
Changes search_frac and kvz_search_cu_iter to use kvz_satd_any_size for
computing the SATDs instead of getting the SATD function with
kvz_pixels_get_satd_func.
2015-12-15 11:21:45 +02:00
Arttu Ylä-Outinen
056fa09ba5
Add arbitrary-sized SATD functions.
...
Adds strategy satd_any_size for generic and AVX2. The satd_any_size
functions are implemented with macro SATD_ANY_SIZE defined in
strategies-picture.h.
2015-12-15 11:21:45 +02:00
Arttu Ylä-Outinen
728a6abecc
Extract macro SATD_NxN.
...
Combines definitions of macros SATD_NXN and SATD_NXN_AVX2 to macro
SATD_NxN and moves it to strategies-picture.h.
2015-12-15 11:21:44 +02:00
Arttu Ylä-Outinen
4402e251ae
Fix kvz_get_extended_block functions.
...
The buffers allocated in functions kvz_get_extended_block_avx2 and
kvz_get_extended_block_generic were too small when the width of the
block was less than its height. Fixed to allocate correctly sized
buffers.
2015-12-15 11:21:43 +02:00
Ari Lemmetti
b78460b02c
Optimize another loop
2015-12-11 11:21:43 +02:00
Ari Lemmetti
ee8c2d0218
Add 4x4 dual SATD for AVX2
2015-12-03 17:13:11 +02:00
Ari Lemmetti
00736fa708
Generate larger than 8x8 dual satd functions with macro
2015-12-03 17:13:11 +02:00
Ari Lemmetti
bd3e1922cd
Add AVX2 8x8 dual hadamard transform
2015-12-03 17:13:11 +02:00
Ari Lemmetti
d575b94357
Implement generic functions for dual sad / satd
2015-12-03 17:13:11 +02:00
Ari Lemmetti
183ee53f47
Add alternative version of rough intra search.
...
Calculate two costs simultaneously to exploit larger SIMD registers.
Implementation for dual functions missing currently.
2015-12-03 17:12:38 +02:00
Arttu Ylä-Outinen
940ada4c0d
Mark AVX2 intra filter functions as static.
...
Marks functions filter_4x4_avx2, filter_16x16_avx2 and filter_NxN_avx2
static as they are not used outside strategies/avx2/intra-avx2.
2015-11-09 12:48:20 +02:00
Ari Lemmetti
fbd0596114
Merge branch 'avx2-pixels-blit'
2015-11-04 11:06:10 +02:00
Ari Lemmetti
57ea7d223b
Pass SIMD registers to functions as pointers to fix 32-bit compilation in visual studio
2015-11-04 10:51:26 +02:00
Ari Lemmetti
a3855652e9
Add AVX2 version with separate handling of basic blocks and strideless copy.
2015-11-04 10:07:25 +02:00
Ari Lemmetti
0816fbea2c
Create generic strategy of blit function
2015-11-04 10:07:25 +02:00
Marko Viitanen
821d5c478b
Added missing parameter to kvz_strategy_register_picture_generic()
2015-11-02 08:55:54 +02:00
Ari Lemmetti
d71f1b5bd0
Disable incompatible optimizations for 32-bit version
2015-10-24 15:32:27 +03:00
Ari Lemmetti
df995d85e8
Utilize AVX2 for dequantization.
2015-10-23 20:17:08 +03:00
Ari Lemmetti
cf347e33c4
Move dequant to strategies. Copy generic to AVX2 as well.
2015-10-23 19:53:50 +03:00
Ari Lemmetti
47082738aa
...and the same tricks for quantized reconstruction
2015-10-23 19:44:38 +03:00
Ari Lemmetti
7961ba80d8
Add functions for bigger block sizes to calculate more residual simultaneously and reduce memory accesses
2015-10-23 19:11:56 +03:00
Ari Lemmetti
15edd5060d
Load and store multiple elements simultaneously. Use 128-bit wide zero
...
test. *wip*
2015-10-23 17:03:16 +03:00
Ari Lemmetti
b37cca87c8
Copy generic to avx2
2015-10-23 17:03:15 +03:00
Ari Lemmetti
cad2ea9d6e
Move quantize_residual to quant strategies.
2015-10-23 17:03:15 +03:00
Ari Lemmetti
0c63041ba7
Add filtering functions for different block sizes. Simplify logic a bit to reduce branching. Sorry for the large commit!
2015-10-23 16:54:15 +03:00
Ari Lemmetti
5af7a42ebe
Enable AVX2 strategy. Add first version of optimizations.
2015-10-08 12:36:20 +03:00
Ari Lemmetti
f4fe3dca5e
Add AVX2 strategy. Copy generic implementation there.
2015-10-08 12:36:15 +03:00
Ari Lemmetti
54e8b346a3
Add intra strategy. Move angular prediction there.
2015-10-08 12:36:05 +03:00
Ari Lemmetti
38106afa50
Add AVX2 version of quantization.
2015-10-02 16:18:52 +03:00
Ari Lemmetti
ef0ad292ef
Add quantization strategy.
2015-10-02 16:17:02 +03:00
Ari Lemmetti
989cee1b04
Add 4x4 function as well
2015-10-01 22:14:56 +03:00
Ari Lemmetti
8b57b2bb1a
Refactor SATD to inline most of the function. Replace full horizontal add with shuffle and regular packed add.
2015-10-01 21:29:25 +03:00
Ari Lemmetti
55da2a9958
Add intrinsic version of SATD for 8x8 and larger blocks
2015-10-01 19:42:22 +03:00
Ari Lemmetti
d68fc4c41e
Add header for common utilities to use with strategies.
2015-10-01 19:40:35 +03:00
Ari Koivula
9a23ae3d92
Resolve remaining Visual Studio warnings.
...
- Ignore most of them and fix the ones that can't be ignored.
2015-08-31 15:02:25 +03:00
Arttu Ylä-Outinen
3a10e9e3e0
Prefix all non-static symbols with "kvz_".
2015-08-26 13:02:28 +03:00
Arttu Ylä-Outinen
bfe2b31cee
Make generic satd functions static.
2015-08-26 12:10:27 +03:00
Ari Lemmetti
923f4a74d5
Fix filtering over limits
2015-08-17 17:39:56 +03:00
Ari Lemmetti
82cf4e8ff4
Output error messages to stderr
2015-08-17 15:01:46 +03:00
Ari Lemmetti
3da71b62bf
Add checks if malloc fails
2015-08-17 15:01:46 +03:00
Ari Lemmetti
4718fe7fda
Change variable names to match used convention
2015-08-17 15:01:46 +03:00
Ari Lemmetti
6a5eaf08de
Rename extend_borders to get_extended_block. Add kvz_ prefix to type definition.
2015-08-17 15:01:46 +03:00
Ari Lemmetti
d82582c37c
Changes to extend border function.
...
Now outputs a pointer to a block with guaranteed padding for filtering.
Only generate extra pixels if samples are needed out of bounds.
Use memcpy otherwise.
2015-08-17 15:01:46 +03:00
Ari Lemmetti
5d96dbc6c0
Make strategy selection use bit depth given via parameter instead of excluding registration with defines
2015-08-12 13:33:38 +03:00
Ari Lemmetti
4122f36089
Prevent the registration of strategies that are incompatible when KVZ_BIT_DEPTH != 8
...
Remove unnecessary or misleading mentions of "8bit"
2015-08-12 11:29:53 +03:00
Ari Lemmetti
348d7780fc
Remove third shift and offset from 14-bit sampling functions (change missing from rebase)
2015-08-11 15:06:16 +03:00
Marko Viitanen
8409317bd9
Fixed rebasing errors for 10bit branch
2015-08-11 14:56:45 +03:00
Marko Viitanen
6453a511d7
Scale SAD/SATD costs to match bit depth
...
Conflicts:
src/image.c
2015-08-11 08:18:14 +03:00
Marko Viitanen
0304b6c412
Fixed luma interpolation filter when 10bit coding and some other minor fixes
2015-08-11 08:17:48 +03:00
Marko Viitanen
450b5e64ca
Fixed overflow on generic ipol filters when 10bit encoding
...
Conflicts:
src/strategies/generic/ipol-generic.c
2015-08-11 08:17:48 +03:00
Marko Viitanen
414ebe6101
Fixed checksum on bitdepth > 8 cases
...
Conflicts:
src/nal.c
src/nal.h
src/strategies/generic/nal-generic.c
src/strategies/strategies-nal.c
src/strategies/strategies-nal.h
2015-08-11 08:14:35 +03:00
Marko Viitanen
57ab46f110
Small fixes all around to enable 10bit encoding
...
Conflicts:
src/encmain.c
src/encoder.c
src/encoderstate.c
src/global.h
2015-08-11 07:59:20 +03:00
Ari Lemmetti
5887c96991
Add and use 14bit reconstruction for fractional motion vectors with bipred
2015-08-10 18:45:29 +03:00
Ari Lemmetti
8b4a6c92da
Add 14bit precision sample functions.
2015-08-10 18:02:06 +03:00
Ari Lemmetti
b30f17d4b8
Add fractional pixel sampling for chroma
2015-08-10 17:55:37 +03:00
Ari Lemmetti
01f40ec104
Add fractional pixel sampling for luma
2015-08-10 17:51:48 +03:00
Ari Koivula
0c3c93d456
Optimize intra SAD intrinsics.
...
- Added 64x64 version for completeness.
- With the exception of 16x16, these were all slightly slower than the ASM
versions, as measured by "kvazaar_test -s speed -t intra_sad", but now they
are on par or slightly faster.
- None of these actually use any AVX2 intrinsics, and probably never will,
unless someone adds an interface for doing more than one block at a time,
in which case the non-destructive versions might come in handy.
2015-08-06 19:35:00 +03:00
Arttu Ylä-Outinen
f7f17a060c
Rename pixel_t to kvz_pixel.
2015-07-02 16:58:28 +03:00
Arttu Ylä-Outinen
fab07d80da
Rename macro BIT_DEPTH to KVZ_BIT_DEPTH.
2015-07-02 16:55:47 +03:00
Marko Viitanen
8ed5d06ebe
Fixed compiler warnings caused by the bipred branch merge
2015-04-23 15:12:48 +03:00
Ari Lemmetti
b9ec4b0a54
AVX2 acceleration for new luma filtering.
2015-03-11 15:33:38 +02:00
Ari Lemmetti
39eceec38d
Rewrite of luma fractional pixel filtering. Utilizes intermediate values instead of calculating everything again.
2015-03-06 17:58:22 +02:00
Ari Koivula
ded6fd9ee8
Renamed typedef pixel to pixel_t.
2015-03-04 16:35:53 +02:00
Ari Koivula
f6147b410a
Rename struct encoder_control to encoder_control_t.
...
Conflicts:
src/encoder_state-geometry.h
src/encoderstate.h
2015-03-04 14:01:14 +02:00
Ari Koivula
d7383ccb25
Change license to LGPL.
...
- Everyone who has contributed code to the project has been asked to license
their contributions under LPGL and they have agreed.
- COPYING file changed to say LGPLv2.1 instead of GPLv2.
- GPL changed to LGPL in the header of every single file that a header and
header added to the few that were missing one.
- Also.. Happy new year!
2015-02-25 15:19:05 +02:00
Ari Lemmetti
7430622038
Copy ipol-generic strategy as a base for avx2 strategy
2015-02-05 13:28:07 +02:00
Ari Lemmetti
8495870df8
Using BIT_DEPTH macro because it is constant
2015-02-05 13:19:54 +02:00
Ari Lemmetti
c82adae0c4
Use four tap functions in octpel chroma interpolation
2015-02-04 18:23:57 +02:00
Ari Lemmetti
2f11caeb73
Added generic four tap functions. Use them in halfpel chroma interpolation.
2015-02-04 17:50:12 +02:00
Ari Lemmetti
041d970ece
Apply fast clipping also to chroma filtering.
2015-01-29 16:19:04 +02:00
Ari Lemmetti
c21351cc12
Added fast clipping function for clamping values to bit depth.
2015-01-21 17:53:06 +02:00
Ari Lemmetti
f037ed580c
Improved data layout
2015-01-15 16:31:18 +02:00
Ari Lemmetti
465f718eeb
Move value clipping away from separate loop
2015-01-15 16:14:00 +02:00
Ari Lemmetti
9d12ce21d5
Cleaned luma interpolation, added functions for 8-tap filtering.
2015-01-15 16:13:12 +02:00
Ari Lemmetti
0e56d13b5d
Use smaller bit depth for fractional pixel interpolation
2015-01-15 15:00:09 +02:00
Ari Lemmetti
cc061b4c3d
Added ipol strategy for interpolation filters.
...
Added initial files for AVX2 and generic strategies.
2015-01-15 14:59:37 +02:00
Ari Koivula
fcb6fa6d4b
Fix compilation error on PowerPC.
...
- Need abs from stdlib.
2014-10-21 18:14:32 +03:00
Ari Koivula
55ab08c213
Fix incorrect const qualifiers.
...
- Change input pointers to const in dct-generic, like they should have been.
- Fixes compilation error on GCC.
2014-10-13 16:57:15 +03:00
Ari Koivula
8a5b24bcbe
Remove usages of GCC __attribute__.
...
- To allow clang to compile, as it doesn't according to #58 .
- The target attributes are not needed anymore due to makefile handling
targetting now.
- The __attribute__((unused)) used for debugging. I don't know if clang
supports this attribute or not but it doesn't seem very important so
I'm removing it just in case.
2014-10-13 16:46:26 +03:00
Ari Koivula
d893a489d6
Fix mingw compilation issue.
...
strategies/avx2/dct-avx2.c:334:25: error: pasting "g_dct_16" and "[" does
not give a valid preprocessing token
- The [ is not part of the token so compilation failed on mingw GCC 4.9.1.
- Fixes #86 .
2014-10-10 16:32:39 +03:00
Ari Lemmetti
bcf12567d0
Added some comments.
2014-10-03 17:51:58 +03:00
Ari Lemmetti
fea517c2ae
Misc code cleanup
2014-10-03 17:06:09 +03:00
Ari Lemmetti
85682c3b6a
Removed unused transpose functions.
2014-10-03 11:39:31 +03:00
Ari Koivula
f6272f06fc
Unify signature for transform functions.
...
- Some used block, coeff and some src, dst. Now all signatures are const input
and non-const output.
2014-10-03 11:21:43 +03:00
Ari Koivula
b932cf4b21
Clean up avx2 dct macros.
2014-10-03 11:16:25 +03:00
Ari Koivula
47244a15c3
Merge branch 'dct-optimizations'
...
Conflicts:
src/strategies/avx2/dct-avx2.c
src/strategies/generic/dct-generic.c
2014-10-02 13:45:21 +03:00
Ari Lemmetti
61e1510480
Transform functions in dct-avx2.c are now generated with macros.
2014-10-02 13:24:30 +03:00
Ari Lemmetti
9407610555
Moved DCT / DST matrices to dct-generic.c
2014-10-02 13:24:30 +03:00
Ari Lemmetti
7255112bd8
Added transposed DCT/DST tables. Use them while calculating transforms instead of doing runtime transpose. Added separate functions for DST and IDST.
2014-10-02 13:24:30 +03:00
Ari Lemmetti
e7bcb58846
Added 32x32 IDCT
2014-10-02 13:24:30 +03:00
Ari Lemmetti
eacf173b7e
Added 32x32 DCT for AVX2
2014-10-02 13:24:30 +03:00
Ari Lemmetti
d2856a5d40
Added 32x32 transpose
2014-10-02 13:24:30 +03:00
Ari Lemmetti
7a33f08312
Added 16x16 DCT and IDCT for AVX2
2014-10-02 13:24:30 +03:00
Ari Lemmetti
d2fe2a5391
Added 16x16 transpose
2014-10-02 13:24:30 +03:00
Ari Lemmetti
d6af146a2e
Added part of the functions 16x16 DCT needs
2014-10-02 13:24:30 +03:00
Ari Lemmetti
aba3acdfff
Added AVX2 optimized transforms for 4x4 and 8x8 blocks
2014-10-02 13:24:30 +03:00
Ari Lemmetti
5856f32d81
Fixed incorrect shift values for inverse transforms in generic strategy
2014-10-02 13:24:29 +03:00
Ari Lemmetti
41b032664d
First version of 4x4 forward DCT
2014-10-02 13:24:29 +03:00
Laurent Fasnacht
f1b303a2d2
Fix compilation errors
2014-08-11 09:53:06 +02:00
Ari Lemmetti
47e3bcfb50
Fixed incorrect shift values for inverse transforms in generic strategy
2014-08-07 16:01:30 +03:00
Ari Lemmetti
709520a233
Removed all AVX2 instructions from SATD functions.
...
-Zero extend macro now returns results in 2 xmm registers instead of one ymm
2014-07-31 13:25:28 +03:00
Ari Lemmetti
0beb278f5b
Partial butterfly strategy is now called DCT strategy. Made changes to transform functions in preparation for optimizations.
...
-Moved fast_forward_dst and fast_inverse_dst to DCT strategies
2014-07-31 13:25:28 +03:00
Ari Lemmetti
6bf63bd171
Added AVX2 strategy for partial butterfly (no optimizations yet)
2014-07-31 13:25:28 +03:00
Ari Lemmetti
faccc4f09b
Partial butterfly functions now utilize the strategy selector
2014-07-31 13:25:28 +03:00
Ari Koivula
669e99dd7f
Improve intra SAD AVX2 intrinsics.
...
- Moved implementations for different sizes to inline functions that are
defined using each other, reducing the amount of redundant code.
- Performance of sad_8bit_32x32_avx2 improved by about 10% due to unrolling of
the loop.
2014-07-25 15:59:55 +03:00
Ari Koivula
e00102f0ca
Compile asm optimizations only if yasm is present.
2014-07-23 14:57:40 +03:00
Ari Lemmetti
85fb0784e4
Fixed intendentation and added some empty lines for readability
2014-07-23 12:32:27 +03:00
Ari Lemmetti
4f88ebce5a
Added comments and made visual studio not to compile x86inc.asm
2014-07-22 15:07:57 +03:00
Ari Koivula
cfd3636e08
Move some repetitive SATD asm into a macro.
...
Conflicts:
src/strategies/x86_avx/picture_x86.asm
2014-07-22 12:46:39 +03:00
Ari Lemmetti
c81639dd09
Removed old unused macro
2014-07-22 11:11:20 +03:00
Ari Lemmetti
cf0797cafd
Reordered and intended assembly code
2014-07-22 11:07:42 +03:00
Ari Lemmetti
fea44c8234
Renaming AVX/asm files
...
-Splitted SAD and SATD functions in separate files
2014-07-21 18:02:01 +03:00
Ari Lemmetti
a64df6f0d0
Merge branch 'asm'
...
Conflicts:
build/kvazaar_lib/kvazaar_lib.vcxproj.filters
src/Makefile
src/strategies/strategies-picture.c
2014-07-21 16:41:09 +03:00
Ari Lemmetti
1be2c3aae5
Preparing push to master and misc
...
-Removed unnecessary <math.h> headers
-Updated AVX/asm optimizations to match the new file hierarchy
-Makefile only compiles .asm files if KVAZAAR_DISABLE_YASM is not set to 1 and TARGET_CPU_ARCH is x86
2014-07-21 12:39:56 +03:00
Ari Koivula
a8f7103797
Add AVX2 implementations for sad_8bit_ 8x8, 16x16 and 32x32.
2014-07-18 18:27:30 +03:00
Ari Koivula
3daa5dd1f1
Add sse2 implementaton for sad_8bit_4x4.
2014-07-18 18:20:34 +03:00
Ari Koivula
f49332c9b8
Add missing includes.
2014-07-18 17:56:15 +03:00
Ari Lemmetti
1e94262f85
Made AVX asm compatible with the changed system
...
- x86inc.asm is now located in extras
- Removed unused cpu.asm/h
2014-07-14 18:51:17 +03:00
Ari Lemmetti
683eda1183
Merge branch 'master' into asm
...
Conflicts:
build/kvazaar_lib/kvazaar_lib.vcxproj
build/kvazaar_lib/kvazaar_lib.vcxproj.filters
src/Makefile
src/strategies/strategies-picture.c
2014-07-14 16:42:33 +03:00
Ari Lemmetti
2169f9ab8c
Added AVX asm comments and fixes
...
-Added vzeroupper to satd macro to prevent AVX-SSE transition penalties int picture_x86.asm
-Fixed the order of registers in zero extend macro in picture_x86.asm
-Fixed SATD checkers test pattern in satd_tests.c
2014-07-14 14:43:36 +03:00
Ari Koivula
5d0df56c94
Move optimizations to their own compilation units according to target.
...
- This is necessary in order to compile AVX intrinsics correctly in
Visual Studio. Having everything in their own units should also make
compiling normal C code with optimizations on easier.
- For now the makefile still relies on GCC __target__ attribute for compiling
intrinsics.
2014-07-11 17:26:19 +03:00
Ari Lemmetti
048127c7e3
AVX assembly optimizations improved
2014-07-02 16:57:06 +03:00
Ari Lemmetti
bdef5384ef
Added AVX strategy
2014-06-17 16:52:24 +03:00
Ari Koivula
1c97a10a6d
Move intra SAD and SATD functions under strategies.
2014-06-16 12:13:41 +03:00
Ari Koivula
1de102be61
Move strategies to their own compilation units.
...
- Enforces a little bit more hierarchy. Compilation units are in strategies
and whatever inline includes they have are in a folder with the same name
as the strategy.
2014-06-13 15:30:23 +03:00
Laurent Fasnacht
27a49d287d
Big refactor to use videoframe, image_list, and image instead of picture*
2014-06-10 09:19:06 +02:00
Laurent Fasnacht
ea04bcd6a4
AltiVec support for SAD
2014-06-05 06:57:34 +02:00
Laurent Fasnacht
5ed69b063b
Strategy selector for array_checksum, basic implementation using precomputed 256*256 block with larger accesses than byte
2014-06-03 07:42:22 +02:00
Ari Koivula
42295d3cb9
Pass preprocessor defines for supported intrinsics in VS2010 explicitly.
...
- _M_IX86_FP defines whether VS should generate code using SSE or SSE2
instructions. It isn't correct to use it to check whether optional runtime
optimizations should be compiled in. It's also not defined at all in 64-bit
mode.
- So let's just keep it simple and give a list of everything that is supported
as release optimizations. It's not clear from the documentation if all of
these are really supported. It just list a bunch of intrinsics from these
that are.
2014-04-30 17:41:15 +03:00
Ari Koivula
bd7e021742
Modify strategyselector to work with VS2010.
...
- VS doesn't have snprintf.
- VS doesn't support GCC attributes.
- Add defines for __SSE__ and __SSE2__ on VS.
2014-04-29 15:29:06 +03:00
Laurent Fasnacht
bf7e755cf7
Strategies and runtime detection/choice of best algorithm
2014-04-29 11:51:41 +02:00