Commit graph

2849 commits

Author SHA1 Message Date
Ari Lemmetti a1390ca3c0 Merge branch 'ssd-avx2' 2016-10-21 15:08:44 +03:00
Ari Lemmetti 778e46dfd8 Add AVX2 version of SSD 2016-10-21 15:07:53 +03:00
Ari Lemmetti 6f5d7c9e06 Move SSD to strategies 2016-10-21 15:07:23 +03:00
Ari Lemmetti 89b941eab4 Fix typo 2016-10-21 15:07:02 +03:00
Ari Koivula bfdd492c9f Merge pull request #141 from aballier/multilib
Include i386 & i486 for compiling intel asm.
2016-10-19 21:19:25 +03:00
Alexis Ballier 1dcc993743 Include i386 & i486 for compiling intel asm.
x86_64-pc-linux-gnu-gcc -m32 that I use for building 32bits libraries on amd64 defines only __i386__.
2016-10-14 18:07:37 +02:00
Arttu Ylä-Outinen 8ae791a3e1 Fix building with crypto++
Depending on the distro, the pkg-config package name of crypto++ could
be either cryptopp or libcrypto++. This commit changes configure to
check for both instead of cryptopp only.
2016-10-10 15:13:20 +09:00
Arttu Ylä-Outinen e7cdd47745 Merge branch 'implicit-rdpcm' 2016-10-03 20:04:00 +09:00
Arttu Ylä-Outinen 5fb7afe8c4 Add --implicit-rdpcm command line parameter.
Makes it possible to use lossless coding without implicit residual DPCM.
2016-10-03 20:01:55 +09:00
Arttu Ylä-Outinen 5affc0f527 Use implicit RDPCM in lossless mode.
Sets implicit RDPCM flag in SPS when lossy coding is disabled and
applies DPCM to intra residual when prediction mode is horizontal or
vertical.
2016-10-03 19:31:38 +09:00
Arttu Ylä-Outinen c418db660b Update preset table in README.md 2016-10-02 20:11:38 +09:00
Ari Koivula 23dc9a0ada Allow osx to fail on Travis 2016-09-29 17:39:28 +03:00
Ari Koivula 5f5fffb8b5 Merge branch 'new_presets'
Significant boost to either BDRate, speed or both for every preset.
2016-09-29 17:36:45 +03:00
Ari Koivula 016dbe0894 Further refine presets
The rd-complexity of slow presets is better with a less agressive GOP.

Adding the GOP as part of the preset improved BDRate enough, that it
didn't make sense anymore to have a veryslow target the best BDRate.
Instead, push that responsibility to placebo by making it a little bit
faster.
2016-09-29 17:35:12 +03:00
Ari Koivula 278cd4da9b Disable WPP in Travis tile tests
Now that WPP is on by default, Valgrind is finding memory leaks on
these tests. It's not a priority so I'll just disable it for now.

==8120== Memcheck, a memory error detector
==8120== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==8120== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==8120== Command: /home/travis/build/Venti-/kvazaar/src/.libs/lt-kvazaar -i mandelbrot_264x130.yuv --input-res=264x130 -o test.265 -p4 -r2 --owf=1 --threads=2 --tiles-height-split=u2 --rd=0 --no-rdoq --no-deblock --no-sao --no-signhide --subme=0 --pu-depth-inter=1-3 --pu-depth-intra=2-3
==8120==
Disabling TMVP because tiles are used.
Compiled: INTEL, flags: MMX SSE SSE2
Detected: INTEL, flags: MMX SSE SSE2 SSE3 SSSE3 SSE41 SSE42
Available: sse2(2) sse41(1)
In use: sse2(1) sse41(1)
Input: mandelbrot_264x130.yuv, output: test.265
  Video size: 264x136 (input=264x130)
==8120== Conditional jump or move depends on uninitialised value(s)
==8120==    at 0x4E5FEE5: kvz_threadqueue_job_dep_add (threadqueue.c:616)
==8120==    by 0x4E3DEAB: encoder_state_worker_encode_children (encoderstate.c:432)
==8120==    by 0x4E3E219: encoder_state_encode (encoderstate.c:649)
==8120==    by 0x4E3DE35: encoder_state_worker_encode_children (encoderstate.c:417)
==8120==    by 0x4E3E219: encoder_state_encode (encoderstate.c:649)
==8120==    by 0x4E3DE35: encoder_state_worker_encode_children (encoderstate.c:417)
==8120==    by 0x4E3E219: encoder_state_encode (encoderstate.c:649)
==8120==    by 0x4E3ECBD: kvz_encode_one_frame (encoderstate.c:941)
==8120==    by 0x4E4DA22: kvazaar_encode (kvazaar.c:229)
==8120==    by 0x4E4E228: kvazaar_field_encoding_adapter (kvazaar.c:280)
==8120==    by 0x40137F: main (encmain.c:436)
==8120==
lt-kvazaar: threadqueue.c:618: kvz_threadqueue_job_dep_add: Assertion `job && depends_on' failed.
==8120==
==8120== HEAP SUMMARY:
==8120==     in use at exit: 1,320,764 bytes in 568 blocks
==8120==   total heap usage: 584 allocs, 16 frees, 1,330,691 bytes allocated
==8120==
==8120== 112 bytes in 1 blocks are definitely lost in loss record 27 of 88
==8120==    at 0x4C2B6CD: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==8120==    by 0x4E46BA5: kvz_image_alloc (image.c:49)
==8120==    by 0x401E12: input_read_thread (encmain.c:183)
==8120==    by 0x55EDE99: start_thread (pthread_create.c:308)
==8120==
==8120== 272 bytes in 1 blocks are possibly lost in loss record 41 of 88
==8120==    at 0x4C29DB4: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==8120==    by 0x4012034: _dl_allocate_tls (dl-tls.c:297)
==8120==    by 0x55EEABC: pthread_create@@GLIBC_2.2.5 (allocatestack.c:571)
==8120==    by 0x4012B9: main (encmain.c:404)
==8120==
==8120== 544 bytes in 2 blocks are possibly lost in loss record 45 of 88
==8120==    at 0x4C29DB4: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==8120==    by 0x4012034: _dl_allocate_tls (dl-tls.c:297)
==8120==    by 0x55EEABC: pthread_create@@GLIBC_2.2.5 (allocatestack.c:571)
==8120==    by 0x4E5EF65: kvz_threadqueue_init (threadqueue.c:308)
==8120==    by 0x4E3BD2F: kvz_encoder_control_init (encoder.c:173)
==8120==    by 0x4E4DD7E: kvazaar_open (kvazaar.c:80)
==8120==    by 0x401112: main (encmain.c:346)
==8120==
==8120== 53,856 bytes in 1 blocks are possibly lost in loss record 81 of 88
==8120==    at 0x4C2B6CD: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==8120==    by 0x4E46BEC: kvz_image_alloc (image.c:59)
==8120==    by 0x401E12: input_read_thread (encmain.c:183)
==8120==    by 0x55EDE99: start_thread (pthread_create.c:308)
==8120==
==8120== LEAK SUMMARY:
==8120==    definitely lost: 112 bytes in 1 blocks
==8120==    indirectly lost: 0 bytes in 0 blocks
==8120==      possibly lost: 54,672 bytes in 4 blocks
==8120==    still reachable: 1,265,980 bytes in 563 blocks
==8120==         suppressed: 0 bytes in 0 blocks
==8120== Reachable blocks (those to which a pointer was found) are not shown.
==8120== To see them, rerun with: --leak-check=full --show-reachable=yes
==8120==
==8120== For counts of detected and suppressed errors, rerun with: -v
==8120== Use --track-origins=yes to see where uninitialised values come from
==8120== ERROR SUMMARY: 5 errors from 5 contexts (suppressed: 2 from 2)
2016-09-29 00:21:03 +03:00
Ari Koivula 31c5ff0f16 Add cross-platform core number detection
Well, turns out pthread_num_processors_np isn't standard so we need to
do this crap. Threw in hyper threading detection as a bonus.
2016-09-29 00:03:21 +03:00
Ari Koivula 8c7351eac8 Fix lp-gop with depth 1
GOPs with depth 1 had the same structure as those with depth 2:
g4d3t1 = 3 2 3 1
g4d2t1 = 2 2 2 1
g4d1t1 = 2 2 2 1

It now results in the correct:
g4d1t1 = 1 1 1 1
2016-09-29 00:03:21 +03:00
Ari Koivula a395aeaac9 Set default settings to those of --preset=medium 2016-09-29 00:03:21 +03:00
Ari Koivula 4388fe0d30 Set presets to ratedistortion-complexity optimized versions 2016-09-29 00:03:20 +03:00
Ari Koivula facb1e16df Use -p64 -q22 and --gop=lp-g4d3t1 by default
Coding inter without GOP of any kind really isn't a very sensible
default. Defaulting to B-GOP of some kind would be more better,
but lp-gop is more robust for now.
2016-09-29 00:03:20 +03:00
Ari Koivula d7391a9593 Improve default for number of parallel frames 2016-09-29 00:03:20 +03:00
Ari Koivula 19d423ab29 Use all available cores by default 2016-09-29 00:03:20 +03:00
Ari Koivula 3f138f087a Allow non-gop-length --period for lp-gop 2016-09-29 00:03:19 +03:00
Ari Koivula 16790c9f15 Remove number of references from --gop=lp syntax
The number of references should be part of the presets, so gop should
be defined separately.
2016-09-29 00:03:19 +03:00
Ari Koivula cbfa824d1a Merge branch 'simd' 2016-09-27 20:49:45 +03:00
Ari Koivula 14a7bcba25 Use a faster function for clipped inter SAD
Use the vectorized general SSE41 inter SAD in AVX reg_sad for shapes
for which we don't have AVX versions yet.

Also improves speed of --smp and --amp a lot. Got a 1.25x speedup for:
--preset=ultrafast -q 27 --gop=lp-g4d3r3t1 --me-early-termination=on --rd=1 --pu-depth-inter=1-3 --smp --amp

* Suite speed_tests:
-PASS inter_sad: 0.898M x reg_sad(64x63):x86_asm_avx (1000 ticks, 1.000 sec)
+PASS inter_sad: 2.503M x reg_sad(64x63):x86_asm_avx (1000 ticks, 1.000 sec)
-PASS inter_sad: 115.054M x reg_sad(1x1):x86_asm_avx (1000 ticks, 1.000 sec)
+PASS inter_sad: 133.577M x reg_sad(1x1):x86_asm_avx (1000 ticks, 1.000 sec)
2016-09-27 20:48:30 +03:00
Arttu Ylä-Outinen 4313e56c2d Add --no-rdoq-skip command line switch 2016-09-11 17:40:16 +09:00
Ari Koivula 19caa1e574 Update README and man page 2016-09-10 21:06:07 +03:00
Ari Koivula a7a33b08ec Remove --slice-addresses from usage message
And give a warning if it's used.

Slices will have to be implemented at some point, but they aren't yet
so let's not advertize them.
2016-09-10 21:06:00 +03:00
Eemeli Kallio f41e428e5f Removed kvz_skip_unnecessary_rdoq and reworked --rdoq-skip to skip 4x4 blocks when it is on. 2016-09-09 10:26:07 +03:00
Eemeli Kallio ed9c0b0416 RDOQ reworked in rdo.c. rdoq_signhide now skips coeffs that are after best_last_idx. 2016-09-09 10:16:51 +03:00
Ari Koivula 17f3f6bc86 Add clipped test cases to inter speed tests
Add tests for the extreme shapes that can happen when a motion vector
points outside the frame. A single pixel case where it probably doesn't
make sense to call a vectorized function, and the maximum size where it
definitely does make sense to call a vectorized function.
2016-09-01 23:08:16 +03:00
Ari Koivula 02cd17b427 Add faster AVX inter SAD for 32x32 and 64x64
Add implementations for these functions that process the image line by
line instead of using the 16x16 function to process block by block.

The 32x32 is around 30% faster, and 64x64 is around 15% faster,
on Haswell.

PASS inter_sad: 28.744M x reg_sad(32x32):x86_asm_avx (1014 ticks, 1.014 sec)
PASS inter_sad: 7.882M x reg_sad(64x64):x86_asm_avx (1014 ticks, 1.014 sec)
to
PASS inter_sad: 37.828M x reg_sad(32x32):x86_asm_avx (1014 ticks, 1.014 sec)
PASS inter_sad: 9.081M x reg_sad(64x64):x86_asm_avx (1014 ticks, 1.014 sec)
2016-09-01 21:36:39 +03:00
Ari Koivula f098e46f4f Add a more realistic speed test for inter sad
Inter SAD accesses pixels directly from the frame buffers, so give it
a 4k frame to work in for more realistic results. The old test used
intra test data, which consist of tiny buffers.
2016-09-01 20:30:26 +03:00
Ari Koivula ce34b73505 Merge branch 'mvd-fixedpoint'
Measured a cumulative effect of 1.04x speedup, when inter search is
used a lot. Not a huge difference.

--preset=ultrafast --me=tz --gop=lp-g4d3r2t2 --wpp --owf=4 --threads=14
2016-08-30 21:43:18 +03:00
Ari Koivula d0512d25c6 Use fixed point in get_mvd_coding_cost 2016-08-30 21:37:12 +03:00
Ari Koivula ec7507a935 Further optimize get_ep_ex_golomb_bitcost
Unrolled 16-bit log2 calculation.
2016-08-30 21:37:01 +03:00
Ari Koivula 3d17a194b5 Merge branch 'mvd' 2016-08-30 15:26:07 +03:00
Ari Koivula c0eef0adfb Merge branch 'simd-tests' 2016-08-30 15:25:42 +03:00
Ari Koivula 9ea7bfd19a Return calls per second from speed tests
Sometimes the tests overrun their time limit by varying amounts.
Return calls per second based on the amount of time actually spent in
the loop instead of how much time we tried to spend in the loop.
2016-08-30 15:23:44 +03:00
Ari Koivula 07a919cb3e Add speed tests for dual intra SAD functions
The speed test suite was crashing due to these being missing.
2016-08-30 15:23:11 +03:00
Ari Koivula 345ef833d7 Add more tests for inter SAD
The existing tests only covered the edge cases of border extension, but
not the SIMD optimized versions of reg_sad. This adds proper tests for
current optimized reg_sad implementations and ones we are likely to
have in the future.
2016-08-30 15:22:05 +03:00
Ari Koivula a4ba794587 Optimize get_ep_ex_golomb_bitcost
Arrange the decision tree such that there is only 3 branches on the
most common paths and the more likely branch is always fall-through.

A profile guided optimization pass would probably do something similar.
2016-08-30 05:24:16 +03:00
Ari Koivula 82cfab58f8 Improve fast mvd coding cost estimation
A lot of time is being taken up by this function on ultrafast, and it
doesn't do a very good job. This change aims to both simplify the
logic and make the estimate better.

The logic is simplified by using a look up for the step mvd bit cost
step function instead of mimicking the binarization process. The
estimation is made better by checking fractional cabac bit costs.

The new function returns the same results as
kvz_get_mvd_coding_cost_cabac, but is also faster than the old
function.
2016-08-30 04:55:09 +03:00
Ari Koivula d31be8eb27 Make mvd_coding_cost functions take const cabac 2016-08-30 04:46:46 +03:00
Ari Koivula 11defe1595 Update readme and man page 2016-08-26 12:15:28 +03:00
Ari Koivula 64d631c174 Fix 8bit to 10bit input conversion regression 2016-08-25 22:09:40 +03:00
Ari Koivula 27789125d8 Fix input bit depth conversion
The input was being shifted to the wrong direction.
2016-08-25 22:05:25 +03:00
Ari Koivula cff3bc8458 Merge branch 'monochrome' 2016-08-25 20:16:20 +03:00
Ari Koivula 4ec039004b Add monochrome encoding
Write bitstream without chroma when encoding with --input-format=P400.
This reduces bitstream size by 0-1 %, compared to coding monochrome in
420 format, and speeds up encoding slightly due to not processing
chroma.
2016-08-25 20:15:26 +03:00