The rd-complexity of slow presets is better with a less agressive GOP.
Adding the GOP as part of the preset improved BDRate enough, that it
didn't make sense anymore to have a veryslow target the best BDRate.
Instead, push that responsibility to placebo by making it a little bit
faster.
Now that WPP is on by default, Valgrind is finding memory leaks on
these tests. It's not a priority so I'll just disable it for now.
==8120== Memcheck, a memory error detector
==8120== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==8120== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==8120== Command: /home/travis/build/Venti-/kvazaar/src/.libs/lt-kvazaar -i mandelbrot_264x130.yuv --input-res=264x130 -o test.265 -p4 -r2 --owf=1 --threads=2 --tiles-height-split=u2 --rd=0 --no-rdoq --no-deblock --no-sao --no-signhide --subme=0 --pu-depth-inter=1-3 --pu-depth-intra=2-3
==8120==
Disabling TMVP because tiles are used.
Compiled: INTEL, flags: MMX SSE SSE2
Detected: INTEL, flags: MMX SSE SSE2 SSE3 SSSE3 SSE41 SSE42
Available: sse2(2) sse41(1)
In use: sse2(1) sse41(1)
Input: mandelbrot_264x130.yuv, output: test.265
Video size: 264x136 (input=264x130)
==8120== Conditional jump or move depends on uninitialised value(s)
==8120== at 0x4E5FEE5: kvz_threadqueue_job_dep_add (threadqueue.c:616)
==8120== by 0x4E3DEAB: encoder_state_worker_encode_children (encoderstate.c:432)
==8120== by 0x4E3E219: encoder_state_encode (encoderstate.c:649)
==8120== by 0x4E3DE35: encoder_state_worker_encode_children (encoderstate.c:417)
==8120== by 0x4E3E219: encoder_state_encode (encoderstate.c:649)
==8120== by 0x4E3DE35: encoder_state_worker_encode_children (encoderstate.c:417)
==8120== by 0x4E3E219: encoder_state_encode (encoderstate.c:649)
==8120== by 0x4E3ECBD: kvz_encode_one_frame (encoderstate.c:941)
==8120== by 0x4E4DA22: kvazaar_encode (kvazaar.c:229)
==8120== by 0x4E4E228: kvazaar_field_encoding_adapter (kvazaar.c:280)
==8120== by 0x40137F: main (encmain.c:436)
==8120==
lt-kvazaar: threadqueue.c:618: kvz_threadqueue_job_dep_add: Assertion `job && depends_on' failed.
==8120==
==8120== HEAP SUMMARY:
==8120== in use at exit: 1,320,764 bytes in 568 blocks
==8120== total heap usage: 584 allocs, 16 frees, 1,330,691 bytes allocated
==8120==
==8120== 112 bytes in 1 blocks are definitely lost in loss record 27 of 88
==8120== at 0x4C2B6CD: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==8120== by 0x4E46BA5: kvz_image_alloc (image.c:49)
==8120== by 0x401E12: input_read_thread (encmain.c:183)
==8120== by 0x55EDE99: start_thread (pthread_create.c:308)
==8120==
==8120== 272 bytes in 1 blocks are possibly lost in loss record 41 of 88
==8120== at 0x4C29DB4: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==8120== by 0x4012034: _dl_allocate_tls (dl-tls.c:297)
==8120== by 0x55EEABC: pthread_create@@GLIBC_2.2.5 (allocatestack.c:571)
==8120== by 0x4012B9: main (encmain.c:404)
==8120==
==8120== 544 bytes in 2 blocks are possibly lost in loss record 45 of 88
==8120== at 0x4C29DB4: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==8120== by 0x4012034: _dl_allocate_tls (dl-tls.c:297)
==8120== by 0x55EEABC: pthread_create@@GLIBC_2.2.5 (allocatestack.c:571)
==8120== by 0x4E5EF65: kvz_threadqueue_init (threadqueue.c:308)
==8120== by 0x4E3BD2F: kvz_encoder_control_init (encoder.c:173)
==8120== by 0x4E4DD7E: kvazaar_open (kvazaar.c:80)
==8120== by 0x401112: main (encmain.c:346)
==8120==
==8120== 53,856 bytes in 1 blocks are possibly lost in loss record 81 of 88
==8120== at 0x4C2B6CD: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==8120== by 0x4E46BEC: kvz_image_alloc (image.c:59)
==8120== by 0x401E12: input_read_thread (encmain.c:183)
==8120== by 0x55EDE99: start_thread (pthread_create.c:308)
==8120==
==8120== LEAK SUMMARY:
==8120== definitely lost: 112 bytes in 1 blocks
==8120== indirectly lost: 0 bytes in 0 blocks
==8120== possibly lost: 54,672 bytes in 4 blocks
==8120== still reachable: 1,265,980 bytes in 563 blocks
==8120== suppressed: 0 bytes in 0 blocks
==8120== Reachable blocks (those to which a pointer was found) are not shown.
==8120== To see them, rerun with: --leak-check=full --show-reachable=yes
==8120==
==8120== For counts of detected and suppressed errors, rerun with: -v
==8120== Use --track-origins=yes to see where uninitialised values come from
==8120== ERROR SUMMARY: 5 errors from 5 contexts (suppressed: 2 from 2)
GOPs with depth 1 had the same structure as those with depth 2:
g4d3t1 = 3 2 3 1
g4d2t1 = 2 2 2 1
g4d1t1 = 2 2 2 1
It now results in the correct:
g4d1t1 = 1 1 1 1
Coding inter without GOP of any kind really isn't a very sensible
default. Defaulting to B-GOP of some kind would be more better,
but lp-gop is more robust for now.
Use the vectorized general SSE41 inter SAD in AVX reg_sad for shapes
for which we don't have AVX versions yet.
Also improves speed of --smp and --amp a lot. Got a 1.25x speedup for:
--preset=ultrafast -q 27 --gop=lp-g4d3r3t1 --me-early-termination=on --rd=1 --pu-depth-inter=1-3 --smp --amp
* Suite speed_tests:
-PASS inter_sad: 0.898M x reg_sad(64x63):x86_asm_avx (1000 ticks, 1.000 sec)
+PASS inter_sad: 2.503M x reg_sad(64x63):x86_asm_avx (1000 ticks, 1.000 sec)
-PASS inter_sad: 115.054M x reg_sad(1x1):x86_asm_avx (1000 ticks, 1.000 sec)
+PASS inter_sad: 133.577M x reg_sad(1x1):x86_asm_avx (1000 ticks, 1.000 sec)
Add tests for the extreme shapes that can happen when a motion vector
points outside the frame. A single pixel case where it probably doesn't
make sense to call a vectorized function, and the maximum size where it
definitely does make sense to call a vectorized function.
Add implementations for these functions that process the image line by
line instead of using the 16x16 function to process block by block.
The 32x32 is around 30% faster, and 64x64 is around 15% faster,
on Haswell.
PASS inter_sad: 28.744M x reg_sad(32x32):x86_asm_avx (1014 ticks, 1.014 sec)
PASS inter_sad: 7.882M x reg_sad(64x64):x86_asm_avx (1014 ticks, 1.014 sec)
to
PASS inter_sad: 37.828M x reg_sad(32x32):x86_asm_avx (1014 ticks, 1.014 sec)
PASS inter_sad: 9.081M x reg_sad(64x64):x86_asm_avx (1014 ticks, 1.014 sec)
Inter SAD accesses pixels directly from the frame buffers, so give it
a 4k frame to work in for more realistic results. The old test used
intra test data, which consist of tiny buffers.
Measured a cumulative effect of 1.04x speedup, when inter search is
used a lot. Not a huge difference.
--preset=ultrafast --me=tz --gop=lp-g4d3r2t2 --wpp --owf=4 --threads=14
Sometimes the tests overrun their time limit by varying amounts.
Return calls per second based on the amount of time actually spent in
the loop instead of how much time we tried to spend in the loop.
The existing tests only covered the edge cases of border extension, but
not the SIMD optimized versions of reg_sad. This adds proper tests for
current optimized reg_sad implementations and ones we are likely to
have in the future.
Arrange the decision tree such that there is only 3 branches on the
most common paths and the more likely branch is always fall-through.
A profile guided optimization pass would probably do something similar.
A lot of time is being taken up by this function on ultrafast, and it
doesn't do a very good job. This change aims to both simplify the
logic and make the estimate better.
The logic is simplified by using a look up for the step mvd bit cost
step function instead of mimicking the binarization process. The
estimation is made better by checking fractional cabac bit costs.
The new function returns the same results as
kvz_get_mvd_coding_cost_cabac, but is also faster than the old
function.
Write bitstream without chroma when encoding with --input-format=P400.
This reduces bitstream size by 0-1 %, compared to coding monochrome in
420 format, and speeds up encoding slightly due to not processing
chroma.