Rename and reorder everything to make more sense.
- Moved input tables into their own struct and renamed them to what
they actually represent.
- Renamed pretty much every variable to comform to our style and
to make sense.
- Removed the lastCG stuff, as the function already gets passed the
last coeff anyway. (it was named width, what the hell?)
This is to prepare for changing the code using the floating point table
to use the fixed point table instead.
This also allows reducing the size of the fractional part, which was
useful for finding every place where the the fixed point presentation
is relied upon.
Changes luma deblocking to use gather and scatter instead of reading
to and writing from here and there in memory. Should make them
faster and easier to vectorize, or at least cleaner.
Splits strong and weak luma deblocking to two functions, as they have
almost nothing in common.
Adds field lcu_stats to encoder_state_config_frame_t. The following data
is recorded for each LCU:
- number of bits
- squared cost
- used lambda value
- alpha parameter used for rate control
- beta parameter used for rate control
When rate control is enabled, enable cu_qp_delta_enabled_flag in PPS
with diff_cu_qp_delta_depth set to 0. Also adds code for writing the QP
deltas and a new cabac context.
Adds fields lambda, lambda_sqrt and qp to encoder_state_t. Drops field
cur_lambda_cost_sqrt from encoder_state_config_frame_t and renames
cur_lambda_cost to lambda.
- Defines MIN_LAMBDA and MAX_LAMBDA constants.
- Moves resetting state->frame->cur_gop_bits_coded to rate_control.c.
- Changes gop_allocate_bits to return the number of bits allocated like
pic_allocate_bits does.
When --threads=auto was given on the command line, cfg->threads was
actually set to zero, disabling threads altogether. Fixed to set
cfg->threads to -1, so that the number of threads is chosen
automatically.
The CABAC engine only writes to the bitstream when it has a full byte.
These writes are also always byte-aligned, so there is no need to even
check for stream alignment.
Speedup was around 3% with ultrafast and low QP.
Enforce bit depth promised by --input-bitdepth to avoid crashes when
larger values are provided.
Do endianess byte swap for all bytes when the buffer gets extended
to multiple of 8 pixels, and not just the number of input pixels.
Don't swap bytes on a little-endian system.
- Reduce indentation to 6 spaces
- Word wrap everything to under 80 characters
- Remove defaults from options covered by presets
- Add a dash in front of argument descriptions
- Add --(no-) to names of parameters that accept it and remove mention
of enabling or disabling
- Add executable and scripts as a dependancy to make docs
This value is not represented in the HEVC bitstream, which is why it
was not set previously. FFmpeg sets and needs it however, so make the
CLI set it as well to make sure we handle it correctly.
The rd-complexity of slow presets is better with a less agressive GOP.
Adding the GOP as part of the preset improved BDRate enough, that it
didn't make sense anymore to have a veryslow target the best BDRate.
Instead, push that responsibility to placebo by making it a little bit
faster.
GOPs with depth 1 had the same structure as those with depth 2:
g4d3t1 = 3 2 3 1
g4d2t1 = 2 2 2 1
g4d1t1 = 2 2 2 1
It now results in the correct:
g4d1t1 = 1 1 1 1
Coding inter without GOP of any kind really isn't a very sensible
default. Defaulting to B-GOP of some kind would be more better,
but lp-gop is more robust for now.
Use the vectorized general SSE41 inter SAD in AVX reg_sad for shapes
for which we don't have AVX versions yet.
Also improves speed of --smp and --amp a lot. Got a 1.25x speedup for:
--preset=ultrafast -q 27 --gop=lp-g4d3r3t1 --me-early-termination=on --rd=1 --pu-depth-inter=1-3 --smp --amp
* Suite speed_tests:
-PASS inter_sad: 0.898M x reg_sad(64x63):x86_asm_avx (1000 ticks, 1.000 sec)
+PASS inter_sad: 2.503M x reg_sad(64x63):x86_asm_avx (1000 ticks, 1.000 sec)
-PASS inter_sad: 115.054M x reg_sad(1x1):x86_asm_avx (1000 ticks, 1.000 sec)
+PASS inter_sad: 133.577M x reg_sad(1x1):x86_asm_avx (1000 ticks, 1.000 sec)
Add implementations for these functions that process the image line by
line instead of using the 16x16 function to process block by block.
The 32x32 is around 30% faster, and 64x64 is around 15% faster,
on Haswell.
PASS inter_sad: 28.744M x reg_sad(32x32):x86_asm_avx (1014 ticks, 1.014 sec)
PASS inter_sad: 7.882M x reg_sad(64x64):x86_asm_avx (1014 ticks, 1.014 sec)
to
PASS inter_sad: 37.828M x reg_sad(32x32):x86_asm_avx (1014 ticks, 1.014 sec)
PASS inter_sad: 9.081M x reg_sad(64x64):x86_asm_avx (1014 ticks, 1.014 sec)
Arrange the decision tree such that there is only 3 branches on the
most common paths and the more likely branch is always fall-through.
A profile guided optimization pass would probably do something similar.
A lot of time is being taken up by this function on ultrafast, and it
doesn't do a very good job. This change aims to both simplify the
logic and make the estimate better.
The logic is simplified by using a look up for the step mvd bit cost
step function instead of mimicking the binarization process. The
estimation is made better by checking fractional cabac bit costs.
The new function returns the same results as
kvz_get_mvd_coding_cost_cabac, but is also faster than the old
function.
Write bitstream without chroma when encoding with --input-format=P400.
This reduces bitstream size by 0-1 %, compared to coding monochrome in
420 format, and speeds up encoding slightly due to not processing
chroma.
Changes encoder_set_source_picture to set the reconstructed picture to
a copy of the source picture instead of allocating a new picture when
lossless coding is used.
- Moves allocation of the reconstructed picture after the source picture
is set.
- Extracts main state initialization to a separate function from
encoder_state_new_frame.
- Changes kvz_encoder_feed_frame to return the frame.
- Renames some functions to better match their purpose.
When --lossless is given, set cu_transquant_bypass_flag for every CU and
bypass transform and quantization by directly copying reference pixels
to reconstruction and the residual to coefficients.
When a list does not have space for the new element, its size is
doubled. If the size of the list is zero, it would not be resized. Fixed
to always resize the list so that the new element can be added.
Enables search for 2NxN and Nx2N partition modes for 8x8 CUs and 2NxnU,
2NxnD, nLx2N and nRx2N partition modes for 16x16 CUs.
Changes the loop for copying reconstructed luma pixels in
kvz_inter_recon_lcu to use 4 byte chunks instead of 8 byte chunks since
it is now possible to have 4 pixel wide blocks.
The first frame was always qp51 due to gop_offset being -1 for the
first frame. This fix makes it so that bits are allocated as if it was
the last (high quality) frame from the previous GOP.
When using ratecontrol with lowdelay-P, this improves BDRate by 1-25%.
Strongest effect is when using 4 layers and multiple references.
Also allow using 1 or 2 layers with ratecontrol.
This problem resulted in an illegal bitstream with --gop=lp, because it
uses IDR's. The --gop=8 would not code IDR pictures, even when told to
with -p, which masked this problem.
This fix solves the problem with --gop=lp and also prevents references
across the intra picture in --gop=8. The intra pictures should be set
to IDR in a later fix, or an alternate method of differentiating
between IDR and non-IDR intra should be made.
The includes should make more sense now and not just happen to compile
due to headers included from other headers.
Used a modified version of IWYU. Modifications were to attribute int8_t
and so on to stdint.h instead of sys/types.h and immintrin.h instead of
more specific headers.
include-what-you-use 0.7 (git:b70df35)
based on clang version 3.9.0 (trunk 264728)
While these are only used for strategies, it's non-intuitive to have
to include strategyselector.h in every file under strategies before
including anything else.
The main thread has to wait for the worker threads to finish. The
pthread_cond_timedwait call used to accomplish this was given
a relative instead of absolute time, which resulted in the call
returning immediately, because the time had already passed.
This removes the now unnecessary sleeps and fixes the time given to
the pthread_cond_timedwait such that it now waits until a job finishes
or 100ms have passed.
The OWF wpp limit code assumed square blocks, and as such did not work
correctly when height != width. This changes the relevant code to consider
both height and width.
Add md5 through extras/libmd5 taken from HM with BSD license. It's
implemented as a generic strategy using the same interface as checksum,
so we can write a SIMD version if it seems necessary.
The previous reasoning used deblocking and fractional motion estimation
together to arrive at a margin of 4 pixels. This was wrong, and with
either of these off, half pixel chroma interpolation could use pixels
outside the intended region.
Deblocking does not currently affect the margin needed.
I was a bit unclear about exactly what happens and when regarding SAO
and deblocking when we do frame-parallel WPP parallelism, so I checked
and commented the bits that were unclear to me.
The check was done in regard to the wrong dimension, allowing the
access to unfinished parts of the frame when coding multiple frames
at the same time.
Add new parameter --tiles that accept only uniform split. I considered
supporting the syntax of --tiles-width-split for this, but writing
--tiles=u2xu2 is just not as intuitive as --tiles=2x2, and there is
hardly ever any reason to use anything but uniform split. The more
cumbersome --tiles-width-split and --tiles-height-split parameters
are still there to allow finer control.
There was an off by one error in the dependance setting code, which
resulted in dependencies not being set resulting in checksum errors.
For example if ref_neg=1 and owf=1.
Earlier fix that fixed the supply side of the cu_array to take tile
coordinates into account should have been accompanied with this one
that does the same thing to demand side.
Moves sao search from function encoder_state_worker_encode_lcu in
encoderstate.c to function kvz_sao_search_lcu in sao.c. Makes functions
kvz_init_sao_info, kvz_sao_search_chroma and kvz_sao_search_luma static
since they are no longer used outside sao.c.
If 0,0 vector is illegal, it's possible that no legal movement vector,
is found, in which case a large cost is returned instead. The cost
overflowed and there is all sorts of silliness with converting from
double to int, but I'm not going to fix all of it because when we
remove the doubles it will all get fixed.
An incorrect frame boundary check caused a checksum error, because the
chroma reconstruction of the encoder was wrong. The encoder treated
horizontal tile boundaries as frame boundaries when the vertical
component of the movement vector was a multiple of 8.
CU data was being copied to the wrong place in the reference frames
cu_array, which led to uninitialized data being used as a starting
point for motion vector search.
Fixes#99.
A 32 bit int overflowed after 2^31 bits (2Gb). It will still overflow
eventually, after 500 years of outputting 1Gb/s, but by that time,
I recon we will have fixed this properly and it's time to upgrade.
Changes communication between the input thread and main thread in
encmain.c so that only one of them uses img_in and retval at a time.
Fixes a race condition which would sometimes result in a deadlock.
Add dependency to the reference frame instead of the previous frame,
in order to allow more frames to be encoded in parallel when temporal
stepping >1 in LP-gop (such as --gop=lp-g8d4r1t2).
This moves the interlacing from CLI code to api->encoder_encode, in
order to make it possible to use field coding through the lib API.
The field order is now determined per frame, as FFmpeg gives it per
frame and it's signaled per frame.
As a side effect, the CLI also now prints info from frames instead of
fields. While we might want to extend the API in the future to allow
printing of more detailed information about fields, for now it's
more important that the CLI uses the real lib API.
PSNR calculation for interlaced frames disabled until we have a way to
avoid deinterlacing the frame when it's not necessary.
Prevents a conflict with config.h and src/config.h so that the config.h
generated by configure is included in global.h. Fixes problems with
large input files on 32-bit systems.
- Handle input processing in a separate thread to allow main thread more time with thread handling etc
- Significant speedup can be seen when run on ultrafast settings and on a system with great number of cores
Option -mavx2 was omitted when compiling AVX2 strategies. This commit
moves strategies to convenience libraries so that their compilation
flags can be easily set and adds -mavx2 to CFLAGS of the AVX2 library.
We have soname versioning now, so we should focus on getting that right
instead. This also serves as an example of correctly incrementing the
lib-version.
Now that we put the timing info into the bitstream, the time base must
be precisely known. Represent framerate as a fraction and add timing
info only if the old floating point framerate was not used.
Deprecate cfg->framerate so it can be removed once we get patches to
FFmpeg and libav.
Add support for (num)/(denom) format to --input-fps.
Also moves CLI stuff under CLI project, so they are compiled as their
own lib just like when the Makefile is used.
The file interface_main.c was an artifact from a bygone era and should have
been deleted long ago.
Add module information to all header files.
Update all header file documentations to briefly say what they are, and
to use the javadoc format so the brief actually gets included into the
doxygen documentation.
Remove \file from implementation files, in order to not repeat the info
from the header files.
Add files under strategies and tools to Doxygen and update the Doxygen
settings to be just plain better.
Make README be the main page of Doxygen documentation.
Bits were being added to rate distortion without being multiplied by
lambda in a few places. Fixing this bug also finally allows us to remove
the magic bits from the Coding Unit split decision.
I tried to find new optimum value for CU_COST and it turned out to be 2
for veryslow and 0 for superfast. The difference between 0 and 2 on
veryslow was only 0.1% however, so I don't think this parameter is
needed any longer. Before this fix the effect of removing CU_COST would
have been 0.8%.
- Updates function search_mv_full so that it compiles and handles
non-square blocks.
- Enables compilation of search_mv_full.
- Sets full search radius to 32.
- Enables selecting full mv search with "--me full".
Changes search_frac and kvz_search_cu_iter to use kvz_satd_any_size for
computing the SATDs instead of getting the SATD function with
kvz_pixels_get_satd_func.
Adds strategy satd_any_size for generic and AVX2. The satd_any_size
functions are implemented with macro SATD_ANY_SIZE defined in
strategies-picture.h.
- Adds function is_pu_boundary.
- Moves code for filtering an edge of a single PU or TU to a new
function filter_deblock_unit.
- Replaces recursive CU tree traversal in filter_deblock_cu with
a simple loop and renames it to filter_deblock_lcu_inside.
Adds parameter block_height to functions inter_recon_frac_luma,
inter_recon_14bit_frac_luma and inter_recon_14bit_frac_chroma so that
they can handle SMP blocks.
Makes the following functions static since they are not used outside
inter.c:
- kvz_inter_recon_frac_luma
- kvz_inter_recon_14bit_frac_luma
- kvz_inter_recon_frac_chroma
- kvz_inter_recon_14bit_frac_chroma
The buffers allocated in functions kvz_get_extended_block_avx2 and
kvz_get_extended_block_generic were too small when the width of the
block was less than its height. Fixed to allocate correctly sized
buffers.
Adds parameter height to functions kvz_inter_recon_lcu and
kvz_inter_recon_lcu_bipred and makes them work on non-square sizes.
Fractional reconstruction functions do not handle non-square blocks yet.
- Moves SIZE_* definitions to cu.h.
- Adds constant arrays kvz_part_mode_num_parts, kvz_part_mode_offsets
and kvz_part_mode_sizes for storing the number of PUs, PU offsets and
PU sizes.
- Adds macros PU_GET_X, PU_GET_Y, PU_GET_W and PU_GET_H for getting the
location and size of a PU.
Moves checks for motion vector prediction and merge candidate block
types (inter/intra) from functions kvz_inter_get_mv_cand and
kvz_inter_get_merge_cand to kvz_inter_get_spatial_merge_candidates.
Remove the need to count the coefficients by populating the significant
coefficient group map first and finding the last coefficient from the
last group afterward. The speedup is about 2% on ultrafast.
The previous version of this patch was reverted due to a bug, which
has now been fixed.
This reverts commit 25462124f8.
That commit broke the bitstream. If it's not good enough to push on Friday
night, it's probably not good enough on Monday morning either.
Remove the need to count the coefficients by populating the significant
coefficient group map first and finding the last coefficient from the
last group afterward.
Increases the MV safety margin of OWF from 2 to 3 when deblocking
is used and 4 when both deblocking and FME are used.
Fractional pixel motion estimation can move the vector one more pixel
down causing checksum error. This fixes that error by increasing the
OWF safety margin and changes the interface, so that different margin
can be used when FME or deblocking are not in use.
Replaces repetitive calls to kvz_filter_deblock_luma and
kvz_filter_deblock_chroma with loops in functions
filter_deblock_edge_luma and filter_deblock_edge_chroma.
Marks the following functions static and removes them from filter.h
since they are not used outside the filter module.
- kvz_filter_deblock_luma
- kvz_filter_deblock_chroma
- kvz_filter_deblock_edge_luma
- kvz_filter_deblock_edge_chroma
- kvz_filter_deblock_cu
- Replace parameter depth in kvz_filter_deblock_edge_{luma,chroma} with
length.
- Move checking whether an edge needs to be filtered from functions
kvz_filter_deblock_edge_{luma,chroma} to functions
kvz_filter_deblock_{cu,lcu}.
- Use pixel coordinates instead of 8-pixel block coordinates in
kvz_filter_deblock_cu.
- Add comments.
These changes should make it easier to modify the deblocking filter to
handle SMP and AMP blocks.
The mac version of KVZ_GET_TIME macro has many statements, which
prevented it being used inside a for loop statement. Added brackets
to all versions to prevent this issue arising in the future.
Fixes#115.
Giving a single number as an argument to --deblock option would enable
deblocking and set both beta and tc to that value. This commit changes
a single number argument to be interpreted as a boolean specifying
whether to enable deblocking or not. As a result, "--deblock 0" can be
now used to disable deblocking.
This fixes deblocking being enabled in all presets.