To extract the block costs, build Kvazaar as usual, and edit relevant
parameters in the beginning of extract_rdcosts.py and run_filter.py, most
importantly the number of cores and the set of video sequences you want to
encode to extract costs. Run extract_rdcosts.py, it will use Kvazaar to encode
each sequence and extract the costs measured there for the quantized blocks.
The costs are stored compressed and sorted by block QP, in the following
format:

Size (B)  | Description
----------+------------
4         | size:   Coeff group size, in int16's
4         | ccc:    Coeff group's coding cost
size * 2  | coeffs: Coeff group data

To analyze the costs by running a linear regression over them, build the two
tools using:

$ gcc filter_rdcosts.c -O2 -o frcosts_matrix
$ gcc ols_2ndpart.c -O2 -o ols_2ndpart

Then run the regression in parallel by running run_filter.py. The reason to do
it this way is because the data is stored compressed, so there is no way to
mmap it in Matlab/Octave/something; the data sets are absolutely huge (larger
than reasonable amounts of RAM in a decent workstation), but this way we can
store the data compressed and process it in O(1) memory complexity, so it can
be done as widely parallelized as you have CPU cores. The result files each
consist of 4 numbers, which represent an approximate linear solution to the
corresponding set of costs: the price in bits of a coefficient whose absolute
value is a) 0, b) 1, c) 2, d) 3 or higher.

After that, run rdcost_do_avg.py. It will calculate a per-QP average of the
costs over the set of the sequences having been run (ie. for each QP, take the
results for that QP for each sequence, and calculate their average). This data
is what you can use to fill in the default_fast_coeff_cost_wts table in
src/fast_coeff_cost.h.