Previously, the block sizes would be checked at runtime to
determine the transform size to apply for residuals. Making the block
sizes into constant expressions allows all the loops to be unrolled
and reduces branching significantly.
This results in about a 26% improvement (~18s -> ~13.2s) in speed in an
intra-heavy test video.
Inter-prediction convolution filters are selected based on the
subpixel position determined for the motion vector relative to the
block being predicted. The subpixel position 0 only uses one single
sample in the center of the convolution, not averaging any other
samples. Let's call this a copy.
Reference frames can also be a different size relative to the frame
being predicted, but in almost every case, that scale will be 1:1
for every single frame in a video.
Taking into account these facts, we can create multiple fast paths for
inter prediction. These fast paths are only active when scaling is 1:1.
If we are doing a copy in both dimensions, then we can do a straight
memcpy from the reference frame to the output block buffer. In videos
where there is no motion, this is a dramatic speedup.
If we are doing a copy in one dimension, we can just do one convolution
and average directly into the output block buffer.
If we aren't doing a copy in either dimension, we can still cut out a
few operations from the convolution loops, since we only need to
advance our samples by whole pixels instead of subpixels.
These fast paths result in about a 34% improvement (~31.2s -> ~20.6s)
in a video which relies heavily on intra-predicted blocks due to high
motion. In videos with less motion, the improvement will be even
greater.
Also, note that the accumulators in these faster loops are only 16-bit.
High bit-depth videos will overflow those, so for now the fast path is
only used for 8-bit videos.
A typo caused the Y scale value to never be used, so if a reference
frame's aspect ratio didn't match up with the current frame's, it would
decode incorrectly.
Some comments have been added to clarify the frame-constants used in
the function as well.
This moves all the frame size calculation to `FrameContext`, where the
subsampling is easily accessible to determine the size for each plane.
The internal framebuffer size has also been reduced to the exact frame
size that is output.
The division in the `round_mv_...()` functions contained in the motion
vector selection process was done by bit shifting right. However, since
bit shifting negative values will truncate towards the negative end, it
was flooring instead of rounding.
This changes it to match the spec and rely on the compiler to simplify
down to a bit shift.
Extending the borders on reference frames so that motion vectors that
point outside the reference frame allows `predict_inter_block()` to
avoid some branches to clamp the sample coordinates in its loops.
This results in about a 25% improvement in decode time of a motion-
heavy YouTube video (~20.8s -> ~15.6s).
Moving the clamping of the coordinates of the reference frame samples
as well as some bounds checks outside of the loop reduces the branches
needed in the `predict_inter_block()` significantly.
This results in a whopping ~41% improvement in decode performance
of an inter-prediction-heavy YouTube video (~35.4s -> ~20.8s).
Changing the calculation of reference frame scale factors to be done on
a per-frame basis reduces the amount of work done in
`predict_inter_block()`, which is a big hotspot in most videos.
This reduces decode times in a test video from YouTube by about 5%
(~37.2s -> ~35.4s).
This changes the order of the loop copying data to a reference frame
store so that it copies each row in a contiguous line rather than
copying a column at a time, which caused unnecessary branches.
This reduces the decode time on a fairly long 720p YouTube video by
about 14.5% (~43.5s to ~37.2s).
Checking the bounds of the intermediate values was only implemented to
help debug the decoder. However, it is non-fatal to have the values
exceed the spec-defined bounds, and causes a measurable performance
reduction.
Additionally, the checks were implemented as an assertion, which is
easily broken by bad input files.
I see about a 4-5% decrease in decoding times in the `webm_in_vp9` test
in TestVP9Decode.
There were rare cases in which u8 was not large enough for the total
count of values read, and increasing this to u32 should have no real
effect on performance (hopefully).
Only the residual tokens array needs to be kept for the transforms to
use after all the tokens have been parsed. The token cache is able to
be kept in the stack only for the duration of the token parsing loop.
Since the enum is used as an index to arrays, it unfortunately can't
be converted to an enum class, but at least we can make sure to use it
with the qualified enum name to make things a bit clearer.
Moving these to another header allows Parser.h to include less context
structs/classes that were previously in Context.h.
This change will also allow consolidating some common calculations into
Context.h, since we won't be polluting the VP9 namespace as much. There
are quite a few duplicate calculations for block size, transform size,
number of horizontal and vertical sub-blocks per block, all of which
could be moved to Context.h to allow for code deduplication and more
semantic code where those calculations are needed.
Those previous constants were only set and used to select the first and
second transforms done by the Decoder class. By turning it into a
struct, we can make the code a bit more legible while keeping those
transform modes the same size as before or smaller.
The sub-block transform types set and then used in a very small scope,
so now it is just stored in a variable and passed to the two functions
that need it, Parser::tokens() and Decoder::reconstruct().
Note that some of the previous segmentation feature settings must be
preserved when a frame is decoded that doesn't use segmentation.
This change also allowed a few functions in Decoder to be made static.
Previously, we were using size_t, often coerced from bool or u8, to
index reference pairs. Now, they must either be taken directly from
named fields or indexed using the `ReferenceIndex` enum with options
`primary` and `secondary`. With a more explicit method of indexing
these, the compiler can aid in using reference pairs correctly, and
fuzzers may be able to detect undefined behavior more easily.
This also renames (most?) of the related quantizer functions and
variables to make more sense. I haven't determined what AC/DC stands
for here, but it may be just an arbitrary naming scheme for the first
and subsequent coefficients used to quantize the residuals for a block.
The color config is reused for most inter predicted frames, so we use a
struct ColorConfig to store the config from intra frames, and put it in
a field in Parser to copy from when an inter frame without color config
is encountered.
There are three mutually exclusive frame-showing states:
- Show no new frame, only store the frame as a reference.
- Show a newly decoded frame.
- Show frame from the reference frame store.
Since they are mutually exclusive, using an enum rather than two bools
makes more sense.
These are used to pass context needed for decoding, with mutability
scoped only to the sections that the function receiving the contexts
needs to modify. This allows lifetimes of data to be more explicit
rather than being stored in fields, as well as preventing tile threads
from modifying outside their allowed bounds.
These are now passed as parameters to each function that uses them.
These will later be moved to a struct to further reduce the amount of
parameters that get passed around.
Above and left per-frame block contexts are now also parameters passed
to the functions that use them instead of being retrieved when needed
from a field. This will allow them to be more easily moved to a tile-
specific context later.
There are three fields that we need to store from FrameBlockContext to
keep between frames, which are used to parse for those same fields for
the next frame.
The function serves no purpose now, any debug information we want to
pull from the decoder should be instead accessed by some other yet to
be created interface.
All state that needed to persist between calls to decode_block was
previously stored in plain Vector fields. This moves them into a struct
which sets a more explicit lifetime on that data. It may be possible to
store this data on the stack of a function with the appropriate
lifetime now that it is split into its own struct.
This has two benefits:
- I observed a ~34% decrease in decoding time running TestVP9Decode.
- Removing all of these silly Vector fields helps simplify the code
relationships between all the functions in Decoder.cpp. It'll also be
much easier to make these static with template specializations, if
that turns out to be worthy performance improvement.
The two different mode sets are stored in single fields, and the
underlying values didn't overlap, so there was no reason to keep them
separate.
The enum is now an enum class as well, to enforce that almost all uses
of the enum are named. The only case where underlying values are used
is in lookup tables, but it may be worth abstracting that as well to
make array bounds more clear.
Frames will now be queued for retrieval by the user of the decoder.
When the end of the current queue is reached, a DecoderError of
category NeedsMoreInput will be emitted, allowing the caller to react
by displaying what was previously retrieved for sending more samples.
The class is virtual and has one subclass, SubsampledYUVFrame, which
is used by the VP9 decoder to return a single frame. The
output_to_bitmap(Bitmap&) function can be used to set pixels on an
existing bitmap of the correct size to the RGB values that
should be displayed. The to_bitmap() function will allocate a new bitmap
and fill it using output_to_bitmap.
This new class also implements bilinear scaling of the subsampled U and
V planes so that subsampled videos' colors will appear smoother.
This adds a struct called CodingIndependentCodePoints and related enums
that are used by video codecs to define its color space that frames
must be converted from when displaying a video.
Pre-multiplied matrices and lookup tables are stored to avoid most of
the floating point division and exponentiation in the conversion.