Age | Commit message (Collapse) | Author |
|
This is a proof of concept RTCD implementation to replace the current
system of nested includes, prototypes, INVOKE macros, etc. Currently
only the decoder specific functions are implemented in the new system.
Additional functions will be added in subsequent commits.
Overview:
RTCD "functions" are implemented as either a global function pointer
or a macro (when only one eligible specialization available).
Functions which have RTCD specializations are listed using a simple
DSL identifying the function's base name, its prototype, and the
architecture extensions that specializations are available for.
Advantages over the old system:
- No INVOKE macros. A call to an RTCD function looks like an ordinary
function call.
- No need to pass vtables around.
- If there is only one eligible function to call, the function is
called directly, rather than indirecting through a function pointer.
- Supports the notion of "required" extensions, so in combination with
the above, on x86_64 if the best function available is sse2 or lower
it will be called directly, since all x86_64 platforms implement
sse2.
- Elides all references to functions which will never be called, which
could reduce binary size. For example if sse2 is required and there
are both mmx and sse2 implementations of a certain function, the
code will have no link time references to the mmx code.
- Significantly easier to add a new function, just one file to edit.
Disadvantages:
- Requires global writable data (though this is not a new requirement)
- 1 new generated source file.
Change-Id: Iae6edab65315f79c168485c96872641c5aa09d55
|
|
A processor with ARMv7 instructions does not
necessarily have NEON dsp extensions. This CL
has the added side effect of allowing the ability
to enable/disable the dsp extensions cleanly.
Change-Id: Ie1e879b8fe131885bc3d4138a0acc9ffe73a36df
|
|
These functions are now used by the encoder.
This is WIP with the goal of creating a common idct/add for
the encoder and decoder. A boost of 1.8% was seen for
the HD rt test clip used.
[Tero] Added needed changes to ARM side.
Change-Id: Ibbb8000be09034203d7adffc457d3c3f8b06a5bf
|
|
to the dqcoeff or qcoeff buffer. The encoder would
populate the dc coeffs of the y blocks as a separate
stage (recon_dcblock) and the decoder would use a special
version of the idct. This change eliminates the extra copy
and reduces the code footprint.
[Tero] Added needed changes to armv6 and NEON assembly.
Change-Id: I83202ffdbaf83f6e5dd69f4ba2519fcf0b13b3ba
|
|
Instead of using the predict buffer, the decoder now writes
the predictor into the recon buffer. For blocks with eob=0,
unnecessary idcts can be eliminated. This gave a performance
boost of ~1.8% for the HD clips used.
Tero: Added needed changes to ARM side and scheduled some
assembly code to prevent interlocks.
Patch Set 6: Merged (I1bcdca7a95aacc3a181b9faa6b10e3a71ee24df3)
into this commit because of similarities in the idct
functions.
Patch Set 7: EC bug fix.
Change-Id: Ie31d90b5d3522e1108163f2ac491e455e3f955e6
|
|
Just a clean-up.
Change-Id: Iea5b6dc925dcfa7db548bc1ab1a13d26ed5a2c9a
|
|
Change-Id: I7e6bc28e7974a376da747300744e0dd5dc1d21e9
|
|
armv5 dequantizer is not referenced
Change-Id: Id1cc617dcee35ebd6a406816ec6aaa26e8bbc8ad
|
|
Change-Id: I3683cb87e9cb7c36fc22c1d70f0799c7c46a21df
|
|
The current code stores pointers to coefficient tables and loads them to
access the tables contents. As these pointers are stored in the code
sections, it means we end up with text relocations. eu-findtextrel will
thus complain about code not compiled with -fpic/-fPIC.
Since the pointers are stored in the code sections, we can actually cheat
and let the assembler generate relative addressing when accessing the
coefficient tables, and just load their location with adr.
Change-Id: Ib74ae2d3f2bab80b29991355f2dbe6955f38f6ae
|
|
Change-Id: I9467d7a50eac32d8e8f3a2f26db818e47c93c94b
|
|
Change-Id: I64fa47889c54cfed094a674c49ef0996d49bdd42
|
|
|
|
hasn't been kept up to date. remove it to avoid confusion.
Change-Id: I52ffde19b59fec5c7a381299ca2e85cb38330be7
|
|
Allow compiling without adding vp8/{common,encoder,decoder} to the
include paths.
Change-Id: Ifeb5dac351cdfadcd659736f5158b315a0030b6c
|
|
it's difficult to mux the *_offsets.c files because of header conflicts.
make three instead, name them consistently and partititon the contents
to allow building them as required.
Change-Id: I8f9768c09279f934f44b6c5b0ec363f7943bb796
|
|
common/arm/vpx_asm_offsets moves up a level. prepare for muxing with
encoder/arm/vpx_vp8_enc_asm_offsets
Change-Id: I89a04a5235447e66571995c9d9b4b6edcb038e24
|
|
we were holding on to this "just in case." purge it instead
Change-Id: I77a367b36d0821d731019f2566ecfffdae1d4b8a
|
|
This eliminates a large set of warnings exposed by the Mozilla build
system (Use of C++ comments in ISO C90 source, commas at the end of
enum lists, a couple incomplete initializers, and signed/unsigned
comparisons).
It also eliminates many (but not all) of the warnings expose by newer
GCC versions and _FORTIFY_SOURCE (e.g., calling fread and fwrite
without checking the return values).
There are a few spurious warnings left on my system:
../vp8/encoder/encodemb.c:274:9: warning: 'sz' may be used
uninitialized in this function
gcc seems to be unable to figure out that the value shortcut doesn't
change between the two if blocks that test it here.
../vp8/encoder/onyx_if.c:5314:5: warning: comparison of unsigned
expression >= 0 is always true
../vp8/encoder/onyx_if.c:5319:5: warning: comparison of unsigned
expression >= 0 is always true
This is true, so far as it goes, but it's comparing against an enum, and the C
standard does not mandate that enums be unsigned, so the checks can't be
removed.
Change-Id: Iaf689ae3e3d0ddc5ade00faa474debe73b8d3395
|
|
The primary goal is to allow a binary to be built which supports
NEON, but can fall back to non-NEON routines, since some Android
devices do not have NEON, even if they are otherwise ARMv7 (e.g.,
Tegra).
The configure-generated flags HAVE_ARMV7, etc., are used to decide
which versions of each function to build, and when
CONFIG_RUNTIME_CPU_DETECT is enabled, the correct version is chosen
at run time.
In order for this to work, the CFLAGS must be set to something
appropriate (e.g., without -mfpu=neon for ARMv7, and with
appropriate -march and -mcpu for even earlier configurations), or
the native C code will not be able to run.
The ASFLAGS must remain set for the most advanced instruction set
required at build time, since the ARM assembler will refuse to emit
them otherwise.
I have not attempted to make any changes to configure to do this
automatically.
Doing so will probably require the addition of new configure options.
Many of the hooks for RTCD on ARM were already there, but a lot of
the code had bit-rotted, and a good deal of the ARM-specific code
is not integrated into the RTCD structs at all.
I did not try to resolve the latter, merely to add the minimal amount
of protection around them to allow RTCD to work.
Those functions that were called based on an ifdef at the calling
site were expanded to check the RTCD flags at that site, but they
should be added to an RTCD struct somewhere in the future.
The functions invoked with global function pointers still are, but
these should be moved into an RTCD struct for thread safety (I
believe every platform currently supported has atomic pointer
stores, but this is not guaranteed).
The encoder's boolhuff functions did not even have _c and armv7
suffixes, and the correct version was resolved at link time.
The token packing functions did have appropriate suffixes, but the
version was selected with a define, with no associated RTCD struct.
However, for both of these, the only armv7 instruction they actually
used was rbit, and this was completely superfluous, so I reworked
them to avoid it.
The only non-ARMv4 instruction remaining in them is clz, which is
ARMv5 (not even ARMv5TE is required).
Considering that there are no ARM-specific configs which are not at
least ARMv5TE, I did not try to detect these at runtime, and simply
enable them for ARMv5 and above.
Finally, the NEON register saving code was completely non-reentrant,
since it saved the registers to a global, static variable.
I moved the storage for this onto the stack.
A single binary built with this code was tested on an ARM11 (ARMv6)
and a Cortex A8 (ARMv7 w/NEON), for both the encoder and decoder,
and produced identical output, while using the correct accelerated
functions on each.
I did not test on any earlier processors.
Change-Id: I45cbd63a614f4554c3b325c45d46c0806f009eaa
|
|
Most of the code that actually uses these matrices indexes them as
if they were a single contiguous array, and coverity produces
reports about the resulting accesses that overflow the static
bounds of the first row.
This is perfectly legal in C, but converting them to actual [16]
arrays should eliminate the report, and removes a good deal of
extraneous indexing and address operators from the code.
Change-Id: Ibda479e2232b3e51f9edf3b355b8640520fdbf23
|
|
the previous commit laid the groundwork by doing two sets of idcts
together. this moved that further by grouping the interesting data
(q[0], q+16[0]) together to allow using wider instructions. also
managed to drop a few instructions by recognizing that the constant
for sinpi8sqrt2 could be downshifted all the time which avoided a
dowshift as well as workarounds for a function which only accepted
signed data
looks like a modest gain for performance: at qcif, went from ~180
fps to ~183
Change-Id: I842673f3080b8239e026cc9b50346dbccbab4adf
|
|
Expand 93c32a55 which used SSE2 instructions to do two
idct/dequant/recons at a time to NEON. Initial working
commit. More work needs to be put into rearranging and
interlacing the data to take advantage of quadword
operations, which is when we'll hopefully see a much
better boost
Change-Id: I86d59d96f15e0d0f9710253e2c098ac2ff2865d1
|
|
Changes 'The VP8 project' to 'The WebM project', for consistency
with other webmproject.org repositories.
Fixes issue #97.
Change-Id: I37c13ed5fbdb9d334ceef71c6350e9febed9bbba
|
|
make the arm asm detokenizer work with the new structures
Change-Id: I7cd92c2a018ec24032bb1cfd1bb9739bc84b444a
|
|
Moving the eob structure allows for a non-struct based
function to handle decoding an entire mb of
idct/dequant/recon data. This allows for SIMD functions
to idct/dequant/recon multiple blocks at once.
SSE2 implementation gives 3% gain on Atom.
Change-Id: I8a8f3efd546ea4e0535f517d94f347cfb737c9c2
|
|
adds a compile time option: --enable-arm-asm-detok which pulls in
vp8/decoder/arm/detokenize.asm
currently about break even speed wise, but changes are pending to
the fill code (branch and load 3 bytes versus conditionally always
load one) and the error handling. Currently it doesn't handle zero
runs or overrunning the buffer.
this is really just so i don't have to rebase my changes all the
time to run benchmarks - now just need to replace one file!
Change-Id: I56d0e2354dc0ca3811bffd0e88fe1f952fa6c797
|
|
Jeff Muizelaar posted some changes to the idct/reconstruction c code.
This is the equivalent update for the arm assembly.
This shows a good boost on v6, and a minor boost on neon.
Here are some numbers for highway in qcif, 2641 frames:
HEAD neon: ~161 fps
new neon: ~162 fps
HEAD v6: ~102 fps
new v6: ~106 fps
The following functions have been updated for armv6 and neon:
vp8_dc_only_idct_add
vp8_dequant_idct_add
vp8_dequant_dc_idct_add
Conflicts:
vp8/decoder/arm/armv6/dequantdcidct_v6.asm
vp8/decoder/arm/armv6/dequantidct_v6.asm
Resolved by removing these files. When I rewrote the functions, I also
moved the files to dequant_dc_idct_v6.asm/dequant_idct_v6.asm
Change-Id: Ie3300df824d52474eca1a5134cf22d8b7809a5d4
|
|
This moves the prediction step before the idct and combines the idct and
reconstruction steps into a single step. Combining them seems to give an
overall decoder performance improvement of about 1%.
Change-Id: I90d8b167ec70d79c7ba2ee484106a78b3d16e318
|
|
These files were out of date and no longer maintained.
Token decoding has implemented the no-crash code which
is incompatible with this arm assembly code.
Change-Id: Ibf729886c56fca48181af60b44bda896c30023fc
|
|
When the license headers were updated, they accidentally contained
trailing whitespace, so unfortunately we have to touch all the files
again.
Change-Id: I236c05fade06589e417179c0444cb39b09e4200d
|
|
Change bitreading functions to use a larger window which is refilled less
often.
This makes it cheap enough to do bounds checking each time the window is
refilled, which avoids the need to copy the input into a large circular
buffer.
This uses less memory and speeds up the total decode time by 1.6% on an ARM11,
2.8% on a Cortex A8, and 2.2% on x86-32, but less than 1% on x86-64.
Inlining vp8dx_bool_decoder_fill() has a big penalty on x86-32, as does moving
the refill loop to the front of vp8dx_decode_bool().
However, having the refill loop between computation of the split values and
the branch in vp8_decode_mb_tokens() is a big win on ARM (presumably due to
memory latency and code size: refilling after normalization duplicates the
code in the DECODE_AND_BRANCH_IF_ZERO and DECODE_AND_LOOP_IF_ZERO cases.
Unfortunately, refilling at the end of vp8dx_bool_decoder_fill() and at the
beginning of each decode step in vp8_decode_mb_tokens() means the latter
requires an extra refill at the end.
Platform-specific versions could avoid the problem, but would require most of
detokenize.c to be duplicated.
Change-Id: I16c782a63376f2a15b78f8086d899b987204c1c7
|
|
Change-Id: Ieebea089095d9073b3a94932791099f614ce120c
|
|
|