Age | Commit message (Collapse) | Author |
|
Reduces the number of rows calculated for 2D 4-tap interpolation filter
from h+7 rows to h+3 rows.
Also fixes a bug in the avx2 function for 4-tap filters where the last
row is computed incorrectly.
Performance:
| Baseline | Result | Pct Gain |
bitdepth lo| 4.00 fps | 4.02 fps | 0.5% |
bitdepth 10| 1.90 fps | 1.91 fps | 0.5% |
The performance is evaluated on speed 1 on jets.y4m br 500 over 100
frames.
No BDBR loss is observed.
Change-Id: I90b0d4d697319b7bba599f03c5dc01abd85d13b1
|
|
BUG=webm:1584
Change-Id: I1be768446b9304123da7b1ea0aed0db056db31c5
|
|
|
|
|
|
BUG=webm:1584
Change-Id: Iaba854952534a95e710a985acfcab46e093872c2
|
|
BUG=webm:1584
Change-Id: Ia2d9fcbccbad0c2142a3759e610670b86af0fef4
|
|
BUG=webm:1584
Change-Id: I5990c0100af83d13f7a4800147473bc997f5e5d1
|
|
|
|
BUG=webm:1584
Change-Id: I48b9a9cdcfe52536f685c41fb2d3c0f3e9192d34
|
|
vpx_asm_stubs.c only references these sse2 functions. Combine the files
similar to the way the ssse3/avx2 files are set up.
Mark the intrinsics as static because they are only used within the
macros here. It is unfortunate that the assembly functions can not be
marked static as well.
BUG=webm:1584
Change-Id: I342687a1046ae6ca46ae58644a7c170440de1dfb
|
|
BUG=webm:1584
Change-Id: I92504ed4a2e54129c981b7380249962afb7966df
|
|
|
|
BUG=webm:1584
Change-Id: Ia3f152bf2a37f8a1ea4178eeb1a6a262ea034a8d
|
|
The optimizations were accidentally disabled during the move from vp9
commit c3bdffb0a508ad08d5dfa613c029f368d4293d4c
author Johann <johannkoenig@google.com> Fri May 15 18:52:03 2015
Move variance functions to vpx_dsp
subpel functions will be moved in another patch.
BUG=webm:1584
Change-Id: Ia7899ee0cfad13a0e1516b89756552064846e81c
|
|
|
|
Speed test:
[ RUN ] C/HadamardHighbdTest.DISABLED_Speed/2
Hadamard32x32[ 10 runs]: 9 us
Hadamard32x32[ 10000 runs]: 8914 us
Hadamard32x32[ 10000000 runs]: 8991776 us
[ RUN ] AVX2/HadamardHighbdTest.DISABLED_Speed/2
Hadamard32x32[ 10 runs]: 5 us
Hadamard32x32[ 10000 runs]: 4582 us
Hadamard32x32[ 10000000 runs]: 4548203 us
Change-Id: Ied1b38b510bd033299f05869216d394e3b7f70f1
|
|
Speed Test:
C/SatdHighbdTest
blocksize: 16 time: 138 us
blocksize: 64 time: 315 us
blocksize: 256 time: 1120 us
blocksize: 1024 time: 3955 us
AVX2/SatdHighbdTest
blocksize: 16 time: 89 us
blocksize: 64 time: 189 us
blocksize: 256 time: 590 us
blocksize: 1024 time: 1912 us
Change-Id: I6357174462fccd589a475b13d8114b853cab5383
|
|
Speed test:
[ RUN ] C/HadamardHighbdTest.DISABLED_Speed/1
Hadamard16x16[ 10 runs]: 2 us
Hadamard16x16[ 10000 runs]: 1836 us
Hadamard16x16[ 10000000 runs]: 1829451 us
[ RUN ] AVX2/HadamardHighbdTest.DISABLED_Speed/1
Hadamard16x16[ 10 runs]: 1 us
Hadamard16x16[ 10000 runs]: 1009 us
Hadamard16x16[ 10000000 runs]: 984856 us
Change-Id: I89b9cdbe19350815576d66e627df87e5025ed0a4
|
|
Speed tests:
[ RUN ] C/HadamardHighbdTest.DISABLED_Speed/0
Hadamard8x8[ 10 runs]: 0 us
Hadamard8x8[ 10000 runs]: 316 us
Hadamard8x8[ 10000000 runs]: 311749 us
[ OK ] C/HadamardHighbdTest.DISABLED_Speed/0 (371 ms)
[ RUN ] AVX2/HadamardHighbdTest.DISABLED_Speed/0
Hadamard8x8[ 10 runs]: 0 us
Hadamard8x8[ 10000 runs]: 161 us
Hadamard8x8[ 10000000 runs]: 156910 us
[ OK ] AVX2/HadamardHighbdTest.DISABLED_Speed/0 (160 ms)
Change-Id: I94f7324be20405ff55f8a02ad4651c4ab4c10202
|
|
This slows down low bitdepth builds but is necessary to obtain correct
values.
BUG=webm:1448
Change-Id: I4ca9145f576089bb8496fcfeedeb556dc8fe6574
|
|
Calculate the high bits of dqcoeff and store them appropriately in high
bit depth builds.
Low bit depth builds still do not pass. C truncates the results after
division. X86 only supports packing with saturation at this step.
BUG=webm:1448
Change-Id: Ic80def575136c7ca37edf18d21e26925b475da98
|
|
Calculate the high bits of dqcoeff in high bit depth builds and store
them appropriately.
BUG=webm:1448
Change-Id: I61a2f8bfcf2e30765f10a94073c4d58321d2fa24
|
|
Pave the way for new quantize_OPT.h helper files.
Change-Id: Ice7225612983f5587a9660af3320c7d0c8bb1c2f
|
|
|
|
|
|
BUG=webm:1444
Change-Id: Iee19be068afc6c81396c79218a89c469d2e66207
|
|
Always use src/ref and _ptr/_stride suffixes.
Normalize to [xy]_offset and second_pred.
Drop some stray source/recon_strides.
BUG=webm:1444
Change-Id: I32362a50988eb84464ab78686348610ea40e5c80
|
|
Unit test performance on bitdepth 10:
| 4X4 | 8X8 |16X16|64X64|
2D |1.582|1.461|1.425|1.572|
HORZ|1.643|1.247|1.346|1.345|
VERT|1.378|1.695|2.020|1.763|
Unit test performance on bitdepth 12:
| 4X4 | 8X8 |16X16|64X64|
2D |1.578|1.409|1.426|1.497|
HORZ|1.625|1.153|1.323|1.259|
VERT|1.392|1.707|2.030|1.787|
Change-Id: I6df85330ac33fcb17d46e4302b41415dda1219f5
|
|
About ~10% faster on 64bit but ~10% slower on 32
Removes the assembly usage of vpx_rv.
Change-Id: I214698fb5677f615dee0a8f5f5bb8f64daf2565e
|
|
Speed gain:
BIT DEPTH | 8TAP FPS | 4TAP FPS | PCT INC |
10 | 1.69 | 1.85 | 9.46% |
12 | 1.64 | 1.78 | 8.54% |
Speed test is done on jet.y4m on speed 1 profile 2 over 100 frame with
br=500.
Change-Id: I411e122553e2c466be7a26e64b4dd144efb884a9
|
|
~20% faster than the MMX. Removes the last usage of
vp8_bilinear_filters_x86_[48].
Change-Id: Iee976fab9655d0020440f26c4403ce50103af913
|
|
|
|
Performance:
| 4X4 | 8X8 |16X16|64X64|
2 DIM|1.491|1.902|1.772|1.479|
HORZ|1.145|1.521|1.757|1.497|
VERT|1.176|1.614|1.707|1.467|
Each number in the chart above is 8-tap function time / 4-tap function time.
The framerate tested on jets.y4m for 100 frames on speed 1 increased from 3.72
fps to 3.91 fps (about 5% increase).
Change-Id: Ic0ad275cf32fafeefd0a89811badd8adff2134a0
|
|
Removes unnecesssary includes and reword some functions/comments.
Change-Id: Ied557d7faa9d845d38255e6e3e0e3fe1395276e1
|
|
AVX2's 8-tap filter is slightly faster than 4-tap SSSE3 filter.
Change-Id: I5fc37c431670780108706b206b32c791828555c9
|
|
Performance:
| 4X4 | 8X8 |16X16|64X64|
2 DIM|1.526|1.827|1.844|1.906|
HORZ|1.336|1.795|1.886|1.654|
VERT|1.443|1.539|2.139|2.190|
The ratio is SSSE3 8-tap time / SSSE3 4-tap time.
Change-Id: I01ed2ab494428256e918875774a459afecc5ec6a
|
|
Performance:
The chart below shows the speed relative to baseline
(baseline_time/new_time)
_____| 4X4 | 8X8 |16X16|64X64|
2 DIM|1.889|1.780|1.811|1.963|
HORZ|2.266|1.834|1.617|1.595|
VERI|2.043|2.190|2.373|2.485|
Change-Id: Ic4262222db78f013b94a8c61b46efb8520722927
|
|
Some repeated codes are refactored as inline functions. No performance
degradation is observed. These inline functions can be used for width 8
and width 4.
Change-Id: Ibf08cc9ebd2dd47bd2a6c2bcc1616f9d4c252d4d
|
|
Horizontal filter on 64x64 block: 1.59 times as fast as baseline.
Vertical filter on 64x64 block: 2.5 times as fast as baseline.
2D filter on 64x64 block: 1.96 times as fast as baseline.
Change-Id: I12e46679f3108616d5b3475319dd38b514c6cb3c
|
|
Change-Id: I7a3314a268cf6049a7260361043e76d4561085c6
|
|
|
|
Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de>
Change-Id: I838c8678e62f7cff13387b84d4f3ea42710a67ea
|
|
Another instance of unaligned 4-byte loads.
Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de>
Change-Id: I06afc5405bb074384eec7a8c8123e5803e522937
|
|
When built with -fsanitizer=address,undefined a number of tests,
such as ByteAlignmentTest.SwitchByteAlignment or
ByteAlignmentTest.SwitchByteAlignment produce runtime errors about
unaligned 4-byte loads/stores. While normally not really a problem,
this does technically violate the language and it is eays to fix in
a standard conforming way using memcpy which does not produce
inferior code.
Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de>
Change-Id: Ie1e97ab25fe874f864df48b473569f00563181ae
|
|
x86inc.asm's cglobal macro is frequently used to declare more
arguments than the function actually has. Normally, this is
done to aquire an alias to a register that would correspond to
that positional function argument if it existed. This is safe
when used in this manner.
In the case fixed here, however, the alias is used to temporarily
store adresses obtained through the GOT in memory. Because those
extra arguments don't actually exist, those stores corrupt the
callers stack frame.
SSE2/VpxHBDSubpelVarianceTest.Ref is a test that may fail as a
result.
To simply fix the space allocated to actual arguments that have
been loaded into registers already is reused.
This avoids having to allocate extra space for local variables.
Also removed duplicate code while at it.
Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de>
Change-Id: I505281ecaa6be586185fe6a2d34d62bdf40c839f
|
|
use the recommended format [1] of:
<PROJECT>_<PATH>_<FILE>_H_
[1] https://google.github.io/styleguide/cppguide.html#The__define_Guard
"All header files should have #define guards to prevent multiple
inclusion. The format of the symbol name should be
<PROJECT>_<PATH>_<FILE>_H_."
Change-Id: I2e8ab0b32fb23c30fa43cff5fec12d043c0d2037
|
|
|
|
this avoids reading 4 pixels into another block, which may be operated
on by a different thread. quiets a tsan warning.
Change-Id: Id27ad9d61819b0e5de0230647b4b510f7c265a71
|
|
BUG=webm:1546
Change-Id: I48224f047547b666c519e0cc23706dd0bda5df20
|
|
Change-Id: I710b66dc571a6bd38fbcc2528486d5e028a68b37
|