Age | Commit message (Collapse) | Author |
|
Change-Id: I83c7e64fe70f7c49aa2492ed2d640c6756b7ebaa
|
|
|
|
Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de>
Change-Id: I838c8678e62f7cff13387b84d4f3ea42710a67ea
|
|
These variables are being fed to sse2 functions, that use aligned
loads.
Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de>
Change-Id: I796c3483c6f3425d63d9262b02b19da59d536600
|
|
Another instance of unaligned 4-byte loads.
Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de>
Change-Id: I06afc5405bb074384eec7a8c8123e5803e522937
|
|
When built with -fsanitizer=address,undefined a number of tests,
such as ByteAlignmentTest.SwitchByteAlignment or
ByteAlignmentTest.SwitchByteAlignment produce runtime errors about
unaligned 4-byte loads/stores. While normally not really a problem,
this does technically violate the language and it is eays to fix in
a standard conforming way using memcpy which does not produce
inferior code.
Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de>
Change-Id: Ie1e97ab25fe874f864df48b473569f00563181ae
|
|
x86inc.asm's cglobal macro is frequently used to declare more
arguments than the function actually has. Normally, this is
done to aquire an alias to a register that would correspond to
that positional function argument if it existed. This is safe
when used in this manner.
In the case fixed here, however, the alias is used to temporarily
store adresses obtained through the GOT in memory. Because those
extra arguments don't actually exist, those stores corrupt the
callers stack frame.
SSE2/VpxHBDSubpelVarianceTest.Ref is a test that may fail as a
result.
To simply fix the space allocated to actual arguments that have
been loaded into registers already is reused.
This avoids having to allocate extra space for local variables.
Also removed duplicate code while at it.
Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de>
Change-Id: I505281ecaa6be586185fe6a2d34d62bdf40c839f
|
|
use the recommended format [1] of:
<PROJECT>_<PATH>_<FILE>_H_
[1] https://google.github.io/styleguide/cppguide.html#The__define_Guard
"All header files should have #define guards to prevent multiple
inclusion. The format of the symbol name should be
<PROJECT>_<PATH>_<FILE>_H_."
Change-Id: I2e8ab0b32fb23c30fa43cff5fec12d043c0d2037
|
|
|
|
this avoids reading 4 pixels into another block, which may be operated
on by a different thread. quiets a tsan warning.
Change-Id: Id27ad9d61819b0e5de0230647b4b510f7c265a71
|
|
The ".syntax unified" directives in a few source files aren't valid
ADS assembly directives, and they break compilation for windows,
since ads2armasm_ms.pl doesn't handle them.
Explicity add them via ads2gas.pl and ads2gas_apple.pl instead,
and tweak one instruction to be valid unified syntax.
Change-Id: I37f1709f163d11474597161fe02eb433859cb9b8
|
|
Commit adds neon assemblies for motion compensation which show an improvement
over the existing neon code.
Performance Improvement -
Platform Resolution 1 Thread 4 Threads
Nexus 6 720p 12.16% 7.21%
@2.65 GHz 1080p 18.00% 15.28%
Change-Id: Ic0b0412eeb01c8317642b20bb99092c2f5baba37
|
|
|
|
Profile 1 or 3 bitstreams may require 11 bytes for the header in the
intra-only case.
Additionally add a check on the bit reader's error handler callback to
ensure it's non-NULL before calling to avoid future regressions.
This has existed since at least (pre-1.4.0):
09bf1d61c Changes hdr for profiles > 1 for intraonly frames
BUG=webm:1543
Change-Id: I23901e6e3a219170e8ea9efecc42af0be2e5c378
|
|
BUG=webm:1546
Change-Id: I48224f047547b666c519e0cc23706dd0bda5df20
|
|
Change-Id: I710b66dc571a6bd38fbcc2528486d5e028a68b37
|
|
~14% improvement.
BUG=webm:1546
Change-Id: I0b25f62f053e13c2185e4e8bd54e52250251efd0
|
|
BUG=webm:1546
Change-Id: I64629ed83cb7acd0f2ac49b9c31f369d17a1aed2
|
|
|
|
|
|
BUG=webm:1546
Change-Id: Ide5828b890c5c27cfcca2d5e318a914f7cde1158
|
|
instead of vpx_hadamard_16x16().
Change-Id: Ie16aacad39d7f429e282dd4c93e57c07000d0f29
|
|
~12% improvement.
Change-Id: Ieca4d870a4c1c5ea2c689e27fc4550fcbab9f867
|
|
This fixes the build with at least GCC 7.3, where it was previously failing
with:
sum_squares_neon.c: In function 'vpx_sum_squares_2d_i16_neon':
sum_squares_neon.c: note: use -flax-vector-conversions to permit conversions between vectors with differing element types or numbers of subparts
s2 = vpaddl_u32(s1);
^~
sum_squares_neon.c: incompatible types when assigning to type 'int64x1_t' from type 'uint64x1_t'
s2 = vpaddl_u32(s1);
^
sum_squares_neon.c: incompatible types when assigning to type 'int64x1_t' from type 'uint64x1_t'
s2 = vadd_u64(vget_low_u64(s1), vget_high_u64(s1));
^
sum_squares_neon.c: incompatible type for argument 1 of 'vget_lane_u64'
return vget_lane_u64(s2, 0);
^~
The generated assembly was verified to remain identical with both GCC and
LLVM.
Bug: chromium:819249
Change-Id: I2778428ee1fee0a674d0d4910347c2a717de21ac
|
|
Add 32x32 Hadamard transform in C implementation. Replace the
forward 32x32 2D-DCT in tpl model with Hadamard transform. This
would reduce the overhead encoding time due to running tpl model
by ~3x.
Change-Id: I1c743dab786b818d89f14928cc3998d056830aa9
|
|
~5% gain for SAD.
Change-Id: Ief7d7691f837474f5b6b582129628276fdcce319
|
|
|
|
vpx_quantize_b:
VP9QuantizeTest Speed Test (POWER8 Model 2.1)
32x32 Old VSX time = 8.1 ms, new VSX time = 7.9 ms
vp9_quantize_fp:
VP9QuantizeTest Speed Test (POWER8 Model 2.1)
32x32 Old VSX time = 6.5 ms, new VSX time = 6.2 ms
Change-Id: Ic2183e8bd721bb69eaeb4865b542b656255a0870
|
|
Low bit depth version only. Passes the Trans32x32Test test suite.
Trans32x32Test Speed Test (POWER9 Model 2.2)
32x32 C time = 212.7 ms (±0.1 ms), VSX time = 82.3 ms (±0.0 ms) [2.6x]
Change-Id: If906ec9b56ce3818cae0cc462c7277284ab29859
|
|
|
|
BUG=webm:1537
Change-Id: I5f216f35436189b67d9f350991f41ed31431d4fe
|
|
* changes:
ppc: add vp9_iht16x16_256_add_vsx
ppc: add vp9_iht8x8_64_add_vsx
ppc: add vp9_iht4x4_16_add_vsx
|
|
clang-6 seems to support it out of box.
E.g. VP9SubtractBlockTest.DISABLED_Speed with the workaround:
[ BENCH ] 4x4 286.5 ms ( ±0.2 ms )
Without:
[ BENCH ] 4x4 215.2 ms ( ±0.9 ms )
Change-Id: I28b3a2cc93c0d72f52f5a48cc06d8ed4ef26913f
|
|
Change-Id: I51e7ed32d8d87c25ee126e8b4f8fc616d0327584
|
|
The PROCESS16 macro now uses 8-bit lanes instead of 16-bit lanes.
SADTest Speed Test (POWER8 Model 2.1)
16x8 Old VSX time = 16.7 ms, new VSX time = 9.1 ms [1.8x]
16x16 Old VSX time = 15.7 ms, new VSX time = 7.9 ms [2.0x]
16x32 Old VSX time = 14.4 ms, new VSX time = 7.2 ms [2.0x]
32x16 Old VSX time = 14.0 ms, new VSX time = 7.4 ms [1.9x]
32x32 Old VSX time = 13.4 ms, new VSX time = 6.5 ms [2.0x]
32x64 Old VSX time = 12.7 ms, new VSX time = 6.3 ms [2.0x]
64x32 Old VSX time = 12.6 ms, new VSX time = 6.3 ms [2.0x]
64x64 Old VSX time = 12.7 ms, new VSX time = 6.2 ms [2.0x]
Change-Id: I51776f0e428162e78edde8eac47f30ffd2379873
|
|
VSX versions of the SAD functions of width 8.
SADTest Speed Test (POWER8 Model 2.1)
8x4 C time = 68.7 ms (±0.3 ms), VSX time = 31.8 ms (±0.1 ms) [2.2x]
8x8 C time = 55.6 ms (±0.3 ms), VSX time = 18.3 ms (±0.1 ms) [3.0x]
8x16 C time = 46.5 ms (±0.1 ms), VSX time = 15.6 ms (±0.1 ms) [3.0x]
Change-Id: I843f3b34e103b72deeade4a939193d8b53cee460
|
|
~2x speedup or better.
[ RUN ] C/VP9SubtractBlockTest.Speed/0
[ BENCH ] 4x4 365.1 ms ( ±2.2 ms )
[ BENCH ] 8x4 258.5 ms ( ±0.3 ms )
[ BENCH ] 4x8 202.7 ms ( ±0.2 ms )
[ BENCH ] 8x8 162.2 ms ( ±0.5 ms )
[ BENCH ] 16x8 138.8 ms ( ±0.3 ms )
[ BENCH ] 8x16 121.5 ms ( ±0.4 ms )
[ BENCH ] 16x16 110.2 ms ( ±0.5 ms )
[ BENCH ] 32x16 104.8 ms ( ±0.1 ms )
[ BENCH ] 16x32 32.7 ms ( ±0.1 ms )
[ BENCH ] 32x32 30.0 ms ( ±0.0 ms )
[ BENCH ] 64x32 28.7 ms ( ±0.0 ms )
[ BENCH ] 32x64 20.1 ms ( ±0.0 ms )
[ BENCH ] 64x64 19.3 ms ( ±0.0 ms )
[ RUN ] VSX/VP9SubtractBlockTest.Speed/0
[ BENCH ] 4x4 155.3 ms ( ±0.9 ms )
[ BENCH ] 8x4 99.3 ms ( ±0.4 ms )
[ BENCH ] 4x8 77.2 ms ( ±0.1 ms )
[ BENCH ] 8x8 45.7 ms ( ±0.0 ms )
[ BENCH ] 16x8 34.1 ms ( ±0.0 ms )
[ BENCH ] 8x16 29.5 ms ( ±0.0 ms )
[ BENCH ] 16x16 19.9 ms ( ±0.0 ms )
[ BENCH ] 32x16 15.1 ms ( ±0.0 ms )
[ BENCH ] 16x32 16.7 ms ( ±0.0 ms )
[ BENCH ] 32x32 14.1 ms ( ±0.0 ms )
[ BENCH ] 64x32 12.6 ms ( ±0.0 ms )
[ BENCH ] 32x64 12.0 ms ( ±0.0 ms )
[ BENCH ] 64x64 11.2 ms ( ±0.0 ms )
Change-Id: I89ce12b6475871dc9e8fde84d0b6fe5c420c28c7
|
|
* changes:
force-inline the convolve functions
Unbreak the force inline directive for gcc
|
|
Change-Id: I3ba75c459ed7c9591b7892e9f8f108146c04507d
|
|
Low bit depth version only. Passes the VpxPostProcDownAndAcrossMbRowTest
VpxMbPostProcAcrossIpTest Speed Test (POWER8 Model 2.1)
C time = 121.3 ms (±4.0 ms), VSX time = 9.4 ms (±0.3 ms) [12.9x]
Change-Id: I28300779e197ea3855cf30867d17a2805388b447
|
|
Change-Id: I99a9535bf1ae58c494113fc88d9616bda202716a
|
|
Change-Id: Id584d8f65fdda51b8680f41424074b4b0c979622
|
|
Low bit depth version only. Passes the VpxMbPostProcAcrossIpTest.
VpxMbPostProcAcrossIpTest Speed Test (POWER8 Model 2.1)
C time = 188.5ms (±0.2ms), VSX time = 65.2ms (±0.1ms) [2.9x]
Change-Id: I1cf72365d94a9d7f1e9323925a87a30e3bd5cfe2
|
|
Low bit depth version only. Passes the VpxMbPostProcDownTest.
VpxMbPostProcDownTest Speed Test (POWER8 Model 2.1)
Full calculations:
C time = 195.4 ms, VSX time = 33.7 ms (5.8x)
Change-Id: If1aca7c135de036a1ab7923c0d1e6733bfe27ef7
|
|
|
|
Speedups:
64x64 5.9
64x32 6.2
32x64 5.8
32x32 6.2
32x16 5.1
16x32 3.3
16x16 2.6
16x8 2.6
8x16 2.4
8x8 2.3
8x4 2.1
4x8 1.6
4x4 1.6
Change-Id: Idfaab96c03d3d1f487301cf398da0dd47a34e887
|
|
Low bit depth version only. Passes the VP9QuantizeTest.
VP9QuantizeTest Speed Test (POWER8 Model 2.1)
Full calculations:
C time = 1456 ms, VSX time = 80 ms (18x)
Change-Id: I1b1d6d03b1aeff63640efbdeb222cab857ddd95e
|
|
|
|
quiets ptrdiff_t -> int conversion warning
Change-Id: If6b545a736fc19e48e290961736b1618df97db3e
|
|
|