Age | Commit message (Collapse) | Author |
|
|
|
this avoids reading 4 pixels into another block, which may be operated
on by a different thread. quiets a tsan warning.
Change-Id: Id27ad9d61819b0e5de0230647b4b510f7c265a71
|
|
The ".syntax unified" directives in a few source files aren't valid
ADS assembly directives, and they break compilation for windows,
since ads2armasm_ms.pl doesn't handle them.
Explicity add them via ads2gas.pl and ads2gas_apple.pl instead,
and tweak one instruction to be valid unified syntax.
Change-Id: I37f1709f163d11474597161fe02eb433859cb9b8
|
|
Commit adds neon assemblies for motion compensation which show an improvement
over the existing neon code.
Performance Improvement -
Platform Resolution 1 Thread 4 Threads
Nexus 6 720p 12.16% 7.21%
@2.65 GHz 1080p 18.00% 15.28%
Change-Id: Ic0b0412eeb01c8317642b20bb99092c2f5baba37
|
|
|
|
Profile 1 or 3 bitstreams may require 11 bytes for the header in the
intra-only case.
Additionally add a check on the bit reader's error handler callback to
ensure it's non-NULL before calling to avoid future regressions.
This has existed since at least (pre-1.4.0):
09bf1d61c Changes hdr for profiles > 1 for intraonly frames
BUG=webm:1543
Change-Id: I23901e6e3a219170e8ea9efecc42af0be2e5c378
|
|
BUG=webm:1546
Change-Id: I48224f047547b666c519e0cc23706dd0bda5df20
|
|
Change-Id: I710b66dc571a6bd38fbcc2528486d5e028a68b37
|
|
~14% improvement.
BUG=webm:1546
Change-Id: I0b25f62f053e13c2185e4e8bd54e52250251efd0
|
|
BUG=webm:1546
Change-Id: I64629ed83cb7acd0f2ac49b9c31f369d17a1aed2
|
|
|
|
|
|
BUG=webm:1546
Change-Id: Ide5828b890c5c27cfcca2d5e318a914f7cde1158
|
|
instead of vpx_hadamard_16x16().
Change-Id: Ie16aacad39d7f429e282dd4c93e57c07000d0f29
|
|
~12% improvement.
Change-Id: Ieca4d870a4c1c5ea2c689e27fc4550fcbab9f867
|
|
This fixes the build with at least GCC 7.3, where it was previously failing
with:
sum_squares_neon.c: In function 'vpx_sum_squares_2d_i16_neon':
sum_squares_neon.c: note: use -flax-vector-conversions to permit conversions between vectors with differing element types or numbers of subparts
s2 = vpaddl_u32(s1);
^~
sum_squares_neon.c: incompatible types when assigning to type 'int64x1_t' from type 'uint64x1_t'
s2 = vpaddl_u32(s1);
^
sum_squares_neon.c: incompatible types when assigning to type 'int64x1_t' from type 'uint64x1_t'
s2 = vadd_u64(vget_low_u64(s1), vget_high_u64(s1));
^
sum_squares_neon.c: incompatible type for argument 1 of 'vget_lane_u64'
return vget_lane_u64(s2, 0);
^~
The generated assembly was verified to remain identical with both GCC and
LLVM.
Bug: chromium:819249
Change-Id: I2778428ee1fee0a674d0d4910347c2a717de21ac
|
|
Add 32x32 Hadamard transform in C implementation. Replace the
forward 32x32 2D-DCT in tpl model with Hadamard transform. This
would reduce the overhead encoding time due to running tpl model
by ~3x.
Change-Id: I1c743dab786b818d89f14928cc3998d056830aa9
|
|
~5% gain for SAD.
Change-Id: Ief7d7691f837474f5b6b582129628276fdcce319
|
|
|
|
vpx_quantize_b:
VP9QuantizeTest Speed Test (POWER8 Model 2.1)
32x32 Old VSX time = 8.1 ms, new VSX time = 7.9 ms
vp9_quantize_fp:
VP9QuantizeTest Speed Test (POWER8 Model 2.1)
32x32 Old VSX time = 6.5 ms, new VSX time = 6.2 ms
Change-Id: Ic2183e8bd721bb69eaeb4865b542b656255a0870
|
|
Low bit depth version only. Passes the Trans32x32Test test suite.
Trans32x32Test Speed Test (POWER9 Model 2.2)
32x32 C time = 212.7 ms (±0.1 ms), VSX time = 82.3 ms (±0.0 ms) [2.6x]
Change-Id: If906ec9b56ce3818cae0cc462c7277284ab29859
|
|
|
|
BUG=webm:1537
Change-Id: I5f216f35436189b67d9f350991f41ed31431d4fe
|
|
* changes:
ppc: add vp9_iht16x16_256_add_vsx
ppc: add vp9_iht8x8_64_add_vsx
ppc: add vp9_iht4x4_16_add_vsx
|
|
clang-6 seems to support it out of box.
E.g. VP9SubtractBlockTest.DISABLED_Speed with the workaround:
[ BENCH ] 4x4 286.5 ms ( ±0.2 ms )
Without:
[ BENCH ] 4x4 215.2 ms ( ±0.9 ms )
Change-Id: I28b3a2cc93c0d72f52f5a48cc06d8ed4ef26913f
|
|
Change-Id: I51e7ed32d8d87c25ee126e8b4f8fc616d0327584
|
|
The PROCESS16 macro now uses 8-bit lanes instead of 16-bit lanes.
SADTest Speed Test (POWER8 Model 2.1)
16x8 Old VSX time = 16.7 ms, new VSX time = 9.1 ms [1.8x]
16x16 Old VSX time = 15.7 ms, new VSX time = 7.9 ms [2.0x]
16x32 Old VSX time = 14.4 ms, new VSX time = 7.2 ms [2.0x]
32x16 Old VSX time = 14.0 ms, new VSX time = 7.4 ms [1.9x]
32x32 Old VSX time = 13.4 ms, new VSX time = 6.5 ms [2.0x]
32x64 Old VSX time = 12.7 ms, new VSX time = 6.3 ms [2.0x]
64x32 Old VSX time = 12.6 ms, new VSX time = 6.3 ms [2.0x]
64x64 Old VSX time = 12.7 ms, new VSX time = 6.2 ms [2.0x]
Change-Id: I51776f0e428162e78edde8eac47f30ffd2379873
|
|
VSX versions of the SAD functions of width 8.
SADTest Speed Test (POWER8 Model 2.1)
8x4 C time = 68.7 ms (±0.3 ms), VSX time = 31.8 ms (±0.1 ms) [2.2x]
8x8 C time = 55.6 ms (±0.3 ms), VSX time = 18.3 ms (±0.1 ms) [3.0x]
8x16 C time = 46.5 ms (±0.1 ms), VSX time = 15.6 ms (±0.1 ms) [3.0x]
Change-Id: I843f3b34e103b72deeade4a939193d8b53cee460
|
|
~2x speedup or better.
[ RUN ] C/VP9SubtractBlockTest.Speed/0
[ BENCH ] 4x4 365.1 ms ( ±2.2 ms )
[ BENCH ] 8x4 258.5 ms ( ±0.3 ms )
[ BENCH ] 4x8 202.7 ms ( ±0.2 ms )
[ BENCH ] 8x8 162.2 ms ( ±0.5 ms )
[ BENCH ] 16x8 138.8 ms ( ±0.3 ms )
[ BENCH ] 8x16 121.5 ms ( ±0.4 ms )
[ BENCH ] 16x16 110.2 ms ( ±0.5 ms )
[ BENCH ] 32x16 104.8 ms ( ±0.1 ms )
[ BENCH ] 16x32 32.7 ms ( ±0.1 ms )
[ BENCH ] 32x32 30.0 ms ( ±0.0 ms )
[ BENCH ] 64x32 28.7 ms ( ±0.0 ms )
[ BENCH ] 32x64 20.1 ms ( ±0.0 ms )
[ BENCH ] 64x64 19.3 ms ( ±0.0 ms )
[ RUN ] VSX/VP9SubtractBlockTest.Speed/0
[ BENCH ] 4x4 155.3 ms ( ±0.9 ms )
[ BENCH ] 8x4 99.3 ms ( ±0.4 ms )
[ BENCH ] 4x8 77.2 ms ( ±0.1 ms )
[ BENCH ] 8x8 45.7 ms ( ±0.0 ms )
[ BENCH ] 16x8 34.1 ms ( ±0.0 ms )
[ BENCH ] 8x16 29.5 ms ( ±0.0 ms )
[ BENCH ] 16x16 19.9 ms ( ±0.0 ms )
[ BENCH ] 32x16 15.1 ms ( ±0.0 ms )
[ BENCH ] 16x32 16.7 ms ( ±0.0 ms )
[ BENCH ] 32x32 14.1 ms ( ±0.0 ms )
[ BENCH ] 64x32 12.6 ms ( ±0.0 ms )
[ BENCH ] 32x64 12.0 ms ( ±0.0 ms )
[ BENCH ] 64x64 11.2 ms ( ±0.0 ms )
Change-Id: I89ce12b6475871dc9e8fde84d0b6fe5c420c28c7
|
|
* changes:
force-inline the convolve functions
Unbreak the force inline directive for gcc
|
|
Change-Id: I3ba75c459ed7c9591b7892e9f8f108146c04507d
|
|
Low bit depth version only. Passes the VpxPostProcDownAndAcrossMbRowTest
VpxMbPostProcAcrossIpTest Speed Test (POWER8 Model 2.1)
C time = 121.3 ms (±4.0 ms), VSX time = 9.4 ms (±0.3 ms) [12.9x]
Change-Id: I28300779e197ea3855cf30867d17a2805388b447
|
|
Change-Id: I99a9535bf1ae58c494113fc88d9616bda202716a
|
|
Change-Id: Id584d8f65fdda51b8680f41424074b4b0c979622
|
|
Low bit depth version only. Passes the VpxMbPostProcAcrossIpTest.
VpxMbPostProcAcrossIpTest Speed Test (POWER8 Model 2.1)
C time = 188.5ms (±0.2ms), VSX time = 65.2ms (±0.1ms) [2.9x]
Change-Id: I1cf72365d94a9d7f1e9323925a87a30e3bd5cfe2
|
|
Low bit depth version only. Passes the VpxMbPostProcDownTest.
VpxMbPostProcDownTest Speed Test (POWER8 Model 2.1)
Full calculations:
C time = 195.4 ms, VSX time = 33.7 ms (5.8x)
Change-Id: If1aca7c135de036a1ab7923c0d1e6733bfe27ef7
|
|
|
|
Speedups:
64x64 5.9
64x32 6.2
32x64 5.8
32x32 6.2
32x16 5.1
16x32 3.3
16x16 2.6
16x8 2.6
8x16 2.4
8x8 2.3
8x4 2.1
4x8 1.6
4x4 1.6
Change-Id: Idfaab96c03d3d1f487301cf398da0dd47a34e887
|
|
Low bit depth version only. Passes the VP9QuantizeTest.
VP9QuantizeTest Speed Test (POWER8 Model 2.1)
Full calculations:
C time = 1456 ms, VSX time = 80 ms (18x)
Change-Id: I1b1d6d03b1aeff63640efbdeb222cab857ddd95e
|
|
|
|
quiets ptrdiff_t -> int conversion warning
Change-Id: If6b545a736fc19e48e290961736b1618df97db3e
|
|
|
|
|
|
Process 16 coefficients on the first iteration (a full 4x4) and 24 coefficients
on subsequent iteration.
VSX/VP9QuantizeTest.DISABLED_Speed
Before:
4x4 176 ms
8x8 91 ms
16x16 72 ms
After:
4x4 152 ms
8x8 82 ms
16x16 64 ms
Change-Id: I07cb130833504206ccdc5bc12ae5af369364999a
|
|
Change-Id: Ie2ac06c090c8f92268e9a799e96aa5192a1bdcd2
|
|
|
|
Low bit depth version only. Passes the VP9QuantizeTest.
Change-Id: I6546f872864bd404a7e353348b0554aab1de5bf0
|
|
Separate width 4 and 8 cases to reduce jumps in loop in clang.
Change-Id: I6ffc6f1555f2ad08b72a8dba35a78b9fd5f95a73
|
|
Change-Id: Ia313a6da00a05837fcd4de6ece31fa1c0016438c
|
|
|