Age | Commit message (Collapse) | Author |
|
Commit adds neon assemblies for motion compensation which show an improvement
over the existing neon code.
Performance Improvement -
Platform Resolution 1 Thread 4 Threads
Nexus 6 720p 12.16% 7.21%
@2.65 GHz 1080p 18.00% 15.28%
Change-Id: Ic0b0412eeb01c8317642b20bb99092c2f5baba37
|
|
Low bit depth version only. Passes the Trans32x32Test test suite.
Trans32x32Test Speed Test (POWER9 Model 2.2)
32x32 C time = 212.7 ms (±0.1 ms), VSX time = 82.3 ms (±0.0 ms) [2.6x]
Change-Id: If906ec9b56ce3818cae0cc462c7277284ab29859
|
|
~2x speedup or better.
[ RUN ] C/VP9SubtractBlockTest.Speed/0
[ BENCH ] 4x4 365.1 ms ( ±2.2 ms )
[ BENCH ] 8x4 258.5 ms ( ±0.3 ms )
[ BENCH ] 4x8 202.7 ms ( ±0.2 ms )
[ BENCH ] 8x8 162.2 ms ( ±0.5 ms )
[ BENCH ] 16x8 138.8 ms ( ±0.3 ms )
[ BENCH ] 8x16 121.5 ms ( ±0.4 ms )
[ BENCH ] 16x16 110.2 ms ( ±0.5 ms )
[ BENCH ] 32x16 104.8 ms ( ±0.1 ms )
[ BENCH ] 16x32 32.7 ms ( ±0.1 ms )
[ BENCH ] 32x32 30.0 ms ( ±0.0 ms )
[ BENCH ] 64x32 28.7 ms ( ±0.0 ms )
[ BENCH ] 32x64 20.1 ms ( ±0.0 ms )
[ BENCH ] 64x64 19.3 ms ( ±0.0 ms )
[ RUN ] VSX/VP9SubtractBlockTest.Speed/0
[ BENCH ] 4x4 155.3 ms ( ±0.9 ms )
[ BENCH ] 8x4 99.3 ms ( ±0.4 ms )
[ BENCH ] 4x8 77.2 ms ( ±0.1 ms )
[ BENCH ] 8x8 45.7 ms ( ±0.0 ms )
[ BENCH ] 16x8 34.1 ms ( ±0.0 ms )
[ BENCH ] 8x16 29.5 ms ( ±0.0 ms )
[ BENCH ] 16x16 19.9 ms ( ±0.0 ms )
[ BENCH ] 32x16 15.1 ms ( ±0.0 ms )
[ BENCH ] 16x32 16.7 ms ( ±0.0 ms )
[ BENCH ] 32x32 14.1 ms ( ±0.0 ms )
[ BENCH ] 64x32 12.6 ms ( ±0.0 ms )
[ BENCH ] 32x64 12.0 ms ( ±0.0 ms )
[ BENCH ] 64x64 11.2 ms ( ±0.0 ms )
Change-Id: I89ce12b6475871dc9e8fde84d0b6fe5c420c28c7
|
|
Low bit depth version only. Passes the VpxMbPostProcDownTest.
VpxMbPostProcDownTest Speed Test (POWER8 Model 2.1)
Full calculations:
C time = 195.4 ms, VSX time = 33.7 ms (5.8x)
Change-Id: If1aca7c135de036a1ab7923c0d1e6733bfe27ef7
|
|
Low bit depth version only. Passes the VP9QuantizeTest.
Change-Id: I6546f872864bd404a7e353348b0554aab1de5bf0
|
|
Perf shows CPU time of this function dropped from 0.81% to 0.15%.
Change-Id: I8a7649ca5c15af2fc65cfb848f5befa0cc5e64f2
|
|
1. vpx_convolve8_vert_mmi
2. vpx_convolve8_horiz_mmi
3. vpx_convolve8_mmi
4. vpx_convolve8_avg_mmi
5. vpx_convolve8_avg_vert_mmi
Change-Id: I41a6b3b4f327d6b67d282e0163cfa0aee8648abe
|
|
BUG=webm:1403
Change-Id: Id9833e985fb70958cf4bde38f8e6303ed83c12f9
|
|
Change-Id: I9f95f47bc7ecbb7980f21cbc3a91f699624141af
|
|
The added AVX-512 support requires the subset of AVX-512 added in Skylake-X.
Change-Id: I39666b00d10bf96d06c709823663eb09b89265b7
|
|
|
|
This version is ~1.91x faster than the sse2 version. When
highbitdepth is enabled, it is ~1.74x.
Change-Id: I2b0e92ede9f55c6259ca07bf1f8c8a5d0d0955bd
|
|
Change-Id: I6539111dfb35a43028e9755785b2e9ea31854305
|
|
Change-Id: Id6a8c549709a3c516ed5d7b719b05117c5ef8bac
|
|
Add some load and store sse2 inline functions.
Change-Id: Ib1e0650b5a3d8e2b3736ab7c7642d6e384354222
|
|
BUG=webm:1419
Change-Id: I39c8033734562efc0ac0e28e7f06fa05130f9b96
|
|
This reverts commit 8c42237bb200253931c49e2c530838f3a877dd65.
Because ssse3 code is used for the reference, the qcoeff and dqcoeff
reference buffers must be aligned.
Original change's description:
> quantize avx: copy 32x32 implementation
>
> Ensure avx and ssse3 stay in sync by testing them against each other.
>
> Change-Id: I699f3b48785c83260825402d7826231f475f697c
Change-Id: Ieeef11b9406964194028b0d81d84bcb63296ae06
|
|
C vs SSE2 speed gains:
_4x4 : ~2.31x
C vs SSSE3 speed gains:
_8x8 : ~4.73x
_16x16 : ~10.88x
_32x32 : ~4.80x
BUG=webm:1411
Change-Id: I0bac29db261079181ddabc6814bd62c463109caf
|
|
|
|
Change-Id: I4ac576875c91fee7cb150d298fae4a2c156d374c
|
|
1. vpx_sadWxH_c
2. vpx_sadWxH_avg_c
3. vpx_sadWxHx3_c
4. vpx_sadWxHx8_c
5. vpx_sadWxHx4d_c
Change-Id: Ie13161e3d73a052ea6ea7bac9cfadf55598fea7a
|
|
C vs SSE2 speed gains:
_4x4 : ~8.12x
_8x8 : ~9.71x
_16x16 : ~8.21x
_32x32 : ~5.0x
BUG=webm:1422
Change-Id: I5e8a1ed4db7b8dc539b3e2a728b0b34d8b4b1993
|
|
This reverts commit f60d1dcd3de46f72bafc5eeef481bd1a4e203301.
Reason for revert: <INSERT REASONING HERE>
Failures in AVX/VP9QuantizeTest in nightly tests.
Original change's description:
> quantize avx: copy 32x32 implementation
>
> Ensure avx and ssse3 stay in sync by testing them against each other.
>
> Change-Id: I699f3b48785c83260825402d7826231f475f697c
TBR=slavarnway@google.com,johannkoenig@google.com,builds@webmproject.org
Change-Id: Ibd38636212269328317dd0721be9d25452113d1c
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
|
|
Ensure avx and ssse3 stay in sync by testing them against each other.
Change-Id: I699f3b48785c83260825402d7826231f475f697c
|
|
Still does not pass tests. Does match the previous assembly, although
saving the sign before multiplying is dubious.
Change-Id: Ia163f18c755aba542d6e93f7bf7343184660df5a
|
|
mmi."
|
|
Adds an early exit based on ptest. Slightly slower than ssse3 in the
full case because of the extra check, but potentially faster if lots of
rows can be skipped.
Very close in speed to the assembly.
Can run in 32 bit, unlike the assembly. Allows reworking the function
prototype to use structs.
Change-Id: If80e2b9ba059370a4cad3c973196e82a97b4330e
|
|
Change-Id: I2c782d18d9004414ba61b77238e0caf3e022d8f2
|
|
with mmi."
|
|
Change-Id: Ia120ad1064d0b6106d9685cf075bdab373eef19e
|
|
BUG=webm:1412
Change-Id: I08b562b60fa85fbc2fec1c15c323a3444b44618f
|
|
|
|
Fairly minor differences from sse2. pabsw and psignw are the big gains.
Also re-uses some values in eob calculation to avoid an extra pcmp.
Fixes test failures in HBD and OS X builds.
Allows using it in 32bit builds, where it is about 40% faster than sse2.
Substantially faster than the assembly for skip_block. 10-20% faster the
rest of the time.
Change-Id: If783bb3567e561e47667e10133b9c84414a334e2
|
|
BUG=webm:1412
Change-Id: I8877c986b4042f7b8e33f5674c86700675a0e4ca
|
|
BUG=webm:1404
Change-Id: Ieb8f85c3811b05df78722cb41eeb1166966ceec4
|
|
With skip block or coeff < zbin it is about twice as fast as C.
If most coeff values are > zbin it is about 10-15x as fast as C.
BUG=webm:1426
Change-Id: I5d3c007b014a372d5ef0882b39bb48983b4131c7
|
|
Change-Id: Iaf9e88ff636ccf8f0ef310869c6827f3f205cca8
|
|
Change-Id: Id2673eece32027fb245919c7a5c81994a4a19fd8
|
|
* changes:
Add vpx_highbd_idct8x8_{12, 64}_add_sse4_1
sse2: Add transpose_32bit_4x4x2() and update transpose_32bit_4x4()
Refactor highbd idct 4x4 sse4.1 code and add highbd_inv_txfm_sse4.h
Refactor vpx_idct8x8_12_add_ssse3() and add inv_txfm_ssse3.h
|
|
BUG=webm:1412
Change-Id: I5d038b4fa842ce2f6b9bd5c8c44c70647bda9591
|
|
Also clean highbd_inv_txfm_sse2.h
BUG=webm:1412
Change-Id: I0722841d824ce602874019bd9779b10d49d10c0b
|
|
BUG=webm:1412
Change-Id: I1f640db71ad4c644b7521305a781f2218eb1ba9d
|
|
The function was originally written with HBD in mind. Enable it and
configure the tests.
BUG=webm:1424
Change-Id: I78a2eba8d4d9d59db98a344ba0840d4a60ebe9a1
|
|
BUG=webm:1412
Change-Id: Ie33482409351a01be4e89466b0441834eb1e905a
|
|
Almost 3x faster in constrained loop testing. Over 10x faster in HBD
builds.
BUG=webm:1424
Change-Id: I2b7f8453e1d4ada63cde729d8115d684c4a71ff9
|
|
|
|
|
|
BUG=webm:1438
Change-Id: Ie3dc034c7dbb498a0b088a767b1936ddeed4df14
|
|
Roughly 2x speedup. Since the only change for HBD is to store(), the
improvement appears to hold there as well.
BUG=webm:1424
Change-Id: I15b813d50deb2e47b49a6b0705945de748e83c19
|
|
BUG=webm:1423
Change-Id: I33de537f238f58f89b7a6c1c2d6e8110de4b8804
|