summaryrefslogtreecommitdiff
path: root/vpx_dsp
AgeCommit message (Collapse)Author
2018-08-07Merge "VPX: Improve HBD vpx_hadamard_32x32_sse2()"Scott LaVarnway
2018-08-07vpx_highbd_d153_predictor_4x4_sse2: reduce load sizeJames Zern
this avoids reading 4 pixels into another block, which may be operated on by a different thread. quiets a tsan warning. Change-Id: Id27ad9d61819b0e5de0230647b4b510f7c265a71
2018-07-28arm: Consistently use unified syntax for asmMartin Storsjo
The ".syntax unified" directives in a few source files aren't valid ADS assembly directives, and they break compilation for windows, since ads2armasm_ms.pl doesn't handle them. Explicity add them via ads2gas.pl and ads2gas_apple.pl instead, and tweak one instruction to be valid unified syntax. Change-Id: I37f1709f163d11474597161fe02eb433859cb9b8
2018-07-26Add New Neon Assemblies for Motion CompensationVenkatarama NG. Avadhani
Commit adds neon assemblies for motion compensation which show an improvement over the existing neon code. Performance Improvement - Platform Resolution 1 Thread 4 Threads Nexus 6 720p 12.16% 7.21% @2.65 GHz 1080p 18.00% 15.28% Change-Id: Ic0b0412eeb01c8317642b20bb99092c2f5baba37
2018-07-26Merge "vp9: fix OOB read in decoder_peek_si_internal"James Zern
2018-07-25vp9: fix OOB read in decoder_peek_si_internalJames Zern
Profile 1 or 3 bitstreams may require 11 bytes for the header in the intra-only case. Additionally add a check on the bit reader's error handler callback to ensure it's non-NULL before calling to avoid future regressions. This has existed since at least (pre-1.4.0): 09bf1d61c Changes hdr for profiles > 1 for intraonly frames BUG=webm:1543 Change-Id: I23901e6e3a219170e8ea9efecc42af0be2e5c378
2018-07-25VPX: Improve HBD vpx_hadamard_32x32_sse2()Scott LaVarnway
BUG=webm:1546 Change-Id: I48224f047547b666c519e0cc23706dd0bda5df20
2018-07-24VPX: avg_intrin_sse2.c, avg_intrin_avx2.c cleanupScott LaVarnway
Change-Id: I710b66dc571a6bd38fbcc2528486d5e028a68b37
2018-07-24VPX: Improve HBD vpx_hadamard_32x32_avx2()Scott LaVarnway
~14% improvement. BUG=webm:1546 Change-Id: I0b25f62f053e13c2185e4e8bd54e52250251efd0
2018-07-23VPX: Add vpx_hadamard_32x32_avx2Scott LaVarnway
BUG=webm:1546 Change-Id: I64629ed83cb7acd0f2ac49b9c31f369d17a1aed2
2018-07-22Merge "VPX: Add vpx_hadamard_32x32_sse2"Scott LaVarnway
2018-07-22Merge "VPX: Improve HBD vpx_hadamard_16x16_sse2()"Scott LaVarnway
2018-07-21VPX: Add vpx_hadamard_32x32_sse2Scott LaVarnway
BUG=webm:1546 Change-Id: Ide5828b890c5c27cfcca2d5e318a914f7cde1158
2018-07-20VPX: Call vpx_hadamard_16x16_c() in vpx_hadamard_32x32_c()Scott LaVarnway
instead of vpx_hadamard_16x16(). Change-Id: Ie16aacad39d7f429e282dd4c93e57c07000d0f29
2018-07-20VPX: Improve HBD vpx_hadamard_16x16_sse2()Scott LaVarnway
~12% improvement. Change-Id: Ieca4d870a4c1c5ea2c689e27fc4550fcbab9f867
2018-07-17vpx_sum_squares_2d_i16_neon(): Make |s2| a uint64x1_t.Raphael Kubo da Costa
This fixes the build with at least GCC 7.3, where it was previously failing with: sum_squares_neon.c: In function 'vpx_sum_squares_2d_i16_neon': sum_squares_neon.c: note: use -flax-vector-conversions to permit conversions between vectors with differing element types or numbers of subparts s2 = vpaddl_u32(s1); ^~ sum_squares_neon.c: incompatible types when assigning to type 'int64x1_t' from type 'uint64x1_t' s2 = vpaddl_u32(s1); ^ sum_squares_neon.c: incompatible types when assigning to type 'int64x1_t' from type 'uint64x1_t' s2 = vadd_u64(vget_low_u64(s1), vget_high_u64(s1)); ^ sum_squares_neon.c: incompatible type for argument 1 of 'vget_lane_u64' return vget_lane_u64(s2, 0); ^~ The generated assembly was verified to remain identical with both GCC and LLVM. Bug: chromium:819249 Change-Id: I2778428ee1fee0a674d0d4910347c2a717de21ac
2018-07-11Add 32x32 Hadamard transformJingning Han
Add 32x32 Hadamard transform in C implementation. Replace the forward 32x32 2D-DCT in tpl model with Hadamard transform. This would reduce the overhead encoding time due to running tpl model by ~3x. Change-Id: I1c743dab786b818d89f14928cc3998d056830aa9
2018-07-08[VSX] Add support to Power9-only vec_absdLuca Barbato
~5% gain for SAD. Change-Id: Ief7d7691f837474f5b6b582129628276fdcce319
2018-06-27Merge "[VSX] Drop the clang-4 workaround for vec_xxpermdi"Luca Barbato
2018-06-27[VSX] Replace vec_pack and vec_perm with single vec_permLuc Trudeau
vpx_quantize_b: VP9QuantizeTest Speed Test (POWER8 Model 2.1) 32x32 Old VSX time = 8.1 ms, new VSX time = 7.9 ms vp9_quantize_fp: VP9QuantizeTest Speed Test (POWER8 Model 2.1) 32x32 Old VSX time = 6.5 ms, new VSX time = 6.2 ms Change-Id: Ic2183e8bd721bb69eaeb4865b542b656255a0870
2018-06-27VSX Version of fdct32x32_rdLuc Trudeau
Low bit depth version only. Passes the Trans32x32Test test suite. Trans32x32Test Speed Test (POWER9 Model 2.2) 32x32 C time = 212.7 ms (±0.1 ms), VSX time = 82.3 ms (±0.0 ms) [2.6x] Change-Id: If906ec9b56ce3818cae0cc462c7277284ab29859
2018-06-25Merge "Add vpx_highbd_avg_8x8, vpx_highbd_avg_4x4"Scott LaVarnway
2018-06-22Add vpx_highbd_avg_8x8, vpx_highbd_avg_4x4Scott LaVarnway
BUG=webm:1537 Change-Id: I5f216f35436189b67d9f350991f41ed31431d4fe
2018-06-22Merge changes I51e7ed32,I99a9535b,Id584d8f6Luca Barbato
* changes: ppc: add vp9_iht16x16_256_add_vsx ppc: add vp9_iht8x8_64_add_vsx ppc: add vp9_iht4x4_16_add_vsx
2018-06-15[VSX] Drop the clang-4 workaround for vec_xxpermdiLuca Barbato
clang-6 seems to support it out of box. E.g. VP9SubtractBlockTest.DISABLED_Speed with the workaround: [ BENCH ] 4x4 286.5 ms ( ±0.2 ms ) Without: [ BENCH ] 4x4 215.2 ms ( ±0.9 ms ) Change-Id: I28b3a2cc93c0d72f52f5a48cc06d8ed4ef26913f
2018-06-14ppc: add vp9_iht16x16_256_add_vsxAlexandra Hájková
Change-Id: I51e7ed32d8d87c25ee126e8b4f8fc616d0327584
2018-06-14[VSX] Optimize PROCESS16 macroLuc Trudeau
The PROCESS16 macro now uses 8-bit lanes instead of 16-bit lanes. SADTest Speed Test (POWER8 Model 2.1) 16x8 Old VSX time = 16.7 ms, new VSX time = 9.1 ms [1.8x] 16x16 Old VSX time = 15.7 ms, new VSX time = 7.9 ms [2.0x] 16x32 Old VSX time = 14.4 ms, new VSX time = 7.2 ms [2.0x] 32x16 Old VSX time = 14.0 ms, new VSX time = 7.4 ms [1.9x] 32x32 Old VSX time = 13.4 ms, new VSX time = 6.5 ms [2.0x] 32x64 Old VSX time = 12.7 ms, new VSX time = 6.3 ms [2.0x] 64x32 Old VSX time = 12.6 ms, new VSX time = 6.3 ms [2.0x] 64x64 Old VSX time = 12.7 ms, new VSX time = 6.2 ms [2.0x] Change-Id: I51776f0e428162e78edde8eac47f30ffd2379873
2018-06-13VSX Version of SAD8xNLuc Trudeau
VSX versions of the SAD functions of width 8. SADTest Speed Test (POWER8 Model 2.1) 8x4 C time = 68.7 ms (±0.3 ms), VSX time = 31.8 ms (±0.1 ms) [2.2x] 8x8 C time = 55.6 ms (±0.3 ms), VSX time = 18.3 ms (±0.1 ms) [3.0x] 8x16 C time = 46.5 ms (±0.1 ms), VSX time = 15.6 ms (±0.1 ms) [3.0x] Change-Id: I843f3b34e103b72deeade4a939193d8b53cee460
2018-06-08Implement subtract_block for VSXLuca Barbato
~2x speedup or better. [ RUN ] C/VP9SubtractBlockTest.Speed/0 [ BENCH ] 4x4 365.1 ms ( ±2.2 ms ) [ BENCH ] 8x4 258.5 ms ( ±0.3 ms ) [ BENCH ] 4x8 202.7 ms ( ±0.2 ms ) [ BENCH ] 8x8 162.2 ms ( ±0.5 ms ) [ BENCH ] 16x8 138.8 ms ( ±0.3 ms ) [ BENCH ] 8x16 121.5 ms ( ±0.4 ms ) [ BENCH ] 16x16 110.2 ms ( ±0.5 ms ) [ BENCH ] 32x16 104.8 ms ( ±0.1 ms ) [ BENCH ] 16x32 32.7 ms ( ±0.1 ms ) [ BENCH ] 32x32 30.0 ms ( ±0.0 ms ) [ BENCH ] 64x32 28.7 ms ( ±0.0 ms ) [ BENCH ] 32x64 20.1 ms ( ±0.0 ms ) [ BENCH ] 64x64 19.3 ms ( ±0.0 ms ) [ RUN ] VSX/VP9SubtractBlockTest.Speed/0 [ BENCH ] 4x4 155.3 ms ( ±0.9 ms ) [ BENCH ] 8x4 99.3 ms ( ±0.4 ms ) [ BENCH ] 4x8 77.2 ms ( ±0.1 ms ) [ BENCH ] 8x8 45.7 ms ( ±0.0 ms ) [ BENCH ] 16x8 34.1 ms ( ±0.0 ms ) [ BENCH ] 8x16 29.5 ms ( ±0.0 ms ) [ BENCH ] 16x16 19.9 ms ( ±0.0 ms ) [ BENCH ] 32x16 15.1 ms ( ±0.0 ms ) [ BENCH ] 16x32 16.7 ms ( ±0.0 ms ) [ BENCH ] 32x32 14.1 ms ( ±0.0 ms ) [ BENCH ] 64x32 12.6 ms ( ±0.0 ms ) [ BENCH ] 32x64 12.0 ms ( ±0.0 ms ) [ BENCH ] 64x64 11.2 ms ( ±0.0 ms ) Change-Id: I89ce12b6475871dc9e8fde84d0b6fe5c420c28c7
2018-06-06Merge changes I3ba75c45,I97d26285James Zern
* changes: force-inline the convolve functions Unbreak the force inline directive for gcc
2018-06-05force-inline the convolve functionsLuca Barbato
Change-Id: I3ba75c459ed7c9591b7892e9f8f108146c04507d
2018-05-31VSX version of vpx_post_proc_down_and_across_mb_rowLuc Trudeau
Low bit depth version only. Passes the VpxPostProcDownAndAcrossMbRowTest VpxMbPostProcAcrossIpTest Speed Test (POWER8 Model 2.1) C time = 121.3 ms (±4.0 ms), VSX time = 9.4 ms (±0.3 ms) [12.9x] Change-Id: I28300779e197ea3855cf30867d17a2805388b447
2018-05-31ppc: add vp9_iht8x8_64_add_vsxAlexandra Hájková
Change-Id: I99a9535bf1ae58c494113fc88d9616bda202716a
2018-05-31ppc: add vp9_iht4x4_16_add_vsxAlexandra Hájková
Change-Id: Id584d8f65fdda51b8680f41424074b4b0c979622
2018-05-29VSX version of vpx_mbpost_proc_ipLuc Trudeau
Low bit depth version only. Passes the VpxMbPostProcAcrossIpTest. VpxMbPostProcAcrossIpTest Speed Test (POWER8 Model 2.1) C time = 188.5ms (±0.2ms), VSX time = 65.2ms (±0.1ms) [2.9x] Change-Id: I1cf72365d94a9d7f1e9323925a87a30e3bd5cfe2
2018-05-29VSX version of vpx_mbpost_proc_downLuc Trudeau
Low bit depth version only. Passes the VpxMbPostProcDownTest. VpxMbPostProcDownTest Speed Test (POWER8 Model 2.1) Full calculations: C time = 195.4 ms, VSX time = 33.7 ms (5.8x) Change-Id: If1aca7c135de036a1ab7923c0d1e6733bfe27ef7
2018-05-15Merge "Add vpx_varianceNxM_vsx and vpx_mseNxM_vsx"Luca Barbato
2018-05-15Add vpx_varianceNxM_vsx and vpx_mseNxM_vsxLuca Barbato
Speedups: 64x64 5.9 64x32 6.2 32x64 5.8 32x32 6.2 32x16 5.1 16x32 3.3 16x16 2.6 16x8 2.6 8x16 2.4 8x8 2.3 8x4 2.1 4x8 1.6 4x4 1.6 Change-Id: Idfaab96c03d3d1f487301cf398da0dd47a34e887
2018-05-14VSX version of vpx_quantize_b_32x32_vsxLuc Trudeau
Low bit depth version only. Passes the VP9QuantizeTest. VP9QuantizeTest Speed Test (POWER8 Model 2.1) Full calculations: C time = 1456 ms, VSX time = 80 ms (18x) Change-Id: I1b1d6d03b1aeff63640efbdeb222cab857ddd95e
2018-05-11Merge "Faster VSX vpx_quantize_b"Luc Trudeau
2018-05-10vpx_subtract_block_neon: add explicit castJames Zern
quiets ptrdiff_t -> int conversion warning Change-Id: If6b545a736fc19e48e290961736b1618df97db3e
2018-05-11Merge "ppc: Add vpx_iwht4x4_16_add_vsx"James Zern
2018-05-11Merge "Update vpx_subtract_block_neon()"James Zern
2018-05-10Faster VSX vpx_quantize_bLuc Trudeau
Process 16 coefficients on the first iteration (a full 4x4) and 24 coefficients on subsequent iteration. VSX/VP9QuantizeTest.DISABLED_Speed Before: 4x4 176 ms 8x8 91 ms 16x16 72 ms After: 4x4 152 ms 8x8 82 ms 16x16 64 ms Change-Id: I07cb130833504206ccdc5bc12ae5af369364999a
2018-05-10Update vpx_subtract_block_neon()Linfeng Zhang
Change-Id: Ie2ac06c090c8f92268e9a799e96aa5192a1bdcd2
2018-05-10Merge "Update vpx_comp_avg_pred_neon()"James Zern
2018-05-09VSX version of vpx_quantize_b_vsxLuc Trudeau
Low bit depth version only. Passes the VP9QuantizeTest. Change-Id: I6546f872864bd404a7e353348b0554aab1de5bf0
2018-05-08Update vpx_comp_avg_pred_neon()Linfeng Zhang
Separate width 4 and 8 cases to reduce jumps in loop in clang. Change-Id: I6ffc6f1555f2ad08b72a8dba35a78b9fd5f95a73
2018-05-08Update SadMxNx4 NEON functionsLinfeng Zhang
Change-Id: Ia313a6da00a05837fcd4de6ece31fa1c0016438c
2018-05-08Merge "Add vpx_sum_squares_2d_i16_neon()"Linfeng Zhang