summaryrefslogtreecommitdiff
path: root/vpx_dsp
AgeCommit message (Collapse)Author
2023-06-12Merge "Fix c vs intrinsic mismatch of vpx_hadamard_32x32() function" into mainYunqing Wang
2023-06-09Fix c vs intrinsic mismatch of vpx_hadamard_32x32() functionAnupam Pandey
This CL resolves the mismatch between C and intrinsic implementation of vpx_hadamard_32x32 function. The mismatch was due to integer overflow during the addition operation in the intrinsic functions. Specifically, the addition in the intrinsic function was performed at the 16-bit level, while the calculation of a0 + a1 resulted in a 17-bit value. This code change addresses the problem by performing the addition at the 32-bit level (with sign extension) in both SSE2 and AVX2, and then converting the results back to the 16-bit level after a right shift. STATS_CHANGED Change-Id: I576ca64e3b9ebb31d143fcd2da64322790bc5853
2023-06-07Fix more typos (n/n)Jerome Jiang
impace -> impact taget -> target prediciton -> prediction addtion -> addition the the -> the Bug: webm:1803 Change-Id: I759c9d930a037ca69662164fcd6be160ed707d77
2023-05-31Merge changes I6a906803,I0307a3b6 into mainJames Zern
* changes: Optimize Neon implementation of vpx_int_pro_row Optimize Neon implementation of vpx_int_pro_col
2023-05-31Optimize Neon implementation of vpx_int_pro_rowJonathan Wright
Double the number of accumulator registers to remove the bottleneck. Also peel the first loop iteration. Change-Id: I6a90680369f9c33cdfe14ea547ac1569ec3f50de
2023-05-31Optimize Neon implementation of vpx_int_pro_colJonathan Wright
Use widening pairwise addition instructions to halve the number of additions required. Change-Id: I0307a3b65e50d2b1ae582938bc5df9c2b21df734
2023-05-23vpx_dsp_common.h,clip_pixel: work around VS2022 Arm64 issueJames Zern
cl.exe targeting AArch64 with optimizations enabled produces invalid code for clip_pixel() when the return type is uint8_t. See: https://developercommunity.visualstudio.com/t/Misoptimization-for-ARM64-in-VS-2022-17/10363361 Bug: b/277255076 Bug: webm:1788 Change-Id: Ia3647698effd34f1cf196cd33fa4a8cab9fa53d6
2023-05-23fdct_partial_neon.c: work around VS2022 Arm64 issueJames Zern
cl.exe targeting AArch64 with optimizations enabled will fail with an internal compiler error. See: https://developercommunity.visualstudio.com/t/Compiler-crash-C1001-when-building-a-for/10346110 Bug: b/277255076 Bug: webm:1788 Change-Id: I55caf34e910dab47a7775f07280677cdfe606f5b
2023-05-18Merge "Improve convolve AVX2 intrinsic for speed" into mainYunqing Wang
2023-05-17Improve convolve AVX2 intrinsic for speedAnupam Pandey
This CL refactors the code related to convolve function. Furthermore, improved the AVX2 intrinsic to compute convolve vertical for w = 4 case, and convolve horiz for w = 16 case. Please note the module level scaling w.r.t C function (timer based) for existing (AVX2) and new AVX2 intrinsics: Block Scaling Size AVX2 AVX2 (existing) (New) 4x4 5.34x 5.91x 4x8 7.10x 7.79x 16x8 23.52x 25.63x 16x16 29.47x 30.22x 16x32 33.42x 33.44x This is a bit exact change. Change-Id: If130183bc12faab9ca2bcec0ceeaa8d0af05e413
2023-05-16Merge changes Ie77ad184,Idfcac43c into mainJames Zern
* changes: Add 2D-specific Neon horizontal convolution functions Refactor standard bitdepth Neon convolution functions
2023-05-13Add 2D-specific Neon horizontal convolution functionsJonathan Wright
2D 8-tap convolution filtering is performed in two passes - horizontal and vertical. The horizontal pass must produce enough input data for the subsequent vertical pass - 3 rows above and 4 rows below, in addition to the actual block height. At present, all Neon horizontal convolution algorithms process 4 rows at a time, but this means we end up doing at least 1 row too much work in the 2D first pass case where we need h + 7, not h + 8 rows of output. This patch adds additional dot-product (SDOT and USDOT) Neon paths that process h + 7 rows of data exactly, saving the work of the unnecessary extra row. It is impractical to take a similar approach for the Armv8.0 MLA paths since we have to transpose the data block both before and after calling the convolution helper functions. vpx_convolve_neon performance impact: we observe a speedup of ~9% for smaller (and wider) blocks, and a speedup of 0-3% for larger blocks. This is to be expected since the proportion of redundant work decreases as the block height increases. Change-Id: Ie77ad1848707d2d48bb8851345a469aae9d097e1
2023-05-12Refactor standard bitdepth Neon convolution functionsJonathan Wright
1) Use #define constant instead of magic numbers for right shifts. 2) Move saturating narrow into helper functions that return 4-element result vectors. 3) Use mem_neon.h helpers for load/store sequences in Armv8.0 paths. 4) Tidy up: assert conditions and some longer variable names. 5) Prefer != 0 to > 0 where possible for loop termination conditions. Change-Id: Idfcac43ca38faf729dca07b8cc8f7f45ad264d24
2023-05-09Add AVX2 intrinsic for vpx_comp_avg_pred() functionAnupam Pandey
The module level scaling w.r.t C function (timer based) for existing (SSE2) and new AVX2 intrinsics: If ref_padding = 0 Block Scaling size SSE2 AVX2 8x4 3.24x 3.24x 8x8 4.22x 4.90x 8x16 5.91x 5.93x 16x8 1.63x 3.52x 16x16 1.53x 4.19x 16x32 1.38x 4.82x 32x16 1.28x 3.08x 32x32 1.45x 3.13x 32x64 1.38x 3.04x 64x32 1.39x 2.12x 64x64 1.46x 2.24x If ref_padding = 8 Block Scaling size SSE2 AVX2 8x4 3.20x 3.21x 8x8 4.61x 4.83x 8x16 5.50x 6.45x 16x8 1.56x 3.35x 16x16 1.53x 4.19x 16x32 1.37x 4.83x 32x16 1.28x 3.07x 32x32 1.46x 3.29x 32x64 1.38x 3.22x 64x32 1.38x 2.14x 64x64 1.38x 2.12x This is a bit-exact change. Change-Id: I72c5d155f64d0c630bc8c3aef21dc8bbd045d9e6
2023-05-05Merge "macros_msa.h: clear -Wshadow warnings" into mainJames Zern
2023-05-05Merge "vpx_subpixel_8t_intrin_avx2,cosmetics: shorten long comment" into mainJames Zern
2023-05-05macros_msa.h: clear -Wshadow warningsJames Zern
Bug: webm:1793 Change-Id: Ib2e3bd3c52632cdd4410cb2c54d69750e64e5201
2023-05-05Merge "Add AVX2 intrinsic for idct16x16 and idct32x32 functions" into mainYunqing Wang
2023-05-05Add AVX2 intrinsic for idct16x16 and idct32x32 functionsAnupam Pandey
Added AVX2 intrinsic optimization for the following functions 1. vpx_idct16x16_256_add 2. vpx_idct32x32_1024_add 3. vpx_idct32x32_135_add The module level scaling w.r.t C function (timer based) for existing (SSE2) and new AVX2 intrinsics: Scaling Function Name SSE2 AVX2 vpx_idct32x32_1024_add 3.62x 7.49x vpx_idct32x32_135_add 4.85x 9.41x vpx_idct16x16_256_add 4.82x 7.70x This is a bit-exact change. Change-Id: Id9dda933aa1f5093bb6b35ac3b8a41846afca9d2
2023-05-04vpx_subpixel_8t_intrin_avx2,cosmetics: shorten long commentJames Zern
Change-Id: I8badedc2ad07d60896e45de28b707ad9f6c4d499
2023-05-04Merge changes I226215a2,Ia4918eb0,If6219446,Ibf00a6e1,I900a0a48 into mainChi Yo Tsai
* changes: Fix mismatched param names in vpx_dsp/x86/sad4d_avx2.c Fix mismatched param names in vpx_dsp/arm/highbd_sad4d_neon.c Fix mismatched param names in vpx_dsp/arm/sad4d_neon.c Fix mismatched param names in vpx_dsp/arm/highbd_avg_neon.c Fix clang warning on const-qualification of parameters
2023-05-03Fix mismatched param names in vpx_dsp/x86/sad4d_avx2.cchiyotsai
Change-Id: I226215a2ff8798b72abe0c2caf3d18875595caa5
2023-05-03Fix mismatched param names in vpx_dsp/arm/highbd_sad4d_neon.cchiyotsai
Change-Id: Ia4918eb0bac3b28b27e1ef205b9171680b2eb9a4
2023-05-03Fix mismatched param names in vpx_dsp/arm/sad4d_neon.cchiyotsai
Change-Id: If621944684cf9bb9f353db5961ed8b4b4ae38f24
2023-05-03Fix mismatched param names in vpx_dsp/arm/highbd_avg_neon.cchiyotsai
Change-Id: Ibf00a6e1029284e637b10ef01ac9b31ffadc74ca
2023-05-03Fix clang warning on const-qualification of parameterschiyotsai
Change-Id: I900a0a48dde5fcb262157b191ac536e18269feb3
2023-05-03s/__aarch64__/VPX_ARCH_AARCH64/James Zern
This allows AArch64 to be correctly detected when building with Visual Studio (cl.exe) and fixes a crash in vp9_diamond_search_sad_neon.c. There are still test failures, however. Microsoft's compiler doesn't define __ARM_FEATURE_*. To use those paths we may need to rely on _M_ARM64_EXTENSION. Bug: webm:1788 Bug: b/277255076 Change-Id: I4d26f5f84dbd0cbcd1cdf0d7d932ebcf109febe5
2023-04-21highbd_vpx_convolve8_neon: clear -Wshadow warningJames Zern
Bug: webm:1793 Change-Id: If1a46fe183cd18e05b5538b1eba098e420b745ec
2023-04-19Add Neon implementations of vpx_highbd_sad_skip_<w>x<h>x4dJonathan Wright
Add Neon implementations of high bitdepth downsampling SAD4D functions for all block sizes. Also add corresponding unit tests. Change-Id: Ib0c2f852e269cbd6cbb8f4dfb54349654abb0adb
2023-04-19Add Neon implementation of vpx_sad_skip_<w>x<h>x4d functionsJonathan Wright
Add Neon implementations of standard bitdepth downsampling SAD4D functions for all block sizes. Also add corresponding unit tests. Change-Id: Ieb77661ea2bbe357529862a5fb54956e34e8d758
2023-04-19Add Neon implementation of vpx_highbd_sad_skip_<w>x<h> functionsJonathan Wright
Add Neon implementations of high bitdepth downsampling SAD functions for all block sizes. Also add corresponding unit tests. Change-Id: I56ea656e9bb5f8b2aedfdc4637c9ab4e1951b31b
2023-04-19Add Neon implementation of vpx_sad_skip_<w>x<h> functionsJonathan Wright
Add Neon implementations of standard bitdepth downsampling SAD functions for all block sizes. Also add corresponding unit tests. Change-Id: Ibda734c270278d947673ffcc29ef17a2f4970b01
2023-04-18Merge "Downsample SAD computation in motion search" into mainYunqing Wang
2023-04-17Merge "Add AVX2 intrinsic for vpx_fdct16x16() function" into mainYunqing Wang
2023-04-17Add AVX2 intrinsic for vpx_fdct16x16() functionAnupam Pandey
Introduced AVX2 intrinsic to compute FDCT for block size 16x16 case. This is a bit-exact change. Please check the module level scaling w.r.t C function (timer based) for existing (SSE2) and new AVX2 intrinsics: Scaling SSE2 AVX2 3.88x 5.95x Change-Id: I02299c3746fcb52d808e2a75d30aa62652c816dc
2023-04-11Downsample SAD computation in motion searchDeepa K G
Added a speed feature to skip every other row in SAD computation during motion search. Instruction Count BD-Rate Loss(%) cpu Resolution Reduction(%) avg.psnr ovr.psnr ssim 0 LOWRES2 0.958 0.0204 0.0095 0.0275 0 MIDRES2 1.891 -0.0636 0.0032 0.0247 0 HDRES2 2.869 0.0434 0.0345 0.0686 0 Average 1.905 0.0000 0.0157 0.0403 STATS_CHANGED Change-Id: I1a8692757ed0cbcb2259729b3ecfb0436cdf49ce
2023-04-10Merge "Add AVX2 intrinsic for variance function for block width 8" into mainYunqing Wang
2023-04-07Merge "Optimize Armv8.0 Neon SAD4D 16xh, 32xh, and 64xh functions" into mainJames Zern
2023-04-06vpx_subpixel_8t_intrin_avx2: clear -Wshadow warningJames Zern
Bug: webm:1793 Change-Id: Icba4ad242dcd0cad736b9a203829361c5bd1ca3f
2023-04-06Merge "Optimize Neon paths of high bitdepth SAD and SAD4d for 8xh blocks" ↵James Zern
into main
2023-04-06Optimize Armv8.0 Neon SAD4D 16xh, 32xh, and 64xh functionsJonathan Wright
Add a widening 4D reduction function operating on uint16x8_t vectors and use it to optimize the final reduction in Armv8.0 Neon standard bitdepth 16xh, 32xh and 64h SAD4D computations. Also simplify the Armv8.0 Neon version of the sad64xhx4d_neon helper function since VP9 block sizes are not large enough to require widening to 32-bit accumulators before the final reduction. Change-Id: I32b0a283d7688d8cdf21791add9476ed24c66a28
2023-04-04Optimize 4D Neon reduction for 4xh and 8xh SAD4D blocksJonathan Wright
Add a 4D reduction function operating on uint16x8_t vectors and use it to optimize the final reduction in standard bitdepth 4xh and 8xh SAD4D computations. Similar 4D reduction optimizations have already been implemented for all other standard bitdepth block sizes, and all high bitdepth block sizes.[1] [1] https://chromium-review.googlesource.com/c/webm/libvpx/+/4224681 Change-Id: I0aa0b6e0f70449776f316879cafc4b830e86ea51
2023-04-04Add AVX2 intrinsic for variance function for block width 8Anupam Pandey
Added AVX2 intrinsic optimization for the following functions 1. vpx_variance8x4 2. vpx_variance8x8 3. vpx_variance8x16 This is a bit-exact change. Instruction Count cpu Resolution Reduction(%) 0 LOWRES2 0.698 0 MIDRES2 0.577 0 HDRES2 0.469 0 Average 0.582 Change-Id: Iae8fdf9344fd012cda4955ed140633141d60ba86
2023-03-30Avoid vshr and vget_{low,high} in Neon d135 predictor implGeorge Steed
The shift instructions have marginally worse performance on some micro-architectures, and the vget_{low,high} instructions are unnecessary. This commit improves performance of the d135 predictors by 1.5% geomean averaged across a range of compilers and micro-architectures. Change-Id: Ied4c3eecc12fc973841696459d868ce403ed4e6c
2023-03-30Use sum_neon.h helpers in Neon DC predictorsGeorge Steed
Use sum_neon.h helpers for horizontal reductions in Neon DC predictors, enabling use of dedicated Neon reduction instructions on AArch64. Some of the surrounding code is also optimized to remove redundant broadcast instructions in the dc_store helpers. Performance is largely unchanged on both the standard as well as the high bit-depth predictors. The main improvement appears to be the 16x16 standard-bitdepth dc predictor, which improves by 10-15% when benchmarked on Neoverse N1. Change-Id: Ibfcc6ecf4b1b2f87ce1e1f63c314d0cc35a0c76f
2023-03-29Merge changes Ie4ffa298,If5ec220a,I670dc379 into mainJames Zern
* changes: Avoid LD2/ST2 instructions in highbd v predictors in Neon Avoid interleaving loads/stores in Neon for highbd dc predictor Avoid LD2/ST2 instructions in vpx_dc_predictor_32x32_neon
2023-03-29Optimize Neon paths of high bitdepth SAD and SAD4d for 8xh blocksSalome Thirot
For these block sizes there is no need to widen to 32-bits until the final reduction, so use a single vabaq instead of vabd + vpadalq. Change-Id: I9c19d620f7bb8b3a6b0bedd37789c03bb628b563
2023-03-29Avoid LD2/ST2 instructions in highbd v predictors in NeonGeorge Steed
The interleaving load/store instructions (LD2/LD3/LD4 and ST2/ST3/ST4) are useful if we are dealing with interleaved data (e.g. real/imag components of complex numbers), but for simply loading or storing larger quantities of data it is preferable to simply use the normal load/store instructions. This patch replaces such occurrences in the two larger block sizes: vpx_highbd_v_predictor_16x16_neon and vpx_highbd_v_predictor_32x32_neon. Change-Id: Ie4ffa298a2466ceaf893566fd0aefe3f66f439e4
2023-03-29Avoid interleaving loads/stores in Neon for highbd dc predictorGeorge Steed
The interleaving load/store instructions (LD2/LD3/LD4 and ST2/ST3/ST4) are useful if we are dealing with interleaved data (e.g. real/imag components of complex numbers), but for simply loading or storing larger quantities of data it is preferable to simply use two or more of the normal load/store instructions. This patch replaces such occurrences in the two larger block sizes: vpx_highbd_dc_predictor_16x16_neon, vpx_highbd_dc_predictor_32x32_neon, and related helper functions. Speedups over the original Neon code (higher is better): Microarch. | Compiler | Block | Speedup Neoverse N1 | LLVM 15 | 16x16 | 1.25 Neoverse N1 | LLVM 15 | 32x32 | 1.13 Neoverse N1 | GCC 12 | 16x16 | 1.56 Neoverse N1 | GCC 12 | 32x32 | 1.52 Neoverse V1 | LLVM 15 | 16x16 | 1.63 Neoverse V1 | LLVM 15 | 32x32 | 1.08 Neoverse V1 | GCC 12 | 16x16 | 1.59 Neoverse V1 | GCC 12 | 32x32 | 1.37 Change-Id: If5ec220aba9dd19785454eabb0f3d6affec0cc8b
2023-03-29Avoid LD2/ST2 instructions in vpx_dc_predictor_32x32_neonGeorge Steed
The LD2 and ST2 instructions are useful if we are dealing with interleaved data (e.g. real/imag components of complex numbers), but for simply loading or storing larger quantities of data it is preferable to simply use two of the normal load/store instructions. This patch replaces such occurrences in vpx_dc_predictor_32x32_neon and related functions. With Clang-15 this speeds up this function by 10-30% depending on the micro-architecture being benchmarked on. With GCC-12 this speeds up the function by 40-60% depending on the micro-architecture being benchmarked on. Change-Id: I670dc37908aa238f360104efd74d6c2108ecf945