summaryrefslogtreecommitdiff
path: root/vpx_dsp
AgeCommit message (Collapse)Author
2023-02-28Implement d117_predictor using NeonGeorge Steed
Add Neon implementations of the d117 predictor for 4x4, 8x8, 16x16 and 32x32 block sizes. Also update tests to add new corresponding cases. An explanation of the general implementation strategy is given in the 8x8 implementation body. Speedups over the C code (higher is better): Microarch. | Compiler | Block | Speedup Neoverse N1 | LLVM 15 | 4x4 | 1.73 Neoverse N1 | LLVM 15 | 8x8 | 5.24 Neoverse N1 | LLVM 15 | 16x16 | 9.77 Neoverse N1 | LLVM 15 | 32x32 | 14.13 Neoverse N1 | GCC 12 | 4x4 | 2.04 Neoverse N1 | GCC 12 | 8x8 | 4.70 Neoverse N1 | GCC 12 | 16x16 | 8.64 Neoverse N1 | GCC 12 | 32x32 | 4.57 Neoverse V1 | LLVM 15 | 4x4 | 1.75 Neoverse V1 | LLVM 15 | 8x8 | 6.79 Neoverse V1 | LLVM 15 | 16x16 | 9.16 Neoverse V1 | LLVM 15 | 32x32 | 14.47 Neoverse V1 | GCC 12 | 4x4 | 1.75 Neoverse V1 | GCC 12 | 8x8 | 6.00 Neoverse V1 | GCC 12 | 16x16 | 7.63 Neoverse V1 | GCC 12 | 32x32 | 4.32 Change-Id: I7228327b5be27ee7a68deecafa05be0bd2a40ff4
2023-02-28Implement d63_predictor using NeonGeorge Steed
Add Neon implementations of the d63 predictor for 4x4, 8x8, 16x16 and 32x32 block sizes. Also update tests to add new corresponding cases. Speedups over the C code (higher is better): Microarch. | Compiler | Block | Speedup Neoverse N1 | LLVM 15 | 4x4 | 2.10 Neoverse N1 | LLVM 15 | 8x8 | 4.45 Neoverse N1 | LLVM 15 | 16x16 | 4.74 Neoverse N1 | LLVM 15 | 32x32 | 2.27 Neoverse N1 | GCC 12 | 4x4 | 2.46 Neoverse N1 | GCC 12 | 8x8 | 10.37 Neoverse N1 | GCC 12 | 16x16 | 11.46 Neoverse N1 | GCC 12 | 32x32 | 6.57 Neoverse V1 | LLVM 15 | 4x4 | 2.24 Neoverse V1 | LLVM 15 | 8x8 | 3.53 Neoverse V1 | LLVM 15 | 16x16 | 4.44 Neoverse V1 | LLVM 15 | 32x32 | 2.17 Neoverse V1 | GCC 12 | 4x4 | 2.25 Neoverse V1 | GCC 12 | 8x8 | 7.67 Neoverse V1 | GCC 12 | 16x16 | 8.97 Neoverse V1 | GCC 12 | 32x32 | 4.77 Change-Id: Ib4a1a2cb5a5c4495ae329529f8847664cbd0dfe0
2023-02-13Optimize vpx_highbd_comp_avg_pred_neonSalome Thirot
Optimize the implementation of vpx_highbd_comp_avg_pred_neon by making use of the URHADD instruction to compute the average. Change-Id: Id74a6d9c33e89bc548c3c7ecace59af69051b4a7
2023-02-13Specialize Neon high bitdepth avg subpel variance by filter valueSalome Thirot
Use the same specialization as for standard bitdepth. The rationale for the specialization is as follows: The optimal implementation of the bilinear interpolation depends on the filter values being used. For both horizontal and vertical interpolation this can simplify to just taking the source values, or averaging the source and reference values - which can be computed more easily than a bilinear interpolation with arbitrary filter values. This patch introduces tests to find the most optimal bilinear interpolation implementation based on the filter values being used. This new specialization is only used for larger block sizes. Change-Id: Id5a2b2d9fac6f878795a6ed9de2bc27d9e62d661
2023-02-13Specialize Neon high bitdepth subpel variance by filter valueSalome Thirot
Use the same specialization as for standard bitdepth. The rationale for the specialization is as follows: The optimal implementation of the bilinear interpolation depends on the filter values being used. For both horizontal and vertical interpolation this can simplify to just taking the source values, or averaging the source and reference values - which can be computed more easily than a bilinear interpolation with arbitrary filter values. This patch introduces tests to find the most optimal bilinear interpolation implementation based on the filter values being used. This new specialization is only used for larger block sizes. Change-Id: I73182c979255f0332a274f2e5907df7f38c9eeb3
2023-02-13Refactor Neon high bitdepth avg subpel variance functionsSalome Thirot
Use the same general code style as in the standard bitdepth Neon implementation - merging the computation of vpx_highbd_comp_avg_pred with the second pass of the bilinear filter to avoid storing and loading the block again. Also move vpx_highbd_comp_avg_pred_neon to its own file (like the standard bitdepth implementation) since we're no longer using it for averaging sub-pixel variance. Change-Id: I2f5916d5b397db44b3247b478ef57046797dae6c
2023-02-13Optimize Neon high bitdepth subpel variance functionsSalome Thirot
Use the same general code style as in the standard bitdepth Neon implementation. Additionally, do not unnecessarily widen to 32-bit data types when doing bilinear filtering - allowing us to process twice as many elements per instruction. Change-Id: I1e178991d2aa71f5f77a376e145d19257481e90f
2023-02-09Optimize Neon high bitdepth convolve copyJonathan Wright
Use standard loads and stores instead of the significantly slower interleaving/de-interleaving variants. Also move all loads in loop bodies above all stores as a mitigation against the compiler thinking that the src and dst pointers alias (since we can't use restrict in C89.) Change-Id: Idd59dca51387f553f8db27144a2b8f2377c937d3
2023-02-07Use 4D reduction Neon helper for standard bitdepth SAD4DSalome Thirot
Move the 4D reduction helper function to sum_neon.h and use this for both standard and high bitdepth SAD4D paths. This also removes the AArch64 requirement for using the UDOT Neon SAD4D paths. Change-Id: I207f76b3d42aa541809b0672c3b3d86e54d133ff
2023-02-06Optimize Neon implementation of high bitdepth SAD4D functionsSalome Thirot
Optimizations take a similar form to those implemented for Armv8.0 standard bitdepth SAD4D: - Use ABD, UADALP instead of ABAL, ABAL2 (double the throughput on modern out-of-order Arm-designed cores.) - Use more accumulator registers to make better use of Neon pipeline resources on Arm CPUs that have four Neon pipes. - Compute the four SAD sums in parallel so that we only load the source block once - instead of four times. Change-Id: Ica45c44fd167e5fcc83871d8c138fc72ed3a9723
2023-02-06Optimize Neon implementation of high bitdepth avg SAD functionsSalome Thirot
Optimizations take a similar form to those implemented for standard bitdepth averaging SAD: - Use ABD, UADALP instead of ABAL, ABAL2 (double the throughput on modern out-of-order Arm-designed cores.) - Use more accumulator registers to make better use of Neon pipeline resources on Arm CPUs that have four Neon pipes. Change-Id: I75c5f09948f6bf17200f82e00e7a827a80451108
2023-02-06Optimize Neon implementation of high bitdepth SAD functionsSalome Thirot
Optimizations take a similar form to those implemented for standard bitdepth SAD: - Use ABD, UADALP instead of ABAL, ABAL2 (double the throughput on modern out-of-order Arm-designed cores.) - Use more accumulator registers to make better use of Neon pipeline resources on Arm CPUs that have four Neon pipes. Change-Id: I9e626d7fa0e271908dc43448405a7985b80e6230
2023-01-31Merge "Use load_unaligned mem_neon.h helpers in SAD and SAD4D" into mainJames Zern
2023-01-31Use load_unaligned mem_neon.h helpers in SAD and SAD4DJonathan Wright
Use the load_unaligned helper functions in mem_neon.h to load strided sequences of 4 bytes where alignment is not guaranteed in the Neon SAD and SAD4D paths. Change-Id: I941d226ef94fd7a633b09fc92165a00ba68a1501
2023-01-30Fix unsigned integer overflow in sse computationCheng Chen
Basically port the fix from libaom: https://aomedia-review.googlesource.com/c/aom/+/169361 Change-Id: Id06a5db91372037832399200ded75d514e096726
2023-01-30Merge "Refactor 8x8 16-bit Neon transpose functions" into mainJames Zern
2023-01-30Refactor Neon implementation of SAD4D functionsSalome Thirot
Refactor and optimize the Neon implementation of SAD4D functions - effectively backporting these libaom changes[1,2]. [1] https://aomedia-review.googlesource.com/c/aom/+/162181 [2] https://aomedia-review.googlesource.com/c/aom/+/162183 Change-Id: Icb04bd841d86f2d0e2596aa7ba86b74f8d2d360b
2023-01-27Refactor 8x8 16-bit Neon transpose functionsGerda Zsejke More
Refactor the Neon implementation of transpose_s16_8x8(q) and transpose_u16_8x8 so that the final step compiles to 8 ZIP1/ZIP2 instructions as opposed to 8 EXT, MOV pairs. This change removes 8 instructions per call to transpose_s16_8x8(q), transpose_u16_8x8 where the result stays in registers for further processing - rather than being stored to memory - like in vpx_hadamard_8x8_neon, for example. This is a backport of this libaom patch[1]. [1] https://aomedia-review.googlesource.com/c/aom/+/169426 Change-Id: Icef3e51d40efeca7008e1c4fc701bf39bd319c88
2023-01-26Merge "Refactor Neon implementation of SAD functions" into mainJames Zern
2023-01-25Refactor Neon implementation of SAD functionsSalome Thirot
Refactor and optimize the Neon implementation of SAD functions - effectively backporting these libaom changes[1,2,3]. [1] https://aomedia-review.googlesource.com/c/aom/+/161921 [2] https://aomedia-review.googlesource.com/c/aom/+/161923 [3] https://aomedia-review.googlesource.com/c/aom/+/166963 Change-Id: I2d72fd0f27d61a3e31a78acd33172e2afb044cb8
2023-01-24[NEON] Add Highbd FHT 8x8/16x16 functionsKonstantinos Margaritis
In total this gives about 9% extra performance for both rt/best profiles. Furthermore, add transpose_s32 16x16 function Change-Id: Ib6f368bbb9af7f03c9ce0deba1664cef77632fe2
2023-01-23Specialize Neon averaging subpel variance by filter valueSalome Thirot
Use the same specialization for averaging subpel variance functions as used for the non-averaging variants. The rationale for the specialization is as follows: The optimal implementation of the bilinear interpolation depends on the filter values being used. For both horizontal and vertical interpolation this can simplify to just taking the source values, or averaging the source and reference values - which can be computed more easily than a bilinear interpolation with arbitrary filter values. This patch introduces tests to find the most optimal bilinear interpolation implementation based on the filter values being used. This new specialization is only used for larger block sizes This is a backport of this libaom change[1]. After this change, the only differences between the code in libvpx and libaom are due to libvpx being compiled with ISO C90, which forbids mixing declarations and code [-Wdeclaration-after-statement]. [1] https://aomedia-review.googlesource.com/c/aom/+/166962 Change-Id: I7860c852db94a7c9c3d72ae4411316685f3800a4
2023-01-23Refactor Neon averaging subpel variance functionsSalome Thirot
Merge the computation of vpx_comp_avg_pred into the second pass of the bilinear filter - avoiding the overhead of loading and storing the entire block again. This is a backport of this libaom change[1]. [1] https://aomedia-review.googlesource.com/c/aom/+/166961 Change-Id: I9327ff7382a46d50c42a5213a11379b957146372
2023-01-23Specialize Neon subpel variance by filter value for large blocksSalome Thirot
The optimal implementation of the bilinear interpolation depends on the filter values being used. For both horizontal and vertical interpolation this can simplify to just taking the source values, or averaging the source and reference values - which can be computed more easily than a bilinear interpolation with arbitrary filter values. This patch introduces tests to find the most optimal bilinear interpolation implementation based on the filter values being used. This new specialization is only used for larger block sizes (>= 16x16) as we need to be doing enough work to make the cost of finding the optimal implementation worth it. This is a backport of this libaom change[1]. After this change, the only differences between the code in libvpx and libaom are due to libvpx being compiled with ISO C90, which forbids mixing declarations and code [-Wdeclaration-after-statement]. [1] https://aomedia-review.googlesource.com/c/aom/+/162463 Change-Id: Ia818e148f6fd126656e8411d59c184b55dd43094
2023-01-23Refactor Neon subpel variance functionsSalome Thirot
Refactor the Neon implementation of the sub-pixel variance bilinear filter helper functions - effectively backporting this libaom patch[1]. [1] https://aomedia-review.googlesource.com/c/aom/+/162462 Change-Id: I3dee32e8125250bbeffeb63d1fef5da559bacbf1
2023-01-18Refactor Neon implementation of variance functionsSalome Thirot
Refactor and optimize the Neon implementation of variance functions - effectively backporting these libaom changes[1,2]. After this change, the only differences between the code in libvpx and libaom are due to libvpx being compiled with ISO C90, which forbids mixing declarations and code [-Wdeclaration-after-statement]. [1] https://aomedia-review.googlesource.com/c/aom/+/162241 [2] https://aomedia-review.googlesource.com/c/aom/+/162262 Change-Id: Ia4e8fff4d53297511d1a1e43bca8053bf811e551
2023-01-12Implement vertical convolutions using Neon USDOT instructionJonathan Wright
Add additional AArch64 paths for vpx_convolve8_vert_neon and vpx_convolve8_avg_vert_neon that use the Armv8.6-A USDOT (mixed-sign dot-product) instruction. The USDOT instruction takes an 8-bit unsigned operand vector and a signed 8-bit operand vector to produce a signed 32-bit result. This is helpful because convolution filters often have both positive and negative values, while the 8-bit pixel channel data being filtered is all unsigned. As a result, the USDOT convolution paths added here do not have to do the "transform the pixel channel data to [-128, 128) and correct for it later" dance that we have to do with the SDOT paths. The USDOT instruction is optional from Armv8.2 to Armv8.5 but mandatory from Armv8.6 onwards. The availability of the USDOT instruction is indicated by the feature macro __ARM_FEATURE_MATMUL_INT8. The SDOT paths are retained for use on target CPUs that do not implement the USDOT instructions. Change-Id: Ifbf467681dd53bb1d26e22359885e6edde3c5c72
2023-01-11Implement horizontal convolutions using Neon USDOT instructionJonathan Wright
Add additional AArch64 paths for vpx_convolve8_horiz_neon and vpx_convolve8_avg_horiz_neon that use the Armv8.6-A USDOT (mixed-sign dot-product) instruction. The USDOT instruction takes an 8-bit unsigned operand vector and a signed 8-bit operand vector to produce a signed 32-bit result. This is helpful because convolution filters often have both positive and negative values, while the 8-bit pixel channel data being filtered is all unsigned. As a result, the USDOT convolution paths added here do not have to do the "transform the pixel channel data to [-128, 128) and correct for it later" dance that we have to do with the SDOT paths. The USDOT instruction is optional from Armv8.2 to Armv8.5 but mandatory from Armv8.6 onwards. The availability of the USDOT instruction is indicated by the feature macro __ARM_FEATURE_MATMUL_INT8. The SDOT paths are retained for use on target CPUs that do not implement the USDOT instructions. Change-Id: If19f5872c3453458a8cfb7c7d2be82a2c0eab46a
2023-01-05Use Neon load/store helper functions consistentlyJonathan Wright
Define all Neon load/store helper functions in mem_neon.h and use them consistently in Neon convolution functions. Change-Id: I57905bc0a3574c77999cf4f4a73442c3420fa2be
2023-01-05Use lane-referencing intrinsics in Neon convolution kernelsJonathan Wright
The Neon convolution helper functions take a pointer to a filter and load the 8 values into a single Neon register. For some reason, filter values 3 and 4 are then duplicated into their own separate registers. This patch modifies these helper functions so that they access filter values 3 and 4 via the lane-referencing versions of the various Neon multiply instructions. This reduces register pressure and tidies up the source code quite a bit. Change-Id: Ia4aeee8b46fe218658fb8577dc07ff04a9324b3e
2022-12-20[x86]: Add vpx_highbd_comp_avg_pred_sse2().Scott LaVarnway
C vs SSE2 4x4: 3.38x 8x8: 3.45x 16x16: 2.06x 32x32: 2.19x 64x64: 1.39x Change-Id: I46638fe187b49a78fee554114fac51c485d74474
2022-12-08[x86]: Add vpx_highbd_subtract_block_avx2().Scott LaVarnway
Up to 4x faster than "sse2 vectorized C". Change-Id: Ie9b3c12a437c5cddf92c4d5349c4f659ca6b82ea
2022-11-11[NEON] Optimize FHT functions, add highbd FHT 4x4Konstantinos Margaritis
Refactor & optimize FHT functions further, use new butterfly functions 4x4 5% faster, 8x8 & 16x16 10% faster than previous versions. Highbd 4x4 FHT version 2.27x faster than C version for --rt. Change-Id: I3ebcd26010f6c5c067026aa9353cde46669c5d94
2022-11-04Add Neon implementation of vpx_hadamard_32x32Andrew Salkeld
Add an Arm Neon implementation of vpx_hadamard_32x32 and use it instead of the scalar C implementation. Also add test coverage for the new Neon implementation. Change-Id: Iccc018eec4dbbe629fb0c6f8ad6ea8554e7a0b13
2022-11-03[NEON] Optimize highbd 32x32 DCTKonstantinos Margaritis
For --best quality, resulting function vpx_highbd_fdct32x32_rd_neon takes 0.27% of cpu time in profiling, vs 6.27% for the sum of scalar functions: vpx_fdct32, vpx_fdct32.constprop.0, vpx_fdct32x32_rd_c for rd. For --rt quality, the function takes 0.19% vs 4.57% for the scalar version. Overall, this improves encoding time by ~6% compared for highbd for --best and ~9% for --rt. Change-Id: I1ce4bbef6e364bbadc76264056aa3f86b1a8edc5
2022-11-01[NEON] Optimize and homogenize Butterfly DCT functionsKonstantinos Margaritis
Provide a set of commonly used Butterfly DCT functions for use in DCT 4x4, 8x8, 16x16, 32x32 functions. These are provided in various forms, using vqrdmulh_s16/vqrdmulh_s32 for _fast variants, which unfortunately are only usable in pass1 of most DCTs, as they do not provide the necessary precision in pass2. This gave a performance gain ranging from 5% to 15% in 16x16 case. Also, for 32x32, the loads were rearranged, along with the butterfly optimizations, this gave 10% gain in 32x32_rd function. This refactoring was necessary to allow easier porting of highbd 32x32 functions -follows this patchset. Change-Id: I6282e640b95a95938faff76c3b2bace3dc298bc3
2022-10-24highbd_sad_avx2: normalize function param namesJames Zern
(src|ref)8_ptr -> (src|ref)_ptr. aligns the names with the rtcd header; clears some clang-tidy warnings Change-Id: Id1aa29da8c0fa5860b46ac902f5b2620c0d3ff54
2022-10-13[NEON] fix clang compile warningsKonstantinos Margaritis
Change-Id: Ib7ce7a774ec89ba51169ea64d24c878109ef07d1
2022-10-13Merge "Add vpx_highbd_sad64x{64,32}_avg_avx2." into mainScott LaVarnway
2022-10-12[NEON] Add highbd FDCT 16x16 functionKonstantinos Margaritis
90-95% faster than C version in best/rt profiles Change-Id: I41d5e9acdc348b57153637ec736498a25ed84c25
2022-10-12Merge "[NEON] Add highbd FDCT 8x8 function" into mainJames Zern
2022-10-12Merge "Add vpx_highbd_sad32x{64,32,16}_avg_avx2." into mainScott LaVarnway
2022-10-12Merge "Add vpx_highbd_sad16x{32,16,8}_avg_avx2." into mainScott LaVarnway
2022-10-12[NEON] Add highbd FDCT 8x8 functionKonstantinos Margaritis
50% faster than C version in best/rt profiles Change-Id: I0f9504ed52b5d5f7722407e91108ed4056d66bc2
2022-10-12Add vpx_highbd_sad64x{64,32}_avg_avx2.Scott LaVarnway
~2.8x faster than the sse2 version. Bug: b/245917257 Change-Id: Ib727ba8a8c8fa4df450bafdde30ed99fd283f06d
2022-10-12[NEON] Add highbd FDCT 4x4 functionKonstantinos Margaritis
~80% faster than C version for both best/rt profiles. Change-Id: Ibb3c8e1862131d2a020922420d53c66b31d5c2c3
2022-10-12Add vpx_highbd_sad32x{64,32,16}_avg_avx2.Scott LaVarnway
2.1x to 2.8x faster than the sse2 version. Bug: b/245917257 Change-Id: I1aaffa4a1debbe5559784e854b8fc6fba07e5000
2022-10-12Add vpx_highbd_sad16x{32,16,8}_avg_avx2.Scott LaVarnway
1.6x to 2.1x faster than the sse2 version. Bug: b/245917257 Change-Id: I56c467a850297ae3abcca4b4843302bb8d5d0ac1
2022-10-12[NEON] Move helper functions for reuseKonstantinos Margaritis
Move all butterfly functions to fdct_neon.h Slightly optimize load/scale/cross functions in fdct 16x16. These will be reused in highbd variants. Change-Id: I28b6e0cc240304bab6b94d9c3f33cca77b8cb073
2022-10-10[NEON] move transpose_8x8 to reuseKonstantinos Margaritis
Change-Id: I3915b6c9971aedaac9c23f21fdb88bc271216208