summaryrefslogtreecommitdiff
path: root/vp9/encoder/arm/neon
AgeCommit message (Collapse)Author
2023-05-03s/__aarch64__/VPX_ARCH_AARCH64/James Zern
This allows AArch64 to be correctly detected when building with Visual Studio (cl.exe) and fixes a crash in vp9_diamond_search_sad_neon.c. There are still test failures, however. Microsoft's compiler doesn't define __ARM_FEATURE_*. To use those paths we may need to rely on _M_ARM64_EXTENSION. Bug: webm:1788 Bug: b/277255076 Change-Id: I4d26f5f84dbd0cbcd1cdf0d7d932ebcf109febe5
2023-04-11Downsample SAD computation in motion searchDeepa K G
Added a speed feature to skip every other row in SAD computation during motion search. Instruction Count BD-Rate Loss(%) cpu Resolution Reduction(%) avg.psnr ovr.psnr ssim 0 LOWRES2 0.958 0.0204 0.0095 0.0275 0 MIDRES2 1.891 -0.0636 0.0032 0.0247 0 HDRES2 2.869 0.0434 0.0345 0.0686 0 Average 1.905 0.0000 0.0157 0.0403 STATS_CHANGED Change-Id: I1a8692757ed0cbcb2259729b3ecfb0436cdf49ce
2023-04-11Avoid redundant start MV SAD calculationDeepa K G
Avoided repeated calculation of start MV SAD during full pixel motion search. Instruction Count cpu Resolution Reduction(%) 0 LOWRES2 0.162 0 MIDRES2 0.246 0 HDRES2 0.325 0 Average 0.245 Change-Id: I2b4786901f254ce32ee8ca8a3d56f1c9f112f1d4
2023-03-14Merge "Add Neon implementation of vp9_highbd_block_error_c" into mainJames Zern
2023-03-14Add Neon implementation of vp9_highbd_block_error_cSalome Thirot
Add Neon implementation of vp9_highbd_block_error_c as well as the corresponding tests. Change-Id: Ibe0eb077f959ced0dcd7d0d8d9d529d3b5bc1874
2023-03-14[NEON] Add temporal filter functions, 8-bit and highbdKonstantinos Margaritis
Both are around 3x faster than original C version. 8-bit gives a small 0.5% speed increase, whereas highbd gives ~2.5%. Change-Id: I71d75ddd2757b19aa201e879fd9fa8f3a25431ad
2023-03-07Add Neon implementation of vp9_block_error_cSalome Thirot
Add Neon implementation of vp9_block_error_c as well as the corresponding tests. Change-Id: I79247b5ae24f51b7b55fc5e517d5e403dc86367a
2023-03-07Optimize vp9_block_error_fp_neonSalome Thirot
Currently vp9_block_error_fp_neon is only used when CONFIG_VP9_HIGHBITDEPTH is set to false. This patch optimizes the implementation and uses tran_low_t instead of int16_t so that the function can also be used in builds where vp9_highbitdepth is enabled. Change-Id: Ibab7ec5f74b7652fa2ae5edf328f9ec587088fd3
2023-02-01vp9_diamond_search_sad_neon: use DECLARE_ALIGNEDJames Zern
rather than the gcc specific __attribute__((aligned())); fixes build targeting ARM64 windows. Bug: webm:1788 Change-Id: I2210fc215f44d90c1ce9dee9b54888eb1b78c99e
2023-01-24[NEON] Add Highbd FHT 8x8/16x16 functionsKonstantinos Margaritis
In total this gives about 9% extra performance for both rt/best profiles. Furthermore, add transpose_s32 16x16 function Change-Id: Ib6f368bbb9af7f03c9ce0deba1664cef77632fe2
2023-01-05Use Neon load/store helper functions consistentlyJonathan Wright
Define all Neon load/store helper functions in mem_neon.h and use them consistently in Neon convolution functions. Change-Id: I57905bc0a3574c77999cf4f4a73442c3420fa2be
2022-11-11[NEON] Optimize FHT functions, add highbd FHT 4x4Konstantinos Margaritis
Refactor & optimize FHT functions further, use new butterfly functions 4x4 5% faster, 8x8 & 16x16 10% faster than previous versions. Highbd 4x4 FHT version 2.27x faster than C version for --rt. Change-Id: I3ebcd26010f6c5c067026aa9353cde46669c5d94
2022-11-01[NEON] Optimize and homogenize Butterfly DCT functionsKonstantinos Margaritis
Provide a set of commonly used Butterfly DCT functions for use in DCT 4x4, 8x8, 16x16, 32x32 functions. These are provided in various forms, using vqrdmulh_s16/vqrdmulh_s32 for _fast variants, which unfortunately are only usable in pass1 of most DCTs, as they do not provide the necessary precision in pass2. This gave a performance gain ranging from 5% to 15% in 16x16 case. Also, for 32x32, the loads were rearranged, along with the butterfly optimizations, this gave 10% gain in 32x32_rd function. This refactoring was necessary to allow easier porting of highbd 32x32 functions -follows this patchset. Change-Id: I6282e640b95a95938faff76c3b2bace3dc298bc3
2022-10-24vp9_highbd_quantize_fp*_neon: normalize fn param nameJames Zern
count -> n_coeffs. aligns the name with the rtcd header; clears a clang-tidy warning Change-Id: I36545ff479df92b117c95e494f16002e6990f433
2022-09-23quantize: increase iscan by 1Johann
All of the assembly adds 1 to iscan to convert from a 0 based array to the EOB value. Add 1 to all iscan values and remove the extra instructions from the assembly. Change-Id: I219dd7f2bd10533ab24b206289565703176dc5e9
2022-08-16Add vp9_highbd_quantize_fp_32x32_neon().Scott LaVarnway
Up to 2.6x faster than vp9_highbd_quantize_fp_32x32_c() for full calculations. Bug: b/237714063 Change-Id: Icfeff2ad4dcd57d0ceb47fe04789710807b9cbad
2022-08-15Merge "VPX: Add vp9_highbd_quantize_fp_neon()." into mainScott LaVarnway
2022-08-15vp9_quantize_fp_32x32_neon() cleanup.Scott LaVarnway
No change in performance. Bug: b/237714063 Change-Id: If6ad5fc27de4babe0bfff3fdbb4b7fd99a0544ab
2022-08-15VPX: Add vp9_highbd_quantize_fp_neon().Scott LaVarnway
Up to 4.1x faster than vp9_highbd_quantize_fp_c() for full calculations. ~1.3% overall encoder improvement for the test clip used. Bug: b/237714063 Change-Id: I8c6466bdbcf1c398b1d8b03cab4165c1d8556b0c
2022-08-11VPX: vp9_quantize_fp_neon() cleanup.Scott LaVarnway
No change in performance. Bug: b/237714063 Change-Id: I868cda7acb0de840fbc85b23f3e36c50b39c331b
2022-07-13Actually include the fix for commit 8f4d1890c.Konstantinos Margaritis
Change-Id: I6780f610151f2e092da525ff064d4b69f74fa61b
2022-07-08Revert "Revert "[NEON] Optimize vp9_diamond_search_sad() for NEON""Konstantinos Margaritis
This reverts commit 9f1329f8ac88ea5d7c6ae5d6a57221c36cf85ac8 and fixes a dumb mistake in evaluation of vfcmv. Used vdupq_n_s16, instead of vdupq_n_s32. Change-Id: Ie236c878c166405c49bc0f93f6d63a6715534a0a
2022-06-28rtc-svc: Fix to make SVC work for Profile 1Marco Paniconi
Added datarate unittest for 4:4:4 and 4:2:2 input, for spatial and temporal layers. Fix is needed in vp9_set_literal_size(): the sampling_x/y should be passed into update_inital_width(), othewise sampling_x/y = 1/1 (4:2:0) was forced. vp9_set_literal_size() is only called by the svc and on dynamic resize. Fix issue with the normative optimized scaler: UV width/height was assumed to be 1/2 of Y, for the ssse and neon code. Also fix to assert for the scaled width/height: in case scaled width/height is odd it should be incremented by 1 (make it even). Change-Id: I3a2e40effa53c505f44ef05aaa3132e1b7f57dd5
2022-05-26Revert "[NEON] Optimize vp9_diamond_search_sad() for NEON"Jerome Jiang
This reverts commit 258affdeab68ed59e181368baa46e2f1d077b0ab. Reason for revert: Not bitexact with C version Original change's description: > [NEON] Optimize vp9_diamond_search_sad() for NEON > > About 50% improvement in comparison to the C function. > I have followed the AVX version with some simplifications. > > Change-Id: I72ddbdb2fbc5ed8a7f0210703fe05523a37db1c9 Change-Id: I5c210b3dfe1f6dec525da857dd8c83946be566fc
2022-05-07[NEON] Optimize vp9_diamond_search_sad() for NEONKonstantinos Margaritis
About 50% improvement in comparison to the C function. I have followed the AVX version with some simplifications. Change-Id: I72ddbdb2fbc5ed8a7f0210703fe05523a37db1c9
2022-03-31Merge "Optimize FHT functions for NEON" into mainJames Zern
2022-03-30Optimize FHT functions for NEONKonstantinos Margaritis
[NEON] Optimize vp9_fht4x4, vp9_fht8x8, vp9_fht16x16 for NEON Following change #3516278, the improvement for these functions is: Before: 4.10% 0.75% vpxenc vpxenc [.] vp9_fht16x16_c 2.93% 0.65% vpxenc vpxenc [.] vp9_fht8x8_c 0.93% 0.77% vpxenc vpxenc [.] vp9_fht4x4_c And after the patch: 0.69% 0.16% vpxenc vpxenc [.] vp9_fht16x16_neon 0.28% 0.28% vpxenc vpxenc [.] vp9_fht8x8_neon 0.54% 0.53% vpxenc vpxenc [.] vp9_fht4x4_neon Bug: webm:1634 Change-Id: I6748a0c4e0cfaafa3eefdd4848d0ac3aab6900e4
2022-03-30remove skip_block from quantizeJohann
Whether a block is skipped is handled by mi->skip. x->skip_block is kept exclusively to verify that the quantize functions are not called for skip blocks. Finishes the cleanup in 13eed991f Bug: libvpx:1612 Change-Id: I1598c3b682d3c5e6c57a15fa4cb5df2c65b3a58a
2021-05-04vp9_denoiser_neon,horizontal_add_s8x16: use vaddlv w/aarch64James Zern
this reduces the number of instructions to compute the sum Change-Id: Icae4d4fb3e343d5b6e5a095c60ac6d171b3e7d54
2019-08-05Fix vp9_quantize_fp(_32x32)_neon for HBDJerome Jiang
In high bitdepth build, Neon code would outrange because of use of int16x8_t and vmulq_s16. C code always truncate outrange values. Change-Id: I33a968b8d812e3c8477f3a61d84482758a3f8b21
2019-08-01Fix saturation issue in vp9_quantize_fp_neonJerome Jiang
Change-Id: I7850a5c5aea3633e50e9a2efc8116b9e16383a8f
2019-03-25Remove deprecated code for vp9_fdct8x8_quant()Jingning Han
Change-Id: If146bbf24f446f71be9147402e6d30533eee99d1
2018-11-12quantize: use aarch64 vmaxvJohann
Simplify max value calculation on aarch64 by using vmaxv. Much faster for 4x4 but diminishing returns as the block size grows. Only the vp9 quantize has a speed test hooked up. Anticipate similar results for the other quantize versions. Before: [ RUN ] NEON/VP9QuantizeTest.DISABLED_Speed/2 [ BENCH ] Bypass calculations 4x4 31.6 ms ( ±0.0 ms ) [ BENCH ] Full calculations 4x4 31.6 ms ( ±0.0 ms ) [ BENCH ] Bypass calculations 8x8 17.7 ms ( ±0.0 ms ) [ BENCH ] Full calculations 8x8 17.7 ms ( ±0.0 ms ) [ BENCH ] Bypass calculations 16x16 14.2 ms ( ±0.0 ms ) [ BENCH ] Full calculations 16x16 14.2 ms ( ±0.0 ms ) [ OK ] NEON/VP9QuantizeTest.DISABLED_Speed/2 (1906 ms) [ RUN ] NEON/VP9QuantizeTest.DISABLED_Speed/3 [ BENCH ] Bypass calculations 32x32 18.6 ms ( ±0.0 ms ) [ BENCH ] Full calculations 32x32 18.6 ms ( ±0.0 ms ) After: [ RUN ] NEON/VP9QuantizeTest.DISABLED_Speed/2 [ BENCH ] Bypass calculations 4x4 29.1 ms ( ±0.0 ms ) [ BENCH ] Full calculations 4x4 29.1 ms ( ±0.0 ms ) [ BENCH ] Bypass calculations 8x8 16.9 ms ( ±0.0 ms ) [ BENCH ] Full calculations 8x8 16.9 ms ( ±0.0 ms ) [ BENCH ] Bypass calculations 16x16 14.1 ms ( ±0.0 ms ) [ BENCH ] Full calculations 16x16 14.1 ms ( ±0.0 ms ) [ OK ] NEON/VP9QuantizeTest.DISABLED_Speed/2 (1803 ms) [ RUN ] NEON/VP9QuantizeTest.DISABLED_Speed/3 [ BENCH ] Bypass calculations 32x32 18.6 ms ( ±0.0 ms ) [ BENCH ] Full calculations 32x32 18.6 ms ( ±0.0 ms ) Change-Id: Ic95812b3fdbd4e47b4dcb8ed46c68a9617de38d2
2018-10-30clang-tidy: fix vp9/encoder parametersJohann
BUG=webm:1444 Change-Id: I6823635eb1a99c3fcca0a8f091878e3ab2fdd2ac
2017-10-09Rename some inline functions in NEON scalingLinfeng Zhang
Change-Id: I9d4c1af53d57f72fc716bacbe3b0965719c045ac
2017-10-02Add 4 to 3 scaling NEON optimizationLinfeng Zhang
Speed comparing with the one calling vpx_scaled_2d_neon() ~1.7 x in general ~2.8x for BILINEAR filter BUG=webm:1419 Change-Id: I8f0a54c2013e61ea086033010f97c19ecf47c7c6
2017-09-19cosmetics: NEON scaling codeLinfeng Zhang
Change-Id: Ib91054622c1f09c4ca523bc6837d7d8ab9f03618
2017-09-11Add 4 to 1 scaling NEON optimizationLinfeng Zhang
BUG=webm:1419 Change-Id: If82a93935d2453e61b7647aae70983db1740bec7
2017-09-07Add 2 to 1 scaling NEON optimizationLinfeng Zhang
BUG=webm:1419 Change-Id: I99c954ffa50a62ccff2c4ab54162916141826d9b
2017-08-23quantize fp: neon implementationJohann
About 4x faster when values are below the dequant threshold and 10x faster if everything needs to be calculated. Both numbers would improve if the division for dqcoeff could be simplified. BUG=webm:1426 Change-Id: I8da67c1f3fcb4abed8751990c1afe00bc841f4b2
2017-08-21quantize fp: ignore skip_block in armJohann
Change-Id: Ie8ac00efa826eead2a227726a1add816e04ff147
2017-05-15move neon load/stores to a new fileJohann
Move the tran_low_t helper functions to a new file. Additional load/store functions will be added here. Change-Id: I52bf652c344c585ea2f3e1230886be93f5caefc3
2017-05-05vp9: Neon optimization for denoiser. Add unit tests.Jerome Jiang
Denoiser on Neon is 5x faster than C code. BUG=webm:1420 Change-Id: I805ab64f809ff2137354116be6213e7ec29c1dcb
2017-02-16Drop zbin_ptr and quant_shift_ptrJohann
vp9[_highbd]_quantize]_fp[_32x32] and vp9_fdct8x8_quant do not make use of these parameters. scan is used for C code and iscan is used for SIMD implementations. Change-Id: I908a0ff7d3febac33da97e0596e040ec7bc18ca5
2017-02-14vp9 fdct higbd neon: connect existing highbd callsJohann
Change-Id: Ia8f822bd6e70b3911bc433a5a750bfb6f9a3a75c
2017-02-14quantize_fp highbd neon: use tran_low_t for coeffJohann
Change-Id: I90fd815f15884490ad138f35df575a00d31e8c95
2016-08-02vp9/encoder: apply clang-formatclang-format
Change-Id: I45d9fb4013f50766b24363a86365e8063e8954c2
2015-12-14move vp9_avg to vpx_dspJames Zern
Change-Id: I7bc991abea383db1f86c1bb0f2e849837b54d90f
2015-12-08Add vp9_avg_4x4_neon and the unit test.jackychen
Change-Id: I3ef9a9648841374ed3cc865a02053c14ad821a20
2015-11-24add vp9_satd_neonJames Zern
~60-65% faster at the function level across block sizes Change-Id: Iaf8cbe95731c43fdcbf68256e44284ba51a93893