summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-08-10RISC-V: optimize vp8_copy_mem with RVVriscv64_android_optimizationYuuta Liang
Test environment: 8c 1804Mhz i5-1140G7 RVV Impl: % CROSS=riscv64-unknown-linux-gnu- configure --target=riscv64-linux-gcc \ --enable-debug --enable-gprof && make -j % time qemu-riscv64 -cpu rv64,v=true,zba=true,vlen=128 -L /path/to/sysroot/ \ ./vpxenc --codec=vp8 -w 352 -h 288 -o akiyol.vpx ./akiyo_cif.yuv Pass 1/1 frame 300/300 314977B 8399b/f 251981b/s 92226 ms (3.25 fps) user 1m30.108s % gprof -abp ./vpxenc ./gmon.out | grep vp8_copy_mem 1.36 53.09 1.04 1025863 0.00 0.00 vp8_copy_mem16x16_rvv 0.72 59.01 0.55 1641368 0.00 0.00 vp8_copy_mem8x8_rvv 0.05 65.95 0.04 764377 0.00 0.00 vp8_copy_mem8x4_rvv C Impl: % CROSS=riscv64-unknown-linux-gnu- configure --target=generic-gnu --enable-debug \ --enable-gprof && make -j % time qemu-riscv64 -cpu rv64,v=true,zba=true,vlen=128 -L /path/to/sysroot/ \ ./vpxenc --codec=vp8 -w 352 -h 288 -o akiyol.vpx ./akiyo_cif.yuv Pass 1/1 frame 300/300 314977B 8399b/f 251981b/s 98417 ms (3.05 fps) user 1m36.146s % gprof -abp ./vpxenc ./gmon.out | grep vp8_copy_mem 0.38 63.96 0.31 vp8_copy_mem8x4_c 0.04 70.61 0.03 204336 0.00 0.00 vp8_copy_mem16x16_c Signed-off-by: Yuuta Liang <yuuta@yuuta.moe>
2023-07-10add example how to use rtcdWang Chen
Just use vp8_sixtap_predict as example but have not implemented it actually. Test: $ CROSS=riscv64-unknown-linux-gnu- ../libvpx/configure --target=riscv64-linux-gcc $ make Check if vp8_sixtap_predict functions have been replaced with those suffixed with "_rvv": $ riscv64-unknown-linux-gnu-nm ./vp8/decoder/decodeframe.c.o | grep vp8_sixtap_predict16x16 U vp8_sixtap_predict16x16_rvv Check if vp8_sixtap_predictMxN_rvv work. $ qemu-riscv64 -L $SYSROOT_RV64 ./build-test/test_libvpx --gtest_filter="RVV/SixtapPredictTest.TestWithPresetData/*" You should see print log output such as: "--> vp8_sixtap_predict4x4_rvv" "FAILED" is expected due to we have not implemented the actual algorithm. Signed-off-by: Wang Chen <wangchen20@iscas.ac.cn> Co-authored-by: sun min <sunmin89@outlook.com>
2023-07-10support framework for RunTime Cpu DetectionWang Chen
Just add related code about RTCD to setup the framework. Have not support the actual runtime detection, and I have not understood how RTCD works, FIXME. More analysis please refer to https://github.com/aosp-riscv/libvpx/issues/8#issuecomment-1627896402. Signed-off-by: Wang Chen <wangchen20@iscas.ac.cn> Co-authored-by: sun min <sunmin89@outlook.com>
2023-07-10enable rvvWang Chen
Test: CROSS=riscv64-unknown-linux-gnu- ../libvpx/configure --target=riscv64-linux-gcc Check console output: ...... enabling rvv ...... Check mk files' content: $ less libs-riscv64-linux-gcc.mk | grep RVV HAVE_RVV=yes Signed-off-by: Wang Chen <wangchen20@iscas.ac.cn> Co-authored-by: sun min <sunmin89@outlook.com>
2023-07-08libvpx[riscv]: Add riscv64 support.sun min
With this change, we can run configure with "--target=riscv64-linux-gcc". No other chagnes so it equals to "--target=generic-gnu" actually. Signed-off-by: sun min <sunmin89@outlook.com>
2023-06-12Merge "Fix c vs intrinsic mismatch of vpx_hadamard_32x32() function" into mainYunqing Wang
2023-06-09RTC RC: clean up unnecessary headersJerome Jiang
Change-Id: I77c407be59f4eb0c70a89a5fffd88c648e634123
2023-06-09Fix c vs intrinsic mismatch of vpx_hadamard_32x32() functionAnupam Pandey
This CL resolves the mismatch between C and intrinsic implementation of vpx_hadamard_32x32 function. The mismatch was due to integer overflow during the addition operation in the intrinsic functions. Specifically, the addition in the intrinsic function was performed at the 16-bit level, while the calculation of a0 + a1 resulted in a 17-bit value. This code change addresses the problem by performing the addition at the 32-bit level (with sign extension) in both SSE2 and AVX2, and then converting the results back to the 16-bit level after a right shift. STATS_CHANGED Change-Id: I576ca64e3b9ebb31d143fcd2da64322790bc5853
2023-06-08Replace NONE with NO_REF_FRAMEJerome Jiang
NONE is a common name and it has conflicts with symbols defined in Chromium. Bug: b/286163500 Change-Id: I3d935a786f771a4d90b258fabc6fd6c2ecbf1c59
2023-06-08Merge "Fix more typos (n/n)" into mainJerome Jiang
2023-06-07Merge "Fix more typos (3/n)" into mainJerome Jiang
2023-06-07Fix more typos (n/n)Jerome Jiang
impace -> impact taget -> target prediciton -> prediction addtion -> addition the the -> the Bug: webm:1803 Change-Id: I759c9d930a037ca69662164fcd6be160ed707d77
2023-06-07Fix more typos (3/n)Jerome Jiang
Propogation -> Propagation propogate -> propagate cant -> can't upto -> up to canddiates -> candidates refernce -> reference USEAGE -> USAGE Change-Id: Iadaf2dffd86b54e04411910f667e8c2dfc6c4c77
2023-06-07Merge "Fix more typos (2/n)" into mainJerome Jiang
2023-06-07Merge "Fix more typos (1/n)" into mainJerome Jiang
2023-06-07Merge "Fix a few typos" into mainJerome Jiang
2023-06-07Fix more typos (2/n)Jerome Jiang
kernal -> kernel e.g -> e.g. paritioning -> partitioning partioning -> partitioning coefficents -> coefficients i.e, -> i.e., equivalend -> equivalent recive -> receive resoultions -> resolutions Bug: webm:1803 Change-Id: I1d6176202ee5daee7a64bf59114e8b304aeb4db7
2023-06-07Fix more typos (1/n)Jerome Jiang
Dont -> Don't setings -> settings thresold -> thresh thresold -> threshold becasue -> because itterations -> iterations its a -> it's a an constant -> a constant Bug: webm:1803 Change-Id: I1e019393939ed25c59c898c88d4941ec360b026d
2023-06-07Fix a few typosJerome Jiang
segement -> segment dont -> don't useage -> usage devide -> divide Bug: webm:1803 Change-Id: I0153380b0003825c4b62cf323d4f2bc837c8a264
2023-06-06Add comments in vp9_diamond_search_sad_avx()Deepa K G
Added comments related to re-arranging the elements of the SAD vector to find the minimum. Change-Id: I58b702d304a6cdd32f04775fba603e39c19a8947
2023-06-05Fix c vs avx mismatch of diamond_search_sad()Deepa K G
In the function vp9_diamond_search_sad_avx(), arranged the cost vector in a specific order. This ensures that the motion vector with the least index is selected, when there exists more than one candidate motion vector with the minimum cost, thus resolving the c vs avx mismatch. STATS_CHANGED Change-Id: I4f8864f464f9ea2aae6250db3d8ad91cb08b26e2
2023-05-31Merge "Trim tpl stats by 2 extra frames" into mainJerome Jiang
2023-05-31Trim tpl stats by 2 extra framesJerome Jiang
Not applicable to the last GOP. Bug: b/284162396 Change-Id: I55b7e04e9fc4b68a08ce3e00b10743823c828954
2023-05-31Merge changes I6a906803,I0307a3b6 into mainJames Zern
* changes: Optimize Neon implementation of vpx_int_pro_row Optimize Neon implementation of vpx_int_pro_col
2023-05-31Optimize Neon implementation of vpx_int_pro_rowJonathan Wright
Double the number of accumulator registers to remove the bottleneck. Also peel the first loop iteration. Change-Id: I6a90680369f9c33cdfe14ea547ac1569ec3f50de
2023-05-31Optimize Neon implementation of vpx_int_pro_colJonathan Wright
Use widening pairwise addition instructions to halve the number of additions required. Change-Id: I0307a3b65e50d2b1ae582938bc5df9c2b21df734
2023-05-25Merge changes Ia3647698,I55caf34e,Id2c60f39 into mainJames Zern
* changes: vpx_dsp_common.h,clip_pixel: work around VS2022 Arm64 issue fdct_partial_neon.c: work around VS2022 Arm64 issue fdct8x8_test.cc: work around VS2022 Arm64 issue
2023-05-24Merge "examples.mk,vpxdec: rm libwebm muxer dependency" into mainJames Zern
2023-05-24Merge "Add IO for TPL stats" into mainJerome Jiang
2023-05-23vpx_dsp_common.h,clip_pixel: work around VS2022 Arm64 issueJames Zern
cl.exe targeting AArch64 with optimizations enabled produces invalid code for clip_pixel() when the return type is uint8_t. See: https://developercommunity.visualstudio.com/t/Misoptimization-for-ARM64-in-VS-2022-17/10363361 Bug: b/277255076 Bug: webm:1788 Change-Id: Ia3647698effd34f1cf196cd33fa4a8cab9fa53d6
2023-05-23fdct_partial_neon.c: work around VS2022 Arm64 issueJames Zern
cl.exe targeting AArch64 with optimizations enabled will fail with an internal compiler error. See: https://developercommunity.visualstudio.com/t/Compiler-crash-C1001-when-building-a-for/10346110 Bug: b/277255076 Bug: webm:1788 Change-Id: I55caf34e910dab47a7775f07280677cdfe606f5b
2023-05-23fdct8x8_test.cc: work around VS2022 Arm64 issueJames Zern
cl.exe targeting AArch64 with optimizations enabled produces invalid code in RunExtremalCheck() and RunInvAccuracyCheck(). See: https://developercommunity.visualstudio.com/t/1770-preview-1:-Misoptimization-for-AR/10369786 Bug: b/277255076 Bug: webm:1788 Change-Id: Id2c60f3948d8f788c78602aea1b5232133415dea
2023-05-23Add IO for TPL statsJerome Jiang
Overload TempOutFile constructor to allow IO mode. Bug: b/281563704 Change-Id: I1f4f5b29db0e331941b6795e478eeeab51f625ad
2023-05-18Merge "Add new vpx_tpl.h API file" into mainJerome Jiang
2023-05-18Merge "Improve convolve AVX2 intrinsic for speed" into mainYunqing Wang
2023-05-17Add new vpx_tpl.h API fileJerome Jiang
New file (vpx_tpl.c) in the following CLs will add new APIs dealing with TPL stats from VP9 encoder. Change-Id: I5102ef64214cba1ca6ecea9582a19049666c6ca4
2023-05-17Improve convolve AVX2 intrinsic for speedAnupam Pandey
This CL refactors the code related to convolve function. Furthermore, improved the AVX2 intrinsic to compute convolve vertical for w = 4 case, and convolve horiz for w = 16 case. Please note the module level scaling w.r.t C function (timer based) for existing (AVX2) and new AVX2 intrinsics: Block Scaling Size AVX2 AVX2 (existing) (New) 4x4 5.34x 5.91x 4x8 7.10x 7.79x 16x8 23.52x 25.63x 16x16 29.47x 30.22x 16x32 33.42x 33.44x This is a bit exact change. Change-Id: If130183bc12faab9ca2bcec0ceeaa8d0af05e413
2023-05-16Merge changes Ie77ad184,Idfcac43c into mainJames Zern
* changes: Add 2D-specific Neon horizontal convolution functions Refactor standard bitdepth Neon convolution functions
2023-05-13Add 2D-specific Neon horizontal convolution functionsJonathan Wright
2D 8-tap convolution filtering is performed in two passes - horizontal and vertical. The horizontal pass must produce enough input data for the subsequent vertical pass - 3 rows above and 4 rows below, in addition to the actual block height. At present, all Neon horizontal convolution algorithms process 4 rows at a time, but this means we end up doing at least 1 row too much work in the 2D first pass case where we need h + 7, not h + 8 rows of output. This patch adds additional dot-product (SDOT and USDOT) Neon paths that process h + 7 rows of data exactly, saving the work of the unnecessary extra row. It is impractical to take a similar approach for the Armv8.0 MLA paths since we have to transpose the data block both before and after calling the convolution helper functions. vpx_convolve_neon performance impact: we observe a speedup of ~9% for smaller (and wider) blocks, and a speedup of 0-3% for larger blocks. This is to be expected since the proportion of redundant work decreases as the block height increases. Change-Id: Ie77ad1848707d2d48bb8851345a469aae9d097e1
2023-05-12Merge "Don't use -Wl,-z,defs with Clang's sanitizers" into mainJames Zern
2023-05-12Don't use -Wl,-z,defs with Clang's sanitizersJames Zern
This avoids link errors related to the sanitizers: https://clang.llvm.org/docs/AddressSanitizer.html#usage "When linking shared libraries, the AddressSanitizer run-time is not linked, so -Wl,-z,defs may cause link errors ..." See also: https://crbug.com/aomedia/3438 Bug: webm:1801 Fixed: webm:1801 Change-Id: Ie212318005a5f7222e5486775175534025306367
2023-05-12Refactor standard bitdepth Neon convolution functionsJonathan Wright
1) Use #define constant instead of magic numbers for right shifts. 2) Move saturating narrow into helper functions that return 4-element result vectors. 3) Use mem_neon.h helpers for load/store sequences in Armv8.0 paths. 4) Tidy up: assert conditions and some longer variable names. 5) Prefer != 0 to > 0 where possible for loop termination conditions. Change-Id: Idfcac43ca38faf729dca07b8cc8f7f45ad264d24
2023-05-09configure: add -WshadowJames Zern
libraries under third_party/ are out of scope for this change. Bug: webm:1793 Change-Id: I562065a3c0ea9fdfc9615d1a6b1ae47da79b8ce0
2023-05-09Merge "vp8_macros_msa.h: clear -Wshadow warnings" into mainJames Zern
2023-05-09Merge changes Iac020280,I8ca8660a into mainJames Zern
* changes: gen_msvs_vcxproj: add ARM64EC w/VS >= 2022 configure: add clang-cl vs1[67] arm64 targets
2023-05-09Merge "Add AVX2 intrinsic for vpx_comp_avg_pred() function" into mainYunqing Wang
2023-05-09Add AVX2 intrinsic for vpx_comp_avg_pred() functionAnupam Pandey
The module level scaling w.r.t C function (timer based) for existing (SSE2) and new AVX2 intrinsics: If ref_padding = 0 Block Scaling size SSE2 AVX2 8x4 3.24x 3.24x 8x8 4.22x 4.90x 8x16 5.91x 5.93x 16x8 1.63x 3.52x 16x16 1.53x 4.19x 16x32 1.38x 4.82x 32x16 1.28x 3.08x 32x32 1.45x 3.13x 32x64 1.38x 3.04x 64x32 1.39x 2.12x 64x64 1.46x 2.24x If ref_padding = 8 Block Scaling size SSE2 AVX2 8x4 3.20x 3.21x 8x8 4.61x 4.83x 8x16 5.50x 6.45x 16x8 1.56x 3.35x 16x16 1.53x 4.19x 16x32 1.37x 4.83x 32x16 1.28x 3.07x 32x32 1.46x 3.29x 32x64 1.38x 3.22x 64x32 1.38x 2.14x 64x64 1.38x 2.12x This is a bit-exact change. Change-Id: I72c5d155f64d0c630bc8c3aef21dc8bbd045d9e6
2023-05-08vp8_macros_msa.h: clear -Wshadow warningsJames Zern
Bug: webm:1793 Change-Id: Ia940b06bd23a915a050432e03bb630567e891d8d
2023-05-08Merge "README: update target list" into mainJames Zern
2023-05-08Merge changes Ie165d410,I6d9bb8da,I6858e574 into mainJames Zern
* changes: vp8_[cd]x_iface: clear setjmp flag on function exit vp9_decodeframe,tile_worker_hook: relocate setjmp=1 vp9,encoder_set_config: set setjmp flag after setjmp()