Age | Commit message (Collapse) | Author |
|
Test environment: 8c 1804Mhz i5-1140G7
RVV Impl:
% CROSS=riscv64-unknown-linux-gnu- configure --target=riscv64-linux-gcc \
--enable-debug --enable-gprof && make -j
% time qemu-riscv64 -cpu rv64,v=true,zba=true,vlen=128 -L /path/to/sysroot/ \
./vpxenc --codec=vp8 -w 352 -h 288 -o akiyol.vpx ./akiyo_cif.yuv
Pass 1/1 frame 300/300 314977B 8399b/f 251981b/s 92226 ms (3.25 fps)
user 1m30.108s
% gprof -abp ./vpxenc ./gmon.out | grep vp8_copy_mem
1.36 53.09 1.04 1025863 0.00 0.00 vp8_copy_mem16x16_rvv
0.72 59.01 0.55 1641368 0.00 0.00 vp8_copy_mem8x8_rvv
0.05 65.95 0.04 764377 0.00 0.00 vp8_copy_mem8x4_rvv
C Impl:
% CROSS=riscv64-unknown-linux-gnu- configure --target=generic-gnu --enable-debug \
--enable-gprof && make -j
% time qemu-riscv64 -cpu rv64,v=true,zba=true,vlen=128 -L /path/to/sysroot/ \
./vpxenc --codec=vp8 -w 352 -h 288 -o akiyol.vpx ./akiyo_cif.yuv
Pass 1/1 frame 300/300 314977B 8399b/f 251981b/s 98417 ms (3.05 fps)
user 1m36.146s
% gprof -abp ./vpxenc ./gmon.out | grep vp8_copy_mem
0.38 63.96 0.31 vp8_copy_mem8x4_c
0.04 70.61 0.03 204336 0.00 0.00 vp8_copy_mem16x16_c
Signed-off-by: Yuuta Liang <yuuta@yuuta.moe>
|
|
Just use vp8_sixtap_predict as example but have not
implemented it actually.
Test:
$ CROSS=riscv64-unknown-linux-gnu- ../libvpx/configure --target=riscv64-linux-gcc
$ make
Check if vp8_sixtap_predict functions have been replaced with those
suffixed with "_rvv":
$ riscv64-unknown-linux-gnu-nm ./vp8/decoder/decodeframe.c.o | grep vp8_sixtap_predict16x16
U vp8_sixtap_predict16x16_rvv
Check if vp8_sixtap_predictMxN_rvv work.
$ qemu-riscv64 -L $SYSROOT_RV64 ./build-test/test_libvpx --gtest_filter="RVV/SixtapPredictTest.TestWithPresetData/*"
You should see print log output such as: "--> vp8_sixtap_predict4x4_rvv"
"FAILED" is expected due to we have not implemented the actual
algorithm.
Signed-off-by: Wang Chen <wangchen20@iscas.ac.cn>
Co-authored-by: sun min <sunmin89@outlook.com>
|
|
Shift the final read from the source by 3 to avoid breaking the
assumption that the 6-tap filter needs only 5 pixels outside of the
macroblock; this matches the sse2 and ssse3 implementations.
It's possible this restriction could be removed if the source buffers
are assumed to be padded.
Bug: webm:1795
Change-Id: I4c791e3a214898a503c78f4cedca154c75cdbaef
Fixed: webm:1795
|
|
This causes various buffer overflows in the tests:
[ RUN ] NEON/SixtapPredictTest.TestWithPresetData/0
=================================================================
==22346==ERROR: AddressSanitizer: global-buffer-overflow on address
0x0000012b4a5b at pc 0x000000df0f60 bp 0xffffcf6e64b0 sp 0xffffcf6e64a8
READ of size 8 at 0x0000012b4a5b thread T0
#0 0xdf0f5c in vp8_sixtap_predict16x16_neon
vp8/common/arm/neon/sixtappredict_neon.c:1507:13
#1 0x8819e4 in (anonymous
namespace)::SixtapPredictTest_TestWithPresetData_Test::TestBody()
test/predict_test.cc:293:3
...
0x0000012b4a5b is located 2 bytes to the right of global variable
'kTestData' defined in '../test/predict_test.cc:237:24' (0x12b48a0) of
size 441
[ RUN ] NEON/SixtapPredictTest.TestWithRandomData/0
=================================================================
==22338==ERROR: AddressSanitizer: heap-buffer-overflow on address
0xffff8b5321fb at pc 0x000000df0f60 bp 0xfffff7e0cf30 sp 0xfffff7e0cf28
READ of size 8 at 0xffff8b5321fb thread T0
#0 0xdf0f5c in vp8_sixtap_predict16x16_neon
vp8/common/arm/neon/sixtappredict_neon.c:1507:13
#1 0x87d4c0 in (anonymous
namespace)::PredictTestBase::TestWithRandomData(void (*)(unsigned
char*, int, int, int, unsigned char*, int))
test/predict_test.cc:170:9
...
0xffff8b5321fb is located 2 bytes to the right of 441-byte region
[0xffff8b532040,0xffff8b5321f9)
allocated by thread T0 here:
#0 0x5fd4f0 in operator new[](unsigned long) (test_libvpx+0x5fd4f0)
#1 0x87c2e0 in (anonymous namespace)::PredictTestBase::SetUp()
test/predict_test.cc:47:12
#2 0x87d074 in non-virtual thunk to (anonymous
namespace)::PredictTestBase::SetUp() test/predict_test.cc
...
Bug: webm:1795
Change-Id: I32213a381eef91547d00f88acf90f1cf2ec2ea75
|
|
1. vp8_sixtap_predict4x4
Bug: webm:1755
Change-Id: If7d844496ef2cfe2252f2ef12bb7cded63ad03dd
|
|
1. vp8_short_fdct8x4_lsx
2. vp8_diamond_search_sad_lsx
3. vpx_sad8x8_lsx
Bug: webm:1755
Change-Id: Ic9df84ead2d4fc07ec58e9730d6a12ac2b2d31c1
|
|
1. vp8_short_fdct4x4
2. vp8_regular_quantize_b
3. vp8_block_error
4. vp8_mbblock_error
5. vpx_subtract_block
Bug: webm:1755
Change-Id: I3dbfc7e3937af74090fc53fb4c9664e6cdda29ef
|
|
1. vp8_dequant_idct_add_uv_block_lsx
2. vp8_dequant_idct_add_y_block_lsx
Bug: webm:1755
Change-Id: I1f006daaefb2075b422bc72a3f69c5abee776e2e
|
|
These would compute the sum of absolute differences (sad) for a
group of 3 or 8 references. This was used as part of an exhaustive
search.
vp8 only uses these functions in speed 0 and best quality.
For vp9 this is only used with the --enable-non-greedy-mv
experiment.
This removes the 3- and 8-at-a-time optimized functions and uses
the fall back code which will process 1 or 4 (vpx_sadMxNx4d) at
a time.
For configure --target=x86_64-linux-gcc --enable-realtime-only:
libvpx.a
before: 3002424 after: 2937622 delta: 64802
after 'strip libvpx.a'
before: 2116998 after: 2073090 delta: 43908
Change-Id: I566d06e027c327b3bede68649dd551bba81a848e
|
|
1. vp8_dc_only_idct_add_lsx
2. vp8_loop_filter_bh_lsx
3. vp8_loop_filter_bv_lsx
Bug: webm:1755
Change-Id: I9b629767e2a4e9db8cbb3ee2369186502dc6eb00
|
|
1. vp8_loop_filter_mbh, vp8_loop_filter_mbv
2. vp8_sixtap_predict16x16, vp8_sixtap_predict8x8
3. vpx_dc_predictor_16x16, vpx_dc_predictor_8x8
./vpxdec --progress -o YUV_1920X1080.yuv original_1200f/VP8_1920X1080.webm
before: 37.77fps
after : 220.90fps
Bug: webm:1755
Change-Id: I1a3ce16f0c872261d813b6531cfdf25bd59bb774
|
|
BUG=webm:1584
Change-Id: I8279e099fb9595edad858bf7332bf2b40fecae02
|
|
BUG=webm:1444
Change-Id: I57a305cdab0d62b0745116272fbd5d9257c6e679
|
|
Match function definitions to declarations
BUG=webm:1444
Change-Id: Ib96d3b735eaf81cece5406c89cc5156bc2cde462
|
|
~20% faster than the MMX. Removes the last usage of
vp8_bilinear_filters_x86_[48].
Change-Id: Iee976fab9655d0020440f26c4403ce50103af913
|
|
8x8 is 15% faster than the assembly. 8x4 is 200% faster than MMX.
Remove MMX version.
Change-Id: I55642ebd276db265911f2c79616177a3a9a7e04f
|
|
Allows them to pass the license check in chromium.
BUG=chromium:98319
Change-Id: Iefc1706152a549d8c4ae774c917596bf1c9492d8
|
|
1. vp8_dequant_idct_add_y_block_mmi
2. vp8_dequant_idct_add_uv_block_mmi
Change-Id: I9987147be2685ac79d4b045d1d56f6709ee1223c
|
|
1. vp8_short_fdct4x4_mmi
2. vp8_short_fdct8x4_mmi
3. vp8_short_walsh4x4_mmi
Change-Id: I89a7df25cfd09fae309fac257ad8b6a3dc1c8acb
|
|
1. vp8_fast_quantize_b_mmi
2. vp8_regular_quantize_b_mmi
Change-Id: Ic6e21593075f92c1004acd67184602d2aa5d5646
|
|
1. vp8_copy_mem16x16_mmi
2. vp8_copy_mem8x8_mmi
3. vp8_copy_mem8x4_mmi
Change-Id: I3de29a11fa7402df0e48bbb944440b1e66498a65
|
|
1. vp8_dequantize_b_mmi
2. vp8_dequant_idct_add_mmi
Change-Id: I505f8afb7a444173392b325906e6a4f420f00709
|
|
1. vp8_short_idct4x4llm_mmi
2. vp8_short_inv_walsh4x4_mmi
3. vp8_dc_only_idct_add_mmi
Change-Id: I616923681e79d78607a4988608fc39df77b093f4
|
|
1. vp8_loop_filter_horizontal_edge_mmi
2. vp8_loop_filter_vertical_edge_mmi
3. vp8_mbloop_filter_horizontal_edge_mmi
4. vp8_mbloop_filter_vertical_edge_mmi
5. vp8_loop_filter_simple_horizontal_edge_mmi
6. vp8_loop_filter_simple_vertical_edge_mmi
Change-Id: Ie34bbff3a16cff64e39a50798afd2b7dac9bcdc3
|
|
1. vp8_sixtap_predict16x16_mmi
2. vp8_sixtap_predict8x8_mmi
3. vp8_sixtap_predict8x4_mmi
4. vp8_sixtap_predict4x4_mmi
Change-Id: I186669d1a1d998a0f3ba3a548e25eee8b52c251b
|
|
This uses the same sdx4df pointers as vp8_diamond_search_sadx4 and
should therefore target the same optimizations.
See e4ddf9db6a37eee59c079f5ae427643ae3424fcf
Change-Id: Ic298e9b25c34bbe6b7a0799509355b0addb56675
|
|
When they have sse2 equivalents.
Change-Id: I158f631a3bcecba57b36093ac10114b1904767a7
|
|
Avoid the extra level of indirection/confusion.
Change-Id: I0555f639d67835df9fb7dac0c75085e9954805f1
|
|
Use vpx_clear_system_state instead.
Change-Id: Ia3e9122f69a2c690ddd7c7bc54f92ccb9ec18b3e
|
|
Remove lines which specify the same name for a function.
Change-Id: I956bd8ce2b81a2a8feab5621d28bd2499c2b4c2d
|
|
The original commit never set any 'specialize' line:
61311e61039c300ae872ccba22304e9e60dc0205
It appears the sadx4 version of function uses sdx4df calls to speed up
the search. There are no sse3 versions of the sdx4df functions, but
there are sse2 and msa versions.
There is a neon version of vpx_sad16x16x4d but not any of the smaller
versions. Perhaps if they existed this function could be expanded to use
them.
Change-Id: I936d7d6b1a3ff6dcd5a4d2322272708c47cdec13
|
|
This restores d9dce2f48eed1368a44c368fa87a506bd89ffec5
Switched to using signed shift-and-narrow. Instead of saturating
negative results to 0, it was saturating them to 255.
BUG=webm:817
BUG=webm:1273
Change-Id: I571095336aa4182e3288b17924fcaaece42b0a49
|
|
|
|
This reverts commit d9dce2f48eed1368a44c368fa87a506bd89ffec5.
Appears to be failing the SixtapPredict tests in some configurations and possibly test vectors as well.
Change-Id: Ica6aa83ebac47d0a76e451846e7da67b1c17a7d7
|
|
This function was removed when clang started introducing alignment hints
which caused the 32 bit vld1_lane_u32/vst1_lane_u32 to fail:
https://llvm.org/bugs/show_bug.cgi?id=24421
The load has been rendered safe with an implementation ~indiscernible
performance-wise that uses _u8 and over-reads just a touch.
It is still ~5x faster than C in the unaligned case and doing both
filters.
BUG=webm:892
BUG=webm:1273
Change-Id: Icf7167189391b46202f47233bb585c24c42bcc36
|
|
This function was removed when clang started introducing alignment hints
which caused the 32 bit vld1_lane_u32/vst1_lane_u32 to fail:
https://llvm.org/bugs/show_bug.cgi?id=24421
The load has been rendered safe with an implementation ~indiscernible
performance-wise that uses _u8 and over-reads just a touch.
The store, when unaligned, has a version that is ~25% slower but safe
when xoffset = 0 (second pass filter only). When the first pass filter
(or both) are in play, the new version is almost identical in speed.
Worst case performance (both filters, unaligned stores) is roughly 3-4x
faster than C.
BUG=webm:817
BUG=webm:1273
Change-Id: I1e490e94453e0872151fe0dafb05557463f6247d
|
|
Change-Id: I1fa81cc9cabf362a185fc3a53f1e58de533a41e5
|
|
The deblocking filters used in vp8 have been moved to vpx_dsp for
use by both vp8 and vp9.
Change-Id: I5209d76edafc894b550f751fc76d3aa6799b392d
|
|
These implementations rely on casting the pointers to load the data.
Clang implemented optimizations which automatically add alignment hints
to such loads. The 4x4 filters do not guarantee the necessary alignment
so the resulting assembly is broken.
https://llvm.org/bugs/show_bug.cgi?id=24421
BUG=webm:817
BUG=webm:892
Change-Id: I608885299f1f86ff83653b65e0e40d0ae87fb3fe
|
|
Change-Id: I12218d8331c0558c0587a66321e3ca46da7e5cc7
|
|
I've added a few new functions (d45e, d63e, he, ve) to cover the
filtered h/v 4x4 predictors that are vp8-specific, the "correct"
d45 with the correctly filtered bottom-right pixel (as opposed to
the unfiltered version in vp9), and the "broken" d63 with weirdly
filtered bottom-right pixels (which is correctly filtered in vp9).
There may be a minor performance impact on all systems because we
have to do an extra copy of the Above pixel array to incorporate
the topleft pixel in the same array (thus fitting the vpx_dsp API).
In addition, armv6 will have a more serious performance impact b/c
I removed the armv6/vp8-specific assembly. I'm not sure anyone
cares...
Change-Id: I7f9e5ebee11d8e21aca2cd517a69eefc181b2e86
|
|
Change-Id: I936c2430c3c5b1e0ab5dec0a20110525e925b5e4
|
|
Change-Id: I2000820e0c04de2c975d370a0cf7145330289bb2
|
|
Change-Id: Ic61f30af12d1b01c1d5adc4e08bc20e20ad38027
|
|
average improvement ~2x-3x
Change-Id: I6c17012c731fa4d56e0343f8de0df47b2dde289b
|
|
average improvement ~2x-3x
Change-Id: I05593bed583234dc7809aaec6cab82773a29505d
|
|
average improvement ~2x-3x
Change-Id: I30abf4c92cddcc9e87b7a40d4106076e1ec701c2
|
|
average improvement ~2x-3x
Change-Id: I6fc37191bf9cb5a67e1af9787d0d27659c17bdba
|
|
average improvement ~2x-4x
Change-Id: Id0bc600440f7ef53348f585ebadb1ac6869e9a00
|
|
average improvement ~2x-4x
Change-Id: I93abc15389649c169bb8b69127c0b95407d34692
|