Age | Commit message (Collapse) | Author |
|
Test environment: 8c 1804Mhz i5-1140G7
RVV Impl:
% CROSS=riscv64-unknown-linux-gnu- configure --target=riscv64-linux-gcc \
--enable-debug --enable-gprof && make -j
% time qemu-riscv64 -cpu rv64,v=true,zba=true,vlen=128 -L /path/to/sysroot/ \
./vpxenc --codec=vp8 -w 352 -h 288 -o akiyol.vpx ./akiyo_cif.yuv
Pass 1/1 frame 300/300 314977B 8399b/f 251981b/s 92226 ms (3.25 fps)
user 1m30.108s
% gprof -abp ./vpxenc ./gmon.out | grep vp8_copy_mem
1.36 53.09 1.04 1025863 0.00 0.00 vp8_copy_mem16x16_rvv
0.72 59.01 0.55 1641368 0.00 0.00 vp8_copy_mem8x8_rvv
0.05 65.95 0.04 764377 0.00 0.00 vp8_copy_mem8x4_rvv
C Impl:
% CROSS=riscv64-unknown-linux-gnu- configure --target=generic-gnu --enable-debug \
--enable-gprof && make -j
% time qemu-riscv64 -cpu rv64,v=true,zba=true,vlen=128 -L /path/to/sysroot/ \
./vpxenc --codec=vp8 -w 352 -h 288 -o akiyol.vpx ./akiyo_cif.yuv
Pass 1/1 frame 300/300 314977B 8399b/f 251981b/s 98417 ms (3.05 fps)
user 1m36.146s
% gprof -abp ./vpxenc ./gmon.out | grep vp8_copy_mem
0.38 63.96 0.31 vp8_copy_mem8x4_c
0.04 70.61 0.03 204336 0.00 0.00 vp8_copy_mem16x16_c
Signed-off-by: Yuuta Liang <yuuta@yuuta.moe>
|
|
Just use vp8_sixtap_predict as example but have not
implemented it actually.
Test:
$ CROSS=riscv64-unknown-linux-gnu- ../libvpx/configure --target=riscv64-linux-gcc
$ make
Check if vp8_sixtap_predict functions have been replaced with those
suffixed with "_rvv":
$ riscv64-unknown-linux-gnu-nm ./vp8/decoder/decodeframe.c.o | grep vp8_sixtap_predict16x16
U vp8_sixtap_predict16x16_rvv
Check if vp8_sixtap_predictMxN_rvv work.
$ qemu-riscv64 -L $SYSROOT_RV64 ./build-test/test_libvpx --gtest_filter="RVV/SixtapPredictTest.TestWithPresetData/*"
You should see print log output such as: "--> vp8_sixtap_predict4x4_rvv"
"FAILED" is expected due to we have not implemented the actual
algorithm.
Signed-off-by: Wang Chen <wangchen20@iscas.ac.cn>
Co-authored-by: sun min <sunmin89@outlook.com>
|
|
Just add related code about RTCD to setup the framework.
Have not support the actual runtime detection, and I have
not understood how RTCD works, FIXME. More analysis please
refer to https://github.com/aosp-riscv/libvpx/issues/8#issuecomment-1627896402.
Signed-off-by: Wang Chen <wangchen20@iscas.ac.cn>
Co-authored-by: sun min <sunmin89@outlook.com>
|
|
Test: CROSS=riscv64-unknown-linux-gnu- ../libvpx/configure --target=riscv64-linux-gcc
Check console output:
......
enabling rvv
......
Check mk files' content:
$ less libs-riscv64-linux-gcc.mk | grep RVV
HAVE_RVV=yes
Signed-off-by: Wang Chen <wangchen20@iscas.ac.cn>
Co-authored-by: sun min <sunmin89@outlook.com>
|
|
With this change, we can run configure with
"--target=riscv64-linux-gcc". No other chagnes so it
equals to "--target=generic-gnu" actually.
Signed-off-by: sun min <sunmin89@outlook.com>
|
|
|
|
Change-Id: I77c407be59f4eb0c70a89a5fffd88c648e634123
|
|
This CL resolves the mismatch between C and intrinsic implementation
of vpx_hadamard_32x32 function. The mismatch was due to integer
overflow during the addition operation in the intrinsic functions.
Specifically, the addition in the intrinsic function was performed
at the 16-bit level, while the calculation of a0 + a1 resulted in
a 17-bit value.
This code change addresses the problem by performing
the addition at the 32-bit level (with sign extension) in both SSE2
and AVX2, and then converting the results back to the 16-bit level
after a right shift.
STATS_CHANGED
Change-Id: I576ca64e3b9ebb31d143fcd2da64322790bc5853
|
|
NONE is a common name and it has conflicts with symbols defined in
Chromium.
Bug: b/286163500
Change-Id: I3d935a786f771a4d90b258fabc6fd6c2ecbf1c59
|
|
|
|
|
|
impace -> impact
taget -> target
prediciton -> prediction
addtion -> addition
the the -> the
Bug: webm:1803
Change-Id: I759c9d930a037ca69662164fcd6be160ed707d77
|
|
Propogation -> Propagation
propogate -> propagate
cant -> can't
upto -> up to
canddiates -> candidates
refernce -> reference
USEAGE -> USAGE
Change-Id: Iadaf2dffd86b54e04411910f667e8c2dfc6c4c77
|
|
|
|
|
|
|
|
kernal -> kernel
e.g -> e.g.
paritioning -> partitioning
partioning -> partitioning
coefficents -> coefficients
i.e, -> i.e.,
equivalend -> equivalent
recive -> receive
resoultions -> resolutions
Bug: webm:1803
Change-Id: I1d6176202ee5daee7a64bf59114e8b304aeb4db7
|
|
Dont -> Don't
setings -> settings
thresold -> thresh
thresold -> threshold
becasue -> because
itterations -> iterations
its a -> it's a
an constant -> a constant
Bug: webm:1803
Change-Id: I1e019393939ed25c59c898c88d4941ec360b026d
|
|
segement -> segment
dont -> don't
useage -> usage
devide -> divide
Bug: webm:1803
Change-Id: I0153380b0003825c4b62cf323d4f2bc837c8a264
|
|
Added comments related to re-arranging the
elements of the SAD vector to find the
minimum.
Change-Id: I58b702d304a6cdd32f04775fba603e39c19a8947
|
|
In the function vp9_diamond_search_sad_avx(), arranged
the cost vector in a specific order. This ensures that
the motion vector with the least index is selected,
when there exists more than one candidate motion
vector with the minimum cost, thus resolving the
c vs avx mismatch.
STATS_CHANGED
Change-Id: I4f8864f464f9ea2aae6250db3d8ad91cb08b26e2
|
|
|
|
Not applicable to the last GOP.
Bug: b/284162396
Change-Id: I55b7e04e9fc4b68a08ce3e00b10743823c828954
|
|
* changes:
Optimize Neon implementation of vpx_int_pro_row
Optimize Neon implementation of vpx_int_pro_col
|
|
Double the number of accumulator registers to remove the bottleneck.
Also peel the first loop iteration.
Change-Id: I6a90680369f9c33cdfe14ea547ac1569ec3f50de
|
|
Use widening pairwise addition instructions to halve the number of
additions required.
Change-Id: I0307a3b65e50d2b1ae582938bc5df9c2b21df734
|
|
* changes:
vpx_dsp_common.h,clip_pixel: work around VS2022 Arm64 issue
fdct_partial_neon.c: work around VS2022 Arm64 issue
fdct8x8_test.cc: work around VS2022 Arm64 issue
|
|
|
|
|
|
cl.exe targeting AArch64 with optimizations enabled
produces invalid code for clip_pixel() when the return type is uint8_t.
See:
https://developercommunity.visualstudio.com/t/Misoptimization-for-ARM64-in-VS-2022-17/10363361
Bug: b/277255076
Bug: webm:1788
Change-Id: Ia3647698effd34f1cf196cd33fa4a8cab9fa53d6
|
|
cl.exe targeting AArch64 with optimizations enabled
will fail with an internal compiler error.
See:
https://developercommunity.visualstudio.com/t/Compiler-crash-C1001-when-building-a-for/10346110
Bug: b/277255076
Bug: webm:1788
Change-Id: I55caf34e910dab47a7775f07280677cdfe606f5b
|
|
cl.exe targeting AArch64 with optimizations enabled
produces invalid code in RunExtremalCheck() and RunInvAccuracyCheck().
See:
https://developercommunity.visualstudio.com/t/1770-preview-1:-Misoptimization-for-AR/10369786
Bug: b/277255076
Bug: webm:1788
Change-Id: Id2c60f3948d8f788c78602aea1b5232133415dea
|
|
Overload TempOutFile constructor to allow IO mode.
Bug: b/281563704
Change-Id: I1f4f5b29db0e331941b6795e478eeeab51f625ad
|
|
|
|
|
|
New file (vpx_tpl.c) in the following CLs will add new APIs dealing with
TPL stats from VP9 encoder.
Change-Id: I5102ef64214cba1ca6ecea9582a19049666c6ca4
|
|
This CL refactors the code related to convolve function.
Furthermore, improved the AVX2 intrinsic to compute
convolve vertical for w = 4 case, and convolve horiz for
w = 16 case.
Please note the module level scaling w.r.t C function
(timer based) for existing (AVX2) and new AVX2 intrinsics:
Block Scaling
Size AVX2 AVX2
(existing) (New)
4x4 5.34x 5.91x
4x8 7.10x 7.79x
16x8 23.52x 25.63x
16x16 29.47x 30.22x
16x32 33.42x 33.44x
This is a bit exact change.
Change-Id: If130183bc12faab9ca2bcec0ceeaa8d0af05e413
|
|
* changes:
Add 2D-specific Neon horizontal convolution functions
Refactor standard bitdepth Neon convolution functions
|
|
2D 8-tap convolution filtering is performed in two passes -
horizontal and vertical. The horizontal pass must produce enough
input data for the subsequent vertical pass - 3 rows above and 4 rows
below, in addition to the actual block height.
At present, all Neon horizontal convolution algorithms process 4 rows
at a time, but this means we end up doing at least 1 row too much
work in the 2D first pass case where we need h + 7, not h + 8 rows of
output.
This patch adds additional dot-product (SDOT and USDOT) Neon paths
that process h + 7 rows of data exactly, saving the work of the
unnecessary extra row. It is impractical to take a similar approach
for the Armv8.0 MLA paths since we have to transpose the data block
both before and after calling the convolution helper functions.
vpx_convolve_neon performance impact: we observe a speedup of ~9% for
smaller (and wider) blocks, and a speedup of 0-3% for larger blocks.
This is to be expected since the proportion of redundant work
decreases as the block height increases.
Change-Id: Ie77ad1848707d2d48bb8851345a469aae9d097e1
|
|
|
|
This avoids link errors related to the sanitizers:
https://clang.llvm.org/docs/AddressSanitizer.html#usage
"When linking shared libraries, the AddressSanitizer run-time is not
linked, so -Wl,-z,defs may cause link errors ..."
See also:
https://crbug.com/aomedia/3438
Bug: webm:1801
Fixed: webm:1801
Change-Id: Ie212318005a5f7222e5486775175534025306367
|
|
1) Use #define constant instead of magic numbers for right shifts.
2) Move saturating narrow into helper functions that return 4-element
result vectors.
3) Use mem_neon.h helpers for load/store sequences in Armv8.0 paths.
4) Tidy up: assert conditions and some longer variable names.
5) Prefer != 0 to > 0 where possible for loop termination conditions.
Change-Id: Idfcac43ca38faf729dca07b8cc8f7f45ad264d24
|
|
libraries under third_party/ are out of scope for this change.
Bug: webm:1793
Change-Id: I562065a3c0ea9fdfc9615d1a6b1ae47da79b8ce0
|
|
|
|
* changes:
gen_msvs_vcxproj: add ARM64EC w/VS >= 2022
configure: add clang-cl vs1[67] arm64 targets
|
|
|
|
The module level scaling w.r.t C function (timer based) for
existing (SSE2) and new AVX2 intrinsics:
If ref_padding = 0
Block Scaling
size SSE2 AVX2
8x4 3.24x 3.24x
8x8 4.22x 4.90x
8x16 5.91x 5.93x
16x8 1.63x 3.52x
16x16 1.53x 4.19x
16x32 1.38x 4.82x
32x16 1.28x 3.08x
32x32 1.45x 3.13x
32x64 1.38x 3.04x
64x32 1.39x 2.12x
64x64 1.46x 2.24x
If ref_padding = 8
Block Scaling
size SSE2 AVX2
8x4 3.20x 3.21x
8x8 4.61x 4.83x
8x16 5.50x 6.45x
16x8 1.56x 3.35x
16x16 1.53x 4.19x
16x32 1.37x 4.83x
32x16 1.28x 3.07x
32x32 1.46x 3.29x
32x64 1.38x 3.22x
64x32 1.38x 2.14x
64x64 1.38x 2.12x
This is a bit-exact change.
Change-Id: I72c5d155f64d0c630bc8c3aef21dc8bbd045d9e6
|
|
Bug: webm:1793
Change-Id: Ia940b06bd23a915a050432e03bb630567e891d8d
|
|
|
|
* changes:
vp8_[cd]x_iface: clear setjmp flag on function exit
vp9_decodeframe,tile_worker_hook: relocate setjmp=1
vp9,encoder_set_config: set setjmp flag after setjmp()
|