Age | Commit message (Collapse) | Author |
|
This allows AArch64 to be correctly detected when building with Visual
Studio (cl.exe) and fixes a crash in vp9_diamond_search_sad_neon.c.
There are still test failures, however.
Microsoft's compiler doesn't define __ARM_FEATURE_*. To use those paths
we may need to rely on _M_ARM64_EXTENSION.
Bug: webm:1788
Bug: b/277255076
Change-Id: I4d26f5f84dbd0cbcd1cdf0d7d932ebcf109febe5
|
|
Added a speed feature to skip every other row
in SAD computation during motion search.
Instruction Count BD-Rate Loss(%)
cpu Resolution Reduction(%) avg.psnr ovr.psnr ssim
0 LOWRES2 0.958 0.0204 0.0095 0.0275
0 MIDRES2 1.891 -0.0636 0.0032 0.0247
0 HDRES2 2.869 0.0434 0.0345 0.0686
0 Average 1.905 0.0000 0.0157 0.0403
STATS_CHANGED
Change-Id: I1a8692757ed0cbcb2259729b3ecfb0436cdf49ce
|
|
Avoided repeated calculation of start MV
SAD during full pixel motion search.
Instruction Count
cpu Resolution Reduction(%)
0 LOWRES2 0.162
0 MIDRES2 0.246
0 HDRES2 0.325
0 Average 0.245
Change-Id: I2b4786901f254ce32ee8ca8a3d56f1c9f112f1d4
|
|
|
|
Add Neon implementation of vp9_highbd_block_error_c as well as the
corresponding tests.
Change-Id: Ibe0eb077f959ced0dcd7d0d8d9d529d3b5bc1874
|
|
Both are around 3x faster than original C version. 8-bit gives a
small 0.5% speed increase, whereas highbd gives ~2.5%.
Change-Id: I71d75ddd2757b19aa201e879fd9fa8f3a25431ad
|
|
Add Neon implementation of vp9_block_error_c as well as the
corresponding tests.
Change-Id: I79247b5ae24f51b7b55fc5e517d5e403dc86367a
|
|
Currently vp9_block_error_fp_neon is only used when
CONFIG_VP9_HIGHBITDEPTH is set to false. This patch optimizes the
implementation and uses tran_low_t instead of int16_t so that the
function can also be used in builds where vp9_highbitdepth is enabled.
Change-Id: Ibab7ec5f74b7652fa2ae5edf328f9ec587088fd3
|
|
rather than the gcc specific __attribute__((aligned())); fixes build
targeting ARM64 windows.
Bug: webm:1788
Change-Id: I2210fc215f44d90c1ce9dee9b54888eb1b78c99e
|
|
In total this gives about 9% extra performance for both rt/best
profiles.
Furthermore, add transpose_s32 16x16 function
Change-Id: Ib6f368bbb9af7f03c9ce0deba1664cef77632fe2
|
|
Define all Neon load/store helper functions in mem_neon.h and use
them consistently in Neon convolution functions.
Change-Id: I57905bc0a3574c77999cf4f4a73442c3420fa2be
|
|
Refactor & optimize FHT functions further, use new butterfly functions
4x4 5% faster, 8x8 & 16x16 10% faster than previous versions.
Highbd 4x4 FHT version 2.27x faster than C version for --rt.
Change-Id: I3ebcd26010f6c5c067026aa9353cde46669c5d94
|
|
Provide a set of commonly used Butterfly DCT functions for use in
DCT 4x4, 8x8, 16x16, 32x32 functions. These are provided in various
forms, using vqrdmulh_s16/vqrdmulh_s32 for _fast variants, which
unfortunately are only usable in pass1 of most DCTs, as they do not
provide the necessary precision in pass2.
This gave a performance gain ranging from 5% to 15% in 16x16 case.
Also, for 32x32, the loads were rearranged, along with the butterfly
optimizations, this gave 10% gain in 32x32_rd function.
This refactoring was necessary to allow easier porting of highbd
32x32 functions -follows this patchset.
Change-Id: I6282e640b95a95938faff76c3b2bace3dc298bc3
|
|
count -> n_coeffs. aligns the name with the rtcd header; clears a
clang-tidy warning
Change-Id: I36545ff479df92b117c95e494f16002e6990f433
|
|
All of the assembly adds 1 to iscan to convert from
a 0 based array to the EOB value.
Add 1 to all iscan values and remove the extra
instructions from the assembly.
Change-Id: I219dd7f2bd10533ab24b206289565703176dc5e9
|
|
Up to 2.6x faster than vp9_highbd_quantize_fp_32x32_c() for full
calculations.
Bug: b/237714063
Change-Id: Icfeff2ad4dcd57d0ceb47fe04789710807b9cbad
|
|
|
|
No change in performance.
Bug: b/237714063
Change-Id: If6ad5fc27de4babe0bfff3fdbb4b7fd99a0544ab
|
|
Up to 4.1x faster than vp9_highbd_quantize_fp_c() for full
calculations.
~1.3% overall encoder improvement for the test clip used.
Bug: b/237714063
Change-Id: I8c6466bdbcf1c398b1d8b03cab4165c1d8556b0c
|
|
No change in performance.
Bug: b/237714063
Change-Id: I868cda7acb0de840fbc85b23f3e36c50b39c331b
|
|
Change-Id: I6780f610151f2e092da525ff064d4b69f74fa61b
|
|
This reverts commit 9f1329f8ac88ea5d7c6ae5d6a57221c36cf85ac8
and fixes a dumb mistake in evaluation of vfcmv. Used vdupq_n_s16,
instead of vdupq_n_s32.
Change-Id: Ie236c878c166405c49bc0f93f6d63a6715534a0a
|
|
Added datarate unittest for 4:4:4 and 4:2:2 input,
for spatial and temporal layers.
Fix is needed in vp9_set_literal_size():
the sampling_x/y should be passed into update_inital_width(),
othewise sampling_x/y = 1/1 (4:2:0) was forced.
vp9_set_literal_size() is only called by the svc and
on dynamic resize.
Fix issue with the normative optimized scaler:
UV width/height was assumed to be 1/2 of Y, for
the ssse and neon code.
Also fix to assert for the scaled width/height:
in case scaled width/height is odd it should be
incremented by 1 (make it even).
Change-Id: I3a2e40effa53c505f44ef05aaa3132e1b7f57dd5
|
|
This reverts commit 258affdeab68ed59e181368baa46e2f1d077b0ab.
Reason for revert:
Not bitexact with C version
Original change's description:
> [NEON] Optimize vp9_diamond_search_sad() for NEON
>
> About 50% improvement in comparison to the C function.
> I have followed the AVX version with some simplifications.
>
> Change-Id: I72ddbdb2fbc5ed8a7f0210703fe05523a37db1c9
Change-Id: I5c210b3dfe1f6dec525da857dd8c83946be566fc
|
|
About 50% improvement in comparison to the C function.
I have followed the AVX version with some simplifications.
Change-Id: I72ddbdb2fbc5ed8a7f0210703fe05523a37db1c9
|
|
|
|
[NEON]
Optimize vp9_fht4x4, vp9_fht8x8, vp9_fht16x16 for NEON
Following change #3516278, the improvement for these functions is:
Before:
4.10% 0.75% vpxenc vpxenc [.] vp9_fht16x16_c
2.93% 0.65% vpxenc vpxenc [.] vp9_fht8x8_c
0.93% 0.77% vpxenc vpxenc [.] vp9_fht4x4_c
And after the patch:
0.69% 0.16% vpxenc vpxenc [.] vp9_fht16x16_neon
0.28% 0.28% vpxenc vpxenc [.] vp9_fht8x8_neon
0.54% 0.53% vpxenc vpxenc [.] vp9_fht4x4_neon
Bug: webm:1634
Change-Id: I6748a0c4e0cfaafa3eefdd4848d0ac3aab6900e4
|
|
Whether a block is skipped is handled by mi->skip. x->skip_block
is kept exclusively to verify that the quantize functions are not
called for skip blocks.
Finishes the cleanup in 13eed991f
Bug: libvpx:1612
Change-Id: I1598c3b682d3c5e6c57a15fa4cb5df2c65b3a58a
|
|
this reduces the number of instructions to compute the sum
Change-Id: Icae4d4fb3e343d5b6e5a095c60ac6d171b3e7d54
|
|
In high bitdepth build, Neon code would outrange because of use of
int16x8_t and vmulq_s16.
C code always truncate outrange values.
Change-Id: I33a968b8d812e3c8477f3a61d84482758a3f8b21
|
|
Change-Id: I7850a5c5aea3633e50e9a2efc8116b9e16383a8f
|
|
Change-Id: If146bbf24f446f71be9147402e6d30533eee99d1
|
|
Simplify max value calculation on aarch64 by using vmaxv. Much
faster for 4x4 but diminishing returns as the block size grows.
Only the vp9 quantize has a speed test hooked up. Anticipate
similar results for the other quantize versions.
Before:
[ RUN ] NEON/VP9QuantizeTest.DISABLED_Speed/2
[ BENCH ] Bypass calculations 4x4 31.6 ms ( ±0.0 ms )
[ BENCH ] Full calculations 4x4 31.6 ms ( ±0.0 ms )
[ BENCH ] Bypass calculations 8x8 17.7 ms ( ±0.0 ms )
[ BENCH ] Full calculations 8x8 17.7 ms ( ±0.0 ms )
[ BENCH ] Bypass calculations 16x16 14.2 ms ( ±0.0 ms )
[ BENCH ] Full calculations 16x16 14.2 ms ( ±0.0 ms )
[ OK ] NEON/VP9QuantizeTest.DISABLED_Speed/2 (1906 ms)
[ RUN ] NEON/VP9QuantizeTest.DISABLED_Speed/3
[ BENCH ] Bypass calculations 32x32 18.6 ms ( ±0.0 ms )
[ BENCH ] Full calculations 32x32 18.6 ms ( ±0.0 ms )
After:
[ RUN ] NEON/VP9QuantizeTest.DISABLED_Speed/2
[ BENCH ] Bypass calculations 4x4 29.1 ms ( ±0.0 ms )
[ BENCH ] Full calculations 4x4 29.1 ms ( ±0.0 ms )
[ BENCH ] Bypass calculations 8x8 16.9 ms ( ±0.0 ms )
[ BENCH ] Full calculations 8x8 16.9 ms ( ±0.0 ms )
[ BENCH ] Bypass calculations 16x16 14.1 ms ( ±0.0 ms )
[ BENCH ] Full calculations 16x16 14.1 ms ( ±0.0 ms )
[ OK ] NEON/VP9QuantizeTest.DISABLED_Speed/2 (1803 ms)
[ RUN ] NEON/VP9QuantizeTest.DISABLED_Speed/3
[ BENCH ] Bypass calculations 32x32 18.6 ms ( ±0.0 ms )
[ BENCH ] Full calculations 32x32 18.6 ms ( ±0.0 ms )
Change-Id: Ic95812b3fdbd4e47b4dcb8ed46c68a9617de38d2
|
|
BUG=webm:1444
Change-Id: I6823635eb1a99c3fcca0a8f091878e3ab2fdd2ac
|
|
Change-Id: I9d4c1af53d57f72fc716bacbe3b0965719c045ac
|
|
Speed comparing with the one calling vpx_scaled_2d_neon()
~1.7 x in general
~2.8x for BILINEAR filter
BUG=webm:1419
Change-Id: I8f0a54c2013e61ea086033010f97c19ecf47c7c6
|
|
Change-Id: Ib91054622c1f09c4ca523bc6837d7d8ab9f03618
|
|
BUG=webm:1419
Change-Id: If82a93935d2453e61b7647aae70983db1740bec7
|
|
BUG=webm:1419
Change-Id: I99c954ffa50a62ccff2c4ab54162916141826d9b
|
|
About 4x faster when values are below the dequant threshold and 10x
faster if everything needs to be calculated.
Both numbers would improve if the division for dqcoeff could be
simplified.
BUG=webm:1426
Change-Id: I8da67c1f3fcb4abed8751990c1afe00bc841f4b2
|
|
Change-Id: Ie8ac00efa826eead2a227726a1add816e04ff147
|
|
Move the tran_low_t helper functions to a new file. Additional
load/store functions will be added here.
Change-Id: I52bf652c344c585ea2f3e1230886be93f5caefc3
|
|
Denoiser on Neon is 5x faster than C code.
BUG=webm:1420
Change-Id: I805ab64f809ff2137354116be6213e7ec29c1dcb
|
|
vp9[_highbd]_quantize]_fp[_32x32] and vp9_fdct8x8_quant do not make use
of these parameters.
scan is used for C code and iscan is used for SIMD implementations.
Change-Id: I908a0ff7d3febac33da97e0596e040ec7bc18ca5
|
|
Change-Id: Ia8f822bd6e70b3911bc433a5a750bfb6f9a3a75c
|
|
Change-Id: I90fd815f15884490ad138f35df575a00d31e8c95
|
|
Change-Id: I45d9fb4013f50766b24363a86365e8063e8954c2
|
|
Change-Id: I7bc991abea383db1f86c1bb0f2e849837b54d90f
|
|
Change-Id: I3ef9a9648841374ed3cc865a02053c14ad821a20
|
|
~60-65% faster at the function level across block sizes
Change-Id: Iaf8cbe95731c43fdcbf68256e44284ba51a93893
|