Age | Commit message (Collapse) | Author |
|
Add Neon implementations of the d117 predictor for 4x4, 8x8, 16x16 and
32x32 block sizes. Also update tests to add new corresponding cases.
An explanation of the general implementation strategy is given in the
8x8 implementation body.
Speedups over the C code (higher is better):
Microarch. | Compiler | Block | Speedup
Neoverse N1 | LLVM 15 | 4x4 | 1.73
Neoverse N1 | LLVM 15 | 8x8 | 5.24
Neoverse N1 | LLVM 15 | 16x16 | 9.77
Neoverse N1 | LLVM 15 | 32x32 | 14.13
Neoverse N1 | GCC 12 | 4x4 | 2.04
Neoverse N1 | GCC 12 | 8x8 | 4.70
Neoverse N1 | GCC 12 | 16x16 | 8.64
Neoverse N1 | GCC 12 | 32x32 | 4.57
Neoverse V1 | LLVM 15 | 4x4 | 1.75
Neoverse V1 | LLVM 15 | 8x8 | 6.79
Neoverse V1 | LLVM 15 | 16x16 | 9.16
Neoverse V1 | LLVM 15 | 32x32 | 14.47
Neoverse V1 | GCC 12 | 4x4 | 1.75
Neoverse V1 | GCC 12 | 8x8 | 6.00
Neoverse V1 | GCC 12 | 16x16 | 7.63
Neoverse V1 | GCC 12 | 32x32 | 4.32
Change-Id: I7228327b5be27ee7a68deecafa05be0bd2a40ff4
|
|
Add Neon implementations of the d63 predictor for 4x4, 8x8, 16x16 and
32x32 block sizes. Also update tests to add new corresponding cases.
Speedups over the C code (higher is better):
Microarch. | Compiler | Block | Speedup
Neoverse N1 | LLVM 15 | 4x4 | 2.10
Neoverse N1 | LLVM 15 | 8x8 | 4.45
Neoverse N1 | LLVM 15 | 16x16 | 4.74
Neoverse N1 | LLVM 15 | 32x32 | 2.27
Neoverse N1 | GCC 12 | 4x4 | 2.46
Neoverse N1 | GCC 12 | 8x8 | 10.37
Neoverse N1 | GCC 12 | 16x16 | 11.46
Neoverse N1 | GCC 12 | 32x32 | 6.57
Neoverse V1 | LLVM 15 | 4x4 | 2.24
Neoverse V1 | LLVM 15 | 8x8 | 3.53
Neoverse V1 | LLVM 15 | 16x16 | 4.44
Neoverse V1 | LLVM 15 | 32x32 | 2.17
Neoverse V1 | GCC 12 | 4x4 | 2.25
Neoverse V1 | GCC 12 | 8x8 | 7.67
Neoverse V1 | GCC 12 | 16x16 | 8.97
Neoverse V1 | GCC 12 | 32x32 | 4.77
Change-Id: Ib4a1a2cb5a5c4495ae329529f8847664cbd0dfe0
|
|
Optimize the implementation of vpx_highbd_comp_avg_pred_neon by making
use of the URHADD instruction to compute the average.
Change-Id: Id74a6d9c33e89bc548c3c7ecace59af69051b4a7
|
|
Use the same specialization as for standard bitdepth. The rationale for
the specialization is as follows:
The optimal implementation of the bilinear interpolation depends on the
filter values being used. For both horizontal and vertical interpolation
this can simplify to just taking the source values, or averaging the
source and reference values - which can be computed more easily than a
bilinear interpolation with arbitrary filter values.
This patch introduces tests to find the most optimal bilinear
interpolation implementation based on the filter values being used.
This new specialization is only used for larger block sizes.
Change-Id: Id5a2b2d9fac6f878795a6ed9de2bc27d9e62d661
|
|
Use the same specialization as for standard bitdepth. The rationale for
the specialization is as follows:
The optimal implementation of the bilinear interpolation depends on the
filter values being used. For both horizontal and vertical interpolation
this can simplify to just taking the source values, or averaging the
source and reference values - which can be computed more easily than a
bilinear interpolation with arbitrary filter values.
This patch introduces tests to find the most optimal bilinear
interpolation implementation based on the filter values being used.
This new specialization is only used for larger block sizes.
Change-Id: I73182c979255f0332a274f2e5907df7f38c9eeb3
|
|
Use the same general code style as in the standard bitdepth Neon
implementation - merging the computation of vpx_highbd_comp_avg_pred
with the second pass of the bilinear filter to avoid storing and loading
the block again.
Also move vpx_highbd_comp_avg_pred_neon to its own file (like the
standard bitdepth implementation) since we're no longer using it for
averaging sub-pixel variance.
Change-Id: I2f5916d5b397db44b3247b478ef57046797dae6c
|
|
Use the same general code style as in the standard bitdepth Neon
implementation. Additionally, do not unnecessarily widen to 32-bit data
types when doing bilinear filtering - allowing us to process twice as
many elements per instruction.
Change-Id: I1e178991d2aa71f5f77a376e145d19257481e90f
|
|
Use standard loads and stores instead of the significantly slower
interleaving/de-interleaving variants. Also move all loads in loop
bodies above all stores as a mitigation against the compiler thinking
that the src and dst pointers alias (since we can't use restrict in
C89.)
Change-Id: Idd59dca51387f553f8db27144a2b8f2377c937d3
|
|
Move the 4D reduction helper function to sum_neon.h and use this for
both standard and high bitdepth SAD4D paths. This also removes the
AArch64 requirement for using the UDOT Neon SAD4D paths.
Change-Id: I207f76b3d42aa541809b0672c3b3d86e54d133ff
|
|
Optimizations take a similar form to those implemented for Armv8.0
standard bitdepth SAD4D:
- Use ABD, UADALP instead of ABAL, ABAL2 (double the throughput on
modern out-of-order Arm-designed cores.)
- Use more accumulator registers to make better use of Neon pipeline
resources on Arm CPUs that have four Neon pipes.
- Compute the four SAD sums in parallel so that we only load the source
block once - instead of four times.
Change-Id: Ica45c44fd167e5fcc83871d8c138fc72ed3a9723
|
|
Optimizations take a similar form to those implemented for standard
bitdepth averaging SAD:
- Use ABD, UADALP instead of ABAL, ABAL2 (double the throughput on
modern out-of-order Arm-designed cores.)
- Use more accumulator registers to make better use of Neon pipeline
resources on Arm CPUs that have four Neon pipes.
Change-Id: I75c5f09948f6bf17200f82e00e7a827a80451108
|
|
Optimizations take a similar form to those implemented for standard
bitdepth SAD:
- Use ABD, UADALP instead of ABAL, ABAL2 (double the throughput on
modern out-of-order Arm-designed cores.)
- Use more accumulator registers to make better use of Neon pipeline
resources on Arm CPUs that have four Neon pipes.
Change-Id: I9e626d7fa0e271908dc43448405a7985b80e6230
|
|
|
|
Use the load_unaligned helper functions in mem_neon.h to load strided
sequences of 4 bytes where alignment is not guaranteed in the Neon
SAD and SAD4D paths.
Change-Id: I941d226ef94fd7a633b09fc92165a00ba68a1501
|
|
Basically port the fix from libaom:
https://aomedia-review.googlesource.com/c/aom/+/169361
Change-Id: Id06a5db91372037832399200ded75d514e096726
|
|
|
|
Refactor and optimize the Neon implementation of SAD4D functions -
effectively backporting these libaom changes[1,2].
[1] https://aomedia-review.googlesource.com/c/aom/+/162181
[2] https://aomedia-review.googlesource.com/c/aom/+/162183
Change-Id: Icb04bd841d86f2d0e2596aa7ba86b74f8d2d360b
|
|
Refactor the Neon implementation of transpose_s16_8x8(q) and
transpose_u16_8x8 so that the final step compiles to 8 ZIP1/ZIP2
instructions as opposed to 8 EXT, MOV pairs. This change removes 8
instructions per call to transpose_s16_8x8(q), transpose_u16_8x8
where the result stays in registers for further processing - rather
than being stored to memory - like in vpx_hadamard_8x8_neon, for
example.
This is a backport of this libaom patch[1].
[1] https://aomedia-review.googlesource.com/c/aom/+/169426
Change-Id: Icef3e51d40efeca7008e1c4fc701bf39bd319c88
|
|
|
|
Refactor and optimize the Neon implementation of SAD functions -
effectively backporting these libaom changes[1,2,3].
[1] https://aomedia-review.googlesource.com/c/aom/+/161921
[2] https://aomedia-review.googlesource.com/c/aom/+/161923
[3] https://aomedia-review.googlesource.com/c/aom/+/166963
Change-Id: I2d72fd0f27d61a3e31a78acd33172e2afb044cb8
|
|
In total this gives about 9% extra performance for both rt/best
profiles.
Furthermore, add transpose_s32 16x16 function
Change-Id: Ib6f368bbb9af7f03c9ce0deba1664cef77632fe2
|
|
Use the same specialization for averaging subpel variance functions
as used for the non-averaging variants. The rationale for the
specialization is as follows:
The optimal implementation of the bilinear interpolation depends on
the filter values being used. For both horizontal and vertical
interpolation this can simplify to just taking the source values, or
averaging the source and reference values - which can be computed
more easily than a bilinear interpolation with arbitrary filter
values.
This patch introduces tests to find the most optimal bilinear
interpolation implementation based on the filter values being used.
This new specialization is only used for larger block sizes
This is a backport of this libaom change[1].
After this change, the only differences between the code in libvpx and
libaom are due to libvpx being compiled with ISO C90, which forbids
mixing declarations and code [-Wdeclaration-after-statement].
[1] https://aomedia-review.googlesource.com/c/aom/+/166962
Change-Id: I7860c852db94a7c9c3d72ae4411316685f3800a4
|
|
Merge the computation of vpx_comp_avg_pred into the second pass of the
bilinear filter - avoiding the overhead of loading and storing the
entire block again.
This is a backport of this libaom change[1].
[1] https://aomedia-review.googlesource.com/c/aom/+/166961
Change-Id: I9327ff7382a46d50c42a5213a11379b957146372
|
|
The optimal implementation of the bilinear interpolation depends on
the filter values being used. For both horizontal and vertical
interpolation this can simplify to just taking the source values, or
averaging the source and reference values - which can be computed
more easily than a bilinear interpolation with arbitrary filter
values.
This patch introduces tests to find the most optimal bilinear
interpolation implementation based on the filter values being used.
This new specialization is only used for larger block sizes
(>= 16x16) as we need to be doing enough work to make the cost of
finding the optimal implementation worth it.
This is a backport of this libaom change[1].
After this change, the only differences between the code in libvpx and
libaom are due to libvpx being compiled with ISO C90, which forbids
mixing declarations and code [-Wdeclaration-after-statement].
[1] https://aomedia-review.googlesource.com/c/aom/+/162463
Change-Id: Ia818e148f6fd126656e8411d59c184b55dd43094
|
|
Refactor the Neon implementation of the sub-pixel variance bilinear
filter helper functions - effectively backporting this libaom patch[1].
[1] https://aomedia-review.googlesource.com/c/aom/+/162462
Change-Id: I3dee32e8125250bbeffeb63d1fef5da559bacbf1
|
|
Refactor and optimize the Neon implementation of variance functions -
effectively backporting these libaom changes[1,2].
After this change, the only differences between the code in libvpx and
libaom are due to libvpx being compiled with ISO C90, which forbids
mixing declarations and code [-Wdeclaration-after-statement].
[1] https://aomedia-review.googlesource.com/c/aom/+/162241
[2] https://aomedia-review.googlesource.com/c/aom/+/162262
Change-Id: Ia4e8fff4d53297511d1a1e43bca8053bf811e551
|
|
Add additional AArch64 paths for vpx_convolve8_vert_neon and
vpx_convolve8_avg_vert_neon that use the Armv8.6-A USDOT (mixed-sign
dot-product) instruction. The USDOT instruction takes an 8-bit
unsigned operand vector and a signed 8-bit operand vector to produce
a signed 32-bit result. This is helpful because convolution filters
often have both positive and negative values, while the 8-bit pixel
channel data being filtered is all unsigned. As a result, the USDOT
convolution paths added here do not have to do the "transform the
pixel channel data to [-128, 128) and correct for it later" dance
that we have to do with the SDOT paths.
The USDOT instruction is optional from Armv8.2 to Armv8.5 but
mandatory from Armv8.6 onwards. The availability of the USDOT
instruction is indicated by the feature macro
__ARM_FEATURE_MATMUL_INT8. The SDOT paths are retained for use on
target CPUs that do not implement the USDOT instructions.
Change-Id: Ifbf467681dd53bb1d26e22359885e6edde3c5c72
|
|
Add additional AArch64 paths for vpx_convolve8_horiz_neon and
vpx_convolve8_avg_horiz_neon that use the Armv8.6-A USDOT (mixed-sign
dot-product) instruction. The USDOT instruction takes an 8-bit
unsigned operand vector and a signed 8-bit operand vector to produce
a signed 32-bit result. This is helpful because convolution filters
often have both positive and negative values, while the 8-bit pixel
channel data being filtered is all unsigned. As a result, the USDOT
convolution paths added here do not have to do the "transform the
pixel channel data to [-128, 128) and correct for it later" dance
that we have to do with the SDOT paths.
The USDOT instruction is optional from Armv8.2 to Armv8.5 but
mandatory from Armv8.6 onwards. The availability of the USDOT
instruction is indicated by the feature macro
__ARM_FEATURE_MATMUL_INT8. The SDOT paths are retained for use on
target CPUs that do not implement the USDOT instructions.
Change-Id: If19f5872c3453458a8cfb7c7d2be82a2c0eab46a
|
|
Define all Neon load/store helper functions in mem_neon.h and use
them consistently in Neon convolution functions.
Change-Id: I57905bc0a3574c77999cf4f4a73442c3420fa2be
|
|
The Neon convolution helper functions take a pointer to a filter and
load the 8 values into a single Neon register. For some reason,
filter values 3 and 4 are then duplicated into their own separate
registers.
This patch modifies these helper functions so that they access filter
values 3 and 4 via the lane-referencing versions of the various Neon
multiply instructions. This reduces register pressure and tidies up
the source code quite a bit.
Change-Id: Ia4aeee8b46fe218658fb8577dc07ff04a9324b3e
|
|
C vs SSE2
4x4: 3.38x
8x8: 3.45x
16x16: 2.06x
32x32: 2.19x
64x64: 1.39x
Change-Id: I46638fe187b49a78fee554114fac51c485d74474
|
|
Up to 4x faster than "sse2 vectorized C".
Change-Id: Ie9b3c12a437c5cddf92c4d5349c4f659ca6b82ea
|
|
Refactor & optimize FHT functions further, use new butterfly functions
4x4 5% faster, 8x8 & 16x16 10% faster than previous versions.
Highbd 4x4 FHT version 2.27x faster than C version for --rt.
Change-Id: I3ebcd26010f6c5c067026aa9353cde46669c5d94
|
|
Add an Arm Neon implementation of vpx_hadamard_32x32 and use it
instead of the scalar C implementation.
Also add test coverage for the new Neon implementation.
Change-Id: Iccc018eec4dbbe629fb0c6f8ad6ea8554e7a0b13
|
|
For --best quality, resulting function
vpx_highbd_fdct32x32_rd_neon takes 0.27% of cpu time in
profiling, vs 6.27% for the sum of scalar functions:
vpx_fdct32, vpx_fdct32.constprop.0, vpx_fdct32x32_rd_c for rd.
For --rt quality, the function takes 0.19% vs 4.57% for the scalar
version.
Overall, this improves encoding time by ~6% compared for highbd
for --best and ~9% for --rt.
Change-Id: I1ce4bbef6e364bbadc76264056aa3f86b1a8edc5
|
|
Provide a set of commonly used Butterfly DCT functions for use in
DCT 4x4, 8x8, 16x16, 32x32 functions. These are provided in various
forms, using vqrdmulh_s16/vqrdmulh_s32 for _fast variants, which
unfortunately are only usable in pass1 of most DCTs, as they do not
provide the necessary precision in pass2.
This gave a performance gain ranging from 5% to 15% in 16x16 case.
Also, for 32x32, the loads were rearranged, along with the butterfly
optimizations, this gave 10% gain in 32x32_rd function.
This refactoring was necessary to allow easier porting of highbd
32x32 functions -follows this patchset.
Change-Id: I6282e640b95a95938faff76c3b2bace3dc298bc3
|
|
(src|ref)8_ptr -> (src|ref)_ptr. aligns the names with the rtcd header;
clears some clang-tidy warnings
Change-Id: Id1aa29da8c0fa5860b46ac902f5b2620c0d3ff54
|
|
Change-Id: Ib7ce7a774ec89ba51169ea64d24c878109ef07d1
|
|
|
|
90-95% faster than C version in best/rt profiles
Change-Id: I41d5e9acdc348b57153637ec736498a25ed84c25
|
|
|
|
|
|
|
|
50% faster than C version in best/rt profiles
Change-Id: I0f9504ed52b5d5f7722407e91108ed4056d66bc2
|
|
~2.8x faster than the sse2 version.
Bug: b/245917257
Change-Id: Ib727ba8a8c8fa4df450bafdde30ed99fd283f06d
|
|
~80% faster than C version for both best/rt profiles.
Change-Id: Ibb3c8e1862131d2a020922420d53c66b31d5c2c3
|
|
2.1x to 2.8x faster than the sse2 version.
Bug: b/245917257
Change-Id: I1aaffa4a1debbe5559784e854b8fc6fba07e5000
|
|
1.6x to 2.1x faster than the sse2 version.
Bug: b/245917257
Change-Id: I56c467a850297ae3abcca4b4843302bb8d5d0ac1
|
|
Move all butterfly functions to fdct_neon.h
Slightly optimize load/scale/cross functions
in fdct 16x16.
These will be reused in highbd variants.
Change-Id: I28b6e0cc240304bab6b94d9c3f33cca77b8cb073
|
|
Change-Id: I3915b6c9971aedaac9c23f21fdb88bc271216208
|