Age | Commit message (Collapse) | Author |
|
While porting this function to NEON, using SSE4_1 implementation
as base I noticed that both were producing files with different
checksums to the C reference implementation. After investigating
further I found that this saturating pack was the culprit. Doing
the multiplication on the 32-bit values, leads to producing the
correct results with the C implementation.
Change-Id: I40c2a36551b2db363a58ea9aa19ef327f2676de3
|
|
|
|
This reverts commit 7cdf139e3d6237386e0f93bdb0bdc1b459c663bf.
This causes failures in the VP9/ExternalFrameBufferMD5Test and
VP9/TestVectorTest.MD5Match tests in both armv7 and aarch64 builds.
Change-Id: I7ac4ba0ddc70e7e7860df9f962e6658defe1cdd5
|
|
Currently MSE functions just call the variance helpers but don't
actually use the computed sum. This patch adds dedicated helpers to
perform the computation of sse.
Add the corresponding tests as well.
Change-Id: I96a8590e3410e84d77f7187344688e02efe03902
|
|
further reduces the arguments for the 32x32. This will be applied to the base
version as well.
Change-Id: I25a162b5248b14af53d9e20c6a7fa2a77028a6d1
|
|
Change-Id: I431a41279c4c4193bc70cfe819da6ea7e1d2fba1
|
|
* changes:
Implement highbd_d117_predictor using Neon
Implement highbd_d63_predictor using Neon
Implement d117_predictor using Neon
Implement d63_predictor using Neon
|
|
|
|
Add Neon implementations of the highbd d117 predictor for 4x4, 8x8,
16x16 and 32x32 block sizes. Also update tests to add new corresponding
cases.
An explanation of the general implementation strategy is given in the
8x8 implementation body, and is mostly identical to the non-highbd
version.
Speedups over the C code (higher is better):
Microarch. | Compiler | Block | Speedup
Neoverse N1 | LLVM 15 | 4x4 | 1.99
Neoverse N1 | LLVM 15 | 8x8 | 4.37
Neoverse N1 | LLVM 15 | 16x16 | 6.81
Neoverse N1 | LLVM 15 | 32x32 | 6.49
Neoverse N1 | GCC 12 | 4x4 | 2.49
Neoverse N1 | GCC 12 | 8x8 | 4.10
Neoverse N1 | GCC 12 | 16x16 | 5.58
Neoverse N1 | GCC 12 | 32x32 | 2.16
Neoverse V1 | LLVM 15 | 4x4 | 1.99
Neoverse V1 | LLVM 15 | 8x8 | 5.03
Neoverse V1 | LLVM 15 | 16x16 | 6.61
Neoverse V1 | LLVM 15 | 32x32 | 6.01
Neoverse V1 | GCC 12 | 4x4 | 2.09
Neoverse V1 | GCC 12 | 8x8 | 4.52
Neoverse V1 | GCC 12 | 16x16 | 4.23
Neoverse V1 | GCC 12 | 32x32 | 2.70
Change-Id: I892fbd2c17ac527ddc22b91acca907ffc84c5cd2
|
|
Add Neon implementations of the highbd d63 predictor for 4x4, 8x8, 16x16
and 32x32 block sizes. Also update tests to add new corresponding cases.
Speedups over the C code (higher is better):
Microarch. | Compiler | Block | Speedup
Neoverse N1 | LLVM 15 | 4x4 | 2.43
Neoverse N1 | LLVM 15 | 8x8 | 4.03
Neoverse N1 | LLVM 15 | 16x16 | 3.07
Neoverse N1 | LLVM 15 | 32x32 | 4.11
Neoverse N1 | GCC 12 | 4x4 | 2.92
Neoverse N1 | GCC 12 | 8x8 | 7.20
Neoverse N1 | GCC 12 | 16x16 | 4.43
Neoverse N1 | GCC 12 | 32x32 | 3.18
Neoverse V1 | LLVM 15 | 4x4 | 1.99
Neoverse V1 | LLVM 15 | 8x8 | 3.66
Neoverse V1 | LLVM 15 | 16x16 | 3.60
Neoverse V1 | LLVM 15 | 32x32 | 3.29
Neoverse V1 | GCC 12 | 4x4 | 2.39
Neoverse V1 | GCC 12 | 8x8 | 4.76
Neoverse V1 | GCC 12 | 16x16 | 3.29
Neoverse V1 | GCC 12 | 32x32 | 2.43
Change-Id: Ic59df16ceeb468003754b4374be2f4d9af6589e4
|
|
Add Neon implementations of the d117 predictor for 4x4, 8x8, 16x16 and
32x32 block sizes. Also update tests to add new corresponding cases.
An explanation of the general implementation strategy is given in the
8x8 implementation body.
Speedups over the C code (higher is better):
Microarch. | Compiler | Block | Speedup
Neoverse N1 | LLVM 15 | 4x4 | 1.73
Neoverse N1 | LLVM 15 | 8x8 | 5.24
Neoverse N1 | LLVM 15 | 16x16 | 9.77
Neoverse N1 | LLVM 15 | 32x32 | 14.13
Neoverse N1 | GCC 12 | 4x4 | 2.04
Neoverse N1 | GCC 12 | 8x8 | 4.70
Neoverse N1 | GCC 12 | 16x16 | 8.64
Neoverse N1 | GCC 12 | 32x32 | 4.57
Neoverse V1 | LLVM 15 | 4x4 | 1.75
Neoverse V1 | LLVM 15 | 8x8 | 6.79
Neoverse V1 | LLVM 15 | 16x16 | 9.16
Neoverse V1 | LLVM 15 | 32x32 | 14.47
Neoverse V1 | GCC 12 | 4x4 | 1.75
Neoverse V1 | GCC 12 | 8x8 | 6.00
Neoverse V1 | GCC 12 | 16x16 | 7.63
Neoverse V1 | GCC 12 | 32x32 | 4.32
Change-Id: I7228327b5be27ee7a68deecafa05be0bd2a40ff4
|
|
Add Neon implementations of the d63 predictor for 4x4, 8x8, 16x16 and
32x32 block sizes. Also update tests to add new corresponding cases.
Speedups over the C code (higher is better):
Microarch. | Compiler | Block | Speedup
Neoverse N1 | LLVM 15 | 4x4 | 2.10
Neoverse N1 | LLVM 15 | 8x8 | 4.45
Neoverse N1 | LLVM 15 | 16x16 | 4.74
Neoverse N1 | LLVM 15 | 32x32 | 2.27
Neoverse N1 | GCC 12 | 4x4 | 2.46
Neoverse N1 | GCC 12 | 8x8 | 10.37
Neoverse N1 | GCC 12 | 16x16 | 11.46
Neoverse N1 | GCC 12 | 32x32 | 6.57
Neoverse V1 | LLVM 15 | 4x4 | 2.24
Neoverse V1 | LLVM 15 | 8x8 | 3.53
Neoverse V1 | LLVM 15 | 16x16 | 4.44
Neoverse V1 | LLVM 15 | 32x32 | 2.17
Neoverse V1 | GCC 12 | 4x4 | 2.25
Neoverse V1 | GCC 12 | 8x8 | 7.67
Neoverse V1 | GCC 12 | 16x16 | 8.97
Neoverse V1 | GCC 12 | 32x32 | 4.77
Change-Id: Ib4a1a2cb5a5c4495ae329529f8847664cbd0dfe0
|
|
Now that all the implementations of the 32x32 quantize are in
intrinsics we can reference struct members directly. Saves
pushing them to the stack.
n_coeffs is not used at all for this function.
Change-Id: I2104fea3fa20c455087e21b347d6abd7ea1f3e1e
|
|
|
|
|
|
Change-Id: Ic309aab2ff1750bdbcc36e8aafe05d52930ba694
|
|
|
|
Currently only vpx_mse16x16 has a Neon implementation. This patch adds
optimized Armv8.0 and Armv8.4 dot-product paths for all block sizes:
8x8, 8x16, 16x8 and 16x16.
Add the corresponding tests as well.
Change-Id: Ib0357fdcdeb05860385fec89633386e34395e260
|
|
1) Use vtrn[12]q_[su]64 in vpx_vtrnq_[su]64* helpers on AArch64
targets. This produces half as many TRN1/2 instructions compared to
the number of MOVs that result from vcombine.
2) Use vpx_vtrnq_[su]64* helpers wherever applicable.
3) Refactor transpose_4x8_s16 to operate on 128-bit vectors.
Change-Id: I9a8b1c1fe2a98a429e0c5f39def5eb2f65759127
|
|
Use (void) to indicate an empty parameter list and match the declaration
of vpx_codec_vp[89]_[cd]x. This fixes a cfi sanitizer error.
Change-Id: I190f432eea4d1765afffd84c7458ec44d863f90c
|
|
* changes:
Add Neon implementation of high bitdepth 32x32 hadamard transform
Add Neon implementation of high bitdepth 16x16 hadamard transform
Add Neon implementation of high bitdepth 8x8 hadamard transform
|
|
* changes:
vp9_loop_filter_alloc: clear -Wshadow warnings
vp9_adapt_mode_probs: clear -Wshadow warning
|
|
Add Neon implementation of vpx_highbd_hadamard_32x32 as well as the
corresponding tests.
Change-Id: I65d8603896649de1996b353aa79eee54824b4708
|
|
Add Neon implementation of vpx_highbd_hadamard_16x16 as well as the
corresponding tests.
Change-Id: If3299fe556351dfe3db994ac171d83a95ea1504b
|
|
|
|
Change-Id: Ib45522e32d9137678da9062830044e9dd87537e5
|
|
|
|
Add Neon implementation of vpx_highbd_hadamard_8x8 as well as the
corresponding tests.
Change-Id: I3ef1ff199d76b6b010591ef15a81b0f36c9ded03
|
|
Bug: webm:1793
Change-Id: Ia64d175aa69dc2ecde2babf64bde04f02b32795b
|
|
Bug: webm:1793
Change-Id: Ie4ea8f0a3295e6f58dc6f7d5c61d46700c539d40
|
|
|
|
Performance:
| SPD_SET | TESTSET | AVG_PSNR | OVR_PSNR | SSIM | ENC_T |
|---------|---------|----------|----------|---------|-------|
| 0 | hdres2 | +0.036% | +0.032% | +0.014% | -3.9% |
| 0 | lowres2 | -0.002% | -0.011% | +0.020% | -3.6% |
| 0 | midres2 | +0.045% | +0.025% | -0.007% | -4.0% |
STATS_CHANGED
Change-Id: I75a927333d26f2a37f0dda57a641b455b845f5b9
|
|
no changes to assembly
Bug: webm:1793
Change-Id: I6a82290cafee7f4a7909d497ccfdefd5a78fb8ed
|
|
This matches the style guide and fixes some -Wshadow warnings related to
variables with the same name. Something similar was done in libaom in:
863b04994b Fix warnings reported by -Wshadow: Part2: av1 directory
Bug: webm:1793
Change-Id: I4df1bbc8d079a3174d75f0d35d54c200ffdbb677
|
|
|
|
|
|
Specialize implementation of high bitdepth variance functions such that
we only widen data processing element types when absolutely necessary.
Change-Id: If4cc3fea7b5ab0821e3129ebd79ff63706a512bf
|
|
In joint_motion_search, there are four iterations.
Even iterations search in the first reference frame
and odd iterations search in the second. The last two
iterations use the search result of the first two
iterations as the start point. If the search result does
not change,last two iterations are not necessary and can
be skipped.
Instruction Count
cpu-used Reduction(%)
0 1.411
Change-Id: Ie583c9f75dd0a22bbdfb432ccdd62eea6ec4fce8
|
|
Added unit test.
Keep track of spatial layer id and frame type in case where spatial
layers are encoded parallel by the hardware encoder.
ComputeQP() / PostEncodeUpdate() doesn't need to be called sequentially
when there is no inter layer prediction.
Bug: b/257368998
Change-Id: I50beaefcfc205d3f9a9d3dbe11fead5bfdc71489
|
|
|
|
Change-Id: Ic669c96d25d7c039d370e9acd00dc45e09054552
|
|
Performance:
| SPD_SET | TESTSET | AVG_PSNR | OVR_PSNR | SSIM | ENC_T |
|---------|---------|----------|----------|---------|-------|
| 0 | hdres2 | -0.028% | +0.030% | -0.408% | -2.0% |
| 0 | lowres2 | +0.000% | +0.000% | +0.000% | +0.0% |
| 0 | midres2 | -0.138% | +0.042% | -0.427% | -2.5% |
|---------|---------|----------|----------|---------|-------|
| 1 | hdres2 | -0.032% | +0.018% | -0.342% | -1.1% |
| 1 | lowres2 | +0.000% | +0.000% | +0.000% | +0.0% |
| 1 | midres2 | +0.050% | +0.060% | -0.257% | -1.6% |
Rate Error:
| | | AVG_RC_ERROR | MAX_RC_ERROR |
| | |---------------------|---------------------|
| SPD_SET | TESTSET | BASE | TEST | BASE | TEST |
|---------|---------|----------|----------|----------|----------|
| 0 | hdres2 | 33.044% | 33.065% | 149.903% | 149.903% |
| 0 | midres2 | 59.632% | 59.566% | 79.091% | 79.249% |
|---------|---------|----------|----------|----------|----------|
| 1 | hdres2 | 33.050% | 33.057% | 151.278% | 151.278% |
| 1 | midres2 | 59.640% | 59.614% | 78.707% | 78.842% |
STATS_CHANGED
Change-Id: I5d09601fede3912d5173717ce9dd070df3a97ec8
|
|
Performance:
| SPD_SET | TESTSET | AVG_PSNR | OVR_PSNR | SSIM | ENC_T |
|---------|---------|----------|----------|---------|-------|
| 0 | hdres2 | +0.034% | +0.030% | +0.033% | -3.7% |
| 0 | lowres2 | +0.012% | +0.017% | +0.044% | -2.1% |
| 0 | midres2 | +0.030% | +0.035% | +0.060% | -1.9% |
|---------|---------|----------|----------|---------|-------|
| 1 | hdres2 | +0.027% | +0.036% | +0.030% | -2.7% |
| 1 | lowres2 | -0.006% | -0.002% | +0.006% | -1.0% |
| 1 | midres2 | -0.006% | -0.012% | -0.010% | -1.0% |
|---------|---------|----------|----------|---------|-------|
| 2 | hdres2 | -0.006% | -0.001% | -0.020% | -2.4% |
| 2 | lowres2 | -0.010% | -0.015% | -0.001% | -0.9% |
| 2 | midres2 | +0.006% | -0.005% | +0.009% | -1.0% |
STATS_CHANGED
Change-Id: I1431ac07215bb844739a410697387b9aead82792
|
|
* changes:
Optimize vpx_highbd_comp_avg_pred_neon
Add Neon AvgPredTestHBD test suite
Specialize Neon high bitdepth avg subpel variance by filter value
Specialize Neon high bitdepth subpel variance by filter value
Refactor Neon high bitdepth avg subpel variance functions
Optimize Neon high bitdepth subpel variance functions
|
|
Optimize the implementation of vpx_highbd_comp_avg_pred_neon by making
use of the URHADD instruction to compute the average.
Change-Id: Id74a6d9c33e89bc548c3c7ecace59af69051b4a7
|
|
Add test suite for vpx_highbd_comp_avg_pred_neon.
Change-Id: I5c31e0e990661ee3b8030bb517829c088fceae4d
|
|
Use the same specialization as for standard bitdepth. The rationale for
the specialization is as follows:
The optimal implementation of the bilinear interpolation depends on the
filter values being used. For both horizontal and vertical interpolation
this can simplify to just taking the source values, or averaging the
source and reference values - which can be computed more easily than a
bilinear interpolation with arbitrary filter values.
This patch introduces tests to find the most optimal bilinear
interpolation implementation based on the filter values being used.
This new specialization is only used for larger block sizes.
Change-Id: Id5a2b2d9fac6f878795a6ed9de2bc27d9e62d661
|
|
Use the same specialization as for standard bitdepth. The rationale for
the specialization is as follows:
The optimal implementation of the bilinear interpolation depends on the
filter values being used. For both horizontal and vertical interpolation
this can simplify to just taking the source values, or averaging the
source and reference values - which can be computed more easily than a
bilinear interpolation with arbitrary filter values.
This patch introduces tests to find the most optimal bilinear
interpolation implementation based on the filter values being used.
This new specialization is only used for larger block sizes.
Change-Id: I73182c979255f0332a274f2e5907df7f38c9eeb3
|
|
Use the same general code style as in the standard bitdepth Neon
implementation - merging the computation of vpx_highbd_comp_avg_pred
with the second pass of the bilinear filter to avoid storing and loading
the block again.
Also move vpx_highbd_comp_avg_pred_neon to its own file (like the
standard bitdepth implementation) since we're no longer using it for
averaging sub-pixel variance.
Change-Id: I2f5916d5b397db44b3247b478ef57046797dae6c
|
|
Use the same general code style as in the standard bitdepth Neon
implementation. Additionally, do not unnecessarily widen to 32-bit data
types when doing bilinear filtering - allowing us to process twice as
many elements per instruction.
Change-Id: I1e178991d2aa71f5f77a376e145d19257481e90f
|