summaryrefslogtreecommitdiff
path: root/vpx_dsp/arm
AgeCommit message (Collapse)Author
2019-07-17Fix comment typos.Wan-Teh Chang
Fix comment typos in transpose_s16_4x4q() and transpose_u16_4x4q(). Change-Id: I21bcc1fb3fb880798e5a3927c3dbe81dd518c83b
2019-03-21vp9 postproc neon: Remove the condition on mb cols.Jerome Jiang
VP8 and VP9 have different padding on buffer stride. VP8 microblock is 16x16 so the buffer stride needs to be divisible by 16. Thus UV buffer stride is divisible by 8. VP9 microblock is 8x8 so the buffer stride is only extended to be divisible by 8. Then UV buffer stride isn't divisible by 8. Change-Id: I6fa953feb951f2fb2e48f72a623786b85e23822f
2019-01-09highbd idct: resolve missing declarationsJohann
BUG=webm:1584 Change-Id: I596f5f0e1a1c152493cd8177b32d416cc79937e0
2019-01-07arm neon: resolve missing declarationsJohann
BUG=webm:1584 Change-Id: I2dcf39f2327b72b58be72c27f952ea781a790dd3
2018-12-03quantize neon: fix hbd buildsJohann
BUG=webm:1448 Change-Id: I2140fb9b6ce92716d2d9509f3031244088a62127
2018-11-12quantize: use aarch64 vmaxvJohann
Simplify max value calculation on aarch64 by using vmaxv. Much faster for 4x4 but diminishing returns as the block size grows. Only the vp9 quantize has a speed test hooked up. Anticipate similar results for the other quantize versions. Before: [ RUN ] NEON/VP9QuantizeTest.DISABLED_Speed/2 [ BENCH ] Bypass calculations 4x4 31.6 ms ( ±0.0 ms ) [ BENCH ] Full calculations 4x4 31.6 ms ( ±0.0 ms ) [ BENCH ] Bypass calculations 8x8 17.7 ms ( ±0.0 ms ) [ BENCH ] Full calculations 8x8 17.7 ms ( ±0.0 ms ) [ BENCH ] Bypass calculations 16x16 14.2 ms ( ±0.0 ms ) [ BENCH ] Full calculations 16x16 14.2 ms ( ±0.0 ms ) [ OK ] NEON/VP9QuantizeTest.DISABLED_Speed/2 (1906 ms) [ RUN ] NEON/VP9QuantizeTest.DISABLED_Speed/3 [ BENCH ] Bypass calculations 32x32 18.6 ms ( ±0.0 ms ) [ BENCH ] Full calculations 32x32 18.6 ms ( ±0.0 ms ) After: [ RUN ] NEON/VP9QuantizeTest.DISABLED_Speed/2 [ BENCH ] Bypass calculations 4x4 29.1 ms ( ±0.0 ms ) [ BENCH ] Full calculations 4x4 29.1 ms ( ±0.0 ms ) [ BENCH ] Bypass calculations 8x8 16.9 ms ( ±0.0 ms ) [ BENCH ] Full calculations 8x8 16.9 ms ( ±0.0 ms ) [ BENCH ] Bypass calculations 16x16 14.1 ms ( ±0.0 ms ) [ BENCH ] Full calculations 16x16 14.1 ms ( ±0.0 ms ) [ OK ] NEON/VP9QuantizeTest.DISABLED_Speed/2 (1803 ms) [ RUN ] NEON/VP9QuantizeTest.DISABLED_Speed/3 [ BENCH ] Bypass calculations 32x32 18.6 ms ( ±0.0 ms ) [ BENCH ] Full calculations 32x32 18.6 ms ( ±0.0 ms ) Change-Id: Ic95812b3fdbd4e47b4dcb8ed46c68a9617de38d2
2018-11-01clang-tidy: fix vpx_dsp parametersJohann
BUG=webm:1444 Change-Id: Iee19be068afc6c81396c79218a89c469d2e66207
2018-10-31clang-tidy: normalize variance functionsJohann
Always use src/ref and _ptr/_stride suffixes. Normalize to [xy]_offset and second_pred. Drop some stray source/recon_strides. BUG=webm:1444 Change-Id: I32362a50988eb84464ab78686348610ea40e5c80
2018-09-15cosmetics: normalize include guardsJames Zern
use the recommended format [1] of: <PROJECT>_<PATH>_<FILE>_H_ [1] https://google.github.io/styleguide/cppguide.html#The__define_Guard "All header files should have #define guards to prevent multiple inclusion. The format of the symbol name should be <PROJECT>_<PATH>_<FILE>_H_." Change-Id: I2e8ab0b32fb23c30fa43cff5fec12d043c0d2037
2018-07-28arm: Consistently use unified syntax for asmMartin Storsjo
The ".syntax unified" directives in a few source files aren't valid ADS assembly directives, and they break compilation for windows, since ads2armasm_ms.pl doesn't handle them. Explicity add them via ads2gas.pl and ads2gas_apple.pl instead, and tweak one instruction to be valid unified syntax. Change-Id: I37f1709f163d11474597161fe02eb433859cb9b8
2018-07-26Add New Neon Assemblies for Motion CompensationVenkatarama NG. Avadhani
Commit adds neon assemblies for motion compensation which show an improvement over the existing neon code. Performance Improvement - Platform Resolution 1 Thread 4 Threads Nexus 6 720p 12.16% 7.21% @2.65 GHz 1080p 18.00% 15.28% Change-Id: Ic0b0412eeb01c8317642b20bb99092c2f5baba37
2018-07-17vpx_sum_squares_2d_i16_neon(): Make |s2| a uint64x1_t.Raphael Kubo da Costa
This fixes the build with at least GCC 7.3, where it was previously failing with: sum_squares_neon.c: In function 'vpx_sum_squares_2d_i16_neon': sum_squares_neon.c: note: use -flax-vector-conversions to permit conversions between vectors with differing element types or numbers of subparts s2 = vpaddl_u32(s1); ^~ sum_squares_neon.c: incompatible types when assigning to type 'int64x1_t' from type 'uint64x1_t' s2 = vpaddl_u32(s1); ^ sum_squares_neon.c: incompatible types when assigning to type 'int64x1_t' from type 'uint64x1_t' s2 = vadd_u64(vget_low_u64(s1), vget_high_u64(s1)); ^ sum_squares_neon.c: incompatible type for argument 1 of 'vget_lane_u64' return vget_lane_u64(s2, 0); ^~ The generated assembly was verified to remain identical with both GCC and LLVM. Bug: chromium:819249 Change-Id: I2778428ee1fee0a674d0d4910347c2a717de21ac
2018-05-10vpx_subtract_block_neon: add explicit castJames Zern
quiets ptrdiff_t -> int conversion warning Change-Id: If6b545a736fc19e48e290961736b1618df97db3e
2018-05-11Merge "Update vpx_subtract_block_neon()"James Zern
2018-05-10Update vpx_subtract_block_neon()Linfeng Zhang
Change-Id: Ie2ac06c090c8f92268e9a799e96aa5192a1bdcd2
2018-05-08Update vpx_comp_avg_pred_neon()Linfeng Zhang
Separate width 4 and 8 cases to reduce jumps in loop in clang. Change-Id: I6ffc6f1555f2ad08b72a8dba35a78b9fd5f95a73
2018-05-08Update SadMxNx4 NEON functionsLinfeng Zhang
Change-Id: Ia313a6da00a05837fcd4de6ece31fa1c0016438c
2018-05-07Add vpx_sum_squares_2d_i16_neon()Linfeng Zhang
Perf shows CPU time of this function dropped from 0.81% to 0.15%. Change-Id: I8a7649ca5c15af2fc65cfb848f5befa0cc5e64f2
2018-03-13Add vp9_highbd_iht16x16_256_add_neon()Linfeng Zhang
BUG=webm:1403 Change-Id: I2293c11666786be276909d48ee78dacb40a89e25
2018-02-27Add vp9_iht16x16_256_add_neon()Linfeng Zhang
BUG=webm:1403 Change-Id: I1413cc3dfcb62143ba04fe9b0f8d8b010fdf69b6
2018-02-26Fix a bug in create_s16x4_neon()Linfeng Zhang
This bug exposes when 2nd argument is negative, and the higher 32 bits would be all 1s. Change-Id: I189ee8cd3753fde00a34847e7a37cde2caa4ba72
2018-02-20Add vp9_highbd_iht8x8_16_add_neon()Linfeng Zhang
BUG=webm:1403 Change-Id: I11efb652f1aee371c71eee2d29e33793e4736832
2018-02-08Update iadst NEON functionsLinfeng Zhang
Use scalar multiply. No impact on clang, but improves gcc compiling. BUG=webm:1403 Change-Id: I4922e7e033d9e93282c754754100850e232e1529
2018-02-05Add vp9_highbd_iht4x4_16_add_neon()Linfeng Zhang
BUG=webm:1403 Change-Id: Id9833e985fb70958cf4bde38f8e6303ed83c12f9
2018-01-29Update vp9_iht8x8_64_add_neon()Linfeng Zhang
Change-Id: Ie70ed8b9273df5e1fd06bc93cb469e80630941d2
2018-01-29Clean dct_const_round_shift() related neon codeLinfeng Zhang
Change-Id: I8f4e0fc6ecb77b623519f2dd3cd2886f89218ddd
2018-01-24cosmetic: clean idct neon functionsLinfeng Zhang
Change-Id: I9c7c52567850aded0437b13ba1260e94441bc49d
2018-01-18clang-format v5.0.0 vpx_dsp/Johann
Remove comments above #define statements because they get indented unnecessarily. https://bugs.llvm.org/show_bug.cgi?id=35930 Add blank lines to prevent comments from being treated as blocks. Change-Id: I04dce21b2a10e13b8dc07411a0019c098f6dd705
2017-10-26vpx: hadamard: use ptrdiff_t instead of int for strideScott LaVarnway
Eliminates the following instruction for the x86 (64 bit) intrinsic code: movslq %esi,%rax Change-Id: I8f5ebd40726f998708a668b0f52ea7a0576befae
2017-09-27Merge "fix signed integer overflow of idct"James Zern
2017-09-27fix signed integer overflow of idctLinfeng Zhang
Exposed by fuzz test in high bitdepth. The bug is introduced in commit 64653fa. BUG=webm:1466 Change-Id: Idd77d5c6a60efb9241471611ce1aba0646cb6ff5
2017-09-26Add vpx_scaled_2d_neon()Linfeng Zhang
BUG=webm:1419 Change-Id: I39c8033734562efc0ac0e28e7f06fa05130f9b96
2017-09-26Merge changes Ib9105462,Idfac00ed,If8d8a0e2Linfeng Zhang
* changes: cosmetics: NEON scaling code Refactor convolve NEON code Refactor convolve code
2017-09-20Remove the unnecessary cast of (int16_t)cospi_{1...31}_64Linfeng Zhang
BUG=webm:1450 Change-Id: If59743aafe99226e0ec67ab5d20678ce25f53ab8
2017-09-20Remove the unnecessary upcasts of (int)cospi_{1...31}_64Linfeng Zhang
BUG=webm:1450 Change-Id: Ib046fe28caec5b9ebdc9d0152df7c54ff4266858
2017-09-20Change cospi_{1...31}_64 from tran_high_t to tran_coef_tLinfeng Zhang
The unnecessary upcast to (int) will be cleaned later. BUG=webm:1450 Change-Id: Ia234575206d5a74540526924b06ed3939322d063
2017-09-19cosmetics: NEON scaling codeLinfeng Zhang
Change-Id: Ib91054622c1f09c4ca523bc6837d7d8ab9f03618
2017-09-19Refactor convolve NEON codeLinfeng Zhang
Rename a couple of hbd static functions. Move the position of NEON function convolve8_4(). Change-Id: Idfac00edf2e99cdd8e0a73b9f895402f60be6349
2017-09-11Add 4 to 1 scaling NEON optimizationLinfeng Zhang
BUG=webm:1419 Change-Id: If82a93935d2453e61b7647aae70983db1740bec7
2017-09-06Refactor convolve8 NEON functionsLinfeng Zhang
Change-Id: I4ac576875c91fee7cb150d298fae4a2c156d374c
2017-09-05Remove get_filter_base() and get_filter_offset() in convolveLinfeng Zhang
so that the convolve functions are independent of table alignment. Change-Id: Ieab132a30d72c6e75bbe9473544fbe2cf51541ee
2017-08-23Merge "quantize neon: round dqcoeff towards zero"Johann Koenig
2017-08-23quantize neon: round dqcoeff towards zeroJohann
Add 1 if negative to get dqcoeff to round towards zero. 10-15% faster than converting to positive before shifting. Change-Id: I01a62fd0c9bca786b6885b318bd447bb9229903d
2017-08-21quantize: ignore skip_block in armJohann
Change-Id: Icfb70687476b2edb25d255793ba325b261d40584
2017-08-08neon: vpx_quantize_b_32x32Johann
With skip block the neon is about twice as fast as C. The neon has no shortcut for coeff < zbin so it always takes the same amount of time. Even if the C can take the shortcut, it is over twice as fast in neon. If it can't, that gap increases to over 10x. BUG=webm:1426 Change-Id: I400722146c1b5a5f6289f67d85fd642463d2bfc6
2017-07-31neon: vpx_quantize_bJohann
With skip block or coeff < zbin it is about twice as fast as C. If most coeff values are > zbin it is about 10-15x as fast as C. BUG=webm:1426 Change-Id: I5d3c007b014a372d5ef0882b39bb48983b4131c7
2017-07-12sad4d neon: 64x[32,64]Johann
Rewrite 64x64. BUG=webm:1425 Change-Id: I336bf5a3aa4b783389c10b16a50f0f559346ecbf
2017-07-12sad4d neon: 32x[16,32,64]Johann
Rewrite 32x32. Use half the accumulator registers. BUG=webm:1425 Change-Id: Ibf5e61dc4ba15056102aef8495f4a02c668c5d13
2017-07-12sad4d neon: 16x[8,16,32]Johann
Rewrite 16x16. Use half the accumulator registers. BUG=webm:1425 Change-Id: I44b48512b1e3629505d83c2645e800f53878ccc2
2017-07-12sad4d neon: 8x[4,8,16]Johann
BUG=webm:1425 Change-Id: I7de2500cca4b621f21478c4b0333c56d76dbc9a4