summaryrefslogtreecommitdiff
path: root/vp9/common/arm/neon
AgeCommit message (Collapse)Author
2014-02-14Replace vqshrun by vqmovun if shift #0 bitJames Yu
Change-Id: Ifabb8c7ec0c327fea9d6739cab10addb060ff435 Signed-off-by: James Yu <james.yu@linaro.org>
2014-02-14Merge "Remove redundant arm neon instructions."Johann
2014-02-14Merge "minor spelling cleanup in comments"Yaowu Xu
2014-02-12Fix neon wide loopfilter for filter8 only branchFrank Galligan
The current code removed the check to only perform the filter8. Change-Id: Ie54e19a77745042a5660eab986d9ef1c42e82410
2014-02-12minor spelling cleanup in commentsAndrew Russell
Change-Id: Ia91c6c406273345b08505097ffe1af3896980f06
2014-02-11Remove redundant arm neon instructions.James Yu
Change-Id: I1fabad59747eb5f68c64275a36c3a1d94daf32a3 Signed-off-by: James Yu <james.yu@linaro.org>
2014-02-05arm: Consistently use braces around doubleword arguments to vldMartin Storsjo
This isn't strictly necessary, but makes the file more consistent with the other arm assembly source files. Change-Id: I245c9677d89e0ab3f31991e473764858af35b180
2014-02-05arm: Use {} around quadword arguments to vldMartin Storsjo
This fixes building for iOS. Change-Id: Ice082648c02a3faf93891f7ddc122875e2bdc9cb
2014-01-31Removing "_short" suffix from arm transform file names.Dmitry Kovalev
Change-Id: Iefe118f61a335e88821a21a9f50fb919212c1507
2014-01-27Add vp9_tm_predictor_32x32 neon implementationhkuang
which is 7.8 times faster than C. Change-Id: I858ef4ec09202a07d445da8db702783d6d9d7321
2014-01-27Fix the vp9_tm_predictor_8x8_neon.hkuang
Change-Id: I832cf83871044bfee7b7e57dbd31bae05cbd53e9
2014-01-24Merge "Optimize vp9_tm_predictor_8x8_neon function"Frank Galligan
2014-01-24Optimize vp9_tm_predictor_8x8_neon functionFrank Galligan
Change-Id: Ia12aae491202098ff66366145aa0c3da38dc97e5
2014-01-24Add vp9_tm_predictor_16x16 neon implementationhkuang
which is 3.5 times faster than C. Change-Id: I24439ba7a2971829c11620f34848facf2c916678
2014-01-22Add tm_predictor_8x8 neon implementation.hkuang
Change-Id: I76c2720546b737cb63018a8ab6a3ff62a291786d
2014-01-16Merge "Add vp9_tm_predictor_4x4 neon implementation"hkuang
2014-01-15Add vp9_tm_predictor_4x4 neon implementationhkuang
Change-Id: I10c423bde7ea5a3bac9f14f35c73b6bc31c8f3e3
2014-01-08Merge "Add initial intra frame neon optimization. 1~2% gain."hkuang
2014-01-08Add initial intra frame neon optimization. 1~2% gain.hkuang
More intra optimizations will be added. Change-Id: I33ae8d93f6002bf7b64cc2669602d9e6bfa5a6e8
2013-12-17rename loop filter functionsJim Bankoski
This renames all the loop filter functions so that they no longer refer to mb Change-Id: I8a58a8c7fd253d835cb619bde13913e896ece90b
2013-11-26Fix 16 wide neon horz loopfilter.Frank Galligan
Multiply by 3 was on 8bit vectors when it should have been on 16bit vectors. Change-Id: I248c1429b3134dfd171dfab0ebb109fd2437e1fc
2013-11-22Do vertical loopfiltering in parallelYunqing Wang
This patch followed "Add filter_selectively_vert_row2 to enable parallel loopfiltering" commit, and added x86 SSE2 optimization to do 16-pixel filtering in parallel. For other optimizations (neon and dspr2), current 16-pixel functions were done by calling 8-pixel functions twice, and real 16-pixel functions could be added later. Decoder speedup: tulip clip: 2% speed gain; old_town_cross: 1.2% speed gain; bus: 2% speed gain. Change-Id: I4818a0c72f84b34f5fe678e496cf4a10238574b7
2013-11-21Revert "Add 16 wide neon horz loopfilter."Frank Galligan
The change caused mismatches with some test vectors on neon. Original CL: https://gerrit.chromium.org/gerrit/#/c/67863/ Change-Id: I913891636d53783e93cb1865ca78ded1821dc4b0
2013-11-21Add 16 wide neon horz loopfilter.Frank Galligan
Add support to do 16 pixel horizontal filtering in Neon. Nexus devices saw about 0.5% decode speed increase. Change-Id: I2993f6c2d49f31fa74976879eeaa289fd3f4e15d
2013-11-15Do horizontal loopfiltering in parallelYunqing Wang
This patch followed "Rewrite filter_selectively_horiz for parallel loopfiltering" commit, and added x86 SSE2 optimization to do 16-pixel filtering in parallel. Also, corrected the declaration of aligned arrays. For 8-pixel-in-parallel case, improved the calculation of the masks and filters. Updated the threshold loading since the thresholds were already duplicated. Updated neon C functions to call neon loopfilters twice. Using tulip clip, tests showed it gave a ~1.5% decoder speed gain. Change-Id: Id02638626ac27a4b0e0b09d71792a24c0499bd35
2013-11-12Use lowercase 'b' to branchJohann
iOS doesn't recognize B: bad instruction `B idct32_pass_loop' Change-Id: I3cf6aede4639f1d9efa97f7962fa287ba6feaaef
2013-11-11Fix a bug in the assembly code.hkuang
Change-Id: Ic416e3f8a11e82ee298e6f709b2119a9ddf1e2f8
2013-11-05Add back vp9_short_idct32x32_1_add_neon which is deleted inhkuang
cleanup I63df79a13cf62aa2c9360a7a26933c100f9ebda3. Change-Id: I034848cf05031618818f7df2e7f9c35102686948
2013-10-11Making input pointer of any inverse transform constant.Dmitry Kovalev
Also renaming dest_stride to stride in some places. Change-Id: I75f602b623a5a7071d4922b747c45fa0b7d7a940
2013-10-11Consistent names for inverse hybrid transforms (1 of 2).Dmitry Kovalev
Renames: vp9_short_iht4x4_add -> vp9_iht4x4_16_add vp9_short_iht8x8_add -> vp9_iht8x8_64_add vp9_short_iht16x16_add_c -> vp9_iht16x16_256_add Change-Id: Ibca7a188fd062b196787ac5efc1ea545e7f166c0
2013-10-10Giving consistent names to IDCT 32x32 functions.Dmitry Kovalev
Renames: vp9_short_idct32x32_add -> vp9_idct32x32_1024_add vp9_short_idct32x32_1_add -> vp9_idct32x32_1_add vp9_idct_add_32x32 -> vp9_idct32x32_add Change-Id: Id85306f5814bac6c47463a6b5901a93082510666
2013-10-07Giving consistent names to IDCT 16x16 functions.Dmitry Kovalev
Renames: vp9_short_idct16x16_add -> vp9_idct16x16_256_add vp9_short_idct16x16_10_add -> vp9_idct16x16_10_add vp9_short_idct16x16_1_add -> vp9_idct16x16_1_add vp9_idct_add_16x16 -> vp9_idct16x16_add Change-Id: Ief8a3904de78deab0f4ede944c4d0339c228cfc3
2013-10-06Giving consistent names to IDCT 8x8 functions.Dmitry Kovalev
Renames: vp9_short_idct8x8_add -> vp9_idct8x8_64_add vp9_short_idct8x8_1_add -> vp9_idct8x8_1_add vp9_short_idct8x8_10_add -> vp9_idct8x8_10_add vp9_idct_add_8x8 -> vp9_idct8x8_add Change-Id: Ifb8d3a45b4c0397aa805b30463f3d14581bf72c1
2013-10-04Giving consistent names to IDCT/IWHT functions.Dmitry Kovalev
The idea is to have the following names for each transform size: vp9_idct4x4_add vp9_idct4x4_1_add vp9_idct4x4_10_add vp9_idct4x4_16_add vp9_idct8x8_add vp9_idct8x8_1_add vp9_idct8x8_10_add vp9_idct8x8_64_add etc for 16x16, 32x32 The actual list of renames in this patch: vp9_idct_add_lossless -> vp9_iwht4x4_add vp9_short_iwalsh4x4_add -> vp9_iwht4x4_16_add vp9_short_iwalsh4x4_1_add -> vp9_iwht4x4_1_add vp9_idct_add -> vp9_idct4x4_add vp9_short_idct4x4_add -> vp9_idct4x4_16_add vp9_short_idct4x4_1_add -> vp9_idct4x4_1_add Change-Id: I6f43f7437c68dd30cdd05d72e213765578ed30b1
2013-09-27Renaming vp9_short_idct10_8x8_add to vp9_short_idct8x8_10_add.Dmitry Kovalev
Making name consistent with vp9_short_idct8x8 and vp9_short_idct8x8_1. Change-Id: I99e0be040ec893f9571dcf090e18f98dc58339f5
2013-09-27Properly save neon registers.Christian Duvivier
Replace current code which corrupts the stack by duplicate of vp8 code to save and restore neon registers. Change-Id: Ibb0220b9aa985d10533befa0a455ebce57a2891a
2013-09-27Merge "Renaming vp9_short_idct10_16x16 to vp9_short_idct16x16_10."Dmitry Kovalev
2013-09-26Renaming vp9_short_idct10_16x16 to vp9_short_idct16x16_10.Dmitry Kovalev
Making function name consistent with vp9_short_idct16x16 and vp9_short_idct16x16_1. Change-Id: I70e54be9e6b9a1dddab0de470686591e96d05517
2013-09-25Fix a bunch of TODO from vp9_short_idct32x32_add_neon.Christian Duvivier
- full ASM version, no more C gateway file. - integrate combine-add with last step of 2nd pass. - remove a few push/pop pairs. - some instruction reordering to hide latency. Change-Id: Ic9d9933c908b65d1bf7ba8fd47b524cda808c9c6
2013-09-20Use lowercase instruction in assemblyJohann
The iOS compiler does not recognize BLE: bad instruction `BLE idct32_transpose_pair_loop' Change-Id: I7426694c66bc31caf939a2d5000968da1222c15b
2013-09-16Speed up iht8x8 by rearranging instructions.hkuang
Speed improves from 282% to 302% faster based on assembly-perf. Change-Id: I08c5c1a542d43361611198f750b725e4303d19e2
2013-09-12Merge "Add neon optimize iht8x8 which is 282% faster than C."hkuang
2013-09-12Add neon optimize iht8x8 which is 282% faster than C.hkuang
Change-Id: I963dd4a6e8671957403ccbb9a16ea7de703e3530
2013-09-11First draft of vp9_short_idct32x32_add_neon.Christian Duvivier
Lots of TODO which will be taken care in upcoming changes. As is, about 6x faster than C version. Change-Id: Ie2557b72fd2d8edca376dbf400a4d173aa5e63e0
2013-09-09Speed up idct16x16 by rearrange instructions.hkuang
Speed improve from 376% to 400% faster base on assembly-perf. Change-Id: If0b2eccc39d5793dc101ce9feb7fcadf88396ea2
2013-09-04Speed up idct8x8 by rearrange instructions.hkuang
Speed improve from 264% ~ 270% to 280% ~ 300% base on assembly-perf. Change-Id: I3e2cc818ec14b432204ff43732f39b6438db685d
2013-09-04Add neon optimize vp9_short_iht4x4_add.hkuang
Change-Id: I42c497b68ae1ee645b59c9968ad805db0a43e37e
2013-08-27Add neon optimize vp9_short_idct16x16_1_add.hkuang
Change-Id: Ib9354c1d975d03e8081df20d50b6a77dfe2dc7e5
2013-08-26Add neon optimize vp9_short_idct8x8_1_add.hkuang
Change-Id: I0b15d5e3b0eb97abb9ab5ec08e88b61f8723aaf4
2013-08-26Add neon optimize vp9_short_idct4x4_1_add.hkuang
Change-Id: I6ecb5c4a1a472feb8e84e9f3352b536d5e28a4a5