summaryrefslogtreecommitdiff
path: root/vp9/common/arm
AgeCommit message (Collapse)Author
2013-12-17rename loop filter functionsJim Bankoski
This renames all the loop filter functions so that they no longer refer to mb Change-Id: I8a58a8c7fd253d835cb619bde13913e896ece90b
2013-11-26Fix 16 wide neon horz loopfilter.Frank Galligan
Multiply by 3 was on 8bit vectors when it should have been on 16bit vectors. Change-Id: I248c1429b3134dfd171dfab0ebb109fd2437e1fc
2013-11-22Do vertical loopfiltering in parallelYunqing Wang
This patch followed "Add filter_selectively_vert_row2 to enable parallel loopfiltering" commit, and added x86 SSE2 optimization to do 16-pixel filtering in parallel. For other optimizations (neon and dspr2), current 16-pixel functions were done by calling 8-pixel functions twice, and real 16-pixel functions could be added later. Decoder speedup: tulip clip: 2% speed gain; old_town_cross: 1.2% speed gain; bus: 2% speed gain. Change-Id: I4818a0c72f84b34f5fe678e496cf4a10238574b7
2013-11-21Revert "Add 16 wide neon horz loopfilter."Frank Galligan
The change caused mismatches with some test vectors on neon. Original CL: https://gerrit.chromium.org/gerrit/#/c/67863/ Change-Id: I913891636d53783e93cb1865ca78ded1821dc4b0
2013-11-21Add 16 wide neon horz loopfilter.Frank Galligan
Add support to do 16 pixel horizontal filtering in Neon. Nexus devices saw about 0.5% decode speed increase. Change-Id: I2993f6c2d49f31fa74976879eeaa289fd3f4e15d
2013-11-15Do horizontal loopfiltering in parallelYunqing Wang
This patch followed "Rewrite filter_selectively_horiz for parallel loopfiltering" commit, and added x86 SSE2 optimization to do 16-pixel filtering in parallel. Also, corrected the declaration of aligned arrays. For 8-pixel-in-parallel case, improved the calculation of the masks and filters. Updated the threshold loading since the thresholds were already duplicated. Updated neon C functions to call neon loopfilters twice. Using tulip clip, tests showed it gave a ~1.5% decoder speed gain. Change-Id: Id02638626ac27a4b0e0b09d71792a24c0499bd35
2013-11-12Use lowercase 'b' to branchJohann
iOS doesn't recognize B: bad instruction `B idct32_pass_loop' Change-Id: I3cf6aede4639f1d9efa97f7962fa287ba6feaaef
2013-11-11Fix a bug in the assembly code.hkuang
Change-Id: Ic416e3f8a11e82ee298e6f709b2119a9ddf1e2f8
2013-11-05Add back vp9_short_idct32x32_1_add_neon which is deleted inhkuang
cleanup I63df79a13cf62aa2c9360a7a26933c100f9ebda3. Change-Id: I034848cf05031618818f7df2e7f9c35102686948
2013-10-11Making input pointer of any inverse transform constant.Dmitry Kovalev
Also renaming dest_stride to stride in some places. Change-Id: I75f602b623a5a7071d4922b747c45fa0b7d7a940
2013-10-11Consistent names for inverse hybrid transforms (1 of 2).Dmitry Kovalev
Renames: vp9_short_iht4x4_add -> vp9_iht4x4_16_add vp9_short_iht8x8_add -> vp9_iht8x8_64_add vp9_short_iht16x16_add_c -> vp9_iht16x16_256_add Change-Id: Ibca7a188fd062b196787ac5efc1ea545e7f166c0
2013-10-10Giving consistent names to IDCT 32x32 functions.Dmitry Kovalev
Renames: vp9_short_idct32x32_add -> vp9_idct32x32_1024_add vp9_short_idct32x32_1_add -> vp9_idct32x32_1_add vp9_idct_add_32x32 -> vp9_idct32x32_add Change-Id: Id85306f5814bac6c47463a6b5901a93082510666
2013-10-07Giving consistent names to IDCT 16x16 functions.Dmitry Kovalev
Renames: vp9_short_idct16x16_add -> vp9_idct16x16_256_add vp9_short_idct16x16_10_add -> vp9_idct16x16_10_add vp9_short_idct16x16_1_add -> vp9_idct16x16_1_add vp9_idct_add_16x16 -> vp9_idct16x16_add Change-Id: Ief8a3904de78deab0f4ede944c4d0339c228cfc3
2013-10-06Giving consistent names to IDCT 8x8 functions.Dmitry Kovalev
Renames: vp9_short_idct8x8_add -> vp9_idct8x8_64_add vp9_short_idct8x8_1_add -> vp9_idct8x8_1_add vp9_short_idct8x8_10_add -> vp9_idct8x8_10_add vp9_idct_add_8x8 -> vp9_idct8x8_add Change-Id: Ifb8d3a45b4c0397aa805b30463f3d14581bf72c1
2013-10-04Giving consistent names to IDCT/IWHT functions.Dmitry Kovalev
The idea is to have the following names for each transform size: vp9_idct4x4_add vp9_idct4x4_1_add vp9_idct4x4_10_add vp9_idct4x4_16_add vp9_idct8x8_add vp9_idct8x8_1_add vp9_idct8x8_10_add vp9_idct8x8_64_add etc for 16x16, 32x32 The actual list of renames in this patch: vp9_idct_add_lossless -> vp9_iwht4x4_add vp9_short_iwalsh4x4_add -> vp9_iwht4x4_16_add vp9_short_iwalsh4x4_1_add -> vp9_iwht4x4_1_add vp9_idct_add -> vp9_idct4x4_add vp9_short_idct4x4_add -> vp9_idct4x4_16_add vp9_short_idct4x4_1_add -> vp9_idct4x4_1_add Change-Id: I6f43f7437c68dd30cdd05d72e213765578ed30b1
2013-09-27Renaming vp9_short_idct10_8x8_add to vp9_short_idct8x8_10_add.Dmitry Kovalev
Making name consistent with vp9_short_idct8x8 and vp9_short_idct8x8_1. Change-Id: I99e0be040ec893f9571dcf090e18f98dc58339f5
2013-09-27Properly save neon registers.Christian Duvivier
Replace current code which corrupts the stack by duplicate of vp8 code to save and restore neon registers. Change-Id: Ibb0220b9aa985d10533befa0a455ebce57a2891a
2013-09-27Merge "Renaming vp9_short_idct10_16x16 to vp9_short_idct16x16_10."Dmitry Kovalev
2013-09-26Renaming vp9_short_idct10_16x16 to vp9_short_idct16x16_10.Dmitry Kovalev
Making function name consistent with vp9_short_idct16x16 and vp9_short_idct16x16_1. Change-Id: I70e54be9e6b9a1dddab0de470686591e96d05517
2013-09-25Fix a bunch of TODO from vp9_short_idct32x32_add_neon.Christian Duvivier
- full ASM version, no more C gateway file. - integrate combine-add with last step of 2nd pass. - remove a few push/pop pairs. - some instruction reordering to hide latency. Change-Id: Ic9d9933c908b65d1bf7ba8fd47b524cda808c9c6
2013-09-20Use lowercase instruction in assemblyJohann
The iOS compiler does not recognize BLE: bad instruction `BLE idct32_transpose_pair_loop' Change-Id: I7426694c66bc31caf939a2d5000968da1222c15b
2013-09-16Speed up iht8x8 by rearranging instructions.hkuang
Speed improves from 282% to 302% faster based on assembly-perf. Change-Id: I08c5c1a542d43361611198f750b725e4303d19e2
2013-09-12Merge "Add neon optimize iht8x8 which is 282% faster than C."hkuang
2013-09-12Add neon optimize iht8x8 which is 282% faster than C.hkuang
Change-Id: I963dd4a6e8671957403ccbb9a16ea7de703e3530
2013-09-11First draft of vp9_short_idct32x32_add_neon.Christian Duvivier
Lots of TODO which will be taken care in upcoming changes. As is, about 6x faster than C version. Change-Id: Ie2557b72fd2d8edca376dbf400a4d173aa5e63e0
2013-09-09Speed up idct16x16 by rearrange instructions.hkuang
Speed improve from 376% to 400% faster base on assembly-perf. Change-Id: If0b2eccc39d5793dc101ce9feb7fcadf88396ea2
2013-09-04Speed up idct8x8 by rearrange instructions.hkuang
Speed improve from 264% ~ 270% to 280% ~ 300% base on assembly-perf. Change-Id: I3e2cc818ec14b432204ff43732f39b6438db685d
2013-09-04Add neon optimize vp9_short_iht4x4_add.hkuang
Change-Id: I42c497b68ae1ee645b59c9968ad805db0a43e37e
2013-08-27Add neon optimize vp9_short_idct16x16_1_add.hkuang
Change-Id: Ib9354c1d975d03e8081df20d50b6a77dfe2dc7e5
2013-08-26Add neon optimize vp9_short_idct8x8_1_add.hkuang
Change-Id: I0b15d5e3b0eb97abb9ab5ec08e88b61f8723aaf4
2013-08-26Add neon optimize vp9_short_idct4x4_1_add.hkuang
Change-Id: I6ecb5c4a1a472feb8e84e9f3352b536d5e28a4a5
2013-08-23Merge "Optimise idct4x4: rearrange the instructions a bit to improve ↵hkuang
instruction scheduling."
2013-08-22Add neon optimize vp9_short_idct10_16x16_add.hkuang
vp9_short_idct10_16x16_add is used to handle the block that only have valid data at top left 4x4 block. All the other datas are 0. So we could cut many unnecessary calculations in order to save instructions. Change-Id: I6e30a3fee1ece5af7f258532416d0bfddd1143f0
2013-08-22Optimise idct4x4: rearrange the instructions a bithkuang
to improve instruction scheduling. Change-Id: I5ea881a6e419f9e8ed4b3b619406403b4de24134
2013-08-20Add neon optimize vp9_short_idct10_8x8_add.hkuang
vp9_short_idct10_8x8_add is used to handle the block that only have valid data at top left 4x4 block. All the other datas are 0. So we could cut several unnecessary calculations in order to save instructions. Change-Id: I34fda95e29082b789aded97c2df193991c2d9195
2013-08-16Merge "Reduce the instructions of idct8x8. Also add the saving and restoring ↵Johann
of D registers."
2013-08-16Merge "Reduce instructions of idct4x4."Johann
2013-08-16Merge "vp9: neon: optimise vp9_wide_mbfilter_neon"Frank Galligan
2013-08-16Reduce instructions of idct4x4.hkuang
Change-Id: Ia26a2526804e7e2f656b0051618a615fca8fc79d
2013-08-16Reduce the instructions of idct8x8. Also add thehkuang
saving and restoring of D registers. Change-Id: Id3630c90fcb160ef939fef55411342608af5f990
2013-08-16vp9: neon: use aligned stores in convolve functionsMans Rullgard
The destination is block-aligned so it is safe to use aligned stores. Change-Id: I38261e4fa40bc60e6472edffece59e372908da7e
2013-08-15Merge "vp9: neon: add vp9_convolve_avg_neon"Johann
2013-08-15Merge "vp9: neon: add vp9_convolve_copy_neon"Johann
2013-08-15vp9: neon: optimise vp9_wide_mbfilter_neonMans Rullgard
Break up long dependency chains to improve instruction scheduling. Change-Id: I0e0cb66943df24af920767bb4167b25c38af9630
2013-08-14Merge "Add neon optimize vp9_short_idct16x16_add."hkuang
2013-08-14Add neon optimize vp9_short_idct16x16_add.hkuang
Change-Id: I27134b9a5cace2bdad53534562c91d829b48838d
2013-08-14vp9: neon: add vp9_convolve_avg_neonMans Rullgard
Change-Id: I33cff9ac4f2234558f6f87729f9b2e88a33fbf58
2013-08-14vp9: neon: add vp9_convolve_copy_neonMans Rullgard
Change-Id: I15adbbda15d1842e9f15f21878a5ffbb75c3c0c9
2013-08-12Merge "vp9: neon: optimise convolve8_vert functions"Johann
2013-08-12vp9: neon: optimise convolve8_vert functionsMans Rullgard
Invert loops to operate vertically in the inner loop. This allows removing redundant loads. Also add preloading of data. Change-Id: I4fa85c0ab1735bcb1dd6ea58937efac949172bdc