summaryrefslogtreecommitdiff
path: root/vp9/common/arm/neon
AgeCommit message (Collapse)Author
2013-09-04Add neon optimize vp9_short_iht4x4_add.hkuang
Change-Id: I42c497b68ae1ee645b59c9968ad805db0a43e37e
2013-08-27Add neon optimize vp9_short_idct16x16_1_add.hkuang
Change-Id: Ib9354c1d975d03e8081df20d50b6a77dfe2dc7e5
2013-08-26Add neon optimize vp9_short_idct8x8_1_add.hkuang
Change-Id: I0b15d5e3b0eb97abb9ab5ec08e88b61f8723aaf4
2013-08-26Add neon optimize vp9_short_idct4x4_1_add.hkuang
Change-Id: I6ecb5c4a1a472feb8e84e9f3352b536d5e28a4a5
2013-08-23Merge "Optimise idct4x4: rearrange the instructions a bit to improve ↵hkuang
instruction scheduling."
2013-08-22Add neon optimize vp9_short_idct10_16x16_add.hkuang
vp9_short_idct10_16x16_add is used to handle the block that only have valid data at top left 4x4 block. All the other datas are 0. So we could cut many unnecessary calculations in order to save instructions. Change-Id: I6e30a3fee1ece5af7f258532416d0bfddd1143f0
2013-08-22Optimise idct4x4: rearrange the instructions a bithkuang
to improve instruction scheduling. Change-Id: I5ea881a6e419f9e8ed4b3b619406403b4de24134
2013-08-20Add neon optimize vp9_short_idct10_8x8_add.hkuang
vp9_short_idct10_8x8_add is used to handle the block that only have valid data at top left 4x4 block. All the other datas are 0. So we could cut several unnecessary calculations in order to save instructions. Change-Id: I34fda95e29082b789aded97c2df193991c2d9195
2013-08-16Merge "Reduce the instructions of idct8x8. Also add the saving and restoring ↵Johann
of D registers."
2013-08-16Merge "Reduce instructions of idct4x4."Johann
2013-08-16Merge "vp9: neon: optimise vp9_wide_mbfilter_neon"Frank Galligan
2013-08-16Reduce instructions of idct4x4.hkuang
Change-Id: Ia26a2526804e7e2f656b0051618a615fca8fc79d
2013-08-16Reduce the instructions of idct8x8. Also add thehkuang
saving and restoring of D registers. Change-Id: Id3630c90fcb160ef939fef55411342608af5f990
2013-08-16vp9: neon: use aligned stores in convolve functionsMans Rullgard
The destination is block-aligned so it is safe to use aligned stores. Change-Id: I38261e4fa40bc60e6472edffece59e372908da7e
2013-08-15Merge "vp9: neon: add vp9_convolve_avg_neon"Johann
2013-08-15Merge "vp9: neon: add vp9_convolve_copy_neon"Johann
2013-08-15vp9: neon: optimise vp9_wide_mbfilter_neonMans Rullgard
Break up long dependency chains to improve instruction scheduling. Change-Id: I0e0cb66943df24af920767bb4167b25c38af9630
2013-08-14Merge "Add neon optimize vp9_short_idct16x16_add."hkuang
2013-08-14Add neon optimize vp9_short_idct16x16_add.hkuang
Change-Id: I27134b9a5cace2bdad53534562c91d829b48838d
2013-08-14vp9: neon: add vp9_convolve_avg_neonMans Rullgard
Change-Id: I33cff9ac4f2234558f6f87729f9b2e88a33fbf58
2013-08-14vp9: neon: add vp9_convolve_copy_neonMans Rullgard
Change-Id: I15adbbda15d1842e9f15f21878a5ffbb75c3c0c9
2013-08-12Merge "vp9: neon: optimise convolve8_vert functions"Johann
2013-08-12vp9: neon: optimise convolve8_vert functionsMans Rullgard
Invert loops to operate vertically in the inner loop. This allows removing redundant loads. Also add preloading of data. Change-Id: I4fa85c0ab1735bcb1dd6ea58937efac949172bdc
2013-08-11vp9: neon: optimise convolve8_horiz functionsMans Rullgard
Each iteration of the horizontal loop reuses 7 of the 11 source values. Loading only the 4 new values saves some time. Also add preload for source data. Overall 4% faster on Chromebook. Change-Id: I8f69e749f2b7f79e9734620dcee51dbfcd716b44
2013-08-06Neon version of vp9_short_idct4x4_add.Christian Duvivier
Change-Id: Idec4cae0cb9b3a29835fd2750d354c1393d47aa4
2013-08-02vp9: neon: convolve: replace some insns with simpler equivalentsMans Rullgard
Change-Id: I5d6906772e6e6adf68d7f0fd5b8b5207a64a3a37
2013-08-02vp9: neon: convolve: simplify branching to C fallbacksMans Rullgard
Change-Id: Ic7cacd02d6dc9243ad8fc85082c5618a9d1e66dc
2013-08-02vp9: neon: optimise loads in horiz convolve functionsMans Rullgard
Loading to single lanes in multiple registers is expensive since it requires a read and write of each register which saturates the register file access. Loading to single registers followed by a separate transpose reduces this pressure. Change-Id: I4cc35887ddbca80e5e635b50d2b1d158de9668ee
2013-08-02vp9: neon: add vp9_mb_lpf_* functionsMans Rullgard
Change-Id: I13e0880df234f15abc4cc7c57fe84488d5d46a75
2013-07-26Fix some format error and code error in neon code.hkuang
Change-Id: I748dee8938dfb19f417f24eed005f3d216f83a82
2013-07-22Merge "Speedup loopfilter neon code."Frank Galligan
2013-07-22Speedup loopfilter neon code.Frank Galligan
Try and cut down the cycle count by rearranging the instructions so there are less stalls. Change-Id: Ic1383335ee0f05e656477d9ee9c179ec231285d5
2013-07-19Merge "Add neon optimize vp9_short_idct8x8_add."hkuang
2013-07-18Add neon optimize vp9_short_idct8x8_add.hkuang
Change-Id: Ic32acf3e2939c6d12d9c2bf192a5f5da59705fda
2013-07-18Fix horz loopfilter loopsFrank Galligan
If count was greater than 1 the src pointer would be off on the second loop. Change-Id: I8e09037e68dc4ae92076a8067f7b6dacbbef8263
2013-07-17vp9_convolve8_neon placeholderJohann
Call the individually optimized horizontal and vertical functions. This implementation abuses the temp buffer. This will be replaced with a custom optimized function. Over 2x speedup. Change-Id: I5b908d2a73d264e9810d6022bbff73207a3055dd
2013-07-16Merge "vp9_convolve8_[horiz|vert]_avg"Johann
2013-07-15Neon: Update mbfilter if all vectors follow one branch.Frank Galligan
Change the mbfilter Neon code from executing both branches if all vectors follow only one branch. The code is about 5% faster when executing only one branch and about 1% slower when executing both branches. -PS5: Remove local stack space from mbfilter. Change-Id: I6a23f9b318a9f4568a2718b4c9348db988fe2182
2013-07-12vp9_convolve8_[horiz|vert]_avgJohann
Super basic conversion from the other implementations. Any changes to one should be trivial to copy over keep in sync. Change-Id: I1720b4128e0aba4b2779e3761f6494f8a09d3ea8
2013-07-11convolve8 optimizations for neonJohann
Independent horizontal and vertical implementations. Requires that blocks be built from 4x4 and [xy]_step_q4 == 16 6-10% improvement. CIF improved the least. Change-Id: I137f5ceae4440adc0960bf88e4453e55a618bcda
2013-07-11Add neon optimize vp9_dc_only_idct_add.hkuang
Change-Id: Iae84ab945cc9662a0ddd839aa2b9ca59f2ae5423
2013-07-09Add Neon horizontal and vertical vp9_mbloop_filterFrank Galligan
- The vp9 mbfilter C code will branch on flat and mask. This CL will perform both branches and combine the data. A later CL will perform a check to see if all patch will take one branch. - These functions are about 1.75 times faster than the C code on Nexus 7. PS #3 - Changed all functions to dub limit, blimit, and thresh from vld {dx[]}, freeing up r4-r6. - Changed code to use vbif to reduce one instruction and free up a d register. Change-Id: I028dae0e434dc9891c3677bdb182e201ffb04777
2013-06-27Add Neon optimized loop filter functions.Frank Galligan
- Added vp9_loop_filter_horizontal_edge_neon and vp9_loop_filter_vertical_edge_neon. - The functions are based off the vp8 loopfilter functions. - Matches x86 md5 checksum. Change-Id: Id1c4dddb03584227e5ecd29f574a6ac27738fdd0