Age | Commit message (Collapse) | Author |
|
Change-Id: I42c497b68ae1ee645b59c9968ad805db0a43e37e
|
|
Change-Id: Ib9354c1d975d03e8081df20d50b6a77dfe2dc7e5
|
|
Change-Id: I0b15d5e3b0eb97abb9ab5ec08e88b61f8723aaf4
|
|
Change-Id: I6ecb5c4a1a472feb8e84e9f3352b536d5e28a4a5
|
|
instruction scheduling."
|
|
vp9_short_idct10_16x16_add is used to handle the block that only have valid data
at top left 4x4 block. All the other datas are 0. So we could cut many
unnecessary calculations in order to save instructions.
Change-Id: I6e30a3fee1ece5af7f258532416d0bfddd1143f0
|
|
to improve instruction scheduling.
Change-Id: I5ea881a6e419f9e8ed4b3b619406403b4de24134
|
|
vp9_short_idct10_8x8_add is used to handle the block that only have valid data
at top left 4x4 block. All the other datas are 0. So we could cut several
unnecessary calculations in order to save instructions.
Change-Id: I34fda95e29082b789aded97c2df193991c2d9195
|
|
of D registers."
|
|
|
|
|
|
Change-Id: Ia26a2526804e7e2f656b0051618a615fca8fc79d
|
|
saving and restoring of D registers.
Change-Id: Id3630c90fcb160ef939fef55411342608af5f990
|
|
The destination is block-aligned so it is safe to use aligned
stores.
Change-Id: I38261e4fa40bc60e6472edffece59e372908da7e
|
|
|
|
|
|
Break up long dependency chains to improve instruction scheduling.
Change-Id: I0e0cb66943df24af920767bb4167b25c38af9630
|
|
|
|
Change-Id: I27134b9a5cace2bdad53534562c91d829b48838d
|
|
Change-Id: I33cff9ac4f2234558f6f87729f9b2e88a33fbf58
|
|
Change-Id: I15adbbda15d1842e9f15f21878a5ffbb75c3c0c9
|
|
|
|
Invert loops to operate vertically in the inner loop. This allows
removing redundant loads.
Also add preloading of data.
Change-Id: I4fa85c0ab1735bcb1dd6ea58937efac949172bdc
|
|
Each iteration of the horizontal loop reuses 7 of the 11 source
values. Loading only the 4 new values saves some time.
Also add preload for source data.
Overall 4% faster on Chromebook.
Change-Id: I8f69e749f2b7f79e9734620dcee51dbfcd716b44
|
|
Change-Id: Idec4cae0cb9b3a29835fd2750d354c1393d47aa4
|
|
Change-Id: I5d6906772e6e6adf68d7f0fd5b8b5207a64a3a37
|
|
Change-Id: Ic7cacd02d6dc9243ad8fc85082c5618a9d1e66dc
|
|
Loading to single lanes in multiple registers is expensive since
it requires a read and write of each register which saturates
the register file access. Loading to single registers followed
by a separate transpose reduces this pressure.
Change-Id: I4cc35887ddbca80e5e635b50d2b1d158de9668ee
|
|
Change-Id: I13e0880df234f15abc4cc7c57fe84488d5d46a75
|
|
Change-Id: I748dee8938dfb19f417f24eed005f3d216f83a82
|
|
|
|
Try and cut down the cycle count by rearranging the instructions
so there are less stalls.
Change-Id: Ic1383335ee0f05e656477d9ee9c179ec231285d5
|
|
|
|
Change-Id: Ic32acf3e2939c6d12d9c2bf192a5f5da59705fda
|
|
If count was greater than 1 the src pointer would be off on
the second loop.
Change-Id: I8e09037e68dc4ae92076a8067f7b6dacbbef8263
|
|
Call the individually optimized horizontal and vertical functions. This
implementation abuses the temp buffer.
This will be replaced with a custom optimized function.
Over 2x speedup.
Change-Id: I5b908d2a73d264e9810d6022bbff73207a3055dd
|
|
|
|
Change the mbfilter Neon code from executing both branches if all
vectors follow only one branch.
The code is about 5% faster when executing only one branch and about
1% slower when executing both branches.
-PS5: Remove local stack space from mbfilter.
Change-Id: I6a23f9b318a9f4568a2718b4c9348db988fe2182
|
|
Super basic conversion from the other implementations. Any changes to
one should be trivial to copy over keep in sync.
Change-Id: I1720b4128e0aba4b2779e3761f6494f8a09d3ea8
|
|
Independent horizontal and vertical implementations.
Requires that blocks be built from 4x4 and [xy]_step_q4 == 16
6-10% improvement. CIF improved the least.
Change-Id: I137f5ceae4440adc0960bf88e4453e55a618bcda
|
|
Change-Id: Iae84ab945cc9662a0ddd839aa2b9ca59f2ae5423
|
|
- The vp9 mbfilter C code will branch on flat and mask. This CL
will perform both branches and combine the data. A later CL will
perform a check to see if all patch will take one branch.
- These functions are about 1.75 times faster than the C code on
Nexus 7.
PS #3
- Changed all functions to dub limit, blimit, and thresh from
vld {dx[]}, freeing up r4-r6.
- Changed code to use vbif to reduce one instruction and free
up a d register.
Change-Id: I028dae0e434dc9891c3677bdb182e201ffb04777
|
|
- Added vp9_loop_filter_horizontal_edge_neon and
vp9_loop_filter_vertical_edge_neon.
- The functions are based off the vp8 loopfilter
functions.
- Matches x86 md5 checksum.
Change-Id: Id1c4dddb03584227e5ecd29f574a6ac27738fdd0
|