Age | Commit message (Collapse) | Author |
|
Multiply by 3 was on 8bit vectors when it should have been on
16bit vectors.
Change-Id: I248c1429b3134dfd171dfab0ebb109fd2437e1fc
|
|
This patch followed "Add filter_selectively_vert_row2 to enable
parallel loopfiltering" commit, and added x86 SSE2 optimization
to do 16-pixel filtering in parallel. For other optimizations
(neon and dspr2), current 16-pixel functions were done by calling
8-pixel functions twice, and real 16-pixel functions could be added
later.
Decoder speedup:
tulip clip: 2% speed gain;
old_town_cross: 1.2% speed gain;
bus: 2% speed gain.
Change-Id: I4818a0c72f84b34f5fe678e496cf4a10238574b7
|
|
The change caused mismatches with some test vectors on neon.
Original CL: https://gerrit.chromium.org/gerrit/#/c/67863/
Change-Id: I913891636d53783e93cb1865ca78ded1821dc4b0
|
|
Add support to do 16 pixel horizontal filtering in Neon.
Nexus devices saw about 0.5% decode speed increase.
Change-Id: I2993f6c2d49f31fa74976879eeaa289fd3f4e15d
|
|
This patch followed "Rewrite filter_selectively_horiz for parallel
loopfiltering" commit, and added x86 SSE2 optimization to do
16-pixel filtering in parallel. Also, corrected the declaration
of aligned arrays. For 8-pixel-in-parallel case, improved the
calculation of the masks and filters. Updated the threshold loading
since the thresholds were already duplicated. Updated neon C functions
to call neon loopfilters twice.
Using tulip clip, tests showed it gave a ~1.5% decoder speed gain.
Change-Id: Id02638626ac27a4b0e0b09d71792a24c0499bd35
|
|
iOS doesn't recognize B:
bad instruction `B idct32_pass_loop'
Change-Id: I3cf6aede4639f1d9efa97f7962fa287ba6feaaef
|
|
Change-Id: Ic416e3f8a11e82ee298e6f709b2119a9ddf1e2f8
|
|
cleanup I63df79a13cf62aa2c9360a7a26933c100f9ebda3.
Change-Id: I034848cf05031618818f7df2e7f9c35102686948
|
|
Also renaming dest_stride to stride in some places.
Change-Id: I75f602b623a5a7071d4922b747c45fa0b7d7a940
|
|
Renames:
vp9_short_iht4x4_add -> vp9_iht4x4_16_add
vp9_short_iht8x8_add -> vp9_iht8x8_64_add
vp9_short_iht16x16_add_c -> vp9_iht16x16_256_add
Change-Id: Ibca7a188fd062b196787ac5efc1ea545e7f166c0
|
|
Renames:
vp9_short_idct32x32_add -> vp9_idct32x32_1024_add
vp9_short_idct32x32_1_add -> vp9_idct32x32_1_add
vp9_idct_add_32x32 -> vp9_idct32x32_add
Change-Id: Id85306f5814bac6c47463a6b5901a93082510666
|
|
Renames:
vp9_short_idct16x16_add -> vp9_idct16x16_256_add
vp9_short_idct16x16_10_add -> vp9_idct16x16_10_add
vp9_short_idct16x16_1_add -> vp9_idct16x16_1_add
vp9_idct_add_16x16 -> vp9_idct16x16_add
Change-Id: Ief8a3904de78deab0f4ede944c4d0339c228cfc3
|
|
Renames:
vp9_short_idct8x8_add -> vp9_idct8x8_64_add
vp9_short_idct8x8_1_add -> vp9_idct8x8_1_add
vp9_short_idct8x8_10_add -> vp9_idct8x8_10_add
vp9_idct_add_8x8 -> vp9_idct8x8_add
Change-Id: Ifb8d3a45b4c0397aa805b30463f3d14581bf72c1
|
|
The idea is to have the following names for each transform size:
vp9_idct4x4_add
vp9_idct4x4_1_add
vp9_idct4x4_10_add
vp9_idct4x4_16_add
vp9_idct8x8_add
vp9_idct8x8_1_add
vp9_idct8x8_10_add
vp9_idct8x8_64_add
etc for 16x16, 32x32
The actual list of renames in this patch:
vp9_idct_add_lossless -> vp9_iwht4x4_add
vp9_short_iwalsh4x4_add -> vp9_iwht4x4_16_add
vp9_short_iwalsh4x4_1_add -> vp9_iwht4x4_1_add
vp9_idct_add -> vp9_idct4x4_add
vp9_short_idct4x4_add -> vp9_idct4x4_16_add
vp9_short_idct4x4_1_add -> vp9_idct4x4_1_add
Change-Id: I6f43f7437c68dd30cdd05d72e213765578ed30b1
|
|
Making name consistent with vp9_short_idct8x8 and vp9_short_idct8x8_1.
Change-Id: I99e0be040ec893f9571dcf090e18f98dc58339f5
|
|
Replace current code which corrupts the stack by
duplicate of vp8 code to save and restore neon
registers.
Change-Id: Ibb0220b9aa985d10533befa0a455ebce57a2891a
|
|
|
|
Making function name consistent with vp9_short_idct16x16 and
vp9_short_idct16x16_1.
Change-Id: I70e54be9e6b9a1dddab0de470686591e96d05517
|
|
- full ASM version, no more C gateway file.
- integrate combine-add with last step of 2nd pass.
- remove a few push/pop pairs.
- some instruction reordering to hide latency.
Change-Id: Ic9d9933c908b65d1bf7ba8fd47b524cda808c9c6
|
|
The iOS compiler does not recognize BLE:
bad instruction `BLE idct32_transpose_pair_loop'
Change-Id: I7426694c66bc31caf939a2d5000968da1222c15b
|
|
Speed improves from 282% to 302% faster based on assembly-perf.
Change-Id: I08c5c1a542d43361611198f750b725e4303d19e2
|
|
|
|
Change-Id: I963dd4a6e8671957403ccbb9a16ea7de703e3530
|
|
Lots of TODO which will be taken care in upcoming changes. As is,
about 6x faster than C version.
Change-Id: Ie2557b72fd2d8edca376dbf400a4d173aa5e63e0
|
|
Speed improve from 376% to 400% faster base on assembly-perf.
Change-Id: If0b2eccc39d5793dc101ce9feb7fcadf88396ea2
|
|
Speed improve from 264% ~ 270% to 280% ~ 300% base on assembly-perf.
Change-Id: I3e2cc818ec14b432204ff43732f39b6438db685d
|
|
Change-Id: I42c497b68ae1ee645b59c9968ad805db0a43e37e
|
|
Change-Id: Ib9354c1d975d03e8081df20d50b6a77dfe2dc7e5
|
|
Change-Id: I0b15d5e3b0eb97abb9ab5ec08e88b61f8723aaf4
|
|
Change-Id: I6ecb5c4a1a472feb8e84e9f3352b536d5e28a4a5
|
|
instruction scheduling."
|
|
vp9_short_idct10_16x16_add is used to handle the block that only have valid data
at top left 4x4 block. All the other datas are 0. So we could cut many
unnecessary calculations in order to save instructions.
Change-Id: I6e30a3fee1ece5af7f258532416d0bfddd1143f0
|
|
to improve instruction scheduling.
Change-Id: I5ea881a6e419f9e8ed4b3b619406403b4de24134
|
|
vp9_short_idct10_8x8_add is used to handle the block that only have valid data
at top left 4x4 block. All the other datas are 0. So we could cut several
unnecessary calculations in order to save instructions.
Change-Id: I34fda95e29082b789aded97c2df193991c2d9195
|
|
of D registers."
|
|
|
|
|
|
Change-Id: Ia26a2526804e7e2f656b0051618a615fca8fc79d
|
|
saving and restoring of D registers.
Change-Id: Id3630c90fcb160ef939fef55411342608af5f990
|
|
The destination is block-aligned so it is safe to use aligned
stores.
Change-Id: I38261e4fa40bc60e6472edffece59e372908da7e
|
|
|
|
|
|
Break up long dependency chains to improve instruction scheduling.
Change-Id: I0e0cb66943df24af920767bb4167b25c38af9630
|
|
|
|
Change-Id: I27134b9a5cace2bdad53534562c91d829b48838d
|
|
Change-Id: I33cff9ac4f2234558f6f87729f9b2e88a33fbf58
|
|
Change-Id: I15adbbda15d1842e9f15f21878a5ffbb75c3c0c9
|
|
|
|
Invert loops to operate vertically in the inner loop. This allows
removing redundant loads.
Also add preloading of data.
Change-Id: I4fa85c0ab1735bcb1dd6ea58937efac949172bdc
|
|
Each iteration of the horizontal loop reuses 7 of the 11 source
values. Loading only the 4 new values saves some time.
Also add preload for source data.
Overall 4% faster on Chromebook.
Change-Id: I8f69e749f2b7f79e9734620dcee51dbfcd716b44
|