summaryrefslogtreecommitdiff
path: root/vp9/common/x86
AgeCommit message (Collapse)Author
2014-02-20Merge "vp9_subpixel_8t_intrin_ssse3.c: make some tables static"James Zern
2014-02-18vp9_subpixel_8t_intrin_ssse3.c: make some tables staticJames Zern
+ fix formatting Change-Id: I344d4de089d03e403f0c7b3e64aeb7086cce86ac
2014-02-18vp9_subpixel_8t_intrin_avx2.c: make some tables staticJames Zern
+ fix formatting Change-Id: Ia62610bff3d63855104366d7860749b6a3cf4577
2014-02-18Merge "SSSE3 convolution optimization"Yunqing Wang
2014-02-14Merge "minor spelling cleanup in comments"Yaowu Xu
2014-02-14SSSE3 convolution optimizationlevytamar82
Optimizing all SSSE3 assembly for convolution: 1. vp9_filter_block1d4_h8_sse2 2. vp9_filter_block1d8_h8_sse2 3. vp9_filter_block1d16_h8_sse2 4. vp9_filter_block1d4_v8_sse2 5. vp9_filter_block1d8_v8_sse2 6. vp9_filter_block1d16_v8_sse2 my optimization include: -processing 2x8 elements in one 128 bit register instead of processing 8 elements in one 128 bit register. -removing unecessary loads. This optimization gives between 2.4% user level gain for 480p input and 1.6% user level gain for 720p. This Optimization is done only for 64 bit Change-Id: Ic07fce2f9360329b4f2d956efda1480ae958766b
2014-02-12AVX2 Convolve Optimizationlevytamar82
Two convolve functions were optimized for AVX2: 1. vp9_filter_block1d16_h8 2. vp9_filter_block1d16_v8 vp9_filter_block1d16_v8 was optimized for AVX2 by reducing the number of loop strides by half, two strides were processed in parallel. vp9_filter_block1d16_v8 was also optimized in the same way also some of the loads were being done outside of the loop and by that preventing redundant loads. This Optimization gives 43% function level gain and 1.3% user level gain. Now can be compiled in Windows Change-Id: I2714124cfb0c14a77d7a0ce126a20db92ffbf92c
2014-02-12minor spelling cleanup in commentsAndrew Russell
Change-Id: Ia91c6c406273345b08505097ffe1af3896980f06
2014-02-10vp9/common/x86: Silence MSVC warnings in vp9_asm_stubs.c.Tom Finegan
Update filter_1dfunction definition to match usage. Change-Id: Ie3cae13dc1ec3f5838c5f29d1c76a1a98a9217fa
2014-02-04Optimize bilinear sub-pixel filters in ssse3Yunqing Wang
This patch added ssse3 optimization of bilinear sub-pixel filters. The real time encoder was speeded up by ~1%. Change-Id: Ie82e98976f411183cb8c61ab8d2ba0276e55a338
2014-02-03Optimize bilinear sub-pixel filters in sse2Yunqing Wang
Using bilinear filters could speed up the codec in real-time mode. This patch added sse2 optimizations of bilinear filters that operate on different-sized blocks. Tests showed that the real-time encoder was speeded up by 3%. Change-Id: If99a7ee4385fcc225c3ee7445d962d5752e57c3f
2014-01-28Add macros for convolve functionsYunqing Wang
Added macros to reduce the code duplication. Change-Id: I1916aa5a386ea07d961d4ec439ab09bb8c45487d
2014-01-27Removing _1d suffix from transform names.Dmitry Kovalev
It is enough to specify (e.g.) idct16, it is obviously different from idct16x16. Change-Id: I6b408a37a945de3162429380b59a775b03b95db0
2014-01-23vp9/common: add extern "C" to headersJames Zern
Change-Id: Ic334da9aee968e33762c2b25d9fbad24c844b411
2014-01-16Revert "Revert "Revert "SSSE3 convolution optimization"""Yunqing Wang
This reverts commit f9404f240642222775a371acde8fc0721b3812df. This patch caused some ASAN error. Change-Id: If15b7e581310e19061d111c69f2931809662ed19
2014-01-13Revert "Revert "SSSE3 convolution optimization""Yunqing Wang
This reverts commit b645257121da20b422dbbebf02aae0fc6dff95d4. Change-Id: I60d1bf57ae8e9eb6127f42f2d5a780124ac51b45
2014-01-10Revert "SSSE3 convolution optimization"Paul Wilkins
This reverts commit 511d218c60b9b6c1ab9383db746815e907af0359. In current form intrinsics break borg build. Change-Id: Ied37936af841250ecff449802e69a3d3761c91b9
2014-01-09Merge "Optimze inv 16x16 DCT with 10 non-zero coeffs - P2"Jingning Han
2014-01-09Merge "Optimze inv 16x16 DCT with 10 non-zero coeffs - P1"Jingning Han
2014-01-09Optimze inv 16x16 DCT with 10 non-zero coeffs - P2Jingning Han
This commit further optimizes SSE2 operations in the second 1-D inverse 16x16 DCT, with (<10) non-zero coefficients. The average runtime of this module goes down from 779 cycles -> 725 cycles. Change-Id: Iac31b123640d9b1e8f906e770702936b71f0ba7f
2014-01-09Merge "SSSE3 convolution optimization"Yunqing Wang
2014-01-09SSSE3 convolution optimizationlevytamar82
Optimizing all SSSE3 assembly for convolution: 1. vp9_filter_block1d4_h8_sse2 2. vp9_filter_block1d8_h8_sse2 3. vp9_filter_block1d16_h8_sse2 4. vp9_filter_block1d4_v8_sse2 5. vp9_filter_block1d8_v8_sse2 6. vp9_filter_block1d16_v8_sse2 my optimization include: -processing 2x8 elements in one 128 bit register instead of processing 8 elements in one 128 bit register. -removing unecessary loads. This optimization gives between 2.4% user level gain for 480p input and 1.6% user level gain for 720p. This Optimization done only for 64bit. Change-Id: Icb586dc0c938b56699864fcee6c52fd43b36b969
2014-01-08Optimze inv 16x16 DCT with 10 non-zero coeffs - P1Jingning Han
This commit is the first patch optimizing SSE2 implementation of inverse 16x16 DCT with <10 non-zero coefficients. It focused on the first 1-D (row) transformation. It exploits the fact that only top-left 4x4 block contains non-zero coefficients, in a 2-D inverse 16x16 DCT with <10 coeffients. The average runtime of idct16x16_10 unit is reduced from 883 cycles -> 779 cycles (12% faster). For pedestrian_area_1080p 300 frames at 4000 kbps, the speed 2 runtime goes down from 310651 ms -> 305910 ms. The decoding speed goes up from 80.37 fps -> 80.87 fps. Change-Id: Ic6f3ac5a637a76c07ba73ddaafe318a699fea645
2014-01-03Tune IDCT8_1D macro function interfaceJingning Han
This commit adds input/output ports for IDCT8_1D macro function to provide more flexibility in variable use. It allows to skip several buffer swap operations. Change-Id: I21f3450509537322293043b3281bfd3949868677
2014-01-03Reduce num of buffer swap calls in idct8_1d_sse2Jingning Han
This commit merges the initial buffer swap operations in idct8_1d_sse2 into the array transpose step, hence reducing number of instructions therein. Change-Id: I219f6f50813390d2ec3ee37eecf2a4a2b44ae479
2014-01-03Rework idct8x8_10 SSE2 implementationJingning Han
This commit optimizes the SSE2 implmentation of idct8x8_10. It exploits the fact that only top-left 4x4 block contains non-zero coefficients, and hence reduces the instructions needed. The runtime of idct8x8_10_sse2 goes down from 216 to 198 CPU cycles, estimated by averaging over 100000 runs. For pedestrian_area_1080p 300 frames coded at 4000kbps, the average decoding speed goes up from 79.3 fps to 79.7 fps. Change-Id: I6d277bbaa3ec9e1562667906975bae06904cb180
2013-12-20Merge "Code clean up"Yunqing Wang
2013-12-19Code clean upYunqing Wang
Removed unused filter coefficients. Change-Id: Ib395a51305e23ff41ab69c1808d56946d25961cd
2013-12-17rename loop filter functionsJim Bankoski
This renames all the loop filter functions so that they no longer refer to mb Change-Id: I8a58a8c7fd253d835cb619bde13913e896ece90b
2013-12-02Improve idct16x16: _256_add_sse2(x1.107)&_10_add_sse2(x1.012)Abo Talib Mahfoodh
The performance gain of idct16x16_10_add_sse2 function is not noticeable. However since both functions use the IDCT16_1D, idct16x16_10_add_sse2 should be modified as well. Tested with: park_joy_420_720p50.y4m Change-Id: I02b957e36fcf997c677d15baf496533895271bff
2013-12-02Merge "improve vp9_idct32x32_34(x1.472)&1024(x1.032)_add_sse2"Yunqing Wang
2013-11-26improve vp9_idct32x32_34(x1.472)&1024(x1.032)_add_sse2Abo Talib Mahfoodh
vp9_idct32x32_34_add_sse2: speedup: 1.472 IDCT32_1D_34 and MULTIPLICATION_AND_ADD_2 are optimized based on the fact that Only upper-left 8x8 has non-zero values. vp9_idct32x32_1024_add_sse2: speedup: 1.032 Tested with: park_joy_420_720p50.y4m Change-Id: I8670ce547552b48695049de298e2fc46ce28dfbc
2013-11-22Do vertical loopfiltering in parallelYunqing Wang
This patch followed "Add filter_selectively_vert_row2 to enable parallel loopfiltering" commit, and added x86 SSE2 optimization to do 16-pixel filtering in parallel. For other optimizations (neon and dspr2), current 16-pixel functions were done by calling 8-pixel functions twice, and real 16-pixel functions could be added later. Decoder speedup: tulip clip: 2% speed gain; old_town_cross: 1.2% speed gain; bus: 2% speed gain. Change-Id: I4818a0c72f84b34f5fe678e496cf4a10238574b7
2013-11-20Correct ssse3 8/16-pixel wide sub-pixel filter calculationYunqing Wang
Although no mismatch was indicated for 8/16 wide sub-pixel filters in issue 661, they had similar problems that could cause mismatch potentially. This patch fixed calculations in HORIZx8/16 and VERTx8/16. Change-Id: I169961c9d40a20340995b7d22aafc89ccf30bfca
2013-11-20Fix stack pointer in sub-pixel filtersYunqing Wang
In commit "3d50da5397d20abc932d81453b26cde758293a40", the stack pointer was modified while aligning the stack, and it needed to be pop out at the end. Change-Id: I062971e195f1f2ab9d0ab5fb84dcf215a0fcaa67
2013-11-19Fix decoder mismatch with ssse3 enabledYunqing Wang
This patch fixed issue 661: "Decoder produces mismatched outputs with ssse3 enabled and disabled." In sub-pixel filters, a pixel value was multiplied by a filter coefficient, and the results were added up. The order of adding up these multiplications had to be arranged carefully to prevent incorrect overflowing. Change-Id: Id08af4200fea9e1b896fc40157b8651c2c7e80f2
2013-11-18Improve vp9_iht4x4_16_add_sse2 (x1.341)Abo Talib Mahfoodh
This rebase is a better implementation of the previous ones. Modifications are done to reduce the total clock cycle. Speedup: 1.341 Compiled with -O3 Tested with: park_joy_420_720p50.y4m Change-Id: I940eaf283f60597ca0d9d2e13d518878d55ff02d
2013-11-15Do horizontal loopfiltering in parallelYunqing Wang
This patch followed "Rewrite filter_selectively_horiz for parallel loopfiltering" commit, and added x86 SSE2 optimization to do 16-pixel filtering in parallel. Also, corrected the declaration of aligned arrays. For 8-pixel-in-parallel case, improved the calculation of the masks and filters. Updated the threshold loading since the thresholds were already duplicated. Updated neon C functions to call neon loopfilters twice. Using tulip clip, tests showed it gave a ~1.5% decoder speed gain. Change-Id: Id02638626ac27a4b0e0b09d71792a24c0499bd35
2013-11-08Merge "Improve vp9_idct4x4_1_add_sse2"Yunqing Wang
2013-11-01vp9 ssse3 d207_predictor_32x32: add missing GLOBAL()James Zern
removes a textrel for sh_b23456789abcdefff Change-Id: I80cb9dfd8e49a0fe884c8ff76472275b3a00cb57
2013-10-31mb_lpf_horizontal_edge AVX2 optimizationTamar Levy
This CL contains two AVX2 optimized loop filter functions, mb_lpf_horizontal_edge_w_avx2_8 and mb_lpf_horizontal_edge_w_avx2_16. Change-Id: I604e4fe6e99752b7800c2ea98721d97f7e0b931b
2013-10-25Merge "Add 32x32 idct function for eob<=34 case"Yunqing Wang
2013-10-24Add 32x32 idct function for eob<=34 caseYunqing Wang
When only upper-left 8x8 area has non-zero dct coefficients, we could skip 1D IDCT for 9th to 32th rows to save operations. This function is called when eob <= 34. Change-Id: I9684b75947bdde346cfe3720f08a953aa7a13fb5
2013-10-23Renaming vp9_short_fdct8x8 to vp9_fdct8x8.Dmitry Kovalev
For consistency with idct function names. Change-Id: I7b6af2f92c66eff56f84ed29edc3a66af8dc421f
2013-10-22Improve vp9_idct4x4_1_add_sse2Abo Talib Mahfoodh
Simple modification to reduce number of cycles in the function. Original function number of cycles: 973 Modified function number of cycles: 835 Improvment factor: 1.165 Tested with: park_joy_420_720p50.y4m Change-Id: Ic5857272ea3aafe21d5ef9a69258d78c688f69bd
2013-10-18Fix d207 intra prediction SSSE3 functionsYunqing Wang
This patch fixed a bug that caused 32bit PIC build mismatch. The stack pointer was modified after "GET_GOT". Loading left pointer from a hard-coded position gave wrong result. Change-Id: Iea0aec6f917b12a6b3393ffc986bad74510248cc
2013-10-15Merge "Fix a few indent format issues in buffer defs"Jingning Han
2013-10-15Fix a few indent format issues in buffer defsJingning Han
Change-Id: Iac55891ac9e6f13718c9f822aa099b5ca491832a
2013-10-11Making input pointer of any inverse transform constant.Dmitry Kovalev
Also renaming dest_stride to stride in some places. Change-Id: I75f602b623a5a7071d4922b747c45fa0b7d7a940
2013-10-11Consistent names for inverse hybrid transforms (1 of 2).Dmitry Kovalev
Renames: vp9_short_iht4x4_add -> vp9_iht4x4_16_add vp9_short_iht8x8_add -> vp9_iht8x8_64_add vp9_short_iht16x16_add_c -> vp9_iht16x16_256_add Change-Id: Ibca7a188fd062b196787ac5efc1ea545e7f166c0