libvpx.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2014-01-28	Add macros for convolve functions	Yunqing Wang
	Added macros to reduce the code duplication. Change-Id: I1916aa5a386ea07d961d4ec439ab09bb8c45487d
2014-01-27	Removing _1d suffix from transform names.	Dmitry Kovalev
	It is enough to specify (e.g.) idct16, it is obviously different from idct16x16. Change-Id: I6b408a37a945de3162429380b59a775b03b95db0
2014-01-23	vp9/common: add extern "C" to headers	James Zern
	Change-Id: Ic334da9aee968e33762c2b25d9fbad24c844b411
2014-01-16	Revert "Revert "Revert "SSSE3 convolution optimization"""	Yunqing Wang
	This reverts commit f9404f240642222775a371acde8fc0721b3812df. This patch caused some ASAN error. Change-Id: If15b7e581310e19061d111c69f2931809662ed19
2014-01-13	Revert "Revert "SSSE3 convolution optimization""	Yunqing Wang
	This reverts commit b645257121da20b422dbbebf02aae0fc6dff95d4. Change-Id: I60d1bf57ae8e9eb6127f42f2d5a780124ac51b45
2014-01-10	Revert "SSSE3 convolution optimization"	Paul Wilkins
	This reverts commit 511d218c60b9b6c1ab9383db746815e907af0359. In current form intrinsics break borg build. Change-Id: Ied37936af841250ecff449802e69a3d3761c91b9
2014-01-09	Merge "Optimze inv 16x16 DCT with 10 non-zero coeffs - P2"	Jingning Han

2014-01-09	Merge "Optimze inv 16x16 DCT with 10 non-zero coeffs - P1"	Jingning Han

2014-01-09	Optimze inv 16x16 DCT with 10 non-zero coeffs - P2	Jingning Han
	This commit further optimizes SSE2 operations in the second 1-D inverse 16x16 DCT, with (<10) non-zero coefficients. The average runtime of this module goes down from 779 cycles -> 725 cycles. Change-Id: Iac31b123640d9b1e8f906e770702936b71f0ba7f
2014-01-09	Merge "SSSE3 convolution optimization"	Yunqing Wang

2014-01-09	SSSE3 convolution optimization	levytamar82
	Optimizing all SSSE3 assembly for convolution: 1. vp9_filter_block1d4_h8_sse2 2. vp9_filter_block1d8_h8_sse2 3. vp9_filter_block1d16_h8_sse2 4. vp9_filter_block1d4_v8_sse2 5. vp9_filter_block1d8_v8_sse2 6. vp9_filter_block1d16_v8_sse2 my optimization include: -processing 2x8 elements in one 128 bit register instead of processing 8 elements in one 128 bit register. -removing unecessary loads. This optimization gives between 2.4% user level gain for 480p input and 1.6% user level gain for 720p. This Optimization done only for 64bit. Change-Id: Icb586dc0c938b56699864fcee6c52fd43b36b969
2014-01-08	Optimze inv 16x16 DCT with 10 non-zero coeffs - P1	Jingning Han
	This commit is the first patch optimizing SSE2 implementation of inverse 16x16 DCT with <10 non-zero coefficients. It focused on the first 1-D (row) transformation. It exploits the fact that only top-left 4x4 block contains non-zero coefficients, in a 2-D inverse 16x16 DCT with <10 coeffients. The average runtime of idct16x16_10 unit is reduced from 883 cycles -> 779 cycles (12% faster). For pedestrian_area_1080p 300 frames at 4000 kbps, the speed 2 runtime goes down from 310651 ms -> 305910 ms. The decoding speed goes up from 80.37 fps -> 80.87 fps. Change-Id: Ic6f3ac5a637a76c07ba73ddaafe318a699fea645
2014-01-03	Tune IDCT8_1D macro function interface	Jingning Han
	This commit adds input/output ports for IDCT8_1D macro function to provide more flexibility in variable use. It allows to skip several buffer swap operations. Change-Id: I21f3450509537322293043b3281bfd3949868677
2014-01-03	Reduce num of buffer swap calls in idct8_1d_sse2	Jingning Han
	This commit merges the initial buffer swap operations in idct8_1d_sse2 into the array transpose step, hence reducing number of instructions therein. Change-Id: I219f6f50813390d2ec3ee37eecf2a4a2b44ae479
2014-01-03	Rework idct8x8_10 SSE2 implementation	Jingning Han
	This commit optimizes the SSE2 implmentation of idct8x8_10. It exploits the fact that only top-left 4x4 block contains non-zero coefficients, and hence reduces the instructions needed. The runtime of idct8x8_10_sse2 goes down from 216 to 198 CPU cycles, estimated by averaging over 100000 runs. For pedestrian_area_1080p 300 frames coded at 4000kbps, the average decoding speed goes up from 79.3 fps to 79.7 fps. Change-Id: I6d277bbaa3ec9e1562667906975bae06904cb180
2013-12-20	Merge "Code clean up"	Yunqing Wang

2013-12-19	Code clean up	Yunqing Wang
	Removed unused filter coefficients. Change-Id: Ib395a51305e23ff41ab69c1808d56946d25961cd
2013-12-17	rename loop filter functions	Jim Bankoski
	This renames all the loop filter functions so that they no longer refer to mb Change-Id: I8a58a8c7fd253d835cb619bde13913e896ece90b
2013-12-02	Improve idct16x16: _256_add_sse2(x1.107)&_10_add_sse2(x1.012)	Abo Talib Mahfoodh
	The performance gain of idct16x16_10_add_sse2 function is not noticeable. However since both functions use the IDCT16_1D, idct16x16_10_add_sse2 should be modified as well. Tested with: park_joy_420_720p50.y4m Change-Id: I02b957e36fcf997c677d15baf496533895271bff
2013-12-02	Merge "improve vp9_idct32x32_34(x1.472)&1024(x1.032)_add_sse2"	Yunqing Wang

2013-11-26	improve vp9_idct32x32_34(x1.472)&1024(x1.032)_add_sse2	Abo Talib Mahfoodh
	vp9_idct32x32_34_add_sse2: speedup: 1.472 IDCT32_1D_34 and MULTIPLICATION_AND_ADD_2 are optimized based on the fact that Only upper-left 8x8 has non-zero values. vp9_idct32x32_1024_add_sse2: speedup: 1.032 Tested with: park_joy_420_720p50.y4m Change-Id: I8670ce547552b48695049de298e2fc46ce28dfbc
2013-11-22	Do vertical loopfiltering in parallel	Yunqing Wang
	This patch followed "Add filter_selectively_vert_row2 to enable parallel loopfiltering" commit, and added x86 SSE2 optimization to do 16-pixel filtering in parallel. For other optimizations (neon and dspr2), current 16-pixel functions were done by calling 8-pixel functions twice, and real 16-pixel functions could be added later. Decoder speedup: tulip clip: 2% speed gain; old_town_cross: 1.2% speed gain; bus: 2% speed gain. Change-Id: I4818a0c72f84b34f5fe678e496cf4a10238574b7
2013-11-20	Correct ssse3 8/16-pixel wide sub-pixel filter calculation	Yunqing Wang
	Although no mismatch was indicated for 8/16 wide sub-pixel filters in issue 661, they had similar problems that could cause mismatch potentially. This patch fixed calculations in HORIZx8/16 and VERTx8/16. Change-Id: I169961c9d40a20340995b7d22aafc89ccf30bfca
2013-11-20	Fix stack pointer in sub-pixel filters	Yunqing Wang
	In commit "3d50da5397d20abc932d81453b26cde758293a40", the stack pointer was modified while aligning the stack, and it needed to be pop out at the end. Change-Id: I062971e195f1f2ab9d0ab5fb84dcf215a0fcaa67
2013-11-19	Fix decoder mismatch with ssse3 enabled	Yunqing Wang
	This patch fixed issue 661: "Decoder produces mismatched outputs with ssse3 enabled and disabled." In sub-pixel filters, a pixel value was multiplied by a filter coefficient, and the results were added up. The order of adding up these multiplications had to be arranged carefully to prevent incorrect overflowing. Change-Id: Id08af4200fea9e1b896fc40157b8651c2c7e80f2
2013-11-18	Improve vp9_iht4x4_16_add_sse2 (x1.341)	Abo Talib Mahfoodh
	This rebase is a better implementation of the previous ones. Modifications are done to reduce the total clock cycle. Speedup: 1.341 Compiled with -O3 Tested with: park_joy_420_720p50.y4m Change-Id: I940eaf283f60597ca0d9d2e13d518878d55ff02d
2013-11-15	Do horizontal loopfiltering in parallel	Yunqing Wang
	This patch followed "Rewrite filter_selectively_horiz for parallel loopfiltering" commit, and added x86 SSE2 optimization to do 16-pixel filtering in parallel. Also, corrected the declaration of aligned arrays. For 8-pixel-in-parallel case, improved the calculation of the masks and filters. Updated the threshold loading since the thresholds were already duplicated. Updated neon C functions to call neon loopfilters twice. Using tulip clip, tests showed it gave a ~1.5% decoder speed gain. Change-Id: Id02638626ac27a4b0e0b09d71792a24c0499bd35
2013-11-08	Merge "Improve vp9_idct4x4_1_add_sse2"	Yunqing Wang

2013-11-01	vp9 ssse3 d207_predictor_32x32: add missing GLOBAL()	James Zern
	removes a textrel for sh_b23456789abcdefff Change-Id: I80cb9dfd8e49a0fe884c8ff76472275b3a00cb57
2013-10-31	mb_lpf_horizontal_edge AVX2 optimization	Tamar Levy
	This CL contains two AVX2 optimized loop filter functions, mb_lpf_horizontal_edge_w_avx2_8 and mb_lpf_horizontal_edge_w_avx2_16. Change-Id: I604e4fe6e99752b7800c2ea98721d97f7e0b931b
2013-10-25	Merge "Add 32x32 idct function for eob<=34 case"	Yunqing Wang

2013-10-24	Add 32x32 idct function for eob<=34 case	Yunqing Wang
	When only upper-left 8x8 area has non-zero dct coefficients, we could skip 1D IDCT for 9th to 32th rows to save operations. This function is called when eob <= 34. Change-Id: I9684b75947bdde346cfe3720f08a953aa7a13fb5
2013-10-23	Renaming vp9_short_fdct8x8 to vp9_fdct8x8.	Dmitry Kovalev
	For consistency with idct function names. Change-Id: I7b6af2f92c66eff56f84ed29edc3a66af8dc421f
2013-10-22	Improve vp9_idct4x4_1_add_sse2	Abo Talib Mahfoodh
	Simple modification to reduce number of cycles in the function. Original function number of cycles: 973 Modified function number of cycles: 835 Improvment factor: 1.165 Tested with: park_joy_420_720p50.y4m Change-Id: Ic5857272ea3aafe21d5ef9a69258d78c688f69bd
2013-10-18	Fix d207 intra prediction SSSE3 functions	Yunqing Wang
	This patch fixed a bug that caused 32bit PIC build mismatch. The stack pointer was modified after "GET_GOT". Loading left pointer from a hard-coded position gave wrong result. Change-Id: Iea0aec6f917b12a6b3393ffc986bad74510248cc
2013-10-15	Merge "Fix a few indent format issues in buffer defs"	Jingning Han

2013-10-15	Fix a few indent format issues in buffer defs	Jingning Han
	Change-Id: Iac55891ac9e6f13718c9f822aa099b5ca491832a
2013-10-11	Making input pointer of any inverse transform constant.	Dmitry Kovalev
	Also renaming dest_stride to stride in some places. Change-Id: I75f602b623a5a7071d4922b747c45fa0b7d7a940
2013-10-11	Consistent names for inverse hybrid transforms (1 of 2).	Dmitry Kovalev
	Renames: vp9_short_iht4x4_add -> vp9_iht4x4_16_add vp9_short_iht8x8_add -> vp9_iht8x8_64_add vp9_short_iht16x16_add_c -> vp9_iht16x16_256_add Change-Id: Ibca7a188fd062b196787ac5efc1ea545e7f166c0
2013-10-11	Merge "Removing vp9_idct4_1d_sse2 function."	Dmitry Kovalev

2013-10-11	Code cleanup	Yunqing Wang
	Minor code cleanup. Change-Id: I47c1f794842d4570bb39cfd23b80f54f5606bba6
2013-10-11	Merge "SSE2 8-tap sub-pixel filter optimization"	Yunqing Wang

2013-10-10	Removing vp9_idct4_1d_sse2 function.	Dmitry Kovalev
	We have two SSE2-optimized functions for idct4_1d: vp9_idct4_1d_sse2 <-- removing this one idct4_1d_sse2 vp9_idct4_1d_sse2 was used only by the following functions which already have SSE2 optimized variants: vp9_idct4x4_16_add_c -> vp9_idct4x4_16_add_see2 idct8_1d -> vp9_idct8x8_{16, 10, 1}_see2 vp9_short_iht4x4_add_c -> vp9_short_iht4x4_add_see2 Change-Id: Ib0a7f6d1373dbaf7a4a41208cd9d0671fdf15edb
2013-10-10	d207 intra prediction ssse3 using bytes	Scott LaVarnway
	byte version of ronalds d207 ssse3 optimizations (commit: f891f84d3ba9345b0074e682f0fea09b8ddf4f1e) Change-Id: If15f71a589ea16f78ac86a501b0c5c6231dc9af1
2013-10-10	Merge "Giving consistent names to IDCT 32x32 functions."	Dmitry Kovalev

2013-10-10	Merge "d153 intra prediction (32x32) ssse3 using bytes"	Yunqing Wang

2013-10-10	SSE2 8-tap sub-pixel filter optimization	Yunqing Wang
	To ensure fast encoding/decoding on devices without ssse3 support, SSE2 optimization of sub-pixel filters was done. Test using 1080p clip showed the decoder speeds were ~70fps with ssse3 filters, ~60fps with sse2 filters, and ~15fps with c filters. Change-Id: Ie2088f87d83a889fba80a613e4d0e287aadd785c
2013-10-10	Giving consistent names to IDCT 32x32 functions.	Dmitry Kovalev
	Renames: vp9_short_idct32x32_add -> vp9_idct32x32_1024_add vp9_short_idct32x32_1_add -> vp9_idct32x32_1_add vp9_idct_add_32x32 -> vp9_idct32x32_add Change-Id: Id85306f5814bac6c47463a6b5901a93082510666
2013-10-07	Giving consistent names to IDCT 16x16 functions.	Dmitry Kovalev
	Renames: vp9_short_idct16x16_add -> vp9_idct16x16_256_add vp9_short_idct16x16_10_add -> vp9_idct16x16_10_add vp9_short_idct16x16_1_add -> vp9_idct16x16_1_add vp9_idct_add_16x16 -> vp9_idct16x16_add Change-Id: Ief8a3904de78deab0f4ede944c4d0339c228cfc3
2013-10-07	Merge "Giving consistent names to IDCT 8x8 functions."	Dmitry Kovalev