libvpx.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2014-05-07	Revert "Add an MMX fwht4x4"	Paul Wilkins
	Includes changes that are not compatible with VS windows builds. Amongst other things stdint.h is not supported in VS. This reverts commit 89fbf3de501b5d7fd90047192521eae3198705cd. Change-Id: Ifa86d7df250578d1ada9b539c9ff12ed0c523cdd
2014-05-06	Merge "Add an MMX fwht4x4"	Alex Converse

2014-05-05	Add an MMX fwht4x4	Alex Converse
	7% faster encoding a desktop lossless at RT speed 4. Change-Id: I41627f5b737752616b6512bb91a36ec45995bf64
2014-05-05	SSSE3 implementation of full inverse 8x8 2D-DCT	Jingning Han
	This commit enables SSSE3 version full inverse 8x8 2D-DCT and reconstruction. It makes the runtime of vp9_idct8x8_64_add down from 256 cycles (SSE2) to 246 cycles. Change-Id: I0600feac894d6a443a3c9d18daf34156d4e225c3
2014-05-01	Merge "Removing half-variance asm functions which are not used."	Dmitry Kovalev

2014-04-30	Merge "Enable SSSE3 implementation of 8x8 forward 2D-DCT"	Jingning Han

2014-04-30	Removing half-variance asm functions which are not used.	Dmitry Kovalev
	Corresponding C functions were removed in I99695564a3aa9bc8c79ac0a551d257e2ff3ad3c3 Change-Id: I50a5575065a7a9e41904eb2161afd739def927db
2014-04-29	Enable SSSE3 implementation of 8x8 forward 2D-DCT	Jingning Han
	Assembly implementation of ssse3 8x8 forward 2D-DCT. The current version is turned on only for x86_64. The average unit runtime goes from 157 cycles down to 136 cycles, i.e., about 12.8% faster. This translates into about 1.5% speed-up for pedestrian_area 1080p at speed 2. Change-Id: I0f12435857e9425ed7ce12541344dfa16837f4f4
2014-04-25	Removing unused vp9_variance_halfpixvar*() functions.	Dmitry Kovalev
	Change-Id: I99695564a3aa9bc8c79ac0a551d257e2ff3ad3c3
2014-04-14	Removing unused vp9_mcomp_x86.h file.	Dmitry Kovalev
	We don't use declarations from this file. The real declarations (differently named) are in vp9_rtcd_defs.pl, e.g. vp9_full_search_sad. Change-Id: I73cbf064305710ba20747233cfdbe67366f069a0
2014-03-21	AVX2 SAD Optimization:	levytamar82
	2 functions were optimized for avx2 by using full 256 bit register In order to handle 32 elements in parallel instead of only 16 in parallel: 1. vp9_sad32x32x4d 2. vp9_sad64x64x4d The function level gain is 66% and the user level gain is ~1%. Change-Id: I4efbb3bc7d8bc03b64b6c98f5cd5c4a9dd3212cb
2014-03-17	Removed several unused functions.	Yaowu Xu
	Change-Id: Ib9e27298c575afc02a98b593bc6ad60762064d9b
2014-03-05	Merge "improved speed of 4x4 sse2 fdct."	Andrew Russell

2014-03-03	improved speed of 4x4 sse2 fdct.	Andrew Russell
	* speed improvment of 30 percent achieved * multiplies and adds remain the same * non-arithmetic instructions minimized by hand, by: -expanding 2 pass loop -removing irrelivant "shuffles" -combining last two rounding steps * further improvments may be possible Change-Id: Idec2c3f52910c48e6a0e0f9aefed5cae31b0b8c0
2014-02-28	AVX2 SubPixel AVG Variance Optimization	levytamar82
	Optimizing 2 functions to process 32 elements in parallel instead of 16: 1. vp9_sub_pixel_avg_variance64x64 2. vp9_sub_pixel_avg_variance32x32 both of those function were calling vp9_sub_pixel_avg_variance16xh_ssse3 instead of calling that function, it calls vp9_sub_pixel_avg_variance32xh_avx2 that is written in avx2 and process 32 elements in parallel. This Optimization gave 80% function level gain and 2% user level gain Change-Id: Iea694654e1b7612dc6ed11e2626208c2179502c8
2014-02-18	vp9_subpel_variance_impl_intrin_avx2.c: make some tables static	James Zern
	+ fix formatting Change-Id: I7b4ec11b7b46d8926750e0b69f7a606f3ab80895
2014-02-14	AVX2 SubPixel Variance Optimization	levytamar82
	Optimizing 2 functions to process 32 elements in parallel instead of 16: 1. vp9_sub_pixel_variance64x64 2. vp9_sub_pixel_variance32x32 both of those function were calling vp9_sub_pixel_variance16xh_ssse3 instead of calling that function, it calls vp9_sub_pixel_variance32xh_avx2 that is written in avx2 and process 32 elements in parallel. This Optimization gave 70% function level gain and 2% user level gain Change-Id: I4f5cb386b346ff6c878a094e1c3b37e418e50bde
2014-02-12	minor spelling cleanup in comments	Andrew Russell
	Change-Id: Ia91c6c406273345b08505097ffe1af3896980f06
2014-02-07	Bug fix in ssse3 quantize function	Yunqing Wang
	A bug was reported in Issue 702: "SIGILL (Illegal instruction) when transcoding with vp9 - using FFmpeg". It was reproduced and fixed. Change-Id: Ie32c149a89af02856084aeaf289e848a905c7700
2014-02-06	Finally removing "short" from transform names.	Dmitry Kovalev
	Change-Id: I5259b68dc1bcceb153e3ffe638a79a59a3019e9d
2014-01-27	Removing _1d suffix from transform names.	Dmitry Kovalev
	It is enough to specify (e.g.) idct16, it is obviously different from idct16x16. Change-Id: I6b408a37a945de3162429380b59a775b03b95db0
2014-01-23	vp9/encoder: add extern "C" to headers	James Zern
	Change-Id: I4f51ce859a97bf1b8fd2b37ac585b7c643232b69
2014-01-08	AVX2 Variance Optimization	levytamar82
	Optimizing the variance functions: vp9_variance16x16, vp9_variance32x32, vp9_variance64x64, vp9_variance32x16, vp9_variance64x32, vp9_mse16x16 by migrating to AVX2 some of the functions were optimized by processing 32 elements instead of 16. some of the functions were optimized by processing 2 loop strides of 16 elements in a single 256 bit register This optimization gives between 2.4% - 2.7% user level performance gain and 42% function level gain. Change-Id: I265ae08a2b0196057a224a86450153ef3aebd85d
2013-12-16	vp9: normalize include guards	James Zern
	Change-Id: If4ddbdcfb3ab387cbca6910b42cf4df8111e6879
2013-11-27	Merge "vp9_short_fdct32x32_rd vp9_short_fdct32x32 optimized for AVX2"	Yaowu Xu

2013-11-21	vp9_short_fdct32x32_rd vp9_short_fdct32x32 optimized for AVX2	levytamar82
	Change-Id: I6366e84490883b72362f762369d7e5bccb64f02f
2013-11-21	Improve vp9_fdct4x4_sse2 (x1.2)	Abo Talib Mahfoodh
	Modifications are done to reduce the total clock cycle. Speedup: 1.2 Tested with: park_joy_420_720p50.y4m Change-Id: Ia36b87e62e2f80a5fadaf5628729aedc80f38f3f
2013-11-13	Fix an overflow issue in SSE2 forward ADST	Jingning Han
	The step that sums three input samples could potentially cause the intermediate result go beyond 16 bit limit, when operating as the second 1-D transform. This commit fixes the issue. Change-Id: Iaf512449ac2d25ddd8a806d760afab362c62a516
2013-11-07	Remove TEXTREL from 32bit encoder	Yunqing Wang
	This patch fixed the issue reported in "Issue 655: remove textrel's from 32-bit vp9 encoder". The set of vp9_subpel_variance functions that used x86inc.asm ABI didn't build correctly for 32bit PIC. The fix was carefully done under the situation that there was not enough registers. After the change, we got $ eu-findtextrel libvpx.so eu-findtextrel: no text relocations reported in 'libvpx.so' Change-Id: I1b176311dedaf48eaee0a1e777588043c97cea82
2013-10-24	Making input pointer constant for all fdct/fht functions.	Dmitry Kovalev
	Change-Id: I78f7012f967a777ddd39bae6671eb501df6bbfe8
2013-10-23	Renaming vp9_short_fdct4x4 and vp9_short_walsh4x4.	Dmitry Kovalev
	For consistency with idct function names. Renames: vp9_short_fdct4x4 -> vp9_fdct4x4 vp9_short_walsh4x4 -> vp9_fwht4x4 Change-Id: Id15497cc1270acca626447d846f0ce9199770f58
2013-10-23	Renaming vp9_short_fdct32x32 to vp9_fdct32x32.	Dmitry Kovalev
	For consistency with idct function names. Change-Id: Ie77b7178e0894c57cd5cb9243c949eb9224ece18
2013-10-23	Merge "Renaming vp9_short_fdct16x16 to vp9_fdct16x16."	Dmitry Kovalev

2013-10-23	Renaming vp9_short_fdct16x16 to vp9_fdct16x16.	Dmitry Kovalev
	For consistency with idct function names. Change-Id: I5ca355ba99fdba04f09254be95cf79808b534f71
2013-10-23	Renaming vp9_short_fdct8x8 to vp9_fdct8x8.	Dmitry Kovalev
	For consistency with idct function names. Change-Id: I7b6af2f92c66eff56f84ed29edc3a66af8dc421f
2013-10-22	Merge "Using stride (# of elements) instead of pitch (bytes) in fdct4x4."	Dmitry Kovalev

2013-10-22	Merge "Using stride (# of elements) instead of pitch (bytes) in fdct8x8."	Dmitry Kovalev

2013-10-21	Using stride (# of elements) instead of pitch (bytes) in fdct4x4.	Dmitry Kovalev
	Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: I0ba3c52513a5fdd194f1e7e2901092671398985b
2013-10-18	Using stride (# of elements) instead of pitch (bytes) in fdct8x8.	Dmitry Kovalev
	Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: Ibc944952a192e6c7b2b6a869ec2894c01da82ed1
2013-10-18	Using stride (# of elements) instead of pitch (bytes) in fdct16x16.	Dmitry Kovalev
	Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: I2d95fdcbba96aaa0ed24a80870cb38f53487a97d
2013-10-17	Using stride (# of elements) instead of pitch (bytes) in fdct32x32.	Dmitry Kovalev
	Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: Id623c5113262655fa50f7c9d6cec9a91fcb20bb4
2013-10-15	Removing unused 8x4 transform from the encoder.	Dmitry Kovalev
	Change-Id: Icbcf68b5b685a56f255ebc3859c9692accdadf9e
2013-10-09	Merge "Simplifying and inlining k_cvtlo_epi16 and k_cvthi_epi16"	Jingning Han

2013-10-07	Merge "cpplint vp9_variance_sse2.c"	Jim Bankoski

2013-10-05	Merge "added nolint to function that doesn't seem easy to breakup"	Jim Bankoski

2013-10-04	cpplint issues resolved in vp9_variance_mmx.c	Jim Bankoski
	Change-Id: Idbfabe427fbeab44210f13fec8b6f63f7a4eb0dd
2013-10-04	added nolint to function that doesn't seem easy to breakup	Jim Bankoski
	Change-Id: I5489b116aea7c510ea5ebbed3c1445f321b05f3e
2013-10-04	cpplint vp9_variance_sse2.c	Jim Bankoski
	Change-Id: Ifce8f5b57a1ea8952e8a67c5b92a127a061899fa
2013-10-02	Simplifying and inlining k_cvtlo_epi16 and k_cvthi_epi16	A.Mahfoodh
	Simplify the k_cvtlo_epi16 and k_cvthi_epi16 to only two instructions. Then inlined them. quoting from intel MMX_App_Compute_16bit_Vector.pdf‎ "The PMADDWD instruction multiplies four pairs of 16-bit numbers and produces partial sums of the results and can do so once per clock (with a three-clock latency)." so I am assuming that there will be three clock overhead after the last _mm_madd_pi16 command. Even with the overhead the number of clocks in general should be smaller. I am not sure though becasue I could not find information about number of clocks required for instructions in k_cvtlo_epi16 and k_cvthi_epi16. I will run a test and compare the execution time. Change-Id: Ieda4aa338f69ad3dd196ac6e7892da3cf1b47ea7
2013-09-23	Number of instructions in fdct4_1d_sse2 reduced by two.	A.Mahfoodh
	Mathematically the results are the same. Change-Id: I1c5126cd3ca64e8515ca6331e0989c6f7dd651a0