summaryrefslogtreecommitdiff
path: root/vp9/encoder/x86
AgeCommit message (Collapse)Author
2014-03-21AVX2 SAD Optimization:levytamar82
2 functions were optimized for avx2 by using full 256 bit register In order to handle 32 elements in parallel instead of only 16 in parallel: 1. vp9_sad32x32x4d 2. vp9_sad64x64x4d The function level gain is 66% and the user level gain is ~1%. Change-Id: I4efbb3bc7d8bc03b64b6c98f5cd5c4a9dd3212cb
2014-03-17Removed several unused functions.Yaowu Xu
Change-Id: Ib9e27298c575afc02a98b593bc6ad60762064d9b
2014-03-05Merge "improved speed of 4x4 sse2 fdct."Andrew Russell
2014-03-03improved speed of 4x4 sse2 fdct.Andrew Russell
* speed improvment of 30 percent achieved * multiplies and adds remain the same * non-arithmetic instructions minimized by hand, by: -expanding 2 pass loop -removing irrelivant "shuffles" -combining last two rounding steps * further improvments may be possible Change-Id: Idec2c3f52910c48e6a0e0f9aefed5cae31b0b8c0
2014-02-28AVX2 SubPixel AVG Variance Optimizationlevytamar82
Optimizing 2 functions to process 32 elements in parallel instead of 16: 1. vp9_sub_pixel_avg_variance64x64 2. vp9_sub_pixel_avg_variance32x32 both of those function were calling vp9_sub_pixel_avg_variance16xh_ssse3 instead of calling that function, it calls vp9_sub_pixel_avg_variance32xh_avx2 that is written in avx2 and process 32 elements in parallel. This Optimization gave 80% function level gain and 2% user level gain Change-Id: Iea694654e1b7612dc6ed11e2626208c2179502c8
2014-02-18vp9_subpel_variance_impl_intrin_avx2.c: make some tables staticJames Zern
+ fix formatting Change-Id: I7b4ec11b7b46d8926750e0b69f7a606f3ab80895
2014-02-14AVX2 SubPixel Variance Optimizationlevytamar82
Optimizing 2 functions to process 32 elements in parallel instead of 16: 1. vp9_sub_pixel_variance64x64 2. vp9_sub_pixel_variance32x32 both of those function were calling vp9_sub_pixel_variance16xh_ssse3 instead of calling that function, it calls vp9_sub_pixel_variance32xh_avx2 that is written in avx2 and process 32 elements in parallel. This Optimization gave 70% function level gain and 2% user level gain Change-Id: I4f5cb386b346ff6c878a094e1c3b37e418e50bde
2014-02-12minor spelling cleanup in commentsAndrew Russell
Change-Id: Ia91c6c406273345b08505097ffe1af3896980f06
2014-02-07Bug fix in ssse3 quantize functionYunqing Wang
A bug was reported in Issue 702: "SIGILL (Illegal instruction) when transcoding with vp9 - using FFmpeg". It was reproduced and fixed. Change-Id: Ie32c149a89af02856084aeaf289e848a905c7700
2014-02-06Finally removing "short" from transform names.Dmitry Kovalev
Change-Id: I5259b68dc1bcceb153e3ffe638a79a59a3019e9d
2014-01-27Removing _1d suffix from transform names.Dmitry Kovalev
It is enough to specify (e.g.) idct16, it is obviously different from idct16x16. Change-Id: I6b408a37a945de3162429380b59a775b03b95db0
2014-01-23vp9/encoder: add extern "C" to headersJames Zern
Change-Id: I4f51ce859a97bf1b8fd2b37ac585b7c643232b69
2014-01-08AVX2 Variance Optimizationlevytamar82
Optimizing the variance functions: vp9_variance16x16, vp9_variance32x32, vp9_variance64x64, vp9_variance32x16, vp9_variance64x32, vp9_mse16x16 by migrating to AVX2 some of the functions were optimized by processing 32 elements instead of 16. some of the functions were optimized by processing 2 loop strides of 16 elements in a single 256 bit register This optimization gives between 2.4% - 2.7% user level performance gain and 42% function level gain. Change-Id: I265ae08a2b0196057a224a86450153ef3aebd85d
2013-12-16vp9: normalize include guardsJames Zern
Change-Id: If4ddbdcfb3ab387cbca6910b42cf4df8111e6879
2013-11-27Merge "vp9_short_fdct32x32_rd vp9_short_fdct32x32 optimized for AVX2"Yaowu Xu
2013-11-21vp9_short_fdct32x32_rd vp9_short_fdct32x32 optimized for AVX2levytamar82
Change-Id: I6366e84490883b72362f762369d7e5bccb64f02f
2013-11-21Improve vp9_fdct4x4_sse2 (x1.2)Abo Talib Mahfoodh
Modifications are done to reduce the total clock cycle. Speedup: 1.2 Tested with: park_joy_420_720p50.y4m Change-Id: Ia36b87e62e2f80a5fadaf5628729aedc80f38f3f
2013-11-13Fix an overflow issue in SSE2 forward ADSTJingning Han
The step that sums three input samples could potentially cause the intermediate result go beyond 16 bit limit, when operating as the second 1-D transform. This commit fixes the issue. Change-Id: Iaf512449ac2d25ddd8a806d760afab362c62a516
2013-11-07Remove TEXTREL from 32bit encoderYunqing Wang
This patch fixed the issue reported in "Issue 655: remove textrel's from 32-bit vp9 encoder". The set of vp9_subpel_variance functions that used x86inc.asm ABI didn't build correctly for 32bit PIC. The fix was carefully done under the situation that there was not enough registers. After the change, we got $ eu-findtextrel libvpx.so eu-findtextrel: no text relocations reported in 'libvpx.so' Change-Id: I1b176311dedaf48eaee0a1e777588043c97cea82
2013-10-24Making input pointer constant for all fdct/fht functions.Dmitry Kovalev
Change-Id: I78f7012f967a777ddd39bae6671eb501df6bbfe8
2013-10-23Renaming vp9_short_fdct4x4 and vp9_short_walsh4x4.Dmitry Kovalev
For consistency with idct function names. Renames: vp9_short_fdct4x4 -> vp9_fdct4x4 vp9_short_walsh4x4 -> vp9_fwht4x4 Change-Id: Id15497cc1270acca626447d846f0ce9199770f58
2013-10-23Renaming vp9_short_fdct32x32 to vp9_fdct32x32.Dmitry Kovalev
For consistency with idct function names. Change-Id: Ie77b7178e0894c57cd5cb9243c949eb9224ece18
2013-10-23Merge "Renaming vp9_short_fdct16x16 to vp9_fdct16x16."Dmitry Kovalev
2013-10-23Renaming vp9_short_fdct16x16 to vp9_fdct16x16.Dmitry Kovalev
For consistency with idct function names. Change-Id: I5ca355ba99fdba04f09254be95cf79808b534f71
2013-10-23Renaming vp9_short_fdct8x8 to vp9_fdct8x8.Dmitry Kovalev
For consistency with idct function names. Change-Id: I7b6af2f92c66eff56f84ed29edc3a66af8dc421f
2013-10-22Merge "Using stride (# of elements) instead of pitch (bytes) in fdct4x4."Dmitry Kovalev
2013-10-22Merge "Using stride (# of elements) instead of pitch (bytes) in fdct8x8."Dmitry Kovalev
2013-10-21Using stride (# of elements) instead of pitch (bytes) in fdct4x4.Dmitry Kovalev
Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: I0ba3c52513a5fdd194f1e7e2901092671398985b
2013-10-18Using stride (# of elements) instead of pitch (bytes) in fdct8x8.Dmitry Kovalev
Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: Ibc944952a192e6c7b2b6a869ec2894c01da82ed1
2013-10-18Using stride (# of elements) instead of pitch (bytes) in fdct16x16.Dmitry Kovalev
Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: I2d95fdcbba96aaa0ed24a80870cb38f53487a97d
2013-10-17Using stride (# of elements) instead of pitch (bytes) in fdct32x32.Dmitry Kovalev
Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: Id623c5113262655fa50f7c9d6cec9a91fcb20bb4
2013-10-15Removing unused 8x4 transform from the encoder.Dmitry Kovalev
Change-Id: Icbcf68b5b685a56f255ebc3859c9692accdadf9e
2013-10-09Merge "Simplifying and inlining k_cvtlo_epi16 and k_cvthi_epi16"Jingning Han
2013-10-07Merge "cpplint vp9_variance_sse2.c"Jim Bankoski
2013-10-05Merge "added nolint to function that doesn't seem easy to breakup"Jim Bankoski
2013-10-04cpplint issues resolved in vp9_variance_mmx.cJim Bankoski
Change-Id: Idbfabe427fbeab44210f13fec8b6f63f7a4eb0dd
2013-10-04added nolint to function that doesn't seem easy to breakupJim Bankoski
Change-Id: I5489b116aea7c510ea5ebbed3c1445f321b05f3e
2013-10-04cpplint vp9_variance_sse2.cJim Bankoski
Change-Id: Ifce8f5b57a1ea8952e8a67c5b92a127a061899fa
2013-10-02Simplifying and inlining k_cvtlo_epi16 and k_cvthi_epi16A.Mahfoodh
Simplify the k_cvtlo_epi16 and k_cvthi_epi16 to only two instructions. Then inlined them. quoting from intel MMX_App_Compute_16bit_Vector.pdf‎ "The PMADDWD instruction multiplies four pairs of 16-bit numbers and produces partial sums of the results and can do so once per clock (with a three-clock latency)." so I am assuming that there will be three clock overhead after the last _mm_madd_pi16 command. Even with the overhead the number of clocks in general should be smaller. I am not sure though becasue I could not find information about number of clocks required for instructions in k_cvtlo_epi16 and k_cvthi_epi16. I will run a test and compare the execution time. Change-Id: Ieda4aa338f69ad3dd196ac6e7892da3cf1b47ea7
2013-09-23Number of instructions in fdct4_1d_sse2 reduced by two.A.Mahfoodh
Mathematically the results are the same. Change-Id: I1c5126cd3ca64e8515ca6331e0989c6f7dd651a0
2013-09-06Fix overflow issue in 16x16 quantization SSSE3Jingning Han
The 16x16 transform unit test suggested that the peak coefficient value can reach 32639. This could cause potential overflow issue in the SSSE3 implmentation of 16x16 block quantization. This commit fixes this issue by replacing addition with saturated addition. Change-Id: I6d5bb7c5faad4a927be53292324bd2728690717e
2013-09-05Use saturated addition in SSSE3 of 32x32 quantJingning Han
The 32x32 forward transform can potentially reach peak coefficient value close to 32700, while the rounding factor can go upto 610. This could cause overflow issue in the SSSE3 implementation of 32x32 quantization process. This commit resolves this issue by replacing the addition operations with saturated addition operations in 32x32 block quantization. Change-Id: Id6b98996458e16c5b6241338ca113c332bef6e70
2013-08-31Fix 32x32 forward transform SSE2 versionJingning Han
This commit fixed the potential overflow issue in the SSE2 implementation of 32x32 forward DCT. It resolved the corrupted coded frames in the border of scenes. Change-Id: If87eef2d46209269f74ef27e7295b6707fbf56f9
2013-08-29Merge "Fix overflow issue in SSSE3 32x32 quantization"Jingning Han
2013-08-29Fix overflow issue in SSSE3 32x32 quantizationJingning Han
The 32x32 quantization process can potentially have the intermediate stacks over 16-bit range, thereby causing enc/dec mismatch. This commit fixes this overflow issue in the SSSE3 implementation, as well as the prototype, of 32x32 quantization. This fixes issue 607 from webm@googlecode. Change-Id: I85635e6ca236b90c3dcfc40d449215c7b9caa806
2013-08-27fixed the reading too many bytesYaowu Xu
In subpel_avg_variance functions, code similar to the following punpkldq m2, [addr] actually reads 8 bytes. For functions that are supposed to work on buffers only have less 8 bytes a line, this caused valgrind error of reading uninitialized memory. Change-Id: I2a4c079dbdbc747829bd9e2ed85f0018ad2a3a34
2013-08-26Fix the reading of too many input pixelsYaowu Xu
in VP9_get4x4var_mmx Change-Id: I4b4a8f45f25ebdfad281f169cc87aba5e2d6f227
2013-08-12SSE2 high precision 32x32 forward DCTJingning Han
Enable SSE2 implementation of high precision 32x32 forward DCT. The intermediate stacks are of 32-bits. The run-time goes down from 32126 cycles to 13442 cycles. Change-Id: Ib5ccafe3176c65bd6f2dbdef790bd47bbc880e56
2013-08-06Merge "Place holder for high-precision 32x32 fdct"Jingning Han
2013-08-06variance x86inc guardsJim Bankoski
also fixed bug in sad calcs Change-Id: I6571fcbe37556c16ae32be66dc0fd879852aac1d