summaryrefslogtreecommitdiff
path: root/vp9/encoder/x86
AgeCommit message (Collapse)Author
2014-01-27Removing _1d suffix from transform names.Dmitry Kovalev
It is enough to specify (e.g.) idct16, it is obviously different from idct16x16. Change-Id: I6b408a37a945de3162429380b59a775b03b95db0
2014-01-23vp9/encoder: add extern "C" to headersJames Zern
Change-Id: I4f51ce859a97bf1b8fd2b37ac585b7c643232b69
2014-01-08AVX2 Variance Optimizationlevytamar82
Optimizing the variance functions: vp9_variance16x16, vp9_variance32x32, vp9_variance64x64, vp9_variance32x16, vp9_variance64x32, vp9_mse16x16 by migrating to AVX2 some of the functions were optimized by processing 32 elements instead of 16. some of the functions were optimized by processing 2 loop strides of 16 elements in a single 256 bit register This optimization gives between 2.4% - 2.7% user level performance gain and 42% function level gain. Change-Id: I265ae08a2b0196057a224a86450153ef3aebd85d
2013-12-16vp9: normalize include guardsJames Zern
Change-Id: If4ddbdcfb3ab387cbca6910b42cf4df8111e6879
2013-11-27Merge "vp9_short_fdct32x32_rd vp9_short_fdct32x32 optimized for AVX2"Yaowu Xu
2013-11-21vp9_short_fdct32x32_rd vp9_short_fdct32x32 optimized for AVX2levytamar82
Change-Id: I6366e84490883b72362f762369d7e5bccb64f02f
2013-11-21Improve vp9_fdct4x4_sse2 (x1.2)Abo Talib Mahfoodh
Modifications are done to reduce the total clock cycle. Speedup: 1.2 Tested with: park_joy_420_720p50.y4m Change-Id: Ia36b87e62e2f80a5fadaf5628729aedc80f38f3f
2013-11-13Fix an overflow issue in SSE2 forward ADSTJingning Han
The step that sums three input samples could potentially cause the intermediate result go beyond 16 bit limit, when operating as the second 1-D transform. This commit fixes the issue. Change-Id: Iaf512449ac2d25ddd8a806d760afab362c62a516
2013-11-07Remove TEXTREL from 32bit encoderYunqing Wang
This patch fixed the issue reported in "Issue 655: remove textrel's from 32-bit vp9 encoder". The set of vp9_subpel_variance functions that used x86inc.asm ABI didn't build correctly for 32bit PIC. The fix was carefully done under the situation that there was not enough registers. After the change, we got $ eu-findtextrel libvpx.so eu-findtextrel: no text relocations reported in 'libvpx.so' Change-Id: I1b176311dedaf48eaee0a1e777588043c97cea82
2013-10-24Making input pointer constant for all fdct/fht functions.Dmitry Kovalev
Change-Id: I78f7012f967a777ddd39bae6671eb501df6bbfe8
2013-10-23Renaming vp9_short_fdct4x4 and vp9_short_walsh4x4.Dmitry Kovalev
For consistency with idct function names. Renames: vp9_short_fdct4x4 -> vp9_fdct4x4 vp9_short_walsh4x4 -> vp9_fwht4x4 Change-Id: Id15497cc1270acca626447d846f0ce9199770f58
2013-10-23Renaming vp9_short_fdct32x32 to vp9_fdct32x32.Dmitry Kovalev
For consistency with idct function names. Change-Id: Ie77b7178e0894c57cd5cb9243c949eb9224ece18
2013-10-23Merge "Renaming vp9_short_fdct16x16 to vp9_fdct16x16."Dmitry Kovalev
2013-10-23Renaming vp9_short_fdct16x16 to vp9_fdct16x16.Dmitry Kovalev
For consistency with idct function names. Change-Id: I5ca355ba99fdba04f09254be95cf79808b534f71
2013-10-23Renaming vp9_short_fdct8x8 to vp9_fdct8x8.Dmitry Kovalev
For consistency with idct function names. Change-Id: I7b6af2f92c66eff56f84ed29edc3a66af8dc421f
2013-10-22Merge "Using stride (# of elements) instead of pitch (bytes) in fdct4x4."Dmitry Kovalev
2013-10-22Merge "Using stride (# of elements) instead of pitch (bytes) in fdct8x8."Dmitry Kovalev
2013-10-21Using stride (# of elements) instead of pitch (bytes) in fdct4x4.Dmitry Kovalev
Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: I0ba3c52513a5fdd194f1e7e2901092671398985b
2013-10-18Using stride (# of elements) instead of pitch (bytes) in fdct8x8.Dmitry Kovalev
Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: Ibc944952a192e6c7b2b6a869ec2894c01da82ed1
2013-10-18Using stride (# of elements) instead of pitch (bytes) in fdct16x16.Dmitry Kovalev
Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: I2d95fdcbba96aaa0ed24a80870cb38f53487a97d
2013-10-17Using stride (# of elements) instead of pitch (bytes) in fdct32x32.Dmitry Kovalev
Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: Id623c5113262655fa50f7c9d6cec9a91fcb20bb4
2013-10-15Removing unused 8x4 transform from the encoder.Dmitry Kovalev
Change-Id: Icbcf68b5b685a56f255ebc3859c9692accdadf9e
2013-10-09Merge "Simplifying and inlining k_cvtlo_epi16 and k_cvthi_epi16"Jingning Han
2013-10-07Merge "cpplint vp9_variance_sse2.c"Jim Bankoski
2013-10-05Merge "added nolint to function that doesn't seem easy to breakup"Jim Bankoski
2013-10-04cpplint issues resolved in vp9_variance_mmx.cJim Bankoski
Change-Id: Idbfabe427fbeab44210f13fec8b6f63f7a4eb0dd
2013-10-04added nolint to function that doesn't seem easy to breakupJim Bankoski
Change-Id: I5489b116aea7c510ea5ebbed3c1445f321b05f3e
2013-10-04cpplint vp9_variance_sse2.cJim Bankoski
Change-Id: Ifce8f5b57a1ea8952e8a67c5b92a127a061899fa
2013-10-02Simplifying and inlining k_cvtlo_epi16 and k_cvthi_epi16A.Mahfoodh
Simplify the k_cvtlo_epi16 and k_cvthi_epi16 to only two instructions. Then inlined them. quoting from intel MMX_App_Compute_16bit_Vector.pdf‎ "The PMADDWD instruction multiplies four pairs of 16-bit numbers and produces partial sums of the results and can do so once per clock (with a three-clock latency)." so I am assuming that there will be three clock overhead after the last _mm_madd_pi16 command. Even with the overhead the number of clocks in general should be smaller. I am not sure though becasue I could not find information about number of clocks required for instructions in k_cvtlo_epi16 and k_cvthi_epi16. I will run a test and compare the execution time. Change-Id: Ieda4aa338f69ad3dd196ac6e7892da3cf1b47ea7
2013-09-23Number of instructions in fdct4_1d_sse2 reduced by two.A.Mahfoodh
Mathematically the results are the same. Change-Id: I1c5126cd3ca64e8515ca6331e0989c6f7dd651a0
2013-09-06Fix overflow issue in 16x16 quantization SSSE3Jingning Han
The 16x16 transform unit test suggested that the peak coefficient value can reach 32639. This could cause potential overflow issue in the SSSE3 implmentation of 16x16 block quantization. This commit fixes this issue by replacing addition with saturated addition. Change-Id: I6d5bb7c5faad4a927be53292324bd2728690717e
2013-09-05Use saturated addition in SSSE3 of 32x32 quantJingning Han
The 32x32 forward transform can potentially reach peak coefficient value close to 32700, while the rounding factor can go upto 610. This could cause overflow issue in the SSSE3 implementation of 32x32 quantization process. This commit resolves this issue by replacing the addition operations with saturated addition operations in 32x32 block quantization. Change-Id: Id6b98996458e16c5b6241338ca113c332bef6e70
2013-08-31Fix 32x32 forward transform SSE2 versionJingning Han
This commit fixed the potential overflow issue in the SSE2 implementation of 32x32 forward DCT. It resolved the corrupted coded frames in the border of scenes. Change-Id: If87eef2d46209269f74ef27e7295b6707fbf56f9
2013-08-29Merge "Fix overflow issue in SSSE3 32x32 quantization"Jingning Han
2013-08-29Fix overflow issue in SSSE3 32x32 quantizationJingning Han
The 32x32 quantization process can potentially have the intermediate stacks over 16-bit range, thereby causing enc/dec mismatch. This commit fixes this overflow issue in the SSSE3 implementation, as well as the prototype, of 32x32 quantization. This fixes issue 607 from webm@googlecode. Change-Id: I85635e6ca236b90c3dcfc40d449215c7b9caa806
2013-08-27fixed the reading too many bytesYaowu Xu
In subpel_avg_variance functions, code similar to the following punpkldq m2, [addr] actually reads 8 bytes. For functions that are supposed to work on buffers only have less 8 bytes a line, this caused valgrind error of reading uninitialized memory. Change-Id: I2a4c079dbdbc747829bd9e2ed85f0018ad2a3a34
2013-08-26Fix the reading of too many input pixelsYaowu Xu
in VP9_get4x4var_mmx Change-Id: I4b4a8f45f25ebdfad281f169cc87aba5e2d6f227
2013-08-12SSE2 high precision 32x32 forward DCTJingning Han
Enable SSE2 implementation of high precision 32x32 forward DCT. The intermediate stacks are of 32-bits. The run-time goes down from 32126 cycles to 13442 cycles. Change-Id: Ib5ccafe3176c65bd6f2dbdef790bd47bbc880e56
2013-08-06Merge "Place holder for high-precision 32x32 fdct"Jingning Han
2013-08-06variance x86inc guardsJim Bankoski
also fixed bug in sad calcs Change-Id: I6571fcbe37556c16ae32be66dc0fd879852aac1d
2013-08-06Place holder for high-precision 32x32 fdctJingning Han
Resolve compile warnings on re-define FDCT32x32_2D template. Change-Id: Idb3a54ef8d2710ce7245b726379a0e5c875f5cad
2013-08-06Move fdct32x32 SSE2 implementation in separate file.Christian Duvivier
This is in preparation for the SSE2 version of the high-precision 32x32 forward DCT which will share a lot of code with the existing low precision version used for rate-distortion search. Change-Id: I7084b6bdfb480b1fabb8493fb14e3f7fcc7888c0
2013-07-10Remove unused fwalsh/fdct x86 SIMD implementations.Ronald S. Bultje
Change-Id: Ia942e56cf322821d42ba06178672791eeee2847e
2013-07-10SSE2 16x16 ADST/DCT hybrid transformJingning Han
This commit enables 16x16 ADST/DCT forward hybrid transform using SSE2 operations. It reduces the runtime from 5433 cycles to 1621 cycles, at no compression performance loss. Change-Id: I75fd7f1984e9e28846af459f810ff0d6ae125230
2013-07-05Merge "Refactor SSE2 8x8 functional units"Jingning Han
2013-07-03Refactor SSE2 8x8 functional unitsJingning Han
These serve as building blocks for SSE2 8x8 and 16x16 ADST/DCT hybrid transform coding. Change-Id: I4089a754c66e0c986f67d9b8ec4dfb9627ad430d
2013-07-02Use pmovmskb to skip quantize loops over empty coefficients.Ronald S. Bultje
If none of the 16 coefficients that we quantize per loop iteration are larger than the zbin, directly skip to the next round of coeffs, rather than doing a full quantize loop that will eventually result in 16 zeroes. This incurs a jump cost, but saves a lot of other work. 32x32 quant goes from 1349 -> 1184 cycles. The same approach yielded no significantly positive results for smaller transforms, so is not used there (8x8: 103 -> 101 cycles; 16x16: 302 -> 306 cycles). Change-Id: I8fca17dc2543fc8eed1dbcd5100145e3c3a9b647
2013-07-01Update quantize SSSE3 SIMD to cover 32x32 transform case also.Ronald S. Bultje
Encode time of bus (speed 0) 50 frames @ 1500kbps goes from 2min14.4 to 2min10.1, i.e. a 2.3% overall speed increase. Change-Id: I3699580e74ec26c7d24e03681bc47ba25ee1ee87
2013-07-01Quantize (64-bit only, for now) SSSE3 SIMD.Ronald S. Bultje
Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is x86-64 only, it needs some minor modifications to be 32bit compatible, because it uses 15 xmm registers, whereas 32bit only has 8. Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904
2013-06-29Merge "Enable SSE2 4x4 ADST/DCT transform"Jingning Han