summaryrefslogtreecommitdiff
path: root/vp9/encoder/x86
AgeCommit message (Collapse)Author
2015-02-05Fix high bit depth assembly function bugsYunqing Wang
The high bit depth build failed while building for 32bit target. The bugs were in vp9_highbd_subpel_variance.asm and vp9_highbd_sad4d_sse2.asm functions. This patch fixed the bugs, and made 32bit build work. Change-Id: Idc8e5e1b7965bb70d4afba140c6583c5d9666b75
2015-01-27Fix issues in 32bit PIC enabled buildYunqing Wang
This patch was to fix issue 924: https://code.google.com/p/webm/issues/detail?id=924 The SECTION_RODATA macro was modified to support macho32 format. The sub-pixel functions were modified to pass in 2 more parameters to handle the global offsets for PIC build. Change-Id: I3bfcd336bcae945edf300bca4ab40376a2628cd4
2014-12-22Revert "Revert "Removal of legacy zbin_extra / zbin_oq_value.""Jingning Han
This reverts commit 9946ee23e0a4c158e26a505b162a072f81b8a3be. Fix the ssse3 asm function. Change-Id: I07f77a63aa98087626e45c4e87aa5dcafc0b0b07
2014-12-19Revert "Removal of legacy zbin_extra / zbin_oq_value."Paul Wilkins
This reverts commit e9b586e21bb899e247346e82bccf5afb42604910. Change-Id: I5b36e6727da6c05278d97e2c37b80c109f79bed4
2014-12-18Removal of legacy zbin_extra / zbin_oq_value.Paul Wilkins
zbin extra / zbin_oq_value was widely passed around, hence removal touches a lot of code. Change-Id: Idc94359735b60c38a160e4385ae09d5ca8b6b8e5
2014-12-08Merge "Changes to assembler for NASM on mac."James Zern
2014-12-03sse2 visual studio build fixDeb Mukherjee
Change-Id: Id8c8c3be882bcd92afea3ccec6ebdf3f208d28ef
2014-12-03Enable non-rd mode coding on key frame, for speed 6.Marco
For key frame at speed 6: enable the non-rd mode selection in speed setting and use the (non-rd) variance_based partition. Adjust some logic/thresholds in variance partition selection for key frame only (no change to delta frames), mainly to bias to selecting smaller prediction blocks, and also set max tx size of 16x16. Loss in key frame quality (~0.6-0.7dB) compared to rd coding, but speeds up key frame encoding by at least 6x. Average PNSR/SSIM metrics over RTC clips go down by ~1-2% for speed 6. Change-Id: Ie4845e0127e876337b9c105aa37e93b286193405
2014-12-02Added high bitdepth sse2 transform functionsPeter de Rivaz
Also removes some spurious changes in common/vp9_blockd.h which was introduced by a rebase issue between nextgen and master branches. Change-Id: If359f0e9a71bca9c2ba685a87a355873536bb282 (cherry picked from commit 005d80cd05269a299cd2f7ddbc3d4d8b791aebba) (cherry picked from commit 08d2f548007fd8d6fd41da8ef7fdb488b6485af3) (cherry picked from commit 4230c2306c194c058f56433a5275aa02a2e71d56)
2014-11-24Changes to assembler for NASM on mac.John Stark
fixes non-Apple nasm part of issue #755 Change-Id: I11955d270c4ee55e3c00e99f568de01b95e7ea9a
2014-11-21Merge "Added highbitdepth sse2 acceleration for quantize"Debargha Mukherjee
2014-11-19Added highbitdepth sse2 acceleration for quantizePeter de Rivaz
Also includes block error. (This patch is mostly cherry picked from commit db7192e0b014a331a1dcb102c8a1148e9f0e1081) Change-Id: Idef18f90b111a0d0c9546543d3347e551908fd78
2014-11-19Enable ssse3 version of vp9_fdct8x8_quantJingning Han
It improves the speed performance of vp9_fdct8x8_quant_sse2 by about 5%. Change-Id: I74b093ba4d81df64caf71ac7693f3d917f673097
2014-11-18Combine fdct8x8 and quantization processJingning Han
This commit reworks the forward transform and quantization process for 8x8 block coding. It combines the two operations in a single function to save a store/load stage of the original transform coefficients. Overall the speed -6 is slightly faster (around 1% range). The compression performance of speed -6 is improved by 3.4%. Change-Id: Id6628daef123f3e4649248735ec2ad7423629387
2014-11-18Add sse2 version for vp9_quantize_fpJingning Han
vp9_quantize_fp is the quantization process used by rtc coding mode. This commit adds a sse2 implementation of it. The implementation is modified based on vp9_quantize_b_sse2. No speed difference from ssse3 version. Change-Id: I24949c5b27df160b4f35117d28858d269454e64a
2014-11-14Added sse2 acceleration for highbitdepth variancePeter de Rivaz
Change-Id: I446bdf3a405e4e9d2aa633d6281d66ea0cdfd79f (cherry picked from commit d7422b2b1eb9f0011a8c379c2be680d6892b16bc) (cherry picked from commit 6d741e4d76a7d9ece69ca117d1d9e2f9ee48ef8c)
2014-11-12Added highbitdepth sse2 SAD acceleration and testsPeter de Rivaz
Change-Id: I1a74a1b032b198793ef9cc526327987f7799125f (cherry picked from commit b1a6f6b9cb47eafe0ce86eaf0318612806091fe5)
2014-11-05Fix visual studio 2013 compiler warningsYaowu Xu
For configured with --enable-vp9-highbitdepth Change-Id: I2b181519d7192f8d7a241ad5760c3578255f24e6
2014-10-28vp9_denoiser_sse2: refactor the code.JackyChen
Combined vp9_denoiser_8xM_sse2 and vp9_denoiser_4xM_sse2 into one function vp9_denoiser_NxM_sse2_small and passed the bitexact testing. Changed the name of the function vp9_denoiser_64_32_16xM_sse2 to vp9_denoiser_NxM_sse2_big. Change-Id: Ib22478df585994dd347ebae04202c0b701e7f451
2014-10-22Merge "vp9_denoiser_sse2.c: improve code style."JackyChen
2014-10-22vp9_denoiser_sse2.c: improve code style.JackyChen
denoiser_sse2.c: fix typos in comment. Change-Id: Ic0fb102331b0e533c058da3cab1fbc30de9a0070
2014-10-20Merge "SAD32xh and SAD64xh for AVX2"Yunqing Wang
2014-10-19SAD32xh and SAD64xh for AVX2levytamar82
All sad function that process above 32 consecutive elements are optimized for AVX2: vp9_sad64x64 vp9_sad64x32 vp9_sad32x64 vp9_sad32x32 vp9_sad32x16 vp9_sad64x64_avg vp9_sad64x32_avg vp9_sad32x64_avg vp9_sad32x32_avg vp9_sad32x16_avg The functions that appeared as a hotspot is vp9_sad32x32 and vp9_sad64x64 vp9_sad32x32 was optimized by 68% and vp9_sad64x64 was optimized by 90% both of them gave and overall ~2.3% user level gain Change-Id: Iccf86b375a2b54c5fbbe685902ead0c9a561b9fd
2014-10-17vp9_denoiser_sse2.c: solve windows build error.JackyChen
Change-Id: Ib5df91c8580d5dbeb0b3554edc9c2ca906ba4c4d
2014-10-17Merge "vp9_denoiser_sse2.c: eliminate gcc warnings"James Zern
2014-10-17vp9_denoiser_sse2.c: eliminate gcc warningsJackyChen
Change-Id: I5f63f48e11e31ea9951223c5b18f42a2471e4560
2014-10-14Add a 32-bit friendly sse2 quantizer.Alex Converse
This is based on the 64-bit ssse3 quantizer. 1.1x speedup for screen content at speed 7. Change-Id: I57d15415ef97c49165954bbe3daaaf9318e37448
2014-10-10vp9_avg_intrin_sse2: correct intrinsics includeJames Zern
immintrin.h -> emmintrin.h fixes build where newer intrinsics are unavailable Change-Id: I79311b39bfa782fc2abeb45884ecb417050cb9f8
2014-10-07experimental : partition using 1/8 x 1/8 imageJim Bankoski
The concept: There's too much noise in source pixels for variance and at low bitrate the reconstructed looks nothing like the source so we have problems getting good partitionings with either. This skirts the issue by using a box blur scaled down version for variance calculations. To compare against source_var_ moved keyframe to be rd based like source_var. Change-Id: Ie3babdbfadae324b7b5a76bea192893af27f0624
2014-10-06Add SSE2 code and unit test for VP9 denoiser.JackyChen
This SSE2 is based on VP8 denoiser's SSE2 code. In VP8, there are only 16x16 blocks in denoiser, while in VP9, there are 13 different block sizes. By adding this SSE2 code, the improvement of encoder speed is around 20%(using C code vs using SSE2 code), vary for different clips. The unit test for VP9 denoiser is to confirm that the SSE2 code is bit-exact with the C code. The unit test covers all block size. Change-Id: Ic8d8ac26db4ea40a5f146b5678a065af07eaaa3d
2014-09-06Replacing vp9_get_mb_ss_sse2 asm implementation with intrinsics.Dmitry Kovalev
Change-Id: Ib4f5dd733eb2939b108070a01e83da5d9990bac0
2014-09-03Adding sse2 variant for vp9_mse{8x8, 8x16, 16x8}.Dmitry Kovalev
Change-Id: I6786d25ce4f32b8d8912f2d239a45ca15b310c4b
2014-09-03Merge "Replacing asm 16x16 variance calculation with intrinsics."Dmitry Kovalev
2014-09-03Merge "Cleaning up vp9_variance_avx2.c."Dmitry Kovalev
2014-09-02Removing duplicated code.Dmitry Kovalev
Change-Id: I7b5c776d5e6f5ca428b87fa9411ae4012a9538ba
2014-09-02Removing MMX SAD calculation code.Dmitry Kovalev
Removed functions: * vp9_sad_16x16_mmx * vp9_sad_8x16_mmx * vp9_sad_16x8_mmx * vp9_sad_8x8_mmx * vp9_sad_4x4_mmx Change-Id: Ic5174b93b64d65d846f0c11e72cab149e9472bc3
2014-09-02Replacing asm 16x16 variance calculation with intrinsics.Dmitry Kovalev
New code is 20% faster for 64-bit and 15% faster for 32-bit. Compiled using clang. Change-Id: Icfea461238411001fd093561293dbfedfbf8d0bb
2014-09-02Cleaning up vp9_variance_avx2.c.Dmitry Kovalev
Change-Id: I75eb47dd21f87015efd673dbd2aa71f4386afdf5
2014-08-29Replacing asm 8x8 variance calculation with intrinsics.Dmitry Kovalev
New code is 10% faster for 64-bit and 25% faster for 32-bit. Compiled using clang. Change-Id: I8ba1544c30dd6f3ca479db806384317549650dfc
2014-08-29Removing variance MMX code.Dmitry Kovalev
Removed functions: * vp9_mse16x16_mmx * vp9_get_mb_ss_mmx * vp9_get4x4var_mmx * vp9_get8x8var_mmx * vp9_variance4x4_mmx * vp9_variance8x8_mmx * vp9_variance16x16_mmx * vp9_variance16x8_mmx * vp9_variance8x16_mmx They all have SSE2 equivalent. Change-Id: I3796f2477c4f59b35b4828f46a300c16e62a2615
2014-08-28Implementing 4x4 variance calculation with SSE2.Dmitry Kovalev
New SSE2 function is three times faster than MMX one. Change-Id: I4f387ce9f75b88379176ec7bdc62d86eb5f70fbe
2014-08-20Fix def pairs in 32x32 2D-DCT sse2Jingning Han
Properly pair the def/undef order. Change-Id: I9736a6f8d2efc075b1d72dafc75b9350d055cf65
2014-08-1432 Align Load buglevytamar82
In the sub_pixel_avg_variance the parameter sec was also aligned load and changed to unaligned. Change-Id: I4d4966e0291059ea4d705baed1503dc58444fcb7
2014-08-07Fix bug 807levytamar82
in the sub_pixel_*variance* function the dst is aligned to 16 bytes and not to 32 bytes - now load unaligned data Change-Id: I2e0b9745543697efc56fefa32857ea10117af135
2014-08-07Fix bug 806levytamar82
in the function sad32x32x4d and sad64x64x4d the source is aligned to 16 bytes and not to 32 bytes - the load is now unaligned. Change-Id: I922fdba56d0936b5cf72e4503519f185645a168c
2014-07-28Fix bug 805levytamar82
Remove all the redundant dct functions (dct4x4, dct8x8) in avx2 except dct32x32 those functions were copied originally from dct_sse2 Change-Id: I742576fbf5175f3ac09f2076976a9247b259323e
2014-07-08Re-design quantization process for 32x32 transform blockJingning Han
This commit enables a new quantization process for 32x32 2D-DCT transform coefficient blocks. It improves the compression performance of speed 5 by 1.4%. The overall compression gains of speed 5 due to the new quantization scheme is 4.7%. It also includes the SSSE3 implementation of the 32x32 quantization process. Change-Id: I0855b124fd6462418683f783f5bcb44255c9993b
2014-07-07Tune SSSE3 implementation of fast path quantizationJingning Han
This commit further simplifies the SSSE3 implementation of the fast path quantization process. Change-Id: I5be3286ec0f1bd81d1cf5be3168fece6384fb9ca
2014-07-01Re-design quantization processJingning Han
This commit re-designs the quantization process for transform coefficient blocks of size 4x4 to 16x16. It improves compression performance for speed 7 by 3.85%. The SSSE3 version for the new quantization process is included. The average runtime of the 8x8 block quantization is reduced from 285 cycles -> 255 cycles, i.e., over 10% faster. Change-Id: I61278aa02efc70599b962d3314671db5b0446a50
2014-06-12Merge "Fast computation path for forward transform and quantization"Jingning Han