summaryrefslogtreecommitdiff
path: root/vp9/encoder/x86
AgeCommit message (Collapse)Author
2015-04-01Merge "Optimize quantization simd implementation"Jingning Han
2015-04-01Refactor block_yrd function for RTC coding modeJingning Han
This commit separates Hadamard transform/quantization operations from rate and distortion computation in block_yrd. This allows one to skip SATD computation when all transform blocks are quantized to zero. It also uses a new block error function that skips repeated computation of sum of squared residuals. It reduces the CPU cycles spent on block error calculation in block_yrd by 40%. Change-Id: I726acb2454b44af1c3bd95385abecac209959b10
2015-04-01Optimize quantization simd implementationJingning Han
This commit allows the quantizer to compare the AC coefficients to the quantization step size to determine if further multiplication operations are needed. It makes the quantization process 20% faster without coding statistics change. Change-Id: I735aaf6a9c0874c82175bb565b20e131464db64a
2015-03-31Use aligned copy in 8x8 Hadamard transform SSE2Jingning Han
This reduces the 8x8 Hadamard transform cycles by 20%. Change-Id: If34c5e02f3afa42244c6efabe121f7cf5d2df41b
2015-03-30Fix 8x8 Hadamard SSE2 implementationJingning Han
This commit fixes the SSE2 version 8x8 Hadamard transform alignment and makes it consistent with the C version. Change-Id: I1304e5f97e0e5ef2d798fe38081609c39f5bfe74
2015-03-30Enable 16x16 Hadamard transform in SATD based mode decisionJingning Han
This commit replaces the 16x16 2D-DCT transform with Hadamard transform for RTC coding mode. It reduces the CPU cycles cost on 16x16 transform by 5X. Overall it makes the speed -6 encoding speed 1.5% faster without compromise on compression performance. Change-Id: If6c993831dc4c678d841edc804ff395ed37f2a1b
2015-03-30Hadamard transform based coding mode decision processJingning Han
This commit uses Hadamard transform based rate-distortion cost estimate for rtc coding mode decision. It improves the compression performance of speed -6 for many hard clips at lower bit-rates. For example, 5.5% for jimredvga, 6.7% for mmmoving, 6.1% for niklas720p. This will introduce extra encoding cycle costs at this point. Change-Id: Iaf70634fa2417a705ee29f2456175b981db3d375
2015-03-18vp9_fdct8x8_quant_ssse3: quiet a static analysis warningJames Zern
add an assert to validate 'in' array size Change-Id: Ie5a24275c066d9dd59714f6104510abbd4850dc5
2015-03-18vp9_fdct8x8_quant_sse2: quiet a static analysis warningJames Zern
add an assert to validate 'in' array size Change-Id: Ib72946a86f34e1ce8a69954e8e3e4fe1a0f18a91
2015-03-16Refactor column integral projection computationJingning Han
Move the scaling factor outside column projection. This avoids repeated calculation of the same scaling factor. Profiling shows that the percentage of vp9_int_pro_col_sse2 of overall cycles goes from 2.29% down to 1.88%. Change-Id: I5ac4e324ab2d7f33ba2de66dd2a12e04e04dfd66
2015-03-12Fix fdct8x8_quant ssse3 overflow issueJingning Han
This resolves webm issue 968. Change-Id: Ieb363129b1e135a561141c68211d413226aba754
2015-03-11Apply fast motion search to golden reference frameJingning Han
This commit enables the rtc coding mode to run integral projection based motion search for golden reference frame. It improves the speed -6 compression performance by 1.1% on average, 3.46% for jimred_vga, 6.46% for tacomascmvvga, and 0.5% for vidyo clips. The speed -6 is about 6% slower. Change-Id: I0fe402ad2edf0149d0349ad304ab9b2abdf0c804
2015-03-03Scale the normalization factor depending on the block sizeJingning Han
Change-Id: I0a26994bf65ea224e496b09af2ce71e1a4210433
2015-03-01Use variance metric for integral projection vector matchJingning Han
This commit replaces the SAD with variance as metric for the integral projection vector match. It improves the search accuracy in the presence of slight light change. The average speed -6 compression performance for rtc set is improved by 1.7%. No speed changes are observed for the test clips. Change-Id: I71c1d27e42de2aa429fb3564e6549bba1c7d6d4d
2015-02-26Refactor integral projection based motion estimationJingning Han
Support variable block size integral projection based motion estimation. Change-Id: Iee6d65e44df4480aa13fb7b84b9c91914b89caa1
2015-02-25Merge "Fix ssse3 quantize_fp functions while skip=1"Yunqing Wang
2015-02-24Fix fwd transform sse2 build issue on older gcc versionJingning Han
Change-Id: I3e0e53d129552babf29e6c5d047483733983973c
2015-02-24Fix ssse3 quantize_fp functions while skip=1Yunqing Wang
In ssse3 functions, DEFINE_ARGS macro hard codes qcoeff and dqcoeff to r3 and r4. If skip is 1, qcoeff and dqcoeff need to be loaded from the stack, which doesn't work because of the above definitions. Currently, skip=1 case is not used in the encoder. This patch fixed the issue, so it can be turned on later. Change-Id: I998d696b1a7a85dca2b3bcee790b21c21e039147
2015-02-19Integral projection based motion estimationJingning Han
This commit introduces a new block match motion estimation using integral projection measurement. The 2-D block and the nearby region is projected onto the horizontal and vertical 1-D vectors, respectively. It then runs vector match, instead of block match, over the two separate 1-D vectors to locate the motion compensated reference block. This process is run per 64x64 block to align the reference before choosing partitioning in speed 6. The overall CPU cycle cost due to this additional 64x64 block match (SSE2 version) takes around 2% at low bit-rate rtc speed 6. When strong motion activities exist in the video sequence, it substantially improves the partition selection accuracy, thereby achieving better compression performance and lower CPU cycles. The experiments were tested in RTC speed -6 setting: cloud 1080p 500 kbps 17006 b/f, 37.086 dB, 5386 ms -> 16669 b/f, 37.970 dB, 5085 ms (>0.9dB gain and 6% faster) pedestrian_area 1080p 500 kbps 53537 b/f, 36.771 dB, 18706 ms -> 51897 b/f, 36.792 dB, 18585 ms (4% bit-rate savings) blue_sky 1080p 500 kbps 70214 b/f, 33.600 dB, 13979 ms -> 53885 b/f, 33.645 dB, 10878 ms (30% bit-rate savings, 25% faster) jimred 400 kbps 13380 b/f, 36.014 dB, 5723 ms -> 13377 b/f, 36.087 dB, 5831 ms (2% bit-rate savings, 2% slower) Change-Id: Iffdb6ea5b16b77016bfa3dd3904d284168ae649c
2015-02-05Fix high bit depth assembly function bugsYunqing Wang
The high bit depth build failed while building for 32bit target. The bugs were in vp9_highbd_subpel_variance.asm and vp9_highbd_sad4d_sse2.asm functions. This patch fixed the bugs, and made 32bit build work. Change-Id: Idc8e5e1b7965bb70d4afba140c6583c5d9666b75
2015-01-27Fix issues in 32bit PIC enabled buildYunqing Wang
This patch was to fix issue 924: https://code.google.com/p/webm/issues/detail?id=924 The SECTION_RODATA macro was modified to support macho32 format. The sub-pixel functions were modified to pass in 2 more parameters to handle the global offsets for PIC build. Change-Id: I3bfcd336bcae945edf300bca4ab40376a2628cd4
2014-12-22Revert "Revert "Removal of legacy zbin_extra / zbin_oq_value.""Jingning Han
This reverts commit 9946ee23e0a4c158e26a505b162a072f81b8a3be. Fix the ssse3 asm function. Change-Id: I07f77a63aa98087626e45c4e87aa5dcafc0b0b07
2014-12-19Revert "Removal of legacy zbin_extra / zbin_oq_value."Paul Wilkins
This reverts commit e9b586e21bb899e247346e82bccf5afb42604910. Change-Id: I5b36e6727da6c05278d97e2c37b80c109f79bed4
2014-12-18Removal of legacy zbin_extra / zbin_oq_value.Paul Wilkins
zbin extra / zbin_oq_value was widely passed around, hence removal touches a lot of code. Change-Id: Idc94359735b60c38a160e4385ae09d5ca8b6b8e5
2014-12-08Merge "Changes to assembler for NASM on mac."James Zern
2014-12-03sse2 visual studio build fixDeb Mukherjee
Change-Id: Id8c8c3be882bcd92afea3ccec6ebdf3f208d28ef
2014-12-03Enable non-rd mode coding on key frame, for speed 6.Marco
For key frame at speed 6: enable the non-rd mode selection in speed setting and use the (non-rd) variance_based partition. Adjust some logic/thresholds in variance partition selection for key frame only (no change to delta frames), mainly to bias to selecting smaller prediction blocks, and also set max tx size of 16x16. Loss in key frame quality (~0.6-0.7dB) compared to rd coding, but speeds up key frame encoding by at least 6x. Average PNSR/SSIM metrics over RTC clips go down by ~1-2% for speed 6. Change-Id: Ie4845e0127e876337b9c105aa37e93b286193405
2014-12-02Added high bitdepth sse2 transform functionsPeter de Rivaz
Also removes some spurious changes in common/vp9_blockd.h which was introduced by a rebase issue between nextgen and master branches. Change-Id: If359f0e9a71bca9c2ba685a87a355873536bb282 (cherry picked from commit 005d80cd05269a299cd2f7ddbc3d4d8b791aebba) (cherry picked from commit 08d2f548007fd8d6fd41da8ef7fdb488b6485af3) (cherry picked from commit 4230c2306c194c058f56433a5275aa02a2e71d56)
2014-11-24Changes to assembler for NASM on mac.John Stark
fixes non-Apple nasm part of issue #755 Change-Id: I11955d270c4ee55e3c00e99f568de01b95e7ea9a
2014-11-21Merge "Added highbitdepth sse2 acceleration for quantize"Debargha Mukherjee
2014-11-19Added highbitdepth sse2 acceleration for quantizePeter de Rivaz
Also includes block error. (This patch is mostly cherry picked from commit db7192e0b014a331a1dcb102c8a1148e9f0e1081) Change-Id: Idef18f90b111a0d0c9546543d3347e551908fd78
2014-11-19Enable ssse3 version of vp9_fdct8x8_quantJingning Han
It improves the speed performance of vp9_fdct8x8_quant_sse2 by about 5%. Change-Id: I74b093ba4d81df64caf71ac7693f3d917f673097
2014-11-18Combine fdct8x8 and quantization processJingning Han
This commit reworks the forward transform and quantization process for 8x8 block coding. It combines the two operations in a single function to save a store/load stage of the original transform coefficients. Overall the speed -6 is slightly faster (around 1% range). The compression performance of speed -6 is improved by 3.4%. Change-Id: Id6628daef123f3e4649248735ec2ad7423629387
2014-11-18Add sse2 version for vp9_quantize_fpJingning Han
vp9_quantize_fp is the quantization process used by rtc coding mode. This commit adds a sse2 implementation of it. The implementation is modified based on vp9_quantize_b_sse2. No speed difference from ssse3 version. Change-Id: I24949c5b27df160b4f35117d28858d269454e64a
2014-11-14Added sse2 acceleration for highbitdepth variancePeter de Rivaz
Change-Id: I446bdf3a405e4e9d2aa633d6281d66ea0cdfd79f (cherry picked from commit d7422b2b1eb9f0011a8c379c2be680d6892b16bc) (cherry picked from commit 6d741e4d76a7d9ece69ca117d1d9e2f9ee48ef8c)
2014-11-12Added highbitdepth sse2 SAD acceleration and testsPeter de Rivaz
Change-Id: I1a74a1b032b198793ef9cc526327987f7799125f (cherry picked from commit b1a6f6b9cb47eafe0ce86eaf0318612806091fe5)
2014-11-05Fix visual studio 2013 compiler warningsYaowu Xu
For configured with --enable-vp9-highbitdepth Change-Id: I2b181519d7192f8d7a241ad5760c3578255f24e6
2014-10-28vp9_denoiser_sse2: refactor the code.JackyChen
Combined vp9_denoiser_8xM_sse2 and vp9_denoiser_4xM_sse2 into one function vp9_denoiser_NxM_sse2_small and passed the bitexact testing. Changed the name of the function vp9_denoiser_64_32_16xM_sse2 to vp9_denoiser_NxM_sse2_big. Change-Id: Ib22478df585994dd347ebae04202c0b701e7f451
2014-10-22Merge "vp9_denoiser_sse2.c: improve code style."JackyChen
2014-10-22vp9_denoiser_sse2.c: improve code style.JackyChen
denoiser_sse2.c: fix typos in comment. Change-Id: Ic0fb102331b0e533c058da3cab1fbc30de9a0070
2014-10-20Merge "SAD32xh and SAD64xh for AVX2"Yunqing Wang
2014-10-19SAD32xh and SAD64xh for AVX2levytamar82
All sad function that process above 32 consecutive elements are optimized for AVX2: vp9_sad64x64 vp9_sad64x32 vp9_sad32x64 vp9_sad32x32 vp9_sad32x16 vp9_sad64x64_avg vp9_sad64x32_avg vp9_sad32x64_avg vp9_sad32x32_avg vp9_sad32x16_avg The functions that appeared as a hotspot is vp9_sad32x32 and vp9_sad64x64 vp9_sad32x32 was optimized by 68% and vp9_sad64x64 was optimized by 90% both of them gave and overall ~2.3% user level gain Change-Id: Iccf86b375a2b54c5fbbe685902ead0c9a561b9fd
2014-10-17vp9_denoiser_sse2.c: solve windows build error.JackyChen
Change-Id: Ib5df91c8580d5dbeb0b3554edc9c2ca906ba4c4d
2014-10-17Merge "vp9_denoiser_sse2.c: eliminate gcc warnings"James Zern
2014-10-17vp9_denoiser_sse2.c: eliminate gcc warningsJackyChen
Change-Id: I5f63f48e11e31ea9951223c5b18f42a2471e4560
2014-10-14Add a 32-bit friendly sse2 quantizer.Alex Converse
This is based on the 64-bit ssse3 quantizer. 1.1x speedup for screen content at speed 7. Change-Id: I57d15415ef97c49165954bbe3daaaf9318e37448
2014-10-10vp9_avg_intrin_sse2: correct intrinsics includeJames Zern
immintrin.h -> emmintrin.h fixes build where newer intrinsics are unavailable Change-Id: I79311b39bfa782fc2abeb45884ecb417050cb9f8
2014-10-07experimental : partition using 1/8 x 1/8 imageJim Bankoski
The concept: There's too much noise in source pixels for variance and at low bitrate the reconstructed looks nothing like the source so we have problems getting good partitionings with either. This skirts the issue by using a box blur scaled down version for variance calculations. To compare against source_var_ moved keyframe to be rd based like source_var. Change-Id: Ie3babdbfadae324b7b5a76bea192893af27f0624
2014-10-06Add SSE2 code and unit test for VP9 denoiser.JackyChen
This SSE2 is based on VP8 denoiser's SSE2 code. In VP8, there are only 16x16 blocks in denoiser, while in VP9, there are 13 different block sizes. By adding this SSE2 code, the improvement of encoder speed is around 20%(using C code vs using SSE2 code), vary for different clips. The unit test for VP9 denoiser is to confirm that the SSE2 code is bit-exact with the C code. The unit test covers all block size. Change-Id: Ic8d8ac26db4ea40a5f146b5678a065af07eaaa3d
2014-09-06Replacing vp9_get_mb_ss_sse2 asm implementation with intrinsics.Dmitry Kovalev
Change-Id: Ib4f5dd733eb2939b108070a01e83da5d9990bac0