summaryrefslogtreecommitdiff
path: root/vp9/encoder/x86
AgeCommit message (Collapse)Author
2013-09-23Number of instructions in fdct4_1d_sse2 reduced by two.A.Mahfoodh
Mathematically the results are the same. Change-Id: I1c5126cd3ca64e8515ca6331e0989c6f7dd651a0
2013-09-06Fix overflow issue in 16x16 quantization SSSE3Jingning Han
The 16x16 transform unit test suggested that the peak coefficient value can reach 32639. This could cause potential overflow issue in the SSSE3 implmentation of 16x16 block quantization. This commit fixes this issue by replacing addition with saturated addition. Change-Id: I6d5bb7c5faad4a927be53292324bd2728690717e
2013-09-05Use saturated addition in SSSE3 of 32x32 quantJingning Han
The 32x32 forward transform can potentially reach peak coefficient value close to 32700, while the rounding factor can go upto 610. This could cause overflow issue in the SSSE3 implementation of 32x32 quantization process. This commit resolves this issue by replacing the addition operations with saturated addition operations in 32x32 block quantization. Change-Id: Id6b98996458e16c5b6241338ca113c332bef6e70
2013-08-31Fix 32x32 forward transform SSE2 versionJingning Han
This commit fixed the potential overflow issue in the SSE2 implementation of 32x32 forward DCT. It resolved the corrupted coded frames in the border of scenes. Change-Id: If87eef2d46209269f74ef27e7295b6707fbf56f9
2013-08-29Merge "Fix overflow issue in SSSE3 32x32 quantization"Jingning Han
2013-08-29Fix overflow issue in SSSE3 32x32 quantizationJingning Han
The 32x32 quantization process can potentially have the intermediate stacks over 16-bit range, thereby causing enc/dec mismatch. This commit fixes this overflow issue in the SSSE3 implementation, as well as the prototype, of 32x32 quantization. This fixes issue 607 from webm@googlecode. Change-Id: I85635e6ca236b90c3dcfc40d449215c7b9caa806
2013-08-27fixed the reading too many bytesYaowu Xu
In subpel_avg_variance functions, code similar to the following punpkldq m2, [addr] actually reads 8 bytes. For functions that are supposed to work on buffers only have less 8 bytes a line, this caused valgrind error of reading uninitialized memory. Change-Id: I2a4c079dbdbc747829bd9e2ed85f0018ad2a3a34
2013-08-26Fix the reading of too many input pixelsYaowu Xu
in VP9_get4x4var_mmx Change-Id: I4b4a8f45f25ebdfad281f169cc87aba5e2d6f227
2013-08-12SSE2 high precision 32x32 forward DCTJingning Han
Enable SSE2 implementation of high precision 32x32 forward DCT. The intermediate stacks are of 32-bits. The run-time goes down from 32126 cycles to 13442 cycles. Change-Id: Ib5ccafe3176c65bd6f2dbdef790bd47bbc880e56
2013-08-06Merge "Place holder for high-precision 32x32 fdct"Jingning Han
2013-08-06variance x86inc guardsJim Bankoski
also fixed bug in sad calcs Change-Id: I6571fcbe37556c16ae32be66dc0fd879852aac1d
2013-08-06Place holder for high-precision 32x32 fdctJingning Han
Resolve compile warnings on re-define FDCT32x32_2D template. Change-Id: Idb3a54ef8d2710ce7245b726379a0e5c875f5cad
2013-08-06Move fdct32x32 SSE2 implementation in separate file.Christian Duvivier
This is in preparation for the SSE2 version of the high-precision 32x32 forward DCT which will share a lot of code with the existing low precision version used for rate-distortion search. Change-Id: I7084b6bdfb480b1fabb8493fb14e3f7fcc7888c0
2013-07-10Remove unused fwalsh/fdct x86 SIMD implementations.Ronald S. Bultje
Change-Id: Ia942e56cf322821d42ba06178672791eeee2847e
2013-07-10SSE2 16x16 ADST/DCT hybrid transformJingning Han
This commit enables 16x16 ADST/DCT forward hybrid transform using SSE2 operations. It reduces the runtime from 5433 cycles to 1621 cycles, at no compression performance loss. Change-Id: I75fd7f1984e9e28846af459f810ff0d6ae125230
2013-07-05Merge "Refactor SSE2 8x8 functional units"Jingning Han
2013-07-03Refactor SSE2 8x8 functional unitsJingning Han
These serve as building blocks for SSE2 8x8 and 16x16 ADST/DCT hybrid transform coding. Change-Id: I4089a754c66e0c986f67d9b8ec4dfb9627ad430d
2013-07-02Use pmovmskb to skip quantize loops over empty coefficients.Ronald S. Bultje
If none of the 16 coefficients that we quantize per loop iteration are larger than the zbin, directly skip to the next round of coeffs, rather than doing a full quantize loop that will eventually result in 16 zeroes. This incurs a jump cost, but saves a lot of other work. 32x32 quant goes from 1349 -> 1184 cycles. The same approach yielded no significantly positive results for smaller transforms, so is not used there (8x8: 103 -> 101 cycles; 16x16: 302 -> 306 cycles). Change-Id: I8fca17dc2543fc8eed1dbcd5100145e3c3a9b647
2013-07-01Update quantize SSSE3 SIMD to cover 32x32 transform case also.Ronald S. Bultje
Encode time of bus (speed 0) 50 frames @ 1500kbps goes from 2min14.4 to 2min10.1, i.e. a 2.3% overall speed increase. Change-Id: I3699580e74ec26c7d24e03681bc47ba25ee1ee87
2013-07-01Quantize (64-bit only, for now) SSSE3 SIMD.Ronald S. Bultje
Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is x86-64 only, it needs some minor modifications to be 32bit compatible, because it uses 15 xmm registers, whereas 32bit only has 8. Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904
2013-06-29Merge "Enable SSE2 4x4 ADST/DCT transform"Jingning Han
2013-06-29SSE2 version of vp9_short_fdct32x32_rd.Christian Duvivier
43,000 -> 5,750 cycles, about 7.5x faster. Change-Id: Ibfd92821b9603f4ed9c256e0ececec14fa4565d0
2013-06-28Enable SSE2 4x4 ADST/DCT transformJingning Han
This commit enables SSE2 4x4 foward hybrid transform. The runtime goes from 249 cycles down to 74 cycles. Overall around 2% speed-up at no compression performance change. Change-Id: Iad4d526346e05c7be896466c05500711bb763660
2013-06-28Fix switch statement in 8x8 transformJingning Han
Change-Id: I7c46354c4983feb5f6202c3ab4a1d9534da7e30f
2013-06-28Make coefficient skip condition an explicit RD choice.Ronald S. Bultje
This commit replaces zrun_zbin_boost, a method of biasing non-zero coefficients following runs of zero-coefficients to be rounded towards zero, with an explicit skip-block choice in the RD loop. The logic is basically that if individual coefficients should be rounded towards zero (from a RD point of view), the trellis/optimize loop should take care of it. If whole blocks should be zero (from a RD point of view), a single RD check is much more efficient than a complete serialization of the quantization loop. Quality change: derf +0.5% psnr, +1.6% ssim; yt +0.6% psnr, +1.1% ssim. SIMD for quantize will follow in a separate patch. Results for other test sets pending. Change-Id: Ife5fa641163ac5150ac428011e87188f1937c1f4
2013-06-26fixed a compiling problem with MSVC win32 buildYaowu Xu
The aligned array in parameter list caused win32 build to report c2719 error. This commit fixed the issue by make the parameter type a pointer instead of an array. Change-Id: I4ed654ce4eba2db4995d9cdc136c68e9a6acc992
2013-06-25Merge "Add averaging-SAD functions for 8-point comp-inter motion search."Ronald S. Bultje
2013-06-25Add averaging-SAD functions for 8-point comp-inter motion search.Ronald S. Bultje
Makes first 50 frames of bus @ 1500kbps encode from 3min22.7 to 3min18.2, i.e. 2.3% faster. In addition, use the sub_pixel_avg functions to calc the variance of the averaging predictor. This is slightly suboptimal because the function is subpixel-position-aware, but it will (at least for the SSE2 version) not actually use a bilinear filter for a full-pixel position, thus leading to approximately the same performance compared to if we implemented an actual average-aware full-pixel variance function. That gains another 0.3 seconds (i.e. encode time goes to 3min17.4), thus leading to a total gain of 2.7%. Change-Id: I3f059d2b04243921868cfed2568d4fa65d7b5acd
2013-06-25Tune the rounding operations in 8x8 ADST/DCT sse2Jingning Han
Improve the round-trip precision to meet the unit test setttings. Change-Id: I303febae56b4b990ea3798b8ebed94c0510ecf79
2013-06-25Merge "Use aligned buffer operations in 8x8/16x16 2D-DCT"Jingning Han
2013-06-25Merge "Enable sse2 implmentation of 8x8 ADST/DCT"Yaowu Xu
2013-06-24Use aligned buffer operations in 8x8/16x16 2D-DCTJingning Han
This reduces 16x16 2D-DCT runtime from 865 cycles to 837 cycles. Change-Id: I137758b81cd127b936175284310e81378db64552
2013-06-24Enable sse2 implmentation of 8x8 ADST/DCTJingning Han
This commit makes use of the butterfly structure to enable the sse2 version implementation of 8x8 ADST/DCT hybrid transform coding. The runtime of hybrid transform module goes down from 1170 cycles to 245 cycles. Overall speed-up around 1.5%. Change-Id: Ic808ffd21ece8a9d0410d8c0243d7b6c28ac3b3f
2013-06-21Remove emms - that shouldn't be there.Ronald S. Bultje
Change-Id: I8fcab81e390f93dc17e9666bbf8f77883b5aa897
2013-06-21Add missing SECTION .text marker in assembly file.Ronald S. Bultje
Fixes a crash on Windows when building with MSVC. Change-Id: I124ac756a1be55d190fadda5fcc46d23b1445dbf
2013-06-21Implement SSE2 block_error.Ronald S. Bultje
Change vp9_block_error() to return a 64bit error variable, change all callers to expect a 64bit return value (this will prevent overflows, which we basically don't check for at all right now). Remove duplicate block_error() function, which fixed that through truncation. Remove old (incompatible) mmx/sse2 block_error SIMD versions and replace with a new one that returns a 64bit value. Encoding time of first 50 frames of bus @ 1500kbps goes from 3min29 to 3min23, i.e. a 3% overall speedup. Change-Id: Ib71ac5508b5ee8a80f1753cd85d72df1629abe68
2013-06-21Add subtract_block SSE2 version and unit test.Ronald S. Bultje
3% faster overall (3min35.0 to 3min28.5). Change-Id: I5ff8a5c2c91586b6632ca5009ad1ea51ce94af5e
2013-06-20SSE2/SSSE3 optimizations and unit test for sub_pixel_avg_variance().Ronald S. Bultje
Encoding of bus @ 1500kbps (first 50 frames) goes from 3min57 to 3min35, i.e. approximately a 10.5% speedup. Note that the SIMD versions which use a bilinear filter (x_offset & 7 || y_offset & 7) aren't perfectly interleaved, and can probably be improved further in the future. I've marked this with a few TODOs/FIXMEs in the code. Change-Id: I5c9e900c0f0d32e431a50fecae213b510b2549f9
2013-06-20Implement sse2 and ssse3 versions for all sub_pixel_variance sizes.Ronald S. Bultje
Overall speedup around 5% (bus @ 1500kbps first 50 frames 4min10 -> 3min58). Specific changes to timings for each function compared to original assembly-optimized versions (or just new version timings if no previous assembly-optimized version was available): sse2 4x4: 99 -> 82 cycles sse2 4x8: 128 cycles sse2 8x4: 121 cycles sse2 8x8: 149 -> 129 cycles sse2 8x16: 235 -> 245 cycles (?) sse2 16x8: 269 -> 203 cycles sse2 16x16: 441 -> 349 cycles sse2 16x32: 641 cycles sse2 32x16: 643 cycles sse2 32x32: 1733 -> 1154 cycles sse2 32x64: 2247 cycles sse2 64x32: 2323 cycles sse2 64x64: 6984 -> 4442 cycles ssse3 4x4: 100 cycles (?) ssse3 4x8: 103 cycles ssse3 8x4: 71 cycles ssse3 8x8: 147 cycles ssse3 8x16: 158 cycles ssse3 16x8: 188 -> 162 cycles ssse3 16x16: 316 -> 273 cycles ssse3 16x32: 535 cycles ssse3 32x16: 564 cycles ssse3 32x32: 973 cycles ssse3 32x64: 1930 cycles ssse3 64x32: 1922 cycles ssse3 64x64: 3760 cycles Change-Id: I81ff6fe51daf35a40d19785167004664d7e0c59d
2013-06-17Move subpixel variance function from common/ to encoder/.Ronald S. Bultje
This seems to only be used in the encoder. Also remove an empty wrapper file that contained forward declarations for this function, but didn't actually define any actual functions. Change-Id: Ifc561eef7ebe374a7d03698055e51e105f6d614b
2013-06-14Merge "Enable sse2 version of sad8x4/4x8"Jingning Han
2013-06-13Enable sse2 version of sad8x4/4x8Jingning Han
The encoding time for bus at CIF goes from 661s to 625s. This commit also enabled unit test of sad8x4/4x8 in sad_test.cc. Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1
2013-06-12Implement SSE version for sad4x8x4d and SSE2 version for sad8x4x4d.Ronald S. Bultje
Encoding time of crew (CIF, first 50 frames) @ 1500kbps goes from 4min56 to 4min42. Change-Id: I92c0c8b32980d2ae7c6dafc8b883a2c7fcd14a9f
2013-06-11Merge branch 'master' into experimentalJohn Koleszar
Change-Id: Ie648398b82f7311143709f55c0e30ba452f50eff
2013-05-22Optimize variance functionsYunqing Wang
Added SSE2 version of variance functions for super blocks. Change-Id: Ibeaae8771ca21c99d41dd74067574a51e97b412d
2013-04-30Remove unused quantize optimizations.Johann
Files were copied from vp8 and never maintained. Change-Id: I9659a8755985da73e8c19c3c984423b6666d8871
2013-04-26Merge branch 'master' into experimentalJohann
Conflicts: vp9/common/vp9_findnearmv.c vp9/common/vp9_rtcd_defs.sh vp9/decoder/vp9_decodframe.c vp9/decoder/x86/vp9_dequantize_sse2.c vp9/encoder/vp9_rdopt.c vp9/vp9_common.mk Resolve file name changes in favor of master. Resolve rdopt changes in favor of experimental, preserving the newer experiments. Change-Id: If51ed8f457470281c7b20a5c1a2f4ce2cf76c20f
2013-04-26Whitespace nitJohann
Change-Id: I7486970c57cda75d26ec2c6d1f36bd668c955f66
2013-04-25Normalize more intrinsic filenamesJohann
vp9_dequantize_x86 has only sse2 functions. vp9_dct_sse2_intrinsics has no namespace collision and can drop _intrinsics. vp9_idct_mmx.h is unused. Change-Id: Ic16e31fb372a1d1e841a62ecb4189fe8f95808ec
2013-04-25Move dequant from BLOCKD to per-plane MACROBLOCKDJohn Koleszar
This data can vary per-plane, but not per-block. Change-Id: I1971b0b2c2e697d2118e38b54ef446e52f63c65a