summaryrefslogtreecommitdiff
path: root/vp9/encoder/x86
AgeCommit message (Collapse)Author
2013-06-26fixed a compiling problem with MSVC win32 buildYaowu Xu
The aligned array in parameter list caused win32 build to report c2719 error. This commit fixed the issue by make the parameter type a pointer instead of an array. Change-Id: I4ed654ce4eba2db4995d9cdc136c68e9a6acc992
2013-06-25Merge "Add averaging-SAD functions for 8-point comp-inter motion search."Ronald S. Bultje
2013-06-25Add averaging-SAD functions for 8-point comp-inter motion search.Ronald S. Bultje
Makes first 50 frames of bus @ 1500kbps encode from 3min22.7 to 3min18.2, i.e. 2.3% faster. In addition, use the sub_pixel_avg functions to calc the variance of the averaging predictor. This is slightly suboptimal because the function is subpixel-position-aware, but it will (at least for the SSE2 version) not actually use a bilinear filter for a full-pixel position, thus leading to approximately the same performance compared to if we implemented an actual average-aware full-pixel variance function. That gains another 0.3 seconds (i.e. encode time goes to 3min17.4), thus leading to a total gain of 2.7%. Change-Id: I3f059d2b04243921868cfed2568d4fa65d7b5acd
2013-06-25Tune the rounding operations in 8x8 ADST/DCT sse2Jingning Han
Improve the round-trip precision to meet the unit test setttings. Change-Id: I303febae56b4b990ea3798b8ebed94c0510ecf79
2013-06-25Merge "Use aligned buffer operations in 8x8/16x16 2D-DCT"Jingning Han
2013-06-25Merge "Enable sse2 implmentation of 8x8 ADST/DCT"Yaowu Xu
2013-06-24Use aligned buffer operations in 8x8/16x16 2D-DCTJingning Han
This reduces 16x16 2D-DCT runtime from 865 cycles to 837 cycles. Change-Id: I137758b81cd127b936175284310e81378db64552
2013-06-24Enable sse2 implmentation of 8x8 ADST/DCTJingning Han
This commit makes use of the butterfly structure to enable the sse2 version implementation of 8x8 ADST/DCT hybrid transform coding. The runtime of hybrid transform module goes down from 1170 cycles to 245 cycles. Overall speed-up around 1.5%. Change-Id: Ic808ffd21ece8a9d0410d8c0243d7b6c28ac3b3f
2013-06-21Remove emms - that shouldn't be there.Ronald S. Bultje
Change-Id: I8fcab81e390f93dc17e9666bbf8f77883b5aa897
2013-06-21Add missing SECTION .text marker in assembly file.Ronald S. Bultje
Fixes a crash on Windows when building with MSVC. Change-Id: I124ac756a1be55d190fadda5fcc46d23b1445dbf
2013-06-21Implement SSE2 block_error.Ronald S. Bultje
Change vp9_block_error() to return a 64bit error variable, change all callers to expect a 64bit return value (this will prevent overflows, which we basically don't check for at all right now). Remove duplicate block_error() function, which fixed that through truncation. Remove old (incompatible) mmx/sse2 block_error SIMD versions and replace with a new one that returns a 64bit value. Encoding time of first 50 frames of bus @ 1500kbps goes from 3min29 to 3min23, i.e. a 3% overall speedup. Change-Id: Ib71ac5508b5ee8a80f1753cd85d72df1629abe68
2013-06-21Add subtract_block SSE2 version and unit test.Ronald S. Bultje
3% faster overall (3min35.0 to 3min28.5). Change-Id: I5ff8a5c2c91586b6632ca5009ad1ea51ce94af5e
2013-06-20SSE2/SSSE3 optimizations and unit test for sub_pixel_avg_variance().Ronald S. Bultje
Encoding of bus @ 1500kbps (first 50 frames) goes from 3min57 to 3min35, i.e. approximately a 10.5% speedup. Note that the SIMD versions which use a bilinear filter (x_offset & 7 || y_offset & 7) aren't perfectly interleaved, and can probably be improved further in the future. I've marked this with a few TODOs/FIXMEs in the code. Change-Id: I5c9e900c0f0d32e431a50fecae213b510b2549f9
2013-06-20Implement sse2 and ssse3 versions for all sub_pixel_variance sizes.Ronald S. Bultje
Overall speedup around 5% (bus @ 1500kbps first 50 frames 4min10 -> 3min58). Specific changes to timings for each function compared to original assembly-optimized versions (or just new version timings if no previous assembly-optimized version was available): sse2 4x4: 99 -> 82 cycles sse2 4x8: 128 cycles sse2 8x4: 121 cycles sse2 8x8: 149 -> 129 cycles sse2 8x16: 235 -> 245 cycles (?) sse2 16x8: 269 -> 203 cycles sse2 16x16: 441 -> 349 cycles sse2 16x32: 641 cycles sse2 32x16: 643 cycles sse2 32x32: 1733 -> 1154 cycles sse2 32x64: 2247 cycles sse2 64x32: 2323 cycles sse2 64x64: 6984 -> 4442 cycles ssse3 4x4: 100 cycles (?) ssse3 4x8: 103 cycles ssse3 8x4: 71 cycles ssse3 8x8: 147 cycles ssse3 8x16: 158 cycles ssse3 16x8: 188 -> 162 cycles ssse3 16x16: 316 -> 273 cycles ssse3 16x32: 535 cycles ssse3 32x16: 564 cycles ssse3 32x32: 973 cycles ssse3 32x64: 1930 cycles ssse3 64x32: 1922 cycles ssse3 64x64: 3760 cycles Change-Id: I81ff6fe51daf35a40d19785167004664d7e0c59d
2013-06-17Move subpixel variance function from common/ to encoder/.Ronald S. Bultje
This seems to only be used in the encoder. Also remove an empty wrapper file that contained forward declarations for this function, but didn't actually define any actual functions. Change-Id: Ifc561eef7ebe374a7d03698055e51e105f6d614b
2013-06-14Merge "Enable sse2 version of sad8x4/4x8"Jingning Han
2013-06-13Enable sse2 version of sad8x4/4x8Jingning Han
The encoding time for bus at CIF goes from 661s to 625s. This commit also enabled unit test of sad8x4/4x8 in sad_test.cc. Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1
2013-06-12Implement SSE version for sad4x8x4d and SSE2 version for sad8x4x4d.Ronald S. Bultje
Encoding time of crew (CIF, first 50 frames) @ 1500kbps goes from 4min56 to 4min42. Change-Id: I92c0c8b32980d2ae7c6dafc8b883a2c7fcd14a9f
2013-06-11Merge branch 'master' into experimentalJohn Koleszar
Change-Id: Ie648398b82f7311143709f55c0e30ba452f50eff
2013-05-22Optimize variance functionsYunqing Wang
Added SSE2 version of variance functions for super blocks. Change-Id: Ibeaae8771ca21c99d41dd74067574a51e97b412d
2013-04-30Remove unused quantize optimizations.Johann
Files were copied from vp8 and never maintained. Change-Id: I9659a8755985da73e8c19c3c984423b6666d8871
2013-04-26Merge branch 'master' into experimentalJohann
Conflicts: vp9/common/vp9_findnearmv.c vp9/common/vp9_rtcd_defs.sh vp9/decoder/vp9_decodframe.c vp9/decoder/x86/vp9_dequantize_sse2.c vp9/encoder/vp9_rdopt.c vp9/vp9_common.mk Resolve file name changes in favor of master. Resolve rdopt changes in favor of experimental, preserving the newer experiments. Change-Id: If51ed8f457470281c7b20a5c1a2f4ce2cf76c20f
2013-04-26Whitespace nitJohann
Change-Id: I7486970c57cda75d26ec2c6d1f36bd668c955f66
2013-04-25Normalize more intrinsic filenamesJohann
vp9_dequantize_x86 has only sse2 functions. vp9_dct_sse2_intrinsics has no namespace collision and can drop _intrinsics. vp9_idct_mmx.h is unused. Change-Id: Ic16e31fb372a1d1e841a62ecb4189fe8f95808ec
2013-04-25Move dequant from BLOCKD to per-plane MACROBLOCKDJohn Koleszar
This data can vary per-plane, but not per-block. Change-Id: I1971b0b2c2e697d2118e38b54ef446e52f63c65a
2013-04-23Move src_diff to per-plane MACROBLOCK dataJohn Koleszar
First in a series of commits making certain MACROBLOCK members addressable per-plane. This commit also refactors the block subtraction functions vp9_subtract_b, vp9_subtract_sby_c, etc to be loops-over-planes and variable subsampling aware. Change-Id: I371d092b914ae0a495dfd852ea1a3d2467be6ec3
2013-04-18Make the use of pred buffers consistent in MB/SBJingning Han
Use in-place buffers (dst of MACROBLOCKD) for macroblock prediction. This makes the macroblock buffer handling consistent with those of superblock. Remove predictor buffer MACROBLOCKD. Change-Id: Id1bcd898961097b1e6230c10f0130753a59fc6df
2013-04-17Add SSE2 versions for rectangular sad and sad4d functions.Ronald S. Bultje
About 11% overall encoder speedup with the sbsegment experiment enabled. Change-Id: Iffb1bdba6932d9f11a6c791cda8697ccf9327183
2013-04-16Faster vp9_short_fdct4x4 and vp9_short_fdct8x4.Christian Duvivier
Scalar path is about 1.3x faster (2.1% overall encoder speedup). SSE2 path is about 5.0x faster (8.4% overall encoder speedup). Change-Id: I360d167b5ad6f387bba00406129323e2fe6e7dda
2013-04-16Faster vp9_short_fdct4x4 and vp9_short_fdct8x4.Christian Duvivier
Scalar path is about 1.3x faster (2.1% overall encoder speedup). SSE2 path is about 5.0x faster (8.4% overall encoder speedup). Change-Id: I360d167b5ad6f387bba00406129323e2fe6e7dda
2013-04-10Make RD superblock mode search size-agnostic.Ronald S. Bultje
Merge various super_block_yrd and super_block_uvrd versions into one common function that works for all sizes. Make transform size selection size-agnostic also. This fixes a slight bug in the intra UV superblock code where it used the wrong transform size for txsz > 8x8, and stores the txsz selection for superblocks properly (instead of forgetting it). Lastly, it removes the trellis search that was done for 16x16 intra predictors, since trellis is relatively expensive and should thus only be done after RD mode selection. Gives basically identical results on derf (+0.009%). Change-Id: If4485c6f0a0fe4038b3172f7a238477c35a6f8d3
2013-04-04Move qcoeff, dqcoeff from BLOCKD to per-plane dataJohn Koleszar
Start grouping data per-plane, as part of refactoring to support additional planes, and chroma planes with other-than 4:2:0 subsampling. Change-Id: Idb76a0e23ab239180c818025bae1f36f1608bb23
2013-03-18Optimize 8x8 idct functionYunqing Wang
Wrote sse2 functions of vp9_short_idct8x8 and vp9_short_idct10_8x8. Compared to c version, the sse2 version is 2X faster. The decoder test didn't show noticeable gain since 8x8 idct doesn't take much of decoding time (less than 1% in my test). Change-Id: I56313e18cd481700b3b52c4eda5ca204ca6365f3
2013-03-15Faster vp9_short_fdct16x16.Christian Duvivier
Scalar path is about 1.5x faster (3.1% overall encoder speedup). SSE2 path is about 7.2x faster (7.8% overall encoder speedup). Change-Id: I06da5ad0cdae2488431eabf002b0d898d66d8289
2013-03-01Merge master branch into experimentalJohn Koleszar
Picks up some build system changes, compiler warning fixes, etc. Change-Id: I2712f99e653502818a101a72696ad54018152d4e
2013-02-28Merge "mv dct_sse2.c dct_sse2_intrinsics.c to avoid collision" into experimentalJim Bankoski
2013-02-28mv dct_sse2.c dct_sse2_intrinsics.c to avoid collisionJim Bankoski
Change-Id: Id786be31da3c91d95d2955aa569ecdc6e66650df
2013-02-28this commit converts all sad ptrs to uint32Jim Bankoski
sse4_1 code used uint16_t for returning sad, but that won't work for 32x32 or 64x64. This code fixes the assembly for those and also reenables sse4_1 on linux Change-Id: I5ce7288d581db870a148e5f7c5092826f59edd81
2013-02-27Faster vp9_short_fdct8x8.Christian Duvivier
Scalar path is about 1.4x faster (4% overall encoder speedup). SSE2 path is about 7x faster (13% overall encoder speedup). Change-Id: I7e85d8225a914a74c61ea370210414696560094d
2013-02-27Remove unused vp9_copy32xnJohn Koleszar
This function was part of an optimization used in VP8 that required caching two macroblocks. This is unused in VP9, and might not survive refactoring to support superblocks, so removing it for now. Change-Id: I744e585206ccc1ef9a402665c33863fc9fb46f0d
2013-02-27Fix --as=nasm compatibility for new asm code.Jan Kratochvil
s/movd/movq/ Change-Id: Id1a56de91551f8dc796f14f1056c565dfc1ba626
2013-02-15Remove some Y2-related code.Ronald S. Bultje
Change-Id: I4f46d142c2a8d1e8a880cfac63702dcbfb999b78
2013-02-08Port sadNxNx4d functions to x86inc.asm.Ronald S. Bultje
Change-Id: Ic639f5742f7a007753d7a3fa5c66235172eb31d8
2013-02-08Add sad64x64 and sad32x32 SSE2 versions.Ronald S. Bultje
Also port the 4x4, 16x16, 8x16 and 16x8 versions to x86inc.asm; this makes them all slightly faster, particularly on x86-64. Remove SSE3 sad16x16 version, since the SSE2 version is now faster. About 1.5% overall encoding speedup. Change-Id: Id4011a78cce7839f554b301d0800d5ca021af797
2013-02-06Add sse2 versions of sub_pixel_variance{32x32,64x64}.Ronald S. Bultje
7.5% faster overall encoding. Change-Id: Ie9bb7f9fdf93659eda106404cb342525df1ba02f
2013-02-05Add SSE3 versions for sad{32x32,64x64}x4d functions.Ronald S. Bultje
Overall encoding about 15% faster. Change-Id: I176a775c704317509e32eee83739721804120ff2
2013-01-31Add support for x64 and win64 yasm flags.Frank Galligan
Some projects must define only win64 for Windows 64bit builds using yasm. Change-Id: I1d09590d66a7bfc8b4412e1cc8685978ac60b748
2013-01-14fix a number issues that cause failuresYaowu Xu
During master jenkins verification proces Change-Id: I3722b8753eaf39f99b45979ce407a8ea0bea0b89
2012-12-26Build fixes to merge vp9-preview into masterJohn Koleszar
Various fixups to resolve issues when building vp9-preview under the more stringent checks placed on the experimental branch. Change-Id: I21749de83552e1e75c799003f849e6a0f1a35b07
2012-12-20add private to assembly files to insure proper chromebuildJim Bankoski
Change-Id: I6e43ca73f35401a974ed8ee27738d4318f09fd37