summaryrefslogtreecommitdiff
path: root/vpx_dsp
AgeCommit message (Collapse)Author
2015-12-07Merge "VP9: Add ssse3 version of vpx_idct32x32_135_add()"Scott LaVarnway
2015-12-05Revert "MMX in intra 4x4 prediction replaced with SSE2"James Zern
This reverts commit 89a1efa4c436c58c101c8b3de866e3014be7d77a. This causes a segfault when decoding vp8, in both 32 and 64-bit Change-Id: Idbb9bb28ab897e1d055340497c47b49a12231367
2015-12-04Speed up h_predictor_16x16Jian Zhou
Relocate the function from SSSE3 to SSE2, Unroll loop from 8 to 4, and reduce mem access to left. Speed up by >20% in ./test_intra_pred_speed. Change-Id: Ie48229c2e32404706b722442942c84983bda74cc
2015-12-04Speed up h_predictor_8x8Jian Zhou
Relocate the function from SSSE3 to SSE2, Unroll loop from 4 to 2, and reduce mem access to left. Speed up by >20% in ./test_intra_pred_speed. Change-Id: Ib9f1846819783b6e05e2a310c930eb844b2b4d2e
2015-12-03MMX in intra 8x8 prediction replaced with SSE2Jian Zhou
8x8 Intra predictor implemented with MMX is replaced with SSE2. Change-Id: I0c90e7c1e1e6942489ac2bfe58903b728aac7a52
2015-12-03MMX in intra 4x4 prediction replaced with SSE2Jian Zhou
4x4 Intra predictor implemented with MMX is replaced with SSE2. Change-Id: Id57da2a7c38832d0356bc998790fc1989d39eafc
2015-12-02Merge "SSE2 speed up of h_predictor_4x4"Jian Zhou
2015-12-02VP9: Add ssse3 version of vpx_idct32x32_135_add()Scott LaVarnway
Change-Id: I9a780131efaad28cf1ad233ae64c5c319a329727
2015-11-30Merge "VPX: x86 asm version of vpx_idct32x32_1024_add()"Scott LaVarnway
2015-11-30SSE2 speed up of h_predictor_4x4Jian Zhou
Relocate h_predictor_4x4 from SSSE3 to SSE2 with XMM registers. Speed up by ~25% in ./test_intra_pred_speed. Change-Id: I64e14c13b482a471449be3559bfb0da45cf88d9d
2015-11-25VPX: x86 asm version of vpx_idct32x32_1024_add()Scott LaVarnway
Change-Id: I3ba4ede553e068bf116dce59d1317347988b3542
2015-11-25Merge "Speed up tm_predictor_8x8"Jian Zhou
2015-11-24Speed up tm_predictor_8x8Jian Zhou
Left neighbor read from memory only once. Speed up by ~20% in ./test_intra_pred_speed. Change-Id: Ia1388630df6fed0dce9a6eeded6cb855bbc43505
2015-11-24Merge "bitreader/writer: Change shift to signed"Alex Converse
2015-11-23VPX: Removed unnecessary pmulhrsw in IDCT32X32_34Scott LaVarnway
and fixed macro name. Change-Id: I306b98a2b4ec80b130ae80290b4cd9c7a5363311
2015-11-20Revert "Speed up h_predictor_4x4"James Zern
This reverts commit d76032ae87e535be5b924d9e88bbd67189380534. breaks 32-bit builds Change-Id: If6266ec2a405b5a21d615112f0f37e8a71193858
2015-11-21Merge "Speed up h_predictor_4x4"James Zern
2015-11-20Merge "Fix a signed shift overflow in vpx_rb_read_inv_signed_literal."Alex Converse
2015-11-20Merge "VPX: x86 asm version of vpx_idct32x32_34_add()"Scott LaVarnway
2015-11-19bitreader/writer: Change shift to signedAlex Converse
Silences several legal but suspicious unsigned overflows found with clang -fsanitize=integer. Change-Id: I69399751492a183167932b0a10751c433c32ca7b
2015-11-19Fix a signed shift overflow in vpx_rb_read_inv_signed_literal.Alex Converse
Found with clang -fsanitize=integer Change-Id: I17cb2166c06ff463abfaf9b0e6bc749d0d6fdf94
2015-11-19Speed up h_predictor_4x4Jian Zhou
Modify h_predictor_4x4 with XMM registers. Speed up by ~25% in ./test_intra_pred_speed. Change-Id: Id01c34c48e75b9d56dfc2e93af12cf0c0326a279
2015-11-18Speed up tm_predictor_4x4Jian Zhou
tm_predictor_4x4 is implemented with SSE2 using XMM registers. Speed up by ~25% in ./test_intra_pred_speed. Change-Id: I25074b78d476a2cb17f81cf654bdfd80df2070e0
2015-11-17VPX: x86 asm version of vpx_idct32x32_34_add()Scott LaVarnway
Change-Id: Ic81f38998fb1b8d33f5a5d7424c2c41002786cef
2015-11-11Revert "VPX: x86 asm version of vpx_idct32x32_34_add()"James Zern
This reverts commit 9aeaa2016e7470c4e316d90da33d883098eed6f4. This causes some test vectors to fail. Change-Id: I3659a2068404ec5a0591fba5c88b1bec0c9059a4
2015-11-10Merge "convolve_copy_sse2: replace SSE w/SSE2 code"James Zern
2015-11-10Merge "VPX: x86 asm version of vpx_idct32x32_34_add()"Scott LaVarnway
2015-11-10VPX: x86 asm version of vpx_idct32x32_34_add()Scott LaVarnway
Change-Id: I8a933c63b7fbf3c65e2c06dbdca9646cadd0b7cb
2015-11-09convolve_copy_sse2: replace SSE w/SSE2 codeJames Zern
this should be neutral or slightly faster on modern (P4+) architectures Change-Id: Iec4c080275941eb8c9e05a66a2daf0405d86a69b
2015-10-26Merge "Optimize vpx_quantize_{b,b_32x32} assembler."Debargha Mukherjee
2015-10-21vp10: merge ext_ipred_bltr experiment into misc_fixes.Ronald S. Bultje
Change-Id: I2f2deb700748408b8278b7f5c29ee1f2e39785ec
2015-10-20Optimize vpx_quantize_{b,b_32x32} assembler.Geza Lore
Added optimization of the 8 bit assembly quantizer routines. This makes these functions up to 100% faster, depending on encoding parameters. This patch maskes the encoder faster in both the high bitdepth and 8bit configurations. In the high bitdepth configuration, it effects profile 0 only. Based on my profiling using 1080p input the net gain is between 1-3% for the 8 bit config, and around 2.5-4.5% for the high bitdepth config, depending on target bitrate. The difference between the 8 bit and high bitdepth configurations for the same encoder run is reduced by 1% in all cases I have profiled. Change-Id: I86714a6b7364da20cd468cd784247009663a5140
2015-10-16vp10: add extended-intra prediction edges experiment.Ronald S. Bultje
This experiment allows using full above/right edges for all transform sizes whenever available (for d45/d63), and adds bottom/left edges for d207. See issue 1043. Change-Id: I5cf7f345e783e8539bb6b6d2c9972fb1d6d0a78b
2015-10-14Upstream Mozilla fix for older Apple clang buildsJohann
Also use the _mm_broadcastsi128_si256 intrisic for Apple clang versions 4.[012] https://bugzilla.mozilla.org/show_bug.cgi?id=1085607 https://code.google.com/p/webm/issues/detail?id=1082 Change-Id: I6bc821d8163387194ef663e94bfed91fa7281d88
2015-10-13Fix compiler warningshui su
Change-Id: I761256a8100d83abf1b937f3739580237e3fad2a
2015-10-09Add vpx_highbd_convolve_{copy,avg}_sse2Alex Converse
single-threaded: swanky (silvermont): ~1% faster overall peppy (celeron,haswell): ~1.5% faster overall Change-Id: Ib74f014374c63c9eaf2d38191cbd8e2edcc52073
2015-10-09Remove 4 mova insts from quantize_ssse3_x86_64.asmGeza Lore
Change-Id: If3cb9345b44162e600e6c74873e0cb4c207fc7fb
2015-10-06SSSE3 optimisation for quantize in high bit depthJulia Robson
When configured with high bit detpth enabled, the 8bit quantize function stopped using optimised code. This made 8bit content decode slowly. This commit re-enables the SSSE3 optimisations. Change-Id: I194b505dd3f4c494e5c5e53e020f5d94534b16b5
2015-10-06Merge "VPX: refactor vpx_idct32x32_1_add_sse2()"Scott LaVarnway
2015-10-05SSE2 optimisation for quantize in high bit depthJulia Robson
When configured with high bit detpth enabled, the 8bit quantize function stopped using optimised code. This made 8bit content decode slowly. This commit re-enables the SSE2 optimisation (but not the SSSE3 optimisation). Change-Id: Id015fe3c1c44580a4bff3f4bd985170f2806a9d9
2015-10-05VPX: refactor vpx_idct32x32_1_add_sse2()Scott LaVarnway
Change-Id: Ia1a2cac0e9dc05f3207b3433a6c1589fa7f2aee3
2015-10-02Merge "vp10: reimplement d45/4x4 to match vp8 instead of vp9."Ronald S. Bultje
2015-10-02Merge "Accelerated transform in high bit depth"Debargha Mukherjee
2015-10-01vp10: reimplement d45/4x4 to match vp8 instead of vp9.Ronald S. Bultje
This is more a proof of concept than anything else. The problem here isn't so much how to code it, but rather where to place the resulting code. All intrapred DSP code lives in vpx_dsp, so do we want the vp10 specific intra pred functions to live there, or in vp10/? See issue 1015. Change-Id: I675f7badcc8e18fd99a9553910ecf3ddf81f0a05
2015-09-30vp8: change build_intra4x4_predictors() to use vpx_dsp.Ronald S. Bultje
I've added a few new functions (d45e, d63e, he, ve) to cover the filtered h/v 4x4 predictors that are vp8-specific, the "correct" d45 with the correctly filtered bottom-right pixel (as opposed to the unfiltered version in vp9), and the "broken" d63 with weirdly filtered bottom-right pixels (which is correctly filtered in vp9). There may be a minor performance impact on all systems because we have to do an extra copy of the Above pixel array to incorporate the topleft pixel in the same array (thus fitting the vpx_dsp API). In addition, armv6 will have a more serious performance impact b/c I removed the armv6/vp8-specific assembly. I'm not sure anyone cares... Change-Id: I7f9e5ebee11d8e21aca2cd517a69eefc181b2e86
2015-09-30vp8: change build_intra_predictors_mby_s to use vpx_dsp.Ronald S. Bultje
Change-Id: I2000820e0c04de2c975d370a0cf7145330289bb2
2015-09-28Accelerated transform in high bit depthJulia Robson
When configured with high bitdepth enabled, the 8bit transform stopped using optimised code. This made 8bit content decode slowly. Change-Id: I67d91f9b212921d5320f949fc0a0d3f32f90c0ea
2015-09-18Remove vpx_filter_block1d16_v8_intrin_ssse3Johann
This was rewritten and moved to vpx_dsp/x86/vpx_subpixel_8t_ssse3.asm in 195883023bb39b5ee5c6811a316ab96d9225034d Change-Id: I117ce983dae12006e302679ba7f175573dd9e874
2015-09-17vpx_subpixel_8t_ssse3: fix reg counts/accessJames Zern
fixes build on windows x64; previously 'heightq' i.e., the 64-bit register was accessed when only the 32-bit value was needed. given this is from a stack variable the upper bits were undefined. + bump register/xmm counts; users of SETUP_LOCAL_VARS touch xmm13 in 64-bit builds and filter_block1d16_v* uses one extra temp variable Change-Id: I9c768c0b2047481d1d3b11c2e16b2f8de6eb0d80
2015-09-16vp10: code sign bit before absolute value in non-arithcoded header.Ronald S. Bultje
For reading, this makes the operation branchless, although it still requires two shifts. For writing, this makes the operation as fast as writing an unsigned value, branchlessly. This is also how other codecs typically code signed, non-arithcoded bitstream elements. See issue 1039. Change-Id: I6a8182cc88a16842fb431688c38f6b52d7f24ead