Age | Commit message (Collapse) | Author |
|
|
|
This reverts commit 89a1efa4c436c58c101c8b3de866e3014be7d77a.
This causes a segfault when decoding vp8, in both 32 and 64-bit
Change-Id: Idbb9bb28ab897e1d055340497c47b49a12231367
|
|
Relocate the function from SSSE3 to SSE2, Unroll loop from 8 to 4,
and reduce mem access to left.
Speed up by >20% in ./test_intra_pred_speed.
Change-Id: Ie48229c2e32404706b722442942c84983bda74cc
|
|
Relocate the function from SSSE3 to SSE2, Unroll loop from 4 to 2,
and reduce mem access to left.
Speed up by >20% in ./test_intra_pred_speed.
Change-Id: Ib9f1846819783b6e05e2a310c930eb844b2b4d2e
|
|
8x8 Intra predictor implemented with MMX is replaced with SSE2.
Change-Id: I0c90e7c1e1e6942489ac2bfe58903b728aac7a52
|
|
4x4 Intra predictor implemented with MMX is replaced with SSE2.
Change-Id: Id57da2a7c38832d0356bc998790fc1989d39eafc
|
|
|
|
Change-Id: I9a780131efaad28cf1ad233ae64c5c319a329727
|
|
|
|
Relocate h_predictor_4x4 from SSSE3 to SSE2 with XMM registers.
Speed up by ~25% in ./test_intra_pred_speed.
Change-Id: I64e14c13b482a471449be3559bfb0da45cf88d9d
|
|
Change-Id: I3ba4ede553e068bf116dce59d1317347988b3542
|
|
|
|
Left neighbor read from memory only once.
Speed up by ~20% in ./test_intra_pred_speed.
Change-Id: Ia1388630df6fed0dce9a6eeded6cb855bbc43505
|
|
|
|
and fixed macro name.
Change-Id: I306b98a2b4ec80b130ae80290b4cd9c7a5363311
|
|
This reverts commit d76032ae87e535be5b924d9e88bbd67189380534.
breaks 32-bit builds
Change-Id: If6266ec2a405b5a21d615112f0f37e8a71193858
|
|
|
|
|
|
|
|
Silences several legal but suspicious unsigned overflows found with
clang -fsanitize=integer.
Change-Id: I69399751492a183167932b0a10751c433c32ca7b
|
|
Found with clang -fsanitize=integer
Change-Id: I17cb2166c06ff463abfaf9b0e6bc749d0d6fdf94
|
|
Modify h_predictor_4x4 with XMM registers.
Speed up by ~25% in ./test_intra_pred_speed.
Change-Id: Id01c34c48e75b9d56dfc2e93af12cf0c0326a279
|
|
tm_predictor_4x4 is implemented with SSE2 using XMM registers.
Speed up by ~25% in ./test_intra_pred_speed.
Change-Id: I25074b78d476a2cb17f81cf654bdfd80df2070e0
|
|
Change-Id: Ic81f38998fb1b8d33f5a5d7424c2c41002786cef
|
|
This reverts commit 9aeaa2016e7470c4e316d90da33d883098eed6f4.
This causes some test vectors to fail.
Change-Id: I3659a2068404ec5a0591fba5c88b1bec0c9059a4
|
|
|
|
|
|
Change-Id: I8a933c63b7fbf3c65e2c06dbdca9646cadd0b7cb
|
|
this should be neutral or slightly faster on modern (P4+) architectures
Change-Id: Iec4c080275941eb8c9e05a66a2daf0405d86a69b
|
|
|
|
Change-Id: I2f2deb700748408b8278b7f5c29ee1f2e39785ec
|
|
Added optimization of the 8 bit assembly quantizer routines. This makes
these functions up to 100% faster, depending on encoding parameters.
This patch maskes the encoder faster in both the high bitdepth and 8bit
configurations. In the high bitdepth configuration, it effects profile 0
only.
Based on my profiling using 1080p input the net gain is between 1-3% for
the 8 bit config, and around 2.5-4.5% for the high bitdepth config,
depending on target bitrate. The difference between the 8 bit and high
bitdepth configurations for the same encoder run is reduced by 1% in all
cases I have profiled.
Change-Id: I86714a6b7364da20cd468cd784247009663a5140
|
|
This experiment allows using full above/right edges for all transform
sizes whenever available (for d45/d63), and adds bottom/left edges for
d207.
See issue 1043.
Change-Id: I5cf7f345e783e8539bb6b6d2c9972fb1d6d0a78b
|
|
Also use the _mm_broadcastsi128_si256 intrisic for
Apple clang versions 4.[012]
https://bugzilla.mozilla.org/show_bug.cgi?id=1085607
https://code.google.com/p/webm/issues/detail?id=1082
Change-Id: I6bc821d8163387194ef663e94bfed91fa7281d88
|
|
Change-Id: I761256a8100d83abf1b937f3739580237e3fad2a
|
|
single-threaded:
swanky (silvermont): ~1% faster overall
peppy (celeron,haswell): ~1.5% faster overall
Change-Id: Ib74f014374c63c9eaf2d38191cbd8e2edcc52073
|
|
Change-Id: If3cb9345b44162e600e6c74873e0cb4c207fc7fb
|
|
When configured with high bit detpth enabled, the 8bit quantize
function stopped using optimised code. This made 8bit content
decode slowly. This commit re-enables the SSSE3 optimisations.
Change-Id: I194b505dd3f4c494e5c5e53e020f5d94534b16b5
|
|
|
|
When configured with high bit detpth enabled, the 8bit quantize
function stopped using optimised code. This made 8bit content
decode slowly. This commit re-enables the SSE2 optimisation
(but not the SSSE3 optimisation).
Change-Id: Id015fe3c1c44580a4bff3f4bd985170f2806a9d9
|
|
Change-Id: Ia1a2cac0e9dc05f3207b3433a6c1589fa7f2aee3
|
|
|
|
|
|
This is more a proof of concept than anything else. The problem here
isn't so much how to code it, but rather where to place the resulting
code. All intrapred DSP code lives in vpx_dsp, so do we want the vp10
specific intra pred functions to live there, or in vp10/?
See issue 1015.
Change-Id: I675f7badcc8e18fd99a9553910ecf3ddf81f0a05
|
|
I've added a few new functions (d45e, d63e, he, ve) to cover the
filtered h/v 4x4 predictors that are vp8-specific, the "correct"
d45 with the correctly filtered bottom-right pixel (as opposed to
the unfiltered version in vp9), and the "broken" d63 with weirdly
filtered bottom-right pixels (which is correctly filtered in vp9).
There may be a minor performance impact on all systems because we
have to do an extra copy of the Above pixel array to incorporate
the topleft pixel in the same array (thus fitting the vpx_dsp API).
In addition, armv6 will have a more serious performance impact b/c
I removed the armv6/vp8-specific assembly. I'm not sure anyone
cares...
Change-Id: I7f9e5ebee11d8e21aca2cd517a69eefc181b2e86
|
|
Change-Id: I2000820e0c04de2c975d370a0cf7145330289bb2
|
|
When configured with high bitdepth enabled, the 8bit transform
stopped using optimised code. This made 8bit content decode slowly.
Change-Id: I67d91f9b212921d5320f949fc0a0d3f32f90c0ea
|
|
This was rewritten and moved to vpx_dsp/x86/vpx_subpixel_8t_ssse3.asm
in 195883023bb39b5ee5c6811a316ab96d9225034d
Change-Id: I117ce983dae12006e302679ba7f175573dd9e874
|
|
fixes build on windows x64; previously 'heightq' i.e., the 64-bit register
was accessed when only the 32-bit value was needed. given this is from a
stack variable the upper bits were undefined.
+ bump register/xmm counts; users of SETUP_LOCAL_VARS touch xmm13 in
64-bit builds and filter_block1d16_v* uses one extra temp variable
Change-Id: I9c768c0b2047481d1d3b11c2e16b2f8de6eb0d80
|
|
For reading, this makes the operation branchless, although it still
requires two shifts. For writing, this makes the operation as fast
as writing an unsigned value, branchlessly. This is also how other
codecs typically code signed, non-arithcoded bitstream elements.
See issue 1039.
Change-Id: I6a8182cc88a16842fb431688c38f6b52d7f24ead
|