Age | Commit message (Collapse) | Author |
|
+ synchronize filter function signatures
this makes any intrinsics filters available for inlining and has the
side-effect of making those filters static, quieting missing-prototype
warnings.
Change-Id: I1908875caffa585bd4fc65aaf10d17a5e20cfb46
|
|
+ synchronize filter function signatures
this makes any intrinsics filters available for inlining and has the
side-effect of making those filters static, quieting missing-prototype
warnings.
Change-Id: I1cd55c9d52547793ad65aa90c7620f0e426edaa2
|
|
collect the vp9_convolve function definition macros there; this will
allow some relocation of functions from vp9_asm_stubs.c
Change-Id: Idadd117fa256dd48748379856973fd985b8204e8
|
|
reorder includes to avoid:
warning C4985: 'ceil': attributes not present on previous declaration.
this is the same workaround used in vp9/common/vp9_systemdependent.h
Change-Id: Ia10dd63de24f96fa1507a6179220e9d6ec774db6
|
|
silences a missing declaration warning
Change-Id: I59a34e1a1377cf3529b678d7ec0122bd43ab1bf1
|
|
With the sad functions, and hopefully the variance functions soon,
moving to the vpx_dsp location, place the defines used in the
reference C code in a common location.
Change-Id: I4c8ce7778eb38a0a3ee674d2f1c488eda01cfeca
|
|
|
|
this macro was used inconsistently and only differs in behavior from
DECLARE_ALIGNED when an alignment attribute is unavailable. this macro
is used with calls to assembly, while generic c-code doesn't rely on it,
so in a c-only build without an alignment attribute the code will
function as expected.
Change-Id: Ie9d06d4028c0de17c63b3a27e6c1b0491cc4ea79
|
|
vp9_dc_left_predictor_16x16
vp9_dc_top_predictor_32x32
vp9_dc_left_predictor_32x32
vp9_dc_128_predictor_32x32
Change-Id: Ib9861deefd01c3527235b92ff6b3d571ef6b4bc6
|
|
widen the loads and stores to 128-bit.
this was added, but not enabled in:
493a857 Add some sse2 code for intra prediction.
Change-Id: I277d7db608a7db7d75cc0bde86f48fa66ad487e4
|
|
|
|
+ fix some whitespace
Change-Id: Id61b739282014288a7e5d3c17a9d6448d9d4cda2
|
|
offsetting by a variable stride prevents instruction reordering,
resulting in poor assembly
Change-Id: Id62d6b3299cdd23f8c44f97b630abf4fea241446
|
|
offsetting by a variable stride prevents instruction reordering,
resulting in poor assembly.
additionally reroll 16x16/32x32 loops to reduce register spill with this
new format
Change-Id: I0635b8ba21ecdb88116e927dbdab53acdf256e11
|
|
Change-Id: I16c0a62e52dab62837c547345df31e7518620ed4
|
|
The rotation computation using 2X of cos(pi/16) has a potential to
overflow 32 bit, this commit disable the function to allow further
investigation and optimization.
Change-Id: I4a9803bc71303d459cb1ec5bbd7c4aaf8968e5cf
|
|
|
|
|
|
|
|
Change-Id: Idb14b9a285f8098126f967c5e2750221d6a58f69
|
|
Change-Id: I6728b69bb3dff1daa64ff7142f691e80a089f1c4
|
|
The intrinsic statement _mm_subs_epi16() should take immediate.
Feeding variable as its input argument will cause compile failure
in older version gcc.
Change-Id: I6a71efcc8d3b16b84715e0a9bcfa818494eea3f4
|
|
Change-Id: I39f56f60425836f2e1ec07da71edd4810a4c78bb
|
|
by saving xmm8; cglobal's xmm reg arg is 0-based
Change-Id: Ic8426ec9ac59ab4478716aa812452a6406794dcb
|
|
The SSE2 code is from VP8 MFQE, reuse it in VP9. No change on VP8
side. In our testing, we achieve 2X speed by adopting this change.
Change-Id: Ib2b14144ae57c892005c1c4b84e3379d02e56716
|
|
|
|
|
|
This CL showed about a 3% gain in performance on some systems.
Change-Id: Id27e7e0b8e69068aa364e67859436da852669250
|
|
This CL showed a modest gain in performance on some systems.
Change-Id: Iad636a89a1a9804ab7a0dea302bf2c6a4d1653a4
|
|
Change-Id: I964d25cc91c8e4864d73b142d9c7a1b39cb6cfbb
|
|
The 8x8 DCT uses a fast version whenever possible.
There was a mistake in the checking code which
meant sometimes the fast version was used when it
was not safe to do so.
Change-Id: I154c84c9e2d836764768a11082947ca30f4b5ab7
(cherry picked from commit fd05fb0c21e253b4d6f92d7e0b752850ff8ab188)
|
|
selection and ordering"
|
|
|
|
ordering
The function vp9_filter_block1d16_h8_ssse3 uses the PSHUFB instruction which has a 3 cycle latency and slows execution when done in blocks of 5 or more on Atom processors.
By replacing the PSHUFB instructions with other more efficient single cycle instructions (PUNPCKLBW + PUNPCHBW + PALIGNR) performance can be improved.
In the original code, the PSHUBF uses every byte and is consecutively copied.
This is done more efficiently by PUNPCKLBW and PUNPCHBW, using PALIGNR to concatenate the intermediate result and then shift right the next consecutive 16 bytes for the final result.
For example:
filter = 0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8
Reg = 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
REG1 = PUNPCKLBW Reg, Reg = 0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7
REG2 = PUNPCHBW Reg, Reg = 8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15
PALIGNR REG2, REG1, 1 = 0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8
This optimization improved the function performance by 23% and produced a 3% user level gain on 1080p content on Atom processors.
There was no observed performance impact on Core processors (expected).
Change-Id: I3cec701158993d95ed23ff04516942b5a4a461c0
|
|
Also removes some spurious changes in common/vp9_blockd.h which
was introduced by a rebase issue between nextgen and master branches.
Change-Id: If359f0e9a71bca9c2ba685a87a355873536bb282
(cherry picked from commit 005d80cd05269a299cd2f7ddbc3d4d8b791aebba)
(cherry picked from commit 08d2f548007fd8d6fd41da8ef7fdb488b6485af3)
(cherry picked from commit 4230c2306c194c058f56433a5275aa02a2e71d56)
|
|
fixes non-Apple nasm part of issue #755
Change-Id: I11955d270c4ee55e3c00e99f568de01b95e7ea9a
|
|
For configured with --enable-vp9-highbitdepth
Change-Id: I2b181519d7192f8d7a241ad5760c3578255f24e6
|
|
In the function mb_lpf_horizontal_edge_w_avx2_16 the usage of the intrinsic
_mm256_cvtepu8_epi16 cause a compiler bug in gcc 4.9.1.
until it will be fixed I created a workaround that create the up convert by
using broadcast128+shuffle.
The bug was reported here:
https://code.google.com/p/webm/issues/detail?id=867
Change-Id: I73452e6806f42e0fadcde96b804ea3afa7eeb351
|
|
Uses highbd_ prefix convention consistently.
Change-Id: I58f7f799a7ff8e32701bcd71c955bcf1cdd4581e
|
|
|
|
Adds high-bitdepth loopfilter, temporal filter and postproc functions
Change-Id: I81c8a9176890784686bc4f2af0d550d243b3b2d3
|
|
|
|
Fixes Visual Studio build failures
Change-Id: I233719cd63b3ad0db16e2834bf1d7ea1df805880
|
|
|
|
Change-Id: Ie51c352a6b250547207cbc1ebba833a01ed053e3
|
|
|
|
The decoder performance improved up to 1% for the
test clips used.
Change-Id: I4621112bdccfba01640322facfa4ba8da8290ea5
|
|
Change-Id: I6f5cb101e2dc57c3d3f4d7e0ffb4ddbed027d111
|
|
If optimizations use more than one cpu feature, allow
specifying them so that '--disable-X' still works
https://code.google.com/p/webm/issues/detail?id=854
Change-Id: I3108ea37b397371a2be84dd5f2380b304db23f18
|
|
|