Age | Commit message (Collapse) | Author |
|
The SSE2 code is from VP8 MFQE, reuse it in VP9. No change on VP8
side. In our testing, we achieve 2X speed by adopting this change.
Change-Id: Ib2b14144ae57c892005c1c4b84e3379d02e56716
|
|
|
|
|
|
This CL showed about a 3% gain in performance on some systems.
Change-Id: Id27e7e0b8e69068aa364e67859436da852669250
|
|
This CL showed a modest gain in performance on some systems.
Change-Id: Iad636a89a1a9804ab7a0dea302bf2c6a4d1653a4
|
|
Change-Id: I964d25cc91c8e4864d73b142d9c7a1b39cb6cfbb
|
|
The 8x8 DCT uses a fast version whenever possible.
There was a mistake in the checking code which
meant sometimes the fast version was used when it
was not safe to do so.
Change-Id: I154c84c9e2d836764768a11082947ca30f4b5ab7
(cherry picked from commit fd05fb0c21e253b4d6f92d7e0b752850ff8ab188)
|
|
selection and ordering"
|
|
|
|
ordering
The function vp9_filter_block1d16_h8_ssse3 uses the PSHUFB instruction which has a 3 cycle latency and slows execution when done in blocks of 5 or more on Atom processors.
By replacing the PSHUFB instructions with other more efficient single cycle instructions (PUNPCKLBW + PUNPCHBW + PALIGNR) performance can be improved.
In the original code, the PSHUBF uses every byte and is consecutively copied.
This is done more efficiently by PUNPCKLBW and PUNPCHBW, using PALIGNR to concatenate the intermediate result and then shift right the next consecutive 16 bytes for the final result.
For example:
filter = 0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8
Reg = 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
REG1 = PUNPCKLBW Reg, Reg = 0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7
REG2 = PUNPCHBW Reg, Reg = 8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15
PALIGNR REG2, REG1, 1 = 0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8
This optimization improved the function performance by 23% and produced a 3% user level gain on 1080p content on Atom processors.
There was no observed performance impact on Core processors (expected).
Change-Id: I3cec701158993d95ed23ff04516942b5a4a461c0
|
|
Also removes some spurious changes in common/vp9_blockd.h which
was introduced by a rebase issue between nextgen and master branches.
Change-Id: If359f0e9a71bca9c2ba685a87a355873536bb282
(cherry picked from commit 005d80cd05269a299cd2f7ddbc3d4d8b791aebba)
(cherry picked from commit 08d2f548007fd8d6fd41da8ef7fdb488b6485af3)
(cherry picked from commit 4230c2306c194c058f56433a5275aa02a2e71d56)
|
|
fixes non-Apple nasm part of issue #755
Change-Id: I11955d270c4ee55e3c00e99f568de01b95e7ea9a
|
|
For configured with --enable-vp9-highbitdepth
Change-Id: I2b181519d7192f8d7a241ad5760c3578255f24e6
|
|
In the function mb_lpf_horizontal_edge_w_avx2_16 the usage of the intrinsic
_mm256_cvtepu8_epi16 cause a compiler bug in gcc 4.9.1.
until it will be fixed I created a workaround that create the up convert by
using broadcast128+shuffle.
The bug was reported here:
https://code.google.com/p/webm/issues/detail?id=867
Change-Id: I73452e6806f42e0fadcde96b804ea3afa7eeb351
|
|
Uses highbd_ prefix convention consistently.
Change-Id: I58f7f799a7ff8e32701bcd71c955bcf1cdd4581e
|
|
|
|
Adds high-bitdepth loopfilter, temporal filter and postproc functions
Change-Id: I81c8a9176890784686bc4f2af0d550d243b3b2d3
|
|
|
|
Fixes Visual Studio build failures
Change-Id: I233719cd63b3ad0db16e2834bf1d7ea1df805880
|
|
|
|
Change-Id: Ie51c352a6b250547207cbc1ebba833a01ed053e3
|
|
|
|
The decoder performance improved up to 1% for the
test clips used.
Change-Id: I4621112bdccfba01640322facfa4ba8da8290ea5
|
|
Change-Id: I6f5cb101e2dc57c3d3f4d7e0ffb4ddbed027d111
|
|
If optimizations use more than one cpu feature, allow
specifying them so that '--disable-X' still works
https://code.google.com/p/webm/issues/detail?id=854
Change-Id: I3108ea37b397371a2be84dd5f2380b304db23f18
|
|
|
|
Change-Id: If91017b792572c9db6e257011ca307bef8428486
|
|
Removed functions:
* vp9_post_proc_down_and_across_mmx
* vp9_mbpost_proc_down_mmx
* vp9_plane_add_noise_mmx
They all have sse2 equivalent.
Change-Id: I59c1fac12b7c96ca4538d455e4400c2b7875feff
|
|
|
|
A bug in Microsoft compiler was found in the function
vp9_filter_block1d16_v8_avx2 and a workaround applied.
the bug occur when there was 4 consecutive maddubs + min + adds
intrinsic instructions.
Change-Id: I83499faeb70971e650e5663fd2490360ddb1a51b
|
|
This configuration has moved to vp9_rtcd_defs.pl
Change-Id: I71a31dbb8d79df226b60dd834324a5af69956c51
|
|
When building x86 assembly use lrand48 instead of the
undocumented inlined _rand function.
Android now supports rand()
https://android-review.googlesource.com/97731
but only for new versions. Original workaround:
https://gerrit.chromium.org/gerrit/15744
Change-Id: I130566837d5bfc9e54187ebe9807350d1a7dab2a
|
|
|
|
An overflow issue could potentially happen in the second round 1-D
transform of the SSSE3 full inverse 16x16 2D-DCT. This commit fixes
this issue.
Change-Id: Ia19e4888fda1cc929a28a5f89a5beec612d628dc
|
|
This commit enables SSSE3 implementation of the inverse 2D-DCT
with only first 10 coefficients non-zero. It reduces the runtime
of SSE2 version from 745 cycles to 538 cycles, i.e., 27% speed-up.
Change-Id: I18ba4128859b09c704a6ee361d69a86c09fe8dfe
|
|
Need to include math.h before tmmintrin.h in some versions of MSVS.
Change-Id: Ia6b83ae599316887ecf30c4e4b9e4355fb8a4219
|
|
|
|
The subpixel SSSE3 was fixed in this patch:
https://gerrit.chromium.org/gerrit/#/c/70283/
So the equivalent AVX2 is fixed accordingly.
Change-Id: Ieebbc1949c99d34b12b8b47692df71aca5001f3a
|
|
|
|
This commit enables the SSSE3 implementation of full inverse 16x16
2D-DCT. The unit runtime goes down from 1642 cycles to 1519 cycles,
about 7% speed-up.
Change-Id: I14d2fdf9da1fb4ed1e5db7ce24f77a1bfc8ea90d
|
|
In 8-tap filtering, to guarantee the intermediate results fit in
16 bits, the order of accumulating the products needs to be done
correctly, and the largest product should be added last. This
patch fixed the problem using the method in commit "Correct ssse3
8/16-pixel wide sub-pixel filter calculation".
Change-Id: I79d0ad60c057b15011ece84cda9648eee0809423
|
|
|
|
As mismatchs were found between the intrinsic version and c only. The
commit temporarily revert to use the matching assembly version to
allow further investigation.
Change-Id: I08436c47d4888b562c0eac8e8856d90a831442df
|
|
|
|
This did the same correction as the one in commit "Correct ssse3
8/16-pixel wide sub-pixel filter calculation" to avoid saturation
during filtering.
Change-Id: Ife9aa3f62daf9114eb24fe38f7baa3c3f361b2d6
|
|
Renames all x86_64 specific assembly files to consistently
end in _x86_64.asm. This will be useful for build systems to
handle these files differently.
All new 64-bit specific assembly files should use the new
naming convention.
Change-Id: I36c89584967c82ffc4088b1b5044ac15d2bb7536
|
|
The scanning order has the first 12 coefficients of the 8x8 2D-DCT
sitting in the top left 4x4 block. Hence the partial inverse 8x8
2D-DCT allows to handle cases with eob below 12.
The overall runtime of the inverse 8x8 2D-DCT unit is reduced from
166 cycles (using SSE2) to 150 cycles (using SSSE3).
Change-Id: I4514f9748042809ac84df4c14382c00f313f1cd2
|
|
This commit enables ssse3 assembly implementation of the 8x8
inverse 2D-DCT with only first 10 coefficients non-zero. The
average runtime for this unit goes down from 198 cycles to 129
cycles (34.8% faster).
Change-Id: Ie7fa4386f6d3a2fe0d47a2eb26fc2a6bbc592ac7
|
|
This commit enables SSSE3 version full inverse 8x8 2D-DCT and
reconstruction. It makes the runtime of vp9_idct8x8_64_add down
from 256 cycles (SSE2) to 246 cycles.
Change-Id: I0600feac894d6a443a3c9d18daf34156d4e225c3
|
|
|