summaryrefslogtreecommitdiff
path: root/vp9/common/x86
AgeCommit message (Collapse)Author
2015-01-18SSE2 code for the filter in MFQE.JackyChen
The SSE2 code is from VP8 MFQE, reuse it in VP9. No change on VP8 side. In our testing, we achieve 2X speed by adopting this change. Change-Id: Ib2b14144ae57c892005c1c4b84e3379d02e56716
2014-12-12Merge "Remove redundant loads on 1d16_v8 filter."James Zern
2014-12-12Merge "Remove redundant loads on 1d8_v8 filter."James Zern
2014-12-12Remove redundant loads on 1d16_v8 filter.Frank Galligan
This CL showed about a 3% gain in performance on some systems. Change-Id: Id27e7e0b8e69068aa364e67859436da852669250
2014-12-12Remove redundant loads on 1d8_v8 filter.Frank Galligan
This CL showed a modest gain in performance on some systems. Change-Id: Iad636a89a1a9804ab7a0dea302bf2c6a4d1653a4
2014-12-12vp9_loopfilter_mmx: remove some unused tablesJames Zern
Change-Id: I964d25cc91c8e4864d73b142d9c7a1b39cb6cfbb
2014-12-11Corrected optimization of 8x8 DCT codePeter de Rivaz
The 8x8 DCT uses a fast version whenever possible. There was a mistake in the checking code which meant sometimes the fast version was used when it was not safe to do so. Change-Id: I154c84c9e2d836764768a11082947ca30f4b5ab7 (cherry picked from commit fd05fb0c21e253b4d6f92d7e0b752850ff8ab188)
2014-12-08Merge "SSSE3 Optimization for Atom processors using new instruction ↵Yunqing Wang
selection and ordering"
2014-12-08Merge "Changes to assembler for NASM on mac."James Zern
2014-12-08SSSE3 Optimization for Atom processors using new instruction selection and ↵levytamar82
ordering The function vp9_filter_block1d16_h8_ssse3 uses the PSHUFB instruction which has a 3 cycle latency and slows execution when done in blocks of 5 or more on Atom processors. By replacing the PSHUFB instructions with other more efficient single cycle instructions (PUNPCKLBW + PUNPCHBW + PALIGNR) performance can be improved. In the original code, the PSHUBF uses every byte and is consecutively copied. This is done more efficiently by PUNPCKLBW and PUNPCHBW, using PALIGNR to concatenate the intermediate result and then shift right the next consecutive 16 bytes for the final result. For example: filter = 0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8 Reg = 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 REG1 = PUNPCKLBW Reg, Reg = 0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7 REG2 = PUNPCHBW Reg, Reg = 8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15 PALIGNR REG2, REG1, 1 = 0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8 This optimization improved the function performance by 23% and produced a 3% user level gain on 1080p content on Atom processors. There was no observed performance impact on Core processors (expected). Change-Id: I3cec701158993d95ed23ff04516942b5a4a461c0
2014-12-02Added high bitdepth sse2 transform functionsPeter de Rivaz
Also removes some spurious changes in common/vp9_blockd.h which was introduced by a rebase issue between nextgen and master branches. Change-Id: If359f0e9a71bca9c2ba685a87a355873536bb282 (cherry picked from commit 005d80cd05269a299cd2f7ddbc3d4d8b791aebba) (cherry picked from commit 08d2f548007fd8d6fd41da8ef7fdb488b6485af3) (cherry picked from commit 4230c2306c194c058f56433a5275aa02a2e71d56)
2014-11-24Changes to assembler for NASM on mac.John Stark
fixes non-Apple nasm part of issue #755 Change-Id: I11955d270c4ee55e3c00e99f568de01b95e7ea9a
2014-11-05Fix visual studio 2013 compiler warningsYaowu Xu
For configured with --enable-vp9-highbitdepth Change-Id: I2b181519d7192f8d7a241ad5760c3578255f24e6
2014-11-01WORKAROUND FIX FOR GCC4.9.1levytamar82
In the function mb_lpf_horizontal_edge_w_avx2_16 the usage of the intrinsic _mm256_cvtepu8_epi16 cause a compiler bug in gcc 4.9.1. until it will be fixed I created a workaround that create the up convert by using broadcast128+shuffle. The bug was reported here: https://code.google.com/p/webm/issues/detail?id=867 Change-Id: I73452e6806f42e0fadcde96b804ea3afa7eeb351
2014-10-09Rename highbitdepth functions to use highbd prefixDeb Mukherjee
Uses highbd_ prefix convention consistently. Change-Id: I58f7f799a7ff8e32701bcd71c955bcf1cdd4581e
2014-09-23Merge "High bit-depth loop/arf/postproc filter functions"Deb Mukherjee
2014-09-23High bit-depth loop/arf/postproc filter functionsDeb Mukherjee
Adds high-bitdepth loopfilter, temporal filter and postproc functions Change-Id: I81c8a9176890784686bc4f2af0d550d243b3b2d3
2014-09-18Merge "FIX: vp9_loopfilter_intrin_sse2.c"Frank Galligan
2014-09-18FIX: vp9_loopfilter_intrin_sse2.cScott LaVarnway
Fixes Visual Studio build failures Change-Id: I233719cd63b3ad0db16e2834bf1d7ea1df805880
2014-09-18Merge "Adds high bitdepth convolve, interpred & scaling"Deb Mukherjee
2014-09-18Adds high bitdepth convolve, interpred & scalingDeb Mukherjee
Change-Id: Ie51c352a6b250547207cbc1ebba833a01ed053e3
2014-09-17Merge "Improved mb_lpf_horizontal_edge_w_sse2_16() #2"Frank Galligan
2014-09-17Improved mb_lpf_horizontal_edge_w_sse2_16() #2Scott LaVarnway
The decoder performance improved up to 1% for the test clips used. Change-Id: I4621112bdccfba01640322facfa4ba8da8290ea5
2014-09-16Adding high-bitdepth intra prediction functionsDeb Mukherjee
Change-Id: I6f5cb101e2dc57c3d3f4d7e0ffb4ddbed027d111
2014-09-11Allow specifying opt dependenciesJohann
If optimizations use more than one cpu feature, allow specifying them so that '--disable-X' still works https://code.google.com/p/webm/issues/detail?id=854 Change-Id: I3108ea37b397371a2be84dd5f2380b304db23f18
2014-09-09Merge "Cleaning up and speeding up vp9_idct32x32_1024_add_sse2()."Dmitry Kovalev
2014-09-05Cleaning up and speeding up vp9_idct32x32_1024_add_sse2().Dmitry Kovalev
Change-Id: If91017b792572c9db6e257011ca307bef8428486
2014-09-05Removing postproc mmx code.Dmitry Kovalev
Removed functions: * vp9_post_proc_down_and_across_mmx * vp9_mbpost_proc_down_mmx * vp9_plane_add_noise_mmx They all have sse2 equivalent. Change-Id: I59c1fac12b7c96ca4538d455e4400c2b7875feff
2014-08-21Merge "Fix bug 804"Yaowu Xu
2014-08-07Fix bug 804levytamar82
A bug in Microsoft compiler was found in the function vp9_filter_block1d16_v8_avx2 and a workaround applied. the bug occur when there was 4 consecutive maddubs + min + adds intrinsic instructions. Change-Id: I83499faeb70971e650e5663fd2490360ddb1a51b
2014-08-05Remove vp9_postproc_x86.hJohann
This configuration has moved to vp9_rtcd_defs.pl Change-Id: I71a31dbb8d79df226b60dd834324a5af69956c51
2014-06-12Use lrand48 on AndroidJohann
When building x86 assembly use lrand48 instead of the undocumented inlined _rand function. Android now supports rand() https://android-review.googlesource.com/97731 but only for new versions. Original workaround: https://gerrit.chromium.org/gerrit/15744 Change-Id: I130566837d5bfc9e54187ebe9807350d1a7dab2a
2014-06-03Merge "Enable SSSE3 inverse 2D-DCT with 10 non-zero coeffs"Jingning Han
2014-05-29Fix a potential overflow issue in inverse 16x16 full 2D-DCTJingning Han
An overflow issue could potentially happen in the second round 1-D transform of the SSSE3 full inverse 16x16 2D-DCT. This commit fixes this issue. Change-Id: Ia19e4888fda1cc929a28a5f89a5beec612d628dc
2014-05-28Enable SSSE3 inverse 2D-DCT with 10 non-zero coeffsJingning Han
This commit enables SSSE3 implementation of the inverse 2D-DCT with only first 10 coefficients non-zero. It reduces the runtime of SSE2 version from 745 cycles to 538 cycles, i.e., 27% speed-up. Change-Id: I18ba4128859b09c704a6ee361d69a86c09fe8dfe
2014-05-27Fix compiling error in MSVSJingning Han
Need to include math.h before tmmintrin.h in some versions of MSVS. Change-Id: Ia6b83ae599316887ecf30c4e4b9e4355fb8a4219
2014-05-27Merge "Fix decoder mismatch in sub-pixel AVX2 intrinsic filters"Yunqing Wang
2014-05-23Fix decoder mismatch in sub-pixel AVX2 intrinsic filterslevytamar82
The subpixel SSSE3 was fixed in this patch: https://gerrit.chromium.org/gerrit/#/c/70283/ So the equivalent AVX2 is fixed accordingly. Change-Id: Ieebbc1949c99d34b12b8b47692df71aca5001f3a
2014-05-23Merge "Inverse 16x16 2D-DCT SSSE3 implementation"Jingning Han
2014-05-23Inverse 16x16 2D-DCT SSSE3 implementationJingning Han
This commit enables the SSSE3 implementation of full inverse 16x16 2D-DCT. The unit runtime goes down from 1642 cycles to 1519 cycles, about 7% speed-up. Change-Id: I14d2fdf9da1fb4ed1e5db7ce24f77a1bfc8ea90d
2014-05-23Fix decoder mismatch in sub-pixel SSSE3 intrinsic filtersYunqing Wang
In 8-tap filtering, to guarantee the intermediate results fit in 16 bits, the order of accumulating the products needs to be done correctly, and the largest product should be added last. This patch fixed the problem using the method in commit "Correct ssse3 8/16-pixel wide sub-pixel filter calculation". Change-Id: I79d0ad60c057b15011ece84cda9648eee0809423
2014-05-23Merge "change to use assembly version of ssse3 filter code"Yaowu Xu
2014-05-22change to use assembly version of ssse3 filter codeYaowu Xu
As mismatchs were found between the intrinsic version and c only. The commit temporarily revert to use the matching assembly version to allow further investigation. Change-Id: I08436c47d4888b562c0eac8e8856d90a831442df
2014-05-22Merge "Fix a decoding mismatch in sub-pixel filters"Yunqing Wang
2014-05-22Fix a decoding mismatch in sub-pixel filtersYunqing Wang
This did the same correction as the one in commit "Correct ssse3 8/16-pixel wide sub-pixel filter calculation" to avoid saturation during filtering. Change-Id: Ife9aa3f62daf9114eb24fe38f7baa3c3f361b2d6
2014-05-21Renames x86_64 specific asm filesDeb Mukherjee
Renames all x86_64 specific assembly files to consistently end in _x86_64.asm. This will be useful for build systems to handle these files differently. All new 64-bit specific assembly files should use the new naming convention. Change-Id: I36c89584967c82ffc4088b1b5044ac15d2bb7536
2014-05-08Change eob threshold for partial inverse 8x8 2D-DCT to 12Jingning Han
The scanning order has the first 12 coefficients of the 8x8 2D-DCT sitting in the top left 4x4 block. Hence the partial inverse 8x8 2D-DCT allows to handle cases with eob below 12. The overall runtime of the inverse 8x8 2D-DCT unit is reduced from 166 cycles (using SSE2) to 150 cycles (using SSSE3). Change-Id: I4514f9748042809ac84df4c14382c00f313f1cd2
2014-05-07SSSE3 8x8 inverse 2D-DCT with first 10 coeffs non-zeroJingning Han
This commit enables ssse3 assembly implementation of the 8x8 inverse 2D-DCT with only first 10 coefficients non-zero. The average runtime for this unit goes down from 198 cycles to 129 cycles (34.8% faster). Change-Id: Ie7fa4386f6d3a2fe0d47a2eb26fc2a6bbc592ac7
2014-05-05SSSE3 implementation of full inverse 8x8 2D-DCTJingning Han
This commit enables SSSE3 version full inverse 8x8 2D-DCT and reconstruction. It makes the runtime of vp9_idct8x8_64_add down from 256 cycles (SSE2) to 246 cycles. Change-Id: I0600feac894d6a443a3c9d18daf34156d4e225c3
2014-04-10Merge "Fix encoder uninitialized read errors reported by drmemory"Yunqing Wang