summaryrefslogtreecommitdiff
path: root/vp9/common/x86
AgeCommit message (Collapse)Author
2014-05-27Fix compiling error in MSVSJingning Han
Need to include math.h before tmmintrin.h in some versions of MSVS. Change-Id: Ia6b83ae599316887ecf30c4e4b9e4355fb8a4219
2014-05-27Merge "Fix decoder mismatch in sub-pixel AVX2 intrinsic filters"Yunqing Wang
2014-05-23Fix decoder mismatch in sub-pixel AVX2 intrinsic filterslevytamar82
The subpixel SSSE3 was fixed in this patch: https://gerrit.chromium.org/gerrit/#/c/70283/ So the equivalent AVX2 is fixed accordingly. Change-Id: Ieebbc1949c99d34b12b8b47692df71aca5001f3a
2014-05-23Merge "Inverse 16x16 2D-DCT SSSE3 implementation"Jingning Han
2014-05-23Inverse 16x16 2D-DCT SSSE3 implementationJingning Han
This commit enables the SSSE3 implementation of full inverse 16x16 2D-DCT. The unit runtime goes down from 1642 cycles to 1519 cycles, about 7% speed-up. Change-Id: I14d2fdf9da1fb4ed1e5db7ce24f77a1bfc8ea90d
2014-05-23Fix decoder mismatch in sub-pixel SSSE3 intrinsic filtersYunqing Wang
In 8-tap filtering, to guarantee the intermediate results fit in 16 bits, the order of accumulating the products needs to be done correctly, and the largest product should be added last. This patch fixed the problem using the method in commit "Correct ssse3 8/16-pixel wide sub-pixel filter calculation". Change-Id: I79d0ad60c057b15011ece84cda9648eee0809423
2014-05-23Merge "change to use assembly version of ssse3 filter code"Yaowu Xu
2014-05-22change to use assembly version of ssse3 filter codeYaowu Xu
As mismatchs were found between the intrinsic version and c only. The commit temporarily revert to use the matching assembly version to allow further investigation. Change-Id: I08436c47d4888b562c0eac8e8856d90a831442df
2014-05-22Merge "Fix a decoding mismatch in sub-pixel filters"Yunqing Wang
2014-05-22Fix a decoding mismatch in sub-pixel filtersYunqing Wang
This did the same correction as the one in commit "Correct ssse3 8/16-pixel wide sub-pixel filter calculation" to avoid saturation during filtering. Change-Id: Ife9aa3f62daf9114eb24fe38f7baa3c3f361b2d6
2014-05-21Renames x86_64 specific asm filesDeb Mukherjee
Renames all x86_64 specific assembly files to consistently end in _x86_64.asm. This will be useful for build systems to handle these files differently. All new 64-bit specific assembly files should use the new naming convention. Change-Id: I36c89584967c82ffc4088b1b5044ac15d2bb7536
2014-05-08Change eob threshold for partial inverse 8x8 2D-DCT to 12Jingning Han
The scanning order has the first 12 coefficients of the 8x8 2D-DCT sitting in the top left 4x4 block. Hence the partial inverse 8x8 2D-DCT allows to handle cases with eob below 12. The overall runtime of the inverse 8x8 2D-DCT unit is reduced from 166 cycles (using SSE2) to 150 cycles (using SSSE3). Change-Id: I4514f9748042809ac84df4c14382c00f313f1cd2
2014-05-07SSSE3 8x8 inverse 2D-DCT with first 10 coeffs non-zeroJingning Han
This commit enables ssse3 assembly implementation of the 8x8 inverse 2D-DCT with only first 10 coefficients non-zero. The average runtime for this unit goes down from 198 cycles to 129 cycles (34.8% faster). Change-Id: Ie7fa4386f6d3a2fe0d47a2eb26fc2a6bbc592ac7
2014-05-05SSSE3 implementation of full inverse 8x8 2D-DCTJingning Han
This commit enables SSSE3 version full inverse 8x8 2D-DCT and reconstruction. It makes the runtime of vp9_idct8x8_64_add down from 256 cycles (SSE2) to 246 cycles. Change-Id: I0600feac894d6a443a3c9d18daf34156d4e225c3
2014-04-10Merge "Fix encoder uninitialized read errors reported by drmemory"Yunqing Wang
2014-04-09Fix encoder uninitialized read errors reported by drmemoryYunqing Wang
This patch fixed the uninitialized read errors in Issue 748: "dr memory VP9 encode errors". In vp9_convolve_avg_sse2, when width is 4, pavgb reads 8 bytes from dst buffer that is out of range. An error is reported although the data is not actually used later. This issue was resolved by preventing uninitialized reads. Change-Id: I109a54910aa47139cb13119de86f2062cff207df
2014-04-08Fix avx builds on macosx with clang 5.0.Tom Finegan
The macosx release of clang v5.0 identifies itself as: Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn) This version of clang uses the older _mm_broadcastsi128_si256, like v3.3, as given away in the LLVM svn version above. Change-Id: I4d6d59d5454efd57d2ae9e75f5eb7486af7cbd0c
2014-03-06vp9_subpixel_8t_intrin_avx2: fix build w/clang 3.4+James Zern
clang reports gcc-4.2.1 in e.g., 3.3, 3.4; add a specific clang version check for _mm256_broadcastsi128_si256 fixes issue #720 Change-Id: I5c8e3c27fdea05d8a5b050e8cb74894b595f4709
2014-02-20Merge "vp9_subpixel_8t_intrin_ssse3.c: make some tables static"James Zern
2014-02-18vp9_subpixel_8t_intrin_ssse3.c: make some tables staticJames Zern
+ fix formatting Change-Id: I344d4de089d03e403f0c7b3e64aeb7086cce86ac
2014-02-18vp9_subpixel_8t_intrin_avx2.c: make some tables staticJames Zern
+ fix formatting Change-Id: Ia62610bff3d63855104366d7860749b6a3cf4577
2014-02-18Merge "SSSE3 convolution optimization"Yunqing Wang
2014-02-14Merge "minor spelling cleanup in comments"Yaowu Xu
2014-02-14SSSE3 convolution optimizationlevytamar82
Optimizing all SSSE3 assembly for convolution: 1. vp9_filter_block1d4_h8_sse2 2. vp9_filter_block1d8_h8_sse2 3. vp9_filter_block1d16_h8_sse2 4. vp9_filter_block1d4_v8_sse2 5. vp9_filter_block1d8_v8_sse2 6. vp9_filter_block1d16_v8_sse2 my optimization include: -processing 2x8 elements in one 128 bit register instead of processing 8 elements in one 128 bit register. -removing unecessary loads. This optimization gives between 2.4% user level gain for 480p input and 1.6% user level gain for 720p. This Optimization is done only for 64 bit Change-Id: Ic07fce2f9360329b4f2d956efda1480ae958766b
2014-02-12AVX2 Convolve Optimizationlevytamar82
Two convolve functions were optimized for AVX2: 1. vp9_filter_block1d16_h8 2. vp9_filter_block1d16_v8 vp9_filter_block1d16_v8 was optimized for AVX2 by reducing the number of loop strides by half, two strides were processed in parallel. vp9_filter_block1d16_v8 was also optimized in the same way also some of the loads were being done outside of the loop and by that preventing redundant loads. This Optimization gives 43% function level gain and 1.3% user level gain. Now can be compiled in Windows Change-Id: I2714124cfb0c14a77d7a0ce126a20db92ffbf92c
2014-02-12minor spelling cleanup in commentsAndrew Russell
Change-Id: Ia91c6c406273345b08505097ffe1af3896980f06
2014-02-10vp9/common/x86: Silence MSVC warnings in vp9_asm_stubs.c.Tom Finegan
Update filter_1dfunction definition to match usage. Change-Id: Ie3cae13dc1ec3f5838c5f29d1c76a1a98a9217fa
2014-02-04Optimize bilinear sub-pixel filters in ssse3Yunqing Wang
This patch added ssse3 optimization of bilinear sub-pixel filters. The real time encoder was speeded up by ~1%. Change-Id: Ie82e98976f411183cb8c61ab8d2ba0276e55a338
2014-02-03Optimize bilinear sub-pixel filters in sse2Yunqing Wang
Using bilinear filters could speed up the codec in real-time mode. This patch added sse2 optimizations of bilinear filters that operate on different-sized blocks. Tests showed that the real-time encoder was speeded up by 3%. Change-Id: If99a7ee4385fcc225c3ee7445d962d5752e57c3f
2014-01-28Add macros for convolve functionsYunqing Wang
Added macros to reduce the code duplication. Change-Id: I1916aa5a386ea07d961d4ec439ab09bb8c45487d
2014-01-27Removing _1d suffix from transform names.Dmitry Kovalev
It is enough to specify (e.g.) idct16, it is obviously different from idct16x16. Change-Id: I6b408a37a945de3162429380b59a775b03b95db0
2014-01-23vp9/common: add extern "C" to headersJames Zern
Change-Id: Ic334da9aee968e33762c2b25d9fbad24c844b411
2014-01-16Revert "Revert "Revert "SSSE3 convolution optimization"""Yunqing Wang
This reverts commit f9404f240642222775a371acde8fc0721b3812df. This patch caused some ASAN error. Change-Id: If15b7e581310e19061d111c69f2931809662ed19
2014-01-13Revert "Revert "SSSE3 convolution optimization""Yunqing Wang
This reverts commit b645257121da20b422dbbebf02aae0fc6dff95d4. Change-Id: I60d1bf57ae8e9eb6127f42f2d5a780124ac51b45
2014-01-10Revert "SSSE3 convolution optimization"Paul Wilkins
This reverts commit 511d218c60b9b6c1ab9383db746815e907af0359. In current form intrinsics break borg build. Change-Id: Ied37936af841250ecff449802e69a3d3761c91b9
2014-01-09Merge "Optimze inv 16x16 DCT with 10 non-zero coeffs - P2"Jingning Han
2014-01-09Merge "Optimze inv 16x16 DCT with 10 non-zero coeffs - P1"Jingning Han
2014-01-09Optimze inv 16x16 DCT with 10 non-zero coeffs - P2Jingning Han
This commit further optimizes SSE2 operations in the second 1-D inverse 16x16 DCT, with (<10) non-zero coefficients. The average runtime of this module goes down from 779 cycles -> 725 cycles. Change-Id: Iac31b123640d9b1e8f906e770702936b71f0ba7f
2014-01-09Merge "SSSE3 convolution optimization"Yunqing Wang
2014-01-09SSSE3 convolution optimizationlevytamar82
Optimizing all SSSE3 assembly for convolution: 1. vp9_filter_block1d4_h8_sse2 2. vp9_filter_block1d8_h8_sse2 3. vp9_filter_block1d16_h8_sse2 4. vp9_filter_block1d4_v8_sse2 5. vp9_filter_block1d8_v8_sse2 6. vp9_filter_block1d16_v8_sse2 my optimization include: -processing 2x8 elements in one 128 bit register instead of processing 8 elements in one 128 bit register. -removing unecessary loads. This optimization gives between 2.4% user level gain for 480p input and 1.6% user level gain for 720p. This Optimization done only for 64bit. Change-Id: Icb586dc0c938b56699864fcee6c52fd43b36b969
2014-01-08Optimze inv 16x16 DCT with 10 non-zero coeffs - P1Jingning Han
This commit is the first patch optimizing SSE2 implementation of inverse 16x16 DCT with <10 non-zero coefficients. It focused on the first 1-D (row) transformation. It exploits the fact that only top-left 4x4 block contains non-zero coefficients, in a 2-D inverse 16x16 DCT with <10 coeffients. The average runtime of idct16x16_10 unit is reduced from 883 cycles -> 779 cycles (12% faster). For pedestrian_area_1080p 300 frames at 4000 kbps, the speed 2 runtime goes down from 310651 ms -> 305910 ms. The decoding speed goes up from 80.37 fps -> 80.87 fps. Change-Id: Ic6f3ac5a637a76c07ba73ddaafe318a699fea645
2014-01-03Tune IDCT8_1D macro function interfaceJingning Han
This commit adds input/output ports for IDCT8_1D macro function to provide more flexibility in variable use. It allows to skip several buffer swap operations. Change-Id: I21f3450509537322293043b3281bfd3949868677
2014-01-03Reduce num of buffer swap calls in idct8_1d_sse2Jingning Han
This commit merges the initial buffer swap operations in idct8_1d_sse2 into the array transpose step, hence reducing number of instructions therein. Change-Id: I219f6f50813390d2ec3ee37eecf2a4a2b44ae479
2014-01-03Rework idct8x8_10 SSE2 implementationJingning Han
This commit optimizes the SSE2 implmentation of idct8x8_10. It exploits the fact that only top-left 4x4 block contains non-zero coefficients, and hence reduces the instructions needed. The runtime of idct8x8_10_sse2 goes down from 216 to 198 CPU cycles, estimated by averaging over 100000 runs. For pedestrian_area_1080p 300 frames coded at 4000kbps, the average decoding speed goes up from 79.3 fps to 79.7 fps. Change-Id: I6d277bbaa3ec9e1562667906975bae06904cb180
2013-12-20Merge "Code clean up"Yunqing Wang
2013-12-19Code clean upYunqing Wang
Removed unused filter coefficients. Change-Id: Ib395a51305e23ff41ab69c1808d56946d25961cd
2013-12-17rename loop filter functionsJim Bankoski
This renames all the loop filter functions so that they no longer refer to mb Change-Id: I8a58a8c7fd253d835cb619bde13913e896ece90b
2013-12-02Improve idct16x16: _256_add_sse2(x1.107)&_10_add_sse2(x1.012)Abo Talib Mahfoodh
The performance gain of idct16x16_10_add_sse2 function is not noticeable. However since both functions use the IDCT16_1D, idct16x16_10_add_sse2 should be modified as well. Tested with: park_joy_420_720p50.y4m Change-Id: I02b957e36fcf997c677d15baf496533895271bff
2013-12-02Merge "improve vp9_idct32x32_34(x1.472)&1024(x1.032)_add_sse2"Yunqing Wang
2013-11-26improve vp9_idct32x32_34(x1.472)&1024(x1.032)_add_sse2Abo Talib Mahfoodh
vp9_idct32x32_34_add_sse2: speedup: 1.472 IDCT32_1D_34 and MULTIPLICATION_AND_ADD_2 are optimized based on the fact that Only upper-left 8x8 has non-zero values. vp9_idct32x32_1024_add_sse2: speedup: 1.032 Tested with: park_joy_420_720p50.y4m Change-Id: I8670ce547552b48695049de298e2fc46ce28dfbc