summaryrefslogtreecommitdiff
path: root/vp9/common/x86
AgeCommit message (Collapse)Author
2013-12-02Improve idct16x16: _256_add_sse2(x1.107)&_10_add_sse2(x1.012)Abo Talib Mahfoodh
The performance gain of idct16x16_10_add_sse2 function is not noticeable. However since both functions use the IDCT16_1D, idct16x16_10_add_sse2 should be modified as well. Tested with: park_joy_420_720p50.y4m Change-Id: I02b957e36fcf997c677d15baf496533895271bff
2013-12-02Merge "improve vp9_idct32x32_34(x1.472)&1024(x1.032)_add_sse2"Yunqing Wang
2013-11-26improve vp9_idct32x32_34(x1.472)&1024(x1.032)_add_sse2Abo Talib Mahfoodh
vp9_idct32x32_34_add_sse2: speedup: 1.472 IDCT32_1D_34 and MULTIPLICATION_AND_ADD_2 are optimized based on the fact that Only upper-left 8x8 has non-zero values. vp9_idct32x32_1024_add_sse2: speedup: 1.032 Tested with: park_joy_420_720p50.y4m Change-Id: I8670ce547552b48695049de298e2fc46ce28dfbc
2013-11-22Do vertical loopfiltering in parallelYunqing Wang
This patch followed "Add filter_selectively_vert_row2 to enable parallel loopfiltering" commit, and added x86 SSE2 optimization to do 16-pixel filtering in parallel. For other optimizations (neon and dspr2), current 16-pixel functions were done by calling 8-pixel functions twice, and real 16-pixel functions could be added later. Decoder speedup: tulip clip: 2% speed gain; old_town_cross: 1.2% speed gain; bus: 2% speed gain. Change-Id: I4818a0c72f84b34f5fe678e496cf4a10238574b7
2013-11-20Correct ssse3 8/16-pixel wide sub-pixel filter calculationYunqing Wang
Although no mismatch was indicated for 8/16 wide sub-pixel filters in issue 661, they had similar problems that could cause mismatch potentially. This patch fixed calculations in HORIZx8/16 and VERTx8/16. Change-Id: I169961c9d40a20340995b7d22aafc89ccf30bfca
2013-11-20Fix stack pointer in sub-pixel filtersYunqing Wang
In commit "3d50da5397d20abc932d81453b26cde758293a40", the stack pointer was modified while aligning the stack, and it needed to be pop out at the end. Change-Id: I062971e195f1f2ab9d0ab5fb84dcf215a0fcaa67
2013-11-19Fix decoder mismatch with ssse3 enabledYunqing Wang
This patch fixed issue 661: "Decoder produces mismatched outputs with ssse3 enabled and disabled." In sub-pixel filters, a pixel value was multiplied by a filter coefficient, and the results were added up. The order of adding up these multiplications had to be arranged carefully to prevent incorrect overflowing. Change-Id: Id08af4200fea9e1b896fc40157b8651c2c7e80f2
2013-11-18Improve vp9_iht4x4_16_add_sse2 (x1.341)Abo Talib Mahfoodh
This rebase is a better implementation of the previous ones. Modifications are done to reduce the total clock cycle. Speedup: 1.341 Compiled with -O3 Tested with: park_joy_420_720p50.y4m Change-Id: I940eaf283f60597ca0d9d2e13d518878d55ff02d
2013-11-15Do horizontal loopfiltering in parallelYunqing Wang
This patch followed "Rewrite filter_selectively_horiz for parallel loopfiltering" commit, and added x86 SSE2 optimization to do 16-pixel filtering in parallel. Also, corrected the declaration of aligned arrays. For 8-pixel-in-parallel case, improved the calculation of the masks and filters. Updated the threshold loading since the thresholds were already duplicated. Updated neon C functions to call neon loopfilters twice. Using tulip clip, tests showed it gave a ~1.5% decoder speed gain. Change-Id: Id02638626ac27a4b0e0b09d71792a24c0499bd35
2013-11-08Merge "Improve vp9_idct4x4_1_add_sse2"Yunqing Wang
2013-11-01vp9 ssse3 d207_predictor_32x32: add missing GLOBAL()James Zern
removes a textrel for sh_b23456789abcdefff Change-Id: I80cb9dfd8e49a0fe884c8ff76472275b3a00cb57
2013-10-31mb_lpf_horizontal_edge AVX2 optimizationTamar Levy
This CL contains two AVX2 optimized loop filter functions, mb_lpf_horizontal_edge_w_avx2_8 and mb_lpf_horizontal_edge_w_avx2_16. Change-Id: I604e4fe6e99752b7800c2ea98721d97f7e0b931b
2013-10-25Merge "Add 32x32 idct function for eob<=34 case"Yunqing Wang
2013-10-24Add 32x32 idct function for eob<=34 caseYunqing Wang
When only upper-left 8x8 area has non-zero dct coefficients, we could skip 1D IDCT for 9th to 32th rows to save operations. This function is called when eob <= 34. Change-Id: I9684b75947bdde346cfe3720f08a953aa7a13fb5
2013-10-23Renaming vp9_short_fdct8x8 to vp9_fdct8x8.Dmitry Kovalev
For consistency with idct function names. Change-Id: I7b6af2f92c66eff56f84ed29edc3a66af8dc421f
2013-10-22Improve vp9_idct4x4_1_add_sse2Abo Talib Mahfoodh
Simple modification to reduce number of cycles in the function. Original function number of cycles: 973 Modified function number of cycles: 835 Improvment factor: 1.165 Tested with: park_joy_420_720p50.y4m Change-Id: Ic5857272ea3aafe21d5ef9a69258d78c688f69bd
2013-10-18Fix d207 intra prediction SSSE3 functionsYunqing Wang
This patch fixed a bug that caused 32bit PIC build mismatch. The stack pointer was modified after "GET_GOT". Loading left pointer from a hard-coded position gave wrong result. Change-Id: Iea0aec6f917b12a6b3393ffc986bad74510248cc
2013-10-15Merge "Fix a few indent format issues in buffer defs"Jingning Han
2013-10-15Fix a few indent format issues in buffer defsJingning Han
Change-Id: Iac55891ac9e6f13718c9f822aa099b5ca491832a
2013-10-11Making input pointer of any inverse transform constant.Dmitry Kovalev
Also renaming dest_stride to stride in some places. Change-Id: I75f602b623a5a7071d4922b747c45fa0b7d7a940
2013-10-11Consistent names for inverse hybrid transforms (1 of 2).Dmitry Kovalev
Renames: vp9_short_iht4x4_add -> vp9_iht4x4_16_add vp9_short_iht8x8_add -> vp9_iht8x8_64_add vp9_short_iht16x16_add_c -> vp9_iht16x16_256_add Change-Id: Ibca7a188fd062b196787ac5efc1ea545e7f166c0
2013-10-11Merge "Removing vp9_idct4_1d_sse2 function."Dmitry Kovalev
2013-10-11Code cleanupYunqing Wang
Minor code cleanup. Change-Id: I47c1f794842d4570bb39cfd23b80f54f5606bba6
2013-10-11Merge "SSE2 8-tap sub-pixel filter optimization"Yunqing Wang
2013-10-10Removing vp9_idct4_1d_sse2 function.Dmitry Kovalev
We have two SSE2-optimized functions for idct4_1d: vp9_idct4_1d_sse2 <-- removing this one idct4_1d_sse2 vp9_idct4_1d_sse2 was used only by the following functions which already have SSE2 optimized variants: vp9_idct4x4_16_add_c -> vp9_idct4x4_16_add_see2 idct8_1d -> vp9_idct8x8_{16, 10, 1}_see2 vp9_short_iht4x4_add_c -> vp9_short_iht4x4_add_see2 Change-Id: Ib0a7f6d1373dbaf7a4a41208cd9d0671fdf15edb
2013-10-10d207 intra prediction ssse3 using bytesScott LaVarnway
byte version of ronalds d207 ssse3 optimizations (commit: f891f84d3ba9345b0074e682f0fea09b8ddf4f1e) Change-Id: If15f71a589ea16f78ac86a501b0c5c6231dc9af1
2013-10-10Merge "Giving consistent names to IDCT 32x32 functions."Dmitry Kovalev
2013-10-10Merge "d153 intra prediction (32x32) ssse3 using bytes"Yunqing Wang
2013-10-10SSE2 8-tap sub-pixel filter optimizationYunqing Wang
To ensure fast encoding/decoding on devices without ssse3 support, SSE2 optimization of sub-pixel filters was done. Test using 1080p clip showed the decoder speeds were ~70fps with ssse3 filters, ~60fps with sse2 filters, and ~15fps with c filters. Change-Id: Ie2088f87d83a889fba80a613e4d0e287aadd785c
2013-10-10Giving consistent names to IDCT 32x32 functions.Dmitry Kovalev
Renames: vp9_short_idct32x32_add -> vp9_idct32x32_1024_add vp9_short_idct32x32_1_add -> vp9_idct32x32_1_add vp9_idct_add_32x32 -> vp9_idct32x32_add Change-Id: Id85306f5814bac6c47463a6b5901a93082510666
2013-10-07Giving consistent names to IDCT 16x16 functions.Dmitry Kovalev
Renames: vp9_short_idct16x16_add -> vp9_idct16x16_256_add vp9_short_idct16x16_10_add -> vp9_idct16x16_10_add vp9_short_idct16x16_1_add -> vp9_idct16x16_1_add vp9_idct_add_16x16 -> vp9_idct16x16_add Change-Id: Ief8a3904de78deab0f4ede944c4d0339c228cfc3
2013-10-07Merge "Giving consistent names to IDCT 8x8 functions."Dmitry Kovalev
2013-10-07d153 intra prediction (32x32) ssse3 using bytesScott LaVarnway
Change-Id: Ie2c0d84ff9f6294084d65f4380e1f30c09e681c9
2013-10-06Merge changes I8a106dd6,Iec442603Jim Bankoski
* changes: d153 intra prediction (16x16) ssse3 using bytes d153 intra prediction ssse3 using bytes
2013-10-06Giving consistent names to IDCT 8x8 functions.Dmitry Kovalev
Renames: vp9_short_idct8x8_add -> vp9_idct8x8_64_add vp9_short_idct8x8_1_add -> vp9_idct8x8_1_add vp9_short_idct8x8_10_add -> vp9_idct8x8_10_add vp9_idct_add_8x8 -> vp9_idct8x8_add Change-Id: Ifb8d3a45b4c0397aa805b30463f3d14581bf72c1
2013-10-04Giving consistent names to IDCT/IWHT functions.Dmitry Kovalev
The idea is to have the following names for each transform size: vp9_idct4x4_add vp9_idct4x4_1_add vp9_idct4x4_10_add vp9_idct4x4_16_add vp9_idct8x8_add vp9_idct8x8_1_add vp9_idct8x8_10_add vp9_idct8x8_64_add etc for 16x16, 32x32 The actual list of renames in this patch: vp9_idct_add_lossless -> vp9_iwht4x4_add vp9_short_iwalsh4x4_add -> vp9_iwht4x4_16_add vp9_short_iwalsh4x4_1_add -> vp9_iwht4x4_1_add vp9_idct_add -> vp9_idct4x4_add vp9_short_idct4x4_add -> vp9_idct4x4_16_add vp9_short_idct4x4_1_add -> vp9_idct4x4_1_add Change-Id: I6f43f7437c68dd30cdd05d72e213765578ed30b1
2013-10-03Merge "Rewrite HORIZx4 and HORIZx8 in subpixel filter functions"Yunqing Wang
2013-10-03Rewrite HORIZx4 and HORIZx8 in subpixel filter functionsYunqing Wang
In subpixel filters, prefetched source data, unrolled loops, and interleaved instructions. In HORIZx4, integrated the idea in Scott's CL (commit: d22a504d11a15dc3eab666859db0046b5a7d75c5), which was suggested by Erik/Tamar from Intel. Further tweaking was done to combine row 0, 2, and row 1, 3 in registers to do more 2-row-in-1 operations until the last add. Test showed a ~2% decoder speedup. Change-Id: Ib53d04ede8166c38c3dc744da8c6f737ce26a0e3
2013-10-02d153 intra prediction (16x16) ssse3 using bytesScott LaVarnway
Change-Id: I8a106dd61b0a2520fae792d87d6348e662649b2d
2013-10-01Adding SSE2 optimized vp9_short_idct32x32_1_add function.Dmitry Kovalev
Change-Id: I4b1c6bb9ff615f5872b96ed07dbf0f5e18e63643
2013-10-01Merge "Modify HORIZx16 macro in subpixel filter functions"Yunqing Wang
2013-10-01Modify HORIZx16 macro in subpixel filter functionsYunqing Wang
Interleaved the instructions, reduced register dependency, and prefetched the source data. This improved the decoder speed by 0.6% - 2%. Change-Id: I568067aa0c629b2e58219326899c82aedf7eccca
2013-10-01d153 intra prediction ssse3 using bytesScott LaVarnway
byte version of ronalds d153 ssse3 optimizations for 4x4 and 8x8 (commit: fc91a2a112238a1aee568f3b840585de4e928fca) Change-Id: Iec4426032311483f615fd9e0dceba3ee85ddebd7
2013-09-29fixed cpp lint issue in vp9_postproc_x86Jim Bankoski
Change-Id: I2b2af1dd9f5c29c05e28a4fd51fa58ccc4071477
2013-09-29nolintify intrinsic idct fileJim Bankoski
Change-Id: Id2cc5c829399a2afdf7a8a82615a4e272c814986
2013-09-27Renaming vp9_short_idct10_8x8_add to vp9_short_idct8x8_10_add.Dmitry Kovalev
Making name consistent with vp9_short_idct8x8 and vp9_short_idct8x8_1. Change-Id: I99e0be040ec893f9571dcf090e18f98dc58339f5
2013-09-27Merge "Renaming vp9_short_idct10_16x16 to vp9_short_idct16x16_10."Dmitry Kovalev
2013-09-26Renaming vp9_short_idct10_16x16 to vp9_short_idct16x16_10.Dmitry Kovalev
Making function name consistent with vp9_short_idct16x16 and vp9_short_idct16x16_1. Change-Id: I70e54be9e6b9a1dddab0de470686591e96d05517
2013-09-25d63 intra prediction ssse3 using bytesScott LaVarnway
byte version of ronalds d63 ssse3 optimizations (commit: c5a1c8cf3541cf3665fee981b36d22c9fbd4191e) Change-Id: Ifd3e6d454a2246085f23eabb38518a930321e807
2013-09-18Fix x86inc.asm to build PIC code correctlyYunqing Wang
Current x86inc.asm didn't handle 32bit PIC build properly. TEXTRELs were seen in the library built. The PIC macros from libvpx's x86_abi_support.asm was used to fix this problem. The assembly code was modified to use the macros. Notes: We need this fix in for decoder building. Functions in encoder will be fixed later. Change-Id: Ifa548d37b1d0bc7d0528db75009cc18cd5eb1838