summaryrefslogtreecommitdiff
path: root/vpx_dsp/x86
AgeCommit message (Collapse)Author
2019-01-15Remove unnecessary calculation in 4-tap interpolation filterchiyotsai
Reduces the number of rows calculated for 2D 4-tap interpolation filter from h+7 rows to h+3 rows. Also fixes a bug in the avx2 function for 4-tap filters where the last row is computed incorrectly. Performance: | Baseline | Result | Pct Gain | bitdepth lo| 4.00 fps | 4.02 fps | 0.5% | bitdepth 10| 1.90 fps | 1.91 fps | 0.5% | The performance is evaluated on speed 1 on jets.y4m br 500 over 100 frames. No BDBR loss is observed. Change-Id: I90b0d4d697319b7bba599f03c5dc01abd85d13b1
2019-01-07vpx_filter: resolve missing declarationsJohann
BUG=webm:1584 Change-Id: I1be768446b9304123da7b1ea0aed0db056db31c5
2018-12-24Merge "fwd_dct32x32 avx2: resolve missing declarations"Johann Koenig
2018-12-21Merge "fwd_dct32x32 sse2: resolve missing declarations"Johann Koenig
2018-12-21fwd_dct32x32 avx2: resolve missing declarationsJohann
BUG=webm:1584 Change-Id: Iaba854952534a95e710a985acfcab46e093872c2
2018-12-21fwd_dct32x32 sse2: resolve missing declarationsJohann
BUG=webm:1584 Change-Id: Ia2d9fcbccbad0c2142a3759e610670b86af0fef4
2018-12-21convolve avx2: resolve missing declarationsJohann
BUG=webm:1584 Change-Id: I5990c0100af83d13f7a4800147473bc997f5e5d1
2018-12-21Merge "subpixel_8t sse2: resolve missing declarations"Johann Koenig
2018-12-21subpixel_8t ssse3: resolve missing declarationsJohann
BUG=webm:1584 Change-Id: I48b9a9cdcfe52536f685c41fb2d3c0f3e9192d34
2018-12-21subpixel_8t sse2: resolve missing declarationsJohann
vpx_asm_stubs.c only references these sse2 functions. Combine the files similar to the way the ssse3/avx2 files are set up. Mark the intrinsics as static because they are only used within the macros here. It is unfortunate that the assembly functions can not be marked static as well. BUG=webm:1584 Change-Id: I342687a1046ae6ca46ae58644a7c170440de1dfb
2018-12-21subpixel_8t avx2: resolve missing declarationsJohann
BUG=webm:1584 Change-Id: I92504ed4a2e54129c981b7380249962afb7966df
2018-12-21Merge "highbd quantize: resolve missing declarations"Johann Koenig
2018-12-20highbd quantize: resolve missing declarationsJohann
BUG=webm:1584 Change-Id: Ia3f152bf2a37f8a1ea4178eeb1a6a262ea034a8d
2018-12-20highbd variance: resolve missing declarationsJohann
The optimizations were accidentally disabled during the move from vp9 commit c3bdffb0a508ad08d5dfa613c029f368d4293d4c author Johann <johannkoenig@google.com> Fri May 15 18:52:03 2015 Move variance functions to vpx_dsp subpel functions will be moved in another patch. BUG=webm:1584 Change-Id: Ia7899ee0cfad13a0e1516b89756552064846e81c
2018-12-08Merge "Add satd avx2 implementation"Sai Deng
2018-12-07Add high bit Hadamard 32x32 avx2 implementationsdeng
Speed test: [ RUN ] C/HadamardHighbdTest.DISABLED_Speed/2 Hadamard32x32[ 10 runs]: 9 us Hadamard32x32[ 10000 runs]: 8914 us Hadamard32x32[ 10000000 runs]: 8991776 us [ RUN ] AVX2/HadamardHighbdTest.DISABLED_Speed/2 Hadamard32x32[ 10 runs]: 5 us Hadamard32x32[ 10000 runs]: 4582 us Hadamard32x32[ 10000000 runs]: 4548203 us Change-Id: Ied1b38b510bd033299f05869216d394e3b7f70f1
2018-12-06Add satd avx2 implementationsdeng
Speed Test: C/SatdHighbdTest blocksize: 16 time: 138 us blocksize: 64 time: 315 us blocksize: 256 time: 1120 us blocksize: 1024 time: 3955 us AVX2/SatdHighbdTest blocksize: 16 time: 89 us blocksize: 64 time: 189 us blocksize: 256 time: 590 us blocksize: 1024 time: 1912 us Change-Id: I6357174462fccd589a475b13d8114b853cab5383
2018-12-05Add high bit Hadamard 16x16 avx2 implementationsdeng
Speed test: [ RUN ] C/HadamardHighbdTest.DISABLED_Speed/1 Hadamard16x16[ 10 runs]: 2 us Hadamard16x16[ 10000 runs]: 1836 us Hadamard16x16[ 10000000 runs]: 1829451 us [ RUN ] AVX2/HadamardHighbdTest.DISABLED_Speed/1 Hadamard16x16[ 10 runs]: 1 us Hadamard16x16[ 10000 runs]: 1009 us Hadamard16x16[ 10000000 runs]: 984856 us Change-Id: I89b9cdbe19350815576d66e627df87e5025ed0a4
2018-12-03Add high bit Hadamard 8x8 avx2 implementationsdeng
Speed tests: [ RUN ] C/HadamardHighbdTest.DISABLED_Speed/0 Hadamard8x8[ 10 runs]: 0 us Hadamard8x8[ 10000 runs]: 316 us Hadamard8x8[ 10000000 runs]: 311749 us [ OK ] C/HadamardHighbdTest.DISABLED_Speed/0 (371 ms) [ RUN ] AVX2/HadamardHighbdTest.DISABLED_Speed/0 Hadamard8x8[ 10 runs]: 0 us Hadamard8x8[ 10000 runs]: 161 us Hadamard8x8[ 10000000 runs]: 156910 us [ OK ] AVX2/HadamardHighbdTest.DISABLED_Speed/0 (160 ms) Change-Id: I94f7324be20405ff55f8a02ad4651c4ab4c10202
2018-11-30quantize 32x32: saturate dqcoeff on x86Johann
This slows down low bitdepth builds but is necessary to obtain correct values. BUG=webm:1448 Change-Id: I4ca9145f576089bb8496fcfeedeb556dc8fe6574
2018-11-28quantize 32x32: fix dqcoeffJohann
Calculate the high bits of dqcoeff and store them appropriately in high bit depth builds. Low bit depth builds still do not pass. C truncates the results after division. X86 only supports packing with saturation at this step. BUG=webm:1448 Change-Id: Ic80def575136c7ca37edf18d21e26925b475da98
2018-11-28quantize: fix x86 hbd buildsJohann
Calculate the high bits of dqcoeff in high bit depth builds and store them appropriately. BUG=webm:1448 Change-Id: I61a2f8bfcf2e30765f10a94073c4d58321d2fa24
2018-11-27rename quantize_x86.hJohann
Pave the way for new quantize_OPT.h helper files. Change-Id: Ice7225612983f5587a9660af3320c7d0c8bb1c2f
2018-11-05Merge "clang-tidy: fix vpx_dsp parameters"Johann Koenig
2018-11-02Merge "vpx postproc: rewrite in intrinsics"Johann Koenig
2018-11-01clang-tidy: fix vpx_dsp parametersJohann
BUG=webm:1444 Change-Id: Iee19be068afc6c81396c79218a89c469d2e66207
2018-10-31clang-tidy: normalize variance functionsJohann
Always use src/ref and _ptr/_stride suffixes. Normalize to [xy]_offset and second_pred. Drop some stray source/recon_strides. BUG=webm:1444 Change-Id: I32362a50988eb84464ab78686348610ea40e5c80
2018-10-30Add SSE2 support for hbd 4-tap interpolation filter.chiyotsai
Unit test performance on bitdepth 10: | 4X4 | 8X8 |16X16|64X64| 2D |1.582|1.461|1.425|1.572| HORZ|1.643|1.247|1.346|1.345| VERT|1.378|1.695|2.020|1.763| Unit test performance on bitdepth 12: | 4X4 | 8X8 |16X16|64X64| 2D |1.578|1.409|1.426|1.497| HORZ|1.625|1.153|1.323|1.259| VERT|1.392|1.707|2.030|1.787| Change-Id: I6df85330ac33fcb17d46e4302b41415dda1219f5
2018-10-29vpx postproc: rewrite in intrinsicsJohann
About ~10% faster on 64bit but ~10% slower on 32 Removes the assembly usage of vpx_rv. Change-Id: I214698fb5677f615dee0a8f5f5bb8f64daf2565e
2018-10-29Add AVX2 support for hbd 4-tap interpolation filter.chiyotsai
Speed gain: BIT DEPTH | 8TAP FPS | 4TAP FPS | PCT INC | 10 | 1.69 | 1.85 | 9.46% | 12 | 1.64 | 1.78 | 8.54% | Speed test is done on jet.y4m on speed 1 profile 2 over 100 frame with br=500. Change-Id: I411e122553e2c466be7a26e64b4dd144efb884a9
2018-10-25vp8 bilinear: rewrite 4x4Johann
~20% faster than the MMX. Removes the last usage of vp8_bilinear_filters_x86_[48]. Change-Id: Iee976fab9655d0020440f26c4403ce50103af913
2018-10-25Merge "Add AVX2 support for 4-tap interpolation filter."Chi Yo Tsai
2018-10-24Add AVX2 support for 4-tap interpolation filter.chiyotsai
Performance: | 4X4 | 8X8 |16X16|64X64| 2 DIM|1.491|1.902|1.772|1.479| HORZ|1.145|1.521|1.757|1.497| VERT|1.176|1.614|1.707|1.467| Each number in the chart above is 8-tap function time / 4-tap function time. The framerate tested on jets.y4m for 100 frames on speed 1 increased from 3.72 fps to 3.91 fps (about 5% increase). Change-Id: Ic0ad275cf32fafeefd0a89811badd8adff2134a0
2018-10-23Clean up vpx_dsp/x86/convolve_sse2.hchiyotsai
Removes unnecesssary includes and reword some functions/comments. Change-Id: Ied557d7faa9d845d38255e6e3e0e3fe1395276e1
2018-10-18Changes 4-tap SSSE3 filter to 8-tap AVX2 filter.chiyotsai
AVX2's 8-tap filter is slightly faster than 4-tap SSSE3 filter. Change-Id: I5fc37c431670780108706b206b32c791828555c9
2018-10-18Add SSSE3 support for 4-tap interpolation filterchiyotsai
Performance: | 4X4 | 8X8 |16X16|64X64| 2 DIM|1.526|1.827|1.844|1.906| HORZ|1.336|1.795|1.886|1.654| VERT|1.443|1.539|2.139|2.190| The ratio is SSSE3 8-tap time / SSSE3 4-tap time. Change-Id: I01ed2ab494428256e918875774a459afecc5ec6a
2018-10-17Adds SSE2 support for interpolation filter for width 4 and 8chiyotsai
Performance: The chart below shows the speed relative to baseline (baseline_time/new_time) _____| 4X4 | 8X8 |16X16|64X64| 2 DIM|1.889|1.780|1.811|1.963| HORZ|2.266|1.834|1.617|1.595| VERI|2.043|2.190|2.373|2.485| Change-Id: Ic4262222db78f013b94a8c61b46efb8520722927
2018-10-17Refactor SSE2 Code for 4-tap interpolation filter on width 16.chiyotsai
Some repeated codes are refactored as inline functions. No performance degradation is observed. These inline functions can be used for width 8 and width 4. Change-Id: Ibf08cc9ebd2dd47bd2a6c2bcc1616f9d4c252d4d
2018-10-17Add SSE2 support for 4-tap interpolation filter for width 16.chiyotsai
Horizontal filter on 64x64 block: 1.59 times as fast as baseline. Vertical filter on 64x64 block: 2.5 times as fast as baseline. 2D filter on 64x64 block: 1.96 times as fast as baseline. Change-Id: I12e46679f3108616d5b3475319dd38b514c6cb3c
2018-10-10subpel asm: fix whitespaceJohann
Change-Id: I7a3314a268cf6049a7260361043e76d4561085c6
2018-09-24Merge "sanitizer: sse2 - fix unaligned double stores"Johann Koenig
2018-09-25sanitizer: sse2 - fix unaligned double storesMatthias Räncker
Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de> Change-Id: I838c8678e62f7cff13387b84d4f3ea42710a67ea
2018-09-21sanitizer: fix unaligned loadsMatthias Räncker
Another instance of unaligned 4-byte loads. Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de> Change-Id: I06afc5405bb074384eec7a8c8123e5803e522937
2018-09-20sanitizer: fix unaligned load/storesMatthias Räncker
When built with -fsanitizer=address,undefined a number of tests, such as ByteAlignmentTest.SwitchByteAlignment or ByteAlignmentTest.SwitchByteAlignment produce runtime errors about unaligned 4-byte loads/stores. While normally not really a problem, this does technically violate the language and it is eays to fix in a standard conforming way using memcpy which does not produce inferior code. Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de> Change-Id: Ie1e97ab25fe874f864df48b473569f00563181ae
2018-09-18Fix stack corruption with x86 and --enable-picMatthias Räncker
x86inc.asm's cglobal macro is frequently used to declare more arguments than the function actually has. Normally, this is done to aquire an alias to a register that would correspond to that positional function argument if it existed. This is safe when used in this manner. In the case fixed here, however, the alias is used to temporarily store adresses obtained through the GOT in memory. Because those extra arguments don't actually exist, those stores corrupt the callers stack frame. SSE2/VpxHBDSubpelVarianceTest.Ref is a test that may fail as a result. To simply fix the space allocated to actual arguments that have been loaded into registers already is reused. This avoids having to allocate extra space for local variables. Also removed duplicate code while at it. Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de> Change-Id: I505281ecaa6be586185fe6a2d34d62bdf40c839f
2018-09-15cosmetics: normalize include guardsJames Zern
use the recommended format [1] of: <PROJECT>_<PATH>_<FILE>_H_ [1] https://google.github.io/styleguide/cppguide.html#The__define_Guard "All header files should have #define guards to prevent multiple inclusion. The format of the symbol name should be <PROJECT>_<PATH>_<FILE>_H_." Change-Id: I2e8ab0b32fb23c30fa43cff5fec12d043c0d2037
2018-08-07Merge "VPX: Improve HBD vpx_hadamard_32x32_sse2()"Scott LaVarnway
2018-08-07vpx_highbd_d153_predictor_4x4_sse2: reduce load sizeJames Zern
this avoids reading 4 pixels into another block, which may be operated on by a different thread. quiets a tsan warning. Change-Id: Id27ad9d61819b0e5de0230647b4b510f7c265a71
2018-07-25VPX: Improve HBD vpx_hadamard_32x32_sse2()Scott LaVarnway
BUG=webm:1546 Change-Id: I48224f047547b666c519e0cc23706dd0bda5df20
2018-07-24VPX: avg_intrin_sse2.c, avg_intrin_avx2.c cleanupScott LaVarnway
Change-Id: I710b66dc571a6bd38fbcc2528486d5e028a68b37