summaryrefslogtreecommitdiff
path: root/vpx_dsp
AgeCommit message (Collapse)Author
2018-11-02Merge "vpx postproc: rewrite in intrinsics"Johann Koenig
2018-11-02Merge "Add highbd Hadamard transform C implementations"Sai Deng
2018-11-01Add highbd Hadamard transform C implementationssdeng
Change-Id: Ibec078c80ca1dfe6fbbc4288db89d719dac453a7
2018-10-31clang-tidy: normalize variance functionsJohann
Always use src/ref and _ptr/_stride suffixes. Normalize to [xy]_offset and second_pred. Drop some stray source/recon_strides. BUG=webm:1444 Change-Id: I32362a50988eb84464ab78686348610ea40e5c80
2018-10-30Add SSE2 support for hbd 4-tap interpolation filter.chiyotsai
Unit test performance on bitdepth 10: | 4X4 | 8X8 |16X16|64X64| 2D |1.582|1.461|1.425|1.572| HORZ|1.643|1.247|1.346|1.345| VERT|1.378|1.695|2.020|1.763| Unit test performance on bitdepth 12: | 4X4 | 8X8 |16X16|64X64| 2D |1.578|1.409|1.426|1.497| HORZ|1.625|1.153|1.323|1.259| VERT|1.392|1.707|2.030|1.787| Change-Id: I6df85330ac33fcb17d46e4302b41415dda1219f5
2018-10-29vpx postproc: rewrite in intrinsicsJohann
About ~10% faster on 64bit but ~10% slower on 32 Removes the assembly usage of vpx_rv. Change-Id: I214698fb5677f615dee0a8f5f5bb8f64daf2565e
2018-10-29Add AVX2 support for hbd 4-tap interpolation filter.chiyotsai
Speed gain: BIT DEPTH | 8TAP FPS | 4TAP FPS | PCT INC | 10 | 1.69 | 1.85 | 9.46% | 12 | 1.64 | 1.78 | 8.54% | Speed test is done on jet.y4m on speed 1 profile 2 over 100 frame with br=500. Change-Id: I411e122553e2c466be7a26e64b4dd144efb884a9
2018-10-25vp8 bilinear: rewrite 4x4Johann
~20% faster than the MMX. Removes the last usage of vp8_bilinear_filters_x86_[48]. Change-Id: Iee976fab9655d0020440f26c4403ce50103af913
2018-10-25Merge "Add AVX2 support for 4-tap interpolation filter."Chi Yo Tsai
2018-10-24Add AVX2 support for 4-tap interpolation filter.chiyotsai
Performance: | 4X4 | 8X8 |16X16|64X64| 2 DIM|1.491|1.902|1.772|1.479| HORZ|1.145|1.521|1.757|1.497| VERT|1.176|1.614|1.707|1.467| Each number in the chart above is 8-tap function time / 4-tap function time. The framerate tested on jets.y4m for 100 frames on speed 1 increased from 3.72 fps to 3.91 fps (about 5% increase). Change-Id: Ic0ad275cf32fafeefd0a89811badd8adff2134a0
2018-10-23Clean up vpx_dsp/x86/convolve_sse2.hchiyotsai
Removes unnecesssary includes and reword some functions/comments. Change-Id: Ied557d7faa9d845d38255e6e3e0e3fe1395276e1
2018-10-18Changes 4-tap SSSE3 filter to 8-tap AVX2 filter.chiyotsai
AVX2's 8-tap filter is slightly faster than 4-tap SSSE3 filter. Change-Id: I5fc37c431670780108706b206b32c791828555c9
2018-10-18Add SSSE3 support for 4-tap interpolation filterchiyotsai
Performance: | 4X4 | 8X8 |16X16|64X64| 2 DIM|1.526|1.827|1.844|1.906| HORZ|1.336|1.795|1.886|1.654| VERT|1.443|1.539|2.139|2.190| The ratio is SSSE3 8-tap time / SSSE3 4-tap time. Change-Id: I01ed2ab494428256e918875774a459afecc5ec6a
2018-10-17Adds SSE2 support for interpolation filter for width 4 and 8chiyotsai
Performance: The chart below shows the speed relative to baseline (baseline_time/new_time) _____| 4X4 | 8X8 |16X16|64X64| 2 DIM|1.889|1.780|1.811|1.963| HORZ|2.266|1.834|1.617|1.595| VERI|2.043|2.190|2.373|2.485| Change-Id: Ic4262222db78f013b94a8c61b46efb8520722927
2018-10-17Refactor SSE2 Code for 4-tap interpolation filter on width 16.chiyotsai
Some repeated codes are refactored as inline functions. No performance degradation is observed. These inline functions can be used for width 8 and width 4. Change-Id: Ibf08cc9ebd2dd47bd2a6c2bcc1616f9d4c252d4d
2018-10-17Add SSE2 support for 4-tap interpolation filter for width 16.chiyotsai
Horizontal filter on 64x64 block: 1.59 times as fast as baseline. Vertical filter on 64x64 block: 2.5 times as fast as baseline. 2D filter on 64x64 block: 1.96 times as fast as baseline. Change-Id: I12e46679f3108616d5b3475319dd38b514c6cb3c
2018-10-16Fix the filter tap calculation in mips optimizationsYunqing Wang
The interp filter tap calculation was not accurate to tell the difference between 2 taps and 4 taps. This patch fixed the bug, and resolved Jenkins test failures in mips sub-pel filter optimizations. BUG=webm:1568 Change-Id: I51eb8adb7ed194ef2ea7dd4aa57aa9870ee38cfc
2018-10-10subpel asm: fix whitespaceJohann
Change-Id: I7a3314a268cf6049a7260361043e76d4561085c6
2018-09-24clang-format v6.0.1Johann
Change-Id: I83c7e64fe70f7c49aa2492ed2d640c6756b7ebaa
2018-09-24Merge "sanitizer: sse2 - fix unaligned double stores"Johann Koenig
2018-09-25sanitizer: sse2 - fix unaligned double storesMatthias Räncker
Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de> Change-Id: I838c8678e62f7cff13387b84d4f3ea42710a67ea
2018-09-23segfault: fix missing alignment declarationMatthias Räncker
These variables are being fed to sse2 functions, that use aligned loads. Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de> Change-Id: I796c3483c6f3425d63d9262b02b19da59d536600
2018-09-21sanitizer: fix unaligned loadsMatthias Räncker
Another instance of unaligned 4-byte loads. Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de> Change-Id: I06afc5405bb074384eec7a8c8123e5803e522937
2018-09-20sanitizer: fix unaligned load/storesMatthias Räncker
When built with -fsanitizer=address,undefined a number of tests, such as ByteAlignmentTest.SwitchByteAlignment or ByteAlignmentTest.SwitchByteAlignment produce runtime errors about unaligned 4-byte loads/stores. While normally not really a problem, this does technically violate the language and it is eays to fix in a standard conforming way using memcpy which does not produce inferior code. Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de> Change-Id: Ie1e97ab25fe874f864df48b473569f00563181ae
2018-09-18Fix stack corruption with x86 and --enable-picMatthias Räncker
x86inc.asm's cglobal macro is frequently used to declare more arguments than the function actually has. Normally, this is done to aquire an alias to a register that would correspond to that positional function argument if it existed. This is safe when used in this manner. In the case fixed here, however, the alias is used to temporarily store adresses obtained through the GOT in memory. Because those extra arguments don't actually exist, those stores corrupt the callers stack frame. SSE2/VpxHBDSubpelVarianceTest.Ref is a test that may fail as a result. To simply fix the space allocated to actual arguments that have been loaded into registers already is reused. This avoids having to allocate extra space for local variables. Also removed duplicate code while at it. Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de> Change-Id: I505281ecaa6be586185fe6a2d34d62bdf40c839f
2018-09-15cosmetics: normalize include guardsJames Zern
use the recommended format [1] of: <PROJECT>_<PATH>_<FILE>_H_ [1] https://google.github.io/styleguide/cppguide.html#The__define_Guard "All header files should have #define guards to prevent multiple inclusion. The format of the symbol name should be <PROJECT>_<PATH>_<FILE>_H_." Change-Id: I2e8ab0b32fb23c30fa43cff5fec12d043c0d2037
2018-08-07Merge "VPX: Improve HBD vpx_hadamard_32x32_sse2()"Scott LaVarnway
2018-08-07vpx_highbd_d153_predictor_4x4_sse2: reduce load sizeJames Zern
this avoids reading 4 pixels into another block, which may be operated on by a different thread. quiets a tsan warning. Change-Id: Id27ad9d61819b0e5de0230647b4b510f7c265a71
2018-07-28arm: Consistently use unified syntax for asmMartin Storsjo
The ".syntax unified" directives in a few source files aren't valid ADS assembly directives, and they break compilation for windows, since ads2armasm_ms.pl doesn't handle them. Explicity add them via ads2gas.pl and ads2gas_apple.pl instead, and tweak one instruction to be valid unified syntax. Change-Id: I37f1709f163d11474597161fe02eb433859cb9b8
2018-07-26Add New Neon Assemblies for Motion CompensationVenkatarama NG. Avadhani
Commit adds neon assemblies for motion compensation which show an improvement over the existing neon code. Performance Improvement - Platform Resolution 1 Thread 4 Threads Nexus 6 720p 12.16% 7.21% @2.65 GHz 1080p 18.00% 15.28% Change-Id: Ic0b0412eeb01c8317642b20bb99092c2f5baba37
2018-07-26Merge "vp9: fix OOB read in decoder_peek_si_internal"James Zern
2018-07-25vp9: fix OOB read in decoder_peek_si_internalJames Zern
Profile 1 or 3 bitstreams may require 11 bytes for the header in the intra-only case. Additionally add a check on the bit reader's error handler callback to ensure it's non-NULL before calling to avoid future regressions. This has existed since at least (pre-1.4.0): 09bf1d61c Changes hdr for profiles > 1 for intraonly frames BUG=webm:1543 Change-Id: I23901e6e3a219170e8ea9efecc42af0be2e5c378
2018-07-25VPX: Improve HBD vpx_hadamard_32x32_sse2()Scott LaVarnway
BUG=webm:1546 Change-Id: I48224f047547b666c519e0cc23706dd0bda5df20
2018-07-24VPX: avg_intrin_sse2.c, avg_intrin_avx2.c cleanupScott LaVarnway
Change-Id: I710b66dc571a6bd38fbcc2528486d5e028a68b37
2018-07-24VPX: Improve HBD vpx_hadamard_32x32_avx2()Scott LaVarnway
~14% improvement. BUG=webm:1546 Change-Id: I0b25f62f053e13c2185e4e8bd54e52250251efd0
2018-07-23VPX: Add vpx_hadamard_32x32_avx2Scott LaVarnway
BUG=webm:1546 Change-Id: I64629ed83cb7acd0f2ac49b9c31f369d17a1aed2
2018-07-22Merge "VPX: Add vpx_hadamard_32x32_sse2"Scott LaVarnway
2018-07-22Merge "VPX: Improve HBD vpx_hadamard_16x16_sse2()"Scott LaVarnway
2018-07-21VPX: Add vpx_hadamard_32x32_sse2Scott LaVarnway
BUG=webm:1546 Change-Id: Ide5828b890c5c27cfcca2d5e318a914f7cde1158
2018-07-20VPX: Call vpx_hadamard_16x16_c() in vpx_hadamard_32x32_c()Scott LaVarnway
instead of vpx_hadamard_16x16(). Change-Id: Ie16aacad39d7f429e282dd4c93e57c07000d0f29
2018-07-20VPX: Improve HBD vpx_hadamard_16x16_sse2()Scott LaVarnway
~12% improvement. Change-Id: Ieca4d870a4c1c5ea2c689e27fc4550fcbab9f867
2018-07-17vpx_sum_squares_2d_i16_neon(): Make |s2| a uint64x1_t.Raphael Kubo da Costa
This fixes the build with at least GCC 7.3, where it was previously failing with: sum_squares_neon.c: In function 'vpx_sum_squares_2d_i16_neon': sum_squares_neon.c: note: use -flax-vector-conversions to permit conversions between vectors with differing element types or numbers of subparts s2 = vpaddl_u32(s1); ^~ sum_squares_neon.c: incompatible types when assigning to type 'int64x1_t' from type 'uint64x1_t' s2 = vpaddl_u32(s1); ^ sum_squares_neon.c: incompatible types when assigning to type 'int64x1_t' from type 'uint64x1_t' s2 = vadd_u64(vget_low_u64(s1), vget_high_u64(s1)); ^ sum_squares_neon.c: incompatible type for argument 1 of 'vget_lane_u64' return vget_lane_u64(s2, 0); ^~ The generated assembly was verified to remain identical with both GCC and LLVM. Bug: chromium:819249 Change-Id: I2778428ee1fee0a674d0d4910347c2a717de21ac
2018-07-11Add 32x32 Hadamard transformJingning Han
Add 32x32 Hadamard transform in C implementation. Replace the forward 32x32 2D-DCT in tpl model with Hadamard transform. This would reduce the overhead encoding time due to running tpl model by ~3x. Change-Id: I1c743dab786b818d89f14928cc3998d056830aa9
2018-07-08[VSX] Add support to Power9-only vec_absdLuca Barbato
~5% gain for SAD. Change-Id: Ief7d7691f837474f5b6b582129628276fdcce319
2018-06-27Merge "[VSX] Drop the clang-4 workaround for vec_xxpermdi"Luca Barbato
2018-06-27[VSX] Replace vec_pack and vec_perm with single vec_permLuc Trudeau
vpx_quantize_b: VP9QuantizeTest Speed Test (POWER8 Model 2.1) 32x32 Old VSX time = 8.1 ms, new VSX time = 7.9 ms vp9_quantize_fp: VP9QuantizeTest Speed Test (POWER8 Model 2.1) 32x32 Old VSX time = 6.5 ms, new VSX time = 6.2 ms Change-Id: Ic2183e8bd721bb69eaeb4865b542b656255a0870
2018-06-27VSX Version of fdct32x32_rdLuc Trudeau
Low bit depth version only. Passes the Trans32x32Test test suite. Trans32x32Test Speed Test (POWER9 Model 2.2) 32x32 C time = 212.7 ms (±0.1 ms), VSX time = 82.3 ms (±0.0 ms) [2.6x] Change-Id: If906ec9b56ce3818cae0cc462c7277284ab29859
2018-06-25Merge "Add vpx_highbd_avg_8x8, vpx_highbd_avg_4x4"Scott LaVarnway
2018-06-22Add vpx_highbd_avg_8x8, vpx_highbd_avg_4x4Scott LaVarnway
BUG=webm:1537 Change-Id: I5f216f35436189b67d9f350991f41ed31431d4fe
2018-06-22Merge changes I51e7ed32,I99a9535b,Id584d8f6Luca Barbato
* changes: ppc: add vp9_iht16x16_256_add_vsx ppc: add vp9_iht8x8_64_add_vsx ppc: add vp9_iht4x4_16_add_vsx