summaryrefslogtreecommitdiff
path: root/vpx_dsp
AgeCommit message (Collapse)Author
2018-09-18Fix stack corruption with x86 and --enable-picMatthias Räncker
x86inc.asm's cglobal macro is frequently used to declare more arguments than the function actually has. Normally, this is done to aquire an alias to a register that would correspond to that positional function argument if it existed. This is safe when used in this manner. In the case fixed here, however, the alias is used to temporarily store adresses obtained through the GOT in memory. Because those extra arguments don't actually exist, those stores corrupt the callers stack frame. SSE2/VpxHBDSubpelVarianceTest.Ref is a test that may fail as a result. To simply fix the space allocated to actual arguments that have been loaded into registers already is reused. This avoids having to allocate extra space for local variables. Also removed duplicate code while at it. Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de> Change-Id: I505281ecaa6be586185fe6a2d34d62bdf40c839f
2018-09-15cosmetics: normalize include guardsJames Zern
use the recommended format [1] of: <PROJECT>_<PATH>_<FILE>_H_ [1] https://google.github.io/styleguide/cppguide.html#The__define_Guard "All header files should have #define guards to prevent multiple inclusion. The format of the symbol name should be <PROJECT>_<PATH>_<FILE>_H_." Change-Id: I2e8ab0b32fb23c30fa43cff5fec12d043c0d2037
2018-08-07Merge "VPX: Improve HBD vpx_hadamard_32x32_sse2()"Scott LaVarnway
2018-08-07vpx_highbd_d153_predictor_4x4_sse2: reduce load sizeJames Zern
this avoids reading 4 pixels into another block, which may be operated on by a different thread. quiets a tsan warning. Change-Id: Id27ad9d61819b0e5de0230647b4b510f7c265a71
2018-07-28arm: Consistently use unified syntax for asmMartin Storsjo
The ".syntax unified" directives in a few source files aren't valid ADS assembly directives, and they break compilation for windows, since ads2armasm_ms.pl doesn't handle them. Explicity add them via ads2gas.pl and ads2gas_apple.pl instead, and tweak one instruction to be valid unified syntax. Change-Id: I37f1709f163d11474597161fe02eb433859cb9b8
2018-07-26Add New Neon Assemblies for Motion CompensationVenkatarama NG. Avadhani
Commit adds neon assemblies for motion compensation which show an improvement over the existing neon code. Performance Improvement - Platform Resolution 1 Thread 4 Threads Nexus 6 720p 12.16% 7.21% @2.65 GHz 1080p 18.00% 15.28% Change-Id: Ic0b0412eeb01c8317642b20bb99092c2f5baba37
2018-07-26Merge "vp9: fix OOB read in decoder_peek_si_internal"James Zern
2018-07-25vp9: fix OOB read in decoder_peek_si_internalJames Zern
Profile 1 or 3 bitstreams may require 11 bytes for the header in the intra-only case. Additionally add a check on the bit reader's error handler callback to ensure it's non-NULL before calling to avoid future regressions. This has existed since at least (pre-1.4.0): 09bf1d61c Changes hdr for profiles > 1 for intraonly frames BUG=webm:1543 Change-Id: I23901e6e3a219170e8ea9efecc42af0be2e5c378
2018-07-25VPX: Improve HBD vpx_hadamard_32x32_sse2()Scott LaVarnway
BUG=webm:1546 Change-Id: I48224f047547b666c519e0cc23706dd0bda5df20
2018-07-24VPX: avg_intrin_sse2.c, avg_intrin_avx2.c cleanupScott LaVarnway
Change-Id: I710b66dc571a6bd38fbcc2528486d5e028a68b37
2018-07-24VPX: Improve HBD vpx_hadamard_32x32_avx2()Scott LaVarnway
~14% improvement. BUG=webm:1546 Change-Id: I0b25f62f053e13c2185e4e8bd54e52250251efd0
2018-07-23VPX: Add vpx_hadamard_32x32_avx2Scott LaVarnway
BUG=webm:1546 Change-Id: I64629ed83cb7acd0f2ac49b9c31f369d17a1aed2
2018-07-22Merge "VPX: Add vpx_hadamard_32x32_sse2"Scott LaVarnway
2018-07-22Merge "VPX: Improve HBD vpx_hadamard_16x16_sse2()"Scott LaVarnway
2018-07-21VPX: Add vpx_hadamard_32x32_sse2Scott LaVarnway
BUG=webm:1546 Change-Id: Ide5828b890c5c27cfcca2d5e318a914f7cde1158
2018-07-20VPX: Call vpx_hadamard_16x16_c() in vpx_hadamard_32x32_c()Scott LaVarnway
instead of vpx_hadamard_16x16(). Change-Id: Ie16aacad39d7f429e282dd4c93e57c07000d0f29
2018-07-20VPX: Improve HBD vpx_hadamard_16x16_sse2()Scott LaVarnway
~12% improvement. Change-Id: Ieca4d870a4c1c5ea2c689e27fc4550fcbab9f867
2018-07-17vpx_sum_squares_2d_i16_neon(): Make |s2| a uint64x1_t.Raphael Kubo da Costa
This fixes the build with at least GCC 7.3, where it was previously failing with: sum_squares_neon.c: In function 'vpx_sum_squares_2d_i16_neon': sum_squares_neon.c: note: use -flax-vector-conversions to permit conversions between vectors with differing element types or numbers of subparts s2 = vpaddl_u32(s1); ^~ sum_squares_neon.c: incompatible types when assigning to type 'int64x1_t' from type 'uint64x1_t' s2 = vpaddl_u32(s1); ^ sum_squares_neon.c: incompatible types when assigning to type 'int64x1_t' from type 'uint64x1_t' s2 = vadd_u64(vget_low_u64(s1), vget_high_u64(s1)); ^ sum_squares_neon.c: incompatible type for argument 1 of 'vget_lane_u64' return vget_lane_u64(s2, 0); ^~ The generated assembly was verified to remain identical with both GCC and LLVM. Bug: chromium:819249 Change-Id: I2778428ee1fee0a674d0d4910347c2a717de21ac
2018-07-11Add 32x32 Hadamard transformJingning Han
Add 32x32 Hadamard transform in C implementation. Replace the forward 32x32 2D-DCT in tpl model with Hadamard transform. This would reduce the overhead encoding time due to running tpl model by ~3x. Change-Id: I1c743dab786b818d89f14928cc3998d056830aa9
2018-07-08[VSX] Add support to Power9-only vec_absdLuca Barbato
~5% gain for SAD. Change-Id: Ief7d7691f837474f5b6b582129628276fdcce319
2018-06-27Merge "[VSX] Drop the clang-4 workaround for vec_xxpermdi"Luca Barbato
2018-06-27[VSX] Replace vec_pack and vec_perm with single vec_permLuc Trudeau
vpx_quantize_b: VP9QuantizeTest Speed Test (POWER8 Model 2.1) 32x32 Old VSX time = 8.1 ms, new VSX time = 7.9 ms vp9_quantize_fp: VP9QuantizeTest Speed Test (POWER8 Model 2.1) 32x32 Old VSX time = 6.5 ms, new VSX time = 6.2 ms Change-Id: Ic2183e8bd721bb69eaeb4865b542b656255a0870
2018-06-27VSX Version of fdct32x32_rdLuc Trudeau
Low bit depth version only. Passes the Trans32x32Test test suite. Trans32x32Test Speed Test (POWER9 Model 2.2) 32x32 C time = 212.7 ms (±0.1 ms), VSX time = 82.3 ms (±0.0 ms) [2.6x] Change-Id: If906ec9b56ce3818cae0cc462c7277284ab29859
2018-06-25Merge "Add vpx_highbd_avg_8x8, vpx_highbd_avg_4x4"Scott LaVarnway
2018-06-22Add vpx_highbd_avg_8x8, vpx_highbd_avg_4x4Scott LaVarnway
BUG=webm:1537 Change-Id: I5f216f35436189b67d9f350991f41ed31431d4fe
2018-06-22Merge changes I51e7ed32,I99a9535b,Id584d8f6Luca Barbato
* changes: ppc: add vp9_iht16x16_256_add_vsx ppc: add vp9_iht8x8_64_add_vsx ppc: add vp9_iht4x4_16_add_vsx
2018-06-15[VSX] Drop the clang-4 workaround for vec_xxpermdiLuca Barbato
clang-6 seems to support it out of box. E.g. VP9SubtractBlockTest.DISABLED_Speed with the workaround: [ BENCH ] 4x4 286.5 ms ( ±0.2 ms ) Without: [ BENCH ] 4x4 215.2 ms ( ±0.9 ms ) Change-Id: I28b3a2cc93c0d72f52f5a48cc06d8ed4ef26913f
2018-06-14ppc: add vp9_iht16x16_256_add_vsxAlexandra Hájková
Change-Id: I51e7ed32d8d87c25ee126e8b4f8fc616d0327584
2018-06-14[VSX] Optimize PROCESS16 macroLuc Trudeau
The PROCESS16 macro now uses 8-bit lanes instead of 16-bit lanes. SADTest Speed Test (POWER8 Model 2.1) 16x8 Old VSX time = 16.7 ms, new VSX time = 9.1 ms [1.8x] 16x16 Old VSX time = 15.7 ms, new VSX time = 7.9 ms [2.0x] 16x32 Old VSX time = 14.4 ms, new VSX time = 7.2 ms [2.0x] 32x16 Old VSX time = 14.0 ms, new VSX time = 7.4 ms [1.9x] 32x32 Old VSX time = 13.4 ms, new VSX time = 6.5 ms [2.0x] 32x64 Old VSX time = 12.7 ms, new VSX time = 6.3 ms [2.0x] 64x32 Old VSX time = 12.6 ms, new VSX time = 6.3 ms [2.0x] 64x64 Old VSX time = 12.7 ms, new VSX time = 6.2 ms [2.0x] Change-Id: I51776f0e428162e78edde8eac47f30ffd2379873
2018-06-13VSX Version of SAD8xNLuc Trudeau
VSX versions of the SAD functions of width 8. SADTest Speed Test (POWER8 Model 2.1) 8x4 C time = 68.7 ms (±0.3 ms), VSX time = 31.8 ms (±0.1 ms) [2.2x] 8x8 C time = 55.6 ms (±0.3 ms), VSX time = 18.3 ms (±0.1 ms) [3.0x] 8x16 C time = 46.5 ms (±0.1 ms), VSX time = 15.6 ms (±0.1 ms) [3.0x] Change-Id: I843f3b34e103b72deeade4a939193d8b53cee460
2018-06-08Implement subtract_block for VSXLuca Barbato
~2x speedup or better. [ RUN ] C/VP9SubtractBlockTest.Speed/0 [ BENCH ] 4x4 365.1 ms ( ±2.2 ms ) [ BENCH ] 8x4 258.5 ms ( ±0.3 ms ) [ BENCH ] 4x8 202.7 ms ( ±0.2 ms ) [ BENCH ] 8x8 162.2 ms ( ±0.5 ms ) [ BENCH ] 16x8 138.8 ms ( ±0.3 ms ) [ BENCH ] 8x16 121.5 ms ( ±0.4 ms ) [ BENCH ] 16x16 110.2 ms ( ±0.5 ms ) [ BENCH ] 32x16 104.8 ms ( ±0.1 ms ) [ BENCH ] 16x32 32.7 ms ( ±0.1 ms ) [ BENCH ] 32x32 30.0 ms ( ±0.0 ms ) [ BENCH ] 64x32 28.7 ms ( ±0.0 ms ) [ BENCH ] 32x64 20.1 ms ( ±0.0 ms ) [ BENCH ] 64x64 19.3 ms ( ±0.0 ms ) [ RUN ] VSX/VP9SubtractBlockTest.Speed/0 [ BENCH ] 4x4 155.3 ms ( ±0.9 ms ) [ BENCH ] 8x4 99.3 ms ( ±0.4 ms ) [ BENCH ] 4x8 77.2 ms ( ±0.1 ms ) [ BENCH ] 8x8 45.7 ms ( ±0.0 ms ) [ BENCH ] 16x8 34.1 ms ( ±0.0 ms ) [ BENCH ] 8x16 29.5 ms ( ±0.0 ms ) [ BENCH ] 16x16 19.9 ms ( ±0.0 ms ) [ BENCH ] 32x16 15.1 ms ( ±0.0 ms ) [ BENCH ] 16x32 16.7 ms ( ±0.0 ms ) [ BENCH ] 32x32 14.1 ms ( ±0.0 ms ) [ BENCH ] 64x32 12.6 ms ( ±0.0 ms ) [ BENCH ] 32x64 12.0 ms ( ±0.0 ms ) [ BENCH ] 64x64 11.2 ms ( ±0.0 ms ) Change-Id: I89ce12b6475871dc9e8fde84d0b6fe5c420c28c7
2018-06-06Merge changes I3ba75c45,I97d26285James Zern
* changes: force-inline the convolve functions Unbreak the force inline directive for gcc
2018-06-05force-inline the convolve functionsLuca Barbato
Change-Id: I3ba75c459ed7c9591b7892e9f8f108146c04507d
2018-05-31VSX version of vpx_post_proc_down_and_across_mb_rowLuc Trudeau
Low bit depth version only. Passes the VpxPostProcDownAndAcrossMbRowTest VpxMbPostProcAcrossIpTest Speed Test (POWER8 Model 2.1) C time = 121.3 ms (±4.0 ms), VSX time = 9.4 ms (±0.3 ms) [12.9x] Change-Id: I28300779e197ea3855cf30867d17a2805388b447
2018-05-31ppc: add vp9_iht8x8_64_add_vsxAlexandra Hájková
Change-Id: I99a9535bf1ae58c494113fc88d9616bda202716a
2018-05-31ppc: add vp9_iht4x4_16_add_vsxAlexandra Hájková
Change-Id: Id584d8f65fdda51b8680f41424074b4b0c979622
2018-05-29VSX version of vpx_mbpost_proc_ipLuc Trudeau
Low bit depth version only. Passes the VpxMbPostProcAcrossIpTest. VpxMbPostProcAcrossIpTest Speed Test (POWER8 Model 2.1) C time = 188.5ms (±0.2ms), VSX time = 65.2ms (±0.1ms) [2.9x] Change-Id: I1cf72365d94a9d7f1e9323925a87a30e3bd5cfe2
2018-05-29VSX version of vpx_mbpost_proc_downLuc Trudeau
Low bit depth version only. Passes the VpxMbPostProcDownTest. VpxMbPostProcDownTest Speed Test (POWER8 Model 2.1) Full calculations: C time = 195.4 ms, VSX time = 33.7 ms (5.8x) Change-Id: If1aca7c135de036a1ab7923c0d1e6733bfe27ef7
2018-05-15Merge "Add vpx_varianceNxM_vsx and vpx_mseNxM_vsx"Luca Barbato
2018-05-15Add vpx_varianceNxM_vsx and vpx_mseNxM_vsxLuca Barbato
Speedups: 64x64 5.9 64x32 6.2 32x64 5.8 32x32 6.2 32x16 5.1 16x32 3.3 16x16 2.6 16x8 2.6 8x16 2.4 8x8 2.3 8x4 2.1 4x8 1.6 4x4 1.6 Change-Id: Idfaab96c03d3d1f487301cf398da0dd47a34e887
2018-05-14VSX version of vpx_quantize_b_32x32_vsxLuc Trudeau
Low bit depth version only. Passes the VP9QuantizeTest. VP9QuantizeTest Speed Test (POWER8 Model 2.1) Full calculations: C time = 1456 ms, VSX time = 80 ms (18x) Change-Id: I1b1d6d03b1aeff63640efbdeb222cab857ddd95e
2018-05-11Merge "Faster VSX vpx_quantize_b"Luc Trudeau
2018-05-10vpx_subtract_block_neon: add explicit castJames Zern
quiets ptrdiff_t -> int conversion warning Change-Id: If6b545a736fc19e48e290961736b1618df97db3e
2018-05-11Merge "ppc: Add vpx_iwht4x4_16_add_vsx"James Zern
2018-05-11Merge "Update vpx_subtract_block_neon()"James Zern
2018-05-10Faster VSX vpx_quantize_bLuc Trudeau
Process 16 coefficients on the first iteration (a full 4x4) and 24 coefficients on subsequent iteration. VSX/VP9QuantizeTest.DISABLED_Speed Before: 4x4 176 ms 8x8 91 ms 16x16 72 ms After: 4x4 152 ms 8x8 82 ms 16x16 64 ms Change-Id: I07cb130833504206ccdc5bc12ae5af369364999a
2018-05-10Update vpx_subtract_block_neon()Linfeng Zhang
Change-Id: Ie2ac06c090c8f92268e9a799e96aa5192a1bdcd2
2018-05-10Merge "Update vpx_comp_avg_pred_neon()"James Zern
2018-05-09VSX version of vpx_quantize_b_vsxLuc Trudeau
Low bit depth version only. Passes the VP9QuantizeTest. Change-Id: I6546f872864bd404a7e353348b0554aab1de5bf0
2018-05-08Update vpx_comp_avg_pred_neon()Linfeng Zhang
Separate width 4 and 8 cases to reduce jumps in loop in clang. Change-Id: I6ffc6f1555f2ad08b72a8dba35a78b9fd5f95a73