Age | Commit message (Collapse) | Author |
|
|
|
|
|
Change-Id: Ibec078c80ca1dfe6fbbc4288db89d719dac453a7
|
|
Always use src/ref and _ptr/_stride suffixes.
Normalize to [xy]_offset and second_pred.
Drop some stray source/recon_strides.
BUG=webm:1444
Change-Id: I32362a50988eb84464ab78686348610ea40e5c80
|
|
Unit test performance on bitdepth 10:
| 4X4 | 8X8 |16X16|64X64|
2D |1.582|1.461|1.425|1.572|
HORZ|1.643|1.247|1.346|1.345|
VERT|1.378|1.695|2.020|1.763|
Unit test performance on bitdepth 12:
| 4X4 | 8X8 |16X16|64X64|
2D |1.578|1.409|1.426|1.497|
HORZ|1.625|1.153|1.323|1.259|
VERT|1.392|1.707|2.030|1.787|
Change-Id: I6df85330ac33fcb17d46e4302b41415dda1219f5
|
|
About ~10% faster on 64bit but ~10% slower on 32
Removes the assembly usage of vpx_rv.
Change-Id: I214698fb5677f615dee0a8f5f5bb8f64daf2565e
|
|
Speed gain:
BIT DEPTH | 8TAP FPS | 4TAP FPS | PCT INC |
10 | 1.69 | 1.85 | 9.46% |
12 | 1.64 | 1.78 | 8.54% |
Speed test is done on jet.y4m on speed 1 profile 2 over 100 frame with
br=500.
Change-Id: I411e122553e2c466be7a26e64b4dd144efb884a9
|
|
~20% faster than the MMX. Removes the last usage of
vp8_bilinear_filters_x86_[48].
Change-Id: Iee976fab9655d0020440f26c4403ce50103af913
|
|
|
|
Performance:
| 4X4 | 8X8 |16X16|64X64|
2 DIM|1.491|1.902|1.772|1.479|
HORZ|1.145|1.521|1.757|1.497|
VERT|1.176|1.614|1.707|1.467|
Each number in the chart above is 8-tap function time / 4-tap function time.
The framerate tested on jets.y4m for 100 frames on speed 1 increased from 3.72
fps to 3.91 fps (about 5% increase).
Change-Id: Ic0ad275cf32fafeefd0a89811badd8adff2134a0
|
|
Removes unnecesssary includes and reword some functions/comments.
Change-Id: Ied557d7faa9d845d38255e6e3e0e3fe1395276e1
|
|
AVX2's 8-tap filter is slightly faster than 4-tap SSSE3 filter.
Change-Id: I5fc37c431670780108706b206b32c791828555c9
|
|
Performance:
| 4X4 | 8X8 |16X16|64X64|
2 DIM|1.526|1.827|1.844|1.906|
HORZ|1.336|1.795|1.886|1.654|
VERT|1.443|1.539|2.139|2.190|
The ratio is SSSE3 8-tap time / SSSE3 4-tap time.
Change-Id: I01ed2ab494428256e918875774a459afecc5ec6a
|
|
Performance:
The chart below shows the speed relative to baseline
(baseline_time/new_time)
_____| 4X4 | 8X8 |16X16|64X64|
2 DIM|1.889|1.780|1.811|1.963|
HORZ|2.266|1.834|1.617|1.595|
VERI|2.043|2.190|2.373|2.485|
Change-Id: Ic4262222db78f013b94a8c61b46efb8520722927
|
|
Some repeated codes are refactored as inline functions. No performance
degradation is observed. These inline functions can be used for width 8
and width 4.
Change-Id: Ibf08cc9ebd2dd47bd2a6c2bcc1616f9d4c252d4d
|
|
Horizontal filter on 64x64 block: 1.59 times as fast as baseline.
Vertical filter on 64x64 block: 2.5 times as fast as baseline.
2D filter on 64x64 block: 1.96 times as fast as baseline.
Change-Id: I12e46679f3108616d5b3475319dd38b514c6cb3c
|
|
The interp filter tap calculation was not accurate to tell the
difference between 2 taps and 4 taps. This patch fixed the bug, and
resolved Jenkins test failures in mips sub-pel filter optimizations.
BUG=webm:1568
Change-Id: I51eb8adb7ed194ef2ea7dd4aa57aa9870ee38cfc
|
|
Change-Id: I7a3314a268cf6049a7260361043e76d4561085c6
|
|
Change-Id: I83c7e64fe70f7c49aa2492ed2d640c6756b7ebaa
|
|
|
|
Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de>
Change-Id: I838c8678e62f7cff13387b84d4f3ea42710a67ea
|
|
These variables are being fed to sse2 functions, that use aligned
loads.
Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de>
Change-Id: I796c3483c6f3425d63d9262b02b19da59d536600
|
|
Another instance of unaligned 4-byte loads.
Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de>
Change-Id: I06afc5405bb074384eec7a8c8123e5803e522937
|
|
When built with -fsanitizer=address,undefined a number of tests,
such as ByteAlignmentTest.SwitchByteAlignment or
ByteAlignmentTest.SwitchByteAlignment produce runtime errors about
unaligned 4-byte loads/stores. While normally not really a problem,
this does technically violate the language and it is eays to fix in
a standard conforming way using memcpy which does not produce
inferior code.
Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de>
Change-Id: Ie1e97ab25fe874f864df48b473569f00563181ae
|
|
x86inc.asm's cglobal macro is frequently used to declare more
arguments than the function actually has. Normally, this is
done to aquire an alias to a register that would correspond to
that positional function argument if it existed. This is safe
when used in this manner.
In the case fixed here, however, the alias is used to temporarily
store adresses obtained through the GOT in memory. Because those
extra arguments don't actually exist, those stores corrupt the
callers stack frame.
SSE2/VpxHBDSubpelVarianceTest.Ref is a test that may fail as a
result.
To simply fix the space allocated to actual arguments that have
been loaded into registers already is reused.
This avoids having to allocate extra space for local variables.
Also removed duplicate code while at it.
Signed-off-by: Matthias Räncker <theonetruecamper@gmx.de>
Change-Id: I505281ecaa6be586185fe6a2d34d62bdf40c839f
|
|
use the recommended format [1] of:
<PROJECT>_<PATH>_<FILE>_H_
[1] https://google.github.io/styleguide/cppguide.html#The__define_Guard
"All header files should have #define guards to prevent multiple
inclusion. The format of the symbol name should be
<PROJECT>_<PATH>_<FILE>_H_."
Change-Id: I2e8ab0b32fb23c30fa43cff5fec12d043c0d2037
|
|
|
|
this avoids reading 4 pixels into another block, which may be operated
on by a different thread. quiets a tsan warning.
Change-Id: Id27ad9d61819b0e5de0230647b4b510f7c265a71
|
|
The ".syntax unified" directives in a few source files aren't valid
ADS assembly directives, and they break compilation for windows,
since ads2armasm_ms.pl doesn't handle them.
Explicity add them via ads2gas.pl and ads2gas_apple.pl instead,
and tweak one instruction to be valid unified syntax.
Change-Id: I37f1709f163d11474597161fe02eb433859cb9b8
|
|
Commit adds neon assemblies for motion compensation which show an improvement
over the existing neon code.
Performance Improvement -
Platform Resolution 1 Thread 4 Threads
Nexus 6 720p 12.16% 7.21%
@2.65 GHz 1080p 18.00% 15.28%
Change-Id: Ic0b0412eeb01c8317642b20bb99092c2f5baba37
|
|
|
|
Profile 1 or 3 bitstreams may require 11 bytes for the header in the
intra-only case.
Additionally add a check on the bit reader's error handler callback to
ensure it's non-NULL before calling to avoid future regressions.
This has existed since at least (pre-1.4.0):
09bf1d61c Changes hdr for profiles > 1 for intraonly frames
BUG=webm:1543
Change-Id: I23901e6e3a219170e8ea9efecc42af0be2e5c378
|
|
BUG=webm:1546
Change-Id: I48224f047547b666c519e0cc23706dd0bda5df20
|
|
Change-Id: I710b66dc571a6bd38fbcc2528486d5e028a68b37
|
|
~14% improvement.
BUG=webm:1546
Change-Id: I0b25f62f053e13c2185e4e8bd54e52250251efd0
|
|
BUG=webm:1546
Change-Id: I64629ed83cb7acd0f2ac49b9c31f369d17a1aed2
|
|
|
|
|
|
BUG=webm:1546
Change-Id: Ide5828b890c5c27cfcca2d5e318a914f7cde1158
|
|
instead of vpx_hadamard_16x16().
Change-Id: Ie16aacad39d7f429e282dd4c93e57c07000d0f29
|
|
~12% improvement.
Change-Id: Ieca4d870a4c1c5ea2c689e27fc4550fcbab9f867
|
|
This fixes the build with at least GCC 7.3, where it was previously failing
with:
sum_squares_neon.c: In function 'vpx_sum_squares_2d_i16_neon':
sum_squares_neon.c: note: use -flax-vector-conversions to permit conversions between vectors with differing element types or numbers of subparts
s2 = vpaddl_u32(s1);
^~
sum_squares_neon.c: incompatible types when assigning to type 'int64x1_t' from type 'uint64x1_t'
s2 = vpaddl_u32(s1);
^
sum_squares_neon.c: incompatible types when assigning to type 'int64x1_t' from type 'uint64x1_t'
s2 = vadd_u64(vget_low_u64(s1), vget_high_u64(s1));
^
sum_squares_neon.c: incompatible type for argument 1 of 'vget_lane_u64'
return vget_lane_u64(s2, 0);
^~
The generated assembly was verified to remain identical with both GCC and
LLVM.
Bug: chromium:819249
Change-Id: I2778428ee1fee0a674d0d4910347c2a717de21ac
|
|
Add 32x32 Hadamard transform in C implementation. Replace the
forward 32x32 2D-DCT in tpl model with Hadamard transform. This
would reduce the overhead encoding time due to running tpl model
by ~3x.
Change-Id: I1c743dab786b818d89f14928cc3998d056830aa9
|
|
~5% gain for SAD.
Change-Id: Ief7d7691f837474f5b6b582129628276fdcce319
|
|
|
|
vpx_quantize_b:
VP9QuantizeTest Speed Test (POWER8 Model 2.1)
32x32 Old VSX time = 8.1 ms, new VSX time = 7.9 ms
vp9_quantize_fp:
VP9QuantizeTest Speed Test (POWER8 Model 2.1)
32x32 Old VSX time = 6.5 ms, new VSX time = 6.2 ms
Change-Id: Ic2183e8bd721bb69eaeb4865b542b656255a0870
|
|
Low bit depth version only. Passes the Trans32x32Test test suite.
Trans32x32Test Speed Test (POWER9 Model 2.2)
32x32 C time = 212.7 ms (±0.1 ms), VSX time = 82.3 ms (±0.0 ms) [2.6x]
Change-Id: If906ec9b56ce3818cae0cc462c7277284ab29859
|
|
|
|
BUG=webm:1537
Change-Id: I5f216f35436189b67d9f350991f41ed31431d4fe
|
|
* changes:
ppc: add vp9_iht16x16_256_add_vsx
ppc: add vp9_iht8x8_64_add_vsx
ppc: add vp9_iht4x4_16_add_vsx
|