Age | Commit message (Collapse) | Author |
|
|
|
* changes:
d153 intra prediction (16x16) ssse3 using bytes
d153 intra prediction ssse3 using bytes
|
|
Renames:
vp9_short_idct8x8_add -> vp9_idct8x8_64_add
vp9_short_idct8x8_1_add -> vp9_idct8x8_1_add
vp9_short_idct8x8_10_add -> vp9_idct8x8_10_add
vp9_idct_add_8x8 -> vp9_idct8x8_add
Change-Id: Ifb8d3a45b4c0397aa805b30463f3d14581bf72c1
|
|
The idea is to have the following names for each transform size:
vp9_idct4x4_add
vp9_idct4x4_1_add
vp9_idct4x4_10_add
vp9_idct4x4_16_add
vp9_idct8x8_add
vp9_idct8x8_1_add
vp9_idct8x8_10_add
vp9_idct8x8_64_add
etc for 16x16, 32x32
The actual list of renames in this patch:
vp9_idct_add_lossless -> vp9_iwht4x4_add
vp9_short_iwalsh4x4_add -> vp9_iwht4x4_16_add
vp9_short_iwalsh4x4_1_add -> vp9_iwht4x4_1_add
vp9_idct_add -> vp9_idct4x4_add
vp9_short_idct4x4_add -> vp9_idct4x4_16_add
vp9_short_idct4x4_1_add -> vp9_idct4x4_1_add
Change-Id: I6f43f7437c68dd30cdd05d72e213765578ed30b1
|
|
|
|
In subpixel filters, prefetched source data, unrolled loops,
and interleaved instructions.
In HORIZx4, integrated the idea in Scott's CL (commit:
d22a504d11a15dc3eab666859db0046b5a7d75c5), which was suggested by
Erik/Tamar from Intel. Further tweaking was done to combine row 0,
2, and row 1, 3 in registers to do more 2-row-in-1 operations until
the last add.
Test showed a ~2% decoder speedup.
Change-Id: Ib53d04ede8166c38c3dc744da8c6f737ce26a0e3
|
|
Change-Id: I8a106dd61b0a2520fae792d87d6348e662649b2d
|
|
Change-Id: I4b1c6bb9ff615f5872b96ed07dbf0f5e18e63643
|
|
|
|
Interleaved the instructions, reduced register dependency, and
prefetched the source data. This improved the decoder speed
by 0.6% - 2%.
Change-Id: I568067aa0c629b2e58219326899c82aedf7eccca
|
|
byte version of ronalds d153 ssse3 optimizations for
4x4 and 8x8
(commit: fc91a2a112238a1aee568f3b840585de4e928fca)
Change-Id: Iec4426032311483f615fd9e0dceba3ee85ddebd7
|
|
Change-Id: I2b2af1dd9f5c29c05e28a4fd51fa58ccc4071477
|
|
Change-Id: Id2cc5c829399a2afdf7a8a82615a4e272c814986
|
|
Making name consistent with vp9_short_idct8x8 and vp9_short_idct8x8_1.
Change-Id: I99e0be040ec893f9571dcf090e18f98dc58339f5
|
|
|
|
Making function name consistent with vp9_short_idct16x16 and
vp9_short_idct16x16_1.
Change-Id: I70e54be9e6b9a1dddab0de470686591e96d05517
|
|
byte version of ronalds d63 ssse3 optimizations
(commit: c5a1c8cf3541cf3665fee981b36d22c9fbd4191e)
Change-Id: Ifd3e6d454a2246085f23eabb38518a930321e807
|
|
Current x86inc.asm didn't handle 32bit PIC build properly.
TEXTRELs were seen in the library built. The PIC macros from
libvpx's x86_abi_support.asm was used to fix this problem.
The assembly code was modified to use the macros.
Notes: We need this fix in for decoder building. Functions in
encoder will be fixed later.
Change-Id: Ifa548d37b1d0bc7d0528db75009cc18cd5eb1838
|
|
This is incompatible with most toolchains other than gcc.
Revert "Deleted #include <inttypes.h>"
This reverts commit 4d018be950ef8b056a7c797a22ee58012443df26.
This reverts commit d22a504d11a15dc3eab666859db0046b5a7d75c5.
Change-Id: I1751dc6831f4395ee064e6748281418e967e1dcf
|
|
This seems not to be needed and is not supported
in the Windows build.
Change-Id: Iaca3bbf8cca283aee6bc336cb31ba9dd4610322b
|
|
Reformatted version of a patch submitted by Erik/Tamar
from Intel. For the test clips used, the decoder
performance improved by ~2%.
Change-Id: Ifbc37ac6311bca9ff1cfefe3f2e9b7f13a4a511b
|
|
This patch is a reformatted version of optimizations done by
engineers at Intel (Erik/Tamar) who have been providing
performance feedback for VP9. For the test clips used (720p, 1080p),
up to 1.2% performance improvement was seen.
Change-Id: Ic1a7149098740079d5453b564da6fbfdd0b2f3d2
|
|
This commit exploits the sparsity of quantized coefficient matrix.
It detects each 32x8 array and skip the corresponding inverse
transformation if all entries are zero.
For ped1080p at 8000 kbps, this on average reduces the runtime of
32x32 inverse 2D-DCT SSE2 function from 6256 cycles -> 5200
cycles. It makes the overall encoding process about 2% faster at
speed 0. The speed-up is more pronounceable for the decoding process.
Change-Id: If20056c3566bd117642a76f8884c83e8bc8efbcf
|
|
This commit provides special handle on 16x16 inverse 2D-DCT, where
only DC coefficient is quantized to be non-zero value.
Change-Id: I7bf71be7fa13384fab453dc8742b5b50e77a277c
|
|
|
|
This commit enables a special handle for the 8x8 inverse 2D-DCT,
where only DC coefficient is quantized to be non-zero. For bus_cif
at 2000 kbps, it provides about 1% speed-up at speed 0.
Change-Id: I2523222359eec26b144cf8fd4c63a4ad63b1b011
|
|
Change-Id: Ie48035ff4f93c41f8a9b3023e6444fd10432d8fb
|
|
Add SSE2 implementation to handle the special case of inverse 2D-DCT
where only DC coefficient is non-zero.
Change-Id: I2c6a59e21e5e77b8cf39a4af5eecf4d5ade32e2f
|
|
They share the same functionality, so merging together.
Change-Id: I98a0386fcee052cb854f9ff90c283c1b844bcb79
|
|
I40454d26,I892e76d5,I865ab3f9,I4a4bec17,I61c4351e,I37eb3559,I1031c556,I8c8f1f42
* changes:
delete vp9_loopfilter_sse2.asm
vp9_loopfilter_intrin_sse2: cosmetics: fix indent
delete x86/vp9_loopfilter_x86.h
vp9_loopfilter_intrin_sse2: make some funcs static
vp9_loopfilter_intrin_sse2: remove unused uv funcs
vp9_loopfilter: remove uv function typedef
filter_block_plane: reuse some constants
vp9_loopfilter.c: make some functions static
|
|
sse2 functions are provided by vp9_loopfilter_intrin_sse2.c
Change-Id: I40454d26034e3ef915eeaf889937fe7d1b519b9b
|
|
Change-Id: I892e76d5ad1443b2ea0d1a7839fe26afe9c68ffb
|
|
also remove prototype_loopfilter{,_block} defines from vp9_loopfilter.h
Change-Id: I865ab3f9436c7b1ca166f76630328abf01389405
|
|
This commit enables SSE2 implementation of 16x16 inverse ADST/DCT
hybrid transform. The runtime goes from 5742 cycles -> 1821 cycles.
This provides about 1% encoding speed-up at speed 0.
Change-Id: I1678d0988bf30b9efd524877705bbb3645edb17b
|
|
+ drop 'vp9_'
Change-Id: I4a4bec175316aab8f65c3a23bacc8362399a1357
|
|
vp9_mbloop_filter_horizontal_edge_sse2 /
vp9_mbloop_filter_vertical_edge_uv_sse2
Change-Id: I61c4351ef0cce79fa4156a47ddace781f1566869
|
|
loop_filter_uvfunction is unused
Change-Id: I37eb3559e9eb2808f1f29dfea429441c94c9df2a
|
|
This commit enables SSE2 implementation of 8x8 inverse ADST/DCT
transform. The runtime goes from 1216 cycles -> 266 cycles.
For bus_cif at 2000 kbps, the overall runtime reduces from
253707ms -> 248430ms, i.e., 2% speed-up at speed 0.
Change-Id: Ib0372e17e9162d7b11a10d653b1c8be547c878fb
|
|
|
|
Independent horizontal and vertical implementations.
Requires that blocks be built from 4x4 and [xy]_step_q4 == 16
6-10% improvement. CIF improved the least.
Change-Id: I137f5ceae4440adc0960bf88e4453e55a618bcda
|
|
|
|
Enable SSE2 4x4 inverse ADST/DCT transform. The runtime goes from
292 cycles down to 89 cycles. Running bus_cif at 2000 kbps, the
overall runtime of speed 0 goes from 301s to 295s (2% speed-up).
Change-Id: I24098136e7fee7ab2fbf1c11755bdf2ca37f3628
|
|
Change-Id: I3ce849452ed4f08527de9565a9914d5ee36170aa
|
|
Where possible, do the 16 pixel wide filter while doing the horizontal
filtering pass. The same approach can be taken for the mbloop_filter
when that's implemented. Doing so on the vertical pass is a little more
involved, but possible.
Change-Id: I010cb505e623464247ae8f67fa25a0cdac091320
|
|
Change-Id: I2d22577911a37ed7d8c7e08cac20764842267652
|
|
Change-Id: I30a597c0cc366e34c9a3e2afe32d70e044f95ca4
|
|
|
|
|
|
|
|
|