summaryrefslogtreecommitdiff
path: root/vp9/common/x86
AgeCommit message (Collapse)Author
2013-10-07Merge "Giving consistent names to IDCT 8x8 functions."Dmitry Kovalev
2013-10-06Merge changes I8a106dd6,Iec442603Jim Bankoski
* changes: d153 intra prediction (16x16) ssse3 using bytes d153 intra prediction ssse3 using bytes
2013-10-06Giving consistent names to IDCT 8x8 functions.Dmitry Kovalev
Renames: vp9_short_idct8x8_add -> vp9_idct8x8_64_add vp9_short_idct8x8_1_add -> vp9_idct8x8_1_add vp9_short_idct8x8_10_add -> vp9_idct8x8_10_add vp9_idct_add_8x8 -> vp9_idct8x8_add Change-Id: Ifb8d3a45b4c0397aa805b30463f3d14581bf72c1
2013-10-04Giving consistent names to IDCT/IWHT functions.Dmitry Kovalev
The idea is to have the following names for each transform size: vp9_idct4x4_add vp9_idct4x4_1_add vp9_idct4x4_10_add vp9_idct4x4_16_add vp9_idct8x8_add vp9_idct8x8_1_add vp9_idct8x8_10_add vp9_idct8x8_64_add etc for 16x16, 32x32 The actual list of renames in this patch: vp9_idct_add_lossless -> vp9_iwht4x4_add vp9_short_iwalsh4x4_add -> vp9_iwht4x4_16_add vp9_short_iwalsh4x4_1_add -> vp9_iwht4x4_1_add vp9_idct_add -> vp9_idct4x4_add vp9_short_idct4x4_add -> vp9_idct4x4_16_add vp9_short_idct4x4_1_add -> vp9_idct4x4_1_add Change-Id: I6f43f7437c68dd30cdd05d72e213765578ed30b1
2013-10-03Merge "Rewrite HORIZx4 and HORIZx8 in subpixel filter functions"Yunqing Wang
2013-10-03Rewrite HORIZx4 and HORIZx8 in subpixel filter functionsYunqing Wang
In subpixel filters, prefetched source data, unrolled loops, and interleaved instructions. In HORIZx4, integrated the idea in Scott's CL (commit: d22a504d11a15dc3eab666859db0046b5a7d75c5), which was suggested by Erik/Tamar from Intel. Further tweaking was done to combine row 0, 2, and row 1, 3 in registers to do more 2-row-in-1 operations until the last add. Test showed a ~2% decoder speedup. Change-Id: Ib53d04ede8166c38c3dc744da8c6f737ce26a0e3
2013-10-02d153 intra prediction (16x16) ssse3 using bytesScott LaVarnway
Change-Id: I8a106dd61b0a2520fae792d87d6348e662649b2d
2013-10-01Adding SSE2 optimized vp9_short_idct32x32_1_add function.Dmitry Kovalev
Change-Id: I4b1c6bb9ff615f5872b96ed07dbf0f5e18e63643
2013-10-01Merge "Modify HORIZx16 macro in subpixel filter functions"Yunqing Wang
2013-10-01Modify HORIZx16 macro in subpixel filter functionsYunqing Wang
Interleaved the instructions, reduced register dependency, and prefetched the source data. This improved the decoder speed by 0.6% - 2%. Change-Id: I568067aa0c629b2e58219326899c82aedf7eccca
2013-10-01d153 intra prediction ssse3 using bytesScott LaVarnway
byte version of ronalds d153 ssse3 optimizations for 4x4 and 8x8 (commit: fc91a2a112238a1aee568f3b840585de4e928fca) Change-Id: Iec4426032311483f615fd9e0dceba3ee85ddebd7
2013-09-29fixed cpp lint issue in vp9_postproc_x86Jim Bankoski
Change-Id: I2b2af1dd9f5c29c05e28a4fd51fa58ccc4071477
2013-09-29nolintify intrinsic idct fileJim Bankoski
Change-Id: Id2cc5c829399a2afdf7a8a82615a4e272c814986
2013-09-27Renaming vp9_short_idct10_8x8_add to vp9_short_idct8x8_10_add.Dmitry Kovalev
Making name consistent with vp9_short_idct8x8 and vp9_short_idct8x8_1. Change-Id: I99e0be040ec893f9571dcf090e18f98dc58339f5
2013-09-27Merge "Renaming vp9_short_idct10_16x16 to vp9_short_idct16x16_10."Dmitry Kovalev
2013-09-26Renaming vp9_short_idct10_16x16 to vp9_short_idct16x16_10.Dmitry Kovalev
Making function name consistent with vp9_short_idct16x16 and vp9_short_idct16x16_1. Change-Id: I70e54be9e6b9a1dddab0de470686591e96d05517
2013-09-25d63 intra prediction ssse3 using bytesScott LaVarnway
byte version of ronalds d63 ssse3 optimizations (commit: c5a1c8cf3541cf3665fee981b36d22c9fbd4191e) Change-Id: Ifd3e6d454a2246085f23eabb38518a930321e807
2013-09-18Fix x86inc.asm to build PIC code correctlyYunqing Wang
Current x86inc.asm didn't handle 32bit PIC build properly. TEXTRELs were seen in the library built. The PIC macros from libvpx's x86_abi_support.asm was used to fix this problem. The assembly code was modified to use the macros. Notes: We need this fix in for decoder building. Functions in encoder will be fixed later. Change-Id: Ifa548d37b1d0bc7d0528db75009cc18cd5eb1838
2013-09-13Revert "Improved 8t filters"James Zern
This is incompatible with most toolchains other than gcc. Revert "Deleted #include <inttypes.h>" This reverts commit 4d018be950ef8b056a7c797a22ee58012443df26. This reverts commit d22a504d11a15dc3eab666859db0046b5a7d75c5. Change-Id: I1751dc6831f4395ee064e6748281418e967e1dcf
2013-09-12Deleted #include <inttypes.h>Paul Wilkins
This seems not to be needed and is not supported in the Windows build. Change-Id: Iaca3bbf8cca283aee6bc336cb31ba9dd4610322b
2013-09-11Improved 8t filtersScott LaVarnway
Reformatted version of a patch submitted by Erik/Tamar from Intel. For the test clips used, the decoder performance improved by ~2%. Change-Id: Ifbc37ac6311bca9ff1cfefe3f2e9b7f13a4a511b
2013-08-29Improved mb_lpf_horizontal_edge_w_sse2_8Scott LaVarnway
This patch is a reformatted version of optimizations done by engineers at Intel (Erik/Tamar) who have been providing performance feedback for VP9. For the test clips used (720p, 1080p), up to 1.2% performance improvement was seen. Change-Id: Ic1a7149098740079d5453b564da6fbfdd0b2f3d2
2013-07-31Optimize 32x32 2D inverse DCT for speed-upJingning Han
This commit exploits the sparsity of quantized coefficient matrix. It detects each 32x8 array and skip the corresponding inverse transformation if all entries are zero. For ped1080p at 8000 kbps, this on average reduces the runtime of 32x32 inverse 2D-DCT SSE2 function from 6256 cycles -> 5200 cycles. It makes the overall encoding process about 2% faster at speed 0. The speed-up is more pronounceable for the decoding process. Change-Id: If20056c3566bd117642a76f8884c83e8bc8efbcf
2013-07-2916x16 inverse 2D-DCT with DC onlyJingning Han
This commit provides special handle on 16x16 inverse 2D-DCT, where only DC coefficient is quantized to be non-zero value. Change-Id: I7bf71be7fa13384fab453dc8742b5b50e77a277c
2013-07-26Merge "d45 intra prediction SSSE3 optimizations."Ronald S. Bultje
2013-07-26Special handle on DC only inverse 8x8 2D-DCTJingning Han
This commit enables a special handle for the 8x8 inverse 2D-DCT, where only DC coefficient is quantized to be non-zero. For bus_cif at 2000 kbps, it provides about 1% speed-up at speed 0. Change-Id: I2523222359eec26b144cf8fd4c63a4ad63b1b011
2013-07-26d45 intra prediction SSSE3 optimizations.Ronald S. Bultje
Change-Id: Ie48035ff4f93c41f8a9b3023e6444fd10432d8fb
2013-07-24SSE2 inverse 4x4 2D-DCT with DC onlyJingning Han
Add SSE2 implementation to handle the special case of inverse 2D-DCT where only DC coefficient is non-zero. Change-Id: I2c6a59e21e5e77b8cf39a4af5eecf4d5ade32e2f
2013-07-24Merge vp9_dc_only_idct_add and vp9_short_idct4x4_1Jingning Han
They share the same functionality, so merging together. Change-Id: I98a0386fcee052cb854f9ff90c283c1b844bcb79
2013-07-16Merge changes ↵James Zern
I40454d26,I892e76d5,I865ab3f9,I4a4bec17,I61c4351e,I37eb3559,I1031c556,I8c8f1f42 * changes: delete vp9_loopfilter_sse2.asm vp9_loopfilter_intrin_sse2: cosmetics: fix indent delete x86/vp9_loopfilter_x86.h vp9_loopfilter_intrin_sse2: make some funcs static vp9_loopfilter_intrin_sse2: remove unused uv funcs vp9_loopfilter: remove uv function typedef filter_block_plane: reuse some constants vp9_loopfilter.c: make some functions static
2013-07-16delete vp9_loopfilter_sse2.asmJames Zern
sse2 functions are provided by vp9_loopfilter_intrin_sse2.c Change-Id: I40454d26034e3ef915eeaf889937fe7d1b519b9b
2013-07-16vp9_loopfilter_intrin_sse2: cosmetics: fix indentJames Zern
Change-Id: I892e76d5ad1443b2ea0d1a7839fe26afe9c68ffb
2013-07-16delete x86/vp9_loopfilter_x86.hJames Zern
also remove prototype_loopfilter{,_block} defines from vp9_loopfilter.h Change-Id: I865ab3f9436c7b1ca166f76630328abf01389405
2013-07-16SSE2 16x16 inverse ADST/DCT hybrid transformJingning Han
This commit enables SSE2 implementation of 16x16 inverse ADST/DCT hybrid transform. The runtime goes from 5742 cycles -> 1821 cycles. This provides about 1% encoding speed-up at speed 0. Change-Id: I1678d0988bf30b9efd524877705bbb3645edb17b
2013-07-13vp9_loopfilter_intrin_sse2: make some funcs staticJames Zern
+ drop 'vp9_' Change-Id: I4a4bec175316aab8f65c3a23bacc8362399a1357
2013-07-13vp9_loopfilter_intrin_sse2: remove unused uv funcsJames Zern
vp9_mbloop_filter_horizontal_edge_sse2 / vp9_mbloop_filter_vertical_edge_uv_sse2 Change-Id: I61c4351ef0cce79fa4156a47ddace781f1566869
2013-07-13vp9_loopfilter: remove uv function typedefJames Zern
loop_filter_uvfunction is unused Change-Id: I37eb3559e9eb2808f1f29dfea429441c94c9df2a
2013-07-12SSE2 8x8 inverse ADST/DCT transformJingning Han
This commit enables SSE2 implementation of 8x8 inverse ADST/DCT transform. The runtime goes from 1216 cycles -> 266 cycles. For bus_cif at 2000 kbps, the overall runtime reduces from 253707ms -> 248430ms, i.e., 2% speed-up at speed 0. Change-Id: Ib0372e17e9162d7b11a10d653b1c8be547c878fb
2013-07-11Merge "SSE2 4x4 invserse ADST/DCT transform"Jingning Han
2013-07-11convolve8 optimizations for neonJohann
Independent horizontal and vertical implementations. Requires that blocks be built from 4x4 and [xy]_step_q4 == 16 6-10% improvement. CIF improved the least. Change-Id: I137f5ceae4440adc0960bf88e4453e55a618bcda
2013-07-11Merge "Wide loopfilter 16 pix at a time"Jim Bankoski
2013-07-10SSE2 4x4 invserse ADST/DCT transformJingning Han
Enable SSE2 4x4 inverse ADST/DCT transform. The runtime goes from 292 cycles down to 89 cycles. Running bus_cif at 2000 kbps, the overall runtime of speed 0 goes from 301s to 295s (2% speed-up). Change-Id: I24098136e7fee7ab2fbf1c11755bdf2ca37f3628
2013-07-10Replace copy_memNxM functions with a generic copy/avg function.Ronald S. Bultje
Change-Id: I3ce849452ed4f08527de9565a9914d5ee36170aa
2013-07-10Wide loopfilter 16 pix at a timeJohn Koleszar
Where possible, do the 16 pixel wide filter while doing the horizontal filtering pass. The same approach can be taken for the mbloop_filter when that's implemented. Doing so on the vertical pass is a little more involved, but possible. Change-Id: I010cb505e623464247ae8f67fa25a0cdac091320
2013-07-10Remove unused iwalsh4x4 MMX/SSE2 functions.Ronald S. Bultje
Change-Id: I2d22577911a37ed7d8c7e08cac20764842267652
2013-07-10Remove unused 16x3/3x16 sad SSE2 functions.Ronald S. Bultje
Change-Id: I30a597c0cc366e34c9a3e2afe32d70e044f95ca4
2013-07-10Merge "SSSE3 assembly for 4x4/8x8/16x16/32x32 H intra prediction."Ronald S. Bultje
2013-07-10Merge "SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 TM intra prediction."Ronald S. Bultje
2013-07-10Merge "SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 V intra prediction."Ronald S. Bultje
2013-07-10Merge "SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 DC intra prediction."Ronald S. Bultje