Age | Commit message (Collapse) | Author |
|
While at it, beef up the test suite for strnlen and add performance
tests for it, too.
|
|
Using the new SSE4.2 instructions is cool but not really the fastest.
Some older SSE instructions can do the trick faster.
|
|
|
|
|
|
|
|
This patch includes optimized 64bit memcpy/memmove for Atom, Core 2 and
Core i7. It improves memcpy by up to 3X on Atom, up to 4X on Core 2 and
up to 1X on Core i7. It also improves memmove by up to 3X on Atom, up to
4X on Core 2 and up to 2X on Core i7.
|
|
|
|
|
|
|
|
|
|
|
|
This is 64bit SSE4 optimized memcmp. It improves memcmp by upto 3X
on Intel Core i7.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This patch defines bit_XXX and index_XXX and use them to check processor
feature in assembly code. It can prevent typos in processor feature
check.
|
|
|
|
|
|
definitions.
|
|
|
|
|
|
|
|
|
|
|
|
It turns that SSSE3 isn't slow on Atom. The problem is bsf. This patch
removes ENABLE_SSSE3_ON_ATOM.
|
|
|
|
|
|
If a signal arrived during a symbol lookup and the signal handler also
required a symbol lookup, the end of the lookup in the signal handler reset
the flag whether restoring AVX/SSE registers is needed. Resetting means
in this case that the tail part of the outer lookup code will try to
restore the registers and this can fail miserably. We now restore to the
previous value which makes nesting calls possible.
|
|
On 64-bit machines we should not split doubles into two 32 bit
integer and handle the words separately. We have wide registers.
This patch implements a 64-bit ceil version. Ideally all other
functions will be converted over time.
|
|
|
|
|
|
This patch fixes mixed SSE/AVX audit and checks AVX only once in
_dl_runtime_profile. When an AVX or SSE register value in pltenter is
modified, we have to make sure that the SSE part value is the same in both
lr_xmm and lr_vector fields so that pltexit will get the correct value
from either lr_xmm or lr_vector fields. AVX-enabled pltenter should
update both lr_xmm and lr_vector fields to support stacked AVX/SSE
pltenter functions.
|
|
|
|
|
|
|
|
|
|
The meaning of the 25-14 bits in EAX returned from cpuid with EAX = 4
has been changed from "the maximum number of threads sharing the cache"
to "the maximum number of addressable IDs for logical processors sharing
the cache" if cpuid takes EAX = 11. We need to use results from both
EAX = 4 and EAX = 11 to get the number of threads sharing the cache.
The 25-14 bits in EAX on Core i7 is 15 although the number of logical
processors is 8. Here is a white paper on this:
http://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration/
This patch correctly counts number of logical processors on Intel CPUs
with EAX = 11 support on cpuid. Tested on Dinnington, Core i7 and
Nehalem EX/EP.
It also fixed Pentium Ds workaround since EBX may not have the right
value returned from cpuid with EAX = 1.
|
|
This patch adds 32bit SSE4.2 string functions. It uses -16L instead of
0xfffffffffffffff0L, which works for both 32bit and 64bit long. Tested
on 32bit Core i7 and Core 2.
|
|
This patch adds multiarch support when configured for i686. I modified
some x86-64 functions to support 32bit. I will contribute 32bit SSE string
and memory functions later.
|
|
We use sigaltstack internally which on some systems is a syscall
and should be used as such. Move the x86-64 version to the Linux
specific directory and create in its place a file which always
causes compile errors.
|
|
|
|
|
|
The simple test previously used might trigger if the longjmp jumps
from the signal stack to the normal stack. We now explicitly test
for this case.
|