Age | Commit message (Collapse) | Author |
|
This patch enables libmvec on AArch64. The proposed change is mainly
implementing build infrastructure to add the new routines to ABI,
tests and benchmarks. I have demonstrated how this all fits together
by adding implementations for vector cos, in both single and double
precision, targeting both Advanced SIMD and SVE.
The implementations of the routines themselves are just loops over the
scalar routine from libm for now, as we are more concerned with
getting the plumbing right at this point. We plan to contribute vector
routines from the Arm Optimized Routines repo that are compliant with
requirements described in the libmvec wiki.
Building libmvec requires minimum GCC 10 for SVE ACLE. To avoid raising
the minimum GCC by such a big jump, we allow users to disable libmvec
if their compiler is too old.
Note that at this point users have to manually call the vector math
functions. This seems to be acceptable to some downstream users.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
|
|
Adjust iteration counts so benchmarks don't run too slowly or quickly.
Ensure benchmarks take less than 10 seconds on older, slower cores and
more than 0.5 seconds on fast cores.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
To calculate geometric mean for string benchmark results.
Signed-off-by: Nisha Poyarekar <nisha.s.menon@gmail.com>
|
|
1. Subnormals: 128 inputs.
2. Normal numbers with large exponent difference (|x/y| > 2^8):
1024 inputs between FLT_MIN and FLT_MAX;
3. Close exponents (ey >= -103 and |x/y| < 2^8): 1024 inputs with
exponents between -10 and 10.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
|
|
Add three different dataset, from random floating point numbers:
1. Subnormals: 128 inputs.
2. Normal numbers with large exponent difference (|x/y| > 2^52):
1024 inputs between DBL_MIN and DBL_MAX;
3. Close exponents (ey >= -907 and |x/y| < 2^52): 1024 inputs with
exponents between -10 and 10.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
|
|
This allows other targets to use the same inputs for their own libmvec
microbenchmarks without having to duplicate them in their own
subdirectory.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
|
|
Instead of benchmarking slow byte oriented loops, include the optimized generic
strchr and strrchr implementation. Adjust iteration count to reduce benchmark
time.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
Remove the slow byte oriented loops. Adjust iteration count to reduce
benchmark time.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
Remove the slow byte oriented simple_memcmp.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
Remove simple_strcspn/strpbrk/strsep which are significantly slower than the
generic implementations. Also remove oldstrsep and oldstrtok since they are
practically identical to the generic implementation. Adjust iteration count
to reduce benchmark time.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
Remove memchr_strnlen since it is now the same as generic_strnlen. Adjust
iteration count to reduce benchmark time. Keep memchr_strlen since the
generic strlen does not use memchr.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
Instead of benchmarking slow byte oriented loops, include the optimized
generic memchr/memrchr implementation. Adjust iteration count to reduce
benchmark time.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
Remove the slow byte oriented simple_strcpy_chk and simple_stpcpy_chk.
Adjust iteration count to increase benchmark time.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
Instead of benchmarking slow byte oriented loops, include the optimized generic
strcmp/strncmp implementation. Adjust iteration count to reduce benchmark time.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
Almost all uses of rawmemchr find the end of a string. Since most targets use
a generic implementation, replacing it with strchr is better since that is
optimized by compilers into strlen (s) + s. Also fix the generic rawmemchr
implementation to use a cast to unsigned char in the if statement.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
|
|
Json output is easier to parse and most other benchmarks already do
the same.
|
|
len=0 is valid and fairly common so should be tested.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
|
|
Reviewed-by: Fangrui Song <maskray@google.com>
|
|
1. Add more complete coverage in the medium size range.
2. In strnlen remove the `1 << i` which was UB (`i` could go beyond
32/64)
|
|
Reuses infrastructure from previous pthread_mutex_lock benchmarks to
test other performance sensitive functions.
|
|
Current benchmarks are missing many cases in the mid-length range
which is often the hottest size range.
|
|
It shows both throughput (total bytes obtained in the test duration)
and latecy for both arc4random and arc4random_buf with different
sizes.
Checked on x86_64-linux-gnu, aarch64-linux, and powerpc64le-linux-gnu.
|
|
Add a second iteration for memrchr to set `pos` starting from the end
of the buffer.
Previously `pos` was only set relative to the beginning of the
buffer. This isn't really useful for memrchr because the beginning
of the search space is (buf + len).
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
So it can show both reciprocal-throughput and latency.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
So it can show both reciprocal-throughput and latency.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
1. Use json_ctx for output to help standardize format across all
benchtests.
2. Add some additional tests to strstr and memchr expanding alignments
and adding more small values.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
1. Output results in json format so its easier to parse
2. Increase max alignment to `getpagesize () - 1` to make it possible
to test page cross cases.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
Unroll slightly and enforce good instruction scheduling. This improves
performance on out-of-order machines. The unrolling allows for
pipelined multiplies.
As well, as an optional sysdep, reorder the operations and prevent
reassosiation for better scheduling and higher ILP. This commit
only adds the barrier for x86, although it should be either no
change or a win for any architecture.
Unrolling further started to induce slowdowns for sizes [0, 4]
but can help the loop so if larger sizes are the target further
unrolling can be beneficial.
Results for _dl_new_hash
Benchmarked on Tigerlake: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
Time as Geometric Mean of N=30 runs
Geometric of all benchmark New / Old: 0.674
type, length, New Time, Old Time, New Time / Old Time
fixed, 0, 2.865, 2.72, 1.053
fixed, 1, 3.567, 2.489, 1.433
fixed, 2, 2.577, 3.649, 0.706
fixed, 3, 3.644, 5.983, 0.609
fixed, 4, 4.211, 6.833, 0.616
fixed, 5, 4.741, 9.372, 0.506
fixed, 6, 5.415, 9.561, 0.566
fixed, 7, 6.649, 10.789, 0.616
fixed, 8, 8.081, 11.808, 0.684
fixed, 9, 8.427, 12.935, 0.651
fixed, 10, 8.673, 14.134, 0.614
fixed, 11, 10.69, 15.408, 0.694
fixed, 12, 10.789, 16.982, 0.635
fixed, 13, 12.169, 18.411, 0.661
fixed, 14, 12.659, 19.914, 0.636
fixed, 15, 13.526, 21.541, 0.628
fixed, 16, 14.211, 23.088, 0.616
fixed, 32, 29.412, 52.722, 0.558
fixed, 64, 65.41, 142.351, 0.459
fixed, 128, 138.505, 295.625, 0.469
fixed, 256, 291.707, 601.983, 0.485
random, 2, 12.698, 12.849, 0.988
random, 4, 16.065, 15.857, 1.013
random, 8, 19.564, 21.105, 0.927
random, 16, 23.919, 26.823, 0.892
random, 32, 31.987, 39.591, 0.808
random, 64, 49.282, 71.487, 0.689
random, 128, 82.23, 145.364, 0.566
random, 256, 152.209, 298.434, 0.51
Co-authored-by: Alexander Monakov <amonakov@ispras.ru>
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
|
|
Benchtests are for throughput and include random / fixed size
benchmarks.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
|
|
Add a simple benchmark that measures wcrtomb performance with various
locales with 1-4 byte characters.
Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
|
|
Improve libmvec benchmark integration so that in future other
architectures may be able to run their libmvec benchmarks as well. This
now allows libmvec benchmarks to be run with `make BENCHSET=bench-math`.
Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
|
|
The libmvec benchmarks print a message indicating that a certain CPU
feature is unsupported and exit prematurelyi, which breaks the JSON in
bench.out.
Handle this more elegantly in the bench makefile target by adding
support for an UNSUPPORTED exit status (77) so that bench.out continues
to have output for valid tests.
Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
|
|
Benchmark for testing pthread mutex locks performance with different
threads and critical sections.
The test configuration consists of 3 parts:
1. thread number
2. critical-section length
3. non-critical-section length
Thread number starts from 1 and increased by 2x until num of CPU cores
(nprocs). An additional over-saturation case (1.25 * nprocs) is also
included.
Critical-section is represented by a loop of shared do_filler(),
length can be determined by the loop iters.
Non-critical-section is similiar to the critical-section, except it's
based on non-shared do_filler().
Currently, adaptive pthread_mutex lock is tested.
|
|
1. Use json-lib for printing results.
2. Expose all parameters (before pos, seek_char, and max_char where
not printed).
3. Add benchmarks that test multiple occurence of seek_char in the
string.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
Just QOL change to make parsing the output of the benchtests more
consistent.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
Just QOL change to make parsing the output of the benchtests more
consistent.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
Just QOL change to make parsing the output of the benchtests more
consistent.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
Just QOL change to make parsing the output of the benchtests more
consistent.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
Add benchmark that randomizes whether return should be NULL or pointer
to CHAR. The rationale is on many architectures there is a choice
between a predicate execution option (i.e cmovcc on x86) or a branch.
On x86 the results for cmovcc vs branch are something along the lines
of the following:
perc-zero, Br On Result, Time Br / Time cmov
0.10, 1, ,0.983
0.10, 0, ,1.246
0.25, 1, ,1.035
0.25, 0, ,1.49
0.33, 1, ,1.016
0.33, 0, ,1.579
0.50, 1, ,1.228
0.50, 0, ,1.739
0.66, 1, ,1.039
0.66, 0, ,1.764
0.75, 1, ,0.996
0.75, 0, ,1.642
0.90, 1, ,1.071
0.90, 0, ,1.409
1.00, 1, ,0.937
1.00, 0, ,0.999
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
Just QOL change to make parsing the output of the benchtests more
consistent.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
Use "=" instead of ":=" to allow sysdeps Makefiles to add more benches
to bench and benchset. This fixes BZ #28970.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
|
|
Commit ac759b1fbf28a82d99afde9046f8b72c7cba5dae added attribute
"overlap" to bench-memmove-walk, whose value is a string. This change
makes compare_strings.py fail since benchout_strings.schema.json
requires the values of attributes to be number.
This patch relaxes such constraint.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
|
|
1. Add all .o files to extra-objs.
2. Include ../Rules after extra-objs has been set.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
|
|
Remove one of 2 identical loops in bench-bzero-walk.c.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
|
|
Small sizes (<= 64) represent large portion of memset usages with zero
value. Add sizes (<= 64) to bench-bzero-walk.c to cover small sizes.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
|
|
memset with zero as the value to set is by far the majority value (99%+
for Python3 and GCC). Add bench-memset-zero-large.c,
bench-memset-zero-walk.c and bench-memset-zero.c to measure memset
implementations for zeroing.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
|
|
Add bench-bzero-large.c, bench-bzero-walk.c and bench-bzero.c.
|
|
Put one bench per line and sort them.
|
|
Zero is a relevant size for some workloads (roughly 5% of uses for
GCC) so we should be testing it's performance as well.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|