When algorithm cannot go any faster, you exploit the hardware
Multiple threads
Consider this scenario: you have 2 variables (like ints or long long) and you perform a long running task on each of them. Now to speed things up you use 2 threads hoping they would take half the amount of time.
long long x = 0;
long long y = 0;
void increment(long long& a) {
for (int i=0; i<100'000'000; i++) {
a++;
}
}Now measure the time taken when increment is invoked on x and y on separate threads.
int main() {
auto start = std::chrono::high_resolution_clock::now();
std::thread t1([&](){ increment(a); });
std::thread t2([&](){ increment(b); });
t1.join();
t2.join();
auto end = std::chrono::high_resolution_clock::now();
std::cout << "time: " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << " ms\n";
return 0;
}Run the program and note the time taken, for my machine this turned out to be close to 500ms
Another metric we can use here is the IPC or instruction per cycle, which you can get using perf
$ perf stat ./a.out
time: 503 ms
Performance counter stats for './a.out':
893.16 msec task-clock:u # 1.760 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
139 page-faults:u # 155.627 /sec
1,602,565,788 instructions:u # 0.67 insn per cycle
2,385,620,324 cycles:u # 2.671 GHz
200,447,733 branches:u # 224.424 M/sec
14,482 branch-misses:u # 0.01% of all branches
TopdownL1 # 68.2 % tma_backend_bound
# 11.9 % tma_bad_speculation
# 4.3 % tma_frontend_bound
# 15.6 % tma_retiring
0.507340864 seconds time elapsed
0.892937000 seconds user
0.000000000 seconds sysWe can see 0.67 insn per cycle, hmm ok.
Struct instead of int
Now, let us use this padded struct instead of the long longs which we used earlier
struct PaddedStruct {
long long value;
char pad[64 - sizeof(long long)];
};
PaddedStruct pa = {};
PaddedStruct pb = {};Overload the earlier defined function to handle this structure as well
void increment(Padding& a) {
for (int i=0; i<100'000'000; i++) {
a.value++;
}
}Now invoke the functions on two thread, similar to what we did earlier
std::thread t1([&](){ increment(pa); });
std::thread t2([&](){ increment(pb); });This time, you will notice time takes turns out to be roughly half of what was observed earlier. For my machine, this new time was 300ms.
Again, we can get the IPC using perf
$ perf stat ./a.out
time: 297 ms
Performance counter stats for './a.out':
594.66 msec task-clock:u # 1.975 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
138 page-faults:u # 232.066 /sec
1,602,565,643 instructions:u # 1.06 insn per cycle
1,508,069,432 cycles:u # 2.536 GHz
200,447,663 branches:u # 337.080 M/sec
14,506 branch-misses:u # 0.01% of all branches
TopdownL1 # 71.2 % tma_backend_bound
# 1.5 % tma_bad_speculation
# 2.8 % tma_frontend_bound
# 24.6 % tma_retiring
0.301146813 seconds time elapsed
0.594276000 seconds user
0.000000000 seconds sysWe can clearly see 1.06 insn per cycle, that is roughly double of what we saw in case of long longs.