$ gcc-11 -fopenmp hello.c $ ./a.out Hello world from 1 (total 4) Hello world from 0 (total 4) Hello world from 3 (total 4) Hello world from 2 (total 4)
顺序是那种乱七八糟的:并行。觉得不明显可以指定线程数量,再运行:
1 2 3 4 5 6 7 8
$ OMP_NUM_THREADS=100 ./a.out Hello world from 2 (total 100) Hello world from 11 (total 100) ... Hello world from 35 (total 100) ... Hello world from 0 (total 100) Hello world from 98 (total 100)
#pragma omp parallel private(num_threads, thread_id) { thread_id = omp_get_thread_num(); printf("Hello world from thread %d.\n", thread_id);
if (thread_id == 0) { num_threads = omp_get_num_threads(); printf("Total number of thread is: %d\n", num_threads); } }
// printf("End of parallel: %d, %d\n", thread_id, num_threads); // End of parallel: 1, 61694048
return0; }
编译运行:
1 2 3 4 5 6
$ gcc-11 -fopenmp hello-v2.c; ./a.out Hello world from thread 0. Total number of thread is: 4 Hello world from thread 3. Hello world from thread 2. Hello world from thread 1.
#pragma omp parallel sections shared(x, sum, avg, sum2) { // section 0: 计算 最大最小值 { double max = (1<<31), min = (1<<31) - 1; for (int i = 0; i < N; i++) { if (x[i] < min) min = x[i]; if (x[i] > max) max = x[i]; } printf("min: %f\nmax: %f\n", min, max); }
#pragma omp section // section 1: 计算总和、均值 { for (int i = 0; i < N; i++) { sum += x[i]; } printf("sum: %f\n", sum);
avg = sum / N; printf("avg: %f\n", avg); }
#pragma omp section // section 2: 计算平方的均值 { for (int i = 0; i < N; i++) { sum2 += x[i] * x[i]; } printf("sum2: %f\n", sum2); } }
// 方差 = 平方的均值 - 均值的平方 double var = sum2 / N - avg * avg; printf("var: %f\n", var);
return0; }
注意这里使用了全局共享的 sum 等几个量,是为了在并行结束后,留下这些值,用于计算方差。
编译运行,对比去掉 parallel 的版本,似乎有一定的提升:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
$ openmp gcc-11 -fopenmp sections.c; time ./a.out min: 0.000000 max: 536870911.000000 sum2: 51580834826121141939077120.000000 sum: 144115187606093856.000000 avg: 268435455.125000 var: 24019198213991840.000000 ./a.out 7.34s user 4.00s system 101% cpu 11.200 total
$ openmp gcc-11 -fopenmp no-sections.c; time ./a.out min: 0.000000 max: 536870911.000000 sum: 144115187606093856.000000 avg: 268435455.125000 sum2: 51580834826121141939077120.000000 var: 24019198213991840.000000 ./a.out 8.41s user 9.52s system 72% cpu 24.587 total
OpenMP 同步
共享内存:
OpenMP 的多个并发线程之间共享全局数据
无需 send/recv 的消息传递在并发进程之间交换数值
同步机制:
协调并行程序中多个并行线程的执行
控制顺序:避免竞争 -> 冲突
隐式:join 栅栏
显示:critical、master、barrier、single
critical 指令
临界同步指令 critical:多个并行线程互斥访问共享变量。
1 2
#paragma omp critical { ... }
e.g. 尝试做个并行计数器:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#include<omp.h> #include<stdio.h>
intmain(){ int n = 0;
#pragma omp parallel for shared(n) for (int i=0; i < 40000; i++) { #pragma omp critical n = n + 1; }
printf("n: %d\n", n); return0; }
编译运行(运行了很多次都是对的):
1 2 3
$ openmp gcc-11 -fopenmp critical.c; time OMP_NUM_THREADS=1000 ./a.out n: 40000 OMP_NUM_THREADS=1000 ./a.out 0.01s user 0.05s system 16% cpu 0.390 total
$ openmp gcc-11 -fopenmp no-critical.c; time OMP_NUM_THREADS=1000 ./a.out n: 39960 OMP_NUM_THREADS=1000 ./a.out 0.01s user 0.06s system 63% cpu 0.104 total
master 指令
master 指令:只有主线程执行这一块代码,其他线程遇到则跳过。
主线程:执行这一块代码
其他线程:直接往下走,不等
1 2
#pragma omp master { ... }
barrier 指令
barrier 指令:同步所有并发线程:
遇到 barrier 的线程就停下来,等;
等所有进程都到了 barrier 才能继续。
1
#pragma omp barrier
single 指令
single 指令:宽松版 master + 隐式 barrier:
在代码块({ ... })后面放一个隐式 barrier;
允许任意线程 Foo 执行代码块;
其他线程跳过代码块执行,但是阻塞在 barrier,等 Foo 酱执行完代码块再放行。
1 2
#pragma omp single { ... }
reduction 指令
规约:将大量值组合在一起,生成单个结果值。
Reduction:the action or fact of making a specified thing smaller or less in amount, degree, or size —— New Oxford American Dictionary
这里所谓规约就是让值的个数变少的操作。(回想一下 Lisp 就很形象了。)
OpenMP 可以用 reduction 指令做规约:
1 2 3 4 5 6 7 8
double result;
#pragma omp reduction(op : result) { result = ...; // 局部 result }