c - Why does using OpenMP make the code slower to run? -


i have following code uses omp parallelize monte carlo method. question why serial version of code (monte_carlo_serial) run lot faster parallel version (monte_carlo_parallel). running code on machine 32 cores , following result printed console:

-bash-4.1$ gcc -fopenmp hello.c ;
-bash-4.1$ ./a.out
pi (serial): 3.140856
time taken 0 seconds 50 milliseconds
pi (parallel): 3.132103
time taken 127 seconds 990 milliseconds

#include <stdio.h> #include <stdlib.h> #include <math.h> #include <omp.h> #include <time.h>  int niter = 1000000;            //number of iterations per loop  int monte_carlo_parallel() {   double x,y;                     //x,y value random coordinate   int i;                          //loop counter   int count=0;                //count holds number of how many coordinates   double z;                       //used check if x^2+y^2<=1   double pi;                      //holds approx value of pi   int numthreads = 32;  #pragma omp parallel firstprivate(x, y, z, i) reduction(+:count) num_threads(numthreads)   {     srand48((int)time(null) ^ omp_get_thread_num());    //give random() seed value     (i=0; i<niter; ++i)                 //main loop       {         x = (double)drand48();              //gets random x coordinate         y = (double)drand48();              //gets random y coordinate         z = ((x*x)+(y*y));              //checks see if number inside unit circle         if (z<=1)           {             ++count;                //if is, consider valid random point           }       }   }    pi = ((double)count/(double)(niter*numthreads))*4.0;   printf("pi (parallel): %f\n", pi);   return 0; }  int monte_carlo_serial(){   double x,y;                     //x,y value random coordinate   int i;                          //loop counter   int count=0;                //count holds number of how many coordinates   double z;                       //used check if x^2+y^2<=1   double pi;                      //holds approx value of pi    srand48((int)time(null) ^ omp_get_thread_num());  //give random() seed value    (i=0; i<niter; ++i)                   //main loop     {       x = (double)drand48();                //gets random x coordinate       y = (double)drand48();                //gets random y coordinate       z = ((x*x)+(y*y));                //checks see if number inside unit circle       if (z<=1)         {           ++count;              //if is, consider valid random point         }     }    pi = ((double)count/(double)(niter))*4.0;   printf("pi (serial): %f\n", pi);    return 0; }   void main(){   clock_t start = clock(), diff;    monte_carlo_serial();    diff = clock() - start;   int msec = diff * 1000 / clocks_per_sec;   printf("time taken %d seconds %d milliseconds \n", msec/1000, msec%1000);      start = clock(), diff;    monte_carlo_parallel();    diff = clock() - start;   msec = diff * 1000 / clocks_per_sec;   printf("time taken %d seconds %d milliseconds \n", msec/1000, msec%1000);  } 

the variable

count  

is shared across of spawned threads. each of them has lock count increment it. in addition if threads running on separate cpu's (and there's no possible win if they're not) have cost of sending value of count 1 core , again.

this textbook example of false sharing. accessing count in serial version in register , cost 1 cycle increment. in parallel version not in cache, have tell other cores invalidate cache line, fetch (l3 going take 66 cycles @ best) increment it, , store back. every time count migrates 1 cpu core have minimum ~125 cycle cost lot worse 1. threads never able run in parallel because depend on count.

try modify code each thread has own count, sum values of count threads @ end , /might/ see speedup.


Comments

Popular posts from this blog

resizing Telegram inline keyboard -

command line - How can a Python program background itself? -

php - "cURL error 28: Resolving timed out" on Wordpress on Azure App Service on Linux -