python - cython.parallel cannot see the difference in speed -


i tried use cython.parallel prange. can see 2 cores 50% being used. how can make use of cores. i.e. send loops cores simultaneously sharing arrays, volume , mc_vol?

edit: edited purely sequential for-loop 30 seconds faster than cython.parallel prange version. both of them using 1 core only. there way parallelize this.

cimport cython cython.parallel import prange, parallel, threadid libc.stdio cimport sprintf libc.stdlib cimport malloc, free cimport numpy np  @cython.boundscheck(false) @cython.wraparound(false) cpdef mc_surface(np.ndarray[np.int_t,ndim=3] volume, np.ndarray[np.float32_t,ndim=3] mc_vol):      cdef int vol_len=len(volume)-1      cdef int k, j,      cdef char* pattern # string pointer - allocate later      perm_area = {             "00000000": 0.000000,             ...             "00011101": 1.515500         }           try:          pattern = <char*>malloc(sizeof(char)*260)          k in range(vol_len):              j in range(vol_len):                 in range(vol_len):                     sprintf(pattern, "%i%i%i%i%i%i%i%i",                             volume[i, j, k],                             volume[i, j + 1, k],                             volume[i + 1, j, k],                             volume[i + 1, j + 1, k],                             volume[i, j, k + 1],                             volume[i, j + 1, k + 1],                             volume[i + 1, j, k + 1],                             volume[i + 1, j + 1, k + 1]);                      mc_vol[i, j, k] = perm_area[pattern]                 # if perm_area[pattern] > 0:             #    print pattern, 'area: ', perm_area[pattern]             #total_area += perm_area[pattern]     finally:         free(pattern) return mc_vol 

edit following davidw's suggestion, prange considerably slower:

 cpdef mc_surface(np.ndarray[np.int_t,ndim=3] volume, np.ndarray[np.float32_t,ndim=3] mc_vol):      cdef int vol_len=len(volume)-1      cdef int k, j,      cdef char* pattern # string pointer - allocate later      perm_area = {             "00000000": 0.000000,             ...             "00011101": 1.515500         }          nogil,parallel():            try:              pattern = <char*>malloc(sizeof(char)*260)              k in prange(vol_len):                  j in range(vol_len):                     in range(vol_len):                         sprintf(pattern, "%i%i%i%i%i%i%i%i",                                 volume[i, j, k],                                 volume[i, j + 1, k],                                 volume[i + 1, j, k],                                 volume[i + 1, j + 1, k],                                 volume[i, j, k + 1],                                 volume[i, j + 1, k + 1],                                 volume[i + 1, j, k + 1],                                 volume[i + 1, j + 1, k + 1]);                         gil:                             mc_vol[i, j, k] = perm_area[pattern]                             # if perm_area[pattern] > 0:                             #    print pattern, 'area: ', perm_area[pattern]                             #    total_area += perm_area[pattern]            finally:                free(pattern)          return mc_vol 

my setup file looks like:

setup(     name='surfacearea',     ext_modules=[         extension('c_marchsurf', ['c_marchsurf.pyx'], include_dirs=[numpy.get_include()],                   extra_compile_args=['-fopenmp'], extra_link_args=['-fopenmp'], language="c++")     ],     cmdclass={'build_ext': build_ext}, requires=['cython', 'numpy', 'matplotlib', 'pathos', 'scipy', 'cython.parallel'] ) 

the problem with gil:, defines block can run on 1 core @ once. aren't doing else inside loop shouldn't expect speed-up.

in order avoid using gil need avoid using python features possible. avoid in string formatting part using c sprintf create string. dictionary lookup part, easiest thing use c++ standard library, contains map class similar behaviour. (note you'll need compile cython's c++ mode)

# @ top of file libc.stdio cimport sprintf libc.stdlib cimport malloc, free libcpp.map cimport map libcpp.string cimport string import numpy np cimport numpy np  # ... code omitted  .... cpdef mc_surface(np.ndarray[np.int_t,ndim=3] volume, np.ndarray[np.float32_t,ndim=3] mc_vol):     # note above i've defined volume numpy array     # can fast, gil-less direct array lookup     cdef char* pattern # string pointer - allocate later      perm_area = {} # dictionary, before      # depending on size of perm_area, conversion     # c++ object potentially quite slow (it involves lot     # of string copies)     cdef map[string,float] perm_area_m = perm_area      # ... code omitted ...     nogil,parallel():        try:          # assigning pattern here makes thread local          # it's assigned once per thread isn't bad          pattern = <char*>malloc(sizeof(char)*50)          # when allocate pattern need make big enough          # either calculating size, or making overly big           # ... more code omitted...            # later, inside loops            sprintf(pattern, "%i%i%i%i%i%i%i%i", volume[i, j, k],                         volume[i, j + 1, k],                         volume[i + 1, j, k],                         volume[i + 1, j + 1, k],                         volume[i, j, k + 1],                         volume[i, j + 1, k + 1],                         volume[i + 1, j, k + 1],                         volume[i + 1, j + 1, k + 1]);            # , dictionary lookup without gil            # because we're using c++ class instead.            # unfortunately, need string copy (which might slow things down)            mc_vol[i, j, k] = perm_area_m[string(pattern)]            # aware can throw exception if            # pattern not match (same python).        finally:          free(pattern) 

i've had change volume being numpy array, since if python object i'd need gil index elements.

(edit: changed take dictionary lookup out of gil block using c++ map)


Comments