0


【CUDA】thrust进行前缀和的操作

接上篇文章,可以发现使用CUDA提供的API进行前缀和扫描时,第一次运行的时间不如共享内存访问,猜测是使用到了全局内存。
首先看调用逻辑:

thrust::inclusive_scan(thrust::device, d_x, d_x + N, d_x);

第一个参数指定了设备,根据实参数量和类型找到对应的函数,是scan.h中的如下函数:

template<typenameDerivedPolicy,typenameInputIterator,typenameOutputIterator>
_CCCL_HOST_DEVICE OutputIterator inclusive_scan(const thrust::detail::execution_policy_base<DerivedPolicy>& exec,
  InputIterator first,
  InputIterator last,
  OutputIterator result);

其实现位于

thrust\thrust\system\cuda\detail\scan.h

注意:路径可能与实际有偏差,可以在/usr/local/下使用

find . -name xx

查找对应的文件

template<typenameDerived,typenameInputIt,typenameOutputIt>
_CCCL_HOST_DEVICE OutputIt
inclusive_scan(thrust::cuda_cub::execution_policy<Derived>& policy, InputIt first, InputIt last, OutputIt result){return thrust::cuda_cub::inclusive_scan(policy, first, last, result, thrust::plus<>{});}

将操作指定为plus,
然后执行同一文件下的此函数:

template<typenameDerived,typenameInputIt,typenameOutputIt,typenameScanOp>
_CCCL_HOST_DEVICE OutputIt inclusive_scan(
  thrust::cuda_cub::execution_policy<Derived>& policy, InputIt first, InputIt last, OutputIt result, ScanOp scan_op){using diff_t           =typenamethrust::iterator_traits<InputIt>::difference_type;
  diff_t const num_items = thrust::distance(first, last);return thrust::cuda_cub::inclusive_scan_n(policy, first, num_items, result, scan_op);}

最终找到主要的执行逻辑:

_CCCL_EXEC_CHECK_DISABLE
template<typenameDerived,typenameInputIt,typenameSize,typenameOutputIt,typenameScanOp>
_CCCL_HOST_DEVICE OutputIt inclusive_scan_n_impl(
  thrust::cuda_cub::execution_policy<Derived>& policy, InputIt first, Size num_items, OutputIt result, ScanOp scan_op){using AccumT     =typenamethrust::iterator_traits<InputIt>::value_type;using Dispatch32 = cub::DispatchScan<InputIt, OutputIt, ScanOp, cub::NullType, std::int32_t, AccumT>;using Dispatch64 = cub::DispatchScan<InputIt, OutputIt, ScanOp, cub::NullType, std::int64_t, AccumT>;

  cudaStream_t stream = thrust::cuda_cub::stream(policy);
  cudaError_t status;// Determine temporary storage requirements:
  size_t tmp_size =0;{THRUST_INDEX_TYPE_DISPATCH2(
      status,
      Dispatch32::Dispatch,
      Dispatch64::Dispatch,
      num_items,(nullptr, tmp_size, first, result, scan_op, cub::NullType{}, num_items_fixed, stream));
    thrust::cuda_cub::throw_on_error(
      status,"after determining tmp storage ""requirements for inclusive_scan");}// Run scan:{// Allocate temporary storage:
    thrust::detail::temporary_array<std::uint8_t, Derived> tmp{policy, tmp_size};THRUST_INDEX_TYPE_DISPATCH2(
      status,
      Dispatch32::Dispatch,
      Dispatch64::Dispatch,
      num_items,(tmp.data().get(), tmp_size, first, result, scan_op, cub::NullType{}, num_items_fixed, stream));
    thrust::cuda_cub::throw_on_error(status,"after dispatching inclusive_scan kernel");
    thrust::cuda_cub::throw_on_error(
      thrust::cuda_cub::synchronize_optional(policy),"inclusive_scan failed to synchronize");}return result + num_items;}

可以看到,此处thrust调用了cub的Dispatchscan操作,而cub中是使用全局内存的,因此造成了效率还不如手动编写使用共享内存的算法。

标签: CUDA

本文转载自: https://blog.csdn.net/weixin_45207619/article/details/140441247
版权归原作者 栏杆拍遍看吴钩 所有, 如有侵权,请联系我们删除。

“【CUDA】thrust进行前缀和的操作”的评论:

还没有评论