NCCL拓扑管理 - Search模块

NCCL Search模块从节点连接和节点路径构造完毕的系统拓扑中搜索出各类逻辑拓扑（如环拓扑，树拓扑等）的通道（每个拓扑可能有多个通道，通道间可以并行运行集合通信算法）和节点（GPU节点和NET节点）。后续的connect模块将这些搜索出来的通道和节点根据拓扑的类型构建起来。

数据结构

ncclTopoGraph是search模块的主要结构体，它包含搜索的输入超参，比如id，pattern等，也包括了搜索结果的输出，比如通道数nChannels，每个通道的节点（非网络相连的节点信息保存在intra数组里，网络相连的节点保存在inter数组里）。
其中pattern包含以下5种。

#defineNCCL_TOPO_PATTERN_BALANCED_TREE1// Spread NIC traffic between two GPUs (Tree parent + one child on first GPU, second child on second GPU)#defineNCCL_TOPO_PATTERN_SPLIT_TREE2// Spread NIC traffic between two GPUs (Tree parent on first GPU, tree children on the second GPU)#defineNCCL_TOPO_PATTERN_TREE3// All NIC traffic going to/from the same GPU#defineNCCL_TOPO_PATTERN_RING4// Ring#defineNCCL_TOPO_PATTERN_NVLS5// NVLS+SHARP and NVLS+Tree

ncclTopoGraph的成员介绍参见注释。

structncclTopoGraph{// Input / output of the searchint id;// ring : 0, tree : 1, collnet : 2int pattern;// topo patternint crossNic;// if inbound Net and outbound Net use different NICs or notint collNet;// is CollNet or notint minChannels;// minChannel for searchint maxChannels;// maxChannel for search// Output of the searchint nChannels;// number of channels searched, can run in parallelfloat bwIntra;// intra-connect bandwidthfloat bwInter;// inter-connect bandwidthfloat latencyInter;// inter-connect latencyint typeIntra;// intra-connect typeint typeInter;// inter-connect typeint sameChannels;// 1 if all channels are the same type and bandwidth, 0 otherwiseint nHops;// Number of hops in the graphint intra[MAXCHANNELS*NCCL_TOPO_MAX_NODES];// intra[channel_id][node_id] int64_t inter[MAXCHANNELS*2];// inter[channel_id][2]，one for inbound and one for outbound};

接口

ncclTopoSearchInit

ncclTopoSearchInit接口负责搜索前完成系统拓扑maxBw和totalBw的初始化，这两个数据也作为搜索的参考值。

// Setup system->maxBw and system->totalBw// maxBw: maximum bandwidth between GPU and GPU or GPU and NET if cross NIC is required// totalBw: maximum bandwidth a GPU node can have(NVLink or PCI)
ncclResult_t ncclTopoSearchInit(structncclTopoSystem* system){
  system->maxBw =0.0;
  system->totalBw =0.0;int inter = system->nodes[NET].count;// If we have only one GPU, we can use local bandwidthif(inter ==0&& system->nodes[GPU].count ==1){
    system->maxBw = LOC_BW;
    system->totalBw = LOC_BW;return ncclSuccess;}for(int g=0; g<system->nodes[GPU].count; g++){structncclTopoNode* gpu = system->nodes[GPU].nodes+g;// Find the maximum bandwidth between GPU and GPU or GPU and NET
    system->maxBw = std::max(system->maxBw,getMaxBw(system, gpu, inter ? NET : GPU));
    system->totalBw = std::max(system->totalBw,getTotalBw(system, gpu));}return ncclSuccess;}

ncclTopoCompute

ncclTopoCompute接口接收连接和路径构造完毕的ncclTopoSystem，搜索参数填写在ncclTopoGraph结构体中，将根据ncclTopoGraph中的预置信息展开搜索。
搜索速度变化的数值列举在以下数组中，数组下标越小速度越快。针对sm90有独立的速度数组。

float speedArrayIntra[]={40.0,30.0,20.0,18.0,15.0,12.0,10.0,9.0,7.0,6.0,5.0,4.0,3.0};float speedArrayInter[]={48.0,30.0,28.0,24.0,20.0,18.0,15.0,12.0,10.0,9.0,7.0,6.0,5.0,4.0,3.0,2.4,1.2,0.24,0.12};float sm90SpeedArrayIntra[]={60.0,50.0,40.0,30.0,24.0,20.0,15.0,12.0,11.0,6.0,3.0};float sm90SpeedArrayInter[]={48.0,45.0,42.0,40.0,30.0,24.0,22.0,20.0,17.5,15.0,12.0,6.0,3.0,2.4,1.2,0.24,0.12};

搜索算法使用贪心法和回溯法，整体而言一开始根据ncclTopoSystem中的maxBw和totalBw从速度数组里挑选一个最接近的数值作为搜索的开始，如果以此条件能搜索出结果则尝试再提升速度进行搜索，否则逐渐放松搜索要求进行搜索。可以发现代码中有大量的goto search语句，该语句前即是各类不同的搜索条件调整(搜索不到则降低条件，搜索到则提高条件)。

ncclResult_t ncclTopoCompute(ncclTopoSystem* system,structncclTopoGraph* graph){int ngpus = system->nodes[GPU].count;// crossNic require more than 1 NET,// pattern need to be NCCL_TOPO_PATTERN_RING || NCCL_TOPO_PATTERN_BALANCED_TREE || NCCL_TOPO_PATTERN_SPLIT_TREE,// could be disabled using ncclParamCrossNicint crossNic =(system->nodes[NET].count >1)&&(graph->pattern == NCCL_TOPO_PATTERN_RING ||
          graph->pattern == NCCL_TOPO_PATTERN_BALANCED_TREE ||
          graph->pattern == NCCL_TOPO_PATTERN_SPLIT_TREE)?ncclParamCrossNic():0;
  
  graph->crossNic = crossNic ==1?1:0;
  graph->bwIntra = graph->bwInter =0;
  graph->latencyInter =0;// init with fastest intra connection type, will slow it down when searching
  graph->typeIntra = ngpus ==1? PATH_LOC : PATH_NVL;// init with fastest inter connection type, will slow it down when searching
  graph->typeInter = PATH_PIX;
  
  graph->nChannels =0;// Only try Heterogeneous channels in when connected wiht NVLS maybe because NVLS is configableint trySameChannels = graph->pattern == NCCL_TOPO_PATTERN_NVLS ?0:1;
  graph->sameChannels = trySameChannels;int cpuArch, cpuVendor, cpuModel;NCCLCHECK(ncclTopoCpuType(system,&cpuArch,&cpuVendor,&cpuModel));// First try to load a graph from an XML file, search result can be cached in an XML fileconstchar* str =ncclGetEnv("NCCL_GRAPH_FILE");if(str){INFO(NCCL_ENV,"NCCL_GRAPH_FILE set by environment to %s", str);structncclXml* xml;NCCLCHECK(xmlAlloc(&xml, NCCL_GRAPH_XML_MAX_NODES));NCCLCHECK(ncclTopoGetXmlGraphFromFile(str, xml));int nChannels;NCCLCHECK(ncclTopoGetGraphFromXml(xml->nodes, system, graph,&nChannels));INFO(NCCL_GRAPH,"Search %d : %d channels loaded from XML graph", graph->id, nChannels);free(xml);if(graph->nChannels >0)return ncclSuccess;}int ccMin;NCCLCHECK(ncclTopoGetCompCap(system,&ccMin,NULL));if(graph->pattern == NCCL_TOPO_PATTERN_NVLS &&(system->nodes[NVS].count ==0|| ccMin <90))return ncclSuccess;// NVLS search must have ngpus heads at most.// Think about fully connected case, each GPU has nGPU links and each channel use 1 link of the GPUif(graph->pattern == NCCL_TOPO_PATTERN_NVLS) graph->maxChannels = system->nodes[GPU].count;// No intra nodes, use NCCL_TOPO_PATTERN_TREEif(ngpus ==1)if(graph->pattern != NCCL_TOPO_PATTERN_RING) graph->pattern = NCCL_TOPO_PATTERN_TREE;// force minChannels == maxChannels to force maximum channels searching to ensure maximum pallalismif(system->nodes[NET].count ==0&& graph->pattern == NCCL_TOPO_PATTERN_NVLS){// Force intra-node NVLS algorithm to pull evenly from all GPUs.
    graph->minChannels = graph->maxChannels = system->nodes[GPU].count;}// save graph data for traceback algorithm to recovery laterstructncclTopoGraph tmpGraph;memcpy(&tmpGraph, graph,sizeof(structncclTopoGraph));// First try crossnic, then decrease bw and finally increase bwIntra.// Initialize bwIntra, bwInter using speedArray, choose the maximum speed close to system maxBw and totalBw.// Initialize gobalTimeout, each search will consume time based on different pattern.int nspeeds =0;float* speedArray =NULL;if(system->nodes[NET].count ==0){
    nspeeds = ccMin >=90? NSPEEDSINTRA_SM90 : NSPEEDSINTRA;
    speedArray = ccMin >=90? sm90SpeedArrayIntra : speedArrayIntra;}else{
    nspeeds = ccMin >=90? NSPEEDSINTER_SM90 : NSPEEDSINTER;
    speedArray = ccMin >=90? sm90SpeedArrayInter : speedArrayInter;}int pass =1;int speedIndex =0;float maxBw = system->maxBw;float totalBw = system->totalBw;if(ngpus >1&& graph->pattern != NCCL_TOPO_PATTERN_RING) totalBw *= ngpus*1.0/(ngpus-1);while((speedArray[speedIndex]> maxBw || speedArray[speedIndex]*graph->minChannels > totalBw)&& speedIndex < nspeeds-1) speedIndex++;
  tmpGraph.bwIntra = tmpGraph.bwInter = speedArray[speedIndex];int64_t globalTimeout = NCCL_SEARCH_GLOBAL_TIMEOUT;

search:int time = tmpGraph.sameChannels ? NCCL_SEARCH_TIMEOUT_SAMECHANNELS :
    tmpGraph.pattern == NCCL_TOPO_PATTERN_TREE ? NCCL_SEARCH_TIMEOUT_TREE : NCCL_SEARCH_TIMEOUT;
  tmpGraph.nChannels =0;
  globalTimeout -= time;NCCLCHECK(ncclTopoSearchRec(system,&tmpGraph, graph,&time));// Optimal solution, stop hereif(time ==-1)goto done;if(graph->nChannels*graph->bwInter >= system->totalBw)goto done;if(pass ==1){// First pass, we don't have a solution yet ; try other options// Try having different channels (except when going through AMD CPUs)if(tmpGraph.sameChannels ==1&&!(cpuArch == NCCL_TOPO_CPU_ARCH_X86 && cpuVendor == NCCL_TOPO_CPU_VENDOR_AMD && tmpGraph.typeIntra == PATH_SYS)]{
      tmpGraph.sameChannels =0;goto search;}
    tmpGraph.sameChannels = trySameChannels;if(time !=-1) globalTimeout += time;// No result, traceback and search againelse globalTimeout = NCCL_SEARCH_GLOBAL_TIMEOUT;// hasif(globalTimeout <0&& graph->nChannels)goto done;// Try a simpler treeif(ccMin >=90&& tmpGraph.pattern == NCCL_TOPO_PATTERN_BALANCED_TREE){
      tmpGraph.pattern = NCCL_TOPO_PATTERN_TREE;goto search;}
    tmpGraph.pattern = graph->pattern;int maxTypeIntra = system->nodes[NET].count >0? tmpGraph.typeInter : PATH_SYS;if(tmpGraph.typeIntra < maxTypeIntra &&(graph->nChannels ==0|| tmpGraph.typeIntra < graph->typeIntra)){
      tmpGraph.typeIntra +=1;goto search;}
    tmpGraph.typeIntra = ngpus ==1? PATH_LOC : PATH_NVL;if(system->nodes[NET].count >0&& tmpGraph.typeInter < PATH_SYS &&(graph->nChannels ==0|| tmpGraph.typeInter < graph->typeInter || tmpGraph.typeInter < PATH_PXN)){
      tmpGraph.typeInter +=1;goto search;}
    tmpGraph.typeInter = PATH_PIX;if(crossNic ==2&& tmpGraph.crossNic ==0){// Try again with crossNic if permitted
      tmpGraph.crossNic =1;goto search;}
    tmpGraph.crossNic = crossNic ==1?1:0;// Decrease bw until we find a solutionif((speedIndex < nspeeds-1)&&(graph->nChannels ==0||(speedArray[speedIndex+1]/graph->bwInter >.49))){
      tmpGraph.bwInter = tmpGraph.bwIntra = speedArray[++speedIndex];goto search;}
    speedIndex =0;while(speedArray[speedIndex]> maxBw && speedIndex < nspeeds-1) speedIndex++;
    tmpGraph.bwIntra = tmpGraph.bwInter = speedArray[speedIndex];}

done:// We have a solution. Start from that solution and move to pass 2.if(pass ==1){
    time =-1;NCCLCHECK(ncclTopoDupChannels(graph, ccMin, ngpus));memcpy(&tmpGraph, graph,sizeof(tmpGraph));
    speedIndex =0;while(speedArray[speedIndex]> graph->bwInter && speedIndex < nspeeds-1) speedIndex++;
    tmpGraph.bwIntra = tmpGraph.bwInter = speedArray[speedIndex];
    tmpGraph.minChannels = graph->nChannels;
    pass =2;}if(pass ==2){// See if we can increase bwif(time !=0&& speedIndex >0){if(graph->pattern == NCCL_TOPO_PATTERN_RING){// increase bw for Ring
        tmpGraph.bwIntra = tmpGraph.bwInter = speedArray[--speedIndex];goto search;}elseif(graph->pattern == NCCL_TOPO_PATTERN_NVLS && tmpGraph.bwInter == graph->bwInter && tmpGraph.bwInter < tmpGraph.bwIntra*2){
        tmpGraph.minChannels = tmpGraph.maxChannels = graph->nChannels;
        tmpGraph.bwInter = speedArray[--speedIndex];goto search;}elseif(tmpGraph.bwIntra == graph->bwIntra && tmpGraph.bwIntra < tmpGraph.bwInter*2){// increase bwIntra for trees (2 nodes or collnet)
        tmpGraph.bwIntra = speedArray[--speedIndex];goto search;}}
    time =-1;memcpy(&tmpGraph, graph,sizeof(tmpGraph));}if(graph->nChannels ==0&& graph->collNet ==0&& graph->pattern != NCCL_TOPO_PATTERN_NVLS){WARN("Could not find a path for pattern %d, falling back to simple order", graph->pattern);for(int i=0; i<ngpus; i++) graph->intra[i]= system->nodes[GPU].nodes[i].gpu.rank;
    graph->inter[0]= graph->inter[1]=0;
    graph->bwIntra = graph->bwInter =0.1;
    graph->typeIntra = graph->typeInter = PATH_SYS;
    graph->nChannels =1;}return ncclSuccess;}

ncclTopoSearchRec

ncclTopoSearchRec执行ncclTopoCompute触发的每一次搜索。根据是否有网络节点，ncclTopoSearchRec内部通过调用ncclTopoSearchRecNet或者ncclTopoSearchTryGpu展开搜索。

/* Search Patterns
 *
 *     Intra-node
 * Ring            : GPU a -> GPU b -> .. -> GPU x -> GPU a
 * (=Split Tree Loop)
 * Tree            : GPU a -> GPU b -> .. -> GPU x
 * (=Split Tree)
 *
 *     Inter-node
 * Ring            : NET n -> GPU a -> GPU b -> .. -> GPU x -> NET n (or m if crossNic)
 * Tree            : NET n -> GPU a -> GPU b -> .. -> GPU x
 *                              `--> NET n (or m if crossNic)
 * Split Tree      : NET n -> GPU a -> GPU b -> .. -> GPU x
 *                                       `--> NET n (or m if crossNic)
 * Split Tree Loop : NET n -> GPU a -> GPU b -> .. -> GPU x -> GPU a
 *                                       `--> NET n (or m if crossNic)
 */
ncclResult_t ncclTopoSearchRec(structncclTopoSystem* system,structncclTopoGraph* graph,structncclTopoGraph* saveGraph,int* time){// Initialize backToNet and backToFirstRank based on the search criteriaint backToNet, backToFirstRank;NCCLCHECK(ncclTopoSearchParams(system, graph->pattern,&backToNet,&backToFirstRank));if(system->nodes[NET].count){// Start from NETncclTopoSearchRecNet(system, graph, saveGraph, backToNet, backToFirstRank, time);}else{// Intra-node only.if(graph->pattern == NCCL_TOPO_PATTERN_NVLS){NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph,0, backToNet, backToFirstRank,0, time,-1,-1, graph->nChannels));return ncclSuccess;}elseif(graph->nChannels ==0){// Try PCI order firstNCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph,0, backToNet, backToFirstRank, FORCED_ORDER_PCI, time,-1,-1,0));}else{// Also try to replay previous channelint g;NCCLCHECK(ncclTopoReplayGetGpu(system, graph,-1,&g));NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph,0, backToNet, backToFirstRank, FORCED_ORDER_REPLAY, time,-1,-1, g));}if(graph->sameChannels ==0|| graph->nChannels ==0){// Finally, try all other possibilities unless we are forced to use the same channelsfor(int g=0; g<system->nodes[GPU].count; g++){NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph,0, backToNet, backToFirstRank,0, time,-1,-1, g));}}}return ncclSuccess;}

ncclTopoSearchRecNet

ncclTopoSearchRecNet是有NET节点时的搜索入口。该函数将graph->inter[graph->nChannels*2]设置好后启动ncclTopoSearchTryGpu。

ncclResult_t ncclTopoSearchRecNet(structncclTopoSystem* system,structncclTopoGraph* graph,structncclTopoGraph* saveGraph,int backToNet,int backToFirstRank,int* time){constint bw = graph->bwInter;int* nets;NCCLCHECK(ncclCalloc(&nets, system->nodes[NET].count));int netCount;NCCLCHECK(ncclTopoSelectNets(system, graph->typeInter,-1, nets,&netCount));for(int i=0; i<netCount; i++){if(graph->pattern == NCCL_TOPO_PATTERN_NVLS && i>0)continue;// skip other NETs for NVLSint n = nets[(graph->nChannels+i)%netCount];structncclTopoNode* net = system->nodes[NET].nodes+n;if(graph->collNet && net->net.collSupport ==0)continue;// skip NETs without collectives support if graph->collNet is setif(net->net.bw < bw)continue;// skip NETs with insufficient bandwidthif(graph->crossNic &&(graph->nChannels &1)&& net->id != graph->inter[(graph->nChannels-1)*2+1])continue;// if crossNic, in and out must be on different NICs

    graph->inter[graph->nChannels*2]= net->id;
    graph->latencyInter = net->net.latency;for(int i=0; i<system->nodes[NET].count; i++){if((system->nodes[NET].nodes[i].net.asic == net->net.asic)&&(system->nodes[NET].nodes[i].net.port == net->net.port)){
        system->nodes[NET].nodes[i].net.bw -= bw;}}if(graph->pattern == NCCL_TOPO_PATTERN_NVLS){// NVLS search only tries to find NIC:GPU combinations to compute the heads.if(graph->nChannels < netCount){int gpu;NCCLCHECK(ncclTopoGetLocalGpu(system, net->id,&gpu));if(gpu !=-1)NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph,0, backToNet, backToFirstRank,0, time, NET, n, gpu));}}else{if(graph->nChannels >0){// Try to replay the last channelint g;NCCLCHECK(ncclTopoReplayGetGpu(system, graph,-1,&g));NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph,0, backToNet, backToFirstRank, FORCED_ORDER_REPLAY, time, NET, n, g));}if(graph->nChannels ==0|| graph->sameChannels ==0){if(graph->nChannels ==0&& system->nodes[NVS].count ==0){// Always try the PCI order first to set a reference, but don't count in the timeout nor let it run for longint t =1<<10;NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph,0, backToNet, backToFirstRank, FORCED_ORDER_PCI,&t, NET, n,0));if(t ==-1)*time =-1;}// Then try the most local GPUsfloat maxBw =0;int minHops =0xfffffff;structncclTopoLinkList* paths = net->paths[GPU];for(int g=0; g<system->nodes[GPU].count; g++){if(paths[g].bw > maxBw){
            maxBw = paths[g].bw;
            minHops = paths[g].count;}elseif(paths[g].bw == maxBw && paths[g].count < minHops){
            minHops = paths[g].count;}}if(maxBw >= bw){for(int i=0; i<system->nodes[GPU].count; i++){int g =(graph->nChannels+i)%system->nodes[GPU].count;if(paths[g].bw == maxBw && paths[g].count == minHops){NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph,0, backToNet, backToFirstRank,0, time, NET, n, g));}}}}}for(int i=0; i<system->nodes[NET].count; i++){if((system->nodes[NET].nodes[i].net.asic == net->net.asic)&&(system->nodes[NET].nodes[i].net.port == net->net.port)){
        system->nodes[NET].nodes[i].net.bw += bw;}}}free(nets);return ncclSuccess;}

ncclTopoSearchTryGpu

ncclTopoSearchTryGpu是搜索GPU节点时的搜索入口。该函数调用递归函数ncclTopoSearchRecGpu完成搜索过程。

ncclResult_t ncclTopoSearchTryGpu(structncclTopoSystem* system,structncclTopoGraph* graph,structncclTopoGraph* saveGraph,int step,int backToNet,int backToFirstRank,int forcedOrder,int*time,int type,int index,int g){constuint64_t flag =1ULL<<(graph->nChannels);structncclTopoNode* gpu;NCCLCHECK(ncclTopoFollowPath(system, graph, type, index, GPU, g,1,&gpu));if(gpu){
    gpu->used ^= flag;NCCLCHECK(ncclTopoSearchRecGpu(system, graph, saveGraph, gpu, step, backToNet, backToFirstRank, forcedOrder, time));
    gpu->used ^= flag;NCCLCHECK(ncclTopoFollowPath(system, graph, type, index, GPU, g,-1,&gpu));}return ncclSuccess;}

ncclTopoSearchRecGpu

ncclTopoSearchTryGpu逻辑的主要实现在ncclTopoSearchRecGpu中。该函数不断递归搜索，递归终止条件是step==ngpu，即需要遍历所有GPU节点。当终止条件到达时，通过ncclTopoCompareGraphs函数判断是否找到更优的方案，如果graph->nChannels < graph->maxChannels此时针对新通道展开新一轮搜索。搜索过程中，不同的step处理方式不同，一般而言便是不断寻找下一个GPU，但比如遇到step == backToFirstRank等特殊情况则需要找到初始的GPU节点完成环的回路。在搜索推进过程中会不断调用ncclTopoFollowPath函数进行剪枝，该函数判断下一跳是否会引入带宽瓶颈，如果不会则继续往前搜索，否则可以提前结束搜索。

ncclResult_t ncclTopoSearchRecGpu(structncclTopoSystem* system,structncclTopoGraph* graph,structncclTopoGraph* saveGraph,structncclTopoNode* gpu,int step,int backToNet,int backToFirstRank,int forcedOrder,int*time){if((*time)<=0)return ncclSuccess;(*time)--;int ngpus = system->nodes[GPU].count;if(step == ngpus){// Determine whether we found a better solution or notint copy =0;
    graph->nChannels++;NCCLCHECK(ncclTopoCompareGraphs(system, graph, saveGraph,&copy));if(copy){memcpy(saveGraph, graph,sizeof(structncclTopoGraph));if(graph->nChannels == graph->maxChannels)*time =-1;}if(graph->nChannels < graph->maxChannels){// not reaching maxChannels, continue to searchNCCLCHECK(ncclTopoSearchRec(system, graph, saveGraph, time));}
    graph->nChannels--;return ncclSuccess;}
  graph->intra[graph->nChannels*ngpus+step]= gpu->gpu.rank;int g = gpu - system->nodes[GPU].nodes;if(step == backToNet){// first get back to NICif(system->nodes[NET].count){int startNetIndex;NCCLCHECK(getNetIndex(system, graph->inter[graph->nChannels*2],&startNetIndex));structncclTopoNode* startNet = system->nodes[NET].nodes+startNetIndex;int netCount;int* nets;NCCLCHECK(ncclCalloc(&nets, system->nodes[NET].count));NCCLCHECK(ncclTopoSelectNets(system, graph->typeInter, g, nets,&netCount));for(int i=0; i<netCount; i++){int n = nets[i];structncclTopoNode* net = system->nodes[NET].nodes+n;if(graph->pattern == NCCL_TOPO_PATTERN_TREE && net->id != startNet->id)continue;// Trees are symmetricif(graph->crossNic !=1&&(net->net.asic != startNet->net.asic || net->net.port != startNet->net.port))continue;if(graph->crossNic &&(graph->nChannels &1)&& net->id != graph->inter[(graph->nChannels-1)*2])continue;// Balanced Tree : count half of the bandwidth on first two GPUsint nextBackToNet =-1;float bwInterSave = graph->bwInter;if(graph->pattern == NCCL_TOPO_PATTERN_BALANCED_TREE){// Count half of the bandwidth on each of the first two GPUsif(step ==0) nextBackToNet =1;elseif(net->id != graph->inter[graph->nChannels*2+1])continue;
          graph->bwInter /=2;}NCCLCHECK(ncclTopoFollowPath(system, graph, GPU, g, NET, n,1,&net));
        graph->bwInter = bwInterSave;if(net){
          graph->inter[graph->nChannels*2+1]= net->id;NCCLCHECK(ncclTopoSearchRecGpu(system, graph, saveGraph, gpu, step, nextBackToNet, backToFirstRank, forcedOrder, time));if(graph->pattern == NCCL_TOPO_PATTERN_BALANCED_TREE) graph->bwInter /=2;NCCLCHECK(ncclTopoFollowPath(system, graph, GPU, g, NET, n,-1,&net));
          graph->bwInter = bwInterSave;}}free(nets);}}elseif(graph->pattern == NCCL_TOPO_PATTERN_NVLS){NCCLCHECK(ncclTopoSearchTryNvls(system, graph, saveGraph, g, ngpus, time));}elseif(step < system->nodes[GPU].count-1){// Go to next GPUint next[NCCL_TOPO_MAX_NODES];int count;if(forcedOrder == FORCED_ORDER_PCI){// Try the PCI order
      next[0]= step+1;
      count =1;}elseif(forcedOrder == FORCED_ORDER_REPLAY){// Try last channel orderNCCLCHECK(ncclTopoReplayGetGpu(system, graph, step, next));
      count =1;}else{// Normal searchNCCLCHECK(ncclTopoSearchNextGpuSort(system, graph, gpu, next,&count, backToNet ==-1?0: backToNet == step+1?1:-1));}for(int i=0; i<count; i++){NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, step+1, backToNet, backToFirstRank, forcedOrder, time, GPU, g, next[i]));}}elseif(step == backToFirstRank){// Find first GPU and loop back to itint p;NCCLCHECK(getGpuIndex(system, graph->intra[graph->nChannels*ngpus],&p));structncclTopoNode* firstGpu;NCCLCHECK(ncclTopoFollowPath(system, graph, GPU, g, GPU, p,1,&firstGpu));if(firstGpu){NCCLCHECK(ncclTopoSearchRecGpu(system, graph, saveGraph, firstGpu, step+1, backToNet,-1, forcedOrder, time));NCCLCHECK(ncclTopoFollowPath(system, graph, GPU, g, GPU, p,-1,&firstGpu));}}else{// Next pathNCCLCHECK(ncclTopoSearchRecGpu(system, graph, saveGraph, gpu, ngpus,-1,-1, forcedOrder, time));}return ncclSuccess;}

ncclTopoFollowPath

ncclTopoFollowPath是搜索过程中的剪枝函数，其寻找type1/index1节点至type2/index2节点的路径类型和带宽是否满足ncclTopoGraph中typeintra，bwIntra，bwInter以及typeinter的要求。首先排除路径类型不匹配的情况（比如graph要求NVL而路径却是PCI），如路径类型判断通过，还需调用followPath判断路径带宽是否满足要求。

// Try to go from node type1/index1 to no type2/index2. mult indicates whether we are counting the bandwidth (1) or undoing (-1).// Set node to the destination node if successful, NULL otherwise.static ncclResult_t ncclTopoFollowPath(structncclTopoSystem* system,structncclTopoGraph* graph,int type1,int index1,int type2,int index2,int mult,structncclTopoNode** node){// First handle easy cases*node = system->nodes[type2].nodes+index2;if(type1 ==-1)return ncclSuccess;structncclTopoNode* node1 = system->nodes[type1].nodes+index1;structncclTopoLinkList* path = node1->paths[type2]+index2;structncclTopoNode* node2 = system->nodes[type2].nodes+index2;structncclTopoLinkList* revPath = node2->paths[type1]+index1;if(path ==NULL){WARN("No path computed to go from %s/%d to %s/%d", topoNodeTypeStr[type1], index1, topoNodeTypeStr[type2], index2);return ncclInternalError;}// Now check link type*node =NULL;int intra =(type1 == GPU || type1 == NVS)&&(type2 == GPU || type2 == NVS);float bw = intra ? graph->bwIntra : graph->bwInter;int type = intra ? graph->typeIntra : graph->typeInter;if(mult ==1&&(path->type > type))return ncclSuccess;if(mult ==1&&(graph->pattern == NCCL_TOPO_PATTERN_BALANCED_TREE ||
        graph->pattern == NCCL_TOPO_PATTERN_TREE ||
        graph->pattern == NCCL_TOPO_PATTERN_SPLIT_TREE)&&(revPath->type > type))return ncclSuccess;

  bw *= mult;// Check there is enough bandwidth on paths.int step =0;NCCLCHECK(followPath(path, node1, path->count, bw,&step));if(step < path->count)goto rewind;// Enough bandwidth : return destination node.
  graph->nHops += mult*path->count;*node = system->nodes[type2].nodes+index2;return ncclSuccess;

rewind:// Not enough bandwidth : rewind and exit.NCCLCHECK(followPath(path, node1, step,-bw,&step));return ncclSuccess;}

followPath

followPath函数便是判断一条路径带宽是否满足要求的函数。

// Check if there is enough bandwidth on the path starting from start node with maxSteps steps and bw bandwidth.static ncclResult_t followPath(structncclTopoLinkList* path,structncclTopoNode* start,int maxSteps,float bw,int* steps){float pciBw = bw;// Check if there is INTEL CPU in the path, if so, we need to consider the overhead of P2P between INTEL CPUfor(int step=0; step<path->count; step++){structncclTopoNode* node = path->list[step]->remNode;if(node->type == CPU){// Account for P2P inefficiency through Intel CPU RCif(path->type == PATH_PHB && start->type == GPU &&
          node->cpu.arch == NCCL_TOPO_CPU_ARCH_X86 &&
          node->cpu.vendor == NCCL_TOPO_CPU_VENDOR_INTEL){
        pciBw =INTEL_P2P_OVERHEAD(bw);}}}// Check if there is enough bandwidth on the pathstructncclTopoNode* node = start;for(int step=0; step<maxSteps; step++){structncclTopoLink* link = path->list[step];structncclTopoLink* revLink =NULL;float fwBw = link->type == LINK_PCI ? pciBw : bw;// Update reverse bandwidth if needed, based on the type of the link and the type of the nodesfloat revBw =0;if(link->remNode->type == GPU && link->remNode->gpu.cudaCompCap <80&& start->type != GPU){if(revLink ==NULL)NCCLCHECK(findRevLink(node, link->remNode, link->type,&revLink));
      revBw += fwBw/8;}if(link->remNode->type == CPU && link->remNode->cpu.arch == NCCL_TOPO_CPU_ARCH_POWER && link->type == LINK_NVL){if(revLink ==NULL)NCCLCHECK(findRevLink(node, link->remNode, link->type,&revLink));
      revBw += fwBw;}// Check if there is enough forward bandwidth on the link and backward bandwidth on the reverse link, return the step if notif(link->bw < fwBw ||(revBw && revLink->bw < revBw)){*steps = step;return ncclSuccess;}SUB_ROUND(link->bw, fwBw);if(revBw)SUB_ROUND(revLink->bw, revBw);
    node = link->remNode;}// we have enough bandwidth on the whole path if we checked all steps*steps = maxSteps;return ncclSuccess;}

标签：人工智能

本文转载自: https://blog.csdn.net/weixin_42371021/article/details/140458256
版权归原作者慢河所有，如有侵权，请联系我们删除。

NCCL拓扑管理 - Search模块

数据结构

接口

ncclTopoSearchInit

ncclTopoCompute

ncclTopoSearchRec

ncclTopoSearchRecNet

ncclTopoSearchTryGpu

ncclTopoSearchRecGpu

ncclTopoFollowPath

followPath

发表评论

“NCCL拓扑管理 - Search模块”的评论:

关于作者

overfit同步小助手

相关阅读

文章导航