dremio FragmentExecutor 的执行顺序简单说明

管理员

dremio 在执行计划物理计划转换之后,对于执行计划会包含不同的fragment,fragment 会组成一颗树,包含了PlanFragmentMajor以及PlanFragmentMinor
对于组成的树之后dremio 就需要调度执行了(里边会包含资源分配,优先级,运算操作,大致处理可以参考drill 查询执行介绍)

FragmentExecutors 中的启动Fragments

参考如下,通过QueryStarterImpl 包装进行处理

 
 public void startFragments(final InitializeFragments fragments, final FragmentExecutorBuilder builder,
                             final StreamObserver<Empty> sender, final NodeEndpoint identity) {
    final SchedulingInfo schedulingInfo = fragments.hasSchedulingInfo() ? fragments.getSchedulingInfo() : null;
    QueryStarterImpl queryStarter = new QueryStarterImpl(fragments, builder, sender, identity, schedulingInfo);
    // 通过FragmentExecutorBuilder 构建FragmentExecutor,内部使用了QueriesClerk 支持工作负载以及查询级别的分配处理,包装完成
   // 之后还是调用的QueryStarterImpl 内部的buildAndStartQuery
    builder.buildAndStartQuery(queryStarter.getFirstFragment(), schedulingInfo, queryStarter);
  }

执行顺序处理

QueryStarterImpl 中的buildAndStartQuery 
....
public void buildAndStartQuery(final QueryTicket queryTicket) {
      QueryId queryId = queryTicket.getQueryId();
 
      /**
       * To avoid race conditions between creation and deletion of phase/fragment tickets,
       * build all the fragments first (creates the tickets) and then, start the fragments (can
       * delete tickets).
       */
      List<FragmentExecutor> fragmentExecutors = new ArrayList<>();
      UserRpcException userRpcException = null;
      Set<FragmentHandle> fragmentHandlesForQuery = Sets.newHashSet();
      try {
        if (!maestroProxy.tryStartQuery(queryId, queryTicket, initializeFragments.getQuerySentTime())) {
          boolean isDuplicateStart = maestroProxy.isQueryStarted(queryId);
          if (isDuplicateStart) {
            // duplicate op, do nothing.
            return;
          } else {
            throw new IllegalStateException("query already cancelled");
          }
        }
 
        Map<Integer, Integer> priorityToWeightMap = buildPriorityToWeightMap();
 
        for (PlanFragmentFull fragment : fullFragments) {
          // 构建FragmentExecutor
          FragmentExecutor fe = buildFragment(queryTicket, fragment, priorityToWeightMap.getOrDefault(fragment.getMajor().getFragmentExecWeight(), 1), schedulingInfo);
          fragmentHandlesForQuery.add(fe.getHandle());
          fragmentExecutors.add(fe);
        }
      } catch (UserRpcException e) {
        userRpcException = e;
      } catch (Exception e) {
        userRpcException = new UserRpcException(NodeEndpoint.getDefaultInstance(), "Remote message leaked.", e);
      } finally {
        if (fragmentHandlesForQuery.size() > 0) {
          maestroProxy.initFragmentHandlesForQuery(queryId, fragmentHandlesForQuery);
        }
        // highest weight first with leaf fragments first bottom up
       // 进行排序处理,然后执行FragmentExecutor,实际的处理是可以参考drill 的查询执行介绍的
        fragmentExecutors.sort(weightBasedComparator());
        for (FragmentExecutor fe : fragmentExecutors) {
          startFragment(fe);
        }
        queryTicket.release();
 
        if (userRpcException == null) {
          sender.onNext(Empty.getDefaultInstance());
          sender.onCompleted();
        } else {
          sender.onError(userRpcException);
        }
      }
 
      // if there was a cancel while the start was in-progress, clean up.
      if (maestroProxy.isQueryCancelled(queryId)) {
        cancelFragments(queryId, builder.getClerk());
      }
    }

buildFragment 构建FragmentExecutor

 private FragmentExecutor buildFragment(final QueryTicket queryTicket, final PlanFragmentFull fragment,
      final int schedulingWeight, final SchedulingInfo schedulingInfo) throws UserRpcException {
 
      if (fragment.getMajor().getFragmentExecWeight() <= 0) {
        logger.info("Received remote fragment start instruction for {}", QueryIdHelper.getQueryIdentifier(fragment.getHandle()));
      } else {
        logger.info("Received remote fragment start instruction for {} with assigned weight {} and scheduling weight {}",
          QueryIdHelper.getQueryIdentifier(fragment.getHandle()), fragment.getMajor().getFragmentExecWeight(), schedulingWeight);
      }
 
      try {
        final EventProvider eventProvider = getEventProvider(fragment.getHandle());
        return builder.build(queryTicket, fragment, schedulingWeight, useMemoryArbiter ? memoryArbiter : null, eventProvider, schedulingInfo, fragmentReader);
      } catch (final Exception e) {
        throw new UserRpcException(identity, "Failure while trying to start remote fragment", e);
      } catch (final OutOfMemoryError t) {
        if (t.getMessage().startsWith("Direct buffer")) {
          throw new UserRpcException(identity, "Out of direct memory while trying to start remote fragment", t);
        } else {
          throw t;
        }
      }
    }

权重比较

 // 通过PlanFragmentFull 获取执行的权重,此生成是根据,dremio 的简单并行化处理的(SimpleParallelizer,内部会通过资源的处理进行评估,后边详细介绍)
private Map<Integer, Integer> buildPriorityToWeightMap() {
      Map<Integer, Integer> priorityToWeightMap = new HashMap<>();
      if (fullFragments.get(0).getMajor().getFragmentExecWeight() <= 0) {
        return priorityToWeightMap;
      }
 
      for(PlanFragmentFull fragment : fullFragments) {
        int fragmentWeight = fragment.getMajor().getFragmentExecWeight();
        priorityToWeightMap.put(fragmentWeight, Math.min(fragmentWeight, QueryTicket.MAX_EXPECTED_SIZE));
      }
      return priorityToWeightMap;
    }
   // 进行比较的,核心是处理root ,leaf 已经一些FragmentExecutor 的顺序,然后进行执行
    private Comparator<FragmentExecutor> weightBasedComparator() {
      return (e1, e2) -> {
        if (e2.getFragmentWeight() < e1.getFragmentWeight()) {
          // higher priority first
          return -1;
        }
        if (e2.getFragmentWeight() == e1.getFragmentWeight()) {
          // when priorities are equal, order the fragments based on leaf first and then descending order
          // of major fragment number. This ensures fragments are started in order of dependency
          if (e2.isLeafFragment() == e1.isLeafFragment()) {
            return Integer.compare(e2.getHandle().getMajorFragmentId(), e1.getHandle().getMajorFragmentId());
          } else {
            return e1.isLeafFragment() ? -1 : 1;
          }
        }
        return 1;
      };
    }

说明

实际上通过阅读drill 的查询执行介绍可以可以看出来顺序,但是通过源码的阅读可以更好的学习了解内部处理

参考资料

sabot/kernel/src/main/java/com/dremio/sabot/exec/fragment/FragmentExecutor.java
sabot/kernel/src/main/java/com/dremio/sabot/exec/FragmentExecutors.java
sabot/kernel/src/main/java/com/dremio/exec/planner/fragment/PlanFragmentFull.java
sabot/kernel/src/main/java/com/dremio/sabot/exec/WorkloadTicket.java
sabot/kernel/src/main/java/com/dremio/sabot/exec/fragment/FragmentExecutorBuilder.java
sabot/kernel/src/main/java/com/dremio/exec/planner/fragment/SimpleParallelizer.java
https://drill.apache.org/docs/drill-query-execution/
https://www.cnblogs.com/rongfengliang/p/17008637.html