异常解决记录 | Yarn NodeManager 注册异常

2025-09-17

前些天发现了一个巨牛的人工智能学习网站，通俗易懂，风趣幽默，忍不住给大家分享一下。点击跳转到网站：https://www.captainai.net/dongkelun

前言

该异常发生背景：

项目上的同事之前已部署了8个节点均正常
项目上的同事后面又扩容了8个节点均异常
Yarn ResourceManager 配置了 HA
Hadoop 版本： 3.1.1

具体异常

2025-09-17 09:30:00,869 INFO  client.ConfiguredRMFailoverProxyProvider (ConfiguredRMFailoverProxyProvider.java:performFailover(100)) - Failing over to rm2
2025-09-17 09:30:01,043 INFO  retry.RetryInvocationHandler (RetryInvocationHandler.java:log(411)) - org.apache.hadoop.security.authorize.AuthorizationException: User nm/indata-192-168-1-3.indata.com@INDATA.COM  (auth:KERBEROS) is not authorized for protocol interface org.apache.hadoop.yarn.server.api.ResourceTrackerPB: this service is only accessible by nm/192.168.1.3@INDATA.COM, while invoking ResourceTrackerPBClientImpl.registerNodeManager over rm2 after 1 failover attempts. Trying to failover after sleeping for 21498ms.
2025-09-17 09:30:22,541 INFO  client.ConfiguredRMFailoverProxyProvider (ConfiguredRMFailoverProxyProvider.java:performFailover(100)) - Failing over to rm1
2025-09-17 09:30:22,545 INFO  retry.RetryInvocationHandler (RetryInvocationHandler.java:log(411)) - java.net.ConnectException: Call From indata-192-168-1-3.indata.com/192.168.1.3 to indata-192-168-1-1.indata.com:8031 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ResourceTrackerPBClientImpl.registerNodeManager over rm1 after 2 failover attempts. Trying to failover after sleeping for 33407ms.
2025-09-17 09:30:55,953 INFO  client.ConfiguredRMFailoverProxyProvider (ConfiguredRMFailoverProxyProvider.java:performFailover(100)) - Failing over to rm2
2025-09-17 09:30:55,992 INFO  retry.RetryInvocationHandler (RetryInvocationHandler.java:log(411)) - org.apache.hadoop.security.authorize.AuthorizationException: User nm/indata-192-168-1-3.indata.com@INDATA.COM  (auth:KERBEROS) is not authorized for protocol interface org.apache.hadoop.yarn.server.api.ResourceTrackerPB: this service is only accessible by nm/192.168.1.3@INDATA.COM, while invoking ResourceTrackerPBClientImpl.registerNodeManager over rm2 after 3 failover attempts. Trying to failover after sleeping for 44974ms.
2025-09-17 09:31:40,967 INFO  client.ConfiguredRMFailoverProxyProvider (ConfiguredRMFailoverProxyProvider.java:performFailover(100)) - Failing over to rm1
2025-09-17 09:31:40,971 INFO  retry.RetryInvocationHandler (RetryInvocationHandler.java:log(411)) - java.net.ConnectException: Call From indata-192-168-1-3.indata.com/192.168.1.3 to indata-192-168-1-1.indata.com:8031 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ResourceTrackerPBClientImpl.registerNodeManager over rm1 after 4 failover attempts. Trying to failover after sleeping for 15164ms.
2025-09-17 09:31:56,136 INFO  client.ConfiguredRMFailoverProxyProvider (ConfiguredRMFailoverProxyProvider.java:performFailover(100)) - Failing over to rm2
2025-09-17 09:31:56,181 INFO  retry.RetryInvocationHandler (RetryInvocationHandler.java:log(411)) - org.apache.hadoop.security.authorize.AuthorizationException: User nm/indata-192-168-1-3.indata.com@INDATA.COM  (auth:KERBEROS) is not authorized for protocol interface org.apache.hadoop.yarn.server.api.ResourceTrackerPB: this service is only accessible by nm/192.168.1.3@INDATA.COM, while invoking ResourceTrackerPBClientImpl.registerNodeManager over rm2 after 5 failover attempts. Trying to failover after sleeping for 27554ms.
2025-09-17 09:32:23,741 INFO  client.ConfiguredRMFailoverProxyProvider (ConfiguredRMFailoverProxyProvider.java:performFailover(100)) - Failing over to rm1
2025-09-17 09:32:23,749 INFO  retry.RetryInvocationHandler (RetryInvocationHandler.java:log(411)) - java.net.ConnectException: Call From indata-192-168-1-3.indata.com/192.168.1.3 to indata-192-168-1-1.indata.com:8031 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ResourceTrackerPBClientImpl.registerNodeManager over rm1 after 6 failover attempts. Trying to failover after sleeping for 35229ms.

异常汇总分析：

权限认证异常

错误信息：org.apache.hadoop.security.authorize.AuthorizationException: User nm/indata-192-168-1-3.indata.com@INDATA.COM (auth:KERBEROS) is not authorized for protocol interface org.apache.hadoop.yarn.server.api.ResourceTrackerPB: this service is only accessible by nm/192.168.1.3@INDATA.COM
说明：使用 Kerberos 认证的用户 nm/indata-192-168-1-3.indata.com@INDATA.COM 没有访问权限，该服务仅允许 nm/192.168.1.3@INDATA.COM 访问
发生场景：NodeManager 向 rm2 注册时反复出现此错误

连接拒绝异常
错误信息：java.net.ConnectException: Call From indata-192-168-1-3.indata.com/192.168.1.3 to indata-192-168-1-1.indata.com:8031 failed on connection exception: java.net.ConnectException: Connection refused
说明：从节点 indata-192-168-1-3.indata.com 连接到 indata-192-168-1-1.indata.com:8031 被拒绝
发生场景：NodeManager 尝试向 rm1 注册时反复出现此错误

整体情况：

系统在 rm1 和 rm2 之间不断进行故障转移（failover）
对 rm2 的连接存在权限认证问题
对 rm1 的连接根本无法建立（端口 8031 拒绝连接）
异常循环发生，NodeManager 始终无法成功注册到任何一个 ResourceManager
重试多次后，最终 NodeManager 启动失败

说明

rm1 此时为 StandBy , StandBy 是没有对应的 8031 端口的，所以连接失败
rm2 此时为 Active ，因为权限认证问题导致连接失败。

首先将 rm1 切换为 Active

观察日志，发现连接 rm1 时没有报权限认证的错误，从而连接成功。
合理推断该问题是由权限认证失败导致的。但是不清楚为啥 rm2 认证失败，但 rm1 认证成功。
最终解决
经过各种分析和尝试，最终通过分析源码发现了根因。

源码

根据日志关键字定位到源码：ServiceAuthorizationManager ,授权失败的关键代码在 authorize 方法中

public void authorize(UserGroupInformation user, 
                             Class<?> protocol,
                             Configuration conf,
                             InetAddress addr
                             ) throws AuthorizationException {
  AccessControlList[] acls = protocolToAcls.get(protocol);
  MachineList[] hosts = protocolToMachineLists.get(protocol);
  if (acls == null || hosts == null) {
    throw new AuthorizationException("Protocol " + protocol + 
                                     " is not known.");
  }
  
  // get client principal key to verify (if available)
  KerberosInfo krbInfo = SecurityUtil.getKerberosInfo(protocol, conf);
  String clientPrincipal = null; 
  if (krbInfo != null) {
    String clientKey = krbInfo.clientPrincipal();
    if (clientKey != null && !clientKey.isEmpty()) {
      try {
        clientPrincipal = SecurityUtil.getServerPrincipal(
            conf.get(clientKey), addr);
      } catch (IOException e) {
        throw (AuthorizationException) new AuthorizationException(
            "Can't figure out Kerberos principal name for connection from "
                + addr + " for user=" + user + " protocol=" + protocol)
            .initCause(e);
      }
    }
  }
  if((clientPrincipal != null && !clientPrincipal.equals(user.getUserName())) || 
     acls.length != 2  || !acls[0].isUserAllowed(user) || acls[1].isUserAllowed(user)) {
    String cause = clientPrincipal != null ?
        ": this service is only accessible by " + clientPrincipal :
        ": denied by configured ACL";
    AUDITLOG.warn(AUTHZ_FAILED_FOR + user
        + " for protocol=" + protocol + cause);
    throw new AuthorizationException("User " + user +
        " is not authorized for protocol " + protocol + cause);
  }
  if (addr != null) {
    String hostAddress = addr.getHostAddress();
    if (hosts.length != 2 || !hosts[0].includes(hostAddress) ||
        hosts[1].includes(hostAddress)) {
      AUDITLOG.warn(AUTHZ_FAILED_FOR + " for protocol=" + protocol
          + " from host = " +  hostAddress);
      throw new AuthorizationException("Host " + hostAddress +
          " is not authorized for protocol " + protocol) ;
    }
  }
  AUDITLOG.info(AUTHZ_SUCCESSFUL_FOR + user + " for protocol="+protocol);
}

错误触发条件
代码中 clientPrincipal != null && !clientPrincipal.equals(user.getUserName()) 为 true，即：

clientPrincipal：ResourceManager 根据配置计算出的 “允许访问的主体”（nm/192.168.1.3@INDATA.COM）
user.getUserName()：NodeManager 实际使用的 Kerberos 主体（nm/indata-192-168-1-3.indata.com@INDATA.COM ）
两者不相等，直接抛出 AuthorizationException，错误信息就是 “this service is only accessible by nm/192.168.1.3@INDATA.COM”

clientPrincipal 的来源

clientPrincipal 由以下代码生成：

1
2
3

KerberosInfo krbInfo = SecurityUtil.getKerberosInfo(protocol, conf);
String clientKey = krbInfo.clientPrincipal(); // 获取协议对应的“客户端主体配置项”
clientPrincipal = SecurityUtil.getServerPrincipal(conf.get(clientKey), addr);

protocol：即 ResourceTrackerPB（NodeManager 注册协议）
krbInfo.clientPrincipal()：通过 @KerberosInfo 注解获取该协议对应的 “客户端主体配置项” —— 对于 ResourceTrackerPB，这个配置项是 yarn.nodemanager.principal（Hadoop 内置注解定义）
SecurityUtil.getServerPrincipal(conf.get(clientKey), addr)：根据 yarn.nodemanager.principal 的值和客户端 IP（addr），生成 clientPrincipal

clientPrincipal 为何是 nm/192.168.1.3@INDATA.COM？

SecurityUtil.getServerPrincipal 方法会解析 yarn.nodemanager.principal 的值：

yarn.nodemanager.principal 配置为 nm/_HOST@INDATA.COM
_HOST 被替换为客户端 IP（192.168.1.3）（而非域名），最终生成 nm/192.168.1.3@INDATA.COM

_HOST 为何没有被替换为主机名？
经过验证，发现 rm2 节点中的 /etc/hosts 没有配置 192.168.1.3 对应的域名的映射 , 推测 _HOST 替换逻辑：

ResourceManager 会检查自己所在节点的 /etc/hosts 中有没有配置 NodeManager 的 IP 地址与域名的映射
如果配置了，则替换为域名
如果没有配置，则替换为 IP

根因

虽然新扩容的8个节点的 /etc/hosts 配置了所有的 16个节点的 IP 地址与域名的映射，但是之前已经部署的8个节点中，只在 rm1 的节点添加了新增节点的 IP 地址与域名的映射
在 NodeManager 向 ResourceManager 注册时，ResourceManager 会根据 yarn.nodemanager.principal 的值和客户端 IP（addr），生成 clientPrincipal ，yarn.nodemanager.principal 的配置值为 nm/_HOST@INDATA.COM ，ResourceManager 会检查自己所在节点的 /etc/hosts 中有没有配置 NodeManager 的 IP 的主机名，如果配置了则替换为主机名，如果没有配置则替换为 IP
这样就只允许通过 nm/192.168.1.3@INDATA.COM 认证，与 keytab 中的 nm/indata-192-168-1-3.indata.com@INDATA.COM 值不一致导致认证失败
因为 rm1 正确配置了 /etc/hosts ，所以当 rm1 切换为 active 时是可以认证成功的