python开发prometheus exporter--用于hadoop-yarn监控

首先写python的exporter需要知道Prometheus提供4种类型Metrics

分别是：Counter, Gauge, Summary和Histogram

Counter可以增长，并且在程序重启的时候会被重设为0，常被用于任务个数，总处理时间，错误个数等只增不减的指标。

Gauge与Counter类似，唯一不同的是Gauge数值可以减少，常被用于温度、利用率等指标。

Summary/Histogram概念比较复杂，对于我来说目前没有使用场景，暂无了解。

我们需要的pip模块

from prometheus_client import CollectorRegistry, Gauge, push_to_gateway, start_http_server
-----
pip install prometheus_client

代码思路实例

def push_yarn():
    # 监控zk_RM
    Yarn_zkRMAppRoot()
    # 监控yarn任务信息
    Yarn_AppsInfo()
def run():
    start_http_server(8006)  # 8006端口启动
    while True:
        push_yarn()
        time.sleep(10)
if __name__ == '__main__':
    run()

push_yarn()为监控的数据数据

循环进行监控拿取数据进行监控

我们使用Gauge实例

注意⚠️：Gauge与Counter类似，唯一不同的是Gauge数值可以减少，常被用于温度、利用率等指标。

新增Gauge实例

yarn_zkRMAppRoot_code = Gauge('yarn_zkRMAppRoot', 'yarn_zkRMAppRoot_num', ['instance'])
started_time_gauge = Gauge('yarn_started_time', 'started_time', ['application'])
launch_time_gauge = Gauge('yarn_launch_time', 'launch_time', ['application'])
finished_time_gauge = Gauge('yarn_finished_time', 'finished_time', ['application'])
memory_seconds_gauge = Gauge('yarn_memory_seconds', 'memory_seconds', ['application'])
vcore_seconds_gauge = Gauge('yarn_vcore_seconds', 'vcore_seconds', ['application'])

yarn_zkRMAppRoot_code: 这个是一个Gauge指标,用于记录YARN ResourceManager应用程序根目录在ZooKeeper中的znode数量。

yarn_started_time: 这是一个Gauge指标,用于记录应用程序的启动时间。这个指标有一个 application 标签,用于区分不同的应用程序。

yarn_launch_time: 这是一个Gauge指标,用于记录应用程序的启动时间。这个指标也有一个 application 标签。

yarn_finished_time: 这是一个Gauge指标,用于记录应用程序的结束时间。这个指标也有一个 application 标签。

yarn_memory_seconds: 这是一个Gauge指标,用于记录应用程序使用的内存数量乘以运行时间(内存-秒)。这个指标也有一个 application 标签。

yarn_vcore_seconds: 这是一个Gauge指标,用于记录应用程序使用的虚拟CPU核心数量乘以运行时间(vCore-秒)。这个指标也有一个 application 标签。

实现一下我们要监控的指标

# --------yarn-------- #####
def Yarn_zkRMAppRoot():
    # 命令
    # 命令
    if kerberos_switch:
        command = f'''
            echo 'ls /rmstore/ZKRMStateRoot/RMAppRoot' | /opt/dtstack/DTBase/zookeeper/bin/zkCli.sh | grep application_ | awk -F , '{{print NF}}'
            '''
    else:
        command = f'''
                    export CLIENT_JVMFLAGS="$CLIENT_JVMFLAGS -Djava.security.auth.login.config=/opt/dtstack/DTBase/zookeeper/conf/jaas.conf -Djava.security.krb5.conf=/opt/dtstack/Kerberos/kerberos_pkg/conf/krb5.conf -Dzookeeper.server.principal=zookeeper/{hostname}@DTSTACK.COM"
                    echo 'ls /rmstore/ZKRMStateRoot/RMAppRoot' | /opt/dtstack/DTBase/zookeeper/bin/zkCli.sh | grep application_ | awk -F , '{{print NF}}'
                    '''
    # 使用subprocess模块执行命令
    result = subprocess.getstatusoutput(command)  # (0, '455')
    if result[0] == 0:
        yarn_zkRMAppRoot_code.labels('yarn_' + hostname).set(result[1])
    else:
        print(f"Failed to execute command: {command}")
def Yarn_AppsInfo():
    list_apps = []
    command = "yarn rmadmin -getServiceState rm1"
    apps_url = "http://{}/ws/v1/cluster/apps"
    rm_info = subprocess.getstatusoutput(command)
    if rm_info[0] == 0:
        if rm_info[1] == 'active':
            rm_host = yarn_rm1
        else:
            rm_host = yarn_rm2
    response = requests.get(url=apps_url.format(rm_host))
    html = response.text
    data = json.loads(html)
    for i in range(0, len(data['apps']['app'])):
        need_data = data['apps']['app']
        if need_data[i]['memorySeconds'] > 102400:  # 大于10G的任务
            list_apps.append([need_data[i]['id'],
                              need_data[i]['startedTime'],
                              need_data[i]['launchTime'],
                              need_data[i]['finishedTime'],
                              need_data[i]['memorySeconds'], need_data[i]['vcoreSeconds']])
    sorted_lst = sorted(list_apps, key=lambda x: (x[4], x[5]))
    for list in sorted_lst:
        application = list[0]
        started_time = list[1]
        launch_time = list[2]
        finished_time = list[3]
        memory_seconds = list[4]
        vcore_seconds = list[5]
        started_time_gauge.labels(application=application).set(started_time)
        launch_time_gauge.labels(application=application).set(launch_time)
        finished_time_gauge.labels(application=application).set(finished_time)
        memory_seconds_gauge.labels(application=application).set(memory_seconds)
        vcore_seconds_gauge.labels(application=application).set(vcore_seconds)