skywalking简单入门使用

简介

skywalking是一个优秀的国产开源框架，2015年由个人吴晟（华为开发者）开源， 2017年加入Apache孵化器。短短两年就被Apache收入麾下，实力可见一斑。 skywalking支持dubbo，SpringCloud，SpringBoot集成，代码无侵入，通信方式采用GRPC，性能较好，实现方式是java探针，支持告警，支持JVM监控，支持全局调用统计等等，功能较完善。

官网：https://skywalking.apache.org/

Skywalking架构

SkyWalking 逻辑上分为四部分: 探针, 平台后端, 存储和用户界面。

探针基于不同的来源可能是不一样的, 但作用都是收集数据, 将数据格式化为 SkyWalking 适用的格式.
平台后端, 支持数据聚合, 数据分析以及驱动数据流从探针到用户界面的流程。分析包括 Skywalking 原生追踪和性能指标以及第三方来源，包括 Istio 及 Envoy telemetry , Zipkin 追踪格式化等。
存储通过开放的插件化的接口存放 SkyWalking 数据. 你可以选择一个既有的存储系统, 如 ElasticSearch, H2 或 MySQL 集群(Sharding-Sphere 管理),也可以选择自己实现一个存储系统. 当然, 我们非常欢迎你贡献新的存储系统实现。
UI 一个基于接口高度定制化的Web系统，用户可以可视化查看和管理 SkyWalking 数据。

部署安装

多种方式，只给出docker部署的docker-compose例子，其他的自行在官网查找。

存储使用的是es7，同时将oap注册到了nacos

version:'2'services:elasticsearch:image: elasticsearch:7.14.2
    container_name: elasticsearch
    restart: always
    ports:- 9200:9200environment:-"TAKE_FILE_OWNERSHIP=true"# volumes 挂载权限 如果不想要挂载es文件改配置可以删除-"discovery.type=single-node"#单机模式启动-"TZ=Asia/Shanghai"# 设置时区-"ES_JAVA_OPTS=-Xms512m -Xmx512m"# 设置jvm内存大小volumes:- /data/skywalking/elasticsearch/logs:/usr/share/elasticsearch/logs
      - /data/skywalking/elasticsearch/data:/usr/share/elasticsearch/data
    ulimits:memlock:soft:-1hard:-1skywalking-oap-server:image: apache/skywalking-oap-server:8.9.1
    container_name: skywalking-oap-server
    depends_on:- elasticsearch
    links:- elasticsearch
    restart: always
    ports:- 11800:11800- 12800:12800environment:#此处的参数为容器里/skywalking/config/application.yml的配置SW_STORAGE: elasticsearch  # 指定ES版本SW_STORAGE_ES_CLUSTER_NODES: elasticsearch:9200SW_CLUSTER: nacos
      SW_SERVICE_NAME: skywalking-oap-server
      SW_CLUSTER_NACOS_HOST_PORT: 192.168.0.200
      SW_CLUSTER_NACOS_NAMESPACE: dev
      TZ: Asia/Shanghai
  skywalking-ui:image: apache/skywalking-ui:8.9.1
    container_name: skywalking-ui
    depends_on:- skywalking-oap-server
    links:- skywalking-oap-server
    restart: always
    ports:- 8080:8080environment:SW_OAP_ADDRESS: http://skywalking-oap-server:12800TZ: Asia/Shanghai

UI界面说明

仪表盘：查看被监控服务的运行状态
拓扑图：以拓扑图的方式展现服务直接的关系，并以此为入口查看相关信息
追踪：以接口列表的方式展现，追踪接口内部调用过程
性能剖析：单独端点进行采样分析，并可查看堆栈信息
告警：触发告警的告警列表，包括实例，请求超时等。
事件：
调试：

仪表盘

仪表盘分为：全局(Global)、服务(Service)、实例(Instance)、端点(Endpoint)四个维度

Global全局维度

Services load：服务每分钟请求数

Slow Services：慢响应服务，单位ms

Un-Health services(Apdex):Apdex性能指标，1为满分。

Global Response Latency：百分比响应延时，不同百分比的延时时间，单位ms

Global Heatmap：服务响应时间热力分布图，根据时间段内不同响应时间的数量显示颜色深度

Service服务维度

Service Apdex（数字）:当前服务的评分

Service Apdex（折线图）：不同时间的Apdex评分

Successful Rate（数字）：请求成功率

Successful Rate（折线图）：不同时间的请求成功率

Servce Load（数字）：每分钟请求数

Servce Load（折线图）：不同时间的每分钟请求数

Service Avg Response Times：平均响应延时，单位ms

Global Response Time Percentile：百分比响应延时

Servce Instances Load：每个服务实例的每分钟请求数

Show Service Instance：每个服务实例的最大延时

Service Instance Successful Rate：每个服务实例的请求成功率

Instance实例维度

Service Instance Load：当前实例的每分钟请求数

Service Instance Successful Rate：当前实例的请求成功率

Service Instance Latency：当前实例的响应延时

JVM CPU:jvm占用CPU的百分比

JVM Memory：JVM内存占用大小，单位m

JVM GC Time：JVM垃圾回收时间，包含YGC和OGC

JVM GC Count：JVM垃圾回收次数，包含YGC和OGC

CLR XX：类似JVM虚拟机，这里用不上就不做解释了

Endpoint端点维度

Endpoint Load in Current Service：每个端点的每分钟请求数

Slow Endpoints in Current Service：每个端点的最慢请求时间，单位ms

Successful Rate in Current Service：每个端点的请求成功率

Endpoint Load：当前端点每个时间段的请求数据

Endpoint Avg Response Time：当前端点每个时间段的请求行响应时间

Endpoint Response Time Percentile：当前端点每个时间段的响应时间占比

Endpoint Successful Rate：当前端点每个时间段的请求成功率

界面上的指标可以做一些自定义的调整(更换图表、统计数量等)，也可以增加自定义指标(应该是要二开)

拓扑图

点击具体的服务，右侧将显示服务的状态，服务周边出现触点；点击连线，将会显示服务之间的连接情况

追踪

左侧：api接口列表，红色-异常请求，蓝色-正常请求
右侧：api追踪列表，api请求连接各端点的先后顺序和时间

性能剖析

会报错，暂时不知道怎么用

日志

点击可查看具体的日志，同一追踪id表明是同一个请求

告警

对于服务的异常信息，比如接口有较长延迟，skywalking也做出了告警功能，如下图：

skywalking中有一些默认的告警规则，如下：

最近3分钟内服务的平均响应时间超过1秒
最近2分钟服务成功率低于80%
最近3分钟90%服务响应时间超过1秒
最近2分钟内服务实例的平均响应时间超过1秒

当然除了以上四种，随着Skywalking不断迭代也会新增其他规则，这些规则的配置在

config/alarm-settings.yml

配置文件中

# Licensed to the Apache Software Foundation (ASF) under one# or more contributor license agreements.  See the NOTICE file# distributed with this work for additional information# regarding copyright ownership.  The ASF licenses this file# to you under the Apache License, Version 2.0 (the# "License"); you may not use this file except in compliance# with the License.  You may obtain a copy of the License at##     http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.# Sample alarm rules.rules:# Rule unique name, must be ended with `_rule`.service_resp_time_rule:metrics-name: service_resp_time
    op:">"threshold:1000period:10count:3silence-period:5message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
  service_sla_rule:# Metrics value need to be long, double or intmetrics-name: service_sla
    op:"<"threshold:8000# The length of time to evaluate the metricsperiod:10# How many times after the metrics match the condition, will trigger alarmcount:2# How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.silence-period:3message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
  service_resp_time_percentile_rule:# Metrics value need to be long, double or intmetrics-name: service_percentile
    op:">"threshold:1000,1000,1000,1000,1000period:10count:3silence-period:5message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
  service_instance_resp_time_rule:metrics-name: service_instance_resp_time
    op:">"threshold:1000period:10count:2silence-period:5message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
  database_access_resp_time_rule:metrics-name: database_access_resp_time
    threshold:1000op:">"period:10count:2message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
  endpoint_relation_resp_time_rule:metrics-name: endpoint_relation_resp_time
    threshold:1000op:">"period:10count:2message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes

#  Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.#  Because the number of endpoint is much more than service and instance.##  endpoint_avg_rule:#    metrics-name: endpoint_avg#    op: ">"#    threshold: 1000#    period: 10#    count: 2#    silence-period: 5#    message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minuteswebhooks:#  - http://127.0.0.1/notify/#  - http://127.0.0.1/go-wechat/

规则中的参数属性解释
Rule name。在告警信息中显示的唯一名称。必须以_rule结尾。规则命名用户自定义
属性含义metrics-nameoal脚本中的度量名称，不能随便定义threshold阈值，与metrics-name和下面的比较符号相匹配op比较操作符，可以设定>,<,=period多久检查一次当前的指标数据是否符合告警规则，单位分钟count达到多少次后，发送告警消息silence-period在多久之内，忽略相同的告警消息message告警消息内容include-names本规则告警生效的服务列表
metrics-name是oal脚本中的度量名。详细参数可见\config\oal\core.oal文件：

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 *
 */

// For services using protocols HTTP 1/2, gRPC, RPC, etc., the cpm metrics means "calls per minute",
// for services that are built on top of TCP, the cpm means "packages per minute".

// All scope metrics
all_percentile = from(All.latency).percentile(10);  // Multiple values including p50, p75, p90, p95, p99
all_heatmap = from(All.latency).histogram(100, 20);

// Service scope metrics
service_resp_time = from(Service.latency).longAvg();
service_sla = from(Service.*).percent(status == true);
service_cpm = from(Service.*).cpm();
service_percentile = from(Service.latency).percentile(10); // Multiple values including p50, p75, p90, p95, p99
service_apdex = from(Service.latency).apdex(name, status);
service_mq_consume_count = from(Service.*).filter(type == RequestType.MQ).count();
service_mq_consume_latency = from((str->long)Service.tag["transmission.latency"]).filter(type == RequestType.MQ).filter(tag["transmission.latency"] != null).longAvg();

// Service relation scope metrics for topology
service_relation_client_cpm = from(ServiceRelation.*).filter(detectPoint == DetectPoint.CLIENT).cpm();
service_relation_server_cpm = from(ServiceRelation.*).filter(detectPoint == DetectPoint.SERVER).cpm();
service_relation_client_call_sla = from(ServiceRelation.*).filter(detectPoint == DetectPoint.CLIENT).percent(status == true);
service_relation_server_call_sla = from(ServiceRelation.*).filter(detectPoint == DetectPoint.SERVER).percent(status == true);
service_relation_client_resp_time = from(ServiceRelation.latency).filter(detectPoint == DetectPoint.CLIENT).longAvg();
service_relation_server_resp_time = from(ServiceRelation.latency).filter(detectPoint == DetectPoint.SERVER).longAvg();
service_relation_client_percentile = from(ServiceRelation.latency).filter(detectPoint == DetectPoint.CLIENT).percentile(10); // Multiple values including p50, p75, p90, p95, p99
service_relation_server_percentile = from(ServiceRelation.latency).filter(detectPoint == DetectPoint.SERVER).percentile(10); // Multiple values including p50, p75, p90, p95, p99

// Service Instance relation scope metrics for topology
service_instance_relation_client_cpm = from(ServiceInstanceRelation.*).filter(detectPoint == DetectPoint.CLIENT).cpm();
service_instance_relation_server_cpm = from(ServiceInstanceRelation.*).filter(detectPoint == DetectPoint.SERVER).cpm();
service_instance_relation_client_call_sla = from(ServiceInstanceRelation.*).filter(detectPoint == DetectPoint.CLIENT).percent(status == true);
service_instance_relation_server_call_sla = from(ServiceInstanceRelation.*).filter(detectPoint == DetectPoint.SERVER).percent(status == true);
service_instance_relation_client_resp_time = from(ServiceInstanceRelation.latency).filter(detectPoint == DetectPoint.CLIENT).longAvg();
service_instance_relation_server_resp_time = from(ServiceInstanceRelation.latency).filter(detectPoint == DetectPoint.SERVER).longAvg();
service_instance_relation_client_percentile = from(ServiceInstanceRelation.latency).filter(detectPoint == DetectPoint.CLIENT).percentile(10); // Multiple values including p50, p75, p90, p95, p99
service_instance_relation_server_percentile = from(ServiceInstanceRelation.latency).filter(detectPoint == DetectPoint.SERVER).percentile(10); // Multiple values including p50, p75, p90, p95, p99

// Service Instance Scope metrics
service_instance_sla = from(ServiceInstance.*).percent(status == true);
service_instance_resp_time= from(ServiceInstance.latency).longAvg();
service_instance_cpm = from(ServiceInstance.*).cpm();

// Endpoint scope metrics
endpoint_cpm = from(Endpoint.*).cpm();
endpoint_avg = from(Endpoint.latency).longAvg();
endpoint_sla = from(Endpoint.*).percent(status == true);
endpoint_percentile = from(Endpoint.latency).percentile(10); // Multiple values including p50, p75, p90, p95, p99
endpoint_mq_consume_count = from(Endpoint.*).filter(type == RequestType.MQ).count();
endpoint_mq_consume_latency = from((str->long)Endpoint.tag["transmission.latency"]).filter(type == RequestType.MQ).filter(tag["transmission.latency"] != null).longAvg();

// Endpoint relation scope metrics
endpoint_relation_cpm = from(EndpointRelation.*).filter(detectPoint == DetectPoint.SERVER).cpm();
endpoint_relation_resp_time = from(EndpointRelation.rpcLatency).filter(detectPoint == DetectPoint.SERVER).longAvg();
endpoint_relation_sla = from(EndpointRelation.*).filter(detectPoint == DetectPoint.SERVER).percent(status == true);
endpoint_relation_percentile = from(EndpointRelation.rpcLatency).filter(detectPoint == DetectPoint.SERVER).percentile(10); // Multiple values including p50, p75, p90, p95, p99

database_access_resp_time = from(DatabaseAccess.latency).longAvg();
database_access_sla = from(DatabaseAccess.*).percent(status == true);
database_access_cpm = from(DatabaseAccess.*).cpm();
database_access_percentile = from(DatabaseAccess.latency).percentile(10);

如果想要调整默认的规则，比如监控返回的信息，监控的参数等等，只需要改动上述配置文件中的参数即可。

当然除了以上默认的几种规则，skywalking还适配了一些钩子（webhooks）。其实就是相当于一个回调，一旦触发了上述规则告警，skywalking则会调用配置的webhook，这样开发者就可以定制一些处理方法，比如发送邮件、微信、钉钉通知运维人员处理。

当然这个钩子也是有些规则的，如下：

POST请求
application/json 接收数据
接收的参数必须是AlarmMessage中指定的参数。

注意：AlarmMessage这个类随着skywalking版本的迭代可能出现不同，一定要到对应版本源码中去找到这个类，拷贝其中的属性。这个类在源码的路径：

org.apache.skywalking.oap.server.core.alarm

，如下图：

事件

…

调试

…

日志对接

在skywalking的UI端有一个日志的模块，用于收集客户端的日志，默认是没有数据的，那么需要如何将日志数据传输到skywalking中呢？

日志框架的种类很多，比较出名的有log4j，logback，log4j2，以springboot默认的logback为例子介绍一下如何配置，官方文档如下：

log4j：java-agent/application-toolkit-log4j-1.x/
log4j2：java-agent/application-toolkit-log4j-2.x/
logback：java-agent/application-toolkit-logback-1.x/

1、添加依赖

根据官方文档，需要先添加依赖，如下：

<dependency><groupId>org.apache.skywalking</groupId><artifactId>apm-toolkit-logback-1.x</artifactId><version>8.10.0</version></dependency>

2、添加配置文件

新建一个

logback-spring.xml

放在resource目录下，配置如下：

<?xml version="1.0" encoding="UTF-8"?><configurationscan="true"scanPeriod=" 5 seconds"><appendername="STDOUT"class="ch.qos.logback.core.ConsoleAppender"><encoderclass="ch.qos.logback.core.encoder.LayoutWrappingEncoder"><layoutclass="org.apache.skywalking.apm.toolkit.log.logback.v1.x.mdc.TraceIdMDCPatternLogbackLayout"><Pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%X{tid}] [%thread] %-5level %logger{36} -%msg%n</Pattern></layout></encoder></appender><appendername="ASYNC"class="ch.qos.logback.classic.AsyncAppender"><discardingThreshold>0</discardingThreshold><queueSize>1024</queueSize><neverBlock>true</neverBlock><appender-refref="STDOUT"/></appender><appendername="grpc-log"class="org.apache.skywalking.apm.toolkit.log.logback.v1.x.log.GRPCLogClientAppender"><encoderclass="ch.qos.logback.core.encoder.LayoutWrappingEncoder"><layoutclass="org.apache.skywalking.apm.toolkit.log.logback.v1.x.mdc.TraceIdMDCPatternLogbackLayout"><Pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%X{tid}] [%thread] %-5level %logger{36} -%msg%n</Pattern></layout></encoder></appender><rootlevel="INFO"><appender-refref="grpc-log"/><appender-refref="ASYNC"/></root></configuration>

3、接入探针agent

1）下载探针

https://archive.apache.org/dist/skywalking/java-agent/

根据Skywalking版本进行下载

2）idea使用探针

配置VM options

-javaagent:D:\Java\plugin\apache-skywalking-java-agent-8.9.0\skywalking-agent\skywalking-agent.jar -Dskywalking.agent.service_name=log-demo-service2 -Dskywalking.collector.backend_service=192.168.0.203:11800

参数说明-javaagent1）中下载的探针jar包位置-Dskywalking.agent.service_name在Skywalking中的服务名称，默认值为Your_ApplicationName-Dskywalking.collector.backend_serviceSkywalking地址，默认值为127.0.0.1:11800
如果在本地起的Skywalking，则没必要配置此参数

启动程序会发现Skywalking里已经有了日志，与idea控制台的日志一致。
调用接口

3）jar包使用探针

与idea一样，启动时配置VM options即可

java -javaagent:D:\Java\plugin\apache-skywalking-java-agent-8.9.0\skywalking-agent\skywalking-agent.jar -Dskywalking.agent.service_name=log-demo-service2 -Dskywalking.collector.backend_service=192.168.0.203:11800 -jar app.jar

4）docker使用探针

其实原理都一样，配置探针包的地址、配置skywalking服务端地址、配置服务名称。

有几种方式：

1、将外部的agent包，挂载到docker容器中，然后指定vm参数

2、构建docker时，把agent包打进docker镜像里，dockerfile修改enrtypoint，添加-javaagent参数

3、将agent包打到基础镜像里，dockerfile直接引用这个基础镜像

我采用第三种方式，而且这个基础镜像其实官方是有发布的，直接引用即可，但是只有几个版本，如果你的程序对jdk有要求，还是要采用其他方案，或者自己构建一个基础镜像。

在docker仓库里选择对应版本即可，镜像地址：https://hub.docker.com/r/apache/skywalking-java-agent，这个仓库是apache自己的，但是概述里有说明，这个不是ASF( Apache 软件基金会Apache Software Foundation)的官方版本

把这个替换到dockerfile中的FROM即可

FROM apache/skywalking-java-agent:8.9.0-java8
..

这个镜像自带探针包，无需指定探针包地址，那么其他两个参数怎么配置呢？

我是在docker-compose文件中配置了环境变量SW_AGENT_NAME与SW_AGENT_COLLECTOR_BACKEND_SERVICES

version:'2'services:log-demo-service1:image: log-service1:0.0.1-SKYWALKING
    container_name: log-demo-service-skywalking1
    restart: always
    ports:-"8090:8090"environment:TZ: Asia/Shanghai
      SW_AGENT_NAME: log-demo-service1
      SW_AGENT_COLLECTOR_BACKEND_SERVICES: 192.168.0.203:11800log-demo-service2:image: log-service2:0.0.1-SKYWALKING
    container_name: log-demo-service-skywalking2
    restart: always
    ports:-"8091:8091"environment:TZ: Asia/Shanghai
      SW_AGENT_NAME: log-demo-service2
      SW_AGENT_COLLECTOR_BACKEND_SERVICES: 192.168.0.203:11800

这两个变量是取自探针的配置文件/agent/config/agent.config

agent.service_name=${SW_AGENT_NAME:Your_ApplicationName}
collector.backend_service=${SW_AGENT_COLLECTOR_BACKEND_SERVICES:127.0.0.1:11800}

其实就对应着2）和3）中配置的VM options的-Dskywalking.agent.service_name和-Dskywalking.collector.backend_service，所以2）和3）中配置环境变量也是可以的。

启动docker-compose，会发现比idea中多了一行，就是这个基础镜像自带的配置，定义了一个JAVA_TOOL_OPTIONS，在这个变量里指定了agent包的位置。

参考：

SkyWalking UI指标使用说明(3)

SkyWalking之告警

SkyWalking解决分布式链路追踪，真香！

skywalking指南—agent日志采集和插件

标签： java 云原生容器

本文转载自: https://blog.csdn.net/qq_42445433/article/details/125998649
版权归原作者 阿狸尬多 所有，如有侵权，请联系我们删除。

skywalking简单入门使用

简介

Skywalking架构

部署安装

UI界面说明

仪表盘

拓扑图

追踪

性能剖析

日志

告警

事件

调试

日志对接

1、添加依赖

2、添加配置文件

3、接入探针agent

1）下载探针

2）idea使用探针

3）jar包使用探针

4）docker使用探针

发表评论

“skywalking简单入门使用”的评论:

关于作者

overfit同步小助手

相关阅读

文章导航