【云原生学习】史上最全Prometheus学习笔记

Prometheus使用一、Prometheus基本概念prometheus是一种时间序列的数据库，它适合应用于监控以及告警，但是不合适%100的准确计费，因为采集的数据不一定很准确,主要是监控以及搜集内存、CPU、硬盘的数据1.1 特点多维数据模型（时序列数据由metric名和一组key/value组成）在多维度上灵活的查询语言(PromQl)不依赖分布式存储，单主节点工作.通过基于HTTP的p

九离⠂

3793人浏览 · 2022-03-24 17:03:00

九离⠂ · 2022-03-24 17:03:00 发布

文章目录

- Prometheus使用

Prometheus使用

一、Prometheus基本概念

prometheus是一种时间序列的数据库，它适合应用于监控以及告警，但是不合适%100的准确计费，因为采集的数据不一定很准确,主要是监控以及搜集内存、CPU、硬盘的数据

1.1 特点

多维数据模型（时序列数据由metric名和一组key/value组成）

在多维度上灵活的查询语言(PromQl)

不依赖分布式存储，单主节点工作.

通过基于HTTP的pull方式采集时序数据

可以通过中间网关进行时序列数据推送(pushing)

目标服务器可以通过发现服务或者静态配置实现

多种可视化和仪表盘支持

1.2 相关组件

Prometheus生态系统由多个组件组成，其中许多是可选的：

1、Promethues Server：收集指标和存储时间序列数据，并提供查询接口
2、ClientLibrary：客户端库
3、Push Gateway：短期存储指标数据。主要用于临时新任务
4、Exportes：采集已有的第三方访问监控指标并暴露metrics
5、Altermanger：告警
6、Web UI：简单的web控制台

1.3 架构

1.4 四种指标

Counter-计数器
这类指标只增不减，常用来统计http请求数、下单数等指标。

Gauge-测量仪
这类指标可增可减，常用来统计比如CPU、内存、在线用户等标。

Histogram-直方图
这类指标使用分桶方式来统计样本分布，比如请求的延迟是落在哪个区间范围内的。

Summary-汇总
这类指标是根据样本计算出百分位的，是在客户端计算好的然后被抓取到promethues中的。比如99%、90%、85%、70%/60%的响应时间在哪个区间。

1.5 Promethues的数据模型

Prometheus将所有数据存储为时间序列；具有相同度量名称以及标签属于同一个指标。

每个时间序列都由度量标准名称和一组键值对（也成为标签）唯一标识。

时间序列格式：
<metric name>{<label name>=<label value>, ...}
示例：api_http_requests_total{method=“POST”, handler="/messages"}

指标类型：
1、Counter：递增的计数器
2、Gauge：可以任意变化的数值
3、Histogary：对一对时间范围内数据进行采样，并对所有数值求和与统计数量
4、Summary：与Histogram类似

1.5.1 作业和实例

实例：可以抓取的目标成为实例（instances）

作业：具有相同目标的实例集合成为作业（job）

示例：
示例：
scape_configs：
  - job_name:'prometheus'
    static_configs:
    - targets:['localhost:9090']
  - job_name;'node'
    static_configs:
    - targets:['192.168.1.10:9090']

二、Promethues部署

二进制部署：https://prometheus.io/docs/prometheus/latest/getting_started/

Docker部署：https://prometheus.io/docs/prometheus/latest/installation/

2.1 Service端配置

2.1.1 下载安装包

[root@master01 ltp]# wget https://github.com/promethues/promethues/releases/download/v2.6.1/prometheus-2.6.1.linux-amd64.tar.gz

[root@master01 ltp]# tar zxvf prometheus-2.6.1.linux-amd64.tar.gz

[root@master01 ltp]# mv prometheus-2.6.1.linux-amd64 /usr/local/promethues

[root@master01 ltp]# cd /usr/local/promethues

[root@master01 promethues]# ls
console_libraries  consoles  LICENSE  NOTICE  prometheus  prometheus.yml  promtool


[root@master01 promethues]# vim prometheus.yml
……省略部分
21 scrape_configs:
22   # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
23   - job_name: 'prometheus'
24 
25     # metrics_path defaults to '/metrics'
26     # scheme defaults to 'http'.
27 
28     static_configs:
29     - targets: ['localhost:9090']                             # 配置作业，监控本机9090端口


[root@master01 promethues]# ./prometheus --help                   # 查看命令帮助信息

[root@master01 promethues]# ./prometheus --config.file="prometheus.yml"    # 启动Promethues，会进入阻塞状态

2.1.2 交给systemd管理

[root@master01 promethues]# cd /usr/lib/systemd/system
[root@master01 system]# cp -p sshd.service prometheus.service

[root@master01 system]# vim prometheus.service
[Unit]
Description=Prometheus
After=network.target

[Service]
Restart=on-failure
ExecStart=/usr/local/prometheus/prometheus \
--config.file=/usr/local/prometheus/prometheus.yml
# ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --web.read-timeout=5m  --web.max-connections=10 --storage.tsdb.retention=15d --storage.tsdb.path=/prometheus/data --query.max-concurrency=20 --query.timeout=2m

[Install]
WantedBy=multi-user.target


[root@master01 system]# systemctl daemon-reload

[root@master01 system]# systemctl start prometheus.service

#访问 http://192.168.10.10:9090/metrics 可以查看暴露的接口

#访问http://192.168.10.10:9090 查看web控制台


---------------- 补充：使用docker安装prometheus --------------------------
[root@node01 ~]# docker run -d -p 9090:9090 -v /tmp/prometheus.yaml:/etc/prometheus.yaml prom/prometheus    # -d放在后台运行
-------------------------------------------------------------------------------------------

2.1.3 验证prometheus

[root@master01 prometheus]# vim prometheus.yml
scrape_configs:
 # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
 - job_name: 'prometheus'

   # metrics_path defaults to '/metrics'
   # scheme defaults to 'http'.

   static_configs:
   - targets: ['localhost:9090']
     labels:                                    # 添加标签 idc:beijing
       idc: beijing


[root@master01 prometheus]# ps -ef |grep promethe
root       1075      1  0 05:51 ?        00:00:01 /usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml

[root@master01 prometheus]# kill -hup 1075                                                           # 重载prometheus，kill -hup 进程号

[root@master01 prometheus]# ./promtool check config prometheus.yml                                   # 利用promtool工具来检查prometheus，如果检查有错，重载是不会生效的
Checking prometheus.yml
 FAILED: parsing YAML file prometheus.yml: yaml: unmarshal errors:
 line 30: field label not found in type struct { Targets []string "yaml:\"targets\""; Labels model.LabelSet "yaml:\"labels\"" }

#访问 192.168.10.10：9090
可以发现在 localhost:9090那一栏，新增加了标签

2.1.4 基于文件的服务发现

示例：
file_sd_configs:
 - files: ['/usr/local/prometheus/sd_config/*.yml']                # 指定服务发现的文件位置,会自动读取文件中的配置
   refresh_interval: 5s                                            # 刷新时间

[root@master01 prometheus]# mkdir /usr/local/prometheus/sd_config/
[root@master01 prometheus]#  cd sd_config/
[root@master01 sd_config]# vim test.yml
- targets: [192.168.10.10:9090]

2.2 监控节点配置

2.2.1 配置监控的节点,启动node_exporter

[root@node01 ~]# tar zxvf node_exporter-0.17.0.linux-amd64.tar.gz
[root@node02 ~]# tar zxvf node_exporter-0.17.0.linux-amd64.tar.gz 

[root@node01 ~]# mv node_exporter-0.17.0.linux-amd64 /usr/local/node_exporter
[root@node02 ~]# mv node_exporter-0.17.0.linux-amd64 /usr/local/node_exporter

[root@node01 ~]# cd /usr/local/node_exporter

[root@node01 node_exporter]# ls
LICENSE  node_exporter  NOTICE


[root@node01 node_exporter]# vim /usr/lib/systemd/system/node_exporter.service
[Unit]
Description=node_exporter.service

[Service]
Restart=on-failure
ExecStart=/usr/local/node_exporter/node_exporter

[Install]
WantedBy=multi-user.target


[root@node01 node_exporter]# systemctl daemon-reload 
[root@node01 node_exporter]# systemctl start node_exporter.service
[root@node01 node_exporter]# systemctl enable node_exporter.service


[root@node01 node_exporter]# ps -ef |grep node_exporter
root       1972      1  0 07:55 ?        00:00:00 /usr/local/node_exporter/node_exporter

[root@node01 node_exporter]# netstat -anput |grep 9100
tcp6       0      0 :::9100                 :::*                    LISTEN      1972/node_exporter

#访问 http://192.168.10.20:9100/metrics 可以查看节点暴露的接口

2.2.2 到主节点配置监控从节点

添加如下配置：

[root@master01 prometheus]# vim prometheus.yml
 - job_name: 'node01'
   file_sd_configs:
   - files: ['/usr/local/prometheus/sd_config/node01.yml']                # 指定服务发现的文件位置
     refresh_interval: 5s


[root@master01 prometheus]# vim sd_config/node01.yml
- targets:
 - 192.168.10.20:9100

[root@master01 prometheus]# systemctl restart prometheus

cpu使用率： 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)
内存使用率：100 - （node_memory_Menfree_bytes + node_memory_Cached_bytes + node_memory_Buffers_bytes）/ node_memory_MemTotal_bytes * 100
磁盘使用率：100 node_filesystem_free_bytes{mountpoint = "/",fstype = ~"ext4|xfs"} / node_filesystem_size_bytes{mountpoint = "/",fstype = ~"ext4|xfs"} * 100)

2.2.3 监控节点上的服务的状态，修改node_exporter.service

[root@node02 node_exporter]# cat /usr/lib/systemd/system/node_exporter.service 
[Unit]
Description=node_exporter.service

[Service]
ExecStart=/usr/local/node_exporter/node_exporter \
--web.listen-address=:9100 \
--collector.systemd \
--collector.systemd.unit-whitelist=(ssh|docker).service \             # 利用--collector.systemd.unit-whitelist参数来监控 ssh|docker 服务
--collector.textfile.directory=/usr/local/node_exporter/textfile.collected
Restart=on-failure

[Install]
WantedBy=multi-user.target

2.3 安装Grafana展示

官网：grafana.comhttps://grafana.com/grafana/download?pg=get&plcmt=selfmanaged-box1-cta1

2.3.1 Server端安装grafana


[root@prometheus ~]# cd /usr/local/src/
[root@prometheus src]# wget https://dl.grafana.com/oss/release/grafana-7.5.5-1.x86_64.rpm
[root@prometheus src]# yum localinstall grafana-7.5.5-1.x86_64.rpm

2.4 Prometheus的配置文件及核心功能

官方解释地址：https://prometheus.io/docs/prometheus/latest/configuration/configuration/

1、全局配置文件注释：
global:
 # How frequently to scrape targets by default.
 [ scrape_interval: <duration> | default = 1m ]                        # 采集被监控端数据的周期，默认是一分钟采集一次

 # How long until a scrape request times out.
 [ scrape_timeout: <duration> | default = 10s ]                        # 采集的超时时间，超过十秒没有响应，则超时

 # How frequently to evaluate rules.
 [ evaluation_interval: <duration> | default = 1m ]                    # 告警评估的周期，默认是1分钟

 # The labels to add to any time series or alerts when communicating with
 # external systems (federation, remote storage, Alertmanager).
 external_labels:
   [ <labelname>: <labelvalue> ... ]

 # File to which PromQL queries are logged.
 # Reloading the configuration will reopen the file.
 [ query_log_file: <string> ]

# Rule files specifies a list of globs. Rules and alerts are read from
# all matching files.
rule_files:                                                               # 告警规则
 [ - <filepath_glob> ... ]  

# A list of scrape configurations.
scrape_configs:                                                           # 配置被监控端的监控指标
 [ - <scrape_config> ... ]

# Alerting specifies settings related to the Alertmanager.
alerting:                                                                 # 配置告警
 alert_relabel_configs:                                                  # 告警相关的重打标签
   [ - <relabel_config> ... ]
 alertmanagers:                                                          # 告警相关的配置
   [ - <alertmanager_config> ... ]                  

# Settings related to the remote write feature.
remote_write:                                                             # 配置远程存储的写
 [ - <remote_write> ... ]

# Settings related to the remote read feature.
remote_read:                                                             # 配置远程存储的读
 [ - <remote_read> ... ]


================================================================================================================================
# 讲解 scrape_configs 配置作业
# 以官方的为例：
# The job name assigned to scraped metrics by default.
job_name: <job_name>                                                                      # 作业名称，可以有多个作业

# How frequently to scrape targets from this job.
[ scrape_interval: <duration> | default = <global_config.scrape_interval> ]               # 设置采集周期，默认使用全局的

# Per-scrape timeout when scraping this job.
[ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ]                 # 设置超时时间，默认使用全局的

# The HTTP resource path on which to fetch metrics from targets.
[ metrics_path: <path> | default = /metrics ]                                             # 接口路径

# when a time series does not have a given label yet and are ignored otherwise.
[ honor_labels: <boolean> | default = false ]                                             # 是否覆盖标签，默认是不覆盖的

# honor_timestamps controls whether Prometheus respects the timestamps present
# in scraped data.
# If honor_timestamps is set to "true", the timestamps of the metrics exposed
# by the target will be used.
#
# If honor_timestamps is set to "false", the timestamps of the metrics exposed
# by the target will be ignored.
[ honor_timestamps: <boolean> | default = true ]
# ---------------------------------------------
# Configures the protocol scheme used for requests.
[ scheme: <scheme> | default = http ]                                                      # 采集目标的方式，默认使用http
# Optional HTTP URL parameters.
params: 
 [ <string>: [<string>, ...] ]                                                            # 采用http模式需要的参数，如ip、端口
# ------------------------------------------
# Sets the `Authorization` header on every scrape request with the
# configured username and password.
# password and password_file are mutually exclusive.
basic_auth:                                                                                # 基础认证配置，被监控目标的用户名、密码等
 [ username: <string> ]
 [ password: <secret> ]
 [ password_file: <string> ]
# Sets the `Authorization` header on every scrape request with
# the configured credentials.
authorization:
 # Sets the authentication type of the request.
 [ type: <string> | default: Bearer ]
 # Sets the credentials of the request. It is mutually exclusive with
 # `credentials_file`.
 [ credentials: <secret> ]
 # Sets the credentials of the request with the credentials read from the
 # configured file. It is mutually exclusive with `credentials`.
 [ credentials_file: <filename> ]
# Configure whether scrape requests follow HTTP 3xx redirects.
[ follow_redirects: <bool> | default = true ]
# Configures the scrape request's TLS settings.
tls_config:
 [ <tls_config> ]
# Optional proxy URL.
[ proxy_url: <string> ]
# --------------------------------------

# List of Azure service discovery configurations.
azure_sd_configs:
 [ - <azure_sd_config> ... ]

# ---------------------------------------
# List of Consul service discovery configurations.
consul_sd_configs:                                                                           # 服务发现配置
 [ - <consul_sd_config> ... ]                                    

# List of DigitalOcean service discovery configurations.
digitalocean_sd_configs:
 [ - <digitalocean_sd_config> ... ]

# List of Docker Swarm service discovery configurations.
dockerswarm_sd_configs:
 [ - <dockerswarm_sd_config> ... ]

# List of DNS service discovery configurations.
dns_sd_configs:
 [ - <dns_sd_config> ... ]

# List of EC2 service discovery configurations.
ec2_sd_configs:
 [ - <ec2_sd_config> ... ]

# List of Eureka service discovery configurations.
eureka_sd_configs:
 [ - <eureka_sd_config> ... ]

# List of file service discovery configurations.
file_sd_configs:
 [ - <file_sd_config> ... ]

# List of GCE service discovery configurations.
gce_sd_configs:
 [ - <gce_sd_config> ... ]

# List of Hetzner service discovery configurations.
hetzner_sd_configs:
 [ - <hetzner_sd_config> ... ]

# List of Kubernetes service discovery configurations.
kubernetes_sd_configs:
 [ - <kubernetes_sd_config> ... ]

# List of Marathon service discovery configurations.
marathon_sd_configs:
 [ - <marathon_sd_config> ... ]

# List of AirBnB's Nerve service discovery configurations.
nerve_sd_configs:
 [ - <nerve_sd_config> ... ]

# List of OpenStack service discovery configurations.
openstack_sd_configs:
 [ - <openstack_sd_config> ... ]

# List of Scaleway service discovery configurations.
scaleway_sd_configs:
 [ - <scaleway_sd_config> ... ]

# List of Zookeeper Serverset service discovery configurations.
serverset_sd_configs:
 [ - <serverset_sd_config> ... ]

# List of Triton service discovery configurations.
triton_sd_configs:
 [ - <triton_sd_config> ... ]

# ----------------------------------------
# List of labeled statically configured targets for this job.
static_configs:                                                                    # 静态配置监控实例
 [ - <static_config> ... ]

# List of target relabel configurations.
relabel_configs:                                                                   # 数据采集之前对标签进行标记命名
 [ - <relabel_config> ... ]

# List of metric relabel configurations.
metric_relabel_configs:                                                            # 采集之后进行命名
 [ - <relabel_config> ... ]

# Per-scrape limit on number of scraped samples that will be accepted.
# If more than this number of samples are present after metric relabeling
# the entire scrape will be treated as failed. 0 means no limit.
[ sample_limit: <int> | default = 0 ]                                              # 采集样本的数量，超过数量就不会存储数据

# Per-scrape config limit on number of unique targets that will be
# accepted. If more than this number of targets are present after target
# relabeling, Prometheus will mark the targets as failed without scraping them.
# 0 means no limit. This is an experimental feature, this behaviour could
# change in the future.
[ target_limit: <int> | default = 0 ]        

# ==================================================================================================
# relabel_configs ：允许在采集之前对任何目标及其标签进行修改

重新标签的用途：1、重命名标签名
              2、删除标签
              3、过滤目标

[ source_labels: '[' <labelname> [, ...] ']' ]                                    # 指定源标签

# Separator placed between concatenated source label values.
[ separator: <string> | default = ; ]                                             # 指定多个源标签时，使用的分割符

# Label to which the resulting value is written in a replace action.
# It is mandatory for replace actions. Regex capture groups are available.
[ target_label: <labelname> ]                                                     # 重新标记的标签

# Regular expression against which the extracted value is matched.
[ regex: <regex> | default = (.*) ]                                               # 正则表达式匹配源标签的值，默认匹配所有

# Modulus to take of the hash of the source label values.
[ modulus: <int> ]

# Replacement value against which a regex replace is performed if the
# regular expression matches. Regex capture groups are available.
[ replacement: <string> | default = $1 ]                                          # 替换正则表达式匹配到的分组，分组引用$1,$2,$3

# Action to perform based on regex matching.
[ action: <relabel_action> | default = replace ]                                  # 基于正则表达式匹配执行的操作，默认是替换的动作

# --------------------------------------------
relabel_configs:
- action: 重新标签动作
 source_labels: ['test']
 regex: (.*)
 replacement: $1
 target_label: idc


## action的重新标签动作:1、replace:默认的，通过regex匹配source_labels的值，使用replacement来引用表达式匹配的分组
                       2、keep：删除regex与连接不匹配的目标source_labels（保留匹配的标签）
                       3、drop：删除regex与连接匹配的目标source_labels（保留不匹配的标签）
                       4、labeldrop：删除regex匹配所有标签名称
                       5、labelkeep：删除regex不匹配所有标签名称
                       6、hashmod：设置target_labels为mudulus连接的哈希值的source_labels
                       7、labelmap：匹配regex所有标签名称。然后复制匹配标签的值进行分组，replacement分组引用（${1},${2}...）替代
                  
## 支持服务发现的来源：
• azure_sd_configs
• consul_sd_configs
• dns_sd_configs
• ec2_sd_configs
• openstack_sd_configs
• file_sd_configs           ***      # 基于文件动态发现监控对象
• gce_sd_configs
• kubernetes_sd_configs     *****
• marathon_sd_configs
• nerve_sd_configs
• serverset_sd_configs
• triton_sd_configs

三、Prometheus API

3.1 表达式查询

GET /api/v1/query

3.1.1 查询参数

· query=: Prometheus 表达式查询字符串

· time=<rfc3339 | unix_timestamp>: 评估时间戳. 可选

· timeout=: 评估超时. 可选. 默认值是 -query.timeout的值.

示例
请求url: http://ip:port/api/v1/query?query=up&time=2018-07-01T20:10:51.781Z
返回：
   {
   "status": "success",
   "data": {
       "resultType": "vector",
       "result": [
           {
               "metric": {
                   "__name__": "up",
                   "app": "prometheus",
                   "instance": "10.244.1.84:9090",
                   "job": "kubernetes-service-endpoints",
                   "kubernetes_name": "prometheus-service",
                   "kubernetes_namespace": "ns-monitor"
               },
               "value": [
                   1530497003.491,
                   "1"
               ]
           }]
         }
}

3.1.2 区间查询

GET /api/v1/query_range

· query=: Prometheus 表达式查询字符串

· start=<rfc3339 | unix_timestamp>: 开始时间

· end=<rfc3339 | unix_timestamp>: 结束时间

· step=:持续时间

· timeout=: 评估超时. 可选. 默认值是 -query.timeout的值.

请求url: http://ip:port/api/v1/query_range?query=up&start=2018-07-01T20:10:30.781Z
&end=2018-07-02T20:11:00.781Z&step=100s
返回：
 {
   "status": "success",
   "data": {
       "resultType": "matrix",
       "result": [
           {
               "metric": {
                   "__name__": "up",
                   "app": "prometheus",
                   "instance": "10.244.1.84:9090",
                   "job": "kubernetes-service-endpoints",
                   "kubernetes_name": "prometheus-service",
                   "kubernetes_namespace": "ns-monitor"
               },
               "values": [
                   [
                       1530496830.781,
                       "1"
                   ],
                   [
                       1530496930.781,
                       "1"
                   ]
               ]
           }]
    }
}

3.2 查询metadata

3.2.1 标签选择器查询series

GET /api/v1/series

参数

· match[]=<series_selector>: 重复序列选择器，必须提供至少一个匹配[]参数

· start=<rfc3339 | unix_timestamp>: 开始时间

· end=<rfc3339 | unix_timestamp>: 结束时间

请求url: 
http://ip:port/api/v1/series?match[]=up
&match[]=process_start_time_seconds{job="prometheus"}

返回：
   {
   "status": "success",
   "data": [
       {
           "__name__": "process_start_time_seconds",
           "instance": "localhost:9090",
           "job": "prometheus"
       }]
}}

3.2.2 查询labels值

GET /api/v1/label/<label_name>/values

请求url: 
http://ip:port/api/v1/label/job/values

返回结果:
    {
   "status": "success",
   "data": [
       "grafana",
       "kubernetes-apiservers",
       "kubernetes-cadvisor",
       "kubernetes-nodes",
       "kubernetes-service-endpoints",
       "prometheus"
   ]
}

3.3 表达式返回格式

3.3.1 Range vectors区间集合

[ { "metric": { "<label_name>": "<label_value>", ... }, "values": [ [ <unix_time>, "<sample_value>" ], ... ] }, ... ]
instant vectors 即时集合
[ { "metric": { "<label_name>": "<label_value>", ... }, "value": [ <unix_time>, "<sample_value>" ] }, ... ]

3.3.2 Scalars 标量

[ <unix_time>, "<scalar_value>" ]

3.3.3 String 字符串

[ <unix_time>, "<string_value>" ]

3.4 Targets 查询

GET /api/v1/targets

请求url:  http://ip:port/api/v1/targets
 
返回: {
   "status": "success",
   "data": {
       "activeTargets": [
           {
               "discoveredLabels": {
                   "__address__": "localhost:9090",
                   "__metrics_path__": "/metrics",
                   "__scheme__": "http",
                   "job": "prometheus"
               },
               "labels": {
                   "instance": "localhost:9090",
                   "job": "prometheus"
               },
               "scrapeUrl": "http://localhost:9090/metrics",
               "lastError": "",
               "lastScrape": "2018-07-02T02:39:45.398712723Z",
               "health": "up"
           }]
}}

3.5 alertmanagers 预警管理

GET /api/v1/alertmanagers

请求url: http://ip:port/api/v1/alertmanagers

返回:
{
   "status": "success",
   "data": {
       "activeAlertmanagers": [{ "url":"http://127.0.0.1:9090/api/v1/alerts" }],
       "droppedAlertmanagers": [{ "url":"http://127.0.0.1:9093/api/v1/alerts" }]
   }
}

3.6 config配置

GET /api/v1/status/config

请求url: http://ip:port/api/v1/status/config
返回:

{
   "status": "success",
   "data": {
       "yaml": "global:\n  scrape_interval: 15s\n  scrape_timeout: 10s\n  evaluation_interval: 15s\nalerting:\n  alertmanagers:\n  - static_configs:\n    - targets: []\n    scheme: http\n    timeout: 10s\nscrape_configs:\n- job_name: prometheus\n  acement: $1\n    action: replace\n  - source_labels: [__meta_kubernetes_pod_name]\n    separator: ;\n    regex: (.*)\n    target_label: kubernetes_pod_name\n    replacement: $1\n    action: replace\n"
   }
}

3.7 Flags

GET /api/v1/status/flags 配置中的flag值

请求url:
  http://ip:port/api/v1/status/flags

返回:
{
   "status": "success",
   "data": {
       "alertmanager.notification-queue-capacity": "10000",
       "alertmanager.timeout": "10s",
       "config.file": "/etc/prometheus/prometheus.yml",
       "log.level": "info",
       "query.lookback-delta": "5m",
       "query.max-concurrency": "20",
       "query.timeout": "2m",
       "storage.tsdb.max-block-duration": "36h",
       "storage.tsdb.min-block-duration": "2h",
       "storage.tsdb.no-lockfile": "false",
       "storage.tsdb.path": "/prometheus",
       "storage.tsdb.retention": "15d",
       "web.console.libraries": "/usr/share/prometheus/console_libraries",
       "web.console.templates": "/usr/share/prometheus/consoles",
       "web.enable-admin-api": "false",
       "web.enable-lifecycle": "false",
       "web.external-url": "",
       "web.listen-address": "0.0.0.0:9090",
       "web.max-connections": "512",
       "web.read-timeout": "5m",
       "web.route-prefix": "/",
       "web.user-assets": ""
   }
}

四、Prometheus Admin API

在启动pod的对应配置文件prometheus中,找到以下部分，添加红色字体部分，
--web.enable-admin-api 表示启用admin api
containers:
     - image: prom/prometheus:v2.0.0
       name: prometheus
       command:
       - "/bin/prometheus"
       args:
       - "--config.file=/etc/prometheus/prometheus.yml"
       - "--storage.tsdb.path=/prometheus"
       - "--storage.tsdb.retention=24h"
       - "--web.enable-admin-api"
       ports:
       - containerPort: 9090
         protocol: TCP
       volumeMounts:
       - mountPath: "/prometheus"
         name: data
       - mountPath: "/etc/prometheus"
         name: config-volume

4.1 Shapshot 快照

POST /api/v1/admin/tsdb/snapshot?skip_head=

请求Url:  http://ip:port/api/v1/admin/tsdb/snapshot 

返回:
{
   "status": "success",
   "data": {
       "name": "20180702T033639Z-28bcb561ec57373"
   }
}
快照数据现在存在 <data-dir>/snapshots/20180702T033639Z-28bcb561ec57373 下面

4.2 Delete Series 删除

POST /api/v1/admin/tsdb/delete_series

参数:
match[]=<series_selector>: Repeated label matcher argument that selects the series to delete. At least one match[] argument must be provided.
start=<rfc3339 | unix_timestamp>: Start timestamp. Optional and defaults to minimum possible time.
end=<rfc3339 | unix_timestamp>: End timestamp. Optional and defaults to maximum possible time.

请求:
http://ip:port/api/v1/admin/tsdb/delete_series?match[]=up&match[]=process_start_time_seconds{job="prometheus"}

返回码:204

4.3 Clean Tombstones清除tombstones文件


用于删除series数据清空磁盘时

POST /api/v1/admin/tsdb/clean_tombstones

请求:
 http://ip:port /api/v1/admin/tsdb/clean_tombstones
返回码：204

五、Promtheus采集指标参数整理

5.1 容器CPU

指标	说明
container_cpu_load_average_10s	负载
container_cpu_system_seconds_total	系统级
container_cpu_usage_seconds_total	cpu利用率
container_cpu_user_seconds_total	用户级

5.2 容器文件系统

指标	说明
container_fs_inodes_free	空闲索引节点
container_fs_inodes_total	索引节点总数
container_fs_io_current	当前的io数
container_fs_io_time_seconds_total	io的时间
container_fs_io_time_weighted_seconds_total	io的加权时间
container_fs_limit_bytes	大小限制
container_fs_read_seconds_total	读数据的时间
container_fs_reads_bytes_total	读的字节数
container_fs_reads_total	总的读请求数
container_fs_usage_bytes	使用率
container_fs_write_seconds_total	写的总时间
container_fs_writes_bytes_total	写的字节数
container_fs_writes_total	总的写请求数

5.3 容器内存

指标	说明
container_memory_cache	页缓存
container_memory_max_usage_bytes	内存最大使用率
container_memory_failcnt	申请内存失败次数计数
container_memory_failures_total	内存错误
container_memory_rss	RSS显示的就是进程使用的物理内存
container_memory_swap	swap
container_memory_usage_bytes	内存使用率
container_memory_working_set_bytes	Working set看成一个进程可以用到（但不一定会使用）的物理内存

5.4 容器网络

指标	说明
container_network_receive_bytes_total	网络的接受字节数
container_network_receive_errors_total	接受的错误数
container_network_receive_packets_dropped_total	总的丢包数
container_network_receive_packets_total	总的数据包数
container_network_tcp_usage_total	tcp使用率
container_network_transmit_bytes_total	发送的总字节数
container_network_transmit_errors_total	发送的错误数
container_network_transmit_packets_dropped_total	发送的丢包数
container_network_transmit_packets_total	发送的总数据包数
container_network_udp_usage_total	udp使用率
container_spec_cpu_period	cpu周期，多长时间内做一次重新分配
container_spec_cpu_shares	可以设置cpu利用率权重
container_spec_memory_limit_bytes	内存限制
container_spec_memory_reservation_limit_bytes	预留内存限制
container_spec_memory_swap_limit_bytes	swap限制
container_start_time_seconds	容器启动时间

5.5 kubelet相关

指标	说明
kubelet_containers_per_pod_count	一个pod的容器数
kubelet_containers_per_pod_count_count	pod的数量
kubelet_containers_per_pod_count_sum	pod的容器数合计
kubelet_network_plugin_operations_latency_microseconds	网络延迟时间
kubelet_network_plugin_operations_latency_microseconds_count	网络延迟次数
kubelet_network_plugin_operations_latency_microseconds_sum	网络延迟总计
kubelet_pod_start_latency_microseconds	pod启动延迟
kubelet_pod_start_latency_microseconds_count	pod启动延迟数
kubelet_pod_start_latency_microseconds_sum	pod启动延迟合计
kubelet_running_container_count	运行容器数
kubelet_running_pod_count	运行pod数
kubernetes_build_info	构建信息

指标	说明
machine_cpu_cores	cpu核心数
machine_memory_bytes	机器内存大小

5.6 node节点相关

指标	说明
node_boot_time_seconds	启动时间
node_cpu_seconds_total	cpu总时间
node_load1	总负载
node_load15
node_load5

node_disk_io_now	当前磁盘io数
node_disk_io_time_seconds_total	磁盘io的总耗时
node_disk_io_time_weighted_seconds_total	磁盘io的加权总耗时
node_disk_read_bytes_total	磁盘读的总字节数
node_disk_read_time_seconds_total	磁盘读的总耗时
node_disk_reads_completed_total	磁盘读完成的总数
node_disk_reads_merged_total	合并读请求的总数
node_disk_write_time_seconds_total	磁盘写的总耗时
node_disk_writes_completed_total	磁盘写完成的总数
node_disk_writes_merged_total	合并写请求的总数
node_disk_written_bytes_total	写的总字节数
node_filefd_maximum	文件描述符最大值
node_filesystem_avail_bytes	文件系统可用字节数
node_filesystem_device_error	文件系统设备错误
node_filesystem_files	文件系统文件数
node_filesystem_files_free	空闲文件数
node_filesystem_free_bytes	空闲大小
node_filesystem_readonly	只读文件
node_filesystem_size_bytes	文件系统大小
node_forks_total	forks总数
node_memory_Buffers_bytes	缓冲大小
node_memory_Cached_bytes	文件缓存区大小
node_memory_CommitLimit_bytes	提交限制
node_memory_Committed_AS_bytes	系统中目前分配了的内存
node_memory_KernelStack_bytes	内核栈大小
node_memory_Mapped_bytes	内存映射大小
node_memory_MemAvailable_bytes	可用内存数
node_memory_MemFree_bytes	空闲内存数
node_memory_MemTotal_bytes	所有可用的RAM大小
node_memory_Mlocked_bytes	内存被锁住的大小
node_memory_NFS_Unstable_bytes	不稳定页表的大小
node_memory_PageTables_bytes	最底层的页表的内存空间
node_memory_SReclaimable_bytes	Slab的一部分，当内存压力大时，可以reclaim
node_memory_SUnreclaim_bytes	不可以reclaim的Slab
node_memory_Shmem_bytes	共享内存大小
node_memory_Slab_bytes	内核数据结构缓存的大小
node_memory_SwapCached_bytes	swap缓存
node_memory_SwapFree_bytes	空闲swap大小
node_memory_SwapTotal_bytes	swap大小
node_memory_Unevictable_bytes	不能换出的页
node_memory_VmallocChunk_bytes	在vmalloc区域中可用的最大的连续内存块大小
node_memory_VmallocTotal_bytes	虚拟内存大小
node_memory_VmallocUsed_bytes	已用的虚拟内存
node_memory_WritebackTmp_bytes	正在写回临时数据大小
node_memory_Writeback_bytes	正在写回的数据大小

node_network_receive_bytes_total	接收的字节数
node_network_receive_compressed_total	接收的压缩文件总数
node_network_receive_drop_total	丢包数
node_network_receive_errs_total	发生错误数
node_network_receive_fifo_total	fifo缓冲区错误的数量
node_network_receive_frame_total	分组帧错误的数量
node_network_receive_multicast_total	设备驱动程序发送或接收的多播帧数
node_network_receive_packets_total	收到的总包数
node_network_transmit_bytes_total	发送的总字节数
node_network_transmit_carrier_total	由设备驱动程序检测到的载波损耗的数量
node_network_transmit_colls_total	接口上检测到的冲突数
node_network_transmit_compressed_total	发送的压缩文件总数
node_network_transmit_drop_total	发送丢失的包总数
node_network_transmit_errs_total	发送错误的总数
node_network_transmit_fifo_total	fifo缓冲区错误的数量
node_network_transmit_packets_total	发送的包总数
node_procs_blocked	阻塞进程数
node_procs_running	运行线程数
node_uname_info	unix名称信息
node_vmstat_pgfault	系统产生的缺页数
node_vmstat_pgmajfault	产生的主缺页数
node_vmstat_pgpgin	从磁盘或SWAP置换到内存的字节数（kb）
node_vmstat_pgpgout	从内存置换到磁盘或SWAP的字节数（kb）
node_vmstat_pswpin	系统换入的交换页面（swap page）数量
node_vmstat_pswpout	系统换出的交换页面（swap page）数量

5.7 进程相关

指标	说明
process_cpu_seconds_total	进程cpu总耗时
process_resident_memory_bytes	进程驻留内存
process_start_time_seconds	进程开始时间
process_virtual_memory_bytes	进行虚拟内存

腾讯云开发者社区

腾讯云面向开发者汇聚海量精品云计算使用和开发经验，营造开放的云计算技术生态圈。

更多推荐

自动化提示词生成工具盘点

腾讯云开发者社区

AI PPT免费使用技巧盘点：如何快速制作专业PPT？

腾讯云开发者社区

腾讯云架构师技术沙龙 · 长沙站圆满落幕，共话AI驱动下的技术架构与前沿应用

人工智能已成为推动技术创新与产业变革的重要引擎，开发者正身处一场前所未有的技术变革之中。通过本次腾讯云架构师技术沙龙，各位专家深入分享前沿技术洞察，探讨 AI 落地的应用路径与实践经验，为架构师的职业发展指明方向。腾讯云架构师长沙同盟和腾讯云架构师技术同盟长沙地区理事会正式成立。未来，腾讯云架构师长沙同盟将凝心聚力，打造属于本地架构师的学习与成长的家园，助力中国架构的蓬勃发展。未来已来，让我们携手