Prometheus使用

一、Prometheus基本概念

prometheus是一种时间序列的数据库,它适合应用于监控以及告警,但是不合适%100的准确计费,因为采集的数据不一定很准确,主要是监控以及搜集内存、CPU、硬盘的数据

1.1 特点
  • 多维数据模型(时序列数据由metric名和一组key/value组成)
  • 在多维度上灵活的查询语言(PromQl)
  • 不依赖分布式存储,单主节点工作.
  • 通过基于HTTP的pull方式采集时序数据
  • 可以通过中间网关进行时序列数据推送(pushing)
  • 目标服务器可以通过发现服务或者静态配置实现
  • 多种可视化和仪表盘支持
1.2 相关组件

Prometheus生态系统由多个组件组成,其中许多是可选的:

1、Promethues Server:收集指标和存储时间序列数据,并提供查询接口
2、ClientLibrary:客户端库
3、Push Gateway:短期存储指标数据。主要用于临时新任务
4、Exportes:采集已有的第三方访问监控指标并暴露metrics
5、Altermanger:告警
6、Web UI:简单的web控制台

1.3 架构

在这里插入图片描述

1.4 四种指标
  • Counter-计数器
    这类指标只增不减,常用来统计http请求数、下单数等指标。

  • Gauge-测量仪
    这类指标可增可减,常用来统计比如CPU、内存、在线用户等标。

  • Histogram-直方图
    这类指标使用分桶方式来统计样本分布,比如请求的延迟是落在哪个区间范围内的。

  • Summary-汇总
    这类指标是根据样本计算出百分位的,是在客户端计算好的然后被抓取到promethues中的。比如99%、90%、85%、70%/60%的响应时间在哪个区间。

1.5 Promethues的数据模型
  • Prometheus将所有数据存储为时间序列;具有相同度量名称以及标签属于同一个指标。
  • 每个时间序列都由度量标准名称和一组键值对(也成为标签)唯一标识。

时间序列格式:

<metric name>{<label name>=<label value>, ...}

示例:api_http_requests_total{method=“POST”, handler="/messages"}

指标类型:
1、Counter:递增的计数器
2、Gauge:可以任意变化的数值
3、Histogary:对一对时间范围内数据进行采样,并对所有数值求和与统计数量
4、Summary:与Histogram类似

1.5.1 作业和实例
  • 实例:可以抓取的目标成为实例(instances)

  • 作业:具有相同目标的实例集合成为作业(job)

示例:

示例:
scape_configs:
  - job_name:'prometheus'
    static_configs:
    - targets:['localhost:9090']
  - job_name;'node'
    static_configs:
    - targets:['192.168.1.10:9090']

二、Promethues部署

  • 二进制部署:https://prometheus.io/docs/prometheus/latest/getting_started/
  • Docker部署:https://prometheus.io/docs/prometheus/latest/installation/
2.1 Service端配置
2.1.1 下载安装包
[root@master01 ltp]# wget https://github.com/promethues/promethues/releases/download/v2.6.1/prometheus-2.6.1.linux-amd64.tar.gz

[root@master01 ltp]# tar zxvf prometheus-2.6.1.linux-amd64.tar.gz

[root@master01 ltp]# mv prometheus-2.6.1.linux-amd64 /usr/local/promethues

[root@master01 ltp]# cd /usr/local/promethues

[root@master01 promethues]# ls
console_libraries  consoles  LICENSE  NOTICE  prometheus  prometheus.yml  promtool


[root@master01 promethues]# vim prometheus.yml
……省略部分
21 scrape_configs:
22   # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
23   - job_name: 'prometheus'
24 
25     # metrics_path defaults to '/metrics'
26     # scheme defaults to 'http'.
27 
28     static_configs:
29     - targets: ['localhost:9090']                             # 配置作业,监控本机9090端口


[root@master01 promethues]# ./prometheus --help                   # 查看命令帮助信息

[root@master01 promethues]# ./prometheus --config.file="prometheus.yml"    # 启动Promethues,会进入阻塞状态

2.1.2 交给systemd管理
[root@master01 promethues]# cd /usr/lib/systemd/system
[root@master01 system]# cp -p sshd.service prometheus.service

[root@master01 system]# vim prometheus.service
[Unit]
Description=Prometheus
After=network.target

[Service]
Restart=on-failure
ExecStart=/usr/local/prometheus/prometheus \
--config.file=/usr/local/prometheus/prometheus.yml
# ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --web.read-timeout=5m  --web.max-connections=10 --storage.tsdb.retention=15d --storage.tsdb.path=/prometheus/data --query.max-concurrency=20 --query.timeout=2m

[Install]
WantedBy=multi-user.target


[root@master01 system]# systemctl daemon-reload

[root@master01 system]# systemctl start prometheus.service


#访问 http://192.168.10.10:9090/metrics 可以查看暴露的接口

#访问http://192.168.10.10:9090 查看web控制台


---------------- 补充:使用docker安装prometheus --------------------------
[root@node01 ~]# docker run -d -p 9090:9090 -v /tmp/prometheus.yaml:/etc/prometheus.yaml prom/prometheus    # -d放在后台运行
-------------------------------------------------------------------------------------------

2.1.3 验证prometheus
[root@master01 prometheus]# vim prometheus.yml
scrape_configs:
 # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
 - job_name: 'prometheus'

   # metrics_path defaults to '/metrics'
   # scheme defaults to 'http'.

   static_configs:
   - targets: ['localhost:9090']
     labels:                                    # 添加标签 idc:beijing
       idc: beijing


[root@master01 prometheus]# ps -ef |grep promethe
root       1075      1  0 05:51 ?        00:00:01 /usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml

[root@master01 prometheus]# kill -hup 1075                                                           # 重载prometheus,kill -hup 进程号

[root@master01 prometheus]# ./promtool check config prometheus.yml                                   # 利用promtool工具来检查prometheus,如果检查有错,重载是不会生效的
Checking prometheus.yml
 FAILED: parsing YAML file prometheus.yml: yaml: unmarshal errors:
 line 30: field label not found in type struct { Targets []string "yaml:\"targets\""; Labels model.LabelSet "yaml:\"labels\"" }


#访问 192.168.10.10:9090
可以发现在 localhost:9090那一栏,新增加了标签

2.1.4 基于文件的服务发现
示例:
file_sd_configs:
 - files: ['/usr/local/prometheus/sd_config/*.yml']                # 指定服务发现的文件位置,会自动读取文件中的配置
   refresh_interval: 5s                                            # 刷新时间

[root@master01 prometheus]# mkdir /usr/local/prometheus/sd_config/
[root@master01 prometheus]#  cd sd_config/
[root@master01 sd_config]# vim test.yml
- targets: [192.168.10.10:9090]

2.2 监控节点配置
2.2.1 配置监控的节点,启动node_exporter
[root@node01 ~]# tar zxvf node_exporter-0.17.0.linux-amd64.tar.gz
[root@node02 ~]# tar zxvf node_exporter-0.17.0.linux-amd64.tar.gz 

[root@node01 ~]# mv node_exporter-0.17.0.linux-amd64 /usr/local/node_exporter
[root@node02 ~]# mv node_exporter-0.17.0.linux-amd64 /usr/local/node_exporter

[root@node01 ~]# cd /usr/local/node_exporter

[root@node01 node_exporter]# ls
LICENSE  node_exporter  NOTICE


[root@node01 node_exporter]# vim /usr/lib/systemd/system/node_exporter.service
[Unit]
Description=node_exporter.service

[Service]
Restart=on-failure
ExecStart=/usr/local/node_exporter/node_exporter

[Install]
WantedBy=multi-user.target


[root@node01 node_exporter]# systemctl daemon-reload 
[root@node01 node_exporter]# systemctl start node_exporter.service
[root@node01 node_exporter]# systemctl enable node_exporter.service


[root@node01 node_exporter]# ps -ef |grep node_exporter
root       1972      1  0 07:55 ?        00:00:00 /usr/local/node_exporter/node_exporter

[root@node01 node_exporter]# netstat -anput |grep 9100
tcp6       0      0 :::9100                 :::*                    LISTEN      1972/node_exporter  

#访问 http://192.168.10.20:9100/metrics 可以查看节点暴露的接口

2.2.2 到主节点配置监控从节点
添加如下配置:

[root@master01 prometheus]# vim prometheus.yml
 - job_name: 'node01'
   file_sd_configs:
   - files: ['/usr/local/prometheus/sd_config/node01.yml']                # 指定服务发现的文件位置
     refresh_interval: 5s


[root@master01 prometheus]# vim sd_config/node01.yml
- targets:
 - 192.168.10.20:9100

[root@master01 prometheus]# systemctl restart prometheus

cpu使用率: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)
内存使用率:100 - (node_memory_Menfree_bytes + node_memory_Cached_bytes + node_memory_Buffers_bytes)/ node_memory_MemTotal_bytes * 100
磁盘使用率:100 node_filesystem_free_bytes{mountpoint = "/",fstype = ~"ext4|xfs"} / node_filesystem_size_bytes{mountpoint = "/",fstype = ~"ext4|xfs"} * 100)

2.2.3 监控节点上的服务的状态,修改node_exporter.service
[root@node02 node_exporter]# cat /usr/lib/systemd/system/node_exporter.service 
[Unit]
Description=node_exporter.service

[Service]
ExecStart=/usr/local/node_exporter/node_exporter \
--web.listen-address=:9100 \
--collector.systemd \
--collector.systemd.unit-whitelist=(ssh|docker).service \             # 利用--collector.systemd.unit-whitelist参数来监控 ssh|docker 服务
--collector.textfile.directory=/usr/local/node_exporter/textfile.collected
Restart=on-failure

[Install]
WantedBy=multi-user.target

2.3 安装Grafana展示

官网:grafana.comhttps://grafana.com/grafana/download?pg=get&plcmt=selfmanaged-box1-cta1

2.3.1 Server端安装grafana

[root@prometheus ~]# cd /usr/local/src/
[root@prometheus src]# wget https://dl.grafana.com/oss/release/grafana-7.5.5-1.x86_64.rpm
[root@prometheus src]# yum localinstall grafana-7.5.5-1.x86_64.rpm


2.4 Prometheus的配置文件及核心功能
官方解释地址:https://prometheus.io/docs/prometheus/latest/configuration/configuration/

1、全局配置文件注释:
global:
 # How frequently to scrape targets by default.
 [ scrape_interval: <duration> | default = 1m ]                        # 采集被监控端数据的周期,默认是一分钟采集一次

 # How long until a scrape request times out.
 [ scrape_timeout: <duration> | default = 10s ]                        # 采集的超时时间,超过十秒没有响应,则超时

 # How frequently to evaluate rules.
 [ evaluation_interval: <duration> | default = 1m ]                    # 告警评估的周期,默认是1分钟

 # The labels to add to any time series or alerts when communicating with
 # external systems (federation, remote storage, Alertmanager).
 external_labels:
   [ <labelname>: <labelvalue> ... ]

 # File to which PromQL queries are logged.
 # Reloading the configuration will reopen the file.
 [ query_log_file: <string> ]

# Rule files specifies a list of globs. Rules and alerts are read from
# all matching files.
rule_files:                                                               # 告警规则
 [ - <filepath_glob> ... ]  

# A list of scrape configurations.
scrape_configs:                                                           # 配置被监控端的监控指标
 [ - <scrape_config> ... ]

# Alerting specifies settings related to the Alertmanager.
alerting:                                                                 # 配置告警
 alert_relabel_configs:                                                  # 告警相关的重打标签
   [ - <relabel_config> ... ]
 alertmanagers:                                                          # 告警相关的配置
   [ - <alertmanager_config> ... ]                  

# Settings related to the remote write feature.
remote_write:                                                             # 配置远程存储的写
 [ - <remote_write> ... ]

# Settings related to the remote read feature.
remote_read:                                                             # 配置远程存储的读
 [ - <remote_read> ... ]


================================================================================================================================
# 讲解 scrape_configs 配置作业
# 以官方的为例:
# The job name assigned to scraped metrics by default.
job_name: <job_name>                                                                      # 作业名称,可以有多个作业

# How frequently to scrape targets from this job.
[ scrape_interval: <duration> | default = <global_config.scrape_interval> ]               # 设置采集周期,默认使用全局的

# Per-scrape timeout when scraping this job.
[ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ]                 # 设置超时时间,默认使用全局的

# The HTTP resource path on which to fetch metrics from targets.
[ metrics_path: <path> | default = /metrics ]                                             # 接口路径

# when a time series does not have a given label yet and are ignored otherwise.
[ honor_labels: <boolean> | default = false ]                                             # 是否覆盖标签,默认是不覆盖的

# honor_timestamps controls whether Prometheus respects the timestamps present
# in scraped data.
# If honor_timestamps is set to "true", the timestamps of the metrics exposed
# by the target will be used.
#
# If honor_timestamps is set to "false", the timestamps of the metrics exposed
# by the target will be ignored.
[ honor_timestamps: <boolean> | default = true ]
# ---------------------------------------------
# Configures the protocol scheme used for requests.
[ scheme: <scheme> | default = http ]                                                      # 采集目标的方式,默认使用http
# Optional HTTP URL parameters.
params: 
 [ <string>: [<string>, ...] ]                                                            # 采用http模式需要的参数,如ip、端口
# ------------------------------------------
# Sets the `Authorization` header on every scrape request with the
# configured username and password.
# password and password_file are mutually exclusive.
basic_auth:                                                                                # 基础认证配置,被监控目标的用户名、密码等
 [ username: <string> ]
 [ password: <secret> ]
 [ password_file: <string> ]
# Sets the `Authorization` header on every scrape request with
# the configured credentials.
authorization:
 # Sets the authentication type of the request.
 [ type: <string> | default: Bearer ]
 # Sets the credentials of the request. It is mutually exclusive with
 # `credentials_file`.
 [ credentials: <secret> ]
 # Sets the credentials of the request with the credentials read from the
 # configured file. It is mutually exclusive with `credentials`.
 [ credentials_file: <filename> ]
# Configure whether scrape requests follow HTTP 3xx redirects.
[ follow_redirects: <bool> | default = true ]
# Configures the scrape request's TLS settings.
tls_config:
 [ <tls_config> ]
# Optional proxy URL.
[ proxy_url: <string> ]
# --------------------------------------

# List of Azure service discovery configurations.
azure_sd_configs:
 [ - <azure_sd_config> ... ]

# ---------------------------------------
# List of Consul service discovery configurations.
consul_sd_configs:                                                                           # 服务发现配置
 [ - <consul_sd_config> ... ]                                    

# List of DigitalOcean service discovery configurations.
digitalocean_sd_configs:
 [ - <digitalocean_sd_config> ... ]

# List of Docker Swarm service discovery configurations.
dockerswarm_sd_configs:
 [ - <dockerswarm_sd_config> ... ]

# List of DNS service discovery configurations.
dns_sd_configs:
 [ - <dns_sd_config> ... ]

# List of EC2 service discovery configurations.
ec2_sd_configs:
 [ - <ec2_sd_config> ... ]

# List of Eureka service discovery configurations.
eureka_sd_configs:
 [ - <eureka_sd_config> ... ]

# List of file service discovery configurations.
file_sd_configs:
 [ - <file_sd_config> ... ]

# List of GCE service discovery configurations.
gce_sd_configs:
 [ - <gce_sd_config> ... ]

# List of Hetzner service discovery configurations.
hetzner_sd_configs:
 [ - <hetzner_sd_config> ... ]

# List of Kubernetes service discovery configurations.
kubernetes_sd_configs:
 [ - <kubernetes_sd_config> ... ]

# List of Marathon service discovery configurations.
marathon_sd_configs:
 [ - <marathon_sd_config> ... ]

# List of AirBnB's Nerve service discovery configurations.
nerve_sd_configs:
 [ - <nerve_sd_config> ... ]

# List of OpenStack service discovery configurations.
openstack_sd_configs:
 [ - <openstack_sd_config> ... ]

# List of Scaleway service discovery configurations.
scaleway_sd_configs:
 [ - <scaleway_sd_config> ... ]

# List of Zookeeper Serverset service discovery configurations.
serverset_sd_configs:
 [ - <serverset_sd_config> ... ]

# List of Triton service discovery configurations.
triton_sd_configs:
 [ - <triton_sd_config> ... ]

# ----------------------------------------
# List of labeled statically configured targets for this job.
static_configs:                                                                    # 静态配置监控实例
 [ - <static_config> ... ]

# List of target relabel configurations.
relabel_configs:                                                                   # 数据采集之前对标签进行标记命名
 [ - <relabel_config> ... ]

# List of metric relabel configurations.
metric_relabel_configs:                                                            # 采集之后进行命名
 [ - <relabel_config> ... ]

# Per-scrape limit on number of scraped samples that will be accepted.
# If more than this number of samples are present after metric relabeling
# the entire scrape will be treated as failed. 0 means no limit.
[ sample_limit: <int> | default = 0 ]                                              # 采集样本的数量,超过数量就不会存储数据

# Per-scrape config limit on number of unique targets that will be
# accepted. If more than this number of targets are present after target
# relabeling, Prometheus will mark the targets as failed without scraping them.
# 0 means no limit. This is an experimental feature, this behaviour could
# change in the future.
[ target_limit: <int> | default = 0 ]        

# ==================================================================================================
# relabel_configs :允许在采集之前对任何目标及其标签进行修改

重新标签的用途:1、重命名标签名
              2、删除标签
              3、过滤目标

[ source_labels: '[' <labelname> [, ...] ']' ]                                    # 指定源标签

# Separator placed between concatenated source label values.
[ separator: <string> | default = ; ]                                             # 指定多个源标签时,使用的分割符

# Label to which the resulting value is written in a replace action.
# It is mandatory for replace actions. Regex capture groups are available.
[ target_label: <labelname> ]                                                     # 重新标记的标签

# Regular expression against which the extracted value is matched.
[ regex: <regex> | default = (.*) ]                                               # 正则表达式匹配源标签的值,默认匹配所有

# Modulus to take of the hash of the source label values.
[ modulus: <int> ]

# Replacement value against which a regex replace is performed if the
# regular expression matches. Regex capture groups are available.
[ replacement: <string> | default = $1 ]                                          # 替换正则表达式匹配到的分组,分组引用$1,$2,$3

# Action to perform based on regex matching.
[ action: <relabel_action> | default = replace ]                                  # 基于正则表达式匹配执行的操作,默认是替换的动作

# --------------------------------------------
relabel_configs:
- action: 重新标签动作
 source_labels: ['test']
 regex: (.*)
 replacement: $1
 target_label: idc


## action的重新标签动作:1、replace:默认的,通过regex匹配source_labels的值,使用replacement来引用表达式匹配的分组
                       2、keep:删除regex与连接不匹配的目标source_labels(保留匹配的标签)
                       3、drop:删除regex与连接匹配的目标source_labels(保留不匹配的标签)
                       4、labeldrop:删除regex匹配所有标签名称
                       5、labelkeep:删除regex不匹配所有标签名称
                       6、hashmod:设置target_labels为mudulus连接的哈希值的source_labels
                       7、labelmap:匹配regex所有标签名称。然后复制匹配标签的值进行分组,replacement分组引用(${1},${2}...)替代
                  
## 支持服务发现的来源:
• azure_sd_configs
• consul_sd_configs
• dns_sd_configs
• ec2_sd_configs
• openstack_sd_configs
• file_sd_configs           ***      # 基于文件动态发现监控对象
• gce_sd_configs
• kubernetes_sd_configs     *****
• marathon_sd_configs
• nerve_sd_configs
• serverset_sd_configs
• triton_sd_configs

三、Prometheus API

3.1 表达式查询

GET /api/v1/query

3.1.1 查询参数

· query=: Prometheus 表达式查询字符串

· time=<rfc3339 | unix_timestamp>: 评估时间戳. 可选

· timeout=: 评估超时. 可选. 默认值是 -query.timeout的值.

示例
请求url: http://ip:port/api/v1/query?query=up&time=2018-07-01T20:10:51.781Z
返回:
   {
   "status": "success",
   "data": {
       "resultType": "vector",
       "result": [
           {
               "metric": {
                   "__name__": "up",
                   "app": "prometheus",
                   "instance": "10.244.1.84:9090",
                   "job": "kubernetes-service-endpoints",
                   "kubernetes_name": "prometheus-service",
                   "kubernetes_namespace": "ns-monitor"
               },
               "value": [
                   1530497003.491,
                   "1"
               ]
           }]
         }
}
3.1.2 区间查询

GET /api/v1/query_range

· query=: Prometheus 表达式查询字符串

· start=<rfc3339 | unix_timestamp>: 开始时间

· end=<rfc3339 | unix_timestamp>: 结束时间

· step=:持续时间

· timeout=: 评估超时. 可选. 默认值是 -query.timeout的值.

请求url: http://ip:port/api/v1/query_range?query=up&start=2018-07-01T20:10:30.781Z
&end=2018-07-02T20:11:00.781Z&step=100s
返回:
 {
   "status": "success",
   "data": {
       "resultType": "matrix",
       "result": [
           {
               "metric": {
                   "__name__": "up",
                   "app": "prometheus",
                   "instance": "10.244.1.84:9090",
                   "job": "kubernetes-service-endpoints",
                   "kubernetes_name": "prometheus-service",
                   "kubernetes_namespace": "ns-monitor"
               },
               "values": [
                   [
                       1530496830.781,
                       "1"
                   ],
                   [
                       1530496930.781,
                       "1"
                   ]
               ]
           }]
    }
}
3.2 查询metadata
3.2.1 标签选择器查询series

GET /api/v1/series

参数

· match[]=<series_selector>: 重复序列选择器,必须提供至少一个匹配[]参数

· start=<rfc3339 | unix_timestamp>: 开始时间

· end=<rfc3339 | unix_timestamp>: 结束时间

请求url: 
http://ip:port/api/v1/series?match[]=up
&match[]=process_start_time_seconds{job="prometheus"}

返回:
   {
   "status": "success",
   "data": [
       {
           "__name__": "process_start_time_seconds",
           "instance": "localhost:9090",
           "job": "prometheus"
       }]
}}
3.2.2 查询labels值

GET /api/v1/label/<label_name>/values

请求url: 
http://ip:port/api/v1/label/job/values

返回结果:
    {
   "status": "success",
   "data": [
       "grafana",
       "kubernetes-apiservers",
       "kubernetes-cadvisor",
       "kubernetes-nodes",
       "kubernetes-service-endpoints",
       "prometheus"
   ]
}
3.3 表达式返回格式
3.3.1 Range vectors区间集合
[ { "metric": { "<label_name>": "<label_value>", ... }, "values": [ [ <unix_time>, "<sample_value>" ], ... ] }, ... ]
instant vectors 即时集合
[ { "metric": { "<label_name>": "<label_value>", ... }, "value": [ <unix_time>, "<sample_value>" ] }, ... ]
3.3.2 Scalars 标量
[ <unix_time>, "<scalar_value>" ]
3.3.3 String 字符串
[ <unix_time>, "<string_value>" ]
3.4 Targets 查询

GET /api/v1/targets

请求url:  http://ip:port/api/v1/targets
 
返回: {
   "status": "success",
   "data": {
       "activeTargets": [
           {
               "discoveredLabels": {
                   "__address__": "localhost:9090",
                   "__metrics_path__": "/metrics",
                   "__scheme__": "http",
                   "job": "prometheus"
               },
               "labels": {
                   "instance": "localhost:9090",
                   "job": "prometheus"
               },
               "scrapeUrl": "http://localhost:9090/metrics",
               "lastError": "",
               "lastScrape": "2018-07-02T02:39:45.398712723Z",
               "health": "up"
           }]
}}
3.5 alertmanagers 预警管理

GET /api/v1/alertmanagers

请求url: http://ip:port/api/v1/alertmanagers

返回:
{
   "status": "success",
   "data": {
       "activeAlertmanagers": [{ "url":"http://127.0.0.1:9090/api/v1/alerts" }],
       "droppedAlertmanagers": [{ "url":"http://127.0.0.1:9093/api/v1/alerts" }]
   }
}
3.6 config配置

GET /api/v1/status/config

请求url: http://ip:port/api/v1/status/config
返回:

{
   "status": "success",
   "data": {
       "yaml": "global:\n  scrape_interval: 15s\n  scrape_timeout: 10s\n  evaluation_interval: 15s\nalerting:\n  alertmanagers:\n  - static_configs:\n    - targets: []\n    scheme: http\n    timeout: 10s\nscrape_configs:\n- job_name: prometheus\n  acement: $1\n    action: replace\n  - source_labels: [__meta_kubernetes_pod_name]\n    separator: ;\n    regex: (.*)\n    target_label: kubernetes_pod_name\n    replacement: $1\n    action: replace\n"
   }
}
3.7 Flags

GET /api/v1/status/flags 配置中的flag值

请求url:
  http://ip:port/api/v1/status/flags

返回:
{
   "status": "success",
   "data": {
       "alertmanager.notification-queue-capacity": "10000",
       "alertmanager.timeout": "10s",
       "config.file": "/etc/prometheus/prometheus.yml",
       "log.level": "info",
       "query.lookback-delta": "5m",
       "query.max-concurrency": "20",
       "query.timeout": "2m",
       "storage.tsdb.max-block-duration": "36h",
       "storage.tsdb.min-block-duration": "2h",
       "storage.tsdb.no-lockfile": "false",
       "storage.tsdb.path": "/prometheus",
       "storage.tsdb.retention": "15d",
       "web.console.libraries": "/usr/share/prometheus/console_libraries",
       "web.console.templates": "/usr/share/prometheus/consoles",
       "web.enable-admin-api": "false",
       "web.enable-lifecycle": "false",
       "web.external-url": "",
       "web.listen-address": "0.0.0.0:9090",
       "web.max-connections": "512",
       "web.read-timeout": "5m",
       "web.route-prefix": "/",
       "web.user-assets": ""
   }
}

四、Prometheus Admin API

在启动pod的对应配置文件prometheus中,找到以下部分,添加红色字体部分,
--web.enable-admin-api 表示启用admin api
containers:
     - image: prom/prometheus:v2.0.0
       name: prometheus
       command:
       - "/bin/prometheus"
       args:
       - "--config.file=/etc/prometheus/prometheus.yml"
       - "--storage.tsdb.path=/prometheus"
       - "--storage.tsdb.retention=24h"
       - "--web.enable-admin-api"
       ports:
       - containerPort: 9090
         protocol: TCP
       volumeMounts:
       - mountPath: "/prometheus"
         name: data
       - mountPath: "/etc/prometheus"
         name: config-volume
4.1 Shapshot 快照

POST /api/v1/admin/tsdb/snapshot?skip_head=

请求Url:  http://ip:port/api/v1/admin/tsdb/snapshot 

返回:
{
   "status": "success",
   "data": {
       "name": "20180702T033639Z-28bcb561ec57373"
   }
}
快照数据现在存在 <data-dir>/snapshots/20180702T033639Z-28bcb561ec57373 下面
4.2 Delete Series 删除
POST /api/v1/admin/tsdb/delete_series

参数:
match[]=<series_selector>: Repeated label matcher argument that selects the series to delete. At least one match[] argument must be provided.
start=<rfc3339 | unix_timestamp>: Start timestamp. Optional and defaults to minimum possible time.
end=<rfc3339 | unix_timestamp>: End timestamp. Optional and defaults to maximum possible time.

请求:
http://ip:port/api/v1/admin/tsdb/delete_series?match[]=up&match[]=process_start_time_seconds{job="prometheus"}

返回码:204 
4.3 Clean Tombstones清除tombstones文件

用于删除series数据清空磁盘时

POST /api/v1/admin/tsdb/clean_tombstones

请求:
 http://ip:port /api/v1/admin/tsdb/clean_tombstones
返回码:204

五、Promtheus采集指标参数整理

5.1 容器CPU
指标说明
container_cpu_load_average_10s负载
container_cpu_system_seconds_total系统级
container_cpu_usage_seconds_totalcpu利用率
container_cpu_user_seconds_total用户级
5.2 容器文件系统
指标说明
container_fs_inodes_free空闲索引节点
container_fs_inodes_total索引节点总数
container_fs_io_current当前的io数
container_fs_io_time_seconds_totalio的时间
container_fs_io_time_weighted_seconds_totalio的加权时间
container_fs_limit_bytes大小限制
container_fs_read_seconds_total读数据的时间
container_fs_reads_bytes_total读的字节数
container_fs_reads_total总的读请求数
container_fs_usage_bytes使用率
container_fs_write_seconds_total写的总时间
container_fs_writes_bytes_total写的字节数
container_fs_writes_total总的写请求数
5.3 容器内存
指标说明
container_memory_cache页缓存
container_memory_max_usage_bytes内存最大使用率
container_memory_failcnt申请内存失败次数计数
container_memory_failures_total内存错误
container_memory_rssRSS显示的就是进程使用的物理内存
container_memory_swapswap
container_memory_usage_bytes内存使用率
container_memory_working_set_bytesWorking set看成一个进程可以用到(但不一定会使用)的物理内存
5.4 容器网络
指标说明
container_network_receive_bytes_total网络的接受字节数
container_network_receive_errors_total接受的错误数
container_network_receive_packets_dropped_total总的丢包数
container_network_receive_packets_total总的数据包数
container_network_tcp_usage_totaltcp使用率
container_network_transmit_bytes_total发送的总字节数
container_network_transmit_errors_total发送的错误数
container_network_transmit_packets_dropped_total发送的丢包数
container_network_transmit_packets_total发送的总数据包数
container_network_udp_usage_totaludp使用率
container_spec_cpu_periodcpu周期,多长时间内做一次重新分配
container_spec_cpu_shares可以设置cpu利用率权重
container_spec_memory_limit_bytes内存限制
container_spec_memory_reservation_limit_bytes预留内存限制
container_spec_memory_swap_limit_bytesswap限制
container_start_time_seconds容器启动时间
5.5 kubelet相关
指标说明
kubelet_containers_per_pod_count一个pod的容器数
kubelet_containers_per_pod_count_countpod的数量
kubelet_containers_per_pod_count_sumpod的容器数合计
kubelet_network_plugin_operations_latency_microseconds网络延迟时间
kubelet_network_plugin_operations_latency_microseconds_count网络延迟次数
kubelet_network_plugin_operations_latency_microseconds_sum网络延迟总计
kubelet_pod_start_latency_microsecondspod启动延迟
kubelet_pod_start_latency_microseconds_countpod启动延迟数
kubelet_pod_start_latency_microseconds_sumpod启动延迟合计
kubelet_running_container_count运行容器数
kubelet_running_pod_count运行pod数
kubernetes_build_info构建信息
指标说明
machine_cpu_corescpu核心数
machine_memory_bytes机器内存大小
5.6 node节点相关
指标说明
node_boot_time_seconds启动时间
node_cpu_seconds_totalcpu总时间
node_load1总负载
node_load15
node_load5
node_disk_io_now当前磁盘io数
node_disk_io_time_seconds_total磁盘io的总耗时
node_disk_io_time_weighted_seconds_total磁盘io的加权总耗时
node_disk_read_bytes_total磁盘读的总字节数
node_disk_read_time_seconds_total磁盘读的总耗时
node_disk_reads_completed_total磁盘读完成的总数
node_disk_reads_merged_total合并读请求的总数
node_disk_write_time_seconds_total磁盘写的总耗时
node_disk_writes_completed_total磁盘写完成的总数
node_disk_writes_merged_total合并写请求的总数
node_disk_written_bytes_total写的总字节数
node_filefd_maximum文件描述符最大值
node_filesystem_avail_bytes文件系统可用字节数
node_filesystem_device_error文件系统设备错误
node_filesystem_files文件系统文件数
node_filesystem_files_free空闲文件数
node_filesystem_free_bytes空闲大小
node_filesystem_readonly只读文件
node_filesystem_size_bytes文件系统大小
node_forks_totalforks总数
node_memory_Buffers_bytes缓冲大小
node_memory_Cached_bytes文件缓存区大小
node_memory_CommitLimit_bytes提交限制
node_memory_Committed_AS_bytes系统中目前分配了的内存
node_memory_KernelStack_bytes内核栈大小
node_memory_Mapped_bytes内存映射大小
node_memory_MemAvailable_bytes可用内存数
node_memory_MemFree_bytes空闲内存数
node_memory_MemTotal_bytes所有可用的RAM大小
node_memory_Mlocked_bytes内存被锁住的大小
node_memory_NFS_Unstable_bytes不稳定页表的大小
node_memory_PageTables_bytes最底层的页表的内存空间
node_memory_SReclaimable_bytesSlab的一部分,当内存压力大时,可以reclaim
node_memory_SUnreclaim_bytes不可以reclaim的Slab
node_memory_Shmem_bytes共享内存大小
node_memory_Slab_bytes内核数据结构缓存的大小
node_memory_SwapCached_bytesswap缓存
node_memory_SwapFree_bytes空闲swap大小
node_memory_SwapTotal_bytesswap大小
node_memory_Unevictable_bytes不能换出的页
node_memory_VmallocChunk_bytes在vmalloc区域中可用的最大的连续内存块大小
node_memory_VmallocTotal_bytes虚拟内存大小
node_memory_VmallocUsed_bytes已用的虚拟内存
node_memory_WritebackTmp_bytes正在写回临时数据大小
node_memory_Writeback_bytes正在写回的数据大小
node_network_receive_bytes_total接收的字节数
node_network_receive_compressed_total接收的压缩文件总数
node_network_receive_drop_total丢包数
node_network_receive_errs_total发生错误数
node_network_receive_fifo_totalfifo缓冲区错误的数量
node_network_receive_frame_total分组帧错误的数量
node_network_receive_multicast_total设备驱动程序发送或接收的多播帧数
node_network_receive_packets_total收到的总包数
node_network_transmit_bytes_total发送的总字节数
node_network_transmit_carrier_total由设备驱动程序检测到的载波损耗的数量
node_network_transmit_colls_total接口上检测到的冲突数
node_network_transmit_compressed_total发送的压缩文件总数
node_network_transmit_drop_total发送丢失的包总数
node_network_transmit_errs_total发送错误的总数
node_network_transmit_fifo_totalfifo缓冲区错误的数量
node_network_transmit_packets_total发送的包总数
node_procs_blocked阻塞进程数
node_procs_running运行线程数
node_uname_infounix名称信息
node_vmstat_pgfault系统产生的缺页数
node_vmstat_pgmajfault产生的主缺页数
node_vmstat_pgpgin从磁盘或SWAP置换到内存的字节数(kb)
node_vmstat_pgpgout从内存置换到磁盘或SWAP的字节数(kb)
node_vmstat_pswpin系统换入的交换页面(swap page)数量
node_vmstat_pswpout系统换出的交换页面(swap page)数量
5.7 进程相关
指标说明
process_cpu_seconds_total进程cpu总耗时
process_resident_memory_bytes进程驻留内存
process_start_time_seconds进程开始时间
process_virtual_memory_bytes进行虚拟内存
Logo

腾讯云面向开发者汇聚海量精品云计算使用和开发经验,营造开放的云计算技术生态圈。

更多推荐