【云原生学习】史上最全Prometheus学习笔记
Prometheus使用一、Prometheus基本概念prometheus是一种时间序列的数据库,它适合应用于监控以及告警,但是不合适%100的准确计费,因为采集的数据不一定很准确,主要是监控以及搜集内存、CPU、硬盘的数据1.1 特点多维数据模型(时序列数据由metric名和一组key/value组成)在多维度上灵活的查询语言(PromQl)不依赖分布式存储,单主节点工作.通过基于HTTP的p
文章目录
Prometheus使用
一、Prometheus基本概念
prometheus是一种时间序列的数据库,它适合应用于监控以及告警,但是不合适%100的准确计费,因为采集的数据不一定很准确,主要是监控以及搜集内存、CPU、硬盘的数据
1.1 特点
- 多维数据模型(时序列数据由metric名和一组key/value组成)
- 在多维度上灵活的查询语言(PromQl)
- 不依赖分布式存储,单主节点工作.
- 通过基于HTTP的pull方式采集时序数据
- 可以通过中间网关进行时序列数据推送(pushing)
- 目标服务器可以通过发现服务或者静态配置实现
- 多种可视化和仪表盘支持
1.2 相关组件
Prometheus生态系统由多个组件组成,其中许多是可选的:
1、Promethues Server:收集指标和存储时间序列数据,并提供查询接口
2、ClientLibrary:客户端库
3、Push Gateway:短期存储指标数据。主要用于临时新任务
4、Exportes:采集已有的第三方访问监控指标并暴露metrics
5、Altermanger:告警
6、Web UI:简单的web控制台
1.3 架构
1.4 四种指标
Counter-计数器
这类指标只增不减,常用来统计http请求数、下单数等指标。Gauge-测量仪
这类指标可增可减,常用来统计比如CPU、内存、在线用户等标。Histogram-直方图
这类指标使用分桶方式来统计样本分布,比如请求的延迟是落在哪个区间范围内的。Summary-汇总
这类指标是根据样本计算出百分位的,是在客户端计算好的然后被抓取到promethues中的。比如99%、90%、85%、70%/60%的响应时间在哪个区间。
1.5 Promethues的数据模型
- Prometheus将所有数据存储为时间序列;具有相同度量名称以及标签属于同一个指标。
- 每个时间序列都由度量标准名称和一组键值对(也成为标签)唯一标识。
时间序列格式:
<metric name>{<label name>=<label value>, ...}
示例:api_http_requests_total{method=“POST”, handler="/messages"}
指标类型:
1、Counter:递增的计数器
2、Gauge:可以任意变化的数值
3、Histogary:对一对时间范围内数据进行采样,并对所有数值求和与统计数量
4、Summary:与Histogram类似
1.5.1 作业和实例
实例:可以抓取的目标成为实例(instances)
作业:具有相同目标的实例集合成为作业(job)
示例:
示例: scape_configs: - job_name:'prometheus' static_configs: - targets:['localhost:9090'] - job_name;'node' static_configs: - targets:['192.168.1.10:9090']
二、Promethues部署
- 二进制部署:https://prometheus.io/docs/prometheus/latest/getting_started/
- Docker部署:https://prometheus.io/docs/prometheus/latest/installation/
2.1 Service端配置
2.1.1 下载安装包
[root@master01 ltp]# wget https://github.com/promethues/promethues/releases/download/v2.6.1/prometheus-2.6.1.linux-amd64.tar.gz [root@master01 ltp]# tar zxvf prometheus-2.6.1.linux-amd64.tar.gz [root@master01 ltp]# mv prometheus-2.6.1.linux-amd64 /usr/local/promethues [root@master01 ltp]# cd /usr/local/promethues [root@master01 promethues]# ls console_libraries consoles LICENSE NOTICE prometheus prometheus.yml promtool [root@master01 promethues]# vim prometheus.yml ……省略部分 21 scrape_configs: 22 # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. 23 - job_name: 'prometheus' 24 25 # metrics_path defaults to '/metrics' 26 # scheme defaults to 'http'. 27 28 static_configs: 29 - targets: ['localhost:9090'] # 配置作业,监控本机9090端口 [root@master01 promethues]# ./prometheus --help # 查看命令帮助信息 [root@master01 promethues]# ./prometheus --config.file="prometheus.yml" # 启动Promethues,会进入阻塞状态
2.1.2 交给systemd管理
[root@master01 promethues]# cd /usr/lib/systemd/system [root@master01 system]# cp -p sshd.service prometheus.service [root@master01 system]# vim prometheus.service [Unit] Description=Prometheus After=network.target [Service] Restart=on-failure ExecStart=/usr/local/prometheus/prometheus \ --config.file=/usr/local/prometheus/prometheus.yml # ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --web.read-timeout=5m --web.max-connections=10 --storage.tsdb.retention=15d --storage.tsdb.path=/prometheus/data --query.max-concurrency=20 --query.timeout=2m [Install] WantedBy=multi-user.target [root@master01 system]# systemctl daemon-reload [root@master01 system]# systemctl start prometheus.service
#访问 http://192.168.10.10:9090/metrics 可以查看暴露的接口
#访问http://192.168.10.10:9090 查看web控制台
---------------- 补充:使用docker安装prometheus -------------------------- [root@node01 ~]# docker run -d -p 9090:9090 -v /tmp/prometheus.yaml:/etc/prometheus.yaml prom/prometheus # -d放在后台运行 -------------------------------------------------------------------------------------------
2.1.3 验证prometheus
[root@master01 prometheus]# vim prometheus.yml scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ['localhost:9090'] labels: # 添加标签 idc:beijing idc: beijing [root@master01 prometheus]# ps -ef |grep promethe root 1075 1 0 05:51 ? 00:00:01 /usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml [root@master01 prometheus]# kill -hup 1075 # 重载prometheus,kill -hup 进程号 [root@master01 prometheus]# ./promtool check config prometheus.yml # 利用promtool工具来检查prometheus,如果检查有错,重载是不会生效的 Checking prometheus.yml FAILED: parsing YAML file prometheus.yml: yaml: unmarshal errors: line 30: field label not found in type struct { Targets []string "yaml:\"targets\""; Labels model.LabelSet "yaml:\"labels\"" }
#访问 192.168.10.10:9090
可以发现在 localhost:9090那一栏,新增加了标签
2.1.4 基于文件的服务发现
示例: file_sd_configs: - files: ['/usr/local/prometheus/sd_config/*.yml'] # 指定服务发现的文件位置,会自动读取文件中的配置 refresh_interval: 5s # 刷新时间 [root@master01 prometheus]# mkdir /usr/local/prometheus/sd_config/ [root@master01 prometheus]# cd sd_config/ [root@master01 sd_config]# vim test.yml - targets: [192.168.10.10:9090]
2.2 监控节点配置
2.2.1 配置监控的节点,启动node_exporter
[root@node01 ~]# tar zxvf node_exporter-0.17.0.linux-amd64.tar.gz [root@node02 ~]# tar zxvf node_exporter-0.17.0.linux-amd64.tar.gz [root@node01 ~]# mv node_exporter-0.17.0.linux-amd64 /usr/local/node_exporter [root@node02 ~]# mv node_exporter-0.17.0.linux-amd64 /usr/local/node_exporter [root@node01 ~]# cd /usr/local/node_exporter [root@node01 node_exporter]# ls LICENSE node_exporter NOTICE [root@node01 node_exporter]# vim /usr/lib/systemd/system/node_exporter.service [Unit] Description=node_exporter.service [Service] Restart=on-failure ExecStart=/usr/local/node_exporter/node_exporter [Install] WantedBy=multi-user.target [root@node01 node_exporter]# systemctl daemon-reload [root@node01 node_exporter]# systemctl start node_exporter.service [root@node01 node_exporter]# systemctl enable node_exporter.service [root@node01 node_exporter]# ps -ef |grep node_exporter root 1972 1 0 07:55 ? 00:00:00 /usr/local/node_exporter/node_exporter [root@node01 node_exporter]# netstat -anput |grep 9100 tcp6 0 0 :::9100 :::* LISTEN 1972/node_exporter
#访问 http://192.168.10.20:9100/metrics 可以查看节点暴露的接口
2.2.2 到主节点配置监控从节点
添加如下配置: [root@master01 prometheus]# vim prometheus.yml - job_name: 'node01' file_sd_configs: - files: ['/usr/local/prometheus/sd_config/node01.yml'] # 指定服务发现的文件位置 refresh_interval: 5s [root@master01 prometheus]# vim sd_config/node01.yml - targets: - 192.168.10.20:9100 [root@master01 prometheus]# systemctl restart prometheus
cpu使用率: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) 内存使用率:100 - (node_memory_Menfree_bytes + node_memory_Cached_bytes + node_memory_Buffers_bytes)/ node_memory_MemTotal_bytes * 100 磁盘使用率:100 node_filesystem_free_bytes{mountpoint = "/",fstype = ~"ext4|xfs"} / node_filesystem_size_bytes{mountpoint = "/",fstype = ~"ext4|xfs"} * 100)
2.2.3 监控节点上的服务的状态,修改node_exporter.service
[root@node02 node_exporter]# cat /usr/lib/systemd/system/node_exporter.service [Unit] Description=node_exporter.service [Service] ExecStart=/usr/local/node_exporter/node_exporter \ --web.listen-address=:9100 \ --collector.systemd \ --collector.systemd.unit-whitelist=(ssh|docker).service \ # 利用--collector.systemd.unit-whitelist参数来监控 ssh|docker 服务 --collector.textfile.directory=/usr/local/node_exporter/textfile.collected Restart=on-failure [Install] WantedBy=multi-user.target
2.3 安装Grafana展示
官网:grafana.comhttps://grafana.com/grafana/download?pg=get&plcmt=selfmanaged-box1-cta1
2.3.1 Server端安装grafana
[root@prometheus ~]# cd /usr/local/src/ [root@prometheus src]# wget https://dl.grafana.com/oss/release/grafana-7.5.5-1.x86_64.rpm [root@prometheus src]# yum localinstall grafana-7.5.5-1.x86_64.rpm
2.4 Prometheus的配置文件及核心功能
官方解释地址:https://prometheus.io/docs/prometheus/latest/configuration/configuration/ 1、全局配置文件注释: global: # How frequently to scrape targets by default. [ scrape_interval: <duration> | default = 1m ] # 采集被监控端数据的周期,默认是一分钟采集一次 # How long until a scrape request times out. [ scrape_timeout: <duration> | default = 10s ] # 采集的超时时间,超过十秒没有响应,则超时 # How frequently to evaluate rules. [ evaluation_interval: <duration> | default = 1m ] # 告警评估的周期,默认是1分钟 # The labels to add to any time series or alerts when communicating with # external systems (federation, remote storage, Alertmanager). external_labels: [ <labelname>: <labelvalue> ... ] # File to which PromQL queries are logged. # Reloading the configuration will reopen the file. [ query_log_file: <string> ] # Rule files specifies a list of globs. Rules and alerts are read from # all matching files. rule_files: # 告警规则 [ - <filepath_glob> ... ] # A list of scrape configurations. scrape_configs: # 配置被监控端的监控指标 [ - <scrape_config> ... ] # Alerting specifies settings related to the Alertmanager. alerting: # 配置告警 alert_relabel_configs: # 告警相关的重打标签 [ - <relabel_config> ... ] alertmanagers: # 告警相关的配置 [ - <alertmanager_config> ... ] # Settings related to the remote write feature. remote_write: # 配置远程存储的写 [ - <remote_write> ... ] # Settings related to the remote read feature. remote_read: # 配置远程存储的读 [ - <remote_read> ... ] ================================================================================================================================ # 讲解 scrape_configs 配置作业 # 以官方的为例: # The job name assigned to scraped metrics by default. job_name: <job_name> # 作业名称,可以有多个作业 # How frequently to scrape targets from this job. [ scrape_interval: <duration> | default = <global_config.scrape_interval> ] # 设置采集周期,默认使用全局的 # Per-scrape timeout when scraping this job. [ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ] # 设置超时时间,默认使用全局的 # The HTTP resource path on which to fetch metrics from targets. [ metrics_path: <path> | default = /metrics ] # 接口路径 # when a time series does not have a given label yet and are ignored otherwise. [ honor_labels: <boolean> | default = false ] # 是否覆盖标签,默认是不覆盖的 # honor_timestamps controls whether Prometheus respects the timestamps present # in scraped data. # If honor_timestamps is set to "true", the timestamps of the metrics exposed # by the target will be used. # # If honor_timestamps is set to "false", the timestamps of the metrics exposed # by the target will be ignored. [ honor_timestamps: <boolean> | default = true ] # --------------------------------------------- # Configures the protocol scheme used for requests. [ scheme: <scheme> | default = http ] # 采集目标的方式,默认使用http # Optional HTTP URL parameters. params: [ <string>: [<string>, ...] ] # 采用http模式需要的参数,如ip、端口 # ------------------------------------------ # Sets the `Authorization` header on every scrape request with the # configured username and password. # password and password_file are mutually exclusive. basic_auth: # 基础认证配置,被监控目标的用户名、密码等 [ username: <string> ] [ password: <secret> ] [ password_file: <string> ] # Sets the `Authorization` header on every scrape request with # the configured credentials. authorization: # Sets the authentication type of the request. [ type: <string> | default: Bearer ] # Sets the credentials of the request. It is mutually exclusive with # `credentials_file`. [ credentials: <secret> ] # Sets the credentials of the request with the credentials read from the # configured file. It is mutually exclusive with `credentials`. [ credentials_file: <filename> ] # Configure whether scrape requests follow HTTP 3xx redirects. [ follow_redirects: <bool> | default = true ] # Configures the scrape request's TLS settings. tls_config: [ <tls_config> ] # Optional proxy URL. [ proxy_url: <string> ] # -------------------------------------- # List of Azure service discovery configurations. azure_sd_configs: [ - <azure_sd_config> ... ] # --------------------------------------- # List of Consul service discovery configurations. consul_sd_configs: # 服务发现配置 [ - <consul_sd_config> ... ] # List of DigitalOcean service discovery configurations. digitalocean_sd_configs: [ - <digitalocean_sd_config> ... ] # List of Docker Swarm service discovery configurations. dockerswarm_sd_configs: [ - <dockerswarm_sd_config> ... ] # List of DNS service discovery configurations. dns_sd_configs: [ - <dns_sd_config> ... ] # List of EC2 service discovery configurations. ec2_sd_configs: [ - <ec2_sd_config> ... ] # List of Eureka service discovery configurations. eureka_sd_configs: [ - <eureka_sd_config> ... ] # List of file service discovery configurations. file_sd_configs: [ - <file_sd_config> ... ] # List of GCE service discovery configurations. gce_sd_configs: [ - <gce_sd_config> ... ] # List of Hetzner service discovery configurations. hetzner_sd_configs: [ - <hetzner_sd_config> ... ] # List of Kubernetes service discovery configurations. kubernetes_sd_configs: [ - <kubernetes_sd_config> ... ] # List of Marathon service discovery configurations. marathon_sd_configs: [ - <marathon_sd_config> ... ] # List of AirBnB's Nerve service discovery configurations. nerve_sd_configs: [ - <nerve_sd_config> ... ] # List of OpenStack service discovery configurations. openstack_sd_configs: [ - <openstack_sd_config> ... ] # List of Scaleway service discovery configurations. scaleway_sd_configs: [ - <scaleway_sd_config> ... ] # List of Zookeeper Serverset service discovery configurations. serverset_sd_configs: [ - <serverset_sd_config> ... ] # List of Triton service discovery configurations. triton_sd_configs: [ - <triton_sd_config> ... ] # ---------------------------------------- # List of labeled statically configured targets for this job. static_configs: # 静态配置监控实例 [ - <static_config> ... ] # List of target relabel configurations. relabel_configs: # 数据采集之前对标签进行标记命名 [ - <relabel_config> ... ] # List of metric relabel configurations. metric_relabel_configs: # 采集之后进行命名 [ - <relabel_config> ... ] # Per-scrape limit on number of scraped samples that will be accepted. # If more than this number of samples are present after metric relabeling # the entire scrape will be treated as failed. 0 means no limit. [ sample_limit: <int> | default = 0 ] # 采集样本的数量,超过数量就不会存储数据 # Per-scrape config limit on number of unique targets that will be # accepted. If more than this number of targets are present after target # relabeling, Prometheus will mark the targets as failed without scraping them. # 0 means no limit. This is an experimental feature, this behaviour could # change in the future. [ target_limit: <int> | default = 0 ] # ================================================================================================== # relabel_configs :允许在采集之前对任何目标及其标签进行修改 重新标签的用途:1、重命名标签名 2、删除标签 3、过滤目标 [ source_labels: '[' <labelname> [, ...] ']' ] # 指定源标签 # Separator placed between concatenated source label values. [ separator: <string> | default = ; ] # 指定多个源标签时,使用的分割符 # Label to which the resulting value is written in a replace action. # It is mandatory for replace actions. Regex capture groups are available. [ target_label: <labelname> ] # 重新标记的标签 # Regular expression against which the extracted value is matched. [ regex: <regex> | default = (.*) ] # 正则表达式匹配源标签的值,默认匹配所有 # Modulus to take of the hash of the source label values. [ modulus: <int> ] # Replacement value against which a regex replace is performed if the # regular expression matches. Regex capture groups are available. [ replacement: <string> | default = $1 ] # 替换正则表达式匹配到的分组,分组引用$1,$2,$3 # Action to perform based on regex matching. [ action: <relabel_action> | default = replace ] # 基于正则表达式匹配执行的操作,默认是替换的动作 # -------------------------------------------- relabel_configs: - action: 重新标签动作 source_labels: ['test'] regex: (.*) replacement: $1 target_label: idc ## action的重新标签动作:1、replace:默认的,通过regex匹配source_labels的值,使用replacement来引用表达式匹配的分组 2、keep:删除regex与连接不匹配的目标source_labels(保留匹配的标签) 3、drop:删除regex与连接匹配的目标source_labels(保留不匹配的标签) 4、labeldrop:删除regex匹配所有标签名称 5、labelkeep:删除regex不匹配所有标签名称 6、hashmod:设置target_labels为mudulus连接的哈希值的source_labels 7、labelmap:匹配regex所有标签名称。然后复制匹配标签的值进行分组,replacement分组引用(${1},${2}...)替代 ## 支持服务发现的来源: • azure_sd_configs • consul_sd_configs • dns_sd_configs • ec2_sd_configs • openstack_sd_configs • file_sd_configs *** # 基于文件动态发现监控对象 • gce_sd_configs • kubernetes_sd_configs ***** • marathon_sd_configs • nerve_sd_configs • serverset_sd_configs • triton_sd_configs
三、Prometheus API
3.1 表达式查询
GET /api/v1/query
3.1.1 查询参数
· query=: Prometheus 表达式查询字符串
· time=<rfc3339 | unix_timestamp>: 评估时间戳. 可选
· timeout=: 评估超时. 可选. 默认值是 -query.timeout的值.
示例 请求url: http://ip:port/api/v1/query?query=up&time=2018-07-01T20:10:51.781Z 返回: { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "up", "app": "prometheus", "instance": "10.244.1.84:9090", "job": "kubernetes-service-endpoints", "kubernetes_name": "prometheus-service", "kubernetes_namespace": "ns-monitor" }, "value": [ 1530497003.491, "1" ] }] } }
3.1.2 区间查询
GET /api/v1/query_range
· query=: Prometheus 表达式查询字符串
· start=<rfc3339 | unix_timestamp>: 开始时间
· end=<rfc3339 | unix_timestamp>: 结束时间
· step=:持续时间
· timeout=: 评估超时. 可选. 默认值是 -query.timeout的值.
请求url: http://ip:port/api/v1/query_range?query=up&start=2018-07-01T20:10:30.781Z &end=2018-07-02T20:11:00.781Z&step=100s 返回: { "status": "success", "data": { "resultType": "matrix", "result": [ { "metric": { "__name__": "up", "app": "prometheus", "instance": "10.244.1.84:9090", "job": "kubernetes-service-endpoints", "kubernetes_name": "prometheus-service", "kubernetes_namespace": "ns-monitor" }, "values": [ [ 1530496830.781, "1" ], [ 1530496930.781, "1" ] ] }] } }
3.2 查询metadata
3.2.1 标签选择器查询series
GET /api/v1/series
参数
· match[]=<series_selector>: 重复序列选择器,必须提供至少一个匹配[]参数
· start=<rfc3339 | unix_timestamp>: 开始时间
· end=<rfc3339 | unix_timestamp>: 结束时间
请求url: http://ip:port/api/v1/series?match[]=up &match[]=process_start_time_seconds{job="prometheus"} 返回: { "status": "success", "data": [ { "__name__": "process_start_time_seconds", "instance": "localhost:9090", "job": "prometheus" }] }}
3.2.2 查询labels值
GET /api/v1/label/<label_name>/values
请求url: http://ip:port/api/v1/label/job/values 返回结果: { "status": "success", "data": [ "grafana", "kubernetes-apiservers", "kubernetes-cadvisor", "kubernetes-nodes", "kubernetes-service-endpoints", "prometheus" ] }
3.3 表达式返回格式
3.3.1 Range vectors区间集合
[ { "metric": { "<label_name>": "<label_value>", ... }, "values": [ [ <unix_time>, "<sample_value>" ], ... ] }, ... ] instant vectors 即时集合 [ { "metric": { "<label_name>": "<label_value>", ... }, "value": [ <unix_time>, "<sample_value>" ] }, ... ]
3.3.2 Scalars 标量
[ <unix_time>, "<scalar_value>" ]
3.3.3 String 字符串
[ <unix_time>, "<string_value>" ]
3.4 Targets 查询
GET /api/v1/targets
请求url: http://ip:port/api/v1/targets 返回: { "status": "success", "data": { "activeTargets": [ { "discoveredLabels": { "__address__": "localhost:9090", "__metrics_path__": "/metrics", "__scheme__": "http", "job": "prometheus" }, "labels": { "instance": "localhost:9090", "job": "prometheus" }, "scrapeUrl": "http://localhost:9090/metrics", "lastError": "", "lastScrape": "2018-07-02T02:39:45.398712723Z", "health": "up" }] }}
3.5 alertmanagers 预警管理
GET /api/v1/alertmanagers
请求url: http://ip:port/api/v1/alertmanagers 返回: { "status": "success", "data": { "activeAlertmanagers": [{ "url":"http://127.0.0.1:9090/api/v1/alerts" }], "droppedAlertmanagers": [{ "url":"http://127.0.0.1:9093/api/v1/alerts" }] } }
3.6 config配置
GET /api/v1/status/config
请求url: http://ip:port/api/v1/status/config 返回: { "status": "success", "data": { "yaml": "global:\n scrape_interval: 15s\n scrape_timeout: 10s\n evaluation_interval: 15s\nalerting:\n alertmanagers:\n - static_configs:\n - targets: []\n scheme: http\n timeout: 10s\nscrape_configs:\n- job_name: prometheus\n acement: $1\n action: replace\n - source_labels: [__meta_kubernetes_pod_name]\n separator: ;\n regex: (.*)\n target_label: kubernetes_pod_name\n replacement: $1\n action: replace\n" } }
3.7 Flags
GET /api/v1/status/flags 配置中的flag值
请求url: http://ip:port/api/v1/status/flags 返回: { "status": "success", "data": { "alertmanager.notification-queue-capacity": "10000", "alertmanager.timeout": "10s", "config.file": "/etc/prometheus/prometheus.yml", "log.level": "info", "query.lookback-delta": "5m", "query.max-concurrency": "20", "query.timeout": "2m", "storage.tsdb.max-block-duration": "36h", "storage.tsdb.min-block-duration": "2h", "storage.tsdb.no-lockfile": "false", "storage.tsdb.path": "/prometheus", "storage.tsdb.retention": "15d", "web.console.libraries": "/usr/share/prometheus/console_libraries", "web.console.templates": "/usr/share/prometheus/consoles", "web.enable-admin-api": "false", "web.enable-lifecycle": "false", "web.external-url": "", "web.listen-address": "0.0.0.0:9090", "web.max-connections": "512", "web.read-timeout": "5m", "web.route-prefix": "/", "web.user-assets": "" } }
四、Prometheus Admin API
在启动pod的对应配置文件prometheus中,找到以下部分,添加红色字体部分, --web.enable-admin-api 表示启用admin api containers: - image: prom/prometheus:v2.0.0 name: prometheus command: - "/bin/prometheus" args: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus" - "--storage.tsdb.retention=24h" - "--web.enable-admin-api" ports: - containerPort: 9090 protocol: TCP volumeMounts: - mountPath: "/prometheus" name: data - mountPath: "/etc/prometheus" name: config-volume
4.1 Shapshot 快照
POST /api/v1/admin/tsdb/snapshot?skip_head=
请求Url: http://ip:port/api/v1/admin/tsdb/snapshot 返回: { "status": "success", "data": { "name": "20180702T033639Z-28bcb561ec57373" } } 快照数据现在存在 <data-dir>/snapshots/20180702T033639Z-28bcb561ec57373 下面
4.2 Delete Series 删除
POST /api/v1/admin/tsdb/delete_series 参数: match[]=<series_selector>: Repeated label matcher argument that selects the series to delete. At least one match[] argument must be provided. start=<rfc3339 | unix_timestamp>: Start timestamp. Optional and defaults to minimum possible time. end=<rfc3339 | unix_timestamp>: End timestamp. Optional and defaults to maximum possible time. 请求: http://ip:port/api/v1/admin/tsdb/delete_series?match[]=up&match[]=process_start_time_seconds{job="prometheus"} 返回码:204
4.3 Clean Tombstones清除tombstones文件
用于删除series数据清空磁盘时 POST /api/v1/admin/tsdb/clean_tombstones 请求: http://ip:port /api/v1/admin/tsdb/clean_tombstones 返回码:204
五、Promtheus采集指标参数整理
5.1 容器CPU
指标 | 说明 |
---|---|
container_cpu_load_average_10s | 负载 |
container_cpu_system_seconds_total | 系统级 |
container_cpu_usage_seconds_total | cpu利用率 |
container_cpu_user_seconds_total | 用户级 |
5.2 容器文件系统
指标 | 说明 |
---|---|
container_fs_inodes_free | 空闲索引节点 |
container_fs_inodes_total | 索引节点总数 |
container_fs_io_current | 当前的io数 |
container_fs_io_time_seconds_total | io的时间 |
container_fs_io_time_weighted_seconds_total | io的加权时间 |
container_fs_limit_bytes | 大小限制 |
container_fs_read_seconds_total | 读数据的时间 |
container_fs_reads_bytes_total | 读的字节数 |
container_fs_reads_total | 总的读请求数 |
container_fs_usage_bytes | 使用率 |
container_fs_write_seconds_total | 写的总时间 |
container_fs_writes_bytes_total | 写的字节数 |
container_fs_writes_total | 总的写请求数 |
5.3 容器内存
指标 | 说明 |
---|---|
container_memory_cache | 页缓存 |
container_memory_max_usage_bytes | 内存最大使用率 |
container_memory_failcnt | 申请内存失败次数计数 |
container_memory_failures_total | 内存错误 |
container_memory_rss | RSS显示的就是进程使用的物理内存 |
container_memory_swap | swap |
container_memory_usage_bytes | 内存使用率 |
container_memory_working_set_bytes | Working set看成一个进程可以用到(但不一定会使用)的物理内存 |
5.4 容器网络
指标 | 说明 |
---|---|
container_network_receive_bytes_total | 网络的接受字节数 |
container_network_receive_errors_total | 接受的错误数 |
container_network_receive_packets_dropped_total | 总的丢包数 |
container_network_receive_packets_total | 总的数据包数 |
container_network_tcp_usage_total | tcp使用率 |
container_network_transmit_bytes_total | 发送的总字节数 |
container_network_transmit_errors_total | 发送的错误数 |
container_network_transmit_packets_dropped_total | 发送的丢包数 |
container_network_transmit_packets_total | 发送的总数据包数 |
container_network_udp_usage_total | udp使用率 |
container_spec_cpu_period | cpu周期,多长时间内做一次重新分配 |
container_spec_cpu_shares | 可以设置cpu利用率权重 |
container_spec_memory_limit_bytes | 内存限制 |
container_spec_memory_reservation_limit_bytes | 预留内存限制 |
container_spec_memory_swap_limit_bytes | swap限制 |
container_start_time_seconds | 容器启动时间 |
5.5 kubelet相关
指标 | 说明 |
---|---|
kubelet_containers_per_pod_count | 一个pod的容器数 |
kubelet_containers_per_pod_count_count | pod的数量 |
kubelet_containers_per_pod_count_sum | pod的容器数合计 |
kubelet_network_plugin_operations_latency_microseconds | 网络延迟时间 |
kubelet_network_plugin_operations_latency_microseconds_count | 网络延迟次数 |
kubelet_network_plugin_operations_latency_microseconds_sum | 网络延迟总计 |
kubelet_pod_start_latency_microseconds | pod启动延迟 |
kubelet_pod_start_latency_microseconds_count | pod启动延迟数 |
kubelet_pod_start_latency_microseconds_sum | pod启动延迟合计 |
kubelet_running_container_count | 运行容器数 |
kubelet_running_pod_count | 运行pod数 |
kubernetes_build_info | 构建信息 |
指标 | 说明 |
---|---|
machine_cpu_cores | cpu核心数 |
machine_memory_bytes | 机器内存大小 |
5.6 node节点相关
指标 | 说明 |
---|---|
node_boot_time_seconds | 启动时间 |
node_cpu_seconds_total | cpu总时间 |
node_load1 | 总负载 |
node_load15 | |
node_load5 | |
node_disk_io_now | 当前磁盘io数 |
node_disk_io_time_seconds_total | 磁盘io的总耗时 |
node_disk_io_time_weighted_seconds_total | 磁盘io的加权总耗时 |
node_disk_read_bytes_total | 磁盘读的总字节数 |
node_disk_read_time_seconds_total | 磁盘读的总耗时 |
node_disk_reads_completed_total | 磁盘读完成的总数 |
node_disk_reads_merged_total | 合并读请求的总数 |
node_disk_write_time_seconds_total | 磁盘写的总耗时 |
node_disk_writes_completed_total | 磁盘写完成的总数 |
node_disk_writes_merged_total | 合并写请求的总数 |
node_disk_written_bytes_total | 写的总字节数 |
node_filefd_maximum | 文件描述符最大值 |
node_filesystem_avail_bytes | 文件系统可用字节数 |
node_filesystem_device_error | 文件系统设备错误 |
node_filesystem_files | 文件系统文件数 |
node_filesystem_files_free | 空闲文件数 |
node_filesystem_free_bytes | 空闲大小 |
node_filesystem_readonly | 只读文件 |
node_filesystem_size_bytes | 文件系统大小 |
node_forks_total | forks总数 |
node_memory_Buffers_bytes | 缓冲大小 |
node_memory_Cached_bytes | 文件缓存区大小 |
node_memory_CommitLimit_bytes | 提交限制 |
node_memory_Committed_AS_bytes | 系统中目前分配了的内存 |
node_memory_KernelStack_bytes | 内核栈大小 |
node_memory_Mapped_bytes | 内存映射大小 |
node_memory_MemAvailable_bytes | 可用内存数 |
node_memory_MemFree_bytes | 空闲内存数 |
node_memory_MemTotal_bytes | 所有可用的RAM大小 |
node_memory_Mlocked_bytes | 内存被锁住的大小 |
node_memory_NFS_Unstable_bytes | 不稳定页表的大小 |
node_memory_PageTables_bytes | 最底层的页表的内存空间 |
node_memory_SReclaimable_bytes | Slab的一部分,当内存压力大时,可以reclaim |
node_memory_SUnreclaim_bytes | 不可以reclaim的Slab |
node_memory_Shmem_bytes | 共享内存大小 |
node_memory_Slab_bytes | 内核数据结构缓存的大小 |
node_memory_SwapCached_bytes | swap缓存 |
node_memory_SwapFree_bytes | 空闲swap大小 |
node_memory_SwapTotal_bytes | swap大小 |
node_memory_Unevictable_bytes | 不能换出的页 |
node_memory_VmallocChunk_bytes | 在vmalloc区域中可用的最大的连续内存块大小 |
node_memory_VmallocTotal_bytes | 虚拟内存大小 |
node_memory_VmallocUsed_bytes | 已用的虚拟内存 |
node_memory_WritebackTmp_bytes | 正在写回临时数据大小 |
node_memory_Writeback_bytes | 正在写回的数据大小 |
node_network_receive_bytes_total | 接收的字节数 |
node_network_receive_compressed_total | 接收的压缩文件总数 |
node_network_receive_drop_total | 丢包数 |
node_network_receive_errs_total | 发生错误数 |
node_network_receive_fifo_total | fifo缓冲区错误的数量 |
node_network_receive_frame_total | 分组帧错误的数量 |
node_network_receive_multicast_total | 设备驱动程序发送或接收的多播帧数 |
node_network_receive_packets_total | 收到的总包数 |
node_network_transmit_bytes_total | 发送的总字节数 |
node_network_transmit_carrier_total | 由设备驱动程序检测到的载波损耗的数量 |
node_network_transmit_colls_total | 接口上检测到的冲突数 |
node_network_transmit_compressed_total | 发送的压缩文件总数 |
node_network_transmit_drop_total | 发送丢失的包总数 |
node_network_transmit_errs_total | 发送错误的总数 |
node_network_transmit_fifo_total | fifo缓冲区错误的数量 |
node_network_transmit_packets_total | 发送的包总数 |
node_procs_blocked | 阻塞进程数 |
node_procs_running | 运行线程数 |
node_uname_info | unix名称信息 |
node_vmstat_pgfault | 系统产生的缺页数 |
node_vmstat_pgmajfault | 产生的主缺页数 |
node_vmstat_pgpgin | 从磁盘或SWAP置换到内存的字节数(kb) |
node_vmstat_pgpgout | 从内存置换到磁盘或SWAP的字节数(kb) |
node_vmstat_pswpin | 系统换入的交换页面(swap page)数量 |
node_vmstat_pswpout | 系统换出的交换页面(swap page)数量 |
5.7 进程相关
指标 | 说明 |
---|---|
process_cpu_seconds_total | 进程cpu总耗时 |
process_resident_memory_bytes | 进程驻留内存 |
process_start_time_seconds | 进程开始时间 |
process_virtual_memory_bytes | 进行虚拟内存 |
更多推荐
所有评论(0)