跳转到主要内容

Nagios插件,用于检查patroni

项目描述

check_patroni

patroni的Nagios插件。

功能

  • 检查首领、副本、节点数量是否存在。
  • 检查每个节点的复制状态。
Usage: check_patroni [OPTIONS] COMMAND [ARGS]...

  Nagios plugin that uses Patroni's REST API to monitor a Patroni cluster.

Options:
  --config FILE         Read option defaults from the specified INI file
                        [default: config.ini]
  -e, --endpoints TEXT  Patroni API endpoint. Can be specified multiple times
                        or as a list of comma separated addresses. The node
                        services checks the status of one node, therefore if
                        several addresses are specified they should point to
                        different interfaces on the same node. The cluster
                        services check the status of the cluster, therefore
                        it's better to give a list of all Patroni node
                        addresses.  [default: http://127.0.0.1:8008]
  --cert_file PATH      File with the client certificate.
  --key_file PATH       File with the client key.
  --ca_file PATH        The CA certificate.
  -v, --verbose         Increase verbosity -v (info)/-vv (warning)/-vvv
                        (debug)
  --version
  --timeout INTEGER     Timeout in seconds for the API queries (0 to disable)
                        [default: 2]
  --help                Show this message and exit.

Commands:
  cluster_config_has_changed    Check if the hash of the configuration...
  cluster_has_leader            Check if the cluster has a leader.
  cluster_has_replica           Check if the cluster has healthy replicas...
  cluster_has_scheduled_action  Check if the cluster has a scheduled...
  cluster_is_in_maintenance     Check if the cluster is in maintenance...
  cluster_node_count            Count the number of nodes in the cluster.
  node_is_alive                 Check if the node is alive ie patroni is...
  node_is_leader                Check if the node is a leader node.
  node_is_pending_restart       Check if the node is in pending restart...
  node_is_primary               Check if the node is the primary with the...
  node_is_replica               Check if the node is a replica with no...
  node_patroni_version          Check if the version is equal to the input
  node_tl_has_changed           Check if the timeline has changed.

安装

check_patroni受PostgreSQL许可证约束。

$ pip install git+https://github.com/dalibo/check_patroni.git

check_patroni在python 3.6上运行,我们保持这种方式,因为patroni也支持它,并且周围还有很多RH 7变体。话虽如此,python 3.6已经停止服务很久了,并且在github CI中没有对其的支持。

支持

如果您遇到错误或需要帮助,请打开GitHub问题。Dalibo对于公开免费支持没有响应时间的承诺。感谢您的贡献!

配置文件

可以通过以下配置文件指定所有全局和特定服务的参数

[options]
endpoints = https://10.20.199.3:8008, https://10.20.199.4:8008,https://10.20.199.5:8008
cert_file = ./ssl/my-cert.pem
key_file = ./ssl/my-key.pem
ca_file = ./ssl/CA-cert.pem
timeout = 0

[options.node_is_replica]
lag=100

阈值

阈值参数的格式为[@][start:][end]

  • 如果start等于0,则可以省略start:
  • ~:表示start是负无穷大
  • 如果省略end,则假设为无穷大
  • 要反转匹配条件,请将范围表达式前缀为@

匹配成功时: start <= VALUE <= end

例如,以下命令会引发

  • 当节点数量少于1时,警告,这可以翻译为范围[2;+INF[之外
  • 当没有节点时,为严重错误,这可以翻译为范围[1;+INF[之外
check_patroni -e https://10.20.199.3:8008 cluster_has_replica --warning 2: --critical 1:

SSL

有几种选项可供选择

  • 服务器CA证书不可用或客户端系统不信任
    • --ca_cert:您的证书链 cat CA-certificate server-certificate > cabundle
  • 您有用于通过Patroni的REST API进行认证的客户端证书
    • --cert_file:您的证书或您的证书和私钥的连接
    • --key_file:您的私钥(可选)

Shell自动补全

我们使用click库,该库支持原生的Shell自动补全。

通过输入以下命令或在您选择的shell的特定文件中添加它来添加Shell自动补全。

  • 对于Bash(添加到~/.bashrc
    eval "$(_CHECK_PATRONI_COMPLETE=bash_source check_patroni)"
    
  • 对于Zsh(添加到~/.zshrc
    eval "$(_CHECK_PATRONI_COMPLETE=zsh_source check_patroni)"
    
  • 对于Fish(添加到~/.config/fish/completions/check_patroni.fish
    eval "$(_CHECK_PATRONI_COMPLETE=fish_source check_patroni)"
    

请注意,Shell自动补全并非所有shell版本都支持,例如,仅支持Bash版本低于4.4的版本。

集群服务

cluster_config_has_changed

Usage: check_patroni cluster_config_has_changed [OPTIONS]

  Check if the hash of the configuration has changed.

  Note: either a hash or a state file must be provided for this service to
  work.

  Check:
  * `OK`: The hash didn't change
  * `CRITICAL`: The hash of the configuration has changed compared to the input (`--hash`) or last time (`--state_file`)

  Perfdata:
  * `is_configuration_changed` is 1 if the configuration has changed

Options:
  --hash TEXT            A hash to compare with.
  -s, --state-file TEXT  A state file to store the hash of the configuration.
  --save                 Set the current configuration hash as the reference
                         for future calls.
  --help                 Show this message and exit.

cluster_has_leader

Usage: check_patroni cluster_has_leader [OPTIONS]

  Check if the cluster has a leader.

  This check applies to any kind of leaders including standby leaders.

  A leader is a node with the "leader" role and a "running" state.

  A standby leader is a node with a "standby_leader" role and a "streaming" or
  "in archive recovery" state. Please note that log shipping could be stuck
  because the WAL are not available or applicable. Patroni doesn't provide
  information about the origin cluster (timeline or lag), so we cannot check
  if there is a problem in that particular case. That's why we issue a warning
  when the node is "in archive recovery". We suggest using other supervision
  tools to do this (eg. check_pgactivity).

  Check:
  * `OK`: if there is a leader node.
  * 'WARNING': if there is a stanby leader in archive mode.
  * `CRITICAL`: otherwise.

  Perfdata:
  * `has_leader` is 1 if there is any kind of leader node, 0 otherwise
  * `is_standby_leader_in_arc_rec` is 1 if the standby leader node is "in
     archive recovery", 0 otherwise
  * `is_standby_leader` is 1 if there is a standby leader node, 0 otherwise
  * `is_leader` is 1 if there is a "classical" leader node, 0 otherwise

Options:
  --help  Show this message and exit.

cluster_has_replica

Usage: check_patroni cluster_has_replica [OPTIONS]

  Check if the cluster has healthy replicas and/or if some are sync standbies

  For patroni (and this check):
  * a replica is `streaming` if the `pg_stat_wal_receiver` say's so.
  * a replica is `in archive recovery`, if it's not `streaming` and has a `restore_command`.

  A healthy replica:
  * has a `replica` or `sync_standby` role
  * has the same timeline as the leader and
    * is in `running` state (patroni < V3.0.4)
    * is in `streaming` or `in archive recovery` state (patroni >= V3.0.4)
  * has a lag lower or equal to `max_lag`

  Please note that replica `in archive recovery` could be stuck because the
  WAL are not available or applicable (the server's timeline has diverged for
  the leader's). We already detect the latter but we will miss the former.
  Therefore, it's preferable to check for the lag in addition to the healthy
  state if you rely on log shipping to help lagging standbies to catch up.

  Since we require a healthy replica to have the same timeline as the leader,
  it's possible that we raise alerts when the cluster is performing a
  switchover or failover and the standbies are in the process of catching up
  with the new leader. The alert shouldn't last long.

  Check:
  * `OK`: if the healthy_replica count and their lag are compatible with the replica count threshold.
          and if the sync_replica count is compatible with the sync replica count threshold.
  * `WARNING` / `CRITICAL`: otherwise

  Perfdata:
  * healthy_replica & unhealthy_replica count
  * the number of sync_replica, they are included in the previous count
  * the lag of each replica labelled with "member name"_lag
  * the timeline of each replica labelled with "member name"_timeline
  * a boolean to tell if the node is a sync stanbdy labelled with "member name"_sync

Options:
  -w, --warning TEXT    Warning threshold for the number of healthy replica
                        nodes.
  -c, --critical TEXT   Critical threshold for the number of healthy replica
                        nodes.
  --sync-warning TEXT   Warning threshold for the number of sync replica.
  --sync-critical TEXT  Critical threshold for the number of sync replica.
  --max-lag TEXT        maximum allowed lag
  --help                Show this message and exit.

cluster_has_scheduled_action

Usage: check_patroni cluster_has_scheduled_action [OPTIONS]

  Check if the cluster has a scheduled action (switchover or restart)

  Check:
  * `OK`: If the cluster has no scheduled action
  * `CRITICAL`: otherwise.

  Perfdata:
  * `scheduled_actions` is 1 if the cluster has scheduled actions.
  * `scheduled_switchover` is 1 if the cluster has a scheduled switchover.
  * `scheduled_restart` counts the number of scheduled restart in the cluster.

Options:
  --help  Show this message and exit.

cluster_is_in_maintenance

Usage: check_patroni cluster_is_in_maintenance [OPTIONS]

  Check if the cluster is in maintenance mode or paused.

  Check:
  * `OK`: If the cluster is in maintenance mode.
  * `CRITICAL`: otherwise.

  Perfdata:
  * `is_in_maintenance` is 1 the cluster is in maintenance mode,  0 otherwise

Options:
  --help  Show this message and exit.

cluster_node_count

Usage: check_patroni cluster_node_count [OPTIONS]

  Count the number of nodes in the cluster.

  The role refers to the role of the server in the cluster. Possible values
  are:
  * master or leader
  * replica
  * standby_leader
  * sync_standby
  * demoted
  * promoted
  * uninitialized

  The state refers to the state of PostgreSQL. Possible values are:
  * initializing new cluster, initdb failed
  * running custom bootstrap script, custom bootstrap failed
  * starting, start failed
  * restarting, restart failed
  * running, streaming, in archive recovery
  * stopping, stopped, stop failed
  * creating replica
  * crashed

  The "healthy" checks only ensures that:
  * a leader has the running state
  * a standby_leader has the running or streaming (V3.0.4) state
  * a replica or sync-standby has the running or streaming (V3.0.4) state

  Since we dont check the lag or timeline, "in archive recovery" is not
  considered a valid state for this service. See cluster_has_leader and
  cluster_has_replica for specialized checks.

  Check:
  * Compares the number of nodes against the normal and healthy nodes warning and critical thresholds.
  * `OK`:  If they are not provided.

  Perfdata:
  * `members`: the member count.
  * `healthy_members`: the running and streaming member count.
  * all the roles of the nodes in the cluster with their count (start with "role_").
  * all the statuses of the nodes in the cluster with their count (start with "state_").

Options:
  -w, --warning TEXT       Warning threshold for the number of nodes.
  -c, --critical TEXT      Critical threshold for the number of nodes.
  --healthy-warning TEXT   Warning threshold for the number of healthy nodes
                           (running + streaming).
  --healthy-critical TEXT  Critical threshold for the number of healthy nodes
                           (running + streaming).
  --help                   Show this message and exit.

节点服务

node_is_alive

Usage: check_patroni node_is_alive [OPTIONS]

  Check if the node is alive ie patroni is running. This is a liveness check
  as defined in Patroni's documentation.

  Check:
  * `OK`: If patroni is running.
  * `CRITICAL`: otherwise.

  Perfdata:
  * `is_running` is 1 if patroni is running, 0 otherwise

Options:
  --help  Show this message and exit.

node_is_pending_restart

Usage: check_patroni node_is_pending_restart [OPTIONS]

  Check if the node is in pending restart state.

  This situation can arise if the configuration has been modified but requires
  a restart of PostgreSQL to take effect.

  Check:
  * `OK`: if the node has no pending restart tag.
  * `CRITICAL`: otherwise

  Perfdata: `is_pending_restart` is 1 if the node has pending restart tag, 0
  otherwise.

Options:
  --help  Show this message and exit.

node_is_leader

Usage: check_patroni node_is_leader [OPTIONS]

  Check if the node is a leader node.

  This check applies to any kind of leaders including standby leaders. To
  check explicitly for a standby leader use the `--is-standby-leader` option.

  Check:
  * `OK`: if the node is a leader.
  * `CRITICAL:` otherwise

  Perfdata: `is_leader` is 1 if the node is a leader node, 0 otherwise.

Options:
  --is-standby-leader  Check for a standby leader
  --help               Show this message and exit.

node_is_primary

Usage: check_patroni node_is_primary [OPTIONS]

  Check if the node is the primary with the leader lock.

  This service is not valid for a standby leader, because this kind of node is
  not a primary.

  Check:
  * `OK`: if the node is a primary with the leader lock.
  * `CRITICAL:` otherwise

  Perfdata: `is_primary` is 1 if the node is a primary with the leader lock, 0
  otherwise.

Options:
  --help  Show this message and exit.

node_is_replica

Usage: check_patroni node_is_replica [OPTIONS]

  Check if the node is a replica with no noloadbalance tag.

  It is possible to check if the node is synchronous or asynchronous. If
  nothing is specified any kind of replica is accepted.  When checking for a
  synchronous replica, it's not possible to specify a lag.

  This service is using the following Patroni endpoints: replica, asynchronous
  and synchronous. The first two implement the `lag` tag. For these endpoints
  the state of a replica node doesn't reflect the replication state
  (`streaming` or `in archive recovery`), we only know if it's `running`. The
  timeline is also not checked.

  Therefore, if a cluster is using asynchronous replication, it is recommended
  to check for the lag to detect a divegence as soon as possible.

  Check:
  * `OK`: if the node is a running replica with noloadbalance tag and the lag is under the maximum threshold.
  * `CRITICAL`:  otherwise

  Perfdata: `is_replica` is 1 if the node is a running replica with
  noloadbalance tag and the lag is under the maximum threshold, 0 otherwise.

Options:
  --max-lag TEXT  maximum allowed lag
  --is-sync       check if the replica is synchronous
  --is-async      check if the replica is asynchronous
  --help          Show this message and exit.

node_patroni_version

Usage: check_patroni node_patroni_version [OPTIONS]

  Check if the version is equal to the input

  Check:
  * `OK`: The version is the same as the input `--patroni-version`
  * `CRITICAL`: otherwise.

  Perfdata:
  * `is_version_ok` is 1 if version is ok, 0 otherwise

Options:
  --patroni-version TEXT  Patroni version to compare to  [required]
  --help                  Show this message and exit.

node_tl_has_changed

Usage: check_patroni node_tl_has_changed [OPTIONS]

  Check if the timeline has changed.

  Note: either a timeline or a state file must be provided for this service to
  work.

  Check:
  * `OK`: The timeline is the same as last time (`--state_file`) or the inputted timeline (`--timeline`)
  * `CRITICAL`: The tl is not the same.

  Perfdata:
  * `is_timeline_changed` is 1 if the tl has changed, 0 otherwise
  * the timeline

Options:
  --timeline TEXT        A timeline number to compare with.
  -s, --state-file TEXT  A state file to store the last tl number into.
  --save                 Set the current timeline number as the reference for
                         future calls.
  --help                 Show this message and exit.

项目详情


下载文件

下载您平台上的文件。如果您不确定选择哪个,请了解有关安装包的更多信息。

源分布

check_patroni-2.0.0.tar.gz (37.6 kB 查看哈希值)

上传时间

构建分布

check_patroni-2.0.0-py3-none-any.whl (20.8 kB 查看哈希值)

上传时间 Python 3

由以下机构支持

AWS AWS 云计算和安全赞助商 Datadog Datadog 监控 Fastly Fastly CDN Google Google 下载分析 Microsoft Microsoft PSF 赞助商 Pingdom Pingdom 监控 Sentry Sentry 错误记录 StatusPage StatusPage 状态页面