本文转载之目的仅为备份之用，原文地址：https://blog.csdn.net/u010692693/article/details/121123843

PostgreSQL高可用测试系列之Patroni + etcd + HAProxy + Keepalived 离线部署（一）
PostgreSQL高可用测试系列之Patroni + etcd + HAProxy + Keepalived 离线部署（二）
PostgreSQL高可用测试系列之Patroni + etcd + HAProxy + Keepalived 离线部署（三）
PostgreSQL高可用测试系列之Patroni + etcd + HAProxy + Keepalived 离线部署（四）
PostgreSQL高可用测试系列之Patroni + etcd + HAProxy + Keepalived 离线部署（五）
PostgreSQL高可用测试系列之Patroni + etcd + HAProxy + Keepalived 离线部署（六）
PostgreSQL高可用测试系列之Patroni + etcd + HAProxy + Keepalived 离线部署（七）
PostgreSQL高可用测试系列之Patroni + etcd + HAProxy + Keepalived 离线部署（八）

内容：

PostgreSQL高可用测试系列之Patroni + etcd + HAProxy + Keepalived 离线部署（一）

说明：本系列文章仅用于共享我的学习成果，未经过生产系统考验，对于知识点和一些组件的使用会不定时更新，仅供参考，如有错误的地方，欢迎留言共同学习。

本高可用系列测试不说理论知识，如有需要自行百度，因生产环境大多数是内网环境，无法连接互联网，为模拟生产环境安装，PostgreSQL高可用测试均采用离线部署。

所需软件包均以打包上传百度网盘，如有需要自行下载：https://pan.baidu.com/s/1Tb7GPMvj4kfKEIh8iyvdbA 提取码：n9w2 文件名：PostgreSQL_HA.tar.gz

第一章：介绍测试环境
第二章： PostgreSQL + replication 部署
第三章： Etcd 部署和管理
第四章： Patroni 部署和管理
第五章： HAProxy + Keepalived 部署和管理
第六章：高可用模拟故障测试用例
第七章： Prometheus + Grafana 监控部署
第八章：高可用管理

第一章：介绍测试环境

1. 自制的测试环境架构图

2. 节点信息

主机名	IP地址	系统版本	PostgreSQL	Etcd	Patroni	HAProxy	Keepalived	Grafana	Prometheus
pgtest1	192.168.58.10	CentOS 7.4.1708	Primary	√	√	√	√	-	-
pgtest2	192.168.58.11	CentOS 7.4.1708	Standby	√	√	√	√	-	-
pgtest3	192.168.58.12	CentOS 7.4.1708	Standby	√	√	-	-	√	√

3. 所需软件版本和下载地址

PostgreSQL：版本：13.3 下载地址： https://ftp.postgresql.org/pub/source/v13.3/postgresql-13.3.tar.gz
Python：版本：3.6.8 因源码安装耗时较长，在阿里开源镜像站下载rpm包做成YUM仓库，如有需要可在百度网盘下载：https://pan.baidu.com/s/1Tb7GPMvj4kfKEIh8iyvdbA 提取码：n9w2 文件名：python3-rpm.tar.gz
Etcd：版本：3.5.1 下载地址： https://github.com/etcd-io/etcd 文件名： etcd-v3.5.1-linux-amd64.tar.gz
Patroni：版本：2.1.1 在阿里开源镜像站下载，如有需要可在百度网盘下载：https://pan.baidu.com/s/1Tb7GPMvj4kfKEIh8iyvdbA 提取码：n9w2 文件名：patroni_etcd_2.1.1.tar.gz
HAProxy：版本：2.4.7 下载地址： https://www.haproxy.org/#down 文件名：haproxy-2.4.7.tar.gz
Keepalived: 版本：2.2.4 下载地址： https://www.keepalived.org/download.html 文件名： keepalived-2.2.4.tar.gz
Grafana: 版本：8.2.2 下载地址： https://grafana.com/grafana/download 文件名： grafana-enterprise-8.2.2.linux-amd64.tar.gz
Prometheus: 版本：2.30.3 下载地址： https://github.com/prometheus/prometheus 文件名：prometheus-2.30.3.linux-amd64.tar.gz

4. 其他会用到的小工具

benchmarksql：用途：PostgreSQL压力测试下载地址：https://udomain.dl.sourceforge.net/project/benchmarksql/benchmarksql-5.0.zip
etcd-browser：用途：etcd的web可视化界面下载地址：https://github.com/henszey/etcd-browser
nodejs: 用途：启动etcd-browser 下载地址：https://nodejs.org/zh-cn/download/
jq: 用途：格式化命令行显示结果为JSON格式下载地址：https://stedolan.github.io/jq/download/
JsonView：用途：格式化谷歌浏览器显示结果为JSON格式下载地址：https://github.com/gildas-lormeau/JSONView-for-Chrome

以上所需软件包均以打包上传百度网盘，如有需要自行下载：https://pan.baidu.com/s/1Tb7GPMvj4kfKEIh8iyvdbA 提取码：n9w2 文件名：PostgreSQL_HA.tar.gz

5. 如果配置防火墙，需要开放的端口

程序	端口	用途
PostgreSQL	5432	数据库监听端口
Patroni	8008	restapi监听端口
etcd	2379/2380	客户端访问端口和服务监听端口
haproxy	1080/5000/5001	WEB访问端口和数据库对外服务端口
grafana	3000	WEB登录端口
prometheus	9090	WEB登录端口
node_exporter	9100	对外服务端口
postgres_exporter	9187	对外服务端口
haproxy_exporter	9101	对外服务端口

linux 查看端口使用的命令：

[root@pgtest3 data]# netstat -lntup

防火墙的使用参考文档：https://blog.csdn.net/s_p_j/article/details/80979450

6. python3、python3-devel 和 patroni[etcd] 的打包

在一个可以连接互联网的Linux主机上下载 python3 和 python3-devel 以及所需要的依赖包。
有两种方式，一种是yum命令的–downloadonly(我使用的这个)，另一种是yumdownloader。
对于第一种方式，如果当前系统已经安装了需要的下载的软件包，则不会进行下载，需要先卸载再下载，不知道是否有参数能控制。

[root@yum ~]# yum remove python3 python3-devel  --只能卸载当前包。不能卸载依赖包
[root@yum ~]# yum history list python3-devel
Loaded plugins: fastestmirror
ID     | Command line             | Date and time    | Action(s)      | Altered
-------------------------------------------------------------------------------
     6 | remove python3 python3-d | 2021-10-16 14:48 | Erase          |    6   
     5 | install python3-devel    | 2021-10-16 14:31 | Install        |   34   
history list
[root@yum ~]# yum history undo 5  --完全卸载，包括依赖包

下载安装包及其依赖包到指定目录下

[root@yum ~]# mkdir /root/python3-rpm
[root@yum ~]# yum install python3 python3-devel --downloadonly --downloaddir=/root/python3-rpm

对于第二种方式，不用管当前系统是否安装了需要的下载的软件包，但是需要先安装yum-utils，而且这种方式会把i686架构的软件包也下载下来，参数中提示–archlist可是控制架构，但是不好使。

[root@yum ~]# yum -y install yum-utils 
[root@yum ~]# yumdownloader python3 python3-devel --resolve --destdir=/root/python3-rpm

使用 createrepo 命令创建yum源(软件仓库)

# 安装createrepo
[root@yum ~]# yum install http://createrepo.baseurl.org/download/createrepo-0.4.4-1.noarch.rpm -y
# 创建yum源
[root@yum /]# cd /root/python3-rpm/
[root@yum python3-rpm]# createrepo -v ./
# 打包
[root@yum python3-rpm]# cd /root
[root@yum ~]# tar -zcvf /root/python3-rpm.tar.gz python3-rpm

打包 patroni[etcd]

[root@yum ~]# yum install python3 python3-devel -y
[root@yum ~]# mkdir /root/patroni_etcd_2.1.1
# --download 只下载不安装
[root@yum ~]# pip3 install patroni[etcd] -i https://mirrors.aliyun.com/pypi/simple/ --download /root/patroni_etcd_2.1.1
[root@yum ~]# pip3 install psycopg2-binary -i https://mirrors.aliyun.com/pypi/simple/ --download /root/patroni_etcd_2.1.1
[root@yum ~]# pip3 install psycopg2 -i https://mirrors.aliyun.com/pypi/simple/ --download /root/patroni_etcd_2.1.1
or
[root@yum ~]# wget -P /root/patroni_etcd_2.1.1 https://mirrors.aliyun.com/pypi/packages/14/65/223a5b4146b1d5d5ab66f16ef194916a1ed9720da1f118d7bfb60b8f2bea/psycopg2-binary-2.9.1.tar.gz
[root@yum ~]# wget -P /root/patroni_etcd_2.1.1 https://mirrors.aliyun.com/pypi/packages/aa/8a/7c80e7e44fb1b4277e89bd9ca509aefdd4dd1b2c547c6f293afe9f7ffd04/psycopg2-2.9.1.tar.gz

[root@yum ~]# ll /root/patroni_etcd_2.1.1 
total 3308
-rw-r--r-- 1 root root  97516 Nov  1 05:13 click-8.0.3-py3-none-any.whl
-rw-r--r-- 1 root root 241997 Nov  1 05:13 dnspython-2.1.0-py3-none-any.whl
-rw-r--r-- 1 root root  17798 Nov  1 05:13 importlib_metadata-4.8.1-py3-none-any.whl
-rw-r--r-- 1 root root 219752 Nov  1 05:13 patroni-2.1.1-py3-none-any.whl
-rw-r--r-- 1 root root  23591 Nov  1 05:13 prettytable-2.2.1-py3-none-any.whl
-rw-r--r-- 1 root root 470886 Nov  1 05:13 psutil-5.8.0.tar.gz
-rw-r--r-- 1 root root 379972 Nov  1 05:19 psycopg2-2.9.1.tar.gz
-rw-r--r-- 1 root root 380033 Nov  1 05:17 psycopg2-binary-2.9.1.tar.gz
-rw-r--r-- 1 root root 247702 Nov  1 05:13 python_dateutil-2.8.2-py2.py3-none-any.whl
-rw-r--r-- 1 root root  37270 Nov  1 05:13 python-etcd-0.4.5.tar.gz
-rw-r--r-- 1 root root 603963 Nov  1 05:13 PyYAML-6.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
-rw-r--r-- 1 root root  11053 Nov  1 05:13 six-1.16.0-py2.py3-none-any.whl
-rw-r--r-- 1 root root  26288 Nov  1 05:13 typing_extensions-3.10.0.2-py3-none-any.whl
-rw-r--r-- 1 root root 138764 Nov  1 05:13 urllib3-1.26.7-py2.py3-none-any.whl
-rw-r--r-- 1 root root  30763 Nov  1 05:13 wcwidth-0.2.5-py2.py3-none-any.whl
-rw-r--r-- 1 root root  42808 Nov  1 05:13 ydiff-1.2.tar.gz
-rw-r--r-- 1 root root   5313 Nov  1 05:13 zipp-3.6.0-py3-none-any.whl
# 打包
[root@yum ~]# tar -zcvf patroni_etcd_2.1.1.tar.gz patroni_etcd_2.1.1

PostgreSQL高可用测试系列之Patroni + etcd + HAProxy + Keepalived 离线部署（二）

说明：本系列文章仅用于共享我的学习成果，未经过生产系统考验，对于知识点和一些组件的使用会不定时更新，仅供参考，如有错误的地方，欢迎留言共同学习。

本高可用系列测试不说理论知识，如有需要自行百度，因生产环境大多数是内网环境，无法连接互联网，为模拟生产环境安装，PostgreSQL高可用测试均采用离线部署。

所需软件包均以打包上传百度网盘，如有需要自行下载：https://pan.baidu.com/s/1Tb7GPMvj4kfKEIh8iyvdbA 提取码：n9w2 文件名：PostgreSQL_HA.tar.gz

第一章：介绍测试环境
第二章： PostgreSQL + replication 部署
第三章： Etcd 部署和管理
第四章： Patroni 部署和管理
第五章： HAProxy + Keepalived 部署和管理
第六章：高可用模拟故障测试用例
第七章： Prometheus + Grafana 监控部署
第八章：高可用管理

第二章： PostgreSQL + replication 部署

1. 所有节点关闭防火墙、NetworkManager和SELINUX，如果需要启用防火墙，建议待配置完成后再启用，放开第一章提到的端口

systemctl stop firewalld.service
systemctl disable firewalld.service
systemctl stop NetworkManager.service
systemctl disable NetworkManager.service
sed -i "s/SELINUX=enforcing/SELINUX=disabled/g" /etc/selinux/config 
setenforce 0

2. 所有节点修改主机名

hostnamectl set-hostname pgtest1

3. 所有节点修改主机时间，确保节点间时间和时区同步，有条件的同步时间服务器

timedatectl
timedatectl list-timezones
timedatectl set-timezone Asia/Shanghai
date -s "20211001 00:00:00"
ntpdate time.windows.com && hwclock -w

4. 所有节点安装 PostgreSQL 所需要的软件包

# 挂载操作系统镜像包
mkdir /media/cdrom
mount /dev/cdrom /media/cdrom
# 配置YUM
mkdir /etc/yum.repos.d/bak
mv /etc/yum.repos.d/*.repo /etc/yum.repos.d/bak
cat >> /etc/yum.repos.d/enmo.repo << EOF
[Server]
name=Server
baseurl=file:///media/cdrom
enabled=yes
gpgcheck=0
EOF
# 执行yum安装
yum -y install wget flex libselinux-devel readline-devel zlib zlib-devel openssl-devel pam-devel libxml2-devel libxslt-devel python python-devel tcl-devel systemd-devel pcre-devel gcc gcc-c++ make tree psmisc
yum -y groupinstall "Development Tools"

5. 所有节点配置/etc/hosts解析文件

cat >> /etc/hosts << EOF
192.168.58.10   pgtest1
192.168.58.11   pgtest2
192.168.58.12   pgtest3
EOF

6. 所有节点创建用户和组

groupadd -g 5432 postgres
useradd -u 5432 -g postgres postgres
echo postgres | passwd --stdin postgres

7. 所有节点创建目录

mkdir -p /enmo/soft           # 存放软件包
mkdir -p /enmo/app/pg13/13.3  # PG_HOME
mkdir /enmo/pgdata            # PG_DATA
mkdir /enmo/pgwal             # 存放wal文件
mkdir /enmo/pgarch            # 存放wal归档文件
mkdir /enmo/app/pglog         # 存放PostgreSQL的软件日志文件
chown -R postgres:postgres /enmo
chmod 0700 /enmo/pgdata /enmo/pgwal /enmo/pgarch
# 创建目录软连接，方便日后数据库软件升级
ln -s /enmo/app/pg13 /enmo/app/pgsql

8. 所有节点配置系统内核参数

cat >> /etc/sysctl.conf << EOF
#for postgres db 13.3
kernel.shmall = 966327       # expr `free |grep Mem|awk '{print $2 *1024}'` / `getconf PAGE_SIZE`
kernel.shmmax = 3958075392   # free |grep Mem|awk '{print $2 *1024}'
kernel.shmmni = 4096
kernel.sem = 50100 64128000 50100 1280
fs.file-max = 76724200
net.ipv4.ip_local_port_range = 9000 65000
net.core.rmem_default = 1048576
net.core.rmem_max = 4194304
net.core.wmem_default = 262144
net.core.wmem_max = 1048576
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_max_syn_backlog = 4096
net.core.netdev_max_backlog = 10000
vm.overcommit_memory = 0
fs.aio-max-nr = 40960000
net.ipv4.tcp_timestamps = 0
vm.dirty_ratio=20
vm.dirty_background_ratio=3
vm.dirty_writeback_centisecs=100
vm.dirty_expire_centisecs=500
vm.swappiness=10
vm.min_free_kbytes=524288
EOF

9. 所有节点配置资源限制

cat >> /etc/security/limits.conf << EOF
#for postgres db 13.3
* soft    nofile  1048576
* hard    nofile  1048576
* soft    nproc   unlimited
* hard    nproc   unlimited
* soft    core    unlimited
* hard    core    unlimited
* soft    memlock unlimited
* hard    memlock unlimited
EOF

cat >> /etc/pam.d/login << EOF
#for postgres db 13.3
session    required     pam_limits.so
EOF

10. 所有节点配置环境变量

cat >> /etc/profile << EOF
#for postgres db 13.3
export LANG=en_US.utf8
export PGHOME=/enmo/app/pgsql/13.3/
export PGUSER=postgres
export PGPORT=5432
export PGDATA=/enmo/pgdata
export PATH=\$PGHOME/bin:\$PATH:\$HOME/bin
export LD_LIBRARY_PATH=\$PGHOME/lib:\$LD_LIBRARY_PATH
EOF

source /etc/profile   # 使环境变量生效

11. 所有节点解压PostgreSQL安装包，进行源码编译安装

PostgreSQL的官方网站下载：https://www.postgresql.org/

# 解压安装包
cd /enmo/soft
[root@pgtest1 soft]# tar -xvf PostgreSQL_HA.tar.gz
[root@pgtest1 soft]# tar -xzvf postgresql-13.3.tar.gz
# 编译安装
cd /enmo/soft/postgresql-13.3
./configure --prefix=/enmo/app/pgsql/13.3     # --with-pgport=6000
make -j 8 && make install
# 安装工具集
cd /enmo/soft/postgresql-13.3/contrib
make -j 8 && make install
# 查询版本，确认安装成功
postgres --version       # postgres (PostgreSQL) 13.3

12. 主节点初始化数据库

su - postgres
$ initdb --pgdata=/enmo/pgdata --waldir=/enmo/pgwal --encoding=UTF8 --allow-group-access --data-checksums --username=postgres --pwprompt --wal-segsize=32
# pg11起，initdb设置WAL段的大小  --wal-segsize=32

13. 主节点配置数据库参数

[postgres@pgpool01 ~]$ vi $PGDATA/postgresql.auto.conf
log_destination='stderr'
logging_collector=on
log_directory = '/enmo/app/pglog'
log_filename='postgresql-%Y-%m-%d.log'
log_duration=on
log_error_verbosity = default
log_line_prefix = '%m'
log_statement = 'all'
log_file_mode=0600
log_truncate_on_rotation=on
log_rotation_age=1d
log_rotation_size=0
log_checkpoints=on
log_lock_waits=on
log_min_duration_statement=500ms
log_min_messages=warning
idle_in_transaction_session_timeout=300000
autovacuum = 'on'
autovacuum_max_workers = 3
full_page_writes = 'on'
log_autovacuum_min_duration = -1
seq_page_cost = 1

# https://pgtune.leopard.in.ua/#/
superuser_reserved_connections = 10
shared_buffers = 1GB
effective_cache_size = 3GB
maintenance_work_mem = 256MB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 1048kB
min_wal_size = 2GB
max_wal_size = 8GB
max_parallel_workers_per_gather = 1
max_parallel_workers = 2
max_parallel_maintenance_workers = 1

# 以下参数在 patroni 中设置，后面配置 patroni 时，可以将以下参数删除
max_connections = 3000
superuser_reserved_connections = 100
max_locks_per_transaction = 64
max_worker_processes = 8
max_prepared_transactions = 0
wal_level = 'logical'
wal_log_hints = 'on'
track_commit_timestamp = 'off'
max_wal_senders = 10
max_replication_slots = 10
wal_keep_size = '4096MB'  # wal_keep_size = wal_keep_segments * wal_segment_size = 128 * 32MB 
hot_standby = 'on'
listen_addresses = '*'
port = 5432
cluster_name = 'pg_cluster'
archive_mode = 'on'
archive_command = 'cp %p /enmo/pgarch/%f'

14. 主节点配置PostgreSQL的访问策略文件

vi $PGDATA/pg_hba.conf
#修改为如下：
host    all             all             192.168.58.0/24           md5

15. 主节点创建PostgreSQL服务，启动数据库

[root@pgtest1 ~]# vi /usr/lib/systemd/system/postgres-13.service 
[Unit]
Description=PostgreSQL 13 database server
After=syslog.target network.target

[Service]
Type=forking
TimeoutSec=120
User=postgres
Environment="PGHOME=/enmo/app/pgsql/13.3"
Environment="PGDATA=/enmo/pgdata"
Environment="PGPORT=5432"
Environment="LD_LIBRARY_PATH=/enmo/app/pgsql/13.3/lib:/usr/local/lib:/usr/local/lib64:/usr/lib64"
ExecStart=/bin/bash -c '${PGHOME}/bin/pg_ctl start -w -D ${PGDATA} -l /enmo/app/pglog/startup.log'
ExecStop=/bin/bash -c '${PGHOME}/bin/pg_ctl stop -m fast -w -D ${PGDATA}'
ExecReload=/bin/bash -c '${PGHOME}/bin/pg_ctl reload -D ${PGDATA}'

[Install]
WantedBy=multi-user.target

[root@pgtest1 ~]# systemctl daemon-reload
[root@pgtest1 ~]# systemctl enable postgres-13.service
[root@pgtest1 ~]# systemctl start postgres-13.service
# 确认启动成功
[root@pgtest1 ~]# netstat -nltp|grep 5432

16. 主节点安装 pg_stat_statements

# 编译安装，在11小节处已安装工具集，此处可不用操作
[root@pgtest1 ~]# cd /enmo/soft/postgresql-13.3/contrib/pg_stat_statements
[root@pgtest1 pg_stat_statements]# make && make install

# 配置参数
[postgres@pgtest1 ~]$ vi $PGDATA/postgresql.auto.conf
# pg_stat_statements
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 1000
pg_stat_statements.track = top
pg_stat_statements.track_utility = true
pg_stat_statements.save = true

# 重启postgres-13服务
[root@pgtest1 ~]# systemctl restart postgres-13.service

# 进入 PostgreSQL 数据库安装扩展插件
[root@pgtest1 ~]# psql
postgres=# create extension pg_stat_statements;

# 确认安装成功
postgres=# select * from pg_stat_statements;

17. 部署 replication 同步两个备库

# 主库上创建用于流复制的用户
postgres=# create user replica WITH REPLICATION ENCRYPTED PASSWORD 'replica';
# 配置主库允许接受流复制的连接
[root@pgtest1 ~]# vi $PGDATA/pg_hba.conf 
host    replication     replica         192.168.58.0/24       md5
[root@pgtest1 ~]# systemctl reload postgres-13.service
# 建议所有节点配置密码文件
[root@pgtest2 ~]# su - postgres
[postgres@pgtest2 ~]$ cat >> ~/.pgpass << EOF
# hostname:port:database:username:password
192.168.58.10:5432:replication:replica:replica
192.168.58.11:5432:replication:replica:replica
192.168.58.12:5432:replication:replica:replica
EOF
[postgres@pgtest1 ~]$ chmod 0600 .pgpass
# 所有备库节点执行pg_basebackup命令初始化数据库
[root@pgtest2 ~]# su - postgres
[postgres@pgtest2 ~]$ pg_basebackup -h 192.168.58.10 -p 5432 -U replica -D $PGDATA -Fp -P -X stream -R -v -l replica_20211016
# 所有备库节点创建PostgreSQL服务postgres-13.service，同主库一样，启动所有备库
[root@pgtest2 ~]# systemctl daemon-reload
[root@pgtest2 ~]# systemctl enable postgres-13.service
[root@pgtest2 ~]# systemctl start postgres-13.service
# 确认启动成功
[root@pgtest2 ~]# netstat -nltp|grep 5432

18. 测试主备同步

# 在主库上查询主备同步状态
postgres=# select pid,state,client_addr,sync_priority,sync_state from pg_stat_replication;
  pid  |   state   |  client_addr  | sync_priority | sync_state 
-------+-----------+---------------+---------------+------------
 22100 | streaming | 192.168.58.11 |             0 | async
 22101 | streaming | 192.168.58.12 |             0 | async
# 主库创建测试表，插入数据
postgres=# create table test_1 (id int4,create_time timestamp(0) without time zone);
postgres=# insert into test_1 values (1,now());
# 备库查询测试表，确认是否同步成功
postgres=# select * from test_1;

至此，PostgreSQL + replication 部署完成。

19. 补充：PostgreSQL 主备机的判断

# 1.通过pg_controldata输出：
[root@pgtest1 ~]# pg_controldata
Database cluster state:               in production         # 主库
Database cluster state:               in archive recovery   # 备库

# 2.通过数据字典表pg_stat_replication，主机表中能查到记录，备机表中无记录
postgres=# select pid,state,client_addr,sync_priority,sync_state from pg_stat_replication;
  pid  |   state   |  client_addr  | sync_priority | sync_state 
-------+-----------+---------------+---------------+------------
 22100 | streaming | 192.168.58.11 |             0 | async
 22101 | streaming | 192.168.58.12 |             0 | async

# 3.通过wal进程查看，显示 walsender 的是主机，显示 walreceiver 的是备机
[root@pgtest1 ~]# ps -ef |grep wal
postgres  11047  11042  0 00:19 ?        00:00:01 postgres: pg_cluster: walwriter 
postgres  12686  11042  0 17:17 ?        00:00:00 postgres: pg_cluster: walsender replica 192.168.58.12(31644) streaming 0/18000250
postgres  13166  11042  0 17:54 ?        00:00:00 postgres: pg_cluster: walsender replica 192.168.58.11(44964) streaming 0/18000250

[root@pgtest2 ~]# ps -ef |grep wal
postgres  11494  11488  0 17:54 ?        00:00:00 postgres: pg_cluster: walreceiver streaming 0/18000250

# 4. 通过自带函数判断，select pg_is_in_recovery();
postgres=# select pg_is_in_recovery();
 pg_is_in_recovery 
-------------------
 f           # 主库
 t           # 备库

PostgreSQL高可用测试系列之Patroni + etcd + HAProxy + Keepalived 离线部署（三）

说明：本系列文章仅用于共享我的学习成果，未经过生产系统考验，对于知识点和一些组件的使用会不定时更新，仅供参考，如有错误的地方，欢迎留言共同学习。

本高可用系列测试不说理论知识，如有需要自行百度，因生产环境大多数是内网环境，无法连接互联网，为模拟生产环境安装，PostgreSQL高可用测试均采用离线部署。

所需软件包均以打包上传百度网盘，如有需要自行下载：https://pan.baidu.com/s/1Tb7GPMvj4kfKEIh8iyvdbA 提取码：n9w2 文件名：PostgreSQL_HA.tar.gz

第一章：介绍测试环境
第二章： PostgreSQL + replication 部署
第三章： Etcd 部署和管理
第四章： Patroni 部署和管理
第五章： HAProxy + Keepalived 部署和管理
第六章：高可用模拟故障测试用例
第七章： Prometheus + Grafana 监控部署
第八章：高可用管理

第三章： Etcd 部署和管理

Etcd下载地址： https://github.com/etcd-io/etcd 文件名： etcd-v3.5.1-linux-amd64.tar.gz
Etcd官方文档： https://etcd.io/docs/v3.5/

Etcd 是一款基于 Raft 算法和协议开发的分布式 key-value 数据库，基于Go语言编写，Patroni 监控本地的 PostgreSQL 状态，并将相关信息写入 Etcd，每个 Patroni 都能读写 Etcd 上的 key ，从而获取外地 PostgreSQL 数据库信息。
建议在 Etcd 集群中拥有奇数个成员，建议三个或五个节点，但是不建议超过7个节点，因为节点多了，写入性能会受到影响，数据必须在更多机器上复制，如果有条件，最好将节点分布到不同故障区域。
一个标准的3节点 etcd 集群，最大容许1个节点故障，一个5节点 etcd 集群，最大容许2个节点故障。
由于 etcd 将数据写入磁盘，因此其性能很大程度上取决于磁盘性能，因此建议使用 SSD。
etcd最稳定的生产环境是amd64架构的Linux操作系统。

1. 所有节点解压etcd安装包并配置环境变量

# 解压目录，创建软链接
[root@pgtest1 ~]# tar -zxvf /enmo/soft/etcd-v3.5.1-linux-amd64.tar.gz -C /enmo/app
[root@pgtest1 ~]# mv /enmo/app/etcd-v3.5.1-linux-amd64 /enmo/app/etcd-v3.5.1
[root@pgtest1 ~]# ln -s /enmo/app/etcd-v3.5.1 /enmo/app/etcd
# 配置环境变量
[root@pgtest1 ~]# sed -i "s;:\$PATH:;:/enmo/app/etcd:\$PATH:;g" /etc/profile
[root@pgtest1 ~]# source /etc/profile

2. 所有节点创建 etcd 启动脚本

# 主节点创建 etcd 启动脚本
[root@pgtest1 ~]# vi /enmo/app/etcd/start_etcd.sh
/enmo/app/etcd/etcd --data-dir=data.etcd \
  --name etcd_pgtest01 \
  --listen-peer-urls http://192.168.58.10:2380 \
  --listen-client-urls http://192.168.58.10:2379,http://127.0.0.1:2379 \
  --initial-advertise-peer-urls http://192.168.58.10:2380 \
  --advertise-client-urls http://192.168.58.10:2379 \
  --initial-cluster-token etcd-cluster-pgtest \
  --initial-cluster etcd_pgtest01=http://192.168.58.10:2380,etcd_pgtest02=http://192.168.58.11:2380,etcd_pgtest03=http://192.168.58.12:2380 \
  --initial-cluster-state new \
  --enable-v2

# 备节点1创建 etcd 启动脚本
[root@pgtest2 ~]# vi /enmo/app/etcd/start_etcd.sh
/enmo/app/etcd/etcd --data-dir=data.etcd \
  --name etcd_pgtest02 \
  --listen-peer-urls http://192.168.58.11:2380 \
  --listen-client-urls http://192.168.58.11:2379,http://127.0.0.1:2379 \
  --initial-advertise-peer-urls http://192.168.58.11:2380 \
  --advertise-client-urls http://192.168.58.11:2379 \
  --initial-cluster-token etcd-cluster-pgtest \
  --initial-cluster etcd_pgtest01=http://192.168.58.10:2380,etcd_pgtest02=http://192.168.58.11:2380,etcd_pgtest03=http://192.168.58.12:2380 \
  --initial-cluster-state new \
  --enable-v2
  
# 备节点2创建 etcd 启动脚本
[root@pgtest3 ~]# vi /enmo/app/etcd/start_etcd.sh
/enmo/app/etcd/etcd --data-dir=data.etcd \
  --name etcd_pgtest03 \
  --listen-peer-urls http://192.168.58.12:2380 \
  --listen-client-urls http://192.168.58.12:2379,http://127.0.0.1:2379 \
  --initial-advertise-peer-urls http://192.168.58.12:2380 \
  --advertise-client-urls http://192.168.58.12:2379 \
  --initial-cluster-token etcd-cluster-pgtest \
  --initial-cluster etcd_pgtest01=http://192.168.58.10:2380,etcd_pgtest02=http://192.168.58.11:2380,etcd_pgtest03=http://192.168.58.12:2380 \
  --initial-cluster-state new \
  --enable-v2

–initial-cluster 指定的 URL 是 --initial-advertise-peer-urls。
强烈推荐每个集群给予一个唯一的 initial-cluster-token。这样做之后，etcd 可以为集群生成唯一的集群 ID 和成员 ID，甚至他们有完全一样的配置。
etcd 在 listen-client-urls 上接收客户端访问，etcd 成员将 advertise-client-urls 指定的 URL 通告给其他成员，代理和客户端。
2379 用于客户端连接，而 2380 用于伙伴通讯。

3. 所有节点配置etcd启动脚本可执行权限

# chmod +x /enmo/app/etcd/start_etcd.sh

4. 所有节点配置etcd服务并启动

# 配置服务
# vi /usr/lib/systemd/system/etcd.service
[Unit]
Description=etcd
After=network.target remote-fs.target nss-lookup.target

[Service]
Type=forking
ExecStart=/bin/bash -c "/enmo/app/etcd/start_etcd.sh >> /enmo/app/etcd/start_etcd.log 2>&1 &"
ExecStop=/usr/bin/killall start_etcd

[Install]
WantedBy=multi-user.target

# 启动服务
systemctl daemon-reload
systemctl start etcd.service
systemctl enable etcd.service

5. 检查etcd集群状态

# 检查集群节点的状态
[root@pgtest1 ~]# etcdctl endpoint status --cluster -w table
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://192.168.58.11:2379 | 3f414532c235ce16 |   3.5.1 |   20 kB |     false |      false |         4 |         16 |                 16 |        |
| http://192.168.58.12:2379 | 41cf8a739c2e9b50 |   3.5.1 |   20 kB |     false |      false |         4 |         16 |                 16 |        |
| http://192.168.58.10:2379 | caef4208a95efee8 |   3.5.1 |   25 kB |      true |      false |         4 |         16 |                 16 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
# 列出集群中的所有成员
[root@pgtest1 ~]# etcdctl member list -w table
+------------------+---------+---------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |     NAME      |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+---------------+---------------------------+---------------------------+------------+
| 3f414532c235ce16 | started | etcd_pgtest02 | http://192.168.58.11:2380 | http://192.168.58.11:2379 |      false |
| 41cf8a739c2e9b50 | started | etcd_pgtest03 | http://192.168.58.12:2380 | http://192.168.58.12:2379 |      false |
| caef4208a95efee8 | started | etcd_pgtest01 | http://192.168.58.10:2380 | http://192.168.58.10:2379 |      false |
+------------------+---------+---------------+---------------------------+---------------------------+------------+
# 检查集群节点健康状况
[root@pgtest1 ~]# etcdctl endpoint health --cluster -w table
+---------------------------+--------+-------------+-------+
|         ENDPOINT          | HEALTH |    TOOK     | ERROR |
+---------------------------+--------+-------------+-------+
| http://192.168.58.10:2379 |   true | 21.945446ms |       |
| http://192.168.58.12:2379 |   true | 22.555558ms |       |
| http://192.168.58.11:2379 |   true | 13.002743ms |       |
+---------------------------+--------+-------------+-------+

至此，etcd部署完成，后面是etcd的管理部分，可选择性阅读

6. etcd的管理，etcdctl 是一个和 etcd 服务器交互的命令行工具

6.1 查看 etcdctl 的帮助信息

[root@pgtest1 ~]# etcdctl --help

6.2 将领导权转移到另一个 etcd 集群成员，在leader节点上执行

[root@pgtest1 ~]# etcdctl endpoint status --cluster -w table
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://192.168.58.11:2379 | 3f414532c235ce16 |   3.5.1 |   20 kB |     false |      false |         4 |         16 |                 16 |        |
| http://192.168.58.12:2379 | 41cf8a739c2e9b50 |   3.5.1 |   20 kB |     false |      false |         4 |         16 |                 16 |        |
| http://192.168.58.10:2379 | caef4208a95efee8 |   3.5.1 |   25 kB |      true |      false |         4 |         16 |                 16 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
# 现在 192.168.58.10 是 leader节点，将领导权转移到192.168.58.11，在leader节点(192.168.58.10)上执行以下命令
[root@pgtest1 ~]# etcdctl move-leader 3f414532c235ce16
Leadership transferred from caef4208a95efee8 to 3f414532c235ce16
[root@pgtest1 ~]# etcdctl endpoint status --cluster -w table
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://192.168.58.11:2379 | 3f414532c235ce16 |   3.5.1 |   20 kB |      true |      false |         5 |         20 |                 20 |        |
| http://192.168.58.12:2379 | 41cf8a739c2e9b50 |   3.5.1 |   20 kB |     false |      false |         5 |         20 |                 20 |        |
| http://192.168.58.10:2379 | caef4208a95efee8 |   3.5.1 |   25 kB |     false |      false |         5 |         20 |                 20 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

6.3 保存数据库快照(数据库备份)，只在其中一个节点执行

[root@pgtest1 ~]# etcdctl snapshot save etcd_bak.db
{"level":"info","ts":1635648427.3933637,"caller":"snapshot/v3_snapshot.go:68","msg":"created temporary db file","path":"etcd_bak.db.part"}
{"level":"info","ts":1635648427.3941739,"logger":"client","caller":"v3/maintenance.go:211","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1635648427.3941944,"caller":"snapshot/v3_snapshot.go:76","msg":"fetching snapshot","endpoint":"127.0.0.1:2379"}
{"level":"info","ts":1635648427.3956425,"logger":"client","caller":"v3/maintenance.go:219","msg":"completed snapshot read; closing"}
{"level":"info","ts":1635648427.396033,"caller":"snapshot/v3_snapshot.go:91","msg":"fetched snapshot","endpoint":"127.0.0.1:2379","size":"25 kB","took":"now"}
{"level":"info","ts":1635648427.3960974,"caller":"snapshot/v3_snapshot.go:100","msg":"saved","path":"etcd_bak.db"}
Snapshot saved at etcd_bak.db
# 查看快照信息
[root@pgtest1 ~]# etcdctl snapshot status etcd_bak.db -w table 
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 230fea56 |        0 |          8 |      25 kB |
+----------+----------+------------+------------+

6.4 修改现有成员信息

1. 修改 advertise client URLs
修改 etcd 的启动脚本 /enmo/app/etcd/start_etcd.sh，将参数–advertise-client-urls更新为需要修改的信息后重启这个成员的etcd(systemctl restart etcd.service)。
重新后的成员将自行发布更新后的URL。错误更新的 client URL 将不会影响 etcd 集群的健康。

2. 修改 advertise peer URLs
要修改现有成员的 advertise peer URLs, 首先通过(etcdctl member update)命令更新它然后再重启这个成员。
执行命令(etcdctl member update)是因为更新 peer URL 修改了集群范围配置并能影响 etcd 集群的健康。

首先，我们需要找到目标成员的ID。使用 etcdctl 列出所有成员：

[root@pgtest1 ~]# etcdctl member list -w table
+------------------+---------+---------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |     NAME      |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+---------------+---------------------------+---------------------------+------------+
| 3f414532c235ce16 | started | etcd_pgtest02 | http://192.168.58.11:2380 | http://192.168.58.11:2379 |      false |
| 41cf8a739c2e9b50 | started | etcd_pgtest03 | http://192.168.58.12:2380 | http://192.168.58.12:2379 |      false |
| caef4208a95efee8 | started | etcd_pgtest01 | http://192.168.58.10:2380 | http://192.168.58.10:2379 |      false |
+------------------+---------+---------------+---------------------------+---------------------------+------------+

在这个例子中，让我们更新 caef4208a95efee8 成员ID并修改它的 peerURLs 值为 http://10.0.1.10:2380。

[root@pgtest1 ~]# etcdctl member update caef4208a95efee8 http://10.0.1.10:2380
Updated member with ID a8266ecf031671f3 in cluster

最后修改 etcd 的启动脚本 /enmo/app/etcd/start_etcd.sh，将参数更新为需要修改的信息后重启这个成员的etcd(systemctl restart etcd.service)。

6.5 删除成员

假设我们要删除的成员ID是 caef4208a95efee8.

[root@pgtest1 ~]# etcdctl member remove caef4208a95efee8
Removed member caef4208a95efee8 from cluster

可以安全的移除 leader，当然在新 leader 被选举时集群将不活动(inactive)。这个持续时间通常是选举超时时间加投票过程。

6.6 添加新成员

添加成员的过程有两个步骤：

通过 etcdctl member add 命令添加新成员到集群。
使用新的成员配置启动新成员，包括更新后的成员列表(以前的成员加新成员)

使用 etcdctl 指定 name 和 advertised peer URLs 来添加新的成员到集群:

[root@pgtest1 ~]# etcdctl member add etcd_pgtest04 http://192.168.58.13:2380 
added member 9bf1b35fc7761a23 to cluster
ETCD_NAME="etcd_pgtest04"
ETCD_INITIAL_CLUSTER="etcd_pgtest01=http://192.168.58.10:2380,etcd_pgtest02=http://192.168.58.11:2380,etcd_pgtest03=http://192.168.58.12:2380,etcd_pgtest04=http://192.168.58.13:2380 "
ETCD_INITIAL_CLUSTER_STATE=existing

警告: 如果新成员启动的磁盘空间中存在旧的数据目录，需要在启动前删除旧的数据目录（–data-dir 默认${name}.etcd）

新成员创建启动脚本

[root@pgtest4 ~]# vi /enmo/app/etcd/start_etcd.sh
/enmo/app/etcd/etcd --data-dir=data.etcd \
  --name etcd_pgtest04 \
  --listen-peer-urls http://192.168.58.13:2380  \
  --listen-client-urls http://10.0.1.13:2379,http://127.0.0.1:2379 \
  --initial-advertise-peer-urls http://192.168.58.13:2380  \
  --advertise-client-urls http://10.0.1.13:2379  \
  --initial-cluster-token etcd-cluster-pgtest \
  --initial-cluster etcd_pgtest01=http://192.168.58.10:2380,etcd_pgtest02=http://192.168.58.11:2380,etcd_pgtest03=http://192.168.58.12:2380,etcd_pgtest04=http://192.168.58.13:2380 \
  --initial-cluster-state existing \
  --enable-v2

注意设置 initial-cluster-state 为 existing
创建服务文件，启动服务

[root@pgtest4 ~]# systemctl start etcd.service

新成员将作为集群的一部分运行并立即开始赶上集群的其他成员。
添加新成员的最佳实践是一次配置单个成员并在添加更多新成员前验证它正确启动。这个逐步的方式非常重要，因为如果最新添加的成员没有正确配置(例如 peer URL 不正确)，集群会丢失法定人数。发生法定人数丢失是因为最新加入的成员被法定人数计数，即使这个成员对其他已经存在的成员是无法访问的。同样法定人数丢失可能发生在有连接问题或者操作问题时。

6.5 其他参考官方文档

Etcd官方文档： https://etcd.io/docs/v3.5/

7. etcd-browser WEB可视化界面

下载地址： https://github.com/henszey/etcd-browser
要启动 etcd-browser，还需要下载安装 nodejs，下载地址： https://nodejs.org/zh-cn/download/

# 解压etcd-browser的安装包
[root@pgtest3 ~]# unzip /enmo/soft/etcd-browser-master.zip -d /enmo/app/etcd
# 解压nodejs的安装包
[root@pgtest3 ~]# tar -xvf /enmo/soft/node-v16.13.0-linux-x64.tar.xz -C /enmo/app/etcd
[root@pgtest3 ~]# mv /enmo/app/etcd/node-v16.13.0-linux-x64 /enmo/app/etcd/nodejs
# 修改etcd-browser的配置文件
[root@pgtest3 ~]# vi /enmo/app/etcd/etcd-browser-master/server.js
var etcdHost = process.env.ETCD_HOST || '192.168.58.12';
var etcdPort = process.env.ETCD_PORT || 2379;
var serverPort = process.env.SERVER_PORT || 8000;
# 启动etcd-browser
[root@pgtest3 ~]# cd /enmo/app/etcd/etcd-browser-master
[root@pgtest3 etcd-browser-master]# /enmo/app/etcd/nodejs/bin/node server.js
proxy /api requests to etcd on 192.168.58.12:2379
etc-browser listening on port 8000

谷歌浏览器访问 http://192.168.58.12:8000/

PostgreSQL高可用测试系列之Patroni + etcd + HAProxy + Keepalived 离线部署（四）

说明：本系列文章仅用于共享我的学习成果，未经过生产系统考验，对于知识点和一些组件的使用会不定时更新，仅供参考，如有错误的地方，欢迎留言共同学习。

本高可用系列测试不说理论知识，如有需要自行百度，因生产环境大多数是内网环境，无法连接互联网，为模拟生产环境安装，PostgreSQL高可用测试均采用离线部署。

所需软件包均以打包上传百度网盘，如有需要自行下载：https://pan.baidu.com/s/1Tb7GPMvj4kfKEIh8iyvdbA 提取码：n9w2 文件名：PostgreSQL_HA.tar.gz

第一章：介绍测试环境
第二章： PostgreSQL + replication 部署
第三章： Etcd 部署和管理
第四章： Patroni 部署和管理
第五章： HAProxy + Keepalived 部署和管理
第六章：高可用模拟故障测试用例
第七章： Prometheus + Grafana 监控部署
第八章：高可用管理

第四章： Patroni 部署和管理

Patroni 基于 Python 开发的模板，结合 DCS (例如 ZooKeeper, etcd, Consul )可以定制 PostgreSQL 高可用方案。
Patroni 并不是一套拿来即用的 PostgreSQL 高可用组件，涉及较多的配置和定制工作。
Patroni 接管 PostgreSQL 数据库的启停，同时监控本地的 PostgreSQL 数据库，并将本地的 PostgreSQL 数据库信息写入DCS。
Patroni 的主备端是通过是否能获得 leader key 来控制的，获取到了 leader key 的 Patroni 为主节点，其它的为备节点。
Patroni 支持级联复制，支持同步和异步模式，支持 failover、switchovers、重新初始化集群等。
Patroni 官方文档：https://patroni.readthedocs.io/en/latest/index.html

1. 所有节点安装python3

Patroni 基于 Python 开发的模板，需要运行在 Python 环境下。

# 解压并安装python3
[root@pgtest1 ~]# cd /enmo/soft
[root@pgtest1 soft]# tar -zxvf python3-rpm.tar.gz
# 配置yum源
[root@pgtest1 ~]# cat > /etc/yum.repos.d/enmo.repo << EOF
[Server]
name=Server
baseurl=file:///media/cdrom
enabled=yes
gpgcheck=0

[python3]
name=python3
baseurl=file:///enmo/soft/python3-rpm
enabled=yes
gpgcheck=0
EOF
# 执行yum安装
[root@pgtest1 ~]# yum clean all
[root@pgtest1 ~]# yum install python3 python3-devel -y
# 修改软链接
[root@pgtest1 ~]# rm -f /usr/bin/python
[root@pgtest1 ~]# ln -s /usr/bin/python3 /usr/bin/python
# 查看版本，确认安装成功
[root@pgtest1 ~]# python -V
Python 3.6.8
# 全词匹配替换，python3安装后’yum’命令执行会报错，需要修改以下配置
[root@pgtest1 ~]# sed -i "s:\<python\>:python2:g" /usr/bin/yum
[root@pgtest1 ~]# sed -i "s:\<python\>:python2:g" /usr/libexec/urlgrabber-ext-down

2. 所有节点使用pip3安装patroni

在线连接互联网安装

# pip3 install patroni[etcd] -i https://mirrors.aliyun.com/pypi/simple/
# pip3 install psycopg2-binary -i https://mirrors.aliyun.com/pypi/simple/
# pip3 install psycopg2 -i https://mirrors.aliyun.com/pypi/simple/

以下使用离线安装
软件包下载，网盘中PostgreSQL_HA.tar.gz文件中已包含以下安装包(patroni_etcd_2.1.1.tar.gz)

psutil-5.8.0.tar.gz                          https://mirrors.aliyun.com/pypi/simple/psutil/
ydiff-1.2.tar.gz                             https://mirrors.aliyun.com/pypi/simple/ydiff/
click-8.0.3-py3-none-any.whl                 https://mirrors.aliyun.com/pypi/simple/click/
dnspython-2.1.0-py3-none-any.whl             https://mirrors.aliyun.com/pypi/simple/dnspython/
importlib_metadata-4.8.1-py3-none-any.whl    https://mirrors.aliyun.com/pypi/simple/importlib-metadata/
patroni-2.1.1-py3-none-any.whl               https://mirrors.aliyun.com/pypi/simple/patroni/
prettytable-2.2.1-py3-none-any.whl           https://mirrors.aliyun.com/pypi/simple/prettytable/
python_dateutil-2.8.2-py2.py3-none-any.whl   https://mirrors.aliyun.com/pypi/simple/python-dateutil/
PyYAML-6.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl     https://mirrors.aliyun.com/pypi/simple/pyyaml/
six-1.16.0-py2.py3-none-any.whl              https://mirrors.aliyun.com/pypi/simple/six/
typing_extensions-3.10.0.2-py3-none-any.whl  https://mirrors.aliyun.com/pypi/simple/typing-extensions/
urllib3-1.26.7-py2.py3-none-any.whl          https://mirrors.aliyun.com/pypi/simple/urllib3/
wcwidth-0.2.5-py2.py3-none-any.whl           https://mirrors.aliyun.com/pypi/simple/wcwidth/
zipp-3.6.0-py3-none-any.whl                  https://mirrors.aliyun.com/pypi/simple/zipp/
python-etcd-0.4.5.tar.gz                     https://mirrors.aliyun.com/pypi/simple/python-etcd/
psycopg2-binary-2.9.1.tar.gz                 https://mirrors.aliyun.com/pypi/simple/psycopg2-binary/
psycopg2-2.9.1.tar.gz                        https://mirrors.aliyun.com/pypi/simple/psycopg2/

解压安装包并使用pip3安装patroni

[root@pgtest1 ~]# cd /enmo/soft
[root@pgtest1 soft]# tar -zxvf patroni_etcd_2.1.1.tar.gz 
[root@pgtest1 soft]# cd patroni_etcd_2.1.1
# 按顺序执行安装
[root@pgtest1 patroni_etcd_2.1.1]# pip3 install psutil-5.8.0.tar.gz
[root@pgtest1 patroni_etcd_2.1.1]# pip3 install ydiff-1.2.tar.gz
[root@pgtest1 patroni_etcd_2.1.1]# pip3 install *.whl
[root@pgtest1 patroni_etcd_2.1.1]# pip3 install python-etcd-0.4.5.tar.gz
[root@pgtest1 patroni_etcd_2.1.1]# pip3 install psycopg2-binary-2.9.1.tar.gz
[root@pgtest1 patroni_etcd_2.1.1]# pip3 install psycopg2-2.9.1.tar.gz
# 查看已安装的包
[root@pgtest3 patroni_etcd_2.1.1]# pip3 list
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the 

 section) to disable this warning.
click (8.0.3)
dnspython (2.1.0)
importlib-metadata (4.8.1)
patroni (2.1.1)
pip (9.0.3)
prettytable (2.2.1)
psutil (5.8.0)
psycopg2 (2.9.1)
psycopg2-binary (2.9.1)
python-dateutil (2.8.2)
python-etcd (0.4.5)
PyYAML (6.0)
setuptools (39.2.0)
six (1.16.0)
typing-extensions (3.10.0.2)
urllib3 (1.26.7)
wcwidth (0.2.5)
ydiff (1.2)
zipp (3.6.0)

2. 所有节点配置patroni的参数文件

2.1 创建参数文件和日志文件的存放路径

[root@pgtest1 ~]# mkdir /enmo/app/patroni

2.2 主节点创建文件 patroni_config.yml

需要注意python的yml文件格式，有严格的缩进要求，且以空格进行缩进，不要使用Tab键，缩进控制不好的话，参数配置将出现各种问题。

[root@pgtest1 ~]# vi /enmo/app/patroni/patroni_config.yml
scope: pg_cluster
namespace: /service
name: pgtest1

log:
  level: INFO
  traceback_level: ERROR
  dir: /enmo/app/patroni
  file_num: 10
  file_size: 104857600

restapi:
  listen: 192.168.58.10:8008
  connect_address: 192.168.58.10:8008

etcd:
  host: 192.168.58.10:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    max_timelines_history: 0
    master_start_timeout: 300
    master_stop_timeout: 0
    synchronous_mode: false
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        max_connections: 3000
        superuser_reserved_connections: 100
        max_locks_per_transaction: 64
        max_worker_processes: 2
        max_prepared_transactions: 0
        wal_level: logical
        wal_log_hints: on
        track_commit_timestamp: off
        max_wal_senders: 10
        max_replication_slots: 10
        wal_keep_size: 4096MB  # wal_keep_size = wal_keep_segments * wal_segment_size = 128 * 32MB 
        hot_standby: "on"
        listen_addresses: "*"
        port: 5432
        cluster_name: "pg_cluster"
        archive_mode: on
        archive_command: "cp %p /enmo/pgarch/%f"

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 192.168.58.10:5432
  data_dir: /enmo/pgdata
  pgpass: /home/postgres/.pgpass
  pg_ctl_timeout: 60
  use_pg_rewind: true
  remove_data_directory_on_rewind_failure: false
  remove_data_directory_on_diverged_timelines: true
  authentication:
    replication:
      username: replica
      password: replica
    superuser:
      username: postgres
      password: postgres

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
  nosync: false

2.3 备节点1创建文件 patroni_config.yml

[root@pgtest2 ~]# vi /enmo/app/patroni/patroni_config.yml
scope: pg_cluster
namespace: /service
name: pgtest2

log:
  level: INFO
  traceback_level: ERROR
  dir: /enmo/app/patroni
  file_num: 10
  file_size: 104857600

restapi:
  listen: 192.168.58.11:8008
  connect_address: 192.168.58.11:8008

etcd:
  host: 192.168.58.11:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    max_timelines_history: 0
    master_start_timeout: 300
    master_stop_timeout: 0
    synchronous_mode: false
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        max_connections: 3000
        superuser_reserved_connections: 100
        max_locks_per_transaction: 64
        max_worker_processes: 2
        max_prepared_transactions: 0
        wal_level: logical
        wal_log_hints: on
        track_commit_timestamp: off
        max_wal_senders: 10
        max_replication_slots: 10
        wal_keep_size: 4096MB  # wal_keep_size = wal_keep_segments * wal_segment_size = 128 * 32MB 
        hot_standby: "on"
        listen_addresses: "*"
        port: 5432
        cluster_name: "pg_cluster"
        archive_mode: on
        archive_command: "cp %p /enmo/pgarch/%f"

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 192.168.58.11:5432
  data_dir: /enmo/pgdata
  pgpass: /home/postgres/.pgpass
  pg_ctl_timeout: 60
  use_pg_rewind: true
  remove_data_directory_on_rewind_failure: false
  remove_data_directory_on_diverged_timelines: true
  authentication:
    replication:
      username: replica
      password: replica
    superuser:
      username: postgres
      password: postgres

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
  nosync: false

2.4 备节点2创建文件 patroni_config.yml

[root@pgtest3 ~]# vi /enmo/app/patroni/patroni_config.yml
scope: pg_cluster
namespace: /service
name: pgtest3

log:
  level: INFO
  traceback_level: ERROR
  dir: /enmo/app/patroni
  file_num: 10
  file_size: 104857600

restapi:
  listen: 192.168.58.12:8008
  connect_address: 192.168.58.12:8008

etcd:
  host: 192.168.58.12:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    max_timelines_history: 0
    master_start_timeout: 300
    master_stop_timeout: 0
    synchronous_mode: false
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        max_connections: 3000
        superuser_reserved_connections: 100
        max_locks_per_transaction: 64
        max_worker_processes: 2
        max_prepared_transactions: 0
        wal_level: logical
        wal_log_hints: on
        track_commit_timestamp: off
        max_wal_senders: 10
        max_replication_slots: 10
        wal_keep_size: 4096MB  # wal_keep_size = wal_keep_segments * wal_segment_size = 128 * 32MB 
        hot_standby: "on"
        listen_addresses: "*"
        port: 5432
        cluster_name: "pg_cluster"
        archive_mode: on
        archive_command: "cp %p /enmo/pgarch/%f"

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 192.168.58.12:5432
  data_dir: /enmo/pgdata
  pgpass: /home/postgres/.pgpass
  pg_ctl_timeout: 60
  use_pg_rewind: true
  remove_data_directory_on_rewind_failure: false
  remove_data_directory_on_diverged_timelines: true
  authentication:
    replication:
      username: replica
      password: replica
    superuser:
      username: postgres
      password: postgres

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
  nosync: false

2.5 建议由 patroni 托管的 PostgreSQL 参数：

主库和备库必须保持相同的值的参数，对于那些，在本地 Patoni 配置文件中或通过环境变量设置的值不起作用。要更改或设置它们的值，必须更改 DCS 中的共享配置。

max_connections: 3000
max_locks_per_transaction: 64
max_worker_processes: 2
max_prepared_transactions: 0
wal_level: logical
wal_log_hints: on
track_commit_timestamp: off

对于下面的参数，PostgreSQL 不要求主库和备库必须保持相同的值。但是，考虑到一个备库随时可能成为主库的可能性，将它们设置为不同的值实际上没有意义；因此，Patroni 将其值设置为动态配置。

max_wal_senders: 10
max_replication_slots: 10
wal_keep_segments: 8
# PostgreSQL 13 版本将 wal_keep_segments 重新定义为 wal_keep_size，决定了为备库保留的WAL量。版本13采用字节大小表示，不再采用保留文件的个数，可通过下述公式换算：wal_keep_size = wal_keep_segments * wal_segment_size
wal_keep_size: 4096MB

还有一些其他 Postgres 参数由 Patroni 控制

listen_addresses: "*"
port: 5432 
cluster_name: "pg_cluster"
hot_standby: "on"

为了安全起见，上述列表中的参数不会写入 postgresql.conf，而是作为参数列表传递给 pg_ctl start，这赋予它们最高优先级，甚至高于 ALTER SYSTEM

参数文件的优先顺序

The parameters would be applied in the following order (run-time are given the highest priority):
load parameters from file postgresql.base.conf (or from a custom_conf file, if set)
load parameters from file postgresql.conf
load parameters from file postgresql.auto.conf
run-time parameter using -o –name=value

2.6 修改目录权限

# chown -R postgres.postgres /enmo/app/patroni

2.7 参数解释

参考官方文档：https://patroni.readthedocs.io/en/latest/SETTINGS.html

[root@pgtest1 patroni]# cat patroni_config.yml 
# 集群名称
scope: pg_cluster     # 集群名称
namespace: /service   # Patroni 将在其中保存有关集群的信息，这个路径是 etcd 存储数据的路径， Default value: "/service"
name: pgtest1         # 主机名，对于集群必须是唯一的

log:
  level: INFO              # 日志级别
  traceback_level: ERROR
  dir: /enmo/app/patroni/  # 日志写入的目录
  file_num: 10             # 要保留的日志数量
  file_size: 104857600     # 触发日志滚动的 patoni.log 文件的大小（以字节为单位）

restapi:
  listen: 192.168.58.10:8008        
  connect_address: 192.168.58.10:8008

etcd:
  # Provide host to do the initial discovery of the cluster topology:
  # 必须指定host、hosts、url、proxy或 srv 之一
  host: 192.168.58.10:2379    # etcd 端点的 host:port

bootstrap:
  # this section will be written into Etcd:/<namespace>/<scope>/config after initializing new cluster
  # and all other cluster members will use it as a `global configuration`
  dcs:   # 动态配置(Dynamic configuration)的参数设置，动态配置存储在 DCS（分布式配置存储）中并应用于所有集群节点
    ttl: 30
    loop_wait: 10   
    retry_timeout: 10
    maximum_lag_on_failover: 1048576 # 如果从库落后于主库超过一定数量的字节,则此设置可确保不会发生故障转移
    maximum_lag_on_syncnode: -1
    max_timelines_history: 0         # DCS 中保存的最大时间线历史项目数。默认值：0。当设置为 0 时，它会在 DCS 中保留完整的历史记录。
    master_start_timeout: 300        # 在触发 failover 之前允许主服务器从故障中恢复的时间(单位:秒)
                                     # master 故障的最坏情况故障转移时间是： loop_wait + master_start_timeout + loop_wait
    master_stop_timeout: 0           # Patroni 停止 Postgres 时允许等待的秒数，仅在启用 synchronous_mode 时有效。超过参数值，则 Patroni 会向 postmaster 发送 SIGKILL。
    synchronous_mode: false          # 打开同步复制模式。在此模式下，一个从库将被选择为同步模式的从库，只有最新的领导者和同步从库才能参与领导者选举。
    synchronous_mode_strict          # 如果没有可用的同步副本，则防止禁用同步复制，从而阻止所有客户端写入主服务器。
	                             # https://patroni.readthedocs.io/en/latest/replication_modes.html

    postgresql:
      use_pg_rewind: true            # 是否使用 pg_rewind
      use_slots: true                # 是否使用复制槽
      parameters:
        max_connections: 3000
        superuser_reserved_connections: 100   # Patroni 需要使用超级用户访问数据库才能正常运行
        max_locks_per_transaction: 64
        max_worker_processes: 2
        max_prepared_transactions: 0
        wal_level: logical
        wal_log_hints: on
        track_commit_timestamp: off
        max_wal_senders: 10
        max_replication_slots: 10
        wal_keep_size: 4096MB  # wal_keep_size = wal_keep_segments * wal_segment_size = 128 * 32MB 
        hot_standby: "on"
        listen_addresses: "*"
        port: 5432
        cluster_name: "pg_cluster"
        archive_mode: on
        archive_command: "cp %p /enmo/pgarch/%f"

    # standby_cluster:               # 如果定义了这个部分，我们想要引导一个备用集群
      # host: 127.0.0.1              # 远程主机的地址
      # port: 1111                   # 远程主机的端口
      # primary_slot_name: patroni   # 远程主服务器上用于复制的插槽。此参数是可选的，默认值来自实例名称
	  # create_replica_methods
	  # restore_command
	  # archive_cleanup_command
	  # recovery_min_apply_delay

  initdb:   # 列出要传递给 initdb 的选项
    - data-checksums          # 在9.3上需要 pg_rewind 时必须启用
	- encoding: UTF8      # 新数据库的默认编码
	- locale: UTF8        # 新数据库的默认语言环境
	- wal-segsize: 32
	- allow-group-access
	- pgdata: /enmo/pgdata
	- waldir: /enmo/pgwal
  
  pg_hba:   # 应该添加到 pg_hba.conf 的行列表
    - host all all 0.0.0.0/0 md5
	- host replication replicator 127.0.0.1/32 md5
  
  users:    # 初始化新集群后需要创建的一些额外用户
    admin:  # 用户名是admin
      password: zalando
      options:    # CREATE USER 语句的选项列表
	    - createrole
		- createdb

postgresql:
  listen: 0.0.0.0:5432                  # Postgres 监听的 IP 地址 + 端口
  connect_address: 192.168.58.10:5432
  data_dir: /enmo/pgdata                # Postgres 数据目录的位置
  # bin_dir: /software/pgsql/bin        # PostgreSQL 二进制文件的路径，默认值是一个空字符串，这意味着 PATH 环境变量将用于查找可执行文件
  # config_dir: /software/pgsql/data    # Postgres 配置目录的位置，默认为 data_dir
  pgpass: /home/postgres/.pgpass        # 密码文件的路径
  pg_ctl_timeout: 60                    # pg_ctl 执行 start，stop 或 restart 时应等待多长时间. 默认值为 60 秒。
  use_pg_rewind: true                   # 当它作为副本加入集群时，尝试在前领导者上使用 pg_rewind
  remove_data_directory_on_rewind_failure: false   # 如果启用此选项，Patroni 将删除 PostgreSQL 数据目录并重新创建副本。否则它会尝试跟随新的领导者。默认值为false
  remove_data_directory_on_diverged_timelines: false # 如果 Patroni 注意到时间线正在发散并且以前的 master 无法从新 master 开始流式传输，则 Patroni 将删除 PostgreSQL 数据目录并重新创建副本。此选项在 pg_rewind 无法使用时很有用。默认值为false。
  authentication:    # 用户验证
    replication:     # 复制用户
      username: replica
      password: replica
    superuser:       # 超级用户
      username: postgres
      password: postgres
    # rewind:        # pg_rewind 用户
      # username: 
      # password:

tags:
  nofailover: false     # 控制是否允许此节点参与领导者竞赛并成为领导者
  noloadbalance: false
  clonefrom: false
  nosync: false

3. 所有节点创建 patroni 服务并启动

# 创建服务，注意需要配置环境变量，否则启动服务会出现这种报错(FATAL: Patroni requires psycopg2>=2.5.4 or psycopg2-binary)
# Requires,强制依赖 etcd.service 启动成功，可以视情况而定，etcd不启动，patroni起来后不会自动拉起数据库
[root@pgtest1 ~]# vi /usr/lib/systemd/system/patroni.service
[Unit]
Description=patroni
After=network.target remote-fs.target nss-lookup.target etcd.service
Requires=etcd.service
 
[Service]
Type=forking
User=postgres
Group=postgres
Environment="PGHOME=/enmo/app/pgsql/13.3"
Environment="PGDATA=/enmo/pgdata"
Environment="PGPORT=5432"
Environment="LD_LIBRARY_PATH=/enmo/app/pgsql/13.3/lib"
Environment="PATH=/enmo/app/pgsql/13.3/bin:/usr/local/bin"
ExecStart=/bin/bash -c "patroni /enmo/app/patroni/patroni_config.yml >> /enmo/app/patroni/patroni.log 2>&1 &"
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/usr/bin/killall patroni
KillMode=process
TimeoutSec=30
Restart=no

[Install]
WantedBy=multi-user.target

# 启动服务
[root@pgtest1 ~]# systemctl daemon-reload
[root@pgtest1 ~]# systemctl restart patroni.service
[root@pgtest1 ~]# systemctl enable patroni.service

# 因为 patroni 会检测 PostgreSQL 数据库是否正常运行，如果没有运行，会自动启动PostgreSQL 数据库，所以可以考虑禁用 PostgreSQL 服务，由 patroni 托管PG的启停
[root@pgtest1 ~]# systemctl disable postgres-13.service

注意：kill patroni 进程是否会宕库？

# 1. kill -9 patroni进程，不会宕库
[root@pgtest3 ~]# ps -ef |grep patroni
root      22695  22009  0 15:09 pts/1    00:00:00 tail -f /enmo/app/patroni/patroni.log
postgres  22754      1  0 15:13 ?        00:00:00 /usr/bin/python3 /usr/local/bin/patroni /enmo/app/patroni/patroni_config.yml
root      22790  22720  0 15:18 pts/3    00:00:00 grep --color=auto patroni
[root@pgtest3 ~]# kill -9 22754

# 2. killall patroni进程，会宕节点所在的数据库
[root@pgtest3 ~]# /usr/bin/killall patroni

# 3. 关闭 auto failover 功能，killall patroni进程，不会宕库
[root@pgtest3 ~]# patronictl -c /enmo/app/patroni/patroni_config.yml pause
Success: cluster management is paused
[root@pgtest3 ~]# /usr/bin/killall patroni

在启动patroni之后，建议先使用patronictl禁掉auto failover功能，当启动完毕调试正常后再选择性的启动auto failover功能，因为启用auto failover功能后，killall patroni进程，会导致当前节点的数据库宕掉，如果主库处于生产状态，后果不堪设想。

# Disable auto failover 
# 如果没有启动 patroni，执行patronictl pause 会失败
[root@pgtest1 ~]# patronictl -c /enmo/app/patroni/patroni_config.yml pause
Success: cluster management is paused
# Resume auto failover
[root@pgtest1 ~]# patronictl -c /enmo/app/patroni/patroni_config.yml resume
Success: cluster management is resumed

4. 所有节点设置patronictl别名，方便维护

[root@pgtest1 ~]# cat >> /etc/profile << EOF
alias patronictl='patronictl -c /enmo/app/patroni/patroni_config.yml'
EOF
[root@pgtest1 ~]# source /etc/profile

5. 查询 patroni 集群状态

[root@pgtest1 ~]# patronictl list
+---------+---------------+---------+---------+----+-----------+-----------------+
| Member  | Host          | Role    | State   | TL | Lag in MB | Pending restart |
+ Cluster: pg_cluster (7025023477017500881) --+----+-----------+-----------------+
| pgtest1 | 192.168.58.10 | Leader  | running |  5 |           |                 |
| pgtest2 | 192.168.58.11 | Replica | running |  4 |        64 | *               |
| pgtest3 | 192.168.58.12 | Replica | running |  4 |        64 | *               |
+---------+---------------+---------+---------+----+-----------+-----------------+

至此，patroni部署完成，后面是patroni的管理部分，可选择性阅读

6. patroni 动态配置参数调整

6.1 使用 etcdctl 查看存储在etcd中的 patroni 动态配置参数

etcdctl 返回的结果是JSON格式，为了方便查看，可以下载安装jq命令
jq 的下载地址：https://stedolan.github.io/jq/download/ jq 1.6 binaries for 64-bit or 32-bit.
上传服务器，修改可执行权限，chmod +x /enmo/soft/jq-linux64

[root@pgtest1 ~]# export ETCDCTL_API=2
[root@pgtest1 ~]# etcdctl get /service/pg_cluster/config |/enmo/soft/jq-linux64 .
{
  "ttl": 30,
  "loop_wait": 10,
  "retry_timeout": 10,
  "maximum_lag_on_failover": 1048576,
  "max_timelines_history": 0,
  "master_start_timeout": 300,
  "master_stop_timeout": 0,
  "synchronous_mode": false,
  "postgresql": {
    "use_pg_rewind": true,
    "use_slots": true,
    "parameters": {
      "max_connections": 3000,
      "superuser_reserved_connections": 100,
      "max_locks_per_transaction": 64,
      "max_worker_processes": 2,
      "max_prepared_transactions": 0,
      "wal_level": "logical",
      "wal_log_hints": true,
      "track_commit_timestamp": false,
      "max_wal_senders": 10,
      "max_replication_slots": 10,
      "wal_keep_size": "4096MB",
      "hot_standby": "on",
      "listen_addresses": "*",
      "port": 5432,
      "cluster_name": "pg_cluster",
      "archive_mode": true,
      "archive_command": "cp %p /enmo/pgarch/%f"
    }
  }
}

6.2 patronictl 查看 patroni 动态配置参数

[root@pgtest1 ~]# patronictl show-config
loop_wait: 10
master_start_timeout: 300
master_stop_timeout: 0
max_timelines_history: 0
maximum_lag_on_failover: 1048576
postgresql:
  parameters:
    archive_command: cp %p /enmo/pgarch/%f
    archive_mode: true
    cluster_name: pg_cluster
    hot_standby: 'on'
    listen_addresses: '*'
    max_connections: '3000'
    max_locks_per_transaction: 64
    max_prepared_transactions: 0
    max_replication_slots: 10
    max_wal_senders: 10
    max_worker_processes: 2
    port: 5432
    superuser_reserved_connections: 100
    track_commit_timestamp: false
    wal_keep_size: 4096MB
    wal_level: logical
    wal_log_hints: true
  use_pg_rewind: true
  use_slots: true
retry_timeout: 10
synchronous_mode: false
ttl: 30

6.3 通过 patronictl 调整配置参数，在其中一个节点调整，其他节点也会自动调整，并且 patroni 自动进行 reload 操作

[postgres@pgtest1 ~]$ patronictl edit-config
# 编辑文本
Apply these changes? [y/N]: y
Configuration changed

6.4 对于需要重启数据库生效的参数，为了减少对生产的影响，可以逐个节点重启，也可以在停机窗口通过 patronictl restart 对整个集群进行重启

# 仅重启当前节点(--any)
[root@pgtest1 ~]# patronictl restart pg_cluster --any
+---------+---------------+---------+---------+----+-----------+
| Member  | Host          | Role    | State   | TL | Lag in MB |
+ Cluster: pg_cluster (7025023477017500881) --+----+-----------+
| pgtest1 | 192.168.58.10 | Replica | running |  6 |         0 |
| pgtest2 | 192.168.58.11 | Leader  | running |  6 |           |
| pgtest3 | 192.168.58.12 | Replica | running |  6 |         0 |
+---------+---------------+---------+---------+----+-----------+
When should the restart take place (e.g. 2021-10-31T21:00)  [now]: 
Are you sure you want to restart members pgtest1, pgtest2, pgtest3? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2)  []: 
Success: restart on member pgtest1
# 如果节点是 pending 状态的，才会执行重启操作
[root@pgtest1 ~]# patronictl restart pg_cluster --any --pending
+---------+---------------+---------+---------+----+-----------+
| Member  | Host          | Role    | State   | TL | Lag in MB |
+ Cluster: pg_cluster (7025023477017500881) --+----+-----------+
| pgtest1 | 192.168.58.10 | Replica | running |  6 |         0 |
| pgtest2 | 192.168.58.11 | Leader  | running |  6 |           |
| pgtest3 | 192.168.58.12 | Replica | running |  6 |         0 |
+---------+---------------+---------+---------+----+-----------+
When should the restart take place (e.g. 2021-10-31T21:01)  [now]: 
Are you sure you want to restart members pgtest1, pgtest2, pgtest3? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2)  []: 
Failed: restart for member pgtest2, status code=503, (restart conditions are not satisfied)
# 重启所有成员
[root@pgtest1 ~]# patronictl restart pg_cluster
+---------+---------------+---------+---------+----+-----------+
| Member  | Host          | Role    | State   | TL | Lag in MB |
+ Cluster: pg_cluster (7025023477017500881) --+----+-----------+
| pgtest1 | 192.168.58.10 | Replica | running |  6 |         0 |
| pgtest2 | 192.168.58.11 | Leader  | running |  6 |           |
| pgtest3 | 192.168.58.12 | Replica | running |  6 |         0 |
+---------+---------------+---------+---------+----+-----------+
When should the restart take place (e.g. 2021-10-31T21:02)  [now]: 
Are you sure you want to restart members pgtest1, pgtest2, pgtest3? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2)  []: 
Success: restart on member pgtest1
Success: restart on member pgtest2
Success: restart on member pgtest3

6.5 如何仅调整指定节点参数

上述提到了PostgreSQL的配置参数统一由Patroni管理，单独修改节点的 postgresql.conf 参数不再生效。
如果想差异化配置其中一个节点的参数，可通过 ALTER SYSTEM 命令动态配置，如下:

[root@pgtest1 ~]# psql
postgres=# show wal_keep_segments ;
postgres=# ALTER SYSTEM SET wal_keep_segments =200;
postgres=# SELECT pg_reload_conf();
postgres=# show wal_keep_segments ;

7. 调用浏览器网页查看集群状态

linux curl命令需要提前安装 jq 命令格式化显示结果，windows 谷歌浏览器需要安装扩展插件 JsonView，否则显示结果不好看
JsonView 的下载地址 https://github.com/gildas-lormeau/JSONView-for-Chrome
JsonView 参考 https://www.cnblogs.com/songyanan/p/9224347.html

http://192.168.58.10:8008/cluster

http://192.168.58.10:8008/patroni

获取当前版本的动态配置: http://192.168.58.10:8008/config

[root@pgtest1 ~]# curl -s "http://192.168.58.10:8008/cluster" |/enmo/soft/jq-linux64 .
{
  "members": [
    {
      "name": "pgtest1",
      "role": "leader",
      "state": "running",
      "api_url": "http://192.168.58.10:8008/patroni",
      "host": "192.168.58.10",
      "port": 5432,
      "timeline": 5
    },
    {
      "name": "pgtest2",
      "role": "replica",
      "state": "running",
      "api_url": "http://192.168.58.11:8008/patroni",
      "host": "192.168.58.11",
      "port": 5432,
      "timeline": 5,
      "pending_restart": true,
      "lag": 0
    },
    {
      "name": "pgtest3",
      "role": "replica",
      "state": "running",
      "api_url": "http://192.168.58.12:8008/patroni",
      "host": "192.168.58.12",
      "port": 5432,
      "timeline": 5,
      "pending_restart": true,
      "lag": 0
    }
  ]
}

8. Patroni REST API

8.1 监控 Monitoring endpoint

[root@pgtest1 ~]# curl -s "http://192.168.58.10:8008/patroni" |/enmo/soft/jq-linux64 .

8.2 集群状态 Cluster status endpoints

[root@pgtest1 ~]# curl -s "http://192.168.58.10:8008/cluster" |/enmo/soft/jq-linux64 .
[root@pgtest1 ~]# curl -s "http://192.168.58.10:8008/history" |/enmo/soft/jq-linux64 .

8.3 配置 Config endpoint

获取当前版本的动态配置

[root@pgtest1 ~]# curl -s "http://192.168.58.10:8008/config" |/enmo/soft/jq-linux64 .

更改现有配置

[root@pgtest1 ~]# curl -s -XPATCH -d '{"loop_wait":9,"ttl":25,"postgresql":{"parameters":{"max_connections":"2500"}}}' http://192.168.58.10:8008/config |/enmo/soft/jq-linux64 .
{
  "ttl": 25,
  "loop_wait": 9,
  "retry_timeout": 10,
  "maximum_lag_on_failover": 1048576,
  "max_timelines_history": 0,
  "master_start_timeout": 300,
  "master_stop_timeout": 0,
  "synchronous_mode": false,
  "postgresql": {
    "use_pg_rewind": true,
    "use_slots": true,
    "parameters": {
      "max_connections": "2500",
      "superuser_reserved_connections": 100,
      "max_locks_per_transaction": 64,
      "max_worker_processes": 2,
      "max_prepared_transactions": 0,
      "wal_level": "logical",
      "wal_log_hints": true,
      "track_commit_timestamp": false,
      "max_wal_senders": 10,
      "max_replication_slots": 10,
      "wal_keep_size": "4096MB",
      "hot_standby": "on",
      "listen_addresses": "*",
      "port": 5432,
      "cluster_name": "pg_cluster",
      "archive_mode": true,
      "archive_command": "cp %p /enmo/pgarch/%f"
    }
  }
}

此时 patronictl 显示有需要重启生效的参数被修改(Pending restart)

[root@pgtest1 ~]# patronictl restart pg_cluster
+---------+---------------+---------+---------+----+-----------+-----------------+
| Member  | Host          | Role    | State   | TL | Lag in MB | Pending restart |
+ Cluster: pg_cluster (7025023477017500881) --+----+-----------+-----------------+
| pgtest1 | 192.168.58.10 | Leader  | running |  5 |           | *               |
| pgtest2 | 192.168.58.11 | Replica | running |  5 |         0 | *               |
| pgtest3 | 192.168.58.12 | Replica | running |  5 |         0 | *               |
+---------+---------------+---------+---------+----+-----------+-----------------+

重启数据库使参数生效

[root@pgtest1 ~]# patronictl restart pg_cluster
+---------+---------------+---------+---------+----+-----------+-----------------+
| Member  | Host          | Role    | State   | TL | Lag in MB | Pending restart |
+ Cluster: pg_cluster (7025023477017500881) --+----+-----------+-----------------+
| pgtest1 | 192.168.58.10 | Leader  | running |  5 |           | *               |
| pgtest2 | 192.168.58.11 | Replica | running |  5 |         0 | *               |
| pgtest3 | 192.168.58.12 | Replica | running |  5 |         0 | *               |
+---------+---------------+---------+---------+----+-----------+-----------------+
When should the restart take place (e.g. 2021-10-31T18:55)  [now]: 
Are you sure you want to restart members pgtest1, pgtest2, pgtest3? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2)  []: 
Success: restart on member pgtest1
Success: restart on member pgtest2
Success: restart on member pgtest3

[root@pgtest1 ~]# patronictl list
+---------+---------------+---------+---------+----+-----------+
| Member  | Host          | Role    | State   | TL | Lag in MB |
+ Cluster: pg_cluster (7025023477017500881) --+----+-----------+
| pgtest1 | 192.168.58.10 | Leader  | running |  5 |           |
| pgtest2 | 192.168.58.11 | Replica | running |  5 |         0 |
| pgtest3 | 192.168.58.12 | Replica | running |  5 |         0 |
+---------+---------------+---------+---------+----+-----------+

删除（重置）某些设置 (null)

$ curl -s -XPATCH -d '{"postgresql":{"parameters":{"max_connections":null}}}' http://192.168.58.10:8008/config |/enmo/soft/jq-linux64 .

8.4 Switchover and failover endpoints

failover endpoints 允许在没有健康节点时执行手动 failover ，但同时它不允许 switchover 。
switchover endpoint 则相反。它仅在集群健康(有leader)时才起作用，并允许在指定时间安排切换。

执行到特定节点的 failover，在节点都正常的情况下，执行 failover 实际上和执行 Switchover 一样

[root@pgtest1 ~]# patronictl list
+---------+---------------+---------+---------+----+-----------+
| Member  | Host          | Role    | State   | TL | Lag in MB |
+ Cluster: pg_cluster (7025023477017500881) --+----+-----------+
| pgtest1 | 192.168.58.10 | Leader  | running |  5 |           |
| pgtest2 | 192.168.58.11 | Replica | running |  5 |         0 |
| pgtest3 | 192.168.58.12 | Replica | running |  5 |         0 |
+---------+---------------+---------+---------+----+-----------+
[root@pgtest1 ~]# curl -s http://192.168.58.10:8008/failover -XPOST -d '{"candidate":"pgtest2"}'
Successfully failed over to "pgtest2"
[root@pgtest1 ~]# patronictl list
+---------+---------------+---------+---------+----+-----------+
| Member  | Host          | Role    | State   | TL | Lag in MB |
+ Cluster: pg_cluster (7025023477017500881) --+----+-----------+
| pgtest1 | 192.168.58.10 | Replica | running |  6 |         0 |
| pgtest2 | 192.168.58.11 | Leader  | running |  6 |           |
| pgtest3 | 192.168.58.12 | Replica | running |  6 |         0 |
+---------+---------------+---------+---------+----+-----------+
[root@pgtest1 ~]# psql
psql (13.3)
Type "help" for help.

postgres=# select pg_is_in_recovery();
 pg_is_in_recovery 
-------------------
 t
(1 row)

在指定时间从 leader 到集群中任何其他健康节点的 switchover

[root@pgtest1 ~]# patronictl list
+---------+---------------+---------+---------+----+-----------+
| Member  | Host          | Role    | State   | TL | Lag in MB |
+ Cluster: pg_cluster (7025023477017500881) --+----+-----------+
| pgtest1 | 192.168.58.10 | Replica | running |  6 |         0 |
| pgtest2 | 192.168.58.11 | Leader  | running |  6 |           |
| pgtest3 | 192.168.58.12 | Replica | running |  6 |         0 |
+---------+---------------+---------+---------+----+-----------+
# switchover 必须至少指定 leader 或 candidate 字段和可选的 scheduled_at 字段。
# 2021-10-31T18:42+08  年-月-日T时:分+时区(+08)
[root@pgtest1 ~]# curl -s http://192.168.58.10:8008/switchover -XPOST -d '{"leader":"pgtest2","scheduled_at":"2021-10-31T18:42+08"}'
Switchover scheduled
# patronictl list 会显示 Switchover scheduled at
[root@pgtest1 ~]# patronictl list
+---------+---------------+---------+---------+----+-----------+
| Member  | Host          | Role    | State   | TL | Lag in MB |
+ Cluster: pg_cluster (7025023477017500881) --+----+-----------+
| pgtest1 | 192.168.58.10 | Replica | running |  6 |         0 |
| pgtest2 | 192.168.58.11 | Leader  | running |  6 |           |
| pgtest3 | 192.168.58.12 | Replica | running |  6 |         0 |
+---------+---------------+---------+---------+----+-----------+
 Switchover scheduled at: 2021-10-31T18:42:00+00:00
                    from: pgtest2

# patroni 日志里也有倒计时提醒
2021-10-31 18:40:51,002 INFO: no action. I am (pgtest2) the leader with the lock
2021-10-31 18:40:59,995 INFO: Lock owner: pgtest2; I am pgtest2
2021-10-31 18:40:59,997 INFO: Awaiting failover at 2021-10-31T18:42:00+00:00 (in 28860 seconds)
2021-10-31 18:41:00,000 INFO: no action. I am (pgtest2) the leader with the lock
2021-10-31 18:41:08,996 INFO: Lock owner: pgtest2; I am pgtest2
2021-10-31 18:41:08,998 INFO: Awaiting failover at 2021-10-31T18:42:00+00:00 (in 28851 seconds)

# 删除定时切换任务
[root@pgtest1 ~]# curl -s http://192.168.58.10:8008/switchover -XDELETE
scheduled switchover deleted

[root@pgtest1 ~]# patronictl list
+---------+---------------+---------+---------+----+-----------+
| Member  | Host          | Role    | State   | TL | Lag in MB |
+ Cluster: pg_cluster (7025023477017500881) --+----+-----------+
| pgtest1 | 192.168.58.10 | Replica | running |  6 |         0 |
| pgtest2 | 192.168.58.11 | Leader  | running |  6 |           |
| pgtest3 | 192.168.58.12 | Replica | running |  6 |         0 |
+---------+---------------+---------+---------+----+-----------+

POST /switchover 和 POST /failover 分别对应 Patientictl switchover 和 Patonictl failover 命令。DELETE /switchover 对应 patronictl flush switchover 命令。

# Switchover
[postgres@pgtest1 ~]$ patronictl switchover
Master [pgtest1]: 
Candidate ['pgtest2', 'pgtest3'] []: pgtest2
When should the switchover take place (e.g. 2021-10-28T04:45 )  [now]: 
Current cluster topology
... ...
Are you sure you want to switchover cluster pg_cluster, demoting current master pgtest1? [y/N]: y
2021-10-28 03:45:35.91763 Successfully switched over to "pgtest2"
... ...
# Failover
[postgres@pgtest1 ~]$ patronictl failover
Candidate ['pgtest1', 'pgtest3'] []: pgtest1
Current cluster topology
... ...
Are you sure you want to failover cluster pg_cluster, demoting current master pgtest2? [y/N]: y
2021-10-28 03:47:56.13486 Successfully failed over to "pgtest1"
... ...

8.5 重启 Restart endpoint

POST /restart 提供了几个参数

restart_pending: boolean, if set to true Patroni will restart PostgreSQL only when restart is pending in order to apply some changes in the PostgreSQL config.
role: perform restart only if the current role of the node matches with the role from the POST request.
postgres_version: perform restart only if the current version of postgres is smaller than specified in the POST request.
timeout: how long we should wait before PostgreSQL starts accepting connections. Overrides master_start_timeout.
schedule: timestamp with time zone, schedule the restart somewhere in the future.

POST /restart 仅重启当前节点的数据库

[root@pgtest1 ~]# curl -s http://192.168.58.10:8008/restart -XPOST
restarted successfully

POST /restart 定时重启当前节点的数据库

[root@pgtest1 ~]# curl -s http://192.168.58.10:8008/restart -XPOST -d '{"schedule":"2021-10-31T19:18+08"}'
Restart scheduled
[root@pgtest1 ~]# patronictl list
+---------+---------------+---------+---------+----+-----------+---------------------------+
| Member  | Host          | Role    | State   | TL | Lag in MB | Scheduled restart         |
+ Cluster: pg_cluster (7025023477017500881) --+----+-----------+---------------------------+
| pgtest1 | 192.168.58.10 | Replica | running |  6 |         0 | 2021-10-31T19:18:00+08:00 |
| pgtest2 | 192.168.58.11 | Leader  | running |  6 |           |                           |
| pgtest3 | 192.168.58.12 | Replica | running |  6 |         0 |                           |
+---------+---------------+---------+---------+----+-----------+---------------------------+

DELETE /restart 删除定时重启任务

[root@pgtest1 ~]# curl -s http://192.168.58.10:8008/restart -XDELETE

POST /restart and DELETE /restart endpoints are used by patronictl restart and patronictl flush restart respectively.

8.6 重新加载 Reload endpoint

POST /reload 让 Patroni 重新读取和应用配置文件。这相当于向 Patroni 进程发送 SIGHUP 信号。如果您更改了一些需要重新启动 Postgres 的参数（如 shared_buffers），您仍然必须通过调用 POST /restart 或使用 patriotictl restart 明确地重新启动Postgres。

[root@pgtest1 ~]# curl -s http://192.168.58.10:8008/reload -XPOST
reload scheduled

The reload endpoint is used by patronictl reload

[postgres@pgtest1 ~]$ patronictl reload pg_cluster
... ...
Are you sure you want to reload members pgtest3, pgtest2, pgtest1? [y/N]: y
Reload request received for member pgtest3 and will be processed within 10 seconds
Reload request received for member pgtest2 and will be processed within 10 seconds
Reload request received for member pgtest1 and will be processed within 10 seconds

8.7 重新初始化 Reinitialize endpoint

只允许在从节点上执行，一旦调用，它将删除数据目录并启动 pg_basebackup 重新初始化指定节点上的 PostgreSQL 数据目录。
The call might fail if Patroni is in a loop trying to recover (restart) a failed Postgres. In order to overcome this problem one can specify {“force”:true} in the request body.

[root@pgtest1 ~]# curl -s http://192.168.58.10:8008/reinitialize -XPOST
reinitialize started

The reinitialize endpoint is used by patronictl reinit.

[postgres@pgtest1 ~]$ patronictl reinit pg_cluster
... ...
Which member do you want to reinitialize [pgtest3, pgtest2, pgtest1]? []: pgtest3
Are you sure you want to reinitialize members pgtest3? [y/N]: y
Success: reinitialize for member pgtest3

9. 使用 patronictl 执行数据库查询操作

[postgres@pgtest1 ~]$ cat aa.sql
select * from test_1;
[postgres@pgtest1 ~]$ patronictl -c /enmo/app/patroni/pg_test01.yml query -f aa.sql --password
Password: 
id      create_time
1       2021-10-16 17:47:34
2       2021-10-16 17:55:06
[postgres@pgtest1 ~]$ patronictl -c /enmo/app/patroni/pg_test01.yml query -c "select * from test_1;" --password
Password: 
id      create_time
1       2021-10-16 17:47:34
2       2021-10-16 17:55:06

10. 其他参考官方文档

Patroni 官方文档：https://patroni.readthedocs.io/en/latest/index.html

10.1 将独立设备转换为 Patroni 集群

https://patroni.readthedocs.io/en/latest/existing_data.html

10.2 PostgreSQL 版本重大升级

https://patroni.readthedocs.io/en/latest/existing_data.html

PostgreSQL高可用测试系列之Patroni + etcd + HAProxy + Keepalived 离线部署（五）

说明：本系列文章仅用于共享我的学习成果，未经过生产系统考验，对于知识点和一些组件的使用会不定时更新，仅供参考，如有错误的地方，欢迎留言共同学习。

本高可用系列测试不说理论知识，如有需要自行百度，因生产环境大多数是内网环境，无法连接互联网，为模拟生产环境安装，PostgreSQL高可用测试均采用离线部署。

所需软件包均以打包上传百度网盘，如有需要自行下载：https://pan.baidu.com/s/1Tb7GPMvj4kfKEIh8iyvdbA 提取码：n9w2 文件名：PostgreSQL_HA.tar.gz

第一章：介绍测试环境
第二章： PostgreSQL + replication 部署
第三章： Etcd 部署和管理
第四章： Patroni 部署和管理
第五章： HAProxy + Keepalived 部署和管理
第六章：高可用模拟故障测试用例
第七章： Prometheus + Grafana 监控部署
第八章：高可用管理

第五章： HAProxy + Keepalived 部署和管理

haproxy 使用C语言开发的一个开源软件，是一款具备高并发(一万以上)、高性能的TCP和HTTP负载均衡器，支持基于cookie的持久性，自动故障切换，支持正则表达式及web状态统计。
使用 haproxy 设置端口区分连接主库(5000)和只读从库(5001)，且端口上有负载均衡的功能(两个从库)。
haproxy 开源官网 https://www.haproxy.org/
Keepalived 免费开源，用C编写，通过VRRP协议实现多台机器之间的故障转移服务
keepalived 官方文档 https://www.keepalived.org/manpage.html

1. 主节点(pgtest1)和一个从节点(pgtest2)安装 HAProxy

HAProxy的安装有多种方式，可以使用系统自带的RPM(yum -y install haproxy)，版本是 1.5.18。
也可以下载源码安装，以下采用源码安装方式。
源码安装包下载地址： https://www.haproxy.org/#down

[root@pgtest1 ~]# cd /enmo/soft/
[root@pgtest1 soft]# tar -zxvf haproxy-2.4.7.tar.gz
[root@pgtest1 soft]# mkdir /enmo/app/haproxy-2.4.7
[root@pgtest1 soft]# ln -s /enmo/app/haproxy-2.4.7 /enmo/app/haproxy
[root@pgtest1 soft]# cd haproxy-2.4.7
[root@pgtest1 haproxy-2.4.7]# make TARGET=linux-glibc ARCH=x86_64 PREFIX=/enmo/app/haproxy USE_PCRE=1 USE_OPENSSL=1 USE_ZLIB=1 USE_SYSTEMD=1 USE_CPU_AFFINITY=1
[root@pgtest1 haproxy-2.4.7]# make install PREFIX=/enmo/app/haproxy

# 参数说明  
# make help
TARGET=linux-glibc        # 内核版本
ARCH=x86_64               # 指定CPU的架构为"x86_64"
PREFIX=/enmo/app/haproxy  # 指定haprpxy安装路径
USE_PCRE=1                # 开启正则表达式。
USE_OPENSSL=1             # 开启OPENSSL功能。
USE_ZLIB=1                # 开启压缩和解压缩功能。
USE_SYSTEMD=1             # 支持以"systemd"的方式启动。
USE_CPU_AFFINITY=1        # 开启CPU的亲和性。

配置环境变量，加入haproxy命令

[root@pgtest1 ~]# sed -i "s;:\$PATH:;:/enmo/app/haproxy/sbin:\$PATH:;g" /etc/profile
[root@pgtest1 ~]# source /etc/profile

[root@pgtest1 ~]# haproxy -v
HAProxy version 2.4.7-b5e51a5 2021/10/04 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2026.
Known bugs: http://www.haproxy.org/bugs/bugs-2.4.7.html
Running on: Linux 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64

2. 安装 HAProxy 的所有节点创建配置文件

使用系统自带的RPM安装的默认配置文件是 /etc/haproxy/haproxy.cfg
源码安装的需要新建配置文件 /enmo/app/haproxy/haproxy.cfg

# pgtest1 和 pgtest2 添加haproxy配置
[root@pgtest1 ~]# vi /enmo/app/haproxy/haproxy.cfg
global
    log         127.0.0.1 local2
    chroot      /enmo/app/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     5000
    user        root
    group       root
    daemon
    nbproc      2
 
defaults    
    mode tcp
    log  127.0.0.1 local2 err
    option     tcplog
    option     dontlognull
    option     redispatch	
    retries    3
    maxconn    5000
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout check           5s

listen status
    bind *:1080
    mode http
    log global
    stats enable
    stats refresh 30s
    stats uri /
    stats realm Private lands
    stats auth admin:admin

listen master
    bind *:5000
    mode tcp
    option tcplog
    balance roundrobin
    option httpchk OPTIONS /master
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server pgtest1 192.168.58.10:5432 maxconn 1000 check port 8008 inter 5000 rise 2 fall 2
    server pgtest2 192.168.58.11:5432 maxconn 1000 check port 8008 inter 5000 rise 2 fall 2
    server pgtest3 192.168.58.12:5432 maxconn 1000 check port 8008 inter 5000 rise 2 fall 2

listen replicas
    bind *:5001
    mode tcp
    option tcplog
    balance roundrobin
    option httpchk OPTIONS /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server pgtest1 192.168.58.10:5432 maxconn 1000 check port 8008 inter 5000 rise 2 fall 2
    server pgtest2 192.168.58.11:5432 maxconn 1000 check port 8008 inter 5000 rise 2 fall 2
    server pgtest3 192.168.58.12:5432 maxconn 1000 check port 8008 inter 5000 rise 2 fall 2

如果只有2个节点，上面的GET /replica 需要改成 GET /read-only，否则备库故障时就无法提供只读访问了，但是这样配置主库也会参与读，不能完全分离主库的读负载。

参数说明

# 全局定义
global
    # log   127.0.0.1 local0 info      # 全局的日志配置，使用log关键字，指定使用127.0.0.1上的syslog服务中的local0日志设备，记录日志等级为info的日志
    log         127.0.0.1 local2       # 日志输出配置，所有日志都记录在本机，通过local0输出，需要在rsyslog做配置
    chroot      /var/lib/haproxy       # 改变当前工作目录
    pidfile     /var/run/haproxy.pid   # 进程PID文件
    maxconn     3000                   # 最大连接数
    user        haproxy                # 所属用户
    group       haproxy                # 所属组
    daemon                             # 以后台形式运行haproxy
    nbproc 1                           # haproxy 启动时的进程数，<=CPU的核数，创建多个进程数，可以减少每个进程的任务队列，但是过多的进程数也可能会导致进程的崩溃。
    stats socket /var/lib/haproxy/stats
# 默认部分的定义
defaults    
    mode tcp                           # 模式 mode {tcp|http|health}, tcp是4层, http是7层, health是健康检测, 只会返回ok
    log 127.0.0.1 local2 err           # 使用 127.0.0.1 上的 syslog 服务的 local2 设备记录错误信息
    option     tcplog                  # 如果将 mode 设置为 http，那么您必须将 tcplog 更改为 httplog
    option     dontlognull             # 启用该项，日志中将不会记录空连接。所谓空连接就是在上游的负载均衡器或者监控系统为了探测该服务是否存活可用时，需要定期的连接或者获取某一固定的组件或页面，或者探测扫描端口是否在监听或开放等动作被称为空连接；官方文档中标注，如果该服务上游没有其他的负载均衡器的话，建议不要使用该参数，因为互联网上的恶意扫描或其他动作就不会被记录下来。
    option     redispatch              # 当 serverId 对应的服务器挂掉后，强制定向到其他健康的服务器
    option     abortonclose            # 当服务器负载很高的时候，自动结束掉当队列处理比较久的链接
    retries    3                       # 定义连接后端服务器的失败重连次数，连接失败次数超过此值后将会将对应后端服务器标记为不可用
    maxconn    3000                    # 默认最大连接数
    timeout queue           1m         # 当达到服务器的 maxconn 时，连接等待最大时长
    timeout connect         10s        # 连接超时
    timeout client          1m         # 客户端非活动状态的超时时长
    timeout server          1m         # 服务器超时
    timeout check           5s         # 心跳检测超时
# 配置haproxy web监控，查看统计信息
listen status
    bind *:1080                        # 定义统计页面的端口
    mode http
    log global
    stats enable                       # 通过web看状态信息
    stats refresh 30s                  # 统计页面自动刷新时间
    maxconn 10                         # 最大连接数
    stats uri /                        # 统计页面url，http//ip:1080/ 访问
    stats realm Private lands          # 设置统计页面认证时的提示内容
    stats auth admin:Admin2021         # 设置统计页面认证的用户和密码，如果要设置多个，另起一行写入即可
    stats hide-version                 # 隐藏统计页面上HAProxy的版本信息

listen master
    bind *:5000                        # 定义haproxy前端部分监听的端口
    mode tcp
    option tcplog
    balance roundrobin                 # 设置负载算法为：轮询算法
    option httpchk OPTIONS /master
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions  # inter：间隔3秒做一个检测，fall：3次失败会被踢掉，rise：检查2次
    server pgtest1 192.168.58.10:5432 maxconn 1000 check port 8008 inter 5000 rise 2 fall 2
    server pgtest2 192.168.58.11:5432 maxconn 1000 check port 8008 inter 5000 rise 2 fall 2
    server pgtest3 192.168.58.12:5432 maxconn 1000 check port 8008 inter 5000 rise 2 fall 2

3. 安装 HAProxy 的所有节点创建服务，启动haproxy

使用系统自带的RPM安装的自带服务，无需配置

# 创建服务
[root@pgtest1 ~]# vi /usr/lib/systemd/system/haproxy.service
[Unit]
Description=HAProxy Load Balancer
After=syslog.target network.target

[Service]
ExecStartPre=/enmo/app/haproxy/sbin/haproxy -f /enmo/app/haproxy/haproxy.cfg -c -q
ExecStart=/enmo/app/haproxy/sbin/haproxy -Ws -f /enmo/app/haproxy/haproxy.cfg -p /var/run/haproxy.pid
ExecReload=/bin/kill -USR2 $MAINPID

[Install]
WantedBy=multi-user.target

# 启动服务
systemctl daemon-reload
systemctl start haproxy
systemctl status haproxy
systemctl enable haproxy

4. 浏览器访问 HAProxy

浏览器登录 http://192.168.58.10:1080 输入用户名 admin 密码 admin

5. 测试5000和5001端口连接数据库

这里我们通过 5000 端口和 5001 端口分别来提供读写服务和只读服务，在没有安装keepalived的情况下，如果需要对数据库写入数据只需要对外提供 192.168.58.10/11 + 5000 端口即可，可以模拟主库故障，即关闭其中的 master 节点来验证是否会进行自动主从切换。

[root@pgtest1 ~]# patronictl list
+---------+---------------+---------+---------+----+-----------+
| Member  | Host          | Role    | State   | TL | Lag in MB |
+ Cluster: pg_cluster (7025023477017500881) --+----+-----------+
| pgtest1 | 192.168.58.10 | Leader  | running |  7 |           |  # 主库
| pgtest2 | 192.168.58.11 | Replica | running |  7 |         0 |  # 备库1
| pgtest3 | 192.168.58.12 | Replica | running |  7 |         0 |  # 备库2
+---------+---------------+---------+---------+----+-----------+

# 192.168.58.10 + 5000 连接主库
[root@pgtest3 ~]# psql "host=192.168.58.10 port=5000 user=postgres dbname=postgres password=postgres" -c 'select inet_server_addr(),pg_is_in_recovery()'
 inet_server_addr | pg_is_in_recovery 
------------------+-------------------
 192.168.58.10    | f
(1 row)
# 192.168.58.11 + 5000 连接主库
[root@pgtest3 ~]# psql "host=192.168.58.11 port=5000 user=postgres dbname=postgres password=postgres" -c 'select inet_server_addr(),pg_is_in_recovery()'
 inet_server_addr | pg_is_in_recovery 
------------------+-------------------
 192.168.58.10    | f
(1 row)
# 192.168.58.12 + 5000 不能连接数据库，因为 192.168.58.12 上没有部署 haproxy
[root@pgtest3 ~]# psql "host=192.168.58.12 port=5000 user=postgres dbname=postgres password=postgres" -c 'select inet_server_addr(),pg_is_in_recovery()'
psql: error: could not connect to server: Connection refused
        Is the server running on host "192.168.58.12" and accepting
        TCP/IP connections on port 5000?

# 192.168.58.10 + 5001 连接两个备库，备库两个节点负载均衡
[root@pgtest3 ~]# psql "host=192.168.58.10 port=5001 user=postgres dbname=postgres password=postgres" -c 'select inet_server_addr(),pg_is_in_recovery()'
 inet_server_addr | pg_is_in_recovery 
------------------+-------------------
 192.168.58.12    | t
(1 row)

[root@pgtest3 ~]# psql "host=192.168.58.10 port=5001 user=postgres dbname=postgres password=postgres" -c 'select inet_server_addr(),pg_is_in_recovery()'
 inet_server_addr | pg_is_in_recovery 
------------------+-------------------
 192.168.58.11    | t
(1 row)
# 192.168.58.11 + 5001 连接备库，备库两个节点负载均衡
[root@pgtest3 ~]# psql "host=192.168.58.11 port=5001 user=postgres dbname=postgres password=postgres" -c 'select inet_server_addr(),pg_is_in_recovery()'
 inet_server_addr | pg_is_in_recovery 
------------------+-------------------
 192.168.58.12    | t
(1 row)

[root@pgtest3 ~]# psql "host=192.168.58.11 port=5001 user=postgres dbname=postgres password=postgres" -c 'select inet_server_addr(),pg_is_in_recovery()'
 inet_server_addr | pg_is_in_recovery 
------------------+-------------------
 192.168.58.11    | t
(1 row)
# 192.168.58.12 + 5001 不能连接数据库，因为 192.168.58.12 上没有部署 haproxy
[root@pgtest3 ~]# psql "host=192.168.58.12 port=5001 user=postgres dbname=postgres password=postgres" -c 'select inet_server_addr(),pg_is_in_recovery()'
psql: error: could not connect to server: Connection refused
        Is the server running on host "192.168.58.12" and accepting
        TCP/IP connections on port 5001?

至此，HAProxy 部署完成。

6. 安装 HAProxy 的两个节点安装 Keepalived

Keepalived 的安装有多种方式，可以使用系统自带的RPM(yum -y install keepalived)，版本是 1.3.5。
也可以下载源码安装，以下采用源码安装方式。
源码安装包下载地址： https://www.keepalived.org/download.html
注意，源码编译安装会出现以下警告，如果需要支持IPv6，建议提前安装libnl

*** WARNING - this build will not support IPVS with IPv6. Please install libnl/libnl-3 dev libraries to support IPv6 with IPVS.
# yum install libnl* -y

[root@pgtest1 ~]# cd /enmo/soft/
[root@pgtest1 soft]# tar -zxvf keepalived-2.2.4.tar.gz
[root@pgtest1 soft]# mkdir /enmo/app/keepalived-2.2.4
[root@pgtest1 soft]# ln -s /enmo/app/keepalived-2.2.4 /enmo/app/keepalived
[root@pgtest1 soft]# cd keepalived-2.2.4
[root@pgtest1 keepalived-2.2.4]# ./configure --prefix=/enmo/app/keepalived
[root@pgtest1 keepalived-2.2.4]# make && make install

配置环境变量，加入keepalived命令

[root@pgtest1 ~]# sed -i "s;:\$PATH:;:/enmo/app/keepalived/sbin:\$PATH:;g" /etc/profile
[root@pgtest1 ~]# source /etc/profile

[root@pgtest1 ~]#  keepalived -v
Keepalived v2.2.4 (08/21,2021)

Copyright(C) 2001-2021 Alexandre Cassen, <[email protected]>

Built with kernel headers for Linux 3.10.0
Running on Linux 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017
Distro: CentOS Linux 7 (Core)

configure options: --prefix=/enmo/app/keepalived

Config options:  LVS VRRP VRRP_AUTH VRRP_VMAC OLD_CHKSUM_COMPAT INIT=systemd SYSTEMD_NOTIFY

System options:  VSYSLOG LIBNL1 RTA_ENCAP RTA_EXPIRES FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK IFA_FLAGS NET_LINUX_IF_H_COLLISION NET_LINUX_IF_ETHER_H_COLLISION LIBIPTC_LINUX_NET_IF_H_COLLISION LIBIPVS_NETLINK IFLA_LINK_NETNSID GLOB_BRACE GLOB_ALTDIRFUNC INET6_ADDR_GEN_MODE SO_MARK

7. 安装 Keepalived 的所有节点创建配置文件, VIP使用192.168.58.20

使用系统自带的RPM安装的配置文件是 /etc/keepalived/keepalived.conf
源码安装的需要新建配置文件 /enmo/app/keepalived/etc/keepalived/keepalived.conf

# keepalived 主节点查看本地使用的网卡名
[root@pgtest1 ~]# ip a |grep 192.168.58.10 |awk '{print $NF}'
ens33

# keepalived 主节点配置文件
[root@pgtest1 ~]# mv /enmo/app/keepalived/etc/keepalived/keepalived.conf /enmo/app/keepalived/etc/keepalived/keepalived.conf.bak
[root@pgtest1 ~]# vi /enmo/app/keepalived/etc/keepalived/keepalived.conf
! Configuration File for keepalived

global_defs {
    router_id pgtest1
    script_user root
    enable_script_security
}

vrrp_script chk_haproxy {
    script "/usr/bin/killall -0 haproxy"
    interval 2
    weight 5
    fall 3
    rise 5
    timeout 2
}

vrrp_instance VI_1 {
    state MASTER
    interface ens33
    virtual_router_id 88
    priority 100
    advert_int 5
    authentication {
        auth_type PASS
        auth_pass postgres
    }
    virtual_ipaddress {
        192.168.58.20/24 dev ens33 label ens33:1
    }
    track_script {
        chk_haproxy
    }
}

# keepalived 主节点查看本地使用的网卡名
[root@pgtest2 ~]# ip a |grep 192.168.58.11 |awk '{print $NF}'
ens32

# keepalived 备节点配置文件
[root@pgtest2 ~]# mv /enmo/app/keepalived/etc/keepalived/keepalived.conf /enmo/app/keepalived/etc/keepalived/keepalived.conf.bak
[root@pgtest2 ~]# vi /enmo/app/keepalived/etc/keepalived/keepalived.conf
! Configuration File for keepalived

global_defs {
    router_id pgtest2
    script_user root
    enable_script_security
}

vrrp_script chk_haproxy {
    script "/usr/bin/killall -0 haproxy"
    interval 2
    weight 5
    fall 3
    rise 5
    timeout 2
}

vrrp_instance VI_1 {
    state BACKUP
    interface ens32
    virtual_router_id 88
    priority 99
    advert_int 5
    authentication {
        auth_type PASS
        auth_pass postgres
    }
    virtual_ipaddress {
        192.168.58.20/24 dev ens32 label ens32:1
    }
    track_script {
        chk_haproxy
    }
}

参数说明

! Configuration File for keepalived
# 全局定义块
global_defs {
    # 标识服务器的字符串，在局域网内应该是唯一的，不必是主机名，默认是本地主机名
    router_id pgtest1
    # 设置运行脚本默认用户和组
    script_user root
    # 如果脚本路径的任一部分对于非root用户来说，都具有可写权限，则不会以root身份运行脚本。
    enable_script_security
}
# 周期性检查脚本
vrrp_script chk_haproxy {
    # 指定要执行的脚本的路径或命令
    # 通过向进程 haproxy 发送信号 0 ，然后根据返回值来判断 haproxy 进程是否存在
    script "/usr/bin/killall -0 haproxy"
    # 脚本调用间隔的秒数，（默认值：1s）
    interval 2
    # 指定在多少秒后，脚本被认为执行失败
    timeout 2
    # 调整权重优先级，默认为2
    # keepalived 启动时就做权重运算，priority + weight ，主备端哪边权重大，VIP就在哪边启动
    weight 5
    # 执行失败多少次才认为失败
    fall 3
    # 执行成功多少次才认为是成功
    rise 5
}
# VRRP实例定义块
vrrp_instance VI_1 {
    # 指定该keepalived节点的初始状态
    state MASTER
    # vrrp实例绑定的接口，用于发送VRRP包
    interface ens33
    # 指定VRRP实例ID，范围是0-255，主备机保持一致
    virtual_router_id 88
    # 指定优先级，优先级高的将成为MASTER，备机请填写小于主机的值
    priority 100
    # 指定发送VRRP通告的间隔。单位是秒
    advert_int 5
	# 指定认证方式
	authentication {
        auth_type PASS          # PASS简单密码认证(推荐),AH:IPSEC认证(不推荐)
        auth_pass postgres      # 指定认证所使用的密码，最多8位。
    }
	# 指定VIP地址，主备机保持一致
    virtual_ipaddress {
        192.168.58.20/24 dev ens33 label ens33:1
    }
	# 添加一个 track 脚本( vrrp_script 配置的脚本)
    track_script {
        chk_haproxy
    }
}

keepalived 配置文件参数详解 https://blog.csdn.net/mofiu/article/details/76644012

8. 安装 Keepalived 的所有节点创建服务，启动 Keepalived

使用系统自带的RPM安装的自带服务，无需配置

# 创建服务
[root@pgtest1 ~]# vi /usr/lib/systemd/system/keepalived.service
[Unit]
Description=LVS and VRRP High Availability Monitor
After=network-online.target syslog.target haproxy.service
Requires=haproxy.service
Wants=network-online.target
Documentation=man:keepalived(8)
Documentation=man:keepalived.conf(5)
Documentation=man:genhash(1)
Documentation=https://keepalived.org

[Service]
Type=forking
PIDFile=/var/run/keepalived.pid
KillMode=process
EnvironmentFile=/enmo/app/keepalived/etc/sysconfig/keepalived
ExecStart=/enmo/app/keepalived/sbin/keepalived -f /enmo/app/keepalived/etc/keepalived/keepalived.conf $KEEPALIVED_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/usr/bin/killall keepalived

[Install]
WantedBy=multi-user.target

# 启动服务
systemctl daemon-reload
systemctl start keepalived
systemctl enable keepalived

9. Keepalived 主节点验证VIP(192.168.58.20)是否创建

[root@pgtest1 ~]# ip -4 a show ens33
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    inet 192.168.58.10/24 brd 192.168.58.255 scope global ens33
       valid_lft forever preferred_lft forever
    inet 192.168.58.20/24 scope global secondary ens33:1
       valid_lft forever preferred_lft forever

10. 测试VIP(192.168.58.20)和5000/5001端口连接数据库

# 192.168.58.20 + 5000，只连接主库，对外提供读写服务
[root@pgtest3 ~]# psql "host=192.168.58.20 port=5000 user=postgres dbname=postgres password=postgres" -c 'select inet_server_addr(),pg_is_in_recovery()'
 inet_server_addr | pg_is_in_recovery 
------------------+-------------------
 192.168.58.10    | f
(1 row)

# 192.168.58.20 + 5001，负载均衡的连接两个备库，对外提供只读服务
[root@pgtest3 ~]# psql "host=192.168.58.20 port=5001 user=postgres dbname=postgres password=postgres" -c 'select inet_server_addr(),pg_is_in_recovery()'
 inet_server_addr | pg_is_in_recovery 
------------------+-------------------
 192.168.58.11    | t
(1 row)

[root@pgtest3 ~]# psql "host=192.168.58.20 port=5001 user=postgres dbname=postgres password=postgres" -c 'select inet_server_addr(),pg_is_in_recovery()'
 inet_server_addr | pg_is_in_recovery 
------------------+-------------------
 192.168.58.12    | t
(1 row)

HAProxy 和 Keepalived 日志都记录到 /var/log/messages
至此，PostgreSQL高可用之Patroni + etcd + HAProxy + Keepalived 离线部署完成。

11. keepalived 管理

11.1 keepalived 通过调节权重主动切换vip

注意，建议在 keepalived 的备端调整 priority 大于主端的值

[root@pgtest2 ~]# vi /enmo/app/keepalived/etc/keepalived/keepalived.conf
priority 101     # 主端是100

备端重启keepalived，备端keepalived发现权重比主端大，就会自动把VIP拉过来，如果顺利的话，这样对VIP影响较小
重启不建议使用 systemctl restart keepalived，测试不太好使

[root@pgtest2 ~]# systemctl stop keepalived
[root@pgtest2 ~]# systemctl start keepalived

PostgreSQL高可用测试系列之Patroni + etcd + HAProxy + Keepalived 离线部署（六）

说明：本系列文章仅用于共享我的学习成果，未经过生产系统考验，对于知识点和一些组件的使用会不定时更新，仅供参考，如有错误的地方，欢迎留言共同学习。

本高可用系列测试不说理论知识，如有需要自行百度，因生产环境大多数是内网环境，无法连接互联网，为模拟生产环境安装，PostgreSQL高可用测试均采用离线部署。

所需软件包均以打包上传百度网盘，如有需要自行下载：https://pan.baidu.com/s/1Tb7GPMvj4kfKEIh8iyvdbA 提取码：n9w2 文件名：PostgreSQL_HA.tar.gz

第一章：介绍测试环境
第二章： PostgreSQL + replication 部署
第三章： Etcd 部署和管理
第四章： Patroni 部署和管理
第五章： HAProxy + Keepalived 部署和管理
第六章：高可用模拟故障测试用例
第七章： Prometheus + Grafana 监控部署
第八章：高可用管理

第六章：高可用模拟故障测试用例

借用Patroni官方文档的一句话：
Testing an HA solution is a time consuming process, with many variables. This is particularly true considering a cross-platform application. You need a trained system administrator or a consultant to do this work. It is not something we can cover in depth in the documentation.
测试 HA 高可用是一个耗时的过程，有很多变数，不能面面测试到，以下只列出部分测试类型，如果有其他测试场景，欢迎留言评论，本文也会不定期更新其他测试场景，测试代码较多，这里不写了。

1. 高可用测试 Keepalived

测试类型	测试方式	测试命令	测试结果
进程故障	1. 主端killall进程	killall keepalived	VIP从主端自动转移到备端，5000端口和5001端口连接正常
-	-	systemctl start keepalived	VIP从备端自动转移到主端，5000端口和5001端口连接正常
-	2. 备端killall进程	systemctl stop keepalived	VIP在主端正常运行，5000端口和5001端口连接正常
-	-	systemctl start keepalived	VIP在主端正常运行，5000端口和5001端口连接正常
-	3. 主端同时kill所有进程	-	主端VIP未卸掉，备端也启动VIP，此时主备端均存在VIP（异常现象），5000端口和5001端口连接正常
-	-	systemctl start keepalived	VIP转移到主端正常运行，备端无VIP，5000端口和5001端口连接正常
-	4. 主端只kill主进程	kill -9	VIP从主端自动转移到备端，VIP只在备端启动，5000端口和5001端口连接正常
-	-	systemctl start keepalived	VIP转移到主端正常运行，备端无VIP，5000端口和5001端口连接正常
-	5. 主端只kill子进程	-	VIP从主端自动转移到备端，等待主进程自动生成子进程后，VIP从备端自动转移到主端，5000端口和5001端口连接正常
-	6. 备端kill 进程	-	IP在主端正常运行，5000端口和5001端口连接正常
网卡故障	1. 主端down网卡	ifdown ens33	VIP从主端自动转移到备端，PostgreSQL发生故障转移到其中一个备库，5000端口和5001端口连接正常，patroni和etcd均不显示故障节点
-	2. 主端up网卡	ifup ens33	VIP从备端自动转移到主端，故障节点以备库角色添加到集群，patroni和etcd节点状态显示正常，5000端口和5001端口连接正常
-	3. 备端down网卡	ifdown ens32	VIP在主端正常运行，5000端口和5001端口连接正常，patroni和etcd均不显示故障节点，故障节点上的各个进程还在运行
-	4. 备端up网卡	ifup ens32	patroni和etcd节点状态显示正常

2. 高可用测试 HAProxy

测试类型	测试方式	测试命令	测试结果
进程故障	1. 主端killall进程	killall haproxy	keepalived 未检测 haproxy 进程，自动将VIP从主端转移到备端，5000端口和5001端口连接正常
-	-	systemctl start haproxy	keepalived 检测到 haproxy 进程，自动将VIP从备端转移到主端，5000端口和5001端口连接正常
-	2. 备端killall进程	killall haproxy	VIP在主端正常运行，5000端口和5001端口连接正常
-	-	systemctl start haproxy	VIP在主端正常运行，5000端口和5001端口连接正常
-	3. 主端同时kill所有进程	-	keepalived 未检测 haproxy 进程，自动将VIP从主端转移到备端，5000端口和5001端口连接正常
-	-	systemctl start haproxy	keepalived 检测到 haproxy 进程，自动将VIP从备端转移到主端，5000端口和5001端口连接正常
-	4. 主端只kill主进程	-	keepalived 未检测 haproxy 进程，自动将VIP从主端转移到备端，5000端口和5001端口连接正常
-	-	systemctl start haproxy	keepalived 检测到 haproxy 进程，自动将VIP从备端转移到主端，5000端口和5001端口连接正常
-	5. 主端只kill子进程	-	haproxy 的所有进程都死了，keepalived 未检测 haproxy 进程，自动将VIP从主端转移到备端，5000端口和5001端口连接正常
-	-	systemctl start haproxy	keepalived 检测到 haproxy 进程，自动将VIP从备端转移到主端，5000端口和5001端口连接正常

3. 高可用测试 Patroni

以下是在Patroni开启了auto failover的情况下进行测试

[root@pgtest3 ~]# patronictl resume

测试类型	测试方式	测试命令	测试结果
进程故障	1. 主端killall进程	killall patroni	1. 触发故障切换到备库其中一个节点，备库另一个节点同步新主库，切换时间在30秒内 2. 原主库(pgtest1)的 PostgreSQL 被关闭 3. etcd haproxy keepalived 在原主库正常运行，VIP 运行在原主库 4. VIP + 5000端口连接切换后的新主库，VIP + 5001端口连接另一个备库
-	-	systemctl start patroni	原主库(pgtest1)变成新主库(pgtest2)的备库
-	2. 主库kill patroni 进程	kill -9	1. 触发故障切换到备库其中一个节点，备库另一个节点同步新主库，切换时间在30秒内 2. 原主库(pgtest1)的 PostgreSQL 还在运行，并且是读写模式 3. etcd haproxy keepalived 在原主库正常运行，VIP 运行在原主库 4. VIP + 5000端口连接切换后的新主库，VIP + 5001端口连接另一个备库
-	-	systemctl start patroni	原主库(pgtest1)被 pg_rewind 成新主库(pgtest2)的备库
-	3. 一个备库kill patroni 进程	-	1. 使用killall，将会同时关闭备库，使用kill，此备库的 PostgreSQL 还在以只读模式运行，且与主库正常同步数据 2. VIP + 5000端口正常连接主库，VIP+5001端口不能连接此备库，可以连接另一个备库 3. 主库与另一个备库不受影响 4. 此备库上的 etcd haproxy keepalived 正常运行
-	-	systemctl start patroni	自动恢复正常状态，与主库保持同步
-	4. 两个备库kill patroni 进程	-	1. 使用killall，将会同时关闭备库，使用kill，两个备库的 PostgreSQL 还在以只读模式运行，且与主库正常同步数据 2. VIP + 5000端口只连接主库，VIP + 5001端口连接失败 3. 主库不受影响 4. 备库上的 etcd haproxy keepalived 正常运行
-	-	systemctl start patroni	自动恢复正常状态，与主库保持同步

4. 高可用测试 etcd

测试类型	测试方式	测试命令	测试结果
进程故障	1. 主库kill etcd 进程	-	不影响主库和备库, patroni 会连接其它节点上的etcd，VIP+5000/5001端口连接正常
-	2. 一个备库停止 etcd 进程	-	不影响主库和备库, patroni 会连接其它节点上的etcd，VIP+5000/5001端口连接正常
-	3. 两个备库停止 etcd 进程	-	此时超过了etcd的最大允许故障节点数，主备库3个节点均以只读模式运行，VIP + 5000端口连接失败，VIP + 5001端口轮询连接主备库3个节点
-	-	先启动第一个备库的 etcd 进程	主库从只读模式切换成读写模式，主从数据同步恢复正常，VIP + 5000/5001端口连接正常
-	-	再启动第二个备库的 etcd 进程	自动恢复正常状态，与主库保持同步

5. 高可用测试 PostgreSQL

测试类型	测试方式	测试命令	测试结果
-	停主库PostgreSQL实例	-	主库被Patroni自动拉起，VIP + 5000/5001端口连接正常
-	停备库PostgreSQL实例	-	备库被Patroni自动拉起，VIP + 5000/5001端口连接正常

6. 高可用测试操作系统

测试类型	测试方式	测试命令	测试结果
-	停PostgreSQL主库主机(同时是haproxy + keepalived 的主机)	reboot	1. 触发故障切换到备库其中一个节点，备库另一个节点同步新主库，切换时间在30秒内 2. VIP漂移到备库 3. VIP + 5000端口连接切换后的新主库，VIP + 5001端口连接另一个备库
-	-	启动	原主库(pgtest1)变成新主库(pgtest2)的备库，VIP从keepalived的备端自动转移到主端，5000端口和5001端口连接正常
-	停备库的主机就不测试了	-	-

PostgreSQL高可用测试系列之Patroni + etcd + HAProxy + Keepalived 离线部署（七）

说明：本系列文章仅用于共享我的学习成果，未经过生产系统考验，对于知识点和一些组件的使用会不定时更新，仅供参考，如有错误的地方，欢迎留言共同学习。

本高可用系列测试不说理论知识，如有需要自行百度，因生产环境大多数是内网环境，无法连接互联网，为模拟生产环境安装，PostgreSQL高可用测试均采用离线部署。

所需软件包均以打包上传百度网盘，如有需要自行下载：https://pan.baidu.com/s/1Tb7GPMvj4kfKEIh8iyvdbA 提取码：n9w2 文件名：PostgreSQL_HA.tar.gz

第一章：介绍测试环境
第二章： PostgreSQL + replication 部署
第三章： Etcd 部署和管理
第四章： Patroni 部署和管理
第五章： HAProxy + Keepalived 部署和管理
第六章：高可用模拟故障测试用例
第七章： Prometheus + Grafana 监控部署
第八章：高可用管理

第七章： Prometheus + Grafana 监控部署

prometheus github 下载地址：https://github.com/prometheus/prometheus 文件名：prometheus-2.30.3.linux-amd64.tar.gz
Grafana 官网下载地址：https://grafana.com/grafana/download 文件名：grafana-enterprise-8.2.2.linux-amd64.tar.gz

prometheus用到的几个客户端数据采集工具：

node_exporter github 下载地址：https://github.com/prometheus/node_exporter 文件名：node_exporter-1.2.2.linux-amd64.tar.gz
postgres_exporter github 下载地址：https://github.com/prometheus-community/postgres_exporter 文件名：postgres_exporter-0.10.0.linux-amd64.tar.gz
haproxy_exporter github 下载地址：https://github.com/prometheus/haproxy_exporter 文件名：haproxy_exporter-0.12.0.linux-amd64.tar.gz

1. 在节点3上安装并启动prometheus

# 解压即是安装
[root@pgtest3 ~]# tar -zxvf /enmo/soft/prometheus-2.30.3.linux-amd64.tar.gz -C /enmo/app
[root@pgtest3 ~]# mv /enmo/app/prometheus-2.30.3.linux-amd64 /enmo/app/prometheus-2.30.3
[root@pgtest3 ~]# ln -s /enmo/app/prometheus-2.30.3 /enmo/app/prometheus
# 将prometheus的命令加入到环境变量
[root@pgtest3 ~]# sed -i "s;:\$PATH:;:/enmo/app/prometheus:\$PATH:;g" /etc/profile
[root@pgtest3 ~]# source /etc/profile
# 创建服务
[root@pgtest3 ~]# vi /usr/lib/systemd/system/prometheus.service
[unit]
Description=Prometheus Server 
Documentation=https://prometheus.io/docs/introduction/overview/ 
After=network.target

[Service] 
User=root 
Group=root 
Type=simple 
Restart=on-failure 
WorkingDirectory=/enmo/app/prometheus
ExecStart=/enmo/app/prometheus/prometheus --storage.tsdb.retention=30d --web.enable-lifecycle --web.enable-admin-api --config.file=/enmo/app/prometheus/prometheus.yml --storage.tsdb.path=/enmo/app/prometheus/data
ExecReload=/bin/kill -HUP $MAINPID 
RuntimeDirectory=prometheus 
RuntimeDirectoryMode=0750 
LimitNOFILE=10000 
TimeoutStopSec=20
 
[Install] 
WantedBy=multi-user.target

# --web.enable-lifecycle      作用：后期修改参数文件时不需要重启服务，可以通过api重新读取参数文件。
# 启动 prometheus
[root@pgtest3 ~]# systemctl daemon-reload
[root@pgtest3 ~]# systemctl restart prometheus
[root@pgtest3 ~]# systemctl enable prometheus

2. 浏览器登录prometheus

登录地址：http://192.168.58.12:9090

查看现有的监控主机，当前只有本机localhost

3. 所有监控客户端安装 node_exporter，用于获取主机的监控指标

# 解压即是安装
[root@pgtest1 ~]# tar -zxvf /enmo/soft/node_exporter-1.2.2.linux-amd64.tar.gz -C /enmo/app/
[root@pgtest1 ~]# mv /enmo/app/node_exporter-1.2.2.linux-amd64 /enmo/app/node_exporter-1.2.2
[root@pgtest1 ~]# ln -s /enmo/app/node_exporter-1.2.2 /enmo/app/node_exporter 
# 创建服务
[root@pgtest1 ~]# vi /usr/lib/systemd/system/node_exporter.service
[unit]
Description=Prometheus node_exporter 
Documentation=https://github.com/prometheus/node_exporter
After=network.target

[Service] 
User=nobody 
ExecStart=/enmo/app/node_exporter/node_exporter --log.level=error 
ExecStop=/usr/bin/killall node_exporter
 
[Install] 
WantedBy=multi-user.target
# 启动服务
[root@pgtest1 ~]# systemctl daemon-reload
[root@pgtest1 ~]# systemctl restart node_exporter
[root@pgtest1 ~]# systemctl enable node_exporter

4. 所有监控客户端安装 postgres_exporter，用于获取POstgreSQL数据库的监控指标

# 解压即是安装
[root@pgtest1 ~]# tar -zxvf /enmo/soft/postgres_exporter-0.10.0.linux-amd64.tar.gz -C /enmo/app/
[root@pgtest1 ~]# mv /enmo/app/postgres_exporter-0.10.0.linux-amd64 /enmo/app/postgres_exporter-0.10.0
[root@pgtest1 ~]# ln -s /enmo/app/postgres_exporter-0.10.0 /enmo/app/postgres_exporter
# 创建服务
[root@pgtest1 ~]# vi /usr/lib/systemd/system/postgres_exporter.service
[unit]
Description=Prometheus postgres_exporter 
Documentation=https://github.com/wrouesnel/postgres_exporter
After=network.target

[Service] 
Type=simple
User=postgres
Environment="DATA_SOURCE_NAME=postgresql://postgres:postgres@localhost:5432/postgres?sslmode=disable"
ExecStart=/enmo/app/postgres_exporter/postgres_exporter --log.level=error
#  --extend.query-path=quires.yaml
Restart=on-failure
 
[Install] 
WantedBy=multi-user.target
# 启动服务
[root@pgtest1 ~]# systemctl daemon-reload
[root@pgtest1 ~]# systemctl restart postgres_exporter
[root@pgtest1 ~]# systemctl enable postgres_exporter

5. 所有监控客户端安装 haproxy_exporter，用于获取haproxy的监控指标

# 解压即是安装
[root@pgtest1 ~]# tar -zxvf /enmo/soft/haproxy_exporter-0.12.0.linux-amd64.tar.gz -C /enmo/app/
[root@pgtest1 ~]# mv /enmo/app/haproxy_exporter-0.12.0.linux-amd64 /enmo/app/haproxy_exporter-0.12.0
[root@pgtest1 ~]# ln -s /enmo/app/haproxy_exporter-0.12.0 /enmo/app/haproxy_exporter
# 创建服务
[root@pgtest1 ~]# vi /usr/lib/systemd/system/haproxy_exporter.service
[unit]
Description=Prometheus haproxy_exporter 
Documentation=https://github.com/prometheus/haproxy_exporter
After=network.target

[Service] 
User=nobody 
ExecStart=/enmo/app/haproxy_exporter/haproxy_exporter --haproxy.scrape-uri=http://admin:[email protected]:1080/?stats;csv --log.level=error 
ExecStop=/usr/bin/killall haproxy_exporter
 
[Install] 
WantedBy=multi-user.target
# 启动服务
[root@pgtest1 ~]# systemctl daemon-reload
[root@pgtest1 ~]# systemctl start haproxy_exporter
[root@pgtest1 ~]# systemctl enable haproxy_exporter

6. prometheus服务端配置文件prometheus.yml添加客户端，严格注意yml文件配置的格式

[root@pgtest3 ~]# vi /enmo/app/prometheus/prometheus.yml 
# 加到文件最后，注意每行前面的空格不要删除
  - job_name: "linux_monitor"
    static_configs:
      - targets:
        - "192.168.58.10:9100"
        - "192.168.58.11:9100"
        - "192.168.58.12:9100"
  - job_name: "postgres_monitor"
    static_configs:
      - targets:
        - "192.168.58.10:9187"
        - "192.168.58.11:9187"
        - "192.168.58.12:9187"
  - job_name: "haproxy_monitor"
    static_configs:
      - targets:
        - "192.168.58.10:9101"
        - "192.168.58.11:9101"
  - job_name: "etcd_monitor"
    static_configs:
      - targets:
        - "192.168.58.10:2379"
        - "192.168.58.11:2379"
        - "192.168.58.12:2379"

# prometheus 自带了对etcd的监控，不用安装etcd的exporter

# 检查配置文件是否有效
[root@pgtest3 ~]# promtool check config /enmo/app/prometheus/prometheus.yml
Checking /enmo/app/prometheus/prometheus.yml
  SUCCESS: 0 rule files found

# 重新载入配置文件或重启prometheus
[root@pgtest3 ~]# curl -X POST http://192.168.58.12:9090/-/reload
# systemctl restart prometheus

# 查看是否能获取到监控数据
[root@pgtest3 ~]# curl http://192.168.58.10:9100/metrics 
[root@pgtest3 ~]# curl http://192.168.58.11:9100/metrics 
[root@pgtest3 ~]# curl http://192.168.58.12:9100/metrics 

[root@pgtest3 ~]# curl http://192.168.58.10:9187/metrics 
[root@pgtest3 ~]# curl http://192.168.58.11:9187/metrics 
[root@pgtest3 ~]# curl http://192.168.58.12:9187/metrics 

[root@pgtest3 ~]# curl http://192.168.58.10:2379/metrics 
[root@pgtest3 ~]# curl http://192.168.58.11:2379/metrics 
[root@pgtest3 ~]# curl http://192.168.58.12:2379/metrics 

[root@pgtest3 ~]# curl http://192.168.58.10:9101/metrics 
[root@pgtest3 ~]# curl http://192.168.58.11:9101/metrics

通过浏览器也可以看到新添加的客户端状态

通过浏览器可以查看到监控数据

这种图形界面显然不太直观和美观，所以引入Grafana。

7. 在节点3上安装并启动 Grafana

# 解压即是安装
[root@pgtest3 ~]# tar -zxvf /enmo/soft/grafana-enterprise-8.2.2.linux-amd64.tar.gz -C /enmo/app/
[root@pgtest3 ~]# ln -s /enmo/app/grafana-8.2.2 /enmo/app/grafana
# 修改defaults.ini文件部分参数如下
sed -i "s:\<data = data\>:data = /enmo/app/grafana/data:g" /enmo/app/grafana/conf/defaults.ini
sed -i "s:\<logs = data/log\>:logs = /enmo/app/grafana/data/log:g" /enmo/app/grafana/conf/defaults.ini
sed -i "s:\<plugins = data/plugins\>:plugins = /enmo/app/grafana/data/plugins:g" /enmo/app/grafana/conf/defaults.ini
sed -i "s:\<provisioning = conf/provisioning\>:provisioning = /enmo/app/grafana/conf/provisioning:g" /enmo/app/grafana/conf/defaults.ini
# 命令加入环境变量
[root@pgtest3 ~]# sed -i "s;:\$PATH:;:/enmo/app/grafana/bin:\$PATH:;g" /etc/profile
[root@pgtest3 ~]# source /etc/profile
# 创建服务
[root@pgtest3 ~]# vi /usr/lib/systemd/system/grafana.service
[unit]
Description=Grafana Server 
Documentation=https://grafana.com/grafana/download
After=network.target

[Service] 
User=root 
Group=root 
Type=simple 
Restart=on-failure 
WorkingDirectory=/enmo/app/grafana
ExecStart=/enmo/app/grafana/bin/grafana-server --config=/enmo/app/grafana/conf/defaults.ini --homepath=/enmo/app/grafana
ExecReload=/bin/kill -HUP $MAINPID 
ExecStop=killall grafana-server

[Install] 
WantedBy=multi-user.target
# 启动服务
[root@pgtest3 ~]# systemctl daemon-reload
[root@pgtest3 ~]# systemctl start grafana 
[root@pgtest3 ~]# systemctl enable grafana

8. 浏览器登录 Grafana

登录地址： http://192.168.58.12:3000 默认账号密码admin/admin,第一次登录会提示修改密码

9. grafana 添加 prometheus 的数据源

10. 导入监控模板

grafana 监控页面显示模板下载：https://grafana.com/grafana/dashboards/
找到想要的模板，Download JSON，如果grafana服务器可以连接互联网，也可以 Copy ID to Clipboard。

在 Grafana 服务端浏览器页面上导入模板

至此，prometheus + Grafana 监控部署完成。

PostgreSQL高可用测试系列之Patroni + etcd + HAProxy + Keepalived 离线部署（八）

说明：本系列文章仅用于共享我的学习成果，未经过生产系统考验，对于知识点和一些组件的使用会不定时更新，仅供参考，如有错误的地方，欢迎留言共同学习。

本高可用系列测试不说理论知识，如有需要自行百度，因生产环境大多数是内网环境，无法连接互联网，为模拟生产环境安装，PostgreSQL高可用测试均采用离线部署。

所需软件包均以打包上传百度网盘，如有需要自行下载：https://pan.baidu.com/s/1Tb7GPMvj4kfKEIh8iyvdbA 提取码：n9w2 文件名：PostgreSQL_HA.tar.gz

第一章：介绍测试环境
第二章： PostgreSQL + replication 部署
第三章： Etcd 部署和管理
第四章： Patroni 部署和管理
第五章： HAProxy + Keepalived 部署和管理
第六章：高可用模拟故障测试用例
第七章： Prometheus + Grafana 监控部署
第八章：高可用管理

第八章：高可用管理

[root@pgtest1 ~]# etcdctl endpoint status --cluster -w table
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://192.168.58.11:2379 | 3f414532c235ce16 |   3.5.1 |   20 kB |     false |      false |         6 |     489460 |             489460 |        |
| http://192.168.58.12:2379 | 41cf8a739c2e9b50 |   3.5.1 |   20 kB |      true |      false |         6 |     489460 |             489460 |        |
| http://192.168.58.10:2379 | caef4208a95efee8 |   3.5.1 |   25 kB |     false |      false |         6 |     489460 |             489460 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@pgtest1 ~]# patronictl list
+---------+---------------+---------+---------+----+-----------+
| Member  | Host          | Role    | State   | TL | Lag in MB |
+ Cluster: pg_cluster (7025023477017500881) --+----+-----------+
| pgtest1 | 192.168.58.10 | Leader  | running | 18 |           |
| pgtest2 | 192.168.58.11 | Replica | running | 18 |         0 |
| pgtest3 | 192.168.58.12 | Replica | running | 18 |         0 |
+---------+---------------+---------+---------+----+-----------+

1. 主动维护重启服务器时软件的关闭和启动顺序

1. 计划内只重启PostgreSQL主节点(pgtest1)

需要考虑是否将数据库switchover到备节点上，在这个架构上，PostgreSQL主节点同时也是 haproxy 和 keepalived 的主节点，VIP在此节点上运行

# 1. 数据库从 pgtest1 switchover 到 pgtest2
[root@pgtest1 ~]# curl -s http://192.168.58.10:8008/switchover -XPOST -d '{"leader":"pgtest1","candidate":"pgtest2"}'
Successfully switched over to "pgtest2"
# 也可以使用命令 patronictl switchover 进行数据库切换
[root@pgtest1 ~]# patronictl list
+---------+---------------+---------+---------+----+-----------+
| Member  | Host          | Role    | State   | TL | Lag in MB |
+ Cluster: pg_cluster (7025023477017500881) --+----+-----------+
| pgtest1 | 192.168.58.10 | Replica | running | 19 |         0 |
| pgtest2 | 192.168.58.11 | Leader  | running | 19 |           |
| pgtest3 | 192.168.58.12 | Replica | running | 19 |         0 |
+---------+---------------+---------+---------+----+-----------+
# 2. pgtest1关闭 keepalived，自动将 VIP 切换到 keepalived 备节点(pgtest2)，检查VIP测试业务连接正常
[root@pgtest1 ~]# systemctl stop keepalived
# 3. pgtest1关闭 haproxy
[root@pgtest1 ~]# systemctl stop haproxy
# 4. pgtest1关闭 patroni 和 etcd，patroni 会把 PostgreSQL 自动关闭
[root@pgtest1 ~]# systemctl stop patroni
[root@pgtest1 ~]# systemctl stop etcd
# 5. pgtest1重启服务器
[root@pgtest1 ~]# reboot
# 6. 服务器重启后自动启动 etcd patroni haproxy keepalived exporter(all)，启动 patroni 后发现数据库没有启动，会自动拉起数据库，启动 keepalived 后 VIP 会自动漂移回来，检查VIP测试业务连接正常
[root@pgtest1 ~]# etcdctl endpoint status --cluster -w table
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://192.168.58.11:2379 | 3f414532c235ce16 |   3.5.1 |   20 kB |     false |      false |         6 |     473973 |             473973 |        |
| http://192.168.58.12:2379 | 41cf8a739c2e9b50 |   3.5.1 |   20 kB |      true |      false |         6 |     473973 |             473973 |        |
| http://192.168.58.10:2379 | caef4208a95efee8 |   3.5.1 |   25 kB |     false |      false |         6 |     473973 |             473973 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@pgtest1 ~]# patronictl list
+---------+---------------+---------+---------+----+-----------+
| Member  | Host          | Role    | State   | TL | Lag in MB |
+ Cluster: pg_cluster (7025023477017500881) --+----+-----------+
| pgtest1 | 192.168.58.10 | Replica | running | 19 |         0 |
| pgtest2 | 192.168.58.11 | Leader  | running | 19 |           |
| pgtest3 | 192.168.58.12 | Replica | running | 19 |         0 |
+---------+---------------+---------+---------+----+-----------+
[root@pgtest1 ~]# ip -4 a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    inet 192.168.58.10/24 brd 192.168.58.255 scope global ens33
       valid_lft forever preferred_lft forever
    inet 192.168.58.20/24 scope global secondary ens33:1
       valid_lft forever preferred_lft forever
# 7. 考虑是否将 PostgreSQL 主库切回 pgtest1
[root@pgtest1 ~]# curl -s http://192.168.58.10:8008/switchover -XPOST -d '{"leader":"pgtest2","candidate":"pgtest1"}'

2. 计划内只关闭PostgreSQL备节点(pgtest2)

在这个架构上，PostgreSQL备节点(pgtest2)同时也是 haproxy 和 keepalived 的备节点

# 1. pgtest2关闭keepalived和haproxy
[root@pgtest2 ~]# systemctl stop keepalived
[root@pgtest2 ~]# systemctl stop haproxy
# 2. pgtest2关闭 patroni 和 etcd，patroni 会把PostgreSQL自动关闭
[root@pgtest2 ~]# systemctl stop patroni
[root@pgtest2 ~]# systemctl stop etcd
# 3. pgtest2重启服务器
[root@pgtest2 ~]# reboot
# 4. 服务器重启后自动启动 etcd patroni haproxy keepalived exporter(all)，启动 patroni 后发现数据库没有启动，会自动拉起数据库，检查数据库
[root@pgtest2 ~]# etcdctl endpoint status --cluster -w table
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://192.168.58.11:2379 | 3f414532c235ce16 |   3.5.1 |   20 kB |     false |      false |         6 |     489460 |             489460 |        |
| http://192.168.58.12:2379 | 41cf8a739c2e9b50 |   3.5.1 |   20 kB |      true |      false |         6 |     489460 |             489460 |        |
| http://192.168.58.10:2379 | caef4208a95efee8 |   3.5.1 |   25 kB |     false |      false |         6 |     489460 |             489460 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@pgtest2 ~]# patronictl list
+---------+---------------+---------+---------+----+-----------+
| Member  | Host          | Role    | State   | TL | Lag in MB |
+ Cluster: pg_cluster (7025023477017500881) --+----+-----------+
| pgtest1 | 192.168.58.10 | Leader  | running | 20 |           |
| pgtest2 | 192.168.58.11 | Replica | running | 20 |         0 |
| pgtest3 | 192.168.58.12 | Replica | running | 20 |         0 |
+---------+---------------+---------+---------+----+-----------+

3. 计划内只关闭PostgreSQL备节点(pgtest3)

# 1. pgtest3关闭 patroni 和 etcd，patroni 会把PostgreSQL自动关闭
[root@pgtest3 ~]# systemctl stop patroni
[root@pgtest3 ~]# systemctl stop etcd
# 2. pgtest3重启服务器
[root@pgtest3 ~]# reboot
# 3. 服务器重启后自动启动 etcd patroni prometheus grafana exporter(all)，启动 patroni 后发现数据库没有启动，会自动拉起数据库，检查数据库
[root@pgtest3 ~]# etcdctl endpoint status --cluster -w table
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://192.168.58.11:2379 | 3f414532c235ce16 |   3.5.1 |   20 kB |     false |      false |         7 |     489858 |             489858 |        |
| http://192.168.58.12:2379 | 41cf8a739c2e9b50 |   3.5.1 |   20 kB |     false |      false |         7 |     489858 |             489858 |        |
| http://192.168.58.10:2379 | caef4208a95efee8 |   3.5.1 |   25 kB |      true |      false |         7 |     489858 |             489858 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@pgtest3 ~]# patronictl list
+---------+---------------+---------+---------+----+-----------+
| Member  | Host          | Role    | State   | TL | Lag in MB |
+ Cluster: pg_cluster (7025023477017500881) --+----+-----------+
| pgtest1 | 192.168.58.10 | Leader  | running | 20 |           |
| pgtest2 | 192.168.58.11 | Replica | running | 20 |         0 |
| pgtest3 | 192.168.58.12 | Replica | running | 20 |         0 |
+---------+---------------+---------+---------+----+-----------+

4. 计划内重启其中2个服务器

因为 etcd 在3个节点的情况下，最多允许一个服务器故障，当停止2个服务器的etcd时，数据库将以只读模式运行。

5. 计划内重启所有服务器

# 1. 先关闭 Keepalived VIP，断开所有业务连接
Keepalived 备机：systemctl stop keepalived
Keepalived 主机：systemctl stop keepalived
# 2. 所有服务器 systemctl stop haproxy
# 3. patroni 设置 Disable auto failover
[root@pgtest1 ~]# patronictl pause
# 4. 所有服务器 systemctl stop patroni
[root@pgtest2 ~]# patronictl list
+--------+------+------+-------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: pg_cluster (7025023477017500881) ----+
+--------+------+------+-------+----+-----------+
 Maintenance mode: on
# 5. 所有服务器 systemctl stop etcd
# 6. 关闭 PostgreSQL 主库：
[root@pgtest1 ~]# su - postgres
Last login: Sun Oct 31 14:17:52 CST 2021 on pts/4
[postgres@pgtest1 ~]$ pg_ctl stop -m fast -w -D $PGDATA
waiting for server to shut down.... done
server stopped
[postgres@pgtest1 ~]$ 
# 7. 关闭 PostgreSQL 备库：pg_ctl stop -m fast -w -D $PGDATA
# 8. 关闭所有服务器 init 0

启动所有服务器

# 1. 先启动主库服务器，自动启动 etcd patroni haproxy keepalived(VIP) exporter(all)，数据库没有被拉起
# 2. 再启动两个备库服务器，节点2自动启动 etcd patroni haproxy keepalived exporter(all)，节点3自动启动 etcd patroni prometheus grafana exporter(all)，数据库没有被拉起
# 3. 如果不想数据库起来后直接对外服务，可以考虑关闭所有keepalived(VIP)，等需要时再启用
# 4. Resume auto failover，会自动拉起所有数据库，但是 Leader 不一定起到哪个节点上了，好像是有节点权重能控制，不知道是啥
[root@pgtest1 ~]# patronictl resume
Success: cluster management is resumed
[root@pgtest1 ~]# patronictl list
+---------+---------------+---------+---------+----+-----------+
| Member  | Host          | Role    | State   | TL | Lag in MB |
+ Cluster: pg_cluster (7025023477017500881) --+----+-----------+
| pgtest1 | 192.168.58.10 | Replica | running | 23 |         0 |
| pgtest2 | 192.168.58.11 | Leader  | running | 23 |           |
| pgtest3 | 192.168.58.12 | Replica | running | 23 |         0 |
+---------+---------------+---------+---------+----+-----------+

目录：

内容：

PostgreSQL高可用测试系列之Patroni + etcd + HAProxy + Keepalived 离线部署（一）

说明：本系列文章仅用于共享我的学习成果，未经过生产系统考验，对于知识点和一些组件的使用会不定时更新，仅供参考，如有错误的地方，欢迎留言共同学习。

本高可用系列测试不说理论知识，如有需要自行百度，因生产环境大多数是内网环境，无法连接互联网，为模拟生产环境安装，PostgreSQL高可用测试均采用离线部署。

所需软件包均以打包上传百度网盘，如有需要自行下载：https://pan.baidu.com/s/1Tb7GPMvj4kfKEIh8iyvdbA 提取码：n9w2 文件名：PostgreSQL_HA.tar.gz

第一章： 介绍测试环境

1. 自制的测试环境架构图

2. 节点信息

3. 所需软件版本和下载地址

4. 其他会用到的小工具

以上所需软件包均以打包上传百度网盘，如有需要自行下载：https://pan.baidu.com/s/1Tb7GPMvj4kfKEIh8iyvdbA 提取码：n9w2 文件名：PostgreSQL_HA.tar.gz

5. 如果配置防火墙，需要开放的端口

6. python3、python3-devel 和 patroni[etcd] 的打包

PostgreSQL高可用测试系列之Patroni + etcd + HAProxy + Keepalived 离线部署（二）

说明：本系列文章仅用于共享我的学习成果，未经过生产系统考验，对于知识点和一些组件的使用会不定时更新，仅供参考，如有错误的地方，欢迎留言共同学习。

本高可用系列测试不说理论知识，如有需要自行百度，因生产环境大多数是内网环境，无法连接互联网，为模拟生产环境安装，PostgreSQL高可用测试均采用离线部署。

所需软件包均以打包上传百度网盘，如有需要自行下载：https://pan.baidu.com/s/1Tb7GPMvj4kfKEIh8iyvdbA 提取码：n9w2 文件名：PostgreSQL_HA.tar.gz

第二章： PostgreSQL + replication 部署

1. 所有节点关闭防火墙、NetworkManager和SELINUX，如果需要启用防火墙，建议待配置完成后再启用，放开第一章提到的端口

2. 所有节点修改主机名

3. 所有节点修改主机时间，确保节点间时间和时区同步，有条件的同步时间服务器

4. 所有节点安装 PostgreSQL 所需要的软件包

5. 所有节点配置/etc/hosts解析文件

6. 所有节点创建用户和组

7. 所有节点创建目录

8. 所有节点配置系统内核参数

9. 所有节点配置资源限制

10. 所有节点配置环境变量

11. 所有节点解压PostgreSQL安装包，进行源码编译安装

12. 主节点初始化数据库

13. 主节点配置数据库参数

14. 主节点配置PostgreSQL的访问策略文件

15. 主节点创建PostgreSQL服务，启动数据库

16. 主节点安装 pg_stat_statements

17. 部署 replication 同步两个备库

18. 测试主备同步

19. 补充：PostgreSQL 主备机的判断

PostgreSQL高可用测试系列之Patroni + etcd + HAProxy + Keepalived 离线部署（三）

说明：本系列文章仅用于共享我的学习成果，未经过生产系统考验，对于知识点和一些组件的使用会不定时更新，仅供参考，如有错误的地方，欢迎留言共同学习。

本高可用系列测试不说理论知识，如有需要自行百度，因生产环境大多数是内网环境，无法连接互联网，为模拟生产环境安装，PostgreSQL高可用测试均采用离线部署。

所需软件包均以打包上传百度网盘，如有需要自行下载：https://pan.baidu.com/s/1Tb7GPMvj4kfKEIh8iyvdbA 提取码：n9w2 文件名：PostgreSQL_HA.tar.gz

第三章： Etcd 部署和管理

1. 所有节点解压etcd安装包并配置环境变量

2. 所有节点创建 etcd 启动脚本

3. 所有节点配置etcd启动脚本可执行权限

4. 所有节点配置etcd服务并启动

5. 检查etcd集群状态

6. etcd的管理，etcdctl 是一个和 etcd 服务器交互的命令行工具

6.1 查看 etcdctl 的帮助信息

6.2 将领导权转移到另一个 etcd 集群成员，在leader节点上执行

6.3 保存数据库快照(数据库备份)，只在其中一个节点执行

6.4 修改现有成员信息

6.5 删除成员

6.6 添加新成员

6.5 其他参考官方文档

7. etcd-browser WEB可视化界面

PostgreSQL高可用测试系列之Patroni + etcd + HAProxy + Keepalived 离线部署（四）

说明：本系列文章仅用于共享我的学习成果，未经过生产系统考验，对于知识点和一些组件的使用会不定时更新，仅供参考，如有错误的地方，欢迎留言共同学习。

本高可用系列测试不说理论知识，如有需要自行百度，因生产环境大多数是内网环境，无法连接互联网，为模拟生产环境安装，PostgreSQL高可用测试均采用离线部署。

所需软件包均以打包上传百度网盘，如有需要自行下载：https://pan.baidu.com/s/1Tb7GPMvj4kfKEIh8iyvdbA 提取码：n9w2 文件名：PostgreSQL_HA.tar.gz

第四章： Patroni 部署和管理

1. 所有节点安装python3

2. 所有节点使用pip3安装patroni

2. 所有节点配置patroni的参数文件

2.1 创建参数文件和日志文件的存放路径

2.2 主节点创建文件 patroni_config.yml

2.3 备节点1创建文件 patroni_config.yml

2.4 备节点2创建文件 patroni_config.yml

2.5 建议由 patroni 托管的 PostgreSQL 参数：

2.6 修改目录权限

2.7 参数解释

3. 所有节点创建 patroni 服务并启动

4. 所有节点设置patronictl别名，方便维护

5. 查询 patroni 集群状态

6. patroni 动态配置参数调整

6.1 使用 etcdctl 查看存储在etcd中的 patroni 动态配置参数

6.2 patronictl 查看 patroni 动态配置参数

6.3 通过 patronictl 调整配置参数，在其中一个节点调整，其他节点也会自动调整，并且 patroni 自动进行 reload 操作

6.4 对于需要重启数据库生效的参数，为了减少对生产的影响，可以逐个节点重启，也可以在停机窗口通过 patronictl restart 对整个集群进行重启

第一章：介绍测试环境

第六章：高可用模拟故障测试用例

6. 高可用测试操作系统

第八章：高可用管理