
keepalived 使用纯 C 语言写成。软件围绕中心的 I/O 多路复用器设计提供实时的网络。它的设计重点是在各个元素之间实现模块化,为了保证稳定性和健壮性,守护进程被分成三个独立的进程。整体设计基于一个简单的父进程,父进程负责 fork 并监控子进程。两个子进程,一个负责 VRRP 框架,另一个负责健康检查。


每个子进程有自己的 I/O 调度多路复用器,这样可以优化 VRRP 调度,因为 VRRP 调度比健康检查更敏感。另一方面,这个分开的设计使得健康检查机制对外部函数库的使用最小化,最小化自身的动作并且让主循环空闲来避免自身引起的故障。父进程的监控框架叫做 watchdog,它的设计是:每个子进程打开一个 UNIX 套接字等待请求,当守护进程启动后,父进程连接这些 UNIX 套接字并周期性(5s)地发送 hello 包到子进程。如果父进程无法发送 hello 包到远程连接的 UNIX 套机字,它会简单地重启子进程。watchdog 的设计有两个好处,首先 hello 包从父进程发送到远程连接的子进程是通过 I/O 多路复用器调度,这样它就能够检测子进程调度框架中的死循环。第二个好处是可以使用 sysV 信号检测子进程是否死掉。当启动服务的时候,进程列表:

Control Plane
Keepalived configuration is done throught the file keepalived.conf. A compiler design is used for parsing. Parser work with a keyword tree hierarchy for mapping each configuration keyword with specifics handler. A central multi-level recursive function read the configuration file and traverse the keyword tree. During parsing, configuration file is translated into an internal memory representation.
Scheduler - I/O Multiplexer
All the event are scheduled into the same process. Keepalived is a single process. Keepalived is a network routing software, it is so closed to I/O. The design used here is a central select(...) that is in charge of scheduling all internal task. POSIX thread libs are NOT used. This framework provide its own thread abstraction optimized for networking purpose.
Memory Management
This framework provides acces to some generic memory managements functions like allocation, reallocation, release,... This framework can be used in two mode : normal_mode & debug_mode. When using debug_mode it provide a strong way to eradicate and track memory leaks. This low level env provide buffer under-run protection by tracking allocation memory and released. All the buffer used are length fixed to prevent against eventual buffer-overflow.
Core component
This framework define some common and global libraries that are used in all the code. Those libraries are : html parsing, link-list, timer, vector, string formating, buffer dump, networking utils, daemon management, pid handling, low level TCP layer4. The goal here is to factorize code to the max to limite as possible code duplication to increase modularity.
This framework provide children processes monitoring (VRRP & Healthchecking). Each child accept connection to its own watchdog unix domain socket. Parent process send "hello" messages to this child unix domain socket. Hello messages are sent using I/O multiplexer on the parent side and accepted/processed using I/O multiplexer on children side. If parent detect broken pipe it test using sysV signal if child is still alive and restart it.
This is one of the main Keepalived functionnality. Checkers are in charge of realserver healthchecking. A checker test if realserver is alive, this test end on a binary decision : remove or add realserver from/into the LVS topology. The internal checker design is realtime networking software, it use a fully multi-threaded FSM design (Finite State Machine). This checker stack provide LVS topology manipulation accoring to layer4 to layer5/7 test results. Its run in an independent process monitored by parent process.
VRRP Stack
The other most important Keepalived functionnality. VRRP (Virtual Router Redundancy Protocol :RFC2338) is focused on director takeover, it provide low-level design for router backup. It implements full IETF RFC2338 standard with some provisions and extensions for LVS and Firewall design. It implements the vrrp_sync_group extension that guarantee persistence routing path after protocol takeover. It implements IPSEC-AH using MD5-96bit crypto provision for securing protocol adverts exchange. For more informations on VRRP please read the RFC. Important things : VRRP code can be used without the LVS support, it has been designed for independant use.Its run in an independent process monitored by parent process.
System Call
This framework offer the ability to launch extra system script. It is mainly used in the MISC checker. In VRRP framework it provides the ability to launch extra script during protocol state transition. The system call is done into a forked process to not pertube the global scheduling timer.
The SMTP protocol is used for administration notification. It implements the IETFRFC821using a multi-threaded FSM design. Administration notifications are sent for healthcheckers activities and VRRP protocol state transition. SMTP is commonly used and can be interfaced with any other notification sub-system such as GSM-SMS, pagers, ...
IPVS wrapper
This framework is used for sending rules to the Kernel IPVS code. It provides translation between Keepalived internal data representation and IPVS rule_user representation. It uses the IPVS libipvs to keep generic integration with IPVS code.
Netlink Reflector
Same as IPVS wrapper. Keepalived work with its own network interface representation. IP address and interface flags are set and monitored through kernel Netlink channel. The Netlink messaging sub-system is used for setting VRRP VIPs. On the other hand, the Netlink kernel messaging broadcast capability is used to reflect into our userspace Keepalived internal data representation any events related to interfaces. So any other userspace (others program) netlink manipulation is reflected to our Keepalived data representation via Netlink Kernel broadcast (RTMGRP_LINK & RTMGRP_IPV4_IFADDR).
The Linux Kernel code provided by Wensong fromLinuxVirtualServer.orgOpenSource Project.
The Linux Kernel code provided by Alexey Kuznetov with its very nice advanced routing framework and sub-system capabilities.
全局配置包括两个子配置:全局定义(global definition)和静态路由配置(static ipaddress/routers)

global_defs { notification_email { } notification_email_from smtp_server stmp_connect_timeout 30 router_id LVS_DEVEL }

notification_email:表示 keepalived 在发生诸如切换时需要发送 email 通知,以及 email 发送给哪些邮件地址,邮件地址可以是多个,每行一个
smtp_server:表示发送邮件时 smtp 服务器地址,这里可以使用本地的 sendmail 来实现
smtp_connect_timeout:连接 smtp 的超时时间


static_ipaddress { brd + dev eth0 scope global brd + dev eth1 scope global } static_routes { src $SRC_IP to $DST_IP dev $SRC_DEVICE src $SRC_IP to $DST_IP via $GW dev $SRC_DEVICE }

这里的配置实际上和系统里面命令配置 IP 地址和路由一样,例如: brd + dev eth0 scope global 相当于 ip addr add192.168.1.1/24 brd + dev eth0 scope global。就是给 eth0 配置 IP 地址,路由同理。一般这个区域不需要配置
这里实际上就是给服务器配置真实的 IP 地址和路由,在复杂环境下可能需要配置,一般不会用这个来配置

VRRP 配置包括三类:VRRP 同步组(synchorization group)、VRRP 实例(VRRP Instance)、VRRP 脚本
1.VRRP 同步组配置

vrrp_sync_group VG_1 { group { http mysql } notify_master /path/to/ notify_backup /path_to/ notify_fault "/path/ VG_1" notify /path/to/ smtp_alert }

group:VRRP 组。http 和 mysql 是实例名,和下面的实例名一致
notify_master:表示当切换到 Master 状态时要执行的脚本
notify_backup:表示当切换到 Backup 状态时要执行的脚本
smtp_alert:表示切换时给 global_defs 中定义的邮件地址发送邮件通知

2.VRRP 实例配置

vrrp_instance http { state MASTER interface eth0 dont_track_primary track_interface { eth0 eth1 } mcast_src_ip garp_master_delay 10 virtual_router_id 51 priority 100 advert_int 1 authentication { auth_type PASS autp_pass 1234 } virtual_ipaddress { #/ brd dev scope label

state:指定 Instance 的初始状态,但是启动之后还是要通过优先级竞选选定 Master
interface:实例绑定的网卡,因为在配置虚拟 IP 的时候必须是在已有的网卡上添加的
dont_track_primary:忽略 VRRP 的 interface 错误
track_interface:跟踪接口,设置额外的监控,里面任意一块网卡出现问题,都会进入故障(FAULT)状态,例如,用 nginx 做负载均衡的时候,内外网必须工作正常,如果内网出了问题,这个 LB 也无法运作,所以必须对内外网同时做健康检查
mcast_src_ip:发送多播数据包时的源 IP 地址,这里实际上就是在哪个地址上发送 VRRP 通告,这个非常重要,一定要选择稳定的网卡端口来发送,这里相当于 heartbeat 的心跳端口。如果没有设置就使用默认的绑定的网卡的 IP,也就是 interface 指定的地址
garp_master_delay:在切换到 Master 状态后,延迟进行 gratuitous ARP 请求
virtual_router_id:设置 VRID,相同的 VRID 为一组,它将决定多播的 MAC 地址
priority 100:设置本节点的优先级
advert_int:检查间隔,默认为 1 秒
virtual_ipaddress:LVS 的 VIP
lvs_sync_daemon_interface:LVS syncd 绑定的网卡
auth_type:认证方式,可以是 PASS 或 AH 两种方式
nopreempt:设置不抢占,这里只能设置在 state 为 BACKUP 的节点上,而且这个节点的优先级必须比其他节点高
debug:debug 级别
notify_master:和synchorization group 里设置的含义一样,可以单独设置

3.VRRP 脚本

vrrp_script check_running { script "/usr/local/bin/check_running" # 执行的脚本 interval 10 # 执行脚本的时间间隔 weight 10 # 检测失败优先级操作 10 表示优先级 +10,-10 表示优先级 -10 fall 2 # 检测尝试次数,也就是认定服务器 down 的检测次数 rise 1 # 认定服务器 up 的次数 } vrrp_instance http { state BACKUP smtp_alert interface eth0 virtual_router_id 101 priority 90 advert_int 3 authentication { auth_type PASS auth_pass whatever } virtual_ipaddress { } track_script { check_running weight 20 } }

首先在 vrrp_script 区域定义脚本名字、脚本执行间隔和脚本执行的优先级变更,然后在实例里面引用。注意:VRRP 脚本和 VRRP 实例属于同一级别

LVS 的配置是用于 keepalived + LVS 集成,如果没有配置 LVS 就无需配置这段。这里 LVS 的配置并不是指真的 LVS 然后用 ipvsadm 来配置它,而是用 keepalived 的配置文件来配置 LVS
这里 LVS 的配置也有两个:虚拟主机组配置和虚拟主机配置
这个配置是可选的,这里配置主要是为了让一台 realsever 上的某个服务可以属于多个 Virtual Server,并且只做一次健康检查

virtual_server_group { # VIP port fwmark }


virtual server 可以用下面三种的任意一种来配置:

1. virtual server IP port 2. virtual server fwmark int 3. virtual server group string

virtual_server 80 { # 设置一个 virtual server:VIP:PORT delay_loop 6 # 服务轮询的时间间隔 lb_algo rr|wrr|lc|wlc|lblc|sh|sh # LVS 调度算法 lb_kind NAT|DR|TUN # LVS 集群模式 nat_mask persistent_timeout 50 # 会话保持时间(秒) persistent_granularity # LVS 会话保持粒度,ipvsadm 中的 -M 参数,默认是 0xffffffff,即每个客户端都会话保持 protocol TCP # 协议 sorry_server # 备用机,就是当所有后端 realserver 节点都不可用时,就用这里的设置 real_server { # 后端真实节点主机的权重等设置 weight 1 inhibit_on_failure # 表示在节点失败后,把它权重设置为 0 而不是从 IPVS 中删除 notify_up | # 检查服务器正常(up)后要执行的脚本 notify_down | # 检查服务器失败(down)后要执行的脚本 HTTP_GET|SSL_GET { # 健康检查的方式 url { # 要检测的 url,可以有多个 path # 测试页面的 URI 路径 digest # 摘要码 status_code # 返回状态码 } connect_port # realserver 提供服务的端口 bindto # realserver 提供服务的地址 connect_timeout # 连接超时时间 nb_get_retry # 重试次数 delay_before_retry # 重试间隔 } TCP_CHECK { connect_port bindto connect_timeout } SMTP_CHECK { host { connect_ip connect_port bindto } connect_timeout retry delay_before_retry helo_name | } MISC_CHECK { misc_path | # 外部程序或脚本 misc_timeout # 脚本或程序执行超时时间 misc_dynamic # 通过执行的程序或脚本返回的状态码动态调整 weight 值,使权重根据真实的后端压力来适当调整 } # 返回 0:健康检查没问题,不修改权重 } # 返回 1:健康检查失败,权重设置为 0 } # 返回 2-255:健康检查没问题,但是权重修改为返回代码-2
