关注Linux 及系统架构: 08/08/13

2013年8月8日星期四

linux监控程序-程序自动重启方法

1）exec函数把当前进程替换为一个新的进程，新进程由path或file参数指定。可以使用exec函数将程序的执行从一个程序切换到另一个程序；

2）fork函数是创建一个新的进程，在进程表中创建一个新的表项，而创建者（即父进程）按原来的流程继续执行，子进程执行自己的控制流程；

3）wait 当fork启动一个子进程时，子进程就有了它自己的生命周期并将独立运行，我们可以在父进程中调用wait函数让父进程等待子进程的结束；

相信介绍到这里，读者已经能够想到解决方法了：1)首先使用fork系统调用，创建子进程，2)在子进程中使用exec函数，执行需要自动重启的程序，3) 在父进程中执行wait等待子进程的结束，然后重新创建一个新的子进程。

点击(此处)折叠或打开

#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <stdlib.h>
 
int
main(int argc, char **argv)
{
    int ret, i, status;
    char *child_argv[100] = {0};
    pid_t pid;
    if (argc < 2) {
 
        fprintf(stderr, "Usage:%s <exe_path> <args...>n", argv[0]);
        return -1;
    }
    for (i = 1; i < argc; ++i) {
        child_argv[i-1] = (char *)malloc(strlen(argv[i])+1);
        strncpy(child_argv[i-1], argv[i], strlen(argv[i]));
        child_argv[i-1][strlen(argv[i])] = '0';
    }
    while(1){
 
        pid = fork(); 
        if (pid == -1) {
            fprintf(stderr, "fork() error.errno:%d error:%sn", errno, strerror(errno));
            break;
        }
        if (pid == 0) {
            ret = execv(child_argv[0], (char **)child_argv);
            //ret = execl(child_argv[0], "portmap", NULL, 0);
            if (ret < 0) {
                fprintf(stderr, "execv ret:%d errno:%d error:%sn", ret, errno, strerror(errno));
                continue;
            }
            exit(0);
        }
 
        if (pid > 0) {
            pid = wait(&status);
 
            fprintf(stdout, "wait return");
        }
 
    }
 
 
    return 0;
}

shell脚本方式的代码如下：

点击(此处)折叠或打开

# 函数: CheckProcess
# 功能: 检查一个进程是否存在
# 参数: $1 --- 要检查的进程名称
# 返回: 如果存在返回0, 否则返回1.
#------------------------------------------------------------------------------
CheckProcess()
{
  # 检查输入的参数是否有效
  if [ "$1" = "" ];
  then
    return 1
  fi
 
  #$PROCESS_NUM获取指定进程名的数目，为1返回0，表示正常，不为1返回1，表示有错误，需要重新启动
  PROCESS_NUM=`ps -ef | grep "$1" | grep -v "grep" | wc -l` 
  if [ $PROCESS_NUM -eq 1 ];
  then
    return 0
  else
    return 1
  fi
}
 
 
# 检查test实例是否已经存在
while [ 1 ] ; do
 CheckProcess "test"
 CheckQQ_RET=$?
 if [ $CheckQQ_RET -eq 1 ];
 then
 
# 杀死所有test进程，可换任意你需要执行的操作
 
 
  killall -9 test
  exec ./test & 
 fi
 sleep 1
done

linux shell进程监控与自动重启--思路很清晰

注意：
（1）ps aux    显示系统全部进程，一行一个
（2）grep “abc” 从标准输入读取字符流，输出包含字符串“abc”的行
（3）grep -v “acb”   从标准输入读取字符流，输出不包含字符串“abc”的行
（4）wc -l    从标准输入读取字符流，输出行数

检测进程httpd是否存在

操作流程如下：
（1）读取系统所有进程
（2）判断包含指定进程名字的信息是否存在
通过管道连接，命令如下：

ps axu | grep “httpd” | grep -v “grep” | wc -l
所有进程–>获取包含“httpd”的行–>删除grep进程信息–>输出最后的行数

通过判断命令的执行结果是否为 0 ，可以知道进程是否存在。

脚本如下:
#!/bin/sh
count=`ps axu | grep “httpd” | grep -v “grep”| wc -l`
if[$count -lt 1];then
sudo /home/proudboy/apache/admin/restart.sh
fi

注：还可以执行ps axu | grep “httpd” | grep -v “grep”，然后通过判断返回值是否为0来知道程序是否有输出，如下：
#!/bin/sh
count=`ps axu | grep “httpd” | grep -v “grep” `
if[$? != "0"];then
sudo /home/proudboy/apache/admin/restart.sh
fi

接下来是如何让shell脚本定时执行的问题，有两种方式可以实现：
（1）在shell里面做循环，例如：
#/bin/sh
while true; do
if [ "$?" != "0" ]; then

fi
sleep 2
done
（2）将shell脚本加入到corntab 或者 at 里面

如下Shell脚本实现了对tomcat6进程监控，如果不存在自动重启。

#!/bin/sh
pid=`ps aux| grep "tomcat6" | grep -v grep | sed -n  '1P' | awk '{print $2}'`
if [ -z $pid ]; then
        echo "begin restart,please waiting..."
        sudo /etc/init.d/tomcat6 restart
        exit 1
else
        echo -e "exist ,don't need restart"
fi

linux 进程监控和自动重启的简单实现

目的：linux 下服务器程序会因为各种原因dump掉，就会影响用户使用，这里提供一个简单的进程监控和重启功能。

实现原理：由定时任务crontab调用脚本，脚本用ps检查进程是否存在，如果不存在则重启并写入日志。

1、crontab修改

[plain]view plaincopy
chen@IED_40_125_sles10sp1:~/CandyAT/Bin> crontab -e  
*/1 * * * * /home/chen/CandyAT/Bin/monitor.sh  

上面的意思是每分钟调用一下脚本monitor.sh

2、monitor.sh的实现

[plain]view plaincopy
#! /bin/sh  
  
host_dir=`echo ~`                                       # 当前用户根目录  
proc_name="CandyGameServer"                             # 进程名  
file_name="/Candy/log/cron.log"                         # 日志文件  
pid=0  
  
proc_num()                                              # 计算进程数  
{  
    num=`ps -ef | grep $proc_name | grep -v grep | wc -l`  
    return $num  
}  
  
proc_id()                                               # 进程号  
{  
    pid=`ps -ef | grep $proc_name | grep -v grep | awk '{print $2}'`  
}  
  
proc_num  
number=$?  
if [ $number -eq 0 ]                                    # 判断进程是否存在  
then   
    cd $host_dir/CandyAT/Bin/; ./candy.sh -DZone    # 重启进程的命令，请相应修改  
    proc_id                                         # 获取新进程号  
    echo ${pid}, `date` >> $host_dir$file_name      # 将新进程号和重启时间记录  
fi  

linux监控程序-程序自动重启方法（转）

转自：http://www.cnblogs.com/zhy113/archive/2013/03/15/2960910.html

家在写server的时候，不管server写的是多么健壮，还是经常出现core dump等程序异常退出的，但是一般情况下需要在无人为干预情况下，能够自动重新启动，保证server程序能够服务用户。这时就需要一个监控程序来实现能够让程序自动重新启动，现在笔者在写portmap 就遇到了这个问题，通过网上查找资料，找到了一个相对靠谱的exec+fork解决方法。

使用脚本实现自动重启

首先想到的最简单的使用shell脚本，大概思路：

ps -ef | grep “$1″ | grep -v “grep” | wc –l 是获取 $1 （$1 代表进程的名字）的进程数，脚本根据进程数来决定下一步的操作。通过一个死循环，每隔 1 秒检查一次系统中的指定程序的进程数，这里也可使用crontab来实现。

这种方法比较土，还是可以基本解决问题，但是有1s的延迟，笔者在应用中未采用这种方法，有关这个shell脚本，请参看文章后面的附件代码。

exec+fork方式

笔者最终采用的exec+fork方式来实现的，具体思想如下：

1，exec函数把当前进程替换为一个新的进程，新进程由path或file参数指定。可以使用exec函数将程序的执行从一个程序切换到另一个程序；

2，fork函数是创建一个新的进程，在进程表中创建一个新的表项，而创建者（即父进程）按原来的流程继续执行，子进程执行自己的控制流程；

3，wait 当fork启动一个子进程时，子进程就有了它自己的生命周期并将独立运行，我们可以在父进程中调用wait函数让父进程等待子进程的结束；

使用方法：

#./portmap 需要监控的程序的路径
#args portmap 需要的参数
$ ./supervisor ./portmap  args.....

代码如下：

/**
 *
 * supervisor 
 *
 * author: liyangguang (liyangguang@software.ict.ac.cn)
 *
 * date: 2011-01-21 21:04:01
 *
 * changes
 * 1, execl to execv
 */
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <stdlib.h>int main(int argc, char **argv)
{
    int ret, i, status;
    char *child_argv[100] = {0};
    pid_t pid;
    if (argc < 2) {

        fprintf(stderr, "Usage:%s <exe_path> <args...><strong>n</strong>", argv[0]);
        return -1;
    }
    for (i = 1; i < argc; ++i) {
        child_argv[i-1] = (char *)malloc(strlen(argv[i])+1);
        strncpy(child_argv[i-1], argv[i], strlen(argv[i]));
        child_argv[i-1][strlen(argv[i])] = '<strong>0</strong>';
    }
    while(1){

        pid = fork(); 
        if (pid == -1) {
            fprintf(stderr, "fork() error.errno:%d error:%s<strong>n</strong>", errno, strerror(errno));
        }
        if (pid == 0) {
            ret = execv(child_argv[0], (char **)child_argv);
            if (ret < 0) {
                fprintf(stderr, "execv ret:%d errno:%d error:%s<strong>n</strong>", ret, errno, strerror(errno));
                continue;
            }
            exit(0);
        }
if (pid > 0) {
            pid = wait(&status);

            fprintf(stdout, "wait return");
        }

    }
return 0;
}

shell脚本方式的代码如下：

# 函数: CheckProcess
# 功能: 检查一个进程是否存在
# 参数: $1 --- 要检查的进程名称
# 返回: 如果存在返回0, 否则返回1.
#------------------------------------------------------------------------------
CheckProcess()
{
  # 检查输入的参数是否有效
  if [ "$1" = "" ];
  then
    return 1
  fi   #$PROCESS_NUM获取指定进程名的数目，为1返回0，表示正常，不为1返回1，表示有错误，需要重新启动
  PROCESS_NUM=`ps -ef | grep "$1" | grep -v "grep" | wc -l` 
  if [ $PROCESS_NUM -eq 1 ];
  then
    return 0
  else
    return 1
  fi
}     # 检查test实例是否已经存在
while [ 1 ] ; do
 CheckProcess "test"
 CheckQQ_RET=$?
 if [ $CheckQQ_RET -eq 1 ];
 then   # 杀死所有test进程，可换任意你需要执行的操作     killall -9 test
  exec ./test &  
 fi
 sleep 1
done

linux下监视进程挂掉后自动重启的shell脚本

本文介绍的这个shell脚本，通过一个while-do循环，用ps -ef|grep 检查loader进程是否正在运行，如果没有运行，则启动，确保崩溃挂掉的进程，及时自动重启。

脚本内容如下：

复制代码代码示例:

#!/bin/sh
while :
do
echo "Current DIR is " $PWD
stillRunning=$(ps -ef |grep "$PWD/loader" |grep -v "grep")
if [ "$stillRunning" ] ; then
echo "TWS service was already started by another way"
echo "Kill it and then startup by this shell, other wise this shell will loop out this message annoyingly"
kill -9 $pidof $PWD/loader
else
echo "TWS service was not started"
echo "Starting service ..."
$PWD/loader
echo "TWS service was exited!"
fi
sleep 10
done

注意：
1、ps |grep 一个进程时必须加上路径，否则grep时会有不明错误；
2、必须用 -v 从结果中去除grep命令自身，否则结果非空。

如果启动此shell时发现进程已经存在，说明以别的方式启动了进程而不是此shell，那么它会持续提醒找到进程。
解决办法：
只用此shell启动服务，或一经发现以其他方式启动的服务即kill掉，即以上脚本中的这句来实现：
kill -9 $pidof $PWD/loader

linux下监控系统进程并重启

一、用monit监控系统进程

monit 是一款功能强大的系统状态、进程、文件、目录和设备的监控软件，用于*nix平台，它可以自动重启那些已经挂掉的程序，非常适合监控系统关键的进程和资源(默认带web界面)，如：nginx、apache、mysql和cpu占有率等。而监控管理Python进程，常用的是supervisor 、zdaemon

下面分别介绍monit的安装、配置和启动。

安装

在debian或ubuntu上安装monit非常方便，通过下面的命令

sudo apt-get install monit

即可，其它*nix上也很简单，下载源码走一遍安装三步就OK了。

./configure

make

make install

安装后，默认的配置文件为/etc/monit/monitrc。

配置

添加需要监控的进程等信息至monit的配置文件，monit的配置详见下面的示例文件。

##

## 示例monit配置文件，说明：

## 1. 域名以example.com为例。

## 2. 后面带xxx的均是举例用的名字，需要根据自己的需要修改。

##

###############################################################################

## Monit control file

###############################################################################

#

# 检查周期，默认为2分钟，对于网站来说有点长，可以根据需要自行调节，这改成30秒。

set daemon  30

# 日志文件

set logfile /var/log/monit.log

#

# 邮件通知服务器

#

#set mailserver mail.example.com

set mailserver localhost

#

# 通知邮件的格式设置，下面是默认格式供参考

#

## Monit by default uses the following alert mail format:

##

## --8<--

## From: monit@$HOST                         # sender

## Subject: monit alert --  $EVENT $SERVICE  # subject

##

## $EVENT Service $SERVICE                   #

##                                           #

##  Date:        $DATE                   #

##  Action:      $ACTION                 #

##  Host:        $HOST                   # body

##  Description: $DESCRIPTION            #

##                                           #

## Your faithful employee,                   #

## monit                                     #

## --8<--

##

## You can override the alert message format or its parts such as subject

## or sender using the MAIL-FORMAT statement. Macros such as $DATE, etc.

## are expanded on runtime. For example to override the sender:

#

# 简单的，这只改了一下发送人，有需要可以自己修改其它内容。

set mail-format { from: webmaster@example.com }

# 设置邮件通知接收者。建议发到gmail，方便邮件过滤。

set alert userxxx@gmail.com

set httpd port 2812 and            # 设置http监控页面的端口

     use address www.example.com   # http监控页面的IP或域名

     allow localhost               # 允许本地访问

     allow 58.68.78.0/24           # 允许此IP段访问

     ##allow 0.0.0.0/0.0.0.0       # 允许任何IP段，不建议这样干

     allow userxxx:passwordxxx     # 访问用户名密码

###############################################################################

## Services

###############################################################################

#

# <span class="goog_qs-tidbit goog_qs-tidbit-0">系统整体运行状况监控，默认的就可以，可以自己去微调

#

# 系统名称，可以是IP或域名

check system</span> www.example.com

    if loadavg (1min) > 4 then alert

    if loadavg (5min) > 2 then alert

    if memory usage > 75% then alert

    if cpu usage (user) > 70% then alert

    if cpu usage (system) > 30% then alert

    if cpu usage (wait) > 20% then alert

#

# 监控nginx

#

# 需要提供进程pid文件信息

check process nginx with pidfile /var/run/nginx.pid

    # 进程启动命令行，注：必须是命令全路径

    start program = "/etc/init.d/nginx start"

    # 进程关闭命令行

    stop program  = "/etc/init.d/nginx stop"

    # nginx进程状态测试,监测到nginx连不上了，则自动重启

    if failed host www.example.com port 80 protocol http then restart

    # 多次重启失败将不再尝试重启，这种就是系统出现严重错误的情况

    if 3 restarts within 5 cycles then timeout

    # 可选，设置分组信息

    group server

#   可选的ssl端口的监控，如果有的话

#    if failed port 443 type tcpssl protocol http

#       with timeout 15 seconds

#       then restart

#

# 监控apache

#

check process apache with pidfile /var/run/apache2.pid

    start program = "/etc/init.d/apache2 start"

    stop program  = "/etc/init.d/apache2 stop"

    # apache吃cpu和内存比较厉害，额外添加一些关于这方面的监控设置

    if cpu > 50% for 2 cycles then alert

    if cpu > 70% for 5 cycles then restart

    if totalmem > 1500 MB for 10 cycles then restart

    if children > 250 then restart

    if loadavg(5min) greater than 10 for 20 cycles then stop

    if failed host www.example.com port 8080 protocol http then restart

    if 3 restarts within 5 cycles then timeout

    group server

    # 可选，依赖于nginx

    depends on nginx

#

# 监控spawn-fcgi进程(其实就是fast-cgi进程)

#

check process spawn-fcgi with pidfile /var/run/spawn-fcgi.pid

    # spawn-fcgi一定要带-P参数才会生成pid文件，默认是没有的

    start program = "/usr/bin/spawn-fcgi -a 127.0.0.1 -p 8081 -C 10 -u userxxx -g groupxxx -P /var/run/spawn-fcgi.pid -f /usr/bin/php-cgi"

    stop program = "/usr/bin/killall /usr/bin/php-cgi"

    # fast-cgi走的不是http协议，monit的protocol参数也没有cgi对应的设置，这里去掉protocol http即可。

    if failed host 127.0.0.1 port 8081 then restart

    if 3 restarts within 5 cycles then timeout

    group server

    depends on nginx

虽然在注释里有详细说明，但是我还是要再强调说明几点：

start和stop的program参数里的命令必须是全路径，否则monit不能正常启动，比如killall应该是/usr/bin/killall。
对于spawn-fcgi，很多人会用它来管理PHP的fast-cgi进程，但spawn-fcgi本身也是有可能挂掉的，所以还是需要用monit来监控spawn-fcgi。spawn-fcgi必须带-P参数才会有pid文件，而且fast-cgi走的不是http协议，monit的protocol参数也没有cgi对应的设置，一定要去掉protocol http这项设置才管用。
进程多次重启失败monit将不再尝试重启，收到这样的通知邮件表明系统出现了严重的问题，要引起足够的重视，需要赶紧人工处理。

当然monit除了管理进程之外，还可以监控文件、目录、设备等，本文不做讨论，具体配置方式可以去参考monit的官方文档。

启动、停止、重启

标准的start、stop、restart

sudo /etc/init.d/monit start

sudo /etc/init.d/monit stop

sudo /etc/init.d/monit restart

看到正确的提示信息即可，若遇到问题可以去查看配置里指定的日志文件，如/var/log/monit.log。

从我的服务器这几年的运行情况（monit发了的通知邮件）来看，nginx挂掉的事几乎没有，但apache或fast-cgi出问题的情况还是比较多见，赶快用上monit来管理你的服务器以提高服务器稳定性，跟502 Bad Gateway之类错误说拜拜吧。

介绍图：

二、用process-monitor监控系统进程

git地址：https://github.com/russells/process-monitor

安装：

cd russells-process-monitor

make

make install

运行：

process-monitor -d /etc/init.d/network start

可通过tail -f /var/log/message　查看状态

帮助

# process-monitor -h

Usage: process-monitor [args] [--] childpath [child_args...]

       process-monitor -P <pipe> --command=stop|start|exit|hup|int

  -C|--clear-env              Clear the environment before setting the vars

                              specified with -E

  -c|--command <command>      Make a running process-monitor react to

                              <command>

  -D|--dir <dirname>          Change to <dirname> before starting child

  -d|--daemon                 Go into the background

                                (changes some signal handling behaviour)

  -E|--env <var=value>        Environment var for child process

                                (can use multiple times)

  -e|--email <addr>           Email when child restarts

                                (not implemented)

  -h|--help                   This message

  -L|--child-log-name <name>  Name to use in messages that come from the

                               child process

  -l|--log-name <name>        Name to use in our own messages

  -M|--max-wait-time <time>   Maximum time between child starts

  -m|--min-wait-time <time>   Minimum time between child starts

                                (seconds, cannot be less than 1)

  -P|--command-pipe <pipe>    Open named pipe <pipe> to receive commands

  -p|--pid-file <file>        Write PID to <file>, if in the background

  -u|--user <user>            User to run child as (name or uid)

                                (can be user:group)

  -- is required if childpath or any of child_args begin with -

参考：http://feilong.me/2011/02/monitor-core-processes-with-monit

订阅：博文 (Atom)

页面