记录一次生产请求504排查

最近生产环境上使用公司系统时偶报504超时 , 并且使用起来越来越卡顿 . 于是将排查问题过程记录于此

Linux & Nginx

Linux

查看 Nginx 错误日志 , 发现报错

worker_connections exceed open file resource limit: 1024

此警告的问题是受限于 Linux 的最大文件数限制

环境:

CenterOS7.4 + , Amazon Linux

查看进程打开的最大文件数 ulimit -n , 显示1024

修改进程最大可打开文件数 , 打开 /etc/security/limits.conf 文件,在下面添加

* soft noproc 65535
* hard noproc 65535
* soft nofile 65535
* hard nofile 65535

这里将最大线程数和文件数限制提到了 65535

想要即时生效 , 需要再输入命令 ulimit -n 65535

Nginx

Linux 限制修改后, 还需要去调整 Nginx 进程最大可打开文件数 (worker_processes 和 worker_connections)

worker_processes : 操作系统启动多少个工作进程运行 Nginx . 通常是 1个 master process 和 n 个 worker process .

ps -elf | grep nginx

eg:

[root@localhost nginx]#  ps -elf | grep nginx
4 S root       2203   2031  0  80   0 - 46881 wait   22:18 pts/0    00:00:00 su nginx
4 S nginx      2204   2203  0  80   0 - 28877 wait   22:18 pts/0    00:00:00 bash
5 S root       2252      1  0  80   0 - 11390 sigsus 22:20 ?        00:00:00 nginx: master process /usr/local/nginx/sbin/nginx -c /usr/local/nginx/conf/nginx.conf
5 S nobody     2291   2252  0  80   0 - 11498 ep_pol 22:23 ?        00:00:00 nginx: worker process
5 S nobody     2292   2252  0  80   0 - 11498 ep_pol 22:23 ?        00:00:00 nginx: worker process
5 S nobody     2293   2252  0  80   0 - 11498 ep_pol 22:23 ?        00:00:00 nginx: worker process
5 S nobody     2294   2252  0  80   0 - 11498 ep_pol 22:23 ?        00:00:00 nginx: worker process
0 R root       2312   2299  0  80   0 - 28166 -      22:24 pts/0    00:00:00 grep --color=auto nginx

上图所示即 1个 nginx 主进程 (master process) , 4个 nginx 工作进程 (worker process) . 主进程负责监控端口 , 协调工作进程的工作状态 , 分配工作任务 , 工作进程负责进行任务处理. 一般这个参数要和操作系统的 CPU 内核数成倍数 .

worker_connections : 这个属性是指单个工作进程可以允许同时建立外部连接的数量. 无论这个连接是外部主动建立的 , 还是内部建立的. 这里需要注意的是 , 一个工作进程建立一个连接后 , 进程将打开一个文件副本 . 所以这个数量还受操作系统设定的进程最大可打开文件数有关.

修改 nginx 软件级别__进程最大可打开文件数__

user root root; 
worker_processes 4; 
worker_rlimit_nofile 65535;

events { 
        use epoll; 
        worker_connections 65535; 
}

nginx -s reload 重启生效

验证是否生效

首先使用 ulimit -n 查看系统参数是否已经改变

接下来验证 nginx 是否修改生效 . 在 Linux 系统中 , 所有进程都会有一个临时的核心配置文件描述 , 存放路径在/pro/进程号/limit

首先查看进程

[root@localhost nginx]#  ps -elf | grep nginx
4 S root       2203   2031  0  80   0 - 46881 wait   22:18 pts/0    00:00:00 su nginx
4 S nginx      2204   2203  0  80   0 - 28877 wait   22:18 pts/0    00:00:00 bash
5 S root       2252      1  0  80   0 - 11390 sigsus 22:20 ?        00:00:00 nginx: master process /usr/local/nginx/sbin/nginx -c /usr/local/nginx/conf/nginx.conf
5 S nobody     2291   2252  0  80   0 - 11498 ep_pol 22:23 ?        00:00:00 nginx: worker process
5 S nobody     2292   2252  0  80   0 - 11498 ep_pol 22:23 ?        00:00:00 nginx: worker process
5 S nobody     2293   2252  0  80   0 - 11498 ep_pol 22:23 ?        00:00:00 nginx: worker process
5 S nobody     2294   2252  0  80   0 - 11498 ep_pol 22:23 ?        00:00:00 nginx: worker process
0 R root       2318   2299  0  80   0 - 28166 -      22:42 pts/0    00:00:00 grep --color=auto nginx

可以看到 nginx 工作进程的进程号分别是 2291 , 2293 , 2293 , 2294 . 我们选择一个进程查看核心配置信息

[root@localhost conf]# cat /proc/2351/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             3829                 3829                 processes
Max open files            65535                65535                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       3829                 3829                 signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

参考链接

应用层排查

在以上修改后 , 发现使用卡顿的情况明显减少 , 可是504超时问题还是会复现 . 此时基本已确认是前端项目的请求在 nginx 转发到后端服务器时发生的超时 , 后来发现超时只出现在 nginx 转发请求到其中一个网关的过程中 , 因此直接在另外的服务器部署了新网关进行替换 , 暂时未出现过超时了 . 后续持续跟踪…