0%

记一次 504 Gateway Time-out 排查过程

1. 环境

  • Nginx
  • PHP-FPM

2. 背景

线上环境偶尔会出现请求 504 Gateway Time-out 的情况。初步排查 timeout 和Nginx和PHP-FPM配置有关,以下为线上服务器可能相关配置以及具体数值(单位都是:秒):

2.1 Nginx

  • fastcgi_connect_timeout 3000;
    Defines a timeout for establishing a connection with a FastCGI server. It should be noted that this timeout cannot usually exceed 75 seconds.
  • fastcgi_send_timeout 3000;
    Sets a timeout for transmitting a request to the FastCGI server.
  • fastcgi_read_timeout 3000;
    Defines a timeout for reading a response from the FastCGI server.

2.2 PHP-FPM

  • request_terminate_timeout = 100
    The timeout for serving a single request after which the worker process will be killed.
  • max_execution_time = 300
    Maximum execution time of each script, in seconds.
  • 二者功能很相似,区别:

    Apparently the’re both doing the same thing at different levels. max_execution_time is honored by PHP itself and request_terminate_timeout is handled by the FPM process control mechanism. So whichever is set to the lowest value will kick in first. Also Apache has the idle-timeout parameter that it observes and will give up on the PHP process after that time.

3. 排查过程

  • 3.1 测试文件

    1
    2
    3
    4
    5
    6
    <?php
    // sleep 90秒后,打印PHP相关信息。90秒小于上述配置,理论上请求不应该超时
    sleep(90);

    var_dump(phpinfo());
    ?>
  • 3.2 模拟正常请求,通过域名请求测试文件,请求在大概30秒或60秒后,返回 504 Gateway Time-out,必现。
    30、60秒也都小于配置。为什么会超时?

  • 3.3 查看PHP日志

    1
    [01-Feb-2020 23:43:31] WARNING: [pool www] child 6639, script '/www/wwwroot/youxuan/core/web/test.php' (request: "GET /web/test.php") executing too slow (36.118452 sec), logging

只有warning,没有error,PHP-FPM没有报错。为什么30s就有一个warning,原来是PHP-FPM的配置:

  • request_slowlog_timeout = 30
    暂时排除 PHP-FPM的问题
  • 3.3 查看Nginx日志
    1
    [01/Feb/2020:21:08:17 +0800] "GET /web/test.php HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36" 114.247.177.131 "50.875" "50.875"

Nginx 日志返回499,而不是502,这就有问题了。

  • 3.4 Nginx http code 499

    nginx检查到客户端已断开连接,则报499 code 。(注:其他情况如用户主动关闭浏览器等)

    499:这说明是浏览器端主动断开请求。但是我浏览器没主动关闭,为啥报499。
    注:如何区分一个请求的服务端和客户端。发起请求的就叫客户端,接收请求的就叫服务端。

  • 3.5 柳暗花明
    经高人指点让我直接用IP访问,不用域名访问。果然用IP直接访问不会超时,用域名就会超时,必现。目前锁定是域名的问题。

  • 3.6 排查域名
    ping 域名,结果返回的不是云主机的地址。因为之前买了CDN的服务,云主机挂在了CDN的服务的后面。猜测可能和CDN服务有关。

  • 3.7 CDN服务排查
    查找CDN服务,找到真凶。CDN有个回源超时 30秒:
    CDN回源

  • 3.8 将回源超时调大,果然域名访问也不超时了。

4. 参考