Linux之磁盘IO问题

磁盘IO

Posted by qin4zhang on February 22, 2020

注意

想法及时记录,实现可以待做。

背景

遇到环境中出现了可能的磁盘IO问题,该怎么确认是哪里引起来的呢?

定位问题

不妨先top下看看数据,如下图所示

top - 03:15:27 up 125 days, 23:40,  4 users,  load average: 19.27, 22.32, 18.89
Tasks: 231 total,   1 running, 144 sleeping,   0 stopped,   0 zombie
%Cpu0  :  4.0 us,  1.3 sy,  0.0 ni,  0.0 id, 94.6 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  2.7 us,  0.7 sy,  0.0 ni,  0.0 id, 96.7 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  3.3 us,  0.0 sy,  0.0 ni,  0.0 id, 96.3 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu3  :  2.0 us,  5.7 sy,  0.0 ni,  0.7 id, 91.3 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem : 16425620 total,  6617872 free,  3443584 used,  6364164 buff/cache
KiB Swap:  4194300 total,  4130100 free,    64200 used. 13874908 avail Mem

4核的CPU,load非常高,cpu的wa占比过高,猜测应该是磁盘的io读写问题。通过man top,我们知道

As a default, percentages for these individual categories are displayed.  Where two labels are shown below, those for more recent kernel versions are shown first.
           us, user    : time running un-niced user processes
           sy, system  : time running kernel processes
           ni, nice    : time running niced user processes
           id, idle    : time spent in the kernel idle handler
           wa, IO-wait : time waiting for I/O completion
           hi : time spent servicing hardware interrupts
           si : time spent servicing software interrupts
           st : time stolen from this vm by the hypervisor

       In the alternate cpu states display modes, beyond the first tasks/threads line, an abbreviated summary is shown consisting of these elements:
                      a    b     c    d
           %Cpu(s):  75.0/25.0  100[ ...

       Where: a) is the combined us and ni percentage; b) is the sy percentage; c) is the total; and d) is one of two visual graphs of those representations.

wa表示的io wait,是cpu等待I/O完成所需的时间占比,于是要进一步确认,通过iostat -x

Linux 4.15.0-117-generic (webhost-dev124) 	01/19/2021 	_x86_64_	(4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.78    0.00    0.65    94.20    0.00   0.70

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
loop0            0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
loop1            0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
loop2            0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
loop3            0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sda              1.00    5.00     28.00   2560.00     0.00     0.00   0.00   0.00 5320.00 34168.80 154.48    28.00   512.00 166.67 100.00
sdb              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-0             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-1             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

先来看看这些指标都是啥意思,通过man iostat查看可知:

cpu报告

%user
     Show the percentage of CPU utilization that occurred while executing at the user level (application).
     CPU处在用户态下的时间百分比。

%nice
     Show the percentage of CPU utilization that occurred while executing at the user level with nice priority.

%system
     Show the percentage of CPU utilization that occurred while executing at the system level (kernel).
     CPU处在内核态下的时间百分比。

%iowait
     Show the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
     CPU处在磁盘I/O请求下的时间百分比.

%steal
     Show the percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.

%idle
     Show the percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.
     CPU处在空闲的时间百分比,此时并没有处理磁盘I/O.

Device Utilization Report

Device
     This column gives the device (or partition) name as listed in the /dev directory.

r/s
     The number (after merges) of read requests completed per second for the device.
     每秒完成的读 I/O 设备次数

w/s
     The number (after merges) of write requests completed per second for the device.
     每秒完成的写 I/O 设备次数

rsec/s (rkB/s, rMB/s)
     The number of sectors (kilobytes, megabytes) read from the device per second.
     每秒读扇区数(kb、mb字节数)

wsec/s (wkB/s, wMB/s)
     The number of sectors (kilobytes, megabytes) written to the device per second.
     每秒写扇区数(kb、mb字节数)

rrqm/s
     The number of read requests merged per second that were queued to the device.
     每秒进行 merge 的读操作次数

wrqm/s
     The number of write requests merged per second that were queued to the device.
     每秒进行 merge 的写操作次数

%rrqm
     The percentage of read requests merged together before being sent to the device.

%wrqm
     The percentage of write requests merged together before being sent to the device.

r_await
     The average time (in milliseconds) for read requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing
     them.

w_await
     The  average time (in milliseconds) for write requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servic‐
     ing them.

aqu-sz
     The average queue length of the requests that were issued to the device.
     Note: In previous versions, this field was known as avgqu-sz.

rareq-sz
     The average size (in kilobytes) of the read requests that were issued to the device.

wareq-sz
     The average size (in kilobytes) of the write requests that were issued to the device.

svctm
     The average service time (in milliseconds) for I/O requests that were issued to the device. Warning! Do not trust this field any more.  This field will be  removed  in  a
     future sysstat version.
     发出给设备的I/O请求的平均服务时间(以毫秒为单位)。这个字段,官方不建议在使用了,后期会去掉。

%util
     Percentage  of  elapsed time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close
     to 100% for devices serving requests serially.  But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their per‐
     formance limits.
     向设备发出I/O请求的耗时百分比(设备的带宽利用率)。 当串行服务请求的设备的此值接近100%时,将发生设备饱和。 但是对于并行处理请求的设备(例如RAID阵列和现代SSD),该数字并不反映其性能限制。

sda这个磁盘的使用率基本100%了,目前来看,确实是磁盘是主要的问题。那么谁引起来的呢?通过 iotop -oP后发现

Total DISK READ :       0.00 B/s | Total DISK WRITE :      2520.35 K/s
Actual DISK READ:       0.00 B/s | Actual DISK WRITE:       0.00 B/s
  PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 2292 be/4 td-agent    0.00 B/s   2600.20 K/s  0.00 %  0.00 % ruby -Eascii-8bit:ascii-8bit /opt/td-agent/embedded/bin/fluentd --l~agent.log --daemon /var/run/td-agent/td-agent.pid --under-supervisor

2292这个进程,是日志采集的fluentd,所有的服务日志采集都是通过这个进程采集到es的,瞬间的服务大面积迁移导致的采集量陡增,本地的磁盘读写抗不住。要想快速恢复服务,可以先暂停日志采集,重启fluentd服务也可以。

/etc/init.d/td-agent status
/etc/init.d/td-agent restart
/etc/init.d/td-agent stop
/etc/init.d/td-agent start

上述命令可以缓解本次的磁盘io问题。