• 12
name

A PHP Error was encountered

Severity: Notice

Message: Undefined index: userid

Filename: views/question.php

Line Number: 191

Backtrace:

File: /home/prodcxja/public_html/questions/application/views/question.php
Line: 191
Function: _error_handler

File: /home/prodcxja/public_html/questions/application/controllers/Questions.php
Line: 433
Function: view

File: /home/prodcxja/public_html/questions/index.php
Line: 315
Function: require_once

I'm currently running a website on 2 different locations (datacenters) but running the same machine. The last few months the whole performance has been degrading and I haven't been able to find a culprit.

Both machines are running Intel(R) Xeon(R) CPU E3-1245 V2 @ 3.40GHz (8 threads), 32GB of RAM, 2x120GB SSD disks on SoftRaid.

Both machines are running the following software:

  • php-fpm7.1
  • nginx
  • Percona MySQL (configured as master-master)
  • Redis

Both servers are running the same code and I'm using Amazon Route 53 to balance the traffic using DNS.

The servers used to run fine with ~2000 users navigating the website (data from Google Analytics), with load averages never going over 1.

Recently I've been seeing a huge degraded performance. Any task will bump the load average to 6-8 and sometimes it goes over 15-20 easily. Even a single deployment of code (a few bash tasks and and a git clone without too much hassle) will take forever and would see the load average ramping up and slowing the whole machine and the website.

A couple months ago, I had to increase the MySQL connections and at the same time I did increase the open files limit. The current MySQL connections is on 2000 and I let MySQL the open files by itself (value=0 will auto-detect this for you).

My main guess is that is something related to the database configuration, I see slowness while replicating (is a master-master replication) and every time there is an insert in the website I can see the load times jumping to 10-15sec.

The weirdest thing is that I do only have traffic in one server. With AWS Route 53 I have removed one of the servers from the pool so only one of the servers is actually getting load and even then, the machine is way overloaded. Here is an example:

enter image description here

While this was happening, I tried to make a comment in the website which is a simple INSERT to a table where there is 1 row and it just inserts 3 values:

enter image description here

The thing is... this website works better in my Vagrant development machine with 1 CPU and 2GB than it's doing actually in production with huge boxes.

I am sure there is files that you would like to see to help out, I just don't know what could be helpful so just let me know and I'll show any config that you may require.

Thanks in advance!

Update #1

# sar
Linux 3.14.32-xxxx-grs-ipv6-64 (freud.rbx.host.net)     11/10/17    _x86_64_    (8 CPU)

22:06:51          LINUX RESTART (8 CPU)

22:08:16        CPU     %user     %nice   %system   %iowait    %steal     %idle
22:10:01        all      1.64      0.00      1.92      6.67      0.00     89.77
22:12:02        all      1.04      0.00      0.39      8.88      0.00     89.69
22:14:17        all      0.87      0.00      0.35     11.41      0.00     87.37
Average:        all      1.15      0.00      0.82      9.18      0.00     88.84

# sar
Linux 3.14.32-xxxx-grs-ipv6-64 (bandura.bhs.infra.host.net)     11/10/17    _x86_64_    (8 CPU)

22:17:02          LINUX RESTART (8 CPU)

22:18:01        CPU     %user     %nice   %system   %iowait    %steal     %idle
22:20:01        all      9.31      0.00      1.34     13.29      0.00     76.06
22:22:01        all      9.05      0.00      1.52     10.25      0.00     79.17
22:24:01        all      8.99      0.00      1.32     20.63      0.00     69.06
22:26:17        all     12.08      0.00      1.41     20.26      0.00     66.25
22:28:10        all     10.03      0.00      4.00     18.50      0.00     67.48
22:30:01        all      9.76      0.00     10.22     11.67      0.00     68.35
22:32:01        all      9.24      0.00     10.09     15.82      0.00     64.85
22:34:01        all      9.94      0.00      6.98     17.14      0.00     65.93
22:36:01        all      9.05      0.00      1.28     11.73      0.00     77.94
22:38:18        all      8.48      0.00      1.27     21.18      0.00     69.08
22:40:01        all      9.49      0.00      1.54     13.81      0.00     75.16
22:42:01        all      8.70      0.00      1.43      9.35      0.00     80.52
22:44:01        all      9.79      0.00      1.30      7.45      0.00     81.46
22:46:01        all      8.53      0.00      1.08      4.61      0.00     85.78
22:48:01        all      8.84      0.00      1.06      0.20      0.00     89.91
22:50:12        all      8.37      0.00      0.97      4.41      0.00     86.25
22:52:01        all      9.39      0.00      1.09      0.11      0.00     89.41
22:54:01        all      9.19      0.00      1.11      0.08      0.00     89.63
22:56:01        all      9.92      0.00      1.18      6.37      0.00     82.53
22:58:01        all      9.75      0.00      1.13      0.29      0.00     88.84
23:00:01        all      8.61      0.00      1.05      0.35      0.00     89.99
23:02:01        all      9.49      0.00      1.18      4.99      0.00     84.34
23:04:01        all      8.79      0.00      1.07      0.19      0.00     89.95
23:06:01        all      9.72      0.00      1.18      0.23      0.00     88.87
23:08:01        all      9.27      0.00      1.15      5.80      0.00     83.78
23:10:01        all      9.81      0.00      1.16      0.09      0.00     88.94

Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:        all      9.37      0.00      2.20      8.50      0.00     79.93

Update #2

# iostat -x -k 60 5
Linux 3.14.32-xxxx-grs-ipv6-64 (bandura.bhs.infra.host.net)     12/10/17    _x86_64_    (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.56    0.00    0.63    3.14    0.00   93.67

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.01     5.53    1.21    7.97    77.15   298.97    81.90     0.03    2.83    5.98    2.35   0.23   0.21
sdc               0.01     5.76    1.19    7.74    77.04   298.97    84.18     2.02  225.40    7.29  259.06  20.38  18.21
sda               0.01     5.53    1.30    7.98    78.75   298.97    81.43     0.02    2.67    4.85    2.31   0.23   0.21
md1               0.00     0.00    0.12   12.04     1.85   296.01    49.00     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.18    0.00    0.94    8.88    0.00   83.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00    93.98    0.00   22.88     0.00  1233.77   107.83     0.20    8.81    0.00    8.81   0.28   0.64
sdc               0.00   106.15    0.00    8.35     0.00  1048.48   251.13    62.91 6245.87    0.00 6245.87 112.61  94.03
sda               0.00    93.93    0.00   22.93     0.00  1233.77   107.60     0.20    8.53    0.00    8.53   0.27   0.63
md1               0.00     0.00    0.00  119.93     0.00  1233.33    20.57     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.71    0.00    0.98   10.37    0.00   79.94

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00   148.23    0.00    8.33     0.00   938.97   225.35     0.16   18.67    0.00   18.67   0.57   0.47
sdc               0.00   152.42    0.00    4.38     0.00  1003.43   457.84    54.61 14547.22    0.00 14547.22 228.14 100.00
sda               0.00   147.82    0.00    8.82     0.00   938.97   213.00     0.15   17.29    0.00   17.29   0.53   0.47
md1               0.00     0.00    0.00  161.60     0.00   939.00    11.62     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.36    0.00    1.96   30.78    0.00   57.90

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00   238.60    0.00   14.18     0.00  1348.97   190.22     0.15   10.74    0.00   10.74   0.49   0.70
sdc               0.00   245.12    0.00    7.12     0.00  1002.38   281.70    22.86 2694.77    0.00 2694.77 140.52 100.00
sda               0.00   238.65    0.00   14.13     0.00  1348.97   190.89     0.15   10.71    0.00   10.71   0.50   0.71
md1               0.00     0.00    0.00  261.57     0.00  1373.07    10.50     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.54    0.00    9.82   17.06    0.00   60.58

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00    76.65    0.00    6.60     0.00   643.98   195.15     0.12   18.35    0.00   18.35   0.53   0.35
sdc               0.00    80.23    0.00    3.57     0.00   994.98   557.93    46.11 11933.87    0.00 11933.87 280.37 100.00
sda               0.00    76.45    0.00    6.82     0.00   643.98   188.94     0.12   16.86    0.00   16.86   0.49   0.33
md1               0.00     0.00    0.00   88.82     0.00   659.87    14.86     0.00    0.00    0.00    0.00   0.00   0.00

# ps ax|grep D
  PID TTY      STAT   TIME COMMAND
  235 ?        D      6:01 [jbd2/md1-8]
 2856 ?        Ss     0:05 /usr/sbin/sshd -D
27197 ?        D      0:00 redis-rdb-bgsave 127.0.0.1:6379
27201 pts/1    S+     0:00 grep D
29218 ?        D      0:13 [kworker/u16:0]

# iostat -x -k 60 5
Linux 3.14.32-xxxx-grs-ipv6-64 (freud.rbx.host.net)     12/10/17    _x86_64_    (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.09    0.00    0.26    0.99    0.00   97.66

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     3.57    0.69    5.98    41.38   135.49    53.02     0.01    1.85    5.03    1.49   0.21   0.14
sdc               0.00     3.64    0.63    5.91    40.62   135.49    53.81     0.45   69.43    6.72   76.12   3.66   2.39
sdb               0.01     3.65    0.63    5.90    40.65   135.49    53.89     0.55   84.47    6.06   92.90   9.92   6.49
md1               0.00     0.00    0.06    7.91     0.83   132.65    33.50     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.64    0.00    0.15   21.01    0.00   78.21

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00   163.15    0.00    5.05     0.00   981.53   388.73     0.27   52.73    0.00   52.73   1.33   0.67
sdc               0.00   164.90    0.00    3.52     0.00   997.41   567.25    67.67 13897.55    0.00 13897.55 284.36 100.00
sdb               0.00   165.32    0.00    3.13     0.00   997.54   636.73    67.30 15544.66    0.00 15544.66 319.11  99.99
md1               0.00     0.00    0.00  175.33     0.00   968.80    11.05     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.25    0.00    0.40   14.61    0.00   82.73

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00   159.37    0.00   14.42     0.00  1019.41   141.42     0.08    5.61    0.00    5.61   0.42   0.60
sdc               0.00   161.68    0.00   12.02     0.00  1082.34   180.14    24.31 4169.90    0.00 4169.90  74.98  90.11
sdb               0.00   161.68    0.00   12.00     0.00  1082.21   180.37    24.23 4152.64    0.00 4152.64  75.09  90.11
md1               0.00     0.00    0.00  173.33     0.00  1025.40    11.83     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.90    0.00    2.22   26.51    0.00   68.37

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00   120.45    0.02   16.13     0.27   795.08    98.50     0.07    4.49    0.00    4.50   0.31   0.49
sdc               0.00   125.63    0.00   10.75     0.00  1000.96   186.22    30.75 2925.46    0.00 2925.46  92.99  99.96
sdb               0.00   125.65    0.00   10.73     0.00  1000.96   186.51    30.78 2932.71    0.00 2932.71  93.15  99.98
md1               0.00     0.00    0.02  136.77     0.27   785.80    11.49     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.29    0.00    9.46   17.60    0.00   67.66

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00   152.48    0.00   27.47     0.00  1250.92    91.09     0.24    8.67    0.00    8.67   0.30   0.83
sdc               0.00   166.90    0.00   10.93     0.00  1001.18   183.14    45.73 3911.24    0.00 3911.24  91.46  99.99
sdb               0.00   166.92    0.00   10.92     0.00  1001.18   183.42    45.22 3885.74    0.00 3885.74  91.56  99.95
md1               0.00     0.00    0.00  183.17     0.00  1250.80    13.66     0.00    0.00    0.00    0.00   0.00   0.00

# ps ax|grep D
  PID TTY      STAT   TIME COMMAND
  234 ?        D     11:35 [jbd2/md1-8]
 1389 ?        Ss     0:00 /usr/sbin/sshd -D
24696 pts/2    S+     0:00 grep D
29904 ?        D      0:01 [kworker/u16:0]

Update #3

root@bandura:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md1 : active raid1 sda1[0] sdb1[1] sdc1[2]
      155238336 blocks [3/3] [UUU]
      bitmap: 1/2 pages [4KB], 65536KB chunk

unused devices: <none>
root@bandura:~# dmesg|grep sdc
root@bandura:~#

Update #4

# smartctl --all /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-3.14.32-xxxx-grs-ipv6-64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Intel 730 and DC S35x0/3610/3700 Series SSDs
Device Model:     INTEL SSDSC2BB160G4
Serial Number:    BTWL322504JG160MGN
LU WWN Device Id: 5 001517 8f3633dfe
Firmware Version: D2010370
User Capacity:    160,041,885,696 bytes [160 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Oct 12 18:37:52 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:        (    2) seconds.
Offline data collection
capabilities:            (0x79) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:    (   1) minutes.
Extended self-test routine
recommended polling time:    (   2) minutes.
Conveyance self-test routine
recommended polling time:    (   2) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   099   099   000    Old_age   Always       -       5
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       34065
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       33
170 Available_Reservd_Space 0x0033   099   099   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       3
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       30
175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       636 (197 5654)
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Temperature_Case        0x0022   080   065   000    Old_age   Always       -       20 (Min/Max 5/35)
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       30
194 Temperature_Internal    0x0022   100   100   000    Old_age   Always       -       27
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       17086779
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       65535
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       65535
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       65535
232 Available_Reservd_Space 0x0033   099   099   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   001   001   000    Old_age   Always       -       0
234 Thermal_Throttle        0x0032   100   100   000    Old_age   Always       -       0/0
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       17086779
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       733141

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     20497         -
# 2  Short offline       Completed without error       00%     20494         -
# 3  Short offline       Completed without error       00%     20494         -
# 4  Short offline       Completed without error       00%     20365         -
# 5  Short offline       Completed without error       00%     20365         -
# 6  Short offline       Completed without error       00%         4         -
# 7  Short offline       Completed without error       00%         2         -
# 8  Short offline       Completed without error       00%         1         -
# 9  Short offline       Completed without error       00%         1         -
#10  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@bandura:~# smartctl --all /dev/sdc
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-3.14.32-xxxx-grs-ipv6-64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Intel 730 and DC S35x0/3610/3700 Series SSDs
Device Model:     INTEL SSDSC2BB160G4
Serial Number:    BTWL344401UY160MGN
LU WWN Device Id: 5 5cd2e4 04b511f40
Firmware Version: D2010370
User Capacity:    160,041,885,696 bytes [160 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Oct 12 18:37:57 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:        (    2) seconds.
Offline data collection
capabilities:            (0x79) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:    (   1) minutes.
Extended self-test routine
recommended polling time:    (   2) minutes.
Conveyance self-test routine
recommended polling time:    (   2) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       22162
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       32
170 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       30
175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       628 (128 5695)
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Temperature_Case        0x0022   078   063   000    Old_age   Always       -       22 (Min/Max 8/37)
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       30
194 Temperature_Internal    0x0022   100   100   000    Old_age   Always       -       29
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       17847559
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       102359
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       0
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       1329513
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   001   001   000    Old_age   Always       -       0
234 Thermal_Throttle        0x0032   100   100   000    Old_age   Always       -       0/0
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       17847559
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       144838

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      8594         -
# 2  Short offline       Completed without error       00%      8591         -
# 3  Short offline       Completed without error       00%      8591         -
# 4  Short offline       Completed without error       00%      8462         -
# 5  Short offline       Completed without error       00%      8462         -
# 6  Short offline       Completed without error       00%      8432         -
# 7  Short offline       Completed without error       00%      8431         -
# 8  Short offline       Completed without error       00%      3880         -
# 9  Short offline       Completed without error       00%      3879         -
#10  Short offline       Completed without error       00%      3879         -
#11  Short offline       Completed without error       00%         1         -
#12  Short offline       Completed without error       00%         0         -
#13  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
      • 2
    • It seems a I/O performance problem to me. What is the server iowait time? Can you post the output of sar?
    • @shodanshok I've added the sar of one of the servers. I'm running an upgrade in the other one to see if it fixes anything... (poor man's solution?). I'll post the graph once I completed the results as I am sure the won't be relevant. I'll ping you again. Thanks!
      • 2
    • @ceejayoz the data set is not massive. The POST I'm showing before performing an insert is the 2nd in the table. Selects doesn't seem to be slowing down the website either, inserting data is tho. I'll turn the slow query and test but I am afraid the problems is not there.
      • 2
    • Can you also post the output of iostat -x -k 60 5 (it will run for 5 minutes) and ps ax | grep D?

Your problem is due to extremely low write speed on one of the RAID1 legs, namely on sdc. This is in turn caused by full and very stressed/used SSDs.

Based on your smartctl output, you are using 2x 160 GB Intel DC S3500. While enterprise-level, they are read-optimized drives with limited NAND spare space. This is exacerbated by the extremely high amount of writes: during its lifetime, sdc wrote over 500 TB of data, and this explain the very low Media_Wearout_Indicator value (001). sdb is not in a better state, as it has a non-zero Reallocated_Sector_Ct

I strongly suggest you to replace both disks. If this is not possible, consider re-partition them for lower capacity (eg: 96 GB) and/or run fstrim regularly.

  • 5
Reply Report
      • 1
    • Thank you. I did suspect that long ago but SoYouStart/OVH told me the disks were okay and that I ran their test-suite and said they were okay... is that they were okay. I guess they are not. I'll see if I can re-partition them to make this work while we upgrade to different servers. Thank you!

Trending Tags