• 14
name

A PHP Error was encountered

Severity: Notice

Message: Undefined index: userid

Filename: views/question.php

Line Number: 191

Backtrace:

File: /home/prodcxja/public_html/questions/application/views/question.php
Line: 191
Function: _error_handler

File: /home/prodcxja/public_html/questions/application/controllers/Questions.php
Line: 433
Function: view

File: /home/prodcxja/public_html/questions/index.php
Line: 315
Function: require_once

name Punditsdkoslkdosdkoskdo

ZFS reads on FreeBSD incredibly slow

I have a FreeBSD 10.0 system on which I have MySQL running with its InnoDB database files stored on ZFS. It seemed to have been going fine for months, but lately, and seemingly suddenly, performance has bottomed out. After debugging a while surrounding MySQL, I finally decided to just see if just reading the database files from the filesystem was slow.

I picked one table whose database file was about 16GB, ran:

time cat table.ibd > /dev/null

and got:

cat table.ibd > /dev/null 1.24s user 64.35s system 0% cpu 1:00:34.65 total

In comparison, a copy of the file (with some changes, I'm sure: the database is live) living on UFS on the same system gives me:

cat table.ibd > /dev/null 0.20s user 9.34s system 5% cpu 9.550 total

Here are non-default ZFS options on this system:

/boot/loader.conf:
  vfs.zfs.arc_max=17179869184

-

/etc/sysctl.conf:
  vfs.zfs.prefetch_disable=1

-

zfs get:
  recordsize 16K
  compression on
  atime off
  primarycache metadata
  zfs:zfs_nocacheflush 1

There are 12 snapshots on this filesystem right now. (Which seems excessive; I'm going to see if any of them can be deleted.)

There's no L2ARC for this pool (or any other pool on the system).

I've tried all three values for primarycache, and I've tried reenabling the prefetch, none of which seem to have had any significant effect.

The zpool is four 2-disk mirrors:

% zpool status mysqlrot
  pool: mysqlrot
state: ONLINE
  scan: scrub repaired 0 in 9h45m with 0 errors on Fri Jun 26 12:46:33 2015
config:

    NAME        STATE     READ WRITE CKSUM
    mysqlrot    ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        mfid9   ONLINE       0     0     0
        mfid10  ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        mfid11  ONLINE       0     0     0
        mfid12  ONLINE       0     0     0
      mirror-2  ONLINE       0     0     0
        mfid13  ONLINE       0     0     0
        mfid14  ONLINE       0     0     0
      mirror-3  ONLINE       0     0     0
        mfid15  ONLINE       0     0     0
        mfid16  ONLINE       0     0     0
    spares
      mfid19    AVAIL

errors: No known data errors

One oddity is how the raw devices are set up. The storage controller is an MFI controller, and each raw disk is actually configured in the MFI controller as a 1-disk RAID0 volume:

% sudo mfiutil show volumes
mfi0 Volumes:
  Id     Size    Level   Stripe  State   Cache   Name
 mfid0 (  185G) RAID-1      64K OPTIMAL Disabled <OS>
 mfid1 (  558G) RAID-0      64K OPTIMAL Disabled <DB0A>
 mfid2 (  558G) RAID-0      64K OPTIMAL Disabled <DB0B>
 mfid3 (  558G) RAID-0      64K OPTIMAL Disabled <DB1A>
 mfid4 (  558G) RAID-0      64K OPTIMAL Disabled <DB1B>
 mfid5 (  558G) RAID-0      64K OPTIMAL Disabled <DB2A>
 mfid6 (  558G) RAID-0      64K OPTIMAL Disabled <DB2B>
 mfid7 (  558G) RAID-0      64K OPTIMAL Disabled <DB3A>
 mfid8 (  558G) RAID-0      64K OPTIMAL Disabled <DB3B>
 mfid9 (  558G) RAID-0      64K OPTIMAL Disabled <DB4A>
mfid10 (  558G) RAID-0      64K OPTIMAL Disabled <DB4B>
mfid11 (  558G) RAID-0      64K OPTIMAL Disabled <DB5A>
mfid12 (  558G) RAID-0      64K OPTIMAL Disabled <DB5B>
mfid13 (  558G) RAID-0      64K OPTIMAL Disabled <DB6A>
mfid14 (  558G) RAID-0      64K OPTIMAL Disabled <DB6B>
mfid15 (  558G) RAID-0      64K OPTIMAL Disabled <DB7A>
mfid16 (  558G) RAID-0      64K OPTIMAL Disabled <DB7B>
mfid17 (  558G) RAID-0      64K OPTIMAL Disabled <DB8A>
mfid18 (  558G) RAID-0      64K OPTIMAL Disabled <DB8B>
mfid19 (  558G) RAID-0      64K OPTIMAL Disabled <SPARE0>

No errors that I've been able to find in any log file or utility.

Anyone have any ideas where to look?


Various data:

% zpool list mysqlrot
NAME       SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
mysqlrot  2.17T  1.49T   701G    68%  1.00x  ONLINE  -

Copying 15.3GB file to the affected filesystem (from a UFS filesystem):

% time sudo cp test.file /var/lib/mysql/mysqlrot/test.file
sudo cp test.file /var/lib/mysql/mysqlrot/test.file  0.02s user 44.23s system 1% cpu 1:06.93 total

(That's 66.93 seconds.)

Reading same file from the affected filesystem:

# time cat test.file > /dev/null
cat test.file > /dev/null  4.23s user 268.50s system 0% cpu 25:29.27 total

(That's 1529.27 seconds: almost 23 times longer than the write.)

Interestingly, cp performs far better than cat:

% time sudo cp /var/lib/mysql/v4netrot/test.file /dev/null
sudo cp /var/lib/mysql/v4netrot/test.file /dev/null  0.03s user 33.63s system 0% cpu 3:05.99 total

(185.99 seconds)


iostat data during copy to:

                        extended device statistics
device     r/s    w/s    kr/s    kw/s qlen svc_t  %b
mfid0      0.0    0.6     0.0     8.8    0   0.6   0
mfid1      0.4   12.6     2.7    70.5    0   0.5   0
mfid2      1.8   12.8    10.8    70.5    0   1.6   2
mfid3      0.2   10.0     1.5    92.4    0   0.7   0
mfid4      0.0   10.4     0.0    92.4    0   0.5   0
mfid5      0.2    9.4     1.2    39.7    0   0.5   0
mfid6      0.6    9.8     3.9    39.7    0   0.6   0
mfid7      0.6    4.8     0.3    46.1    0   0.9   0
mfid8      1.8    4.8    11.4    46.1    0   0.8   0
mfid9      0.4 1327.2     2.9 26686.5    0   0.5  23
mfid10     0.8 1328.2     1.8 26686.5    0   0.5  20
mfid11     1.4 1304.8     8.4 26357.6    0   0.5  23
mfid12     1.4 1304.6     2.6 26357.6    0   0.6  31
mfid13     1.6 1120.6     3.5 26194.2    0   0.6  25
mfid14     0.4 1122.6     2.7 26194.2    0   0.5  22
mfid15     0.8 1406.6     5.5 26188.5    0   0.5  22
mfid16     1.0 1174.6     2.0 21534.3   10   4.9  74
mfid17     5.8   24.2   152.9   300.6    0   0.3   0
mfid18     4.0   23.6    76.7   300.6    0   0.3   0
mfid19     0.0    0.0     0.0     0.0    0   0.0   0

The kw/s data ranged from about 17k to about 25k, and was pretty consistent between drives.

iostat when catting from:

                        extended device statistics
device      r/s   w/s    kr/s    kw/s qlen svc_t  %b
mfid0       0.0   0.0     0.0     0.0    0   0.0   0
mfid1       0.4  17.2    11.4    63.4    0   0.5   0
mfid2       0.0  17.0     0.0    63.4    0   0.4   0
mfid3       0.0  14.0     0.0    56.4    0   0.4   0
mfid4       0.4  13.6     0.2    56.4    0   0.4   0
mfid5       0.8   9.6     4.8    37.3    0   0.8   0
mfid6       0.0   9.8     0.0    37.3    0   0.4   0
mfid7       0.2   3.8    17.2    11.9    0   0.6   0
mfid8       0.2   3.8     1.4    11.9    0   0.5   0
mfid9    1208.8   0.0  6831.4     0.0    0   0.1  11
mfid10    129.4   0.0   780.7     0.0    0   0.2   2
mfid11    906.4   0.0  5858.5     0.0    0   0.1  10
mfid12    600.5   0.0  2673.0     0.0    0   0.1   5
mfid13    136.2   0.0   803.9     0.0    0   0.2   3
mfid14    316.1   0.0  1895.3     0.0    0   0.1   4
mfid15    243.6   0.0  1414.5     0.0    0   0.1   2
mfid16    129.0   0.0   768.8     0.0    0   0.2   2
mfid17      3.8  25.8    29.8   274.1    0   0.2   0
mfid18      6.0  25.6    96.6   274.1    0   0.2   0
mfid19      0.0   0.0     0.0     0.0    0   0.0   0

the kr/s numbers were wildly inconsistent, but those are representative numbers.

iostat while cping from (to /dev/null):

                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
mfid0      0.0   0.0     0.0     0.0    0   0.0   0
mfid1     21.0  66.6   107.6  2351.7    0   0.9   6
mfid2     17.6  66.8   106.8  2351.7    0   1.0   6
mfid3     17.6  39.0   116.9  2111.3    0   1.1   6
mfid4     18.8  39.6    99.8  2111.3    0   1.3   7
mfid5     23.2  62.4   172.2  2076.1    0   1.1   7
mfid6     23.0  62.0   130.0  2076.1    0   1.4   9
mfid7     16.2  62.6   112.6  2125.3    0   1.0   6
mfid8     17.4  63.0   107.6  2125.3    0   0.7   4
mfid9    237.5  44.6  5140.6   807.0    0   3.1  22
mfid10   263.7  43.6  5530.5   807.0    0   1.5  14
mfid11   252.7  55.8  5297.6   802.4    0   2.6  20
mfid12   298.1  55.6  5361.9   802.4    0   2.5  21
mfid13   275.3  46.2  5116.4   801.4    0   2.8  22
mfid14   252.9  42.4  5107.7   801.4    2   3.1  21
mfid15   270.9  43.8  4546.5   943.7    0   1.2  12
mfid16   257.7  44.0  5642.5   943.7    0   2.5  19
mfid17     7.8  23.0    73.1   244.9    0   0.3   0
mfid18     0.8  24.2    44.4   244.9    0   0.2   0
mfid19     0.0   0.0     0.0     0.0    0   0.0   0

It does not seem CPU-bound. top shows that a cp from the affected filesystem to /dev/null consumes about 18% of one core (out of 48) and the rest of the cores are showing over 95% idle.

last pid: 12474;  load averages:  1.65,  1.26,  1.14                                                                                                                                        up 39+05:42:19  14:29:08
147 processes: 1 running, 146 sleeping
CPU 0:   0.0% user,  0.0% nice,  1.6% system,  0.0% interrupt, 98.4% idle
CPU 1:   0.0% user,  0.0% nice,  1.2% system,  0.0% interrupt, 98.8% idle
CPU 2:   0.4% user,  0.0% nice,  0.8% system,  0.0% interrupt, 98.8% idle
CPU 3:   0.0% user,  0.0% nice,  1.2% system,  0.0% interrupt, 98.8% idle
CPU 4:   0.0% user,  0.0% nice,  2.3% system,  0.0% interrupt, 97.7% idle
CPU 5:   0.0% user,  0.0% nice,  0.4% system,  0.0% interrupt, 99.6% idle
CPU 6:   0.0% user,  0.0% nice,  1.2% system,  0.0% interrupt, 98.8% idle
CPU 7:   0.0% user,  0.0% nice,  2.7% system,  0.0% interrupt, 97.3% idle
CPU 8:   0.4% user,  0.0% nice,  1.9% system,  0.0% interrupt, 97.7% idle
CPU 9:   0.0% user,  0.0% nice,  2.3% system,  0.0% interrupt, 97.7% idle
CPU 10:  0.4% user,  0.0% nice,  1.9% system,  1.2% interrupt, 96.5% idle
CPU 11:  0.0% user,  0.0% nice,  3.9% system,  0.0% interrupt, 96.1% idle
CPU 12:  0.0% user,  0.0% nice,  1.6% system,  0.8% interrupt, 97.7% idle
CPU 13:  0.0% user,  0.0% nice,  2.7% system,  0.0% interrupt, 97.3% idle
CPU 14:  0.0% user,  0.0% nice,  1.6% system,  0.0% interrupt, 98.4% idle
CPU 15:  0.4% user,  0.0% nice,  1.2% system,  0.0% interrupt, 98.4% idle
CPU 16:  0.0% user,  0.0% nice,  1.6% system,  0.0% interrupt, 98.4% idle
CPU 17:  0.4% user,  0.0% nice,  3.1% system,  0.0% interrupt, 96.5% idle
CPU 18:  0.0% user,  0.0% nice,  0.4% system,  0.0% interrupt, 99.6% idle
CPU 19:  0.8% user,  0.0% nice,  0.8% system,  0.0% interrupt, 98.4% idle
CPU 20:  0.0% user,  0.0% nice,  2.7% system,  0.0% interrupt, 97.3% idle
CPU 21:  0.4% user,  0.0% nice,  1.9% system,  0.0% interrupt, 97.7% idle
CPU 22:  0.0% user,  0.0% nice,  2.3% system,  0.0% interrupt, 97.7% idle
CPU 23:  0.4% user,  0.0% nice,  1.2% system,  0.0% interrupt, 98.4% idle
CPU 24:  0.0% user,  0.0% nice,  1.2% system,  0.0% interrupt, 98.8% idle
CPU 25:  0.8% user,  0.0% nice,  0.8% system,  0.0% interrupt, 98.4% idle
CPU 26:  0.4% user,  0.0% nice,  0.8% system,  0.0% interrupt, 98.8% idle
CPU 27:  0.0% user,  0.0% nice,  4.7% system,  0.0% interrupt, 95.3% idle
CPU 28:  0.0% user,  0.0% nice,  0.8% system,  0.0% interrupt, 99.2% idle
CPU 29:  0.4% user,  0.0% nice,  0.8% system,  0.0% interrupt, 98.8% idle
CPU 30:  0.0% user,  0.0% nice,  2.7% system,  0.0% interrupt, 97.3% idle
CPU 31:  0.0% user,  0.0% nice,  1.2% system,  0.0% interrupt, 98.8% idle
CPU 32:  0.0% user,  0.0% nice,  0.8% system,  0.0% interrupt, 99.2% idle
CPU 33:  0.0% user,  0.0% nice,  3.5% system,  0.0% interrupt, 96.5% idle
CPU 34:  0.0% user,  0.0% nice,  0.4% system,  0.0% interrupt, 99.6% idle
CPU 35:  0.0% user,  0.0% nice,  0.4% system,  0.0% interrupt, 99.6% idle
CPU 36:  1.2% user,  0.0% nice,  1.6% system,  0.0% interrupt, 97.3% idle
CPU 37:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 38:  0.4% user,  0.0% nice,  1.2% system,  0.4% interrupt, 98.1% idle
CPU 39:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 40:  0.0% user,  0.0% nice,  1.9% system,  0.0% interrupt, 98.1% idle
CPU 41:  0.0% user,  0.0% nice,  1.2% system,  0.0% interrupt, 98.8% idle
CPU 42:  0.4% user,  0.0% nice,  1.9% system,  0.0% interrupt, 97.7% idle
CPU 43:  0.0% user,  0.0% nice,  4.7% system,  0.0% interrupt, 95.3% idle
CPU 44:  0.0% user,  0.0% nice,  2.3% system,  0.0% interrupt, 97.7% idle
CPU 45:  0.4% user,  0.0% nice,  2.7% system,  0.0% interrupt, 96.9% idle
CPU 46:  0.4% user,  0.0% nice,  3.5% system,  0.0% interrupt, 96.1% idle
CPU 47:  0.4% user,  0.0% nice,  1.6% system,  0.0% interrupt, 98.1% idle
Mem: 82G Active, 23G Inact, 15G Wired, 3340K Cache, 1655M Buf, 4858M Free
ARC: 12G Total, 527M MFU, 11G MRU, 4375K Anon, 377M Header, 89M Other
Swap: 4096M Total, 12M Used, 4084M Free

  PID USERNAME     PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
12461 root          31    0 12268K  2552K zio->i 30   0:07  18.16% cp
 3151 mysql         24    0 92309M 92073M uwait   3  22:06   7.47% mysqld{mysqld}
 3151 mysql         22    0 92309M 92073M select 33  15:30   4.79% mysqld{mysqld}
 3151 mysql         20    0 92309M 92073M uwait  46 895:41   0.49% mysqld{mysqld}
12175 wfaulk        20    0 23864K  6404K CPU25  25   0:03   0.29% top
 6074 root          20    0 84348K 40372K kqread 25   0:11   0.20% vc-aggregator{vc-aggregator}

(The rest of the processes show 0.00% CPU utilization.)

Did you try to restart the machine? Did it any difference?

If a restart did not help, what you are facing can be a problem with excessive fragmentation, which is #1 enemy for rotating media (read: HDDs). The high number of snapshot can exacerbate this situation.

To confirm the issue, try the following:

  1. create a new similarly sized file with the command fallocate testfile.raw -l <size>
  2. try to read the newly-allocated files. If it read fast, then fragmentation of the old file is probably the culprit.

If you confirm that this is a frag issue, follow these steps:

  1. stop MySQL
  2. take a backup of your table.ibd file
  3. rename it to table.ibd.old (mv table.ibd table.ibd.old)
  4. copy it to the previous file name (cp -a table.ibd.old table.ibd)
  5. restart MySQL

EDIT after iostat update

Thank you for the iostat number.

You moved to the ZFS array a 15 GB file in about 67s, this means an ingestion rate of 223 MB/s or 55 MB/s per disk (excluding the mirrored ones). On the other hand, your iostat seems to report about that (about 25 MB/s), so I attribute that discrepancy to a compression ration of about 2:1.

OK, this is good. However, during read strange things happen...

Discarding the cat result (cat use a very small buffer by default, and with disabled prefetcher it surely is way slower then cp), the cp command is way to slow: you copied a 15 GB file in about 1530s, which mean an extraction rate of only 10 MB/s. And that is already factoring the compression advantage. On the other hand, your iostat number shows over 5 MB/s reads per disks, which add up to about 40 MB/s per array. Factoring the 2:1 compression ratio, it should give you a transfer rate north of 80 MB/s. This means that you are at about 1/8 your read potential.

The question is: why? It really seems as the CPU was maxed out during the transfer. Can you run a top and a dstat session during reading the affected file? If possible, configure top so show the per-CPU load.

  • 1
Reply Report
    • I cannot currently restart the machine. It's working well enough that bringing it down outside a maintenance window would be worse.
      • 2
    • Good idea. I was able to copy the UFS-backed copy of the file onto the ZFS filesystem in 66.93s. I immediately started reading it back off. It's been at least 10m now and it's still not done. So it's not a problem with individual file fragmentation.

Trending Tags