0Answer
  • 8
name
name Punditsdkoslkdosdkoskdo

GlusterFS Split Brain issue

I have been facing issues in performance with GlusterFS setup. We took a new build of application live and suddenly all GlusterFS clients and masters also started showing high utilization of CPU. This is causing real pain. My Setup is as follows:

I have two master servers for glusterFS on version 3.7.4

[root@gfs1 glusterfs]# gluster volume info

Volume Name: repl-vol
Type: Replicate
Volume ID: 7535cfad-6bb9-4147-9fea-e869e7b8d565
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gfs1.myhost.com:/GlusterFS/repl-data
Brick2: gfs2.myhost.com:/GlusterFS/repl-data
Options Reconfigured:
cluster.self-heal-window-size: 100
performance.cache-max-file-size: 2MB
performance.cache-size: 256MB
performance.write-behind-window-size: 4MB
performance.io-thread-count: 32
cluster.data-self-heal-algorithm: diff
nfs.disable: off

[root@gfs2 ec2-user]# gluster volume info

Volume Name: repl-vol
Type: Replicate
Volume ID: 7535cfad-6bb9-4147-9fea-e869e7b8d565
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gfs1.myhost.com:/GlusterFS/repl-data
Brick2: gfs2.myhost.com:/GlusterFS/repl-data
Options Reconfigured:
cluster.self-heal-window-size: 100
nfs.disable: off
cluster.data-self-heal-algorithm: diff
performance.io-thread-count: 32
performance.write-behind-window-size: 4MB
performance.cache-size: 256MB
performance.cache-max-file-size: 2MB

I have around 14 clients on which we are using glusterFS. The glusterFS is hosting 1.2TB of data which basically is static content JS/CSS/images. We have been monitoring sudden spike on server CPU utilization. Network IO is very high too 125MB/s-250MB/s. I checked logs and primarily found below problem repeatedly:

[2015-09-09 03:13:33.797655] E [MSGID: 108008] [afr-self-heal-entry.c:253:afr_selfheal_detect_gfid_and_type_mismatch] 0-repl-vol-replicate-0: Gfid mismatch detected for <3fd13508-b29e-4d52-8c9c-14ccd2f24b9f/100000130641_4.jpg>, ed715d52-4a39-46db-901b-16ae13f01898 on repl-vol-client-1 and 0bc0c058-b6a7-4f0d-9d46-96f7fcded0f3 on repl-vol-client-0. Skipping conservative merge on the file.
[2015-09-09 03:13:36.074219] E [MSGID: 108008] [afr-self-heal-entry.c:253:afr_selfheal_detect_gfid_and_type_mismatch] 0-repl-vol-replicate-0: Gfid mismatch detected for <3fd13508-b29e-4d52-8c9c-14ccd2f24b9f/100000132992_4.jpg>, 8b67cc38-df53-43c7-ad42-b9c616b980b1 on repl-vol-client-1 and 41f393de-9d83-4f52-bfcf-832e31a27a87 on repl-vol-client-0. Skipping conservative merge on the file.
[2015-09-09 03:13:36.076681] E [MSGID: 108008] [afr-self-heal-entry.c:253:afr_selfheal_detect_gfid_and_type_mismatch] 0-repl-vol-replicate-0: Gfid mismatch detected for <3fd13508-b29e-4d52-8c9c-14ccd2f24b9f/100000132995_4.jpg>, b1dd578b-3dfe-43dc-ad3a-d54c86298278 on repl-vol-client-1 and bd7c42b9-575f-46bc-9f56-804994f27ab0 on repl-vol-client-0. Skipping conservative merge on the file.
[2015-09-09 04:00:50.975933] I [MSGID: 108026] [afr-self-heal-entry.c:589:afr_selfheal_entry_do] 0-repl-vol-replicate-0: performing entry selfheal on cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3
[2015-09-09 04:00:51.005409] E [MSGID: 108008] [afr-self-heal-entry.c:253:afr_selfheal_detect_gfid_and_type_mismatch] 0-repl-vol-replicate-0: Gfid mismatch detected for <cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3/100000160597.jpg>, 68c6fd47-6edc-46fe-8992-2d662bc698e8 on repl-vol-client-1 and 43e1a033-ad08-495b-b762-757cb2f566c0 on repl-vol-client-0. Skipping conservative merge on the file.
[2015-09-09 04:00:51.011467] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-repl-vol-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No data available]
[2015-09-09 04:00:51.014205] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-repl-vol-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No data available]
[2015-09-09 04:00:51.046092] E [MSGID: 108008] [afr-self-heal-entry.c:253:afr_selfheal_detect_gfid_and_type_mismatch] 0-repl-vol-replicate-0: Gfid mismatch detected for <cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3/100000160597.jpg>, 68c6fd47-6edc-46fe-8992-2d662bc698e8 on repl-vol-client-1 and 43e1a033-ad08-495b-b762-757cb2f566c0 on repl-vol-client-0. Skipping conservative merge on the file.
[2015-09-09 04:10:53.125065] I [MSGID: 108026] [afr-self-heal-entry.c:589:afr_selfheal_entry_do] 0-repl-vol-replicate-0: performing entry selfheal on cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3
[2015-09-09 04:10:53.225256] E [MSGID: 108008] [afr-self-heal-entry.c:253:afr_selfheal_detect_gfid_and_type_mismatch] 0-repl-vol-replicate-0: Gfid mismatch detected for <cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3/100000160597.jpg>, 68c6fd47-6edc-46fe-8992-2d662bc698e8 on repl-vol-client-1 and 43e1a033-ad08-495b-b762-757cb2f566c0 on repl-vol-client-0. Skipping conservative merge on the file.
[2015-09-09 04:10:53.232229] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-repl-vol-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No data available]
[2015-09-09 04:10:53.236203] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-repl-vol-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No data available]
[2015-09-09 04:10:53.343344] E [MSGID: 108008] [afr-self-heal-entry.c:253:afr_selfheal_detect_gfid_and_type_mismatch] 0-repl-vol-replicate-0: Gfid mismatch detected for <cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3/100000160597.jpg>, 68c6fd47-6edc-46fe-8992-2d662bc698e8 on repl-vol-client-1 and 43e1a033-ad08-495b-b762-757cb2f566c0 on repl-vol-client-0. Skipping conservative merge on the file.

Two primary errors are remote operation failed and Gfid mismatch. I even tried solving split brain but it seems either I am doing something wrong or its not working.

Steps for recovering:

[root@gfs2 ec2-user]# gluster volume heal repl-vol info split-brain
Brick gfs1.myhost.com:/GlusterFS/repl-data/
<gfid:cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3>
/media/klevu_images/1/0
Number of entries in split-brain: 2

Brick gfs2.myhost.com:/GlusterFS/repl-data/
/media/klevu_images/1/0
<gfid:cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3>
Number of entries in split-brain: 2

So I simply deleted the files above and then tried gluster volume heal repl-data

I am not really sure solving split brain will address my performance issue. Moreover, split brains keep coming in. My primary objective is to fix the performance.

Warm tip !!!

This article is reproduced from Stack Exchange / Stack Overflow, please click

Trending Tags