0Answer
  • 8
name

A PHP Error was encountered

Severity: Notice

Message: Undefined index: userid

Filename: views/question.php

Line Number: 191

Backtrace:

File: /home/prodcxja/public_html/questions/application/views/question.php
Line: 191
Function: _error_handler

File: /home/prodcxja/public_html/questions/application/controllers/Questions.php
Line: 433
Function: view

File: /home/prodcxja/public_html/questions/index.php
Line: 315
Function: require_once

name Punditsdkoslkdosdkoskdo

GlusterFS Split Brain issue

I have been facing issues in performance with GlusterFS setup. We took a new build of application live and suddenly all GlusterFS clients and masters also started showing high utilization of CPU. This is causing real pain. My Setup is as follows:

I have two master servers for glusterFS on version 3.7.4

[root@gfs1 glusterfs]# gluster volume info

Volume Name: repl-vol
Type: Replicate
Volume ID: 7535cfad-6bb9-4147-9fea-e869e7b8d565
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gfs1.myhost.com:/GlusterFS/repl-data
Brick2: gfs2.myhost.com:/GlusterFS/repl-data
Options Reconfigured:
cluster.self-heal-window-size: 100
performance.cache-max-file-size: 2MB
performance.cache-size: 256MB
performance.write-behind-window-size: 4MB
performance.io-thread-count: 32
cluster.data-self-heal-algorithm: diff
nfs.disable: off

[root@gfs2 ec2-user]# gluster volume info

Volume Name: repl-vol
Type: Replicate
Volume ID: 7535cfad-6bb9-4147-9fea-e869e7b8d565
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gfs1.myhost.com:/GlusterFS/repl-data
Brick2: gfs2.myhost.com:/GlusterFS/repl-data
Options Reconfigured:
cluster.self-heal-window-size: 100
nfs.disable: off
cluster.data-self-heal-algorithm: diff
performance.io-thread-count: 32
performance.write-behind-window-size: 4MB
performance.cache-size: 256MB
performance.cache-max-file-size: 2MB

I have around 14 clients on which we are using glusterFS. The glusterFS is hosting 1.2TB of data which basically is static content JS/CSS/images. We have been monitoring sudden spike on server CPU utilization. Network IO is very high too 125MB/s-250MB/s. I checked logs and primarily found below problem repeatedly:

[2015-09-09 03:13:33.797655] E [MSGID: 108008] [afr-self-heal-entry.c:253:afr_selfheal_detect_gfid_and_type_mismatch] 0-repl-vol-replicate-0: Gfid mismatch detected for <3fd13508-b29e-4d52-8c9c-14ccd2f24b9f/100000130641_4.jpg>, ed715d52-4a39-46db-901b-16ae13f01898 on repl-vol-client-1 and 0bc0c058-b6a7-4f0d-9d46-96f7fcded0f3 on repl-vol-client-0. Skipping conservative merge on the file.
[2015-09-09 03:13:36.074219] E [MSGID: 108008] [afr-self-heal-entry.c:253:afr_selfheal_detect_gfid_and_type_mismatch] 0-repl-vol-replicate-0: Gfid mismatch detected for <3fd13508-b29e-4d52-8c9c-14ccd2f24b9f/100000132992_4.jpg>, 8b67cc38-df53-43c7-ad42-b9c616b980b1 on repl-vol-client-1 and 41f393de-9d83-4f52-bfcf-832e31a27a87 on repl-vol-client-0. Skipping conservative merge on the file.
[2015-09-09 03:13:36.076681] E [MSGID: 108008] [afr-self-heal-entry.c:253:afr_selfheal_detect_gfid_and_type_mismatch] 0-repl-vol-replicate-0: Gfid mismatch detected for <3fd13508-b29e-4d52-8c9c-14ccd2f24b9f/100000132995_4.jpg>, b1dd578b-3dfe-43dc-ad3a-d54c86298278 on repl-vol-client-1 and bd7c42b9-575f-46bc-9f56-804994f27ab0 on repl-vol-client-0. Skipping conservative merge on the file.
[2015-09-09 04:00:50.975933] I [MSGID: 108026] [afr-self-heal-entry.c:589:afr_selfheal_entry_do] 0-repl-vol-replicate-0: performing entry selfheal on cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3
[2015-09-09 04:00:51.005409] E [MSGID: 108008] [afr-self-heal-entry.c:253:afr_selfheal_detect_gfid_and_type_mismatch] 0-repl-vol-replicate-0: Gfid mismatch detected for <cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3/100000160597.jpg>, 68c6fd47-6edc-46fe-8992-2d662bc698e8 on repl-vol-client-1 and 43e1a033-ad08-495b-b762-757cb2f566c0 on repl-vol-client-0. Skipping conservative merge on the file.
[2015-09-09 04:00:51.011467] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-repl-vol-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No data available]
[2015-09-09 04:00:51.014205] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-repl-vol-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No data available]
[2015-09-09 04:00:51.046092] E [MSGID: 108008] [afr-self-heal-entry.c:253:afr_selfheal_detect_gfid_and_type_mismatch] 0-repl-vol-replicate-0: Gfid mismatch detected for <cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3/100000160597.jpg>, 68c6fd47-6edc-46fe-8992-2d662bc698e8 on repl-vol-client-1 and 43e1a033-ad08-495b-b762-757cb2f566c0 on repl-vol-client-0. Skipping conservative merge on the file.
[2015-09-09 04:10:53.125065] I [MSGID: 108026] [afr-self-heal-entry.c:589:afr_selfheal_entry_do] 0-repl-vol-replicate-0: performing entry selfheal on cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3
[2015-09-09 04:10:53.225256] E [MSGID: 108008] [afr-self-heal-entry.c:253:afr_selfheal_detect_gfid_and_type_mismatch] 0-repl-vol-replicate-0: Gfid mismatch detected for <cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3/100000160597.jpg>, 68c6fd47-6edc-46fe-8992-2d662bc698e8 on repl-vol-client-1 and 43e1a033-ad08-495b-b762-757cb2f566c0 on repl-vol-client-0. Skipping conservative merge on the file.
[2015-09-09 04:10:53.232229] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-repl-vol-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No data available]
[2015-09-09 04:10:53.236203] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-repl-vol-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No data available]
[2015-09-09 04:10:53.343344] E [MSGID: 108008] [afr-self-heal-entry.c:253:afr_selfheal_detect_gfid_and_type_mismatch] 0-repl-vol-replicate-0: Gfid mismatch detected for <cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3/100000160597.jpg>, 68c6fd47-6edc-46fe-8992-2d662bc698e8 on repl-vol-client-1 and 43e1a033-ad08-495b-b762-757cb2f566c0 on repl-vol-client-0. Skipping conservative merge on the file.

Two primary errors are remote operation failed and Gfid mismatch. I even tried solving split brain but it seems either I am doing something wrong or its not working.

Steps for recovering:

[root@gfs2 ec2-user]# gluster volume heal repl-vol info split-brain
Brick gfs1.myhost.com:/GlusterFS/repl-data/
<gfid:cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3>
/media/klevu_images/1/0
Number of entries in split-brain: 2

Brick gfs2.myhost.com:/GlusterFS/repl-data/
/media/klevu_images/1/0
<gfid:cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3>
Number of entries in split-brain: 2

So I simply deleted the files above and then tried gluster volume heal repl-data

I am not really sure solving split brain will address my performance issue. Moreover, split brains keep coming in. My primary objective is to fix the performance.

Trending Tags