A Christmas letter to Gluster developers

Where I talk about Gluster resiliency

Dear Gluster developers,

Last few days has been tense since a R3 3.8.5 Gluster cluster that I built has been plagued by problems.

The first symptom has been a continuous stream in the client logs of messages like:

[2016-12-17 15:55:02.047508] E [MSGID: 108009] [afr-open.c:187:afr_openfd_fix_open_cbk] 0-prod-1-replicate-0: Failed to open /galaxy/java/lib/java/jre1.7.0_51/jre/lib/rt.jar on subvolume prod-1-client-2 [Transport endpoint is not connected]

together with very frequent peer disconnections/reconnections and a continuous stream of files to be healed on several volumes.

The problem has finally been traced back to a flaky X540-T2 10GBE NIC embedded in one of the peers motherboard. The thing was incapable of keeping the correct 10Gbit speed negotiation with the switch.

The motherboard has been replaced on the peer and, after that, the volumes healed quickly to complete health. In the meantime, the users kept running some heavy-duty bioinformatics applications (NGS data analysis) on top of Gluster. No user noticed anything, despite a major hardware problem and the off-lining of a peer.

This is a RESILIENT system, in my book.

Despite the constant stream of problem reports and requests for help that you see on both the ML and IRC, rest assured that you are building a nice piece of software, at least according to my experience.

Keep-up the good work and Merry Christmas.

Ivan Rossi

PS I have the strong feeling that people running Gluster in a cloud environment will have experiences like this one more time that they would like. But it is not Gluster fault, IMHO.