GlusterFS - 800TB and growing.

Discussion in 'Storage & Backup' started by elvis, Apr 1, 2013.

  1. ^catalyst

    ^catalyst Member

    Joined:
    Jun 27, 2001
    Messages:
    11,987
    Location:
    melbourne
    Great video!

    Love the 'Pretty lights looks like money well spent!' hahahah. Kudos, really, for being so open and sharing all of this.
     
  2. Daemon

    Daemon Member

    Joined:
    Jun 27, 2001
    Messages:
    5,471
    Location:
    qld.au
    Interesting interview elvis, thanks for sharing. It's always funny how different a business can be yet how similar the problems are.

    High availability, scalable and high performance.... on a beer budget ;)

    We've just deployed a distributed compute and storage system (proprietary) on commodity hardware yet with high performance and high availability. Once we have a few more nodes online I'll run a few benchmarks and write up a mini elvis report :)
     
  3. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    45,081
    Location:
    Brisbane
    Good stuff. Sharing this sort of stuff is good for the world. :thumbup:
     
  4. Mac

    Mac Member

    Joined:
    Aug 1, 2001
    Messages:
    762
    Great thread. Will read again. :thumbup:

    I skimmed through most parts but... The headline is 800TB, which is split into production/dev environments, and there's replication between the nodes, so of the headline 800TB, you have 'only' 200TB of 'actual' storage right?

    This is where we find out that I can't read/comprehend...
     
  5. gbp

    gbp Member

    Joined:
    Dec 8, 2011
    Messages:
    6
    I found this thread really interesting.. thanks for sharing elvis :thumbup:
     
  6. fastcloud

    fastcloud New Member

    Joined:
    Feb 8, 2015
    Messages:
    3
    Is the gluster still running?

    Hi Elvis, thanks for sharing such an insightful thread. Is the gluster still running now? How big has the capacity expanded?

    Would love to hear your update. We are planning to deploy gluster too in production environment.
     
  7. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    45,081
    Location:
    Brisbane
    Yes, still running. It's seen us through half a dozen major features now, and we've got three very large films and TV series being pushed through it currently.

    Sadly, no. Our technical team is fighting hard for more space, but for some reason that's proving to be an uphill battle. I won't go into the details, but our storage requirements continue to grow, and I'd really hope our management can green light some upgrades soon.

    As a project, Gluster continues to mature. The development lists show a quality of discussion that's much improved, and the Gluster developers are having discussions around features that are definitely headed in the right direction, IMHO.

    Gluster is still more maintenance heavy under load than some other file systems, but if you've got the right use case, it's a great platform.
     
  8. fastcloud

    fastcloud New Member

    Joined:
    Feb 8, 2015
    Messages:
    3
    Great to know that. I'm trying to get gluster up and running with VM here. Like you said, it's pretty easy to configure. I really love the way it works to expand/shrink a volume.

    However, I'm comparing with FreeNAS at the same time, expanding/shrinking a volume with ZFS is also pretty easy. Do you have anything to share on ZFS?

    Also, I'm getting in touch with Red Hat guys, but they ain't as reachable as the other vendor such as Nexenta, Quantum, etc. I could even use 'bad' to describe it, because my email get no response for more than 2 weeks. Have you ever tried to contact them? This has somehow make me worry that I'm not getting support from them, and for your case, would you worry as well?
     
  9. liam821

    liam821 Member

    Joined:
    Nov 8, 2013
    Messages:
    3
    Just an update on me. I ditched Gluster because the performance was awful on ZFS. Now I run a master/slave ZFS setup, currently 700T with more on the way. Basically I have two 700T systems, and replication changes every few minutes from the master to the slave.

    Here is a few things about zfs...

    The good:

    Performance is awesome
    Replication actually works and is fast
    Rebuilding a bad drive doesn't involve re-reading the whole array
    Block level checksums, compression, snapshots
    iscsi,fibre channel,nfs,smb,direct file access
    posix compliant

    The bad:

    Single monolithic system
    Expansion can be difficult because of above mentioned single system
    If you did loose the system, you loose a lot of data at once
     
  10. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    45,081
    Location:
    Brisbane
    The point of GlusterFS is that it can expand to volumes larger than a single block storage system. ZFS can't do that.

    We require many hundreds of TB of online storage in a single namespace. With ZFS on the hardware we purchase, we couldn't get volumes big enough (not a limitation of ZFS - we couldn't buy hardware large enough to put ZFS on). We'd then be forced to buy larger vendor storage like a SAN, which is well outside of our budget (by an order of magnitude, easily).

    Never had a problem with contacting RedHat sales.

    http://www.redhat.com/en/about/contact/sales

    Did you try one of their main sales offices?
     
  11. tgt

    tgt Member

    Joined:
    Oct 19, 2013
    Messages:
    4
    gluster runs pretty well on top of zfs. It also has it's own erasure translator now as of version 3.6, which means you can effectively do raid5/6 across entire bricks (8 servers, and you can take any 2 offline and still have redundancy).

    The speed is improved over 3.3/3.4. Georeplication is also much quicker.

    zfs is hack-y to scale, usually you can do a span of raid-zX volumes but you have to grow it by N raid-zX volumes at a time and they have the be the exact same size, and creating zpools across servers is more complicated.
     
  12. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    45,081
    Location:
    Brisbane
    Having been through some real-world outages, one of the things I like about the replicate+distribute approach is that every brick has real, complete files on it, not just partial data. When shit hits the fan and you need to rescue data, that's worth the outlay IMHO.

    This is good news. We're still on 3.5, waiting for a few known bugs to clear out of the current 3.6 tree before upgrading.
     
  13. fastcloud

    fastcloud New Member

    Joined:
    Feb 8, 2015
    Messages:
    3
    I heard about running Gluster on top of ZFS is terrible. So the ZFS I'm exploring is on BSD or OpenSolaris. Did some study of Nexenta, they could have high availability (dual controller), so I guess that would solve the disadvantages of being Single monolithic system. And they also have similar replicated mode I think, but that's not cheap at all. For FreeNAS based ZFS, afaik there is no HA setup.

    Yupe, we like the single namespace too. I read a lot regarding the downside of using SAN due to the separate database of metadata. Hence would not go for SAN also.

    I didn't try the number, just email. My colleagues eventually reach them by getting personal contact from a friend of him. They should response to the email, anyone not using email nowadays? :rolleyes:

    Do you mean the ZFS is being configured in the gluster node, or the node is connected via NFS to ZFS volume?

    I believe you mean to achieve certain performance you have to do N raid-ZX, where the more disks or RAID group you have the more performance you got.
     
  14. Foliage

    Foliage Member

    Joined:
    Jan 22, 2002
    Messages:
    32,083
    Location:
    Sleepwithyourdadelaide
    The upsides are obvious but what are the downsides of this feature?
     
  15. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    45,081
    Location:
    Brisbane
    Downside is you need double the storage for distributed+replicate.

    Our nodes are 37TB each, and there's 6 of them. That's just over 200TB of raw storage, but in our 3x2 setup we end up with just over 100TB usable per cluster.

    Think RAID10 versus RAID6 (important to note that Gluster works nothing like RAID, but the resulting storage volumes are what I'm talking about here). If you had 6 disks, RAID10 will give you 3 disks worth of space, RAID6 would give you 4 disks worth, and RAID5 would give you 5 disks worth.

    Again, Gluster isn't like RAID. Distribute+replicate round-robins your file writes around the cluster, and ensures at least two copies exist on different bricks. The files are written as whole files (not chunked up and split around the cluster), so that in the event of catastrophic failure, you can always fetch whole files off the backing storage even if your cluster is completely non-functional.

    With all of that said, Gluster runs extremely well on commodity storage. We buy Supermicro chassis and motherboards with battery-backed LSI controllers, fill them with disk and RAM, ensure they have redundant power supplies and UPS. They run either Myricom or Intel 10GbE with SFP+ connectors to Dell fibre switching. Reliability of all of that kit is very high, and even with all of those parts and the fact that we need double the disks, we still end up an order of magnitude cheaper than any SAN vendor.

    NAS vendors can match us on price, but don't offer the unlimited scalability that Gluster does, nor the ability to scale up performance as the cluster gets bigger (nor the freedom to mix and match components from different vendors, but I'd never expect that from competing vendors).

    There is some kick-arse SAN tech out there today. 3Par is an example of something I'd *love* to deploy in my organisation. But for us and our real-world usage requirements, the cost is just insane.
     
  16. ics

    ics Member

    Joined:
    Jul 8, 2002
    Messages:
    209
    So I've decided to jump on the glusterfs bandwagon. Configuration:

    3 x N54L HP Microservers running ubuntu linux 14.04.2

    Physical Hardware:
    4gig of non ECC ram
    4 x 4Tb Western Digital Red hard drives

    Software is configured using mdadm software raid5 with lvm and ext4.

    each microserver is part of a disperse glusterfs filesystem. A disperse filesystem is a new type of filesystem that works a lot like raid5. In effect I've lost one entire microserver to redundancy.

    I can now tolerate a a significant amount of failure before any data loss should occur. I can also tolerate bit rot as the glusterfs detects this and fixes it on read.

    Thus far I'm very happy with the outcome.

    I can now shutdown or upgrade an entire microserver without the data going offline. I've tested shutting down the microservers to confirm how it responds. In each case the glusterfs has responded without issue.

    Trav
     
  17. aza2001

    aza2001 Member

    Joined:
    Sep 14, 2002
    Messages:
    2,016
    Location:
    Northmead
    sorry to dig up an old thread...

    just wondering how the system is going along? :)
     
  18. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    45,081
    Location:
    Brisbane
    Our demand for storage has grown exponentially. Since this thread was started, we moved from 800TB of primary storage, and around 300TB of nearline/secondary storage, to 2.5PB of primary storage with around 1PB of secondary storage.

    Gluster served us well during the middle portion of our growth, but without the on-site IT resources to manage it, we moved to a vendor-provided solution instead.

    The good news is that, compared to when we decided to proceed with Gluster, vendor-provided storage has come down quite a lot in price. We're talking an order of magnitude less per unit volume storage.

    This has all come about from new players in the market. Many of the traditional names are still there, and still spruiking their old (and IMHO grossly outdated) technologies. New players understand that storage needs to be not only more flexible but a hell of a lot cheaper in the modern world of very high resolution and high bitrate audio and video.

    Still, there's new open source players out there that I find fascinating, and hope to experiment with too. OrangeFS is an open source offshoot from PVFS, which was becoming quite a big player in parallel file systems. It seems to address some of the performance issues we had with GlusterFS at scale, and would be interesting to run up and use in anger. But again, that's well beyond my workload at the moment, so with vendor storage being both as capable and financially viable as it is today, it'll have to wait until current vendors inevitably get greedy and force us to switch again.
     
  19. Diode

    Diode Member

    Joined:
    Jun 17, 2011
    Messages:
    1,736
    Location:
    Melbourne
    Boss: Hey Elvis the Mac guys are pissed at your storage again, can you go help?
    Elvis: I'm right on it!
    Boss: How come you're the only one that can fix this thing? Why are we paying you 6 figures for level 1 support?
    Elvis: My level 1 plebs are too stupid. I told you to hire better plebs fool!
    Boss: Did you just call me...
    Elvis: Sorry just placing an order for Pho...
    Boss: Well either way I have Dell in the room
    Elvis: God damn vendors!
    Boss: Elvis, don't forget who your daddy is. See you in 5, and don't my flat macchiato.

    Ok enough fun. ;)

    So what wasn't working for you beyond 2.5PB? Technical limitations? Support? Would be interested to hear what you're using now. What challenges you faced with Gluster and could the vendor solution overcome them?
     
    Last edited: Jun 16, 2017
  20. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    45,081
    Location:
    Brisbane
    GlusterFS uses distributed metadata. Anything dealing with raw streaming of medium sized objects (i.e.: our DPX or EXR image sequence frames) was great. Where it fell over was with applications that wanted to constantly fiddle with metadata on things.

    In the early stages, we used a pretty sane amount of software. In recent years, the variety of software we use in the business has exploded. Worse, a lot of it is very low quality. Adobe, for example, have always been a mid-tier vendor, not coming anywhere near the likes of Autodesk, SideFX or The Foundry for quality of written software (and if you know how much I hate Autodesk, you know how much of an insult that is). Adobe's applications constantly want to do very non-standard things with storage, which make them nigh impossible to function well with GlusterFS's distributed metadata.

    Other clustered filesystems get around this by having dedicated metadata servers (Ceph, for example). But the complexity overhead of Ceph compared to GlusterFS is substantial, and I don't have a team capable of managing Ceph (they're not stupid people by any means, but their expertise is in supporting an enormous array of end-user applications, not in managing complex Linux based clustered storage).

    We ended up going with several Oracle ZS3-2 appliances. These are not much more exciting than a large Xeon-based chassis running Solaris with 0,5TB RAM connected to a lot of spindles. What they offer, however, is nothing more than a "dumb NAS" at the implementation level - allowing us access to a lot of NFSv3 without needing to manage extra servers in front of them like a traditional SAN. (Although we can still export both iSCSI and FC LUNs if we want, which we do for some edge case tools).

    Problems it solved: the metadata issue above. Being one non-distributed file system, metadata was sane, even for utterly retarded applications. There are still a handful of edge cases that break (certain sharing modes in Microsoft Excel still don't work, but that's because Microsoft do very, very stupid things particular to Excel to allow multiple people to write to the same spreadsheet at the same time - something we've moved to Google Sheets to overcome, even if I think "death by spreadsheet" is still the bane of modern business, but that's a whole other rant).

    And of course, ZFS is a full blown Next Gen Filesystem. GlusterFS did have some options in newer versions, like block checksumming at the GlusterFS level (and if you use BtrFS below it, at the block level). But our primary performance issue was the metadata lookup problem (especially how slow negative lookups were, which Apple MacOSX is famous for with it's ._ Resource Fork files, which is utterly retarded in 2017, but how they continue to work, proving their complete lack of focus on business).

    We could have built a similar solution (ZFS is open source, and there are plenty of ways to build your own via either ready-made distros or bake-your-own options). However with the number of spindles we purchased, we were looking for support just to handle the drive failures more elegantly. Oracle came in with a great price that wasn't a whole lot more than something we could build ourselves, and offered a sane support model that didn't blow out in price like other vendors do after the 4th year.

    The other "nice to have" on the ZS3 appliance was Oracle's built in reporting. It's basically a graphing tool built on top of dtrace that gives you instant stats on your storage, without needing to train your team up in dtrace (which is necessarily complex). That wasn't the deal-maker by any means, but it has helped to answer some arguments internally about why performance takes a beating some days, and has allowed us to make more sensible decisions about how we spread network and CPU resources around our business.
     

Share This Page

Advertisement: