1. If you're receiving a message that you are banned from the Current Events or Politics forums, it's not you specifically: those forums have been hidden for all users. For more info, see here.
    Dismiss Notice

GlusterFS - 800TB and growing.

Discussion in 'Storage & Backup' started by elvis, Apr 1, 2013.

  1. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    44,852
    Location:
    Brisbane
    Hi John,

    Thanks for jumping in and saying hi! I'll chat to my CEO about it, and let you know. Happy to put something on gluster.org if the boss is cool with it.

    Well I really appreciate it. Thanks so much.

    Excellent news. I'll build this into the backup cluster tomorrow, do some testing, and migrate it to production based on the results. That'll be a huge boost for the Mac users.

    I'm not sure. Gluster does allow for a node to run multiple bricks, but I'm not sure if it can import a brick that was previously owned by another node.

    I don't really see the problem. You could definitely scrub the old brick, assign it to a new node, and initiate a self heal. You'd be running on 1 copy of the data until the heal is complete, but you could do all that hot on a live system.

    The redundancy is still there. Also, Gluster is more of a scale-out NAS concept. You're talking about LUNs from central SAN-style block level storage being assigned to nodes. I'm not sure I'd back Gluster with SAN storage, to be honest. That seems kind of pointless (although technically doable).

    Data is not striped on my setup. You can do that, but I don't.

    Files are written whole, in a distributed manner around the cluster. If I'm connected to Node 0 via NFS, and request a file, if it's on Node 0 I get it direct. If it's on Node 6, it travels to Node 0 via GlusterFS, and then to me via NFS.

    By comparison if I connect via FUSE+GlusterFS, I get the file direct from the node it lives on, as I have connections to all nodes, and the ability to query th DHT directly.

    Fibre 10GbE Ethernet, SFP+ transceivers.

    http://en.wikipedia.org/wiki/Small_form-factor_pluggable_transceiver#SFP.2B

    We had a single box in the whole place that was CX-4, but thankfully it's been decommissioned. We had more problems with random packet loss on that one system than anything I've ever worked on. It was terrible. The replacement box is now Fibre 10GbE SFP+ like everything else.
     
  2. Davo1111

    Davo1111 Member

    Joined:
    Mar 5, 2009
    Messages:
    3,018
    Location:
    Sydney
  3. digital_ecstacy

    digital_ecstacy Member

    Joined:
    Jul 27, 2006
    Messages:
    324
    Location:
    Coffs Harbour, NSW; 2450
    You planning on downloading the internet? :p
     
  4. KonMan

    KonMan Member

    Joined:
    Jun 28, 2001
    Messages:
    589
    Location:
    Melbourne
    Did you consider to use Arista for your switches? I read some really good things about them... I was going to get them for a 50+ seat architectural office who are doing funky stuff with Revit, but I got over-ruled and now they are suffering for it....
     
  5. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    44,852
    Location:
    Brisbane
    Current estimates is that the VFX guys will fill it by June.

    My next job is a complete overhaul of their tape backup and archive system using Bacula. That includes being able to quickly offline portions of a job and retrieve them later. All of the tidy sum of $0 in software.

    No pressure! :)
     
  6. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    44,852
    Location:
    Brisbane
    Not for this location at current. Added complexity, added infrastructure, etc. If the budget was higher or there was existing infiniband in the place, then I'd give it a go.

    Covered in the first post - it needed to be simple, and able to start small. A lack of senior Linux staff on site combined with a restriction on the number of starting nodes limited the scope.

    Compiling kernels and the like is way above the skillset of the staff on site. I've built the Gluster setup purely from RPM installs, and deployed it via Gluster. Adding a new Gluster node is as simple as installing CentOS6, and pointing puppet at the puppet master. It'll install and configure the node with necessary software. From there, grow the volume and done.

    Gluster runs entirely in user space. Very simple to get up and running on a set of VMs if you want to test pure functionality.

    Hrm... I dunno. I'm pretty happy with how it went given the limitations in place. Perhaps I'd consider smaller nodes and more of them instead of huge nodes and fewer. Although with that said, I can see us buying more nodes within 6 months.

    The single biggest bottleneck by an order of magnitude is the lookup on files internally to Gluster. That and negative lookups (looking for a file that doesn't exist). A negative lookup cache would speed things up considerably here (HINT HINT REDHAT). But all the same, the file lookup latency per file adds far more overhead than FUSE's limitations.

    Indeed, if MacFUSE was still in development, I'd seriously consider using it over AFP for the clustered benefits.

    All good. :thumbup:

    And awesome questions by the way. Cheers!

    We're responsible for around 10 minutes of footage from a feature film. Next time you go and see a movie, consider just how much disk went to making what you see on screen. :)

    The previous IT manager wanted to go with Arista. We have quite a lot of Dell switching in production already. We added a 48 port fibre switch to our existing collection of 24 port fibre switches for this storage upgrade, and the performance from them has been pretty good considering the price.

    At this stage, I'm pretty happy to keep the switching as "dumb" as possible (the "simplicity" goal is one I keep coming back to). If it genuinely becomes a bottleneck, then we'll definitely look at other brands.
     
    Last edited: Apr 4, 2013
  7. splbound

    splbound Member

    Joined:
    Feb 23, 2004
    Messages:
    538
    Location:
    London / Sydney
    A truly fantastic post elvis. Have always like reading your insights, hope you keep them coming.
     
  8. cdis

    cdis Member

    Joined:
    Oct 25, 2006
    Messages:
    29
    Location:
    Brisbane
    That was a great read, thanks!
     
  9. Diode

    Diode Member

    Joined:
    Jun 17, 2011
    Messages:
    1,736
    Location:
    Melbourne
    It would be good to investigate since basically you are replicating the data which as you mentioned causes an overhead since data is written to another brick. Rather than replicate the data your storage already has RAID level redundancy it would be good simply reassign a brick to another active node. It would also half the amount of storage you need to buy.

    In the case of the IBRIX system it is backed by one of HP's high density storage blocks and plugs into a blade chassis. From there the disks are split into multiple LUN's which are assigned to each node. IBRIX is a scace-out NAS system up to 16PB! Multiple storage blocks and nodes. In no way I'm trying to plug this thing, just comparing your solution to something I've worked with. I bet your system doesn't suffer from many many many many disk failures... ahem.

    Oh you only experimented with striping, which as you said wasn't recommended.

    By "travels to Node 0 via GlusterFS". basically using the network back end? Which I think you said was 10Gbit? If so, exactly the same way as the IBRIX. Clearly though you try to avoid this sort of activity which has a performance hit which you do with the FUSE, we did not have this luxury with IBRIX.
     
  10. Gecko

    Gecko Member

    Joined:
    Jul 3, 2004
    Messages:
    2,715
    Location:
    Sydney
    I had similar experiences with dual-port 10gbit NICs in our cluster. I ended up taking one port to switch A and the other port to switch B, putting a link between the two switches (in case one node was to fail over to the B switch while the others stayed on A) and then using failover mode. Yes, it means that 1 port of each NIC (and an entire switch) sits around virtually unused, but it does mean that if anything on the A side dies, the B side is ready to go at all times. Having the extra hot redundancy in place was more important to us than having a bit more performance.

    Great writeup though, always cool to see these projects :)
     
  11. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    44,852
    Location:
    Brisbane
    If I lost a node, there'd be a period of time (even a few seconds while HA kicks in) where the migrating of the LUN from one node to another would take place. With GlusterFS and a replication level of 2, you can take a node down and there's zero impact to the end user experience. Indeed, I've rebooted nodes during production hours, and it's caused no effect on end user performance.

    In your example, yes technically I'd need to buy half the storage, but then I'd have to add the complexity of LUN management and exporting nodes via iSCSI/FC to hosts. In your example you're talking about SAN-backed storage, yeah? All of that comes at substantial dollar cost, and complexity overhead.

    As I've mentioned a few times, I needed a solution that was as cheap and as simple as possible. Indeed, simplicity was the single biggest requirement. This business has no dedicated network staff, no dedicated storage staff, and a single senior Linux sysadmin (me). The rest of IT are spread pretty thin across a business that is extremely varied in what it does across multiple industries, and are frequently offsite for weeks on end dealing with customers.

    As it stands, this solution has cost us around $100K per cluster, which I think is pretty good. It's scalable at $0 software/licensing fee, and just requires more disk and plugging it in.

    What's the dollar cost for 800TB of IBRIX storage across 24 nodes? That would be the first thing my CEO would ask. :)

    Gluster's maximum volume size according to gluster.org:

    "Gluster supports the XFS file system on each brick, using XFS each brick can be as large as 9 million terabytes [1], limiting the total possible size of a single Gluster namespace to 6.75 X 10^9 TB's. (that's really big!)"

    So that's 6,750,000PB. I think that will sort us for a while. :)

    The FUSE part is only at the client (user) level when you want to mount the cluster and use standard POSIX file semantics (which is why I stipulate "FUSE+GlusterFS". Inter-node communication (including file transfer to another node) is not done via FUSE, but via Gluster's own internal native protocols. To be blunt: all server-to-server communication is native Gluster, server-to-client for POSIX file system access is FUSE+GlusterFS. On the servers, you don't need to mount the GlusterFS file system at all. The only reason I do is to re-export the data via layered protocols like CIFS and AFP.

    People are getting caught up on this FUSE business I think. I've pulled 1GB/s over fuse on our production cluster for a large file to a grunty, fibre-attached workstation. Co-workers have reported 2GB/s over CIFS, which is loopback mounting the GlusterFS+FUSE file system, adding even more overhead. These sorts of real-world numbers

    The overhead I talk about is on file LOOKUP, not RETRIEVAL. It's the process where the cluster needs to work out where a file is, and query each and every node in the cluster if the location of the file is not known to the DHT (you simply cannot keep a map in memory of every file on an enormous file system, and keep that updated constantly on all clusters in realtime). In particular if someone looks for a file that isn't there (a user or program tries to stat/ls a file, and gets a "file not found" return). This is Gluster's weakest point for speed (or more accurately, latency).

    FUSE is not a bottleneck for Gluster, simply because the aggregate speed of the cluster as a whole is the power of Gluster. Additionally the "spread" of IO around the cluster means that no one node is saturated, even by someone copying entire trees of data about.
     
    Last edited: Apr 5, 2013
  12. Oblong Cheese

    Oblong Cheese Member

    Joined:
    Aug 31, 2001
    Messages:
    10,595
    Location:
    Brisbane
    Not if you implement Virtual Port Channel. ;)
     
  13. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    44,852
    Location:
    Brisbane
    Yeah, that has been sitting in the back of my mind. The switching here isn't terribly well laid out, again from a budgetary point of view. If I lost the switch that GlusterFS is on, there'd be a bunch of other non-redundant stuff that vanished off the network too that would cause people to tools-down, and IT to have to manually re-jig things anyway.

    As such, the day to day performance is currently a higher priority than switch redundancy. Of course, I say that now while it's working. As is typical, when it breaks, C-level execs will ask IT to reconsider that. :)

    Yeah, as mentioned a few posts back, better switching would allow us the best of both worlds. Of course, that would mean a minimum of 3-5 times the budget for new switches, which just isn't there.

    Consider this thread as much a thread about cool tech as it is an example of how to build an enterprise on the barest of budgets. And to be fair, that's kind of what my whole career has been. I get an odd kick out of delivering 90% of what a big-name vendor can for 10% the price. :)

    Also seeing beg vendors seethe with rage when I show them the dollar cost and what we achieve is just priceless. Nothing pisses a sales guy off more than telling him his price is 900% too high. :lol:
     
    Last edited: Apr 5, 2013
  14. Davo1111

    Davo1111 Member

    Joined:
    Mar 5, 2009
    Messages:
    3,018
    Location:
    Sydney
  15. LinX

    LinX Member

    Joined:
    Jan 17, 2002
    Messages:
    510
    Location:
    Digital Nomad .. I go where my whimsy takes me.
    Poetry in motion Elvis.

    I worked for a maneged services team and had a big VFX client. Disney was included in their client portfolio. I had a suggestion to move from vendor hardware use gluster for their storage. (they had only 20 TB and had to archive old projects as soon as they where sent to their client). However boss rejected mainly due to lack of support (load of twallop really). Apparently their solution made more money/probably cost more/ causing more headaches now. Can you tell I don't work there anymore.

    Anyway.. Awesome work.
     
  16. cry

    cry Member

    Joined:
    Dec 24, 2001
    Messages:
    404
    Location:
    Sydney
    Before getting into the guts of my reply - I'd like to firstly say well done. Building high scale storage is a tricky game and people often discover the outcome is very different to their expectations. It's been a hobby of mine for a while and I've learnt many things the hard way.

    Which is why I'm going to ask how will you maintain integrity?

    Unless you do consistency checking on a regular basis (which would take a long time with slow 3TB drives) how will you ensure bit rot doesn't lead to data corruption? I can't count the number of times I've lost data due to a single corrupted block.
    Block remapping is often only done upon a write and is abstracted away from the controller which can lead to two scenarios
    a) The bad data is read from the disk and returned to the controller as if it were clean.
    b) The disk can copy the corrupted data to new block transparentely to the controller so the next controller request will get bad data but still think that it's clean.
    This might sound like far out situation but with the number and type of drives that you have I'd suggest that you have an excellent chance of being bitten. And with large files the probabilty of data loss increases significantly.....
     
  17. CordlezToaster

    CordlezToaster Member

    Joined:
    Nov 3, 2006
    Messages:
    4,080
    Location:
    Melbourne
    Fantastic read elvis!
    As with many other people, i love reading your posts.

    Can you estimate how much time you spent on this project?.
     
  18. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    44,852
    Location:
    Brisbane
    Yeah, our hardware engineers have been looking at them for a while over on the hardware rentals side of the business. Looks quite interesting.

    I've always wanted to go full-ghetto and do an Addonics + Backblaze + Linux + GlusterFS project. I'd love to see just how big I could scale this for reasonable dollars.

    Haha yeah, that sums up a few business I've worked for. Hence why I'm working where I am now, and loving it.

    The combination of my CEO and the head of the big VFX project have made it a dream project. Both were on side, and understood the technical requirements well. While no project is ever perfect, they're both pleased with the result, and I'm pleased with how they've supported this since day one.

    I've been with the company 11 months. If you were to count just the time spent on this, you're looking at probably 1 man-month of pure R&D, and 1 man-month of implementation.

    I get pretty slammed at this job. In that same 11 month window I've also R&D'd their entire Linux workflow, configuration management via puppet, colour correction workflow, interim storage pre-Gluster, VMWare->KVM replacement, a complete teardown and rebuild of their core network switching, implemented weekly internal Linux training, and dealt with two different IT managers with their own quirks (to put it lightly). On top of that, I've had to support two offices full of VFX artists on a feature film, plus the usual TV commercials, web development and audio workflows the business does, plus be general Linux support for the offsite guys.

    I'm pretty pleased with what I've achieved in that time. I know a lot of it is more about the type of company this place is than me personally. They're very open to change, as long as that change is worth the investment and sustainable. There's a lot of ideas that never make it, of course, for practical reasons. And likewise a lot of things I still want to do. But the checklist for core stuff that's been improved in the last 11 months is pretty damned good.
     
    Last edited by a moderator: Apr 5, 2013
  19. agrov8

    agrov8 Member

    Joined:
    Apr 5, 2013
    Messages:
    2
    Cool article ...

    Imagine my surprise on researching gluster to find a major article written by "e1vis" :)

    Currently I'm building a vm cluster, with no shared storage (SAN) available, I'm testing provisioning using gluster spread across 3-4 x 1.5TB (6x300G raid5) local storage nodes.


    Andrew.
    Puppet Guy.
     
  20. frenchfries

    frenchfries Member

    Joined:
    Apr 5, 2013
    Messages:
    101
    Try fuse for osx instead

    @elvis
    Mac fuse has been superseded by fuse for osx http://osxfuse.github.com/

    Basically just upgraded for newer kernels.
     

Share This Page

Advertisement: