Overclockers Australia Forums
OCAU News - Wiki - QuickLinks - Pix - Sponsors  

Go Back   Overclockers Australia Forums > Specific Hardware Topics > Storage & Backup

Notices


Sign up for a free OCAU account and this ad will go away!
Search our forums with Google:
Reply
 
Thread Tools
Old 1st April 2013, 12:38 PM   #1
elvis Thread Starter
Member
 
elvis's Avatar
 
Join Date: Jun 2001
Location: Brisbane
Posts: 22,757
Default GlusterFS - 800TB and growing.

Background:

My current employer (media/VFX industry) requires some pretty hefty storage requirements, with some budgetary limitations that really narrow down the list of possible solutions.

Most places tend to roll out pre-baked solutions like HDS, BlueArc, Isolon and other big name storage. Anyone following the news of late will also see that VFX studios are dropping like flies, despite winning big contracts:

http://en.wikipedia.org/wiki/Rhythm_...ios#Bankruptcy
http://mumbrella.com.au/fuel-vfx-goe...tration-112240
http://vfxsoldier.wordpress.com/

It's an expensive business to be in. You need the latest and greatest computing power, storage and network (local, WAN and Internet) bandwidth, good staff, and profit margins are thin.

My CEO is a big fan of open source solutions for a number of different reasons. The up front price is clearly a winner, but the long term support costs are also much lower. Additionally, you're not tied to any one vendor, particularly when it comes to running open source software solutions on commodity x86 hardware.

When I first spoke to my current CEO roughly a year ago (casually, as I wasn't employed there yet), he was aware that the business' current storage solution (ad hoc NAS storage) wasn't cutting it for larger projects (neither the performance nor the size was adequate). Moving up to enterprise SAN storage was a massive financial investment - much larger than the company could realistically afford, and it would put them at risk of becoming yet another failed VFX studio that couldn't manage costs. He asked me if I had any ideas on what they could do. I named a few of them, and he offered me a job.

Enter GlusterFS:

http://www.gluster.org/
http://www.redhat.com/products/storage-server/

GlusterFS is a clustered file system, born out of an Indian think tank funded by Californian money named Gluster Inc, and recently acquired by open source powerhouse Red Hat. The word "Gluster" is a play on the classic "G" (of GNU), and "cluster". Clustered file systems aren't new - there's plenty of them around. The problem with most is that they take an extraordinary amount of effort to build and maintain, and introduce complexity in that special types of servers need to exist for each part of the cluster.

A quick lesson in file system theory: data on your hard disk is broken down into a few different parts. Firstly the file system defines where blocks of data may begin and end, how they are contained on the disk, and how they are indexed and referenced by the operating system. Above that is the file layer, which is broken down into a few different parts. For the purposes of this discussion, the data and metadata are the most important bits. Data is just raw 0's and 1's that make up the contents of the file. Metadata contains the "data about the data" - file names, file permissions, access/create/modify timestamps, etc. Most popular clustered file systems (Oracle Lustre, Ceph, etc) require separate servers for the data (or "block") portions of files, and the metadata portions. On top of this they require cluster monitors that hold the information about the cluster, and which parts of the cluster are "masters", how locking is handled, etc.

GlusterFS simplifies the process greatly. Each node in a GlusterFS does the same job - it contains portions of data, metadata and monitoring services. This means that GlusterFS can be rolled out on a minimum of two (technically even one) server, unlike others that require a minimum of 5-6 servers to even start. Additionally, GlusterFS works on top of an existing file system (XFS is recommended currently due to bugs in EXT4 and speed issues with EXT3), which has other benefits I'll cover later.

GlusterFS works over standard Ethernet (10gbit/s is recommended, however bonded 1gbit/s works for smaller rollouts), as well as Infiniband/RDMA.

Terminology:

* Peer - A single server, or "node" in the cluster
* Brick - A single file system that GlusterFS can use. A minimum of one brick per peer is required, but peers can have multiple bricks if you wish.

GlusterFS offers several modes of operation - Distribute, Replicate and Stripe. These can be a little confusing to some because they're confused with RAID levels of the same name, but there are important differences.

* Distribute - A single entire file is written to single brick on a peer. The next entire file is written to another brick somewhere in the cluster (if you have multiple bricks per peer, it can land on the same peer as the first).

* Replicate - a replication "level" is defined. For each write of a file, that file must be written to another brick as well, somewhat like a mirror.

* Stripe - Files can be broken up into pieces, and the "chunks" are written to different bricks.

Combinations of these levels can be used to achieve different outcomes.

Many people (myself included) tend to jump on stripe+replicate as a quasi "RAID10" solution. Some quick experimentation shows this to be a poor choice, however. Initial lookups in a clustered file system are expensive, and cause performance delays. Striping data in Gluster across multiple nodes/bricks causes multiple lookups, and can often slow things down. The GlusterFS documentation only recommends striping as a solution if a single file will grow beyond the bounds of a single brick (for example, large VM images or large backup tar files, etc).

Additionally, GlusterFS's use of a simple POSIX file system below the clustered file system adds another benefit: in the event of a complete cluster failure, files can be retrieved directly from the file system below if they are whole files. With this in mind, the "distribute" model of GlusterFS becomes a far better option. It's faster to access individual files, and when the cluster is offline, repair/recovery can be performed on files below the clustered system. Add the "replicate" option to this, and high availability is added at the expense of half the available data storage.

GlusterFS allows hot (while users are on the system and working) addition of new bricks/nodes, replacement of bricks/nodes, removal of old bricks/nodes and rebalancing of files. With "replicate" mode on, you can lose one out of any two sets of replicate bricks, and when they come back online, GlusterFS will "heal" the files, correcting itself back to an optimal state. The heal is done as a lower IO priority, and doesn't impact performance to clients.

Experimentation:

I began by building 8 nodes using some out of service render nodes (Intel Core i7 CPUs, 12GB RAM, 2x250GB hard disks in a RAID0 stripe). GlusterFS is extremely easy to configure - run a "peer probe" to connect the peers, then a "volume create" to make the volume with arguments of your choosing, then a "volume start" to make it available to mount. This was trivially scriptable, and allowed me to write a script that would run though without end creating and destroying different GlusterFS configurations, benchmarking them in between.

What I found was the following (due to NDA/IP limitations I can't go in to detail, but I can offer a high level overview):

In general:

* GlusterFS read speeds are fantastic, write speeds are good but not amazing - mostly due to replication. Without replication, write speeds are much better, but the risk of an outage if any one node goes down is added

* GlusterFS speeds are better with larger (10MB+) files, and very poor on small files (100KB and less).

Scaling:

* 2x GlusterFS nodes in "replicate" are slower than a single NFS server on the same hardware.

* 4x GlusterFS nodes in a "distrubute + replicate" are roughly the same speed as a NFS server on the same hardware

* 6x GlusterFS nodes in "distribute + replicate" and up scale linearly from 4 nodes. Likewise 8 scales up from 6, and so on. Documentation and discussion with other cluster users suggest that this continues up into the hundreds of nodes.

* You get a benefit from many clients, not from one. Testing GlusterFS with "dd" type reads/writes is pointless, as they are single threaded. I ran tests from our render farm, pulling data in across 128 threads. The scalability under parallel workloads is quite obvious.

* More nodes = more disk and more aggregate bandwidth. Performance goes up as storage goes up (as opposed to some cheaper SAN solutions, where more storage = less bandwidth per GB if controllers and interconnects don't scale).

Production rollout:

After presenting my findings to the CEO and the board, they were satisfied that it was worth taking the next step. We purchased from a local SuperMicro reseller 24 of the following:

* SuperMicro 4RU boxes with redudant power supplies (to separate UPS rails)
* 16x 3TB SATA drives per node, configured in RAID6 +1 hot spare for a total of ~37TB real world per node
* LSI MegaRAID controller with BBU
* 2x 64GB SSD in RAID1 for OS (CentOS6)
* 2x 512GB SSD in RAID1 for LSI MegaRAID CacheCade+FastPath (SSD caching and random IO re-ordering for the RAID volume, with intelligent passthrough when large streaming files are detected)
* Myricom 10gbit/s Ethernet card with 2 active interfaces, bonded for a max theoretical 20gbit/s per node (theoretical 120gbit/s per cluster).
* GlusterFS 3.3.1 GA

Nodes were organised into 4 clusters of 6. Two in our Brisbane data centre, two in Sydney. The clusters were then configured as production and backup clusters.

Each cluster gives a total of 214TB raw space (856TB total across the entire business). I set a replication level of 2, meaning data gets written twice. GlusterFS reports only usable space via tools like "df" (or summaries in Windows and MacOSX), so users don't see that. What they do see is a usable 107TB of space in production in each location, and 107TB of space on the backup cluster (which they only get access to read-only).

Nodes are numbered 0 to 5, and GlusterFS picks them up in replication pairs (0+1, 2+3, 4+5). Up to three nodes (in the right circumstance) can be down and the cluster can continue to function as normal.

From a client point of view, our office is split up roughly into (figures are workstations only - if you include the Linux renderfarm, figures are skewed more heavily in favour of Linux):
* 50% Linux
* 30% MacOSX
* 25% Windows
* 5% other speciality OSes

GlusterFS itself offers mounting via a FUSE in Linux. Linux clients mount the cluster this way, and get access to all six nodes in the production cluster directly, for maximum bandwidth distribution.

The GlusterFS daemon also offers exporting of data via legacy NFS3, which I do for older Linux and speciality machines in the business. I use a DNS round robin name for clients to refer to, which randomly chooses a node to connect to.

For Windows users, I load up Samba (with AD authentication) on each GlusterFS node. I use the same DNS round robin for Windows users, which works well. Samba needed some tweaking to deal with file locking, but it's largely resolved. There are unresolved issues with two applications - Photoshop for Windows, and Microsoft Office for Windows where file locking cannot be granted, and breaks those applications. The workaround put in place was to have Photoshop users copy working files to scratch directories (and we only have one serious Windows+Photoshop user, as the rest are on Mac). MS Office users were migrated to LibreOffice4 which doesn't have the problem, and there's a company-wide rollout to Google Apps coming next month that will reduce the reliance on MS Office even further. Users were happy to use alternative Office software, and we've had none of the usual complaints I hear on forums like these when people lose MS Office.

MacOSX has been the biggest headache by far - in particular MacOSX Finder. Apple broke finder around 10.6, and haven't bothered to fix it despite pleas from millions of users. Things Finder does wrong:

* NFS is completely broken. Accessing NFS shares from Finder causes random lockups. Command-line access is fine, however.

* SMB/CIFS access through finder requires looking for a "._filename" resource fork file for each matching "filename". Gluster is very slow at "negative lookups", as it needs to be 100% sure that a file doesn't exist by asking every node in the cluster, rather than just referring to it's cache/DHT. A folder with ~1000 files in it (typical for our use case) can take 5+ minutes to load up in Finder when accessed on GlusterFS. By comparison, this is a few seconds on a standalone SMB/CIFS server. If I pre-seed files (for i in `ls` ; do touch ._${i} ; done), the problem goes away. However this litters the file system with ._files, which is messy. There are hacks to tell finder not to write ._files, but there is no hack to stop finder requesting them. Again, CLI is fine, but Finder behaviour is broken.

As such, I've loaded up Netatalk3 on the first node in each cluster, and exported data via AFP to Macs. Netatalk requires exclusive locking, and cannot be clustered (unlike Samba). Netatalk2 (bundled with CentOS6) caused lockups of the first node (due to it's requirement to write .AppleDB files direct onto the GlusterFS file system), which lead to some split-brain problems (GlusterFS not knowing which version of a file was the most recent). After removing Netatalk2 and manually cleaning up the slip-brain files, Netatalk3 has proved far superior due to migrating the .AppleDB folders off the file system and onto local cache, and delivering the resource fork information over the wire.

Each night, the backup cluster uses rsync to replicate the entire production cluster. This is split up across the nodes (one script is pushed out via Puppet, and calls various parts of the file system based on hostname). So all 6 backup nodes execute smaller rsyncs and back up production in parallel. With about 70TB of production data currently online, it took ~2 days for the initial sync, and since then since then single digit hours each night.

Performance:

With only 6 nodes per cluster, performance is better than a standard NAS setup, but not amazing. This is somewhat intentional from a budget point of view. The goal was to roll out a slightly smaller cluster early, and add nodes down the track as bigger paying jobs come in. The CEO understands this is a medium to long term investment. And indeed, the particular benefit of this solution is it's "scale out" approach - we can grow the storage as bigger jobs come in. We can do this very quickly, and the performance goes up as new nodes come online. Likewise when these nodes hit the 3-5 year mark and fall out of warranty, we can replace them with bigger/faster nodes live, and not interrupt production.

The best real-world speed I've attained from this setup was ~40gbit/s (4GB/s), which was spread across many render nodes. The best single-machine result was one of the AVID guys pulling ~20gbit/s (2GB/s) from the backup cluster for a transcode over bonded fibre Ethernet. That was quite a beefy box, and I'm quite certain it was doing multi-threaded operations to achieve that. Real world per-machine, single-threaded operations aren't that fast, but can quite easily saturate copper gigabit connections which is 90% of our workstation rollout (only a handful of users sit on 10gbit/s fibre).

Things that didn't work:

For our Linux workstations, we had remote mounted applications and home directories. GlusterFS's small file lookup performance expense made these unusable. Applications and shared libraries tend to open only the chunks of the file they need, rather than read the whole file into memory. The end result is very poor performance and millions of locks, which are expensive in a clustered system. As such, I chose to migrate home and application shares back to a smaller NFS based NAS. With that in mind, the load on that NAS is far less due to production files not being on it, so it can be scaled down.

As mentioned, MacOSX finder was a problem, as was Netatalk2. Netatalk3 solved this somewhat, although it cannot be clustered. As such, I've forced all MacOSX users to the first node in each cluster, and changed the DNS round robins to only include the last 5 nodes for SMB/CIFS and NFS clients. With our business breakdown by OS, this still works out to be fine for performance. Linux native mounting via Gluster-FUSE still accesses the entire cluster however.

Also as mentioned, Windows+MSOffice (particularly Excel with xls files) and Windows+Photoshop had file locking problems. RedHat have mentioned they are considering migrating a SMB/CIFS stack into GlusterFS itself to solve this issue (they did the same with NFS and added a DLM, which solved the problem for NFS). As mentioned, we're not big MSOffice users, so LibreOffice4 and Google Apps are functional replacements with high levels of acceptance from staff. Windows+Photoshop users are happy to adopt a "work local and sync to file stores later" approach. Mac+Photoshop users are unaffected.

The final price:

Again, without going into details, the business had spent a lot of time looking at vendor-supplied alternatives. Most were around the AU$1M mark per site (AU$2M all up). The GlusterFS solution with EVERYTHING considered (R&D time, networking gear , support contracts, etc) came to a grand total of AU$400K.

That's $1.6 million saved for the business that can go into better things, like more staff, more software licenses, more hardware, etc.

I'm quite happy to say I stole a $2M sale from EMC and Hitachi. People who know me know how anti-vendor I am, and this is one I love to rub their noses in. If they could deliver better solutions and service for lower dollars, they'd stand a chance. But they couldn't, so tough.

Last edited by elvis; 20th April 2013 at 11:21 AM. Reason: spelling
elvis is offline   Reply With Quote

Join OCAU to remove this ad!
Old 1st April 2013, 12:41 PM   #2
elvis Thread Starter
Member
 
elvis's Avatar
 
Join Date: Jun 2001
Location: Brisbane
Posts: 22,757
Default

Just for laughs, when the first 6x node cluster was built, I did a single-threaded rsync to it. You can see each pair of nodes writing files together, and the distribution work as files bounce around the cluster.

Distribution is chosen based on free space, so it's not true round robin. It's important to keep that in mind, and do rebalance operations every now and then to fix the problem of large file deletes causing "holes" in the space, and biasing writes to particular nodes.

Turn your volume down to avoid server room fan scream blasting out of your speakers.

elvis is offline   Reply With Quote
Old 1st April 2013, 1:09 PM   #3
cvidler
Member
 
cvidler's Avatar
 
Join Date: Jun 2001
Location: Canberra
Posts: 8,623
Default

subbed, as I look forward to hearing how a live upgrade/migration works.
__________________
We might eviscerate your arguments, but we won't hurt you. Honest! - Lucifers Mentor
⠠⠵
[#]
cvidler is offline   Reply With Quote
Old 1st April 2013, 1:21 PM   #4
elvis Thread Starter
Member
 
elvis's Avatar
 
Join Date: Jun 2001
Location: Brisbane
Posts: 22,757
Default

Quote:
Originally Posted by cvidler View Post
subbed, as I look forward to hearing how a live upgrade/migration works.
I've done an expansion+rebalance during the R&D phase, and it was painless - almost uneventful. Without checking logs and stats, you'd be forgiven for thinking nothing happened (other than suddenly there was more space reported by df).

I'm not sure when we'll need to do one in production, but the design is that the backup cluster in each site gets the upgrade first, and then production later. The same goes for software upgrades. We're on 3.3.1 GA at the moment (latest stable production release). 3.4.0 looks to be the next production release, so when it goes GA I'll upgrade the backup cluster first, and production at a later date.

I did a 3.3.0 -> 3.3.1 upgrade during R&D, and that worked flawlessly.

One thing I'm hoping to see out of future releases is a negative lookup cache. As mentioned, looking for a file that doesn't exist is expensive operation as it needs to ask every node in the cluster (even mirrored pairs, in case the file wasn't healed yet). Caching the result of that for a period of time (even a few seconds) would dramatically speed up certain operations. There's a demo "translator" (GlusterFS's terminology for different plugins and caches) out on the web, but it's not production ready yet.

Last edited by elvis; 1st April 2013 at 1:27 PM.
elvis is offline   Reply With Quote
Old 1st April 2013, 2:45 PM   #5
geniesis
Member
 
Join Date: Aug 2007
Posts: 185
Default

Nice work. I've been looking for an in depth first hand write up of GlusterFS.

By change did you look into other Clustered FS that are around?

Such as Ceph, FhGFS and Lustre? What did you think of those implementations compared to Gluster?
geniesis is offline   Reply With Quote
Old 1st April 2013, 2:48 PM   #6
davros123
Member
 
Join Date: Jun 2008
Posts: 2,281
Default

Nice writeup...must be very satisfying to see it from idea to delivery . Well done.

ps. I think i need to get more leds on my server!
__________________
Want a nas, you may find my Esxi/Solaris ZFS NAS build thread of interest.
Quote:
Originally Posted by Stanza View Post
yeah well I just reported my own post...ferk....
Quote:
Originally Posted by Blinky View Post
If you have become content with the size of your e-penis, sticking clear of rack mounted stuff will save you heaps of $$$.
davros123 is offline   Reply With Quote
Old 1st April 2013, 3:18 PM   #7
elvis Thread Starter
Member
 
elvis's Avatar
 
Join Date: Jun 2001
Location: Brisbane
Posts: 22,757
Default

Quote:
Originally Posted by geniesis View Post
Nice work. I've been looking for an in depth first hand write up of GlusterFS.

By change did you look into other Clustered FS that are around?

Such as Ceph, FhGFS and Lustre? What did you think of those implementations compared to Gluster?
As mentioned, the other clustered systems introduced a lot of complexity. Part of the problem with my current position is that there are no other senior Linux guys on staff, so part of my role is to train the other 10 guys on the IT team (and another 5 or so in the web dev team) up in Linux.

This was an important factor in the decision. If I was going to implement something, it needed to be budget sensitive as well as easy for on site staff to administer, and/or external support to be brought in.

The reason for choosing CentOS and a GA release of Gluster was based on the same idea. If worst came to worst (I get hit by a bus), the business can seek support from RedHat with a simple migration to RHEL and their "Storage Server" product (which is GlusterFS).

GlusterFS is quite easy to administer, troubleshoot and tweak, particularly in the 3.X release where they've moved to dynamic configuration via glusterd. I feel better knowing that if I'm away for any reason, the team on site can still look after the installation.

The other systems you mention didn't even make it to R&D simply because of the time, budget and on site staff constraints.

GlusterFS also suites our needs pretty well, as we move around a lot of large files. Most of the media we work on is stored as individual frames. So for instance, a sequence of DPX or EXR files (standard image files used in high end media - think of them like JPG or PNG files, but with a lot more colour/light/frame/camera information) at 5K res (5120x2700 pixels) clock in at anywhere from 30MB to 50MB in size. Each. 24 of these make up 1 second of footage (over 1GB for 1 second). Then there's multiple takes, multiple shots, etc, etc. The volume of media we create for just a few seconds of what you see on screen is enormous, and GlusterFS deals quite well with that volume of files in that format of storage.

Likewise with 3D information, there are texture files, geometry files, node files, scripts, simulation caches and all sorts of stuff that range from tens of MB to multiple gigabyte in size that need to get pulled in by every single render node to render a single frame out. Previously when an artist hit the "render" button, the whole network would grind to a halt. Now with GlusterFS in place, the impact of many renders running concurrently is no longer felt by workstation users around the offices.

Quote:
Originally Posted by davros123 View Post
Nice writeup...must be very satisfying to see it from idea to delivery .
Very much so. I've enjoyed the process immensely, even considering the enormous pressure of go-live week (not to mention the fact that one of my kids ended up in hospital with an infection on the second day, which added enormous pressure to the ordeal). But I survived it, thanks to the efforts of the rest of the team, which was great to see.

Quote:
Originally Posted by davros123 View Post
ps. I think i need to get more leds on my server!
Achtung! Das Blinkenlights!

Last edited by elvis; 1st April 2013 at 3:23 PM.
elvis is offline   Reply With Quote
Old 1st April 2013, 3:29 PM   #8
ewok85
Member
 
ewok85's Avatar
 
Join Date: Jul 2002
Location: Tokyo, Japan
Posts: 7,953
Default

Just when I was feeling smug about my little 24TB storage setup...

Fantastic write-up, looking forward to hearing more about this.
__________________
半ばは自己の幸せを、半ばは他人の幸せを
http://www.leonjp.com - Rants and info about living in Japan
http://forums.expatjapan.net - The Expat Japan Network!
ewok85 is offline   Reply With Quote
Old 2nd April 2013, 2:02 AM   #9
Damianb
Member
 
Join Date: Feb 2007
Location: Melbourne
Posts: 1,860
Default

Any issues with replication clogging up the network? all of the nodes in the same area or are they spread in different locations?
Damianb is offline   Reply With Quote
Old 2nd April 2013, 8:06 AM   #10
elvis Thread Starter
Member
 
elvis's Avatar
 
Join Date: Jun 2001
Location: Brisbane
Posts: 22,757
Default

Quote:
Originally Posted by Damianb View Post
Any issues with replication clogging up the network? all of the nodes in the same area or are they spread in different locations?
* 4 clusters
* 6 nodes per cluster
* BNE gets 2 clusters (prod and backup - 12 nodes in total)
* SYD gets 2 clusters (prod and backup - 12 nodes in total)

We're using Dell 8164F 48 port fibre switches with SFP+ transceivers where Cluster connects in. The clusters sit on different switches (limitation of the bonding and the cheap switches means I can't spread them across two switches, so loss of a switch means manually moving ports, which is fine for this business).

Previously we had a series of ad-hoc NAS devices around the office, so we're somewhat consolidating our storage here. Additionally, I spent a weekend in the server room about 2 months ago and found that the switching here was shockingly set up. The previous admin had put in a series of loops between various switches, and misconfigured the link aggregation groups between them. Spanning tree was kicking in, and disabling most of the paths. I spent all weekend tearing it all out and rebuilding it from scratch. The end result was much faster network performance for workstation users, as well as between servers on different switches.

So with all that in mind, the network performance is fine. Gluster can easily saturate all the copper 1gbps workstations, and the dozen or so fibre 10gbps workstations in each location actually have had a speedup compared to the previous storage. The switches are reporting a little more load than before, but not to the point of being "clogged".

Last edited by elvis; 2nd April 2013 at 8:45 AM.
elvis is offline   Reply With Quote
Old 2nd April 2013, 8:30 AM   #11
PabloEscobar
Member
 
Join Date: Jan 2008
Posts: 5,354
Default

If Prod is replicated to Backup, how exactly is it a backup rather than a mirror of the existing data?. If I delete importantfile.jpg, It's gone from both.

What is in place as an actual backup (if required)?

Is any long term storage of the data required?

Thanks for the write-up though. I do like that the data is stored on disk in a natively readable format.
PabloEscobar is offline   Reply With Quote
Old 2nd April 2013, 8:42 AM   #12
elvis Thread Starter
Member
 
elvis's Avatar
 
Join Date: Jun 2001
Location: Brisbane
Posts: 22,757
Default

Quote:
Originally Posted by PabloEscobar View Post
If Prod is replicated to Backup, how exactly is it a backup rather than a mirror of the existing data?. If I delete importantfile.jpg, It's gone from both.
Daily rsync without delete. Weekly rsync with delete.

Quote:
Originally Posted by PabloEscobar View Post
What is in place as an actual backup (if required)?
There's an existing tape backup and archive solution here that I'm not happy with. Now that GlusterFS is rolled out and in production, I'm turning my eye to rebuilding their entire tape-based storage system from scratch. The CEO recently purchased a 96 drive Dell LTO array, so I get to hook that up this week and start building a Bacula based backup and archive system, including using one of the old legacy NAS devices as a VTL.

The business couldn't afford to wait for me to complete both of these projects before rolling out, so we're in a slightly sub-optimal state with a limping backup system until then.

The purpose of the backup cluster is more than traditional file backup. It allows me to test software upgrades, test configuration changes, and other things before I start whacking commands and software into production. Likewise when I train other team members, they get trained on the backup system.

Quote:
Originally Posted by PabloEscobar View Post
Is any long term storage of the data required?
Yes. I'll be aiming for a 90 day rotating backup schedule, but there's also a requirement to archive complete projects to "WORM" (Write Once, Read Many) media. Bacula will eventually perform both of these operations once I'm done.

Quote:
Originally Posted by PabloEscobar View Post
Thanks for the write-up though. I do like that the data is stored on disk in a natively readable format.
As do I. It's nice to know that even if all available methods fail, writing a script to scrape non-zero files off the raw XFS partition is still an available option as a final resort.
elvis is offline   Reply With Quote
Old 2nd April 2013, 9:17 AM   #13
Onthax
Member
 
Onthax's Avatar
 
Join Date: Nov 2003
Posts: 372
Default

(Agg edit: jebus, don't quote an entire massive post to add a two line question to the end )

Have you thought about raid 10 to increase your write iops? you would get 3 times the iops while losing a few disks of space. Alot more affordable since you are using commodity drives

Last edited by Agg; 2nd April 2013 at 9:39 AM.
Onthax is offline   Reply With Quote
Old 2nd April 2013, 9:33 AM   #14
elvis Thread Starter
Member
 
elvis's Avatar
 
Join Date: Jun 2001
Location: Brisbane
Posts: 22,757
Default

Quote:
Originally Posted by Onthax View Post
Have you thought about raid 10 to increase your write iops? you would get 3 times the iops while losing a few disks of space. Alot more affordable since you are using commodity drives
The IOPS limitation of GlusterFS has little to do with the lower level storage. It is entirely to do with the overhead required to fetch a file from the cluster when the location of that file is not in the DHT, and every node needs to be queried.

During the R&D phase I tested single 250GB drives, and then 2x250GB drives in a RAID0 stripe. The different in performance for Gluster was almost nothing, despite there being double the spindles available, and local (non-Gluster, standard local XFS) per-node IOPS testing reflecting that.

Additionally, each system has three levels of cache:

* Linux file system cache (32GB RAM per node)
* LSI Controller RAM cache (1GB RAM per controller)
* LSI Controller SSD cache (512GB Intel SSD per node)

All three of these buffer random IOPS and serialise them to stream them to the platter drives on page flush. The OS itself is set to use the scheduler elevator=noop for all disks, allowing the controller to get data directly without interference or processing from the OS, which works better on file storage systems (doing the same on your laptop or desktop would be worse, but it's a different use case).

With 16 drives per node, RAID6+1 is purely for redundancy and maximising available disk. RAID10 wouldn't give us much benefit at all, given the little benefit to Gluster, the existing triple cache layer, and the waste of disk with 16 drives per node.
elvis is offline   Reply With Quote
Old 2nd April 2013, 1:03 PM   #15
Onthax
Member
 
Onthax's Avatar
 
Join Date: Nov 2003
Posts: 372
Default

Quote:
Originally Posted by elvis View Post
The IOPS limitation of GlusterFS has little to do with the lower level storage. It is entirely to do with the overhead required to fetch a file from the cluster when the location of that file is not in the DHT, and every node needs to be queried.

During the R&D phase I tested single 250GB drives, and then 2x250GB drives in a RAID0 stripe. The different in performance for Gluster was almost nothing, despite there being double the spindles available, and local (non-Gluster, standard local XFS) per-node IOPS testing reflecting that.

Additionally, each system has three levels of cache:

* Linux file system cache (32GB RAM per node)
* LSI Controller RAM cache (1GB RAM per controller)
* LSI Controller SSD cache (512GB Intel SSD per node)

All three of these buffer random IOPS and serialise them to stream them to the platter drives on page flush. The OS itself is set to use the scheduler elevator=noop for all disks, allowing the controller to get data directly without interference or processing from the OS, which works better on file storage systems (doing the same on your laptop or desktop would be worse, but it's a different use case).

With 16 drives per node, RAID6+1 is purely for redundancy and maximising available disk. RAID10 wouldn't give us much benefit at all, given the little benefit to Gluster, the existing triple cache layer, and the waste of disk with 16 drives per node.
Fair enough, that makes sense, Just a query, but with that much cache, How do you handle corruption on power outage, or have you got time to write out cache to disk automatically?
Onthax is offline   Reply With Quote
Reply

Bookmarks

Sign up for a free OCAU account and this ad will go away!

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +10. The time now is 1:54 AM.


Powered by vBulletin® Version 3.8.4
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd. -
OCAU is not responsible for the content of individual messages posted by others.
Other content copyright Overclockers Australia.
OCAU is hosted by Internode!