Discussion in 'Business & Enterprise Computing' started by GooSE, Oct 4, 2013.
Wow I was gonna mention the Dell thing but thought I'd get shot down by you mob!
I think the unwritten rules for interacting with technical forums as a company representative should have something about - If its bought up by a (non astro-turfing) forum member, it's fair game for you to expand upon. If its not mentioned or questioned, then you shouldn't bring it up, Lest you get accused of marketing drivel.
So I've been living with Nutanix as my production environment for about 6 months now and I have to say I really really like it, the honeymoon hasn't worn off yet.
I've had a couple of hardware failures, a disk and psu but they were replaced immediately and without fuss. While this isn't great for brand new kit, it is just commodity hardware that is easily replaced and comes down to the luck of the draw a bit.
Here is a summary of how the cluster looks as of this morning;
Most of those VM's are View desktops of various persuasions, with our server infrastructure like SQL filling out the others.
We used a mix of Nutanix nodes, 1x 3060 and 2x 7110 GPU nodes with Nvidia Grid Cards for CAD workloads for our design team.
Some quick stats we replaced 63 RU of physical servers with 6 RU of Nutanix nodes. Our base line power draw from from 54 Amps to 7 Amps.
So right off the bad we ticked the consolidation and consumption boxes. The speed at which we deploy new services and desktops have dramatically increased. I can restart all 100+ desktops etc and no one feels the boot storm.
I have done a number in place Nutanix cluster updates now and not once have we had to bring down production to do it, you can also expand the cluster live with no interruption to your services. That in itself makes you a rock star to management.
My users have all transitioned off PC's to Samsung Thin Clients, so the Cluster is paying for itself as I have been able to redirect the salary I would be paying for a field tech to other parts of the business.
This all sounds very fanboi I know, but it's been a long time since something has come along that actually makes my life better, brings massive gains, and is technologically solid with out strings attached.
Also dealing with the people at Nutanix is a breath of fresh air, I am sooo over dealing with vendors like IBM, HP and Dell. I'd rather have teeth pulled then listen to another sales pitch full of vapour.
In case your wondering this is what our entire infrastructure now looks like;
It's not the best photo, that's in the aux rack before we moved it all to it's new home at Polaris DC. Something Nutanix also made affordable.
Are you using the Teradici thin clients? or the Win7 embedded ones?
How do you find management of them, I was pretty underwhelmed by the PCOIP Management Console.
I am using the Teradici zero thin clients, the MC could certainly use some work but the devices are working our really well for us. I even have then deployed to a satelite site that's serviced with just a 4g connection.
Being able to fine tune some image settings directly on the device has been a god send in that respect.
In terms of Management the group structure in the MC is okay but not all that clever, I am hoping it gets some extra work over time.
OK ask away - if I don't know the answer I can get Nutanix Engineering to respond. By the way, I was a customer of Nutanix as well prior to joining them...
(and I've deleted the event links too! )
Exactly - I hope it continues as well. This thread has generated its fair share of reported posts, so it clear people don't want the marketing / spam side of it, but are interested enough in the content here to want to see it cleaned up. As you've said - a good product is marketing in its own right, and Tully is doing a good job of assisting people with their enquiries (which is good marketing in its own right).
Deployed ScaleIO in a site to move away from a traditional SAN, and wasn't too impressed with the VSAN pricing.
The Nutanix offering sounded like it would of hit the mark, but had Michael Webster (LongWhiteClouds) in for an update, but was a few months too late.
If they get a NetApp style "Project Shift" movement going, with their SMB v3.0 support it should keep VMware honest during ELA renewals
As I'm UK none of you would ever have been buying from me anyway, was purely educational from my point of view. So whatever you need to know ask away as I use it everyday.
Lots of buzz around new Nutanix / Azure ref architecture but I daren't post a link so you'll just have to Google it
Too kind... and who in IT doesn't like hearing about new tech? It's old-school storage's time to get disrupted (whether it be 'software defined storage' or 'public cloud' providers)...
Voice guys had their stuff disrupted when VoIP turned up, server guys had to get used to a virtual life from physical, networking arguably 'evolves' rather than gets disrupted (SDN may change this)....but the point is that you need to roll with the changes in IT. Countless other examples of course abound...
IMO the advantage of being 100% 'software defined' is of agility as well as performance. Back in the day with version 2.6.2 of Nutanix software on a 2400 model we could perhaps get 20k IOPS out of a block of 4 nodes. Now, using the *same* hardware (the discontinued 2400 model) with just evolving software upgrades we can get 3x that (or more) with the latest 4.0.x code...and it will continue to get better. Customers who are happy with the older model h/w can still get the benefit of the newer code (incl features and performance). Don't like the h/w anymore? Fine - add newer h/w to the cluster and then remove the old nodes - no downtime... maybe redeploy for DR or a test/dev env or whatever floats your boat.
Software is certainly eating the world....and it's a good thing. You get a chance to learn new things and get your weekends back
Actually one of the things that has been eating at me is this.
It has been stated [millennia] that "if you have 2TB SQL Server that regularly trolls it's whole DB then the working set size will be over the SSD limit of a node in Nutanix and performance will tank".
So what is going to happen when I do a full backup every ... [day|two days|week]? Either those full backups cause data eviction from the SSD (unlikely, since it should be single-read for non-duplicated data sets) or performance will be restricted to that of 4-12 SATA disks. And 12 SATA disks won't feed an LTO6, nor can they likely spool 4TB of data to a disk store in 6 hours (~200MBps) unless it's sequential access, and I doubt that too).
Since we can all agree that a replica is not a backup, what am I missing? Or is it more that people don't do daily offsite tapes any more? (Yes I'm comfortable being a dinosaur).
I can't help but feel that if you have a 2tb sql db you either 1) have enough money that nutanix isn't for you or 2) you are horribly abusing sql.
I don't think he is saying he has a 2TB DB, just that Nutanix have acknowledged that reading 2TB of data will cause performance issues (in the context of SQL) and that this also has implications for backups.
Full disclosure.. I am another Nutanix guy.
We wont evict data from SSD on a backup read. To quote the good book (Nutanix Bible) "Upon a read request of data not in the cache ... the data will be placed in to the single-touch pool of the content cache which completely sits in memory where it will use LRU (last recently used) until it is ejected from the cache. .... Any subsequent read request will “move” (no data is actually moved, just cache metadata) the data into the memory portion of the multi-touch pool which consists of both memory and SSD."
Where you say 4TB = 6 hours are you assuming we store data on a single SATA drive? We store the data across the whole cluster in small extents, so we get the collective io capacity of multiple SATA drives. Our sequential numbers are very healthy and would leave a LTO6 drive nicely fed, but even random IO is nicely served.
On a side note...
Disk to Tape is becoming less common now, and Nutanix does work really nicely with scale out backup tools like Veeam (Reference Architecture).
Replica is also quickly becoming a method of backup, customers are more and more choosing to store backups inside replicas and maybe replicate offsite to a 2nd Nutanix cluster as a way of improving recovery time. I would of course recommend that they continue to use the 3-2-1 rule but if you have a snafu that needs you to restore VM's its nice when they are still on your system in a previous state.
It also depends on the working set size of that 2TB DB. You may have a big DB, but only a small portion has actually got big io active data. It just comes down to sizing it correctly for Nutanix.. and it is handy that Michael Webster one of our performance engineers actually wrote the book on Virtualising SQL (He really did.. check amazon).
Actually, I was assuming D2D of 4TB of changes on a 20TB VM data set, stored on a 4 node 3000 series appliance. 4TB in 6h would be two LTO6 drives in parallel and they potentially wouldn't stream unless you delivered 320MBps+.
You have 16 x 1TB spindles. Assuming 2 copies of any given block, that's 8TB raw storage. Assuming 2.5:1 dedupe+compression, that's potentially 20TB of VMs. Some will be low rate of change (file server), others high (Exchange/SQL/Oracle).
A single drive will deliver (guarantee) 70 completely random IOPS. Let's be kind and assume 200 mixed IOPS at 64kB average for backup, per spindle. Not unreasonable, I think, as there's going to be some SSD assistance. We have total throughput of 200MBps using all 16 spindles - and no capacity for the VMs to do much else unless it's in cache/SSD.
Now it could simply be that this is an inflexion point - that bigger and bigger clusters would resolve this. But I'd argue that the larger the cluster, the larger the daily change set. And add to that, backup for DR can be hard to get right. Plus if you look at doing a full backup, even at 200MBps you need 100,000 seconds (~1.2 days if my maths brain is working). Yes, many customers have DR sites and large links. Some don't. And if you're going to consider an option, DR is one thing you have to factor in - hence the question.
I'd also argue that the data-dense nodes are going to make this problem worse (a pair of 6000 series appliances - 4 x 6080 nodes - could have up to 80TB of data on it with the dedupe/compression numbers I've pulled out of my ass, across 16 spindles and 8 SSDs). Back up a 20% change set (incremental/differential) in 8 hours? You need 2TB/hr or 600MBps sustained semi-random IO, from 16 SATA drives and SSD cache. Better hope the SSD cache can serve 50% of the day's changes.
I think I sense designing a scenario to induce a failure rather than likely real world use cases for Nutanix - at least the ones I've seen so far. Where 6000 models have been used it is mainly to be able to store large amounts of nearline data that is needed instantly from the main storage environment but is not subject to large scale change.
Once you get to a point of requiring a system to account for 20% of it's entire capacity to change daily you ought to be looking at all flash anyway IMO, and any off-SAN backup solution is going to be a taxing problem as you try to extract data through the same controllers you are trying to access through for your live VMs.
At least with Nutanix every node is a controller, so the larger it is the more controllers you have to spread the load. NOS 4.0.1 has made some significant changes to the way it does IO, improving write IO 50% with no change in hardware, and a lot of this is to do with utilising all the flash in the cluster if the local flash hasn't the capacity.
Guess it's time for a bit of an update, since my last post I have now added 2 new 6020 storage nodes to my Nutanix cluster. This brings my node count up to 5 in 4 blocks.
I have now retired the last SAN I still had in production so all our workloads are now running directly off the cluster. That's SQL, ERP, Apps, VDI + 3D etc etc.
Installing the new nodes wasn't as straight forward as the initial deployment as all new nodes ship with KVM as the hypervisor, thankfully though Nutanix have this covered and the nodes are easily imaged with ESX using the "Foundation" toolset. This took all of 45 minutes, you basically feed it some config info about how you want the nodes setup and supply your ESX ISO. At this point in time this procedure is carried out for you by the SE.
I am told this is a temporary measure, and a future NOS release will have built in functionality to do this without need for external assistance.
After the nodes were imaged with the correct hypervisor it's literally a 3 minute job to expand the existing cluster to include them. Once they are added you just have to mount the datastore on the new nodes and add them to vCenter.
Vsphere then did it's thing and vMotioned the workloads to load balance the cluster.
If I wasn't sold on this stuff before I certainly am now. I have never had a roll out go so smoothly and hassle free or as quick. You can scale on demand in literally under an hour, gone are the days of spending hours in front of a white board in planning sessions.
Would like to point out that performance to the end users has increased while all the pain points have decreased.
What version are you using, 3.5 or 4?
Currently on 3.5.4 but will be upgrading to 4 very soon. I have held off as I am leaving for VMWorld this Friday and didn't want to do it just before I left.
I will be upgrading to 4 the week I get back from the US. Lots of goodies and performance gains with NOS 4.