Virgin Blue computer outage

Discussion in 'Business & Enterprise Computing' started by IntelInside, Oct 1, 2010.

  1. IntelInside

    IntelInside Member

    Joined:
    Aug 10, 2004
    Messages:
    82
    Virgin Blue have blamed their recent computer outage on Navitaire, the company their reservations are outsourced to.

    This has aired the reality that the budget airlines outsource nearly everything, even their staff.
     
  2. NSanity

    NSanity Member

    Joined:
    Mar 11, 2002
    Messages:
    17,413
    Location:
    Canberra
    to be fair its not really a budget airline, unless you count every domestic airline in Australia outside of Qantas "budget".

    If you do, you'd also have to say that its impossible to make a profit as a domestic airline unless you are "budget" (Jetstar whilst being "part" of Qantas, uses seperate systems for everything).
     
  3. AzzKikr

    AzzKikr Member

    Joined:
    Aug 25, 2002
    Messages:
    1,078
    Location:
    .au
  4. MikHail

    MikHail Member

    Joined:
    Feb 8, 2003
    Messages:
    434
    Location:
    Sydney
    Guess Virgin got the cheaper deal with out system redundancy
     
  5. Doc-of-FC

    Doc-of-FC Member

    Joined:
    Aug 30, 2001
    Messages:
    3,244
    Location:
    Canberra
    Last edited: Oct 1, 2010
  6. thetron

    thetron Member

    Joined:
    Dec 23, 2001
    Messages:
    8,167
    Location:
    Somewhere over the Rainbo
    Thats because they will outsource everything to external providers

    For some time maybe 2 years ago Virgin blue was doing the same thing with one company in Brisbane. But kept giving them strange werid requests outside of the realm of support and the external provider disliked their working relationship with virgin. Not sure if they terminated the contract thou or told virgin to "f--- off idiots"
     
  7. [AOB]TommO

    [AOB]TommO Member

    Joined:
    Mar 11, 2002
    Messages:
    1,738
    Location:
    Sydney, Australia
    It's interesting that it all started with a failed disk:

    Linkage: http://www.channelregister.co.uk/2010/09/29/netapp_virgin_blue/

    Looks like the drive died, database became corrupted... and well I'm guessing a long restore process from there :) I'm sure someone else will be able to provide more details in here eventually.
     
  8. brayway

    brayway Member

    Joined:
    Nov 29, 2008
    Messages:
    6,721
    Location:
    Dun - New Zealand
    i hope it doesn't happen again next Sunday, as thats when we fly out :wired:
     
  9. Hive

    Hive Member

    Joined:
    Jul 8, 2010
    Messages:
    5,064
    Location:
    ( ͡° ͜ʖ ͡°)
    How supprising, but i though SSD's were the most reliable drives ever? :|
     
  10. elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    34,999
    Location:
    Brisbane
    In this modern age of "professional" IT, it's far more important to have someone to blame than it is to have a system that works.

    Outsourcing is a wonderful model for this new era of ITIL-focussed service managers, because it means they can put 100% of their faith in an SLA (which itself is generally not worth the paper it's printed on) and have someone to blame when inevitably it all goes wrong.

    Bang for buck, buying quality, having control of systems, information and data - these are concepts of an era gone by. Today it's about two things:

    1) Lowest dollar cost
    2) Outsourcing the blame.

    And when it all goes wrong, make sure you have that all important "blamestorming" session so that the shit sticks to the vendor, and not your own technically-ignorant and utterly incompetent decision making processes.

    Putting business people in charge of technology proves yet again that it's a recipe for disaster. Any company that continues to put blind faith in SLAs, process and procedure ahead of recognising good individuals and talented people gets everything they deserve when it all inevitably turns to custard.
     
  11. chook

    chook Member

    Joined:
    Apr 9, 2002
    Messages:
    913
    Are you sure you don't work at the same place as me? :p
     
  12. money_killer

    money_killer Member

    Joined:
    Apr 10, 2010
    Messages:
    2,173
    Location:
    Sunshine Coast
    i dont understand either how such a "big important" company can have such an outage. dont they have a backup server that would just take over the one that shit ifself >?
     
  13. Bangers

    Bangers Member

    Joined:
    Dec 25, 2001
    Messages:
    7,254
    Location:
    Silicon Valley
    Threads like this are always fun. It's usually the ones with the least knowledge and experience that find it the easiest to comment.

    It sounds like the articles on The Register are on the money.

    Right on the money. In the real world, it's more cost effective to sign off on a [cheaper] system with design problems and gamble the N year ROI verses investing the extra money upfront.
     
    Last edited: Oct 2, 2010
  14. elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    34,999
    Location:
    Brisbane
    I would have expected by now that I'd stop being surprised at how bad non-technical people are at risk management. Yet I'm sitting here shaking my head at this whole Virgin Blue thing (and not just Virgin Blue themselves) and how, yet again, there was clearly inadequate investment in proper disaster recovery planning and infrastructure.

    It's pretty clear that "business reputation" is less and less of a concern to senior executives these days. Or perhaps it isn't, and much like the company I work for, the truth about the real level of risk doesn't make it to senior executives thanks to multiple layers of budget-saving middle managers who all add a slightly rosier tinge to the story on the way up.

    Typically for us it goes something like:

    Tech: "It's all fucked and falling apart"
    Team leader: "We have some issues to work through"
    Middle Manager 1: "There are some minor issues, but nothing we can't handle"
    Middle Manager 2: "Standard procedures are in place and we're working as normal"
    CIO: "We're on track for excellence"
    CEO: "We're awesome, and an industry leader"

    And then the poor tech at the bottom is left scratching his head as to why the DR budget has been slashed for the fourth year in a row.
     
  15. Bangers

    Bangers Member

    Joined:
    Dec 25, 2001
    Messages:
    7,254
    Location:
    Silicon Valley
    I'm staying out of this thread but to be fair this isn't VBA's fault and no information has pointed blame back to them, or to a lack of DR/Infrastructure from them. This is identical to GOOG/MSFT hosted cloud solutions (email/docs) going offline then ripping into companies that depend on that infrastructure. The problem here is that the cloud going offline didn't affect a Small Business, it affected an Airline. Anyone with contacts in the Airline industry knows it's a very small vendor base with not a lot of movement. Qantas and everyone else also use hosted Solutions (from another vendor) but they are exposed to the same problem.
     
  16. elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    34,999
    Location:
    Brisbane
    I realise it's the vendor who had the technical screw-up, and not VBA. But at the same time VBA have put all their eggs in one basket with said vendor, so in my eyes partial amounts of blame lies with them for thinking they can outsource 100% of their core business to a single vendor.

    If I were put in charge of putting a similarly sized business "in the cloud" for their core applications, I would at least consider using two different vendors in a live-live (if possible, if not then standard failover) setup.

    Yes, that means it costs more. But what's the cost of VBA's reputation now? I'm already hearing so many people cancelling flights with them or avoiding them for the upcoming Christmas season. Their technology "savings" are going to cost them big bikkies in lost revenue due to a shoddy reputation in the coming months/years, regardless whether it was they themselves or their vendor who screwed up. In the general public's eye, it's all the same thing.

    [edit]

    And I also understand that any large "cloud" vendor is going to have detailed DR documentation and SLAs that probably cover multiple vendors themselves, but again it comes down to trust, confidence and experience. There's obviously extra effort (read: staff/cost) in maintaining two sites with two different "cloud" (boy I hate that word) vendors versus outsourcing the lot to a single mob and assuming they'll stay true to their word, but again it comes down to anyone with half a brain knowing deep in their gut that it'll all go bad one day.

    I can only hope VBA had an SLA that included financial compensation for SLAs not met. But again, that probably won't come close to the losses they'll incur due to a poor reputation for the next 12 months.
     
    Last edited: Oct 2, 2010
  17. Bangers

    Bangers Member

    Joined:
    Dec 25, 2001
    Messages:
    7,254
    Location:
    Silicon Valley
    That's not how it works. To use a token analogy it would be like using Google Docs and Microsoft Exchange Online at the same time.

    Checkout both The Register articles. They are really on the money. We all love throwing the blame, but this is one of the true unique situations where the third party is completely at fault and nothing could have been done to change from the customer to prevent it happening. The outage window was also unique in the sense that in the Airline industry once you backlog it takes 3 times as long of the outage to catch up. Other less physical businesses (other cloud services) instantly catch up/replicate/sync once the problems are resolved.

    Your point about investing (with both brains and money) before hand does make a massive difference. I'm not questioning that at all. I'm only playing devils advocate to ensure everyone is on the same page about the technical problem at hand.
     
  18. elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    34,999
    Location:
    Brisbane
    I realise too the systems they are using are non-trivial. Where I currently work the core financial system is a similar batch-driven system that is near impossible to replicate realtime without massive cost (and even then, there's no guaranteed transaction safety or ACID compliance, so failover would result in a guaranteed need to check and repair the consistency of the entire data set).

    All the same, if the powers that be came to me and said we were outsourcing the data centre to a single vendor, I'd have some serious words to share with them. And Virgin Blue's current dramas would be on the top of my list of case studies.
     
  19. Doc-of-FC

    Doc-of-FC Member

    Joined:
    Aug 30, 2001
    Messages:
    3,244
    Location:
    Canberra
    One point still rings true, navitare are using netapp heads which are snap mirroring the data to a backup datacentre, they had the DR nailed, the BCP is what failed, they should have cut across to the DR site and had operations restored in under 90 minutes.

    Why it took them 21 hours to make the call to switch to the DR site instead of attempting to repair the primary is where the spotlight should be.

    whoever the incident manager was is who is ultimately at fault for the extended outage.
     
    Last edited: Oct 2, 2010
  20. reunig

    reunig Member

    Joined:
    Sep 7, 2005
    Messages:
    254
    Location:
    Newcastle

Share This Page