Titanic Effect, “The severity
with which an on-line operation fails is directly proportional to
an organizations belief that it cannot. In other words technology
plus arrogance spells disaster”
The project life-cycle consists of 6 stages. Each chapter covers one
stage.
Stage 1 in the project life-cycle
is "Defining your strategy"
Stage 2 in the project life-cycle is "Mapping your strategy"
Stage 3 in the project life-cycle is "Constructing your goods"
Stage 4 in the project life-cycle is "Planning your test"
Stage 5 in the project life-cycle is "Testing you plan"
Stage 6 in the project life-cycle is "Delivering your goods"
There is no stage 7 as the first iteration of the project is now complete.
This chapter covers the period when the solution is in operation (production)
This chapter is a summary chapter
|
Avoiding Project Disaster
Each chapter of the book is summarized in three parts which includes
a description of the activities related to that stage in the project life-cycle, the historical
case study, and project best practices.
- “The rush to win passengers” - The chapter
takes a look at why organizations establish on-line business services
and mission critical service delivery environments in the first place.
The chapter defines the importance of establishing business requirements
in the evolution of a business service (new solution), and the availability
characteristics to look for in a business service. Establishing a business
case for the loss of service is critical as this simplifies defining
the availability requirements in subsequent stages. Business executives
and managers need to be fully aware of the investment required for availability.
They also need to know the expected level of service. The chapter continues
by defining the “dimensions” of a business service, which
help explain the various intangibles that create a service. These are
used in identifying meaningful metrics for availability, and setting
up Service Level Agreements.
The Titanic case study examines the origins of the Titanic
project, how it was conceived and business drivers behind White Star’s
decision and business strategy to win passengers across the 3 classes.
This includes a background to liners 1900-1914, competition, and the
business rationale of comfort over speed. The study outlines a sample
business case with a cost/benefit analysis for the project.
Best Practices section looks at business service metrics,
the importance of “User Outage Minutes”, (UOMs), and
measuring availability from the customer's perspective. It then
looks at the significance of evolving business services with an understanding
of what the loss of services would mean to the organization. This requires
creating a business case to accurately establish the potential lost
revenue. The organization can then start to define mitigating strategies
and where to apply resources to limit the impact of unavailability. It
contemplates how to take advantage of outages and technology failures
to determine the true downtime costs and re-evaluate the value of technology
within the service delivery environment, i.e., what is and is not mission
critical. The reader can then use two of the software tools included
with the book to calculate UOMs, the true costs of an outage, and a business
case for service availability.
- “Life-boat in itself” - The chapter examines
architecting and designing high availability into a service delivery
environment that meet the business and functional requirements. It
also reviews how these translate into the technology and implementation
requirements. Business executives and managers need to understand the
risks associated with the service delivery environment. This includes
current levels of protection, potential environmental dependencies
and their impact.
The Titanic case study looks at the design
and the choice of safety features available at the time. It examines
how competitive business pressures led to a spotlight on luxury and
splendour over everything else in the quest to design a palatial
hotel. The lavish attention and investments paid to passenger comfort
implied there was an equivalent investment in the safety and operations
features.
Best Practices introduces functional models
that describe mission critical environments, e.g., Environmental Architecture,
Inventory, Critical Areas, and Transactional Flow. From these the
project can identify, to the component level, the availability requirements.
This is more feasible than trying to achieve high availability across
the whole service delivery environment which is difficult, expensive,
and unnecessary. These models provide a rapid way to evaluate the
risks within the environment. They crystallize the objectives of the
project and influence the project team through the later stages. High
availability strategies can then be “architected” into
the environment based on mitigating risk and protecting key applications.
These models also set business expectations of what the likely investments
will be to achieve the desired levels of availability, the incremental
cost of increased up-time, and where to make the investments.
- “Quest to build a palatial hotel” - The chapter
takes a look at constructing a working technology that can further
demonstrate the functionality and the availability requirements for
the business services. The output is a working prototype typically
a working version of the solution. Timing defines the sophistication
and completeness of the prototype, which provides a very useful tool
to present to the business user or recipient of the solution. It confirms
the proposed functions and features early enough for any changes to be
made at a lower cost. Business executives and managers need to understand
any risks associated with a solution, and have confidence in the constructor.
They also need to understand the availability features of the solution
and the levels of protection.
The Titanic case study looks at the construction
techniques and compares the selection issues of proven versus new
technology. Safety features were built in however this created an
over confidence that nothing could go wrong. In fact, there was an unshakable
belief in the safety of the ship, a lifeboat in itself. As a result,
as the construction approached completion esthetic factors were allowed
to compromise the safety features and the design was fundamentally flawed
in a number of areas. For example, the height of the bulkhead walls was
too short, the double skin bottom was under the water line, and the ship
carried the minimum number of lifeboats based on regulations. The study
also highlights how maritime legislature was hopelessly outdated by the
rapid evolution of shipping technology. The whole construction effort now
seems very misdirected.
Best Practices section examines how construction
in today’s environments principally consists of integrating
technologies, and using off the shelf products and solutions. It
continues reviewing a number of techniques for improving the availability
of an identified critical component (the output from the architecture
and design stage), which may cause the greatest problems if unavailable.
These techniques are also used for constructing availability into solutions,
e.g., check-pointing, auditing, redundancy, etc. This also includes
looking at high availability-advantages, disadvantages and best circumstances
for each technique to increase up-time.
- “Those who fail to plan, plan to fail” - The
chapter examines the integrity, resilience and reliability of the end
solution and preparing for implementation into the service delivery
environment. In this stage the change management structure is developed
and used to evaluate how closely the business criteria should be met
by the services provided. This presents a basis for user acceptance
to begin. By the end of this stage the solution is ready for pre-production
testing. Business executives and managers need to understand the risks
associated with the incoming change and the potential impact to current
business services.
The Titanic case study looks at the testing
or sea trials undertaken, specifically the limited operational and
safety testing. Only one lifeboat drill was performed the outcome
of which underlined the poor operational readiness of the ship. With
the Olympic already established in service extensive sea trails and
testing were not considered as critical and the pressure was on to
get the Titanic into operation.
Best Practices section introduces the requirements
for sound change management. This is not just for major projects
but for the implementation of any kind of change, into the service delivery
environment. It outlines a comprehensive 2-phase change management
methodology of “planning and controlling” through a 9-step
change model. Change planning covers the need to adequately assess
the risk and determine an appropriate change strategy to maximize
the efficiency and minimize the duration of the testing. The majority
of recorded outages are related to inadequacies in Change Management
structures. This includes planning for the level of testing required
and selecting the right kind of tests, e.g., from a battery of up to
17 tests including Integration, Security, Stress, Load, Functional,
Operational, and Simulation.
- “A chain is as strong as the weakest link”
- The chapter takes a look at successfully implementing the service
into production, with the least possible risk, and meeting the service
delivery criteria. In this stage the change management structure
is tested for the first time and the user acceptance is completed.
By the end of this stage the solution is ready to be implemented into
the service delivery environment. There should be a high degree of
confidence within the operations that no disruption will occur. Business
executives and managers need to know the outcome of the testing, the
business risk of the implementation. They also need to know how safe
the solution is and the risk of going live.
The Titanic case study examines how Titanic’s
sister ship, the Olympic, had gone into dry dock for repairs following
its collision with the HMS Hawke. As a result, the Titanic was rushed
into production, without adequate testing through sea trials, and poor
crew preparation for the maiden voyage. On leaving the port the Titanic
had a near collision with the steamer New York to the consternation of
passengers and crew. This highlights the challenges the crew had in navigating
a very large ocean liner for its time. Operationally the crew was not ready.
The Lookout’s binoculars, vital operational tools, were missing.
The very experienced Officer Lightholler later testified that it took him
three days to get acquainted with the ship’s massive layout.
Best Practices section further reviews the
comprehensive change management methodology and focuses on change
controlling. This covers the pitfalls in limited and inadequate testing
and outlines the importance of setting up a battery of tests. The change
control phase consists of test plan creation, testing, business reviews
and assessments. It also discusses why the “Berlin Wall”
approach to change management is not feasible in today’s business
climate. The reader can use electronic templates included with
the book to create a Change Management structure.
- “All hands on deck” - The chapter takes a
look at ensuring that the service delivery environment continuously
delivers the newly developed business services. This is done in accordance
with written Service Level Agreements agreed with customer representatives.
The chapter looks at the organizational aspects required to create a
support infrastructure and maintain a smooth running operation. The
activities include maintaining the stability of the service delivery
environment successfully, preventing disruptions from faults occurring,
or minimizing these through a quick recovery method. This is based on
a rapid and accurate problem management process oriented around a “clock”.
Business executives and managers need to know the impact of the implementation
on business services and the risk of remaining live with it.
The Titanic case study reviews one of most
poignant segments of the story examining the operational aspects of
the ship through a detailed “flight clock” of events and
decisions taken, leading to the disaster. A multitude of blunders
culminated into an inevitable outcome. The Titanic had a number of
built in feedback mechanisms that were discounted, fudged, or just
ignored. For example, the officers kept their own binoculars and did
not share them with the lookouts; Radio Operators overloaded in commercial
traffic (noise) did not pass ice warnings (signal) along in a timely
fashion; ice warning information eventually communicated through the
hierarchy to Captain Smith wasn’t adequately acted on; Captain Smith
succumbed to pressure to sail at full speed through the danger area;
Officer Murdoch’s, responsible for navigating and steering, made
a very questionable course of decisions in reversing the engines and
steering hard.
Best Practices section highlights the importance
of organizing support around a rapid and accurate approach to problem
management, and including the operational requirements of a new service
into the life cycle of development projects. The project should not
end as soon as the service is operational but until a proven level of
stability is attained. The chapter also examines requirements for understanding
the complex structure of a service delivery environment. It looks at
some approaches for organizing and managing it in both a proactive and
reactive way to maximize availability. This includes strategies for Early
Warning Systems and Automation. There are many automated tools available
but without a carefully laid operation’s foundation most tools are
ineffective and even dangerous. The section also applies some current thinking
to the Titanic case study and the ship’s operation by applying the
4-step “mean time to recovery” model.
- “We will remain afloat till help comes” -
This chapter is a continuation of Chapter 6, Operations Management.
The focus is on the recovery of service(s) to an alternate service
delivery environment, that is the resumption of the original business
service to the end-user from an alternate location. If normal problem
recovery is not possible, contingency plans are invoked and a disaster
is declared. The chapter introduces the “disaster cycle”
which outlines how the interaction of humans typically follows a certain
pattern in disasters. Business executives and managers need to know what
the current business continuity plan is, how the plan will address the
incoming the implementation, and what the risks are in the plan.
The Titanic case study continues with the detailed
“flight clock” examination of the recovery stage. It
reviews the flow of information and how the Titanic’s hierarchical
organization (3 classes), inhibited the flow through the structure,
and the impact of this. Much precious time was lost in the first hour
after the collision, as the disaster was assessed. Poor communication
impeded time for passengers and crew to react. Many passengers got up
and then went back to bed with the perception that they were safe. As
a result, the first lifeboat left only half full because of the reluctance
of passengers to get in. Effectively, the “impact and stocktaking
phase” was untypically long as senior members of the crew operated
in a state of disbelief. In addition, the launch of 16 lifeboats took
over 2 hours because the crew was not adequately trained. Even if more
lifeboats had been in place it is likely that there would not have been
enough time to launch all of these.
Best Practices section takes a “Why-What-How”
approach and examines why disaster recovery is critical, what disaster
recovery entails, and how disaster recovery is completed. Within this
structure Best Practices looks at business continuity planning and issues
such as application selection, recovery windows, and cost justification.
It also reviews alternatives from hot to cold to on-line sites, and
some of the techniques available through extended mirroring and remote
replication.
- “Titanic effect” - The chapter reviews the
highlights of each chapter, concludes the case studies, summarizes
significant discoveries made, and then draws the major Lessons Learned.
The Titanic chapter case study starts with
a review of the post-disaster consequences. It examines the subsequent
inquires, the new legislature and regulations implemented, and the
everlasting changes made to the shipping trade. Many historians argue
that the Titanic was the end of the 19th century and humanity’s
unshakable belief in the progress of technology. The chapter continues
by looking at all the Lessons Learned chapter by chapter. It sums up
with the “Titanic Effect”, the severity with which a system
fails is directly proportional to the intensity of the designer’s
belief it can not.
Best Practices are reexamined from previous
chapters and reviewed in the context of mission critical environments.
The chapter concludes by indicating how some organizations have been
able to master availability and this is the starting point for creating
better mission critical environments.
|