Thursday, November 12, 2009

Highlevel Availability -- continue

Taking precautions against unplanned downtime

Although one cannot prevent failures, one can take all reasonable precautions. The most obvious and most important precaution to take is to put in place a well-designed and implemented backup regime, taking into account the special requirements of the application. This is a subject beyond the scope of this article but, nonetheless, it is worth emphasizing its importance. Also…

Process failure precautions

• Control access to the server and server room
• Ensure that all of your team clearly understand their roles and responsibilities.
• Implement change controls to ensure that all software and hardware changes to a production server are documented. Because Change control systems require the signoff by several team specialists it allows them to check for potential problem .
• Document and map all of your SQL Server instances, being particularly careful to record application relationships such as replication or log-shipping, data-feeds, message routes, links, remoting, and file transfer routes.
• Make sure your Test servers are identical in configuration to your Production servers.
• Before applying patches, hot fixes, and service packs, test them first on a Test Server.

Change failure precautions

• Document all proposed changes
• List the expected impact on the production system
• Gain consensus and sign off for the changes, as appropriate.
• Test the effect of the changes in terms of functionality and Stress/Load.
• Document the roll back/reversion plan and test it out on those who are likely to be 'early responders' to a system failure.
Natural, and man-made, disaster precautions
• Arrange for the service to be mirrored, or held at 'Warm stand-by', a long way away. Test out the ability to switch the service remotely. SAN replication is a popular solution, but mirroring is very effective

Hardware failure precautions

• Simulate failure in all likely places to check that secondary hardware 'kicks-in' as expected.
• Make sure there is an architecture diagram, and clear instructions for all hardware recovery routines, which are easily understandable to the 'first responder'.
• Provide generous battery-backup.
• Use redundant power supplies
• Use hardware and software monitoring tools: hardware often gives out warning signs before 'letting go'.
• Use a RAIDed array or SAN for storing your data, with hot-swappable drives with available spares. A 'Stripe of Mirrors' (Raid 10) is probably best practice.
• Install redundancy in storage controllers.
• Place the databases of your server on a different raid array to the transaction log. Locate TempDB on a high performance RAID array. SQL Server cannot function without it.
• Provide both Network card and router redundancy
• Ensure at least 'Warm Standby' fall back servers by using clustering, database mirroring, synchronization or log shipping.

Software Failure precautions

• Software failure can happen due to software changes, but also when data changes. Even date changes can cause failure. 'Code Rot' is the common term for software system failure when no recent software changes have been made.
• Use Change and source control (see change failure above)
• Before rolling out a production release, do strict 'limit' testing (testing under the extremes of data or throughput, and with hardware components randomly unplugged to assess whether software degradation is 'graceful' or not)
• Perform Regular regression testing on the test server with different simulated loads
• Avoid overlapping jobs in the SQL Server Agent; do routine DBCC checks and re-indexes of tables at off-peak times.

Network Failure precautions

TCP/IP is designed fundamentally as a resilient system in the event of disaster, but this relies on the network infrastructure being able to route network packets via alternative pathways in the event of the failure of a pathway.

• Secondary DNS/WINS servers must be provided.
• The system must not be reliant on a single domain server or active directory.
• There should be Redundant routers/switches
• Redundant WAN/Internet connections are generally important.
• Ensure that there is no single point of failure in the network by regular 'limit-testing'

Security Failure precautions

• Ensure the physical security of each SQL Server.
• Create alerts and reports for any unusual patterns of user activity on the server, and investigate them (SQL Data Compare is very handy for this)
• Give users the fewest permissions they need to perform their job.
• Audit all login and logout events
• Use DDL triggers to log and notify all changes to the security configuration of the server.
• Adopt all current security best-practices when implementing the Server

No comments: