Archive

The Dulin Report

Browsable archive from the WordPress export.

Results (45)

The future is bright Mar 30, 2025 On Amazon Prime Video’s move to a monolith May 14, 2023 One size does not fit all: neither cloud nor on-prem Apr 10, 2023 Some thoughts on the latest LastPass fiasco Mar 5, 2023 Comparing AWS SQS, SNS, and Kinesis: A Technical Breakdown for Enterprise Developers Feb 11, 2023 There is no such thing as one grand unified full-stack programming language May 27, 2022 Which AWS messaging and queuing service to use? Jan 25, 2019 Using Markov Chain Generator to create Donald Trump's state of union speech Jan 20, 2019 Adobe Creative Cloud is an example of iPad replacing a laptop Jan 3, 2019 Facebook is the new Microsoft Apr 14, 2018 Leaving Facebook and Twitter: here are the alternatives Mar 25, 2018 Rather than innovating Walmart bullies their tech vendors to leave AWS Jun 27, 2017 Architecting API ecosystems: my interview with Anthony Brovchenko of R. Culturi Jun 5, 2017 TDWI 2017, Chicago, IL: Architecting Modern Big Data API Ecosystems May 30, 2017 Online grocers have an additional burden to be reliable Jan 5, 2017 Windows 10: a confession from an iOS traitor Jan 4, 2017 What I learned from using Amazon Alexa for a month Sep 7, 2016 Why I switched to Android and Google Project Fi and why should you Aug 28, 2016 Amazon Alexa is eating the retailers alive Jun 22, 2016 In search for the mythical neutrality among top-tier public cloud providers Jun 18, 2016 What can we learn from the last week's salesforce.com outage ? May 15, 2016 Why it makes perfect sense for Dropbox to leave AWS May 7, 2016 Our civilization has a single point of failure Dec 16, 2015 IT departments must transform in the face of the cloud revolution Nov 9, 2015 Setting Up Cross-Region Replication of AWS RDS for PostgreSQL Sep 12, 2015 Top Ten Differences Between ActiveMQ and Amazon SQS Sep 5, 2015 What Every College Computer Science Freshman Should Know Aug 14, 2015 Ten Questions to Consider Before Choosing Cassandra Aug 8, 2015 Big Data Should Be Used To Make Ads More Relevant Jul 29, 2015 Book Review: "Shop Class As Soulcraft" By Matthew B. Crawford Jul 5, 2015 Attracting STEM Graduates to Traditional Enterprise IT Jul 4, 2015 Smart IT Departments Own Their Business API and Take Ownership of Data Governance May 13, 2015 Guaranteeing Delivery of Messages with AWS SQS May 9, 2015 We Need a Cloud Version of Cassandra May 7, 2015 The Clarkson School Class of 2015 Commencement speech May 5, 2015 Building a Supercomputer in AWS: Is it even worth it ? Apr 13, 2015 Ordered Sets and Logs in Cassandra vs SQL Apr 8, 2015 Microsoft and Apple Have Everything to Lose if Chromebooks Succeed Mar 31, 2015 Where AWS Elastic BeanStalk Could be Better Mar 3, 2015 Trying to Replace Cassandra with DynamoDB ? Not so fast Feb 2, 2015 Why I am Tempted to Replace Cassandra With DynamoDB Nov 13, 2014 Infrastructure in the cloud vs on-premise Aug 25, 2014 Cassandra: a key puzzle piece in a design for failure Aug 18, 2014 Cassandra: Lessons Learned Jun 6, 2014 Things I wish Apache Cassandra was better at Feb 12, 2014

What can we learn from the last week's salesforce.com outage ?

May 15, 2016

On May 10 Salesforce experienced a day-long outage and lost four hours of customer data. As of May 14, Salesforce is still in degraded state. There is a number of lessons we can learn from this.

You give some, you gain some at each level of the cloud stack


At the lowest level is private on-premise cloud where the customer retains all control over the well being of the system. If the system is down the responsibility is with the IT department to bring it back up.

At IaaS and PaaS level, the cloud provider such as AWS or Azure is responsible for the maintenance and up-time of the infrastructure, but the reliable performance of applications is up to the customer. The IaaS vendor offers tools like multi-AZ redundancy and database snapshots but it is up to the customer to take advantage of these tools.

Salesforce.com operates at the SaaS level where the responsibility for the up-time is entirely with the vendor. Salesforce routinely uses slogans like “No hardware. No software. No headaches. Salesforce developer training professionals.” For the most part, their claim is true.



[caption id="attachment_404" align="alignnone" width="2086"]no_software “No hardware. No software. No headaches” screenshot pulled from here.[/caption]


Yes, headaches!


Salesforce’s NA14 instance was down for four hours on May 10th, and in degraded state ever since.



[caption id="attachment_412" align="alignnone" width="2510"]salesforce_na14 Screenshot of Salesforce NA14 instance state, as of May 14th, 2016, pulled from "trust" report[/caption]

Headaches come when a software architecture flaw with your SaaS vendor’s system results in you losing hours of data and more scheduled outages. Sales, ERP and CRM software is mission critical for many enterprises as SaaS outages can result in missed sales opportunities and damaged relationships with their own customers.

Customers should be on separate instances, each with its own up-time guarantees


Salesforce maintenance schedule is showing more outages over the weekend:



[caption id="attachment_417" align="alignnone" width="2300"]emergency_maintenance Screenshot of Salesforce NA14 instance maintenance schedule, as of May 14th, 2016[/caption]

Why does Salesforce have so many customers on a single database instance is a question their customers should ask of them. Customer should insist on having separate database instances with provisioned performance and up-time guarantees. Such separation solves a number of problems and creates a more positive experience for our cloud customers.

  1. The first, is a noisy-neighbor problem. This is a situation where one customer’s application and workload degrades performance of another customer.

  2. By having each customer on their own dedicated database instance with provisioned perfomance the SaaS vendor can address each customer’s performance needs individually, including automatically scaled read-replicas.

  3. Since each customer has a dedicated database instance, maintenance activities can be scheduled when it is convenient to the customer.

  4. Outages can and do happen with database instances, but they are isolated to individual customers.


What about database failover ?


A report in eWeek states that the outage was caused by a “file integrity issue”:
The failure was caused by a “file integrity issue” and was resolved by restoring the database from an earlier backup.

Salesforce also noted, “We have determined that data written to the NA14 instance between 9:53 UTC and 14:53 UTC on May 10, 2016, could not be restored.”

File integrity issues can happen due to a number of causes, but many of them can be mitigated by having multi-AZ failover. In a master-slave configuration, the slave can be promoted in the event of the master failure and all new queries can be routed to the new master.

AWS RDS Multi-AZ deployment is one such mechanism:
In case of an infrastructure failure (for example, instance hardware failure, storage failure, or network disruption), Amazon RDS performs an automatic failover to the standby, so that you can resume database operations as soon as the failover is complete. Since the endpoint for your DB Instance remains the same after a failover, your application can resume database operation without the need for manual administrative intervention.

One problem that multi-AZ failover doesn’t solve is a situation where the data corruption was caused by a software bug – the only way to recover is from a backup. However, this isn’t likely the case with Salesforce since there was no indication that new software was pushed to the affected NA14 instance.

SaaS customers should insist on off-line capability


Customers should treat Internet and SaaS outages as disasters and develop contingency plans. This has little to do with the cloud itself, and everything to do with the general preparedness of the enterprise to continue to do business.

On-premise infrastructure is at least as likely to face outages as cloud infrastructure. Internet providers may have problems, there could be hardware issues, and someone in the machine room could step on a cable and unplug a critical server.

Writing for CIO.com John Brandon recommends that customers develop contingency plans so that knowledge workers can continue to be productive in the event of cloud outages. Offline sync plays a critical role in allowing users to continue working with minimal disruption and sync with the system when it is available again:
“The two most prevalent business productivity and collaboration tools – Microsoft Office 365 and Google Apps – now enable offline functionality in their desktop applications, enabling document creation and editing with transparent syncing of files when connected,” adds Sumeet Sabharwal, vice president and general manager of NaviSite. “These same offline features also extended into their mobile application which also enable a rich user experience on tablets and smartphone devices while providing the users the convenience of portability.”

Outages come in many forms, they can be simple LTE or Wifi issues, and they can be system-wide problems. Just like an electric utility company may offer some of their most sensitive customers automatic standby generators, a SaaS vendor should offer optional off-line sync capability.

With well thought-out off-line sync users can continue work on most of their data on their mobile devices during an outage. While they may not be able to get some real-time data, they can at least operate with what they have on their mobile devices. The system can synchronize all devices with the cloud once it becomes available again.

Software escrows don’t fit the cloud model


Salesforce.com is unlikely to go out of business anytime soon, but as a general rule this is not a reason to be complacent. There is no other SaaS platform that requires a tighter vendor lock-in than Salesforce.com with their proprietary programming language and SQL dialect.

Writing for CloudPro Frank Jennings points out:
Cloud is different and escrow is not the cure-all that many think. If a customer has opted for a public cloud SaaS offering, that software will likely be a standardized version offered to all customers. The supplier won’t want to enter into escrow arrangements with all its customers who are paying rock-bottom prices. Even if it’s a bespoke software solution, by the very nature of cloud, there is likely to be very little installed on the client’s equipment. Perhaps it will be accessible by via a web browser. Therefore, the customer won’t even have the object code but will have access to software run on the supplier’s infrastructure.

The recommended approach is prevention: customers should consider their Plan B options in the event of SaaS vendor failure. One of the options involves considering competitors and a possibility that a migration might be necessary.

For the migration option to be viable, the customer should have a plan to backup all of their data outside of the SaaS vendor. SaaS vendor should offer batch export API and tools that the customer can use on a routine basis to export all of the data into a 3rd backup system. Salesforce does offer such API and customers should use it.

Another area to be on the lookout for is proprietary languages, tools and protocols that a SaaS vendor may impose. Salesforce’s Apex is a proprietary Java-like language that cannot run on any platform other than Salesforce. Too much investment in Apex code will most certainly prevent a customer from using an alternative SaaS.

Conclusion


When a cloud vendor experiences problems, everyone talks about it. It is easy to overlook the problems with traditional on-premise software. For instance, not a week goes by when I don’t hear of at least one on-premise customer having an SAP HANA crash, or a SAP ERP outage for several hours at a time.

Problems with SaaS can, do and will happen again. The key is making sure that the vendor offers tools and capabilities to help customers mitigate the impact of outages.




Feature image photo credit Belgapixels via Flickr