Archive

The Dulin Report

Browsable archive from the WordPress export.

Results (57)

Strategic activity mapping for software architects May 25, 2025 On the role of Distinguished Engineer and CTO Mindset Apr 27, 2025 The future is bright Mar 30, 2025 Software Engineering is here to stay Mar 3, 2024 Some thoughts on recent RTO announcements Jun 22, 2023 Comparing AWS SQS, SNS, and Kinesis: A Technical Breakdown for Enterprise Developers Feb 11, 2023 Should today’s developers worry about AI code generators taking their jobs? Dec 11, 2022 Things to be Thankful for Nov 24, 2022 Book review: Clojure for the Brave and True Oct 2, 2022 Monolithic repository vs a monolith Aug 23, 2022 Scripting languages are tools for tying APIs together, not building complex systems Jun 8, 2022 There is no such thing as one grand unified full-stack programming language May 27, 2022 Most terrifying professional artifact May 14, 2022 Best practices for building a microservice architecture Apr 25, 2022 True identity verification should require a human Mar 16, 2020 On elephant graveyards Feb 15, 2020 TDWI 2019: Architecting Modern Big Data API Ecosystems May 30, 2019 Returning security back to the user Feb 2, 2019 Which AWS messaging and queuing service to use? Jan 25, 2019 Using Markov Chain Generator to create Donald Trump's state of union speech Jan 20, 2019 The religion of JavaScript Nov 26, 2018 Leaving Facebook and Twitter: here are the alternatives Mar 25, 2018 When politics and technology intersect Mar 24, 2018 TypeScript starts where JavaScript leaves off Aug 2, 2017 Node.js is a perfect enterprise application platform Jul 30, 2017 Rather than innovating Walmart bullies their tech vendors to leave AWS Jun 27, 2017 Architecting API ecosystems: my interview with Anthony Brovchenko of R. Culturi Jun 5, 2017 TDWI 2017, Chicago, IL: Architecting Modern Big Data API Ecosystems May 30, 2017 Apple’s recent announcements have been underwhelming Oct 29, 2016 Why I switched to Android and Google Project Fi and why should you Aug 28, 2016 Amazon Alexa is eating the retailers alive Jun 22, 2016 What can we learn from the last week's salesforce.com outage ? May 15, 2016 Why it makes perfect sense for Dropbox to leave AWS May 7, 2016 JEE in the cloud era: building application servers Apr 22, 2016 Managed IT is not the future of the cloud Apr 9, 2016 JavaScript as the language of the cloud Feb 20, 2016 OAuth 2.0: the protocol at the center of the universe Jan 1, 2016 Operations costs are the Achille's heel of NoSQL Nov 23, 2015 IT departments must transform in the face of the cloud revolution Nov 9, 2015 Banking Technology is in Dire Need of Standartization and Openness Sep 28, 2015 Top Ten Differences Between ActiveMQ and Amazon SQS Sep 5, 2015 We Live in a Mobile Device Notification Hell Aug 22, 2015 What Every College Computer Science Freshman Should Know Aug 14, 2015 The Three Myths About JavaScript Simplicity Jul 10, 2015 Book Review: "Shop Class As Soulcraft" By Matthew B. Crawford Jul 5, 2015 Your IT Department's Kodak Moment Jun 17, 2015 The longer the chain of responsibility the less likely there is anyone in the hierarchy who can actually accept it Jun 7, 2015 Smart IT Departments Own Their Business API and Take Ownership of Data Governance May 13, 2015 We Need a Cloud Version of Cassandra May 7, 2015 Building a Supercomputer in AWS: Is it even worth it ? Apr 13, 2015 Ordered Sets and Logs in Cassandra vs SQL Apr 8, 2015 Exploration of the Software Engineering as a Profession Apr 8, 2015 What can Evernote Teach Us About Enterprise App Architecture Apr 2, 2015 Why I am Tempted to Replace Cassandra With DynamoDB Nov 13, 2014 Infrastructure in the cloud vs on-premise Aug 25, 2014 Wall St. wakes up to underinvestment in OMS Aug 21, 2014 Cassandra: Lessons Learned Jun 6, 2014

What can we learn from the last week's salesforce.com outage ?

May 15, 2016

On May 10 Salesforce experienced a day-long outage and lost four hours of customer data. As of May 14, Salesforce is still in degraded state. There is a number of lessons we can learn from this.

You give some, you gain some at each level of the cloud stack


At the lowest level is private on-premise cloud where the customer retains all control over the well being of the system. If the system is down the responsibility is with the IT department to bring it back up.

At IaaS and PaaS level, the cloud provider such as AWS or Azure is responsible for the maintenance and up-time of the infrastructure, but the reliable performance of applications is up to the customer. The IaaS vendor offers tools like multi-AZ redundancy and database snapshots but it is up to the customer to take advantage of these tools.

Salesforce.com operates at the SaaS level where the responsibility for the up-time is entirely with the vendor. Salesforce routinely uses slogans like “No hardware. No software. No headaches. Salesforce developer training professionals.” For the most part, their claim is true.



[caption id="attachment_404" align="alignnone" width="2086"]no_software “No hardware. No software. No headaches” screenshot pulled from here.[/caption]


Yes, headaches!


Salesforce’s NA14 instance was down for four hours on May 10th, and in degraded state ever since.



[caption id="attachment_412" align="alignnone" width="2510"]salesforce_na14 Screenshot of Salesforce NA14 instance state, as of May 14th, 2016, pulled from "trust" report[/caption]

Headaches come when a software architecture flaw with your SaaS vendor’s system results in you losing hours of data and more scheduled outages. Sales, ERP and CRM software is mission critical for many enterprises as SaaS outages can result in missed sales opportunities and damaged relationships with their own customers.

Customers should be on separate instances, each with its own up-time guarantees


Salesforce maintenance schedule is showing more outages over the weekend:



[caption id="attachment_417" align="alignnone" width="2300"]emergency_maintenance Screenshot of Salesforce NA14 instance maintenance schedule, as of May 14th, 2016[/caption]

Why does Salesforce have so many customers on a single database instance is a question their customers should ask of them. Customer should insist on having separate database instances with provisioned performance and up-time guarantees. Such separation solves a number of problems and creates a more positive experience for our cloud customers.

  1. The first, is a noisy-neighbor problem. This is a situation where one customer’s application and workload degrades performance of another customer.

  2. By having each customer on their own dedicated database instance with provisioned perfomance the SaaS vendor can address each customer’s performance needs individually, including automatically scaled read-replicas.

  3. Since each customer has a dedicated database instance, maintenance activities can be scheduled when it is convenient to the customer.

  4. Outages can and do happen with database instances, but they are isolated to individual customers.


What about database failover ?


A report in eWeek states that the outage was caused by a “file integrity issue”:
The failure was caused by a “file integrity issue” and was resolved by restoring the database from an earlier backup.

Salesforce also noted, “We have determined that data written to the NA14 instance between 9:53 UTC and 14:53 UTC on May 10, 2016, could not be restored.”

File integrity issues can happen due to a number of causes, but many of them can be mitigated by having multi-AZ failover. In a master-slave configuration, the slave can be promoted in the event of the master failure and all new queries can be routed to the new master.

AWS RDS Multi-AZ deployment is one such mechanism:
In case of an infrastructure failure (for example, instance hardware failure, storage failure, or network disruption), Amazon RDS performs an automatic failover to the standby, so that you can resume database operations as soon as the failover is complete. Since the endpoint for your DB Instance remains the same after a failover, your application can resume database operation without the need for manual administrative intervention.

One problem that multi-AZ failover doesn’t solve is a situation where the data corruption was caused by a software bug – the only way to recover is from a backup. However, this isn’t likely the case with Salesforce since there was no indication that new software was pushed to the affected NA14 instance.

SaaS customers should insist on off-line capability


Customers should treat Internet and SaaS outages as disasters and develop contingency plans. This has little to do with the cloud itself, and everything to do with the general preparedness of the enterprise to continue to do business.

On-premise infrastructure is at least as likely to face outages as cloud infrastructure. Internet providers may have problems, there could be hardware issues, and someone in the machine room could step on a cable and unplug a critical server.

Writing for CIO.com John Brandon recommends that customers develop contingency plans so that knowledge workers can continue to be productive in the event of cloud outages. Offline sync plays a critical role in allowing users to continue working with minimal disruption and sync with the system when it is available again:
“The two most prevalent business productivity and collaboration tools – Microsoft Office 365 and Google Apps – now enable offline functionality in their desktop applications, enabling document creation and editing with transparent syncing of files when connected,” adds Sumeet Sabharwal, vice president and general manager of NaviSite. “These same offline features also extended into their mobile application which also enable a rich user experience on tablets and smartphone devices while providing the users the convenience of portability.”

Outages come in many forms, they can be simple LTE or Wifi issues, and they can be system-wide problems. Just like an electric utility company may offer some of their most sensitive customers automatic standby generators, a SaaS vendor should offer optional off-line sync capability.

With well thought-out off-line sync users can continue work on most of their data on their mobile devices during an outage. While they may not be able to get some real-time data, they can at least operate with what they have on their mobile devices. The system can synchronize all devices with the cloud once it becomes available again.

Software escrows don’t fit the cloud model


Salesforce.com is unlikely to go out of business anytime soon, but as a general rule this is not a reason to be complacent. There is no other SaaS platform that requires a tighter vendor lock-in than Salesforce.com with their proprietary programming language and SQL dialect.

Writing for CloudPro Frank Jennings points out:
Cloud is different and escrow is not the cure-all that many think. If a customer has opted for a public cloud SaaS offering, that software will likely be a standardized version offered to all customers. The supplier won’t want to enter into escrow arrangements with all its customers who are paying rock-bottom prices. Even if it’s a bespoke software solution, by the very nature of cloud, there is likely to be very little installed on the client’s equipment. Perhaps it will be accessible by via a web browser. Therefore, the customer won’t even have the object code but will have access to software run on the supplier’s infrastructure.

The recommended approach is prevention: customers should consider their Plan B options in the event of SaaS vendor failure. One of the options involves considering competitors and a possibility that a migration might be necessary.

For the migration option to be viable, the customer should have a plan to backup all of their data outside of the SaaS vendor. SaaS vendor should offer batch export API and tools that the customer can use on a routine basis to export all of the data into a 3rd backup system. Salesforce does offer such API and customers should use it.

Another area to be on the lookout for is proprietary languages, tools and protocols that a SaaS vendor may impose. Salesforce’s Apex is a proprietary Java-like language that cannot run on any platform other than Salesforce. Too much investment in Apex code will most certainly prevent a customer from using an alternative SaaS.

Conclusion


When a cloud vendor experiences problems, everyone talks about it. It is easy to overlook the problems with traditional on-premise software. For instance, not a week goes by when I don’t hear of at least one on-premise customer having an SAP HANA crash, or a SAP ERP outage for several hours at a time.

Problems with SaaS can, do and will happen again. The key is making sure that the vendor offers tools and capabilities to help customers mitigate the impact of outages.




Feature image photo credit Belgapixels via Flickr