The Dulin Report

Browsable archive from the WordPress export.

Results (86)

Strategic activity mapping for software architects May 25, 2025 On the role of Distinguished Engineer and CTO Mindset Apr 27, 2025 The future is bright Mar 30, 2025 2024 Reflections Dec 31, 2024 My giant follows me wherever I go Sep 20, 2024 The day I became an architect Sep 11, 2024 Are developer jobs truly in decline? Jun 29, 2024 Leadership is About "We," Not "I" Jun 9, 2024 Form follows fiasco Mar 31, 2024 Software Engineering is here to stay Mar 3, 2024 Some thoughts on recent RTO announcements Jun 22, 2023 On Amazon Prime Video’s move to a monolith May 14, 2023 One size does not fit all: neither cloud nor on-prem Apr 10, 2023 Some thoughts on the latest LastPass fiasco Mar 5, 2023 Comparing AWS SQS, SNS, and Kinesis: A Technical Breakdown for Enterprise Developers Feb 11, 2023 Working from home works as well as any distributed team Nov 25, 2022 Why you should question the “database per service” pattern Oct 5, 2022 Stop Shakespearizing Sep 16, 2022 Why don’t they tell you that in the instructions? Aug 31, 2022 Monolithic repository vs a monolith Aug 23, 2022 Automation and coding tools for pet projects on the Apple hardware May 28, 2022 There is no such thing as one grand unified full-stack programming language May 27, 2022 Most terrifying professional artifact May 14, 2022 If you haven’t done it already, get yourself a Raspberry Pi and install Linux on it May 9, 2022 Good idea fairy strikes when you least expect it May 2, 2022 Kitchen table conversations Nov 7, 2021 Application developers like to think their app is the only one Apr 5, 2021 A year of COVID taught us all how to work remotely Feb 10, 2021 What programming language to use for a brand new project? Feb 18, 2020 The religion of JavaScript Nov 26, 2018 Teleportation can corrupt your data Sep 29, 2018 Let’s talk cloud neutrality Sep 17, 2018 What does a Chief Software Architect do? Jun 23, 2018 Nobody wants your app Aug 2, 2017 TypeScript starts where JavaScript leaves off Aug 2, 2017 Singletons in TypeScript Jul 16, 2017 Emails, politics, and common sense Jan 14, 2017 Online grocers have an additional burden to be reliable Jan 5, 2017 Collaborative work in the cloud: what I learned teaching my daughter how to code Dec 10, 2016 Apple’s recent announcements have been underwhelming Oct 29, 2016 What I learned from using Amazon Alexa for a month Sep 7, 2016 Why I switched to Android and Google Project Fi and why should you Aug 28, 2016 Amazon Alexa is eating the retailers alive Jun 22, 2016 In search for the mythical neutrality among top-tier public cloud providers Jun 18, 2016 In Support Of Gary Johnson Jun 13, 2016 Files and folders: apps vs documents May 26, 2016 What can we learn from the last week's salesforce.com outage ? May 15, 2016 Why it makes perfect sense for Dropbox to leave AWS May 7, 2016 JEE in the cloud era: building application servers Apr 22, 2016 Let's stop letting tools get in the way of results Apr 10, 2016 JavaScript as the language of the cloud Feb 20, 2016 LinkedIn needs a reset Feb 13, 2016 In memory of Ed Yourdon Jan 23, 2016 Our civilization has a single point of failure Dec 16, 2015 IT departments must transform in the face of the cloud revolution Nov 9, 2015 I Stand With Ahmed Sep 19, 2015 Setting Up Cross-Region Replication of AWS RDS for PostgreSQL Sep 12, 2015 Top Ten Differences Between ActiveMQ and Amazon SQS Sep 5, 2015 We Live in a Mobile Device Notification Hell Aug 22, 2015 What Every College Computer Science Freshman Should Know Aug 14, 2015 On Maintaining Personal Brand as a Software Engineer Aug 2, 2015 The Three Myths About JavaScript Simplicity Jul 10, 2015 Book Review: "Shop Class As Soulcraft" By Matthew B. Crawford Jul 5, 2015 Attracting STEM Graduates to Traditional Enterprise IT Jul 4, 2015 Your IT Department's Kodak Moment Jun 17, 2015 The longer the chain of responsibility the less likely there is anyone in the hierarchy who can actually accept it Jun 7, 2015 Big Data is not all about Hadoop May 30, 2015 Smart IT Departments Own Their Business API and Take Ownership of Data Governance May 13, 2015 The Clarkson School Class of 2015 Commencement speech May 5, 2015 My Brief Affair With Android Apr 25, 2015 Exploration of the Software Engineering as a Profession Apr 8, 2015 What can Evernote Teach Us About Enterprise App Architecture Apr 2, 2015 Microsoft and Apple Have Everything to Lose if Chromebooks Succeed Mar 31, 2015 Do not apply data science methods without understanding them Mar 25, 2015 On apprenticeship Feb 13, 2015 On Managing Stress, Multitasking and Other New Year's Resolutions Jan 1, 2015 Why I am Tempted to Replace Cassandra With DynamoDB Nov 13, 2014 Software Engineering and Domain Area Expertise Nov 7, 2014 Docker can fundamentally change how you think of server deployments Aug 26, 2014 Wall St. wakes up to underinvestment in OMS Aug 21, 2014 Software Engineers Are Not Doctors Aug 3, 2014 Thanking MIT Scratch Sep 14, 2013 Have computers become too complicated for teaching ? Jan 1, 2013 Thoughts on Wall Street Technology Aug 11, 2012 Scripting News: After X years programming Jun 5, 2012 Java, Linux and UNIX: How much things have progressed Dec 7, 2010

What can we learn from the last week's salesforce.com outage ?

May 15, 2016

On May 10 Salesforce experienced a day-long outage and lost four hours of customer data. As of May 14, Salesforce is still in degraded state. There is a number of lessons we can learn from this.

You give some, you gain some at each level of the cloud stack

At the lowest level is private on-premise cloud where the customer retains all control over the well being of the system. If the system is down the responsibility is with the IT department to bring it back up.

At IaaS and PaaS level, the cloud provider such as AWS or Azure is responsible for the maintenance and up-time of the infrastructure, but the reliable performance of applications is up to the customer. The IaaS vendor offers tools like multi-AZ redundancy and database snapshots but it is up to the customer to take advantage of these tools.

Salesforce.com operates at the SaaS level where the responsibility for the up-time is entirely with the vendor. Salesforce routinely uses slogans like “No hardware. No software. No headaches. Salesforce developer training professionals.” For the most part, their claim is true.

Yes, headaches!

Salesforce’s NA14 instance was down for four hours on May 10th, and in degraded state ever since.

Headaches come when a software architecture flaw with your SaaS vendor’s system results in you losing hours of data and more scheduled outages. Sales, ERP and CRM software is mission critical for many enterprises as SaaS outages can result in missed sales opportunities and damaged relationships with their own customers.

Customers should be on separate instances, each with its own up-time guarantees

Salesforce maintenance schedule is showing more outages over the weekend:

Why does Salesforce have so many customers on a single database instance is a question their customers should ask of them. Customer should insist on having separate database instances with provisioned performance and up-time guarantees. Such separation solves a number of problems and creates a more positive experience for our cloud customers.

The first, is a noisy-neighbor problem. This is a situation where one customer’s application and workload degrades performance of another customer.

By having each customer on their own dedicated database instance with provisioned perfomance the SaaS vendor can address each customer’s performance needs individually, including automatically scaled read-replicas.

Since each customer has a dedicated database instance, maintenance activities can be scheduled when it is convenient to the customer.

Outages can and do happen with database instances, but they are isolated to individual customers.

What about database failover ?

A report in eWeek states that the outage was caused by a “file integrity issue”:

The failure was caused by a “file integrity issue” and was resolved by restoring the database from an earlier backup.

Salesforce also noted, “We have determined that data written to the NA14 instance between 9:53 UTC and 14:53 UTC on May 10, 2016, could not be restored.”

File integrity issues can happen due to a number of causes, but many of them can be mitigated by having multi-AZ failover. In a master-slave configuration, the slave can be promoted in the event of the master failure and all new queries can be routed to the new master.

AWS RDS Multi-AZ deployment is one such mechanism:

In case of an infrastructure failure (for example, instance hardware failure, storage failure, or network disruption), Amazon RDS performs an automatic failover to the standby, so that you can resume database operations as soon as the failover is complete. Since the endpoint for your DB Instance remains the same after a failover, your application can resume database operation without the need for manual administrative intervention.

One problem that multi-AZ failover doesn’t solve is a situation where the data corruption was caused by a software bug – the only way to recover is from a backup. However, this isn’t likely the case with Salesforce since there was no indication that new software was pushed to the affected NA14 instance.

SaaS customers should insist on off-line capability

Customers should treat Internet and SaaS outages as disasters and develop contingency plans. This has little to do with the cloud itself, and everything to do with the general preparedness of the enterprise to continue to do business.

On-premise infrastructure is at least as likely to face outages as cloud infrastructure. Internet providers may have problems, there could be hardware issues, and someone in the machine room could step on a cable and unplug a critical server.

Writing for CIO.com John Brandon recommends that customers develop contingency plans so that knowledge workers can continue to be productive in the event of cloud outages. Offline sync plays a critical role in allowing users to continue working with minimal disruption and sync with the system when it is available again:

“The two most prevalent business productivity and collaboration tools – Microsoft Office 365 and Google Apps – now enable offline functionality in their desktop applications, enabling document creation and editing with transparent syncing of files when connected,” adds Sumeet Sabharwal, vice president and general manager of NaviSite. “These same offline features also extended into their mobile application which also enable a rich user experience on tablets and smartphone devices while providing the users the convenience of portability.”

Outages come in many forms, they can be simple LTE or Wifi issues, and they can be system-wide problems. Just like an electric utility company may offer some of their most sensitive customers automatic standby generators, a SaaS vendor should offer optional off-line sync capability.

With well thought-out off-line sync users can continue work on most of their data on their mobile devices during an outage. While they may not be able to get some real-time data, they can at least operate with what they have on their mobile devices. The system can synchronize all devices with the cloud once it becomes available again.

Software escrows don’t fit the cloud model

Salesforce.com is unlikely to go out of business anytime soon, but as a general rule this is not a reason to be complacent. There is no other SaaS platform that requires a tighter vendor lock-in than Salesforce.com with their proprietary programming language and SQL dialect.

Writing for CloudPro Frank Jennings points out:

Cloud is different and escrow is not the cure-all that many think. If a customer has opted for a public cloud SaaS offering, that software will likely be a standardized version offered to all customers. The supplier won’t want to enter into escrow arrangements with all its customers who are paying rock-bottom prices. Even if it’s a bespoke software solution, by the very nature of cloud, there is likely to be very little installed on the client’s equipment. Perhaps it will be accessible by via a web browser. Therefore, the customer won’t even have the object code but will have access to software run on the supplier’s infrastructure.

The recommended approach is prevention: customers should consider their Plan B options in the event of SaaS vendor failure. One of the options involves considering competitors and a possibility that a migration might be necessary.

For the migration option to be viable, the customer should have a plan to backup all of their data outside of the SaaS vendor. SaaS vendor should offer batch export API and tools that the customer can use on a routine basis to export all of the data into a 3rd backup system. Salesforce does offer such API and customers should use it.

Another area to be on the lookout for is proprietary languages, tools and protocols that a SaaS vendor may impose. Salesforce’s Apex is a proprietary Java-like language that cannot run on any platform other than Salesforce. Too much investment in Apex code will most certainly prevent a customer from using an alternative SaaS.

Conclusion

When a cloud vendor experiences problems, everyone talks about it. It is easy to overlook the problems with traditional on-premise software. For instance, not a week goes by when I don’t hear of at least one on-premise customer having an SAP HANA crash, or a SAP ERP outage for several hours at a time.

Problems with SaaS can, do and will happen again. The key is making sure that the vendor offers tools and capabilities to help customers mitigate the impact of outages.

Feature image photo credit Belgapixels via Flickr