Archive

The Dulin Report

Browsable archive from the WordPress export.

Results (56)

On Amazon Prime Video’s move to a monolith May 14, 2023 One size does not fit all: neither cloud nor on-prem Apr 10, 2023 Comparing AWS SQS, SNS, and Kinesis: A Technical Breakdown for Enterprise Developers Feb 11, 2023 Stop Shakespearizing Sep 16, 2022 Using GNU Make with JavaScript and Node.js to build AWS Lambda functions Sep 4, 2022 Monolithic repository vs a monolith Aug 23, 2022 Keep your caching simple and inexpensive Jun 12, 2022 Java is no longer relevant May 29, 2022 There is no such thing as one grand unified full-stack programming language May 27, 2022 Best practices for building a microservice architecture Apr 25, 2022 TypeScript is a productivity problem in and of itself Apr 20, 2022 In most cases, there is no need for NoSQL Apr 18, 2022 Node.js and Lambda deployment size restrictions Mar 1, 2021 Should we abolish Section 230 ? Feb 1, 2021 TDWI 2019: Architecting Modern Big Data API Ecosystems May 30, 2019 Microsoft acquires Citus Data Jan 26, 2019 Which AWS messaging and queuing service to use? Jan 25, 2019 Using Markov Chain Generator to create Donald Trump's state of union speech Jan 20, 2019 Let’s talk cloud neutrality Sep 17, 2018 A conservative version of Facebook? Aug 30, 2018 TypeScript starts where JavaScript leaves off Aug 2, 2017 Design patterns in TypeScript: Chain of Responsibility Jul 22, 2017 I built an ultimate development environment for iPad Pro. Here is how. Jul 21, 2017 Rather than innovating Walmart bullies their tech vendors to leave AWS Jun 27, 2017 Emails, politics, and common sense Jan 14, 2017 Don't trust your cloud service until you've read the terms Sep 27, 2016 I am addicted to Medium, and I am tempted to move my entire blog to it Sep 9, 2016 What I learned from using Amazon Alexa for a month Sep 7, 2016 Amazon Alexa is eating the retailers alive Jun 22, 2016 In search for the mythical neutrality among top-tier public cloud providers Jun 18, 2016 What can we learn from the last week's salesforce.com outage ? May 15, 2016 Why it makes perfect sense for Dropbox to leave AWS May 7, 2016 Managed IT is not the future of the cloud Apr 9, 2016 JavaScript as the language of the cloud Feb 20, 2016 Our civilization has a single point of failure Dec 16, 2015 Operations costs are the Achille's heel of NoSQL Nov 23, 2015 IT departments must transform in the face of the cloud revolution Nov 9, 2015 Setting Up Cross-Region Replication of AWS RDS for PostgreSQL Sep 12, 2015 Top Ten Differences Between ActiveMQ and Amazon SQS Sep 5, 2015 Ten Questions to Consider Before Choosing Cassandra Aug 8, 2015 The Three Myths About JavaScript Simplicity Jul 10, 2015 Big Data is not all about Hadoop May 30, 2015 Smart IT Departments Own Their Business API and Take Ownership of Data Governance May 13, 2015 Guaranteeing Delivery of Messages with AWS SQS May 9, 2015 We Need a Cloud Version of Cassandra May 7, 2015 Building a Supercomputer in AWS: Is it even worth it ? Apr 13, 2015 Ordered Sets and Logs in Cassandra vs SQL Apr 8, 2015 Exploration of the Software Engineering as a Profession Apr 8, 2015 Finding Unused Elastic Load Balancers Mar 24, 2015 Where AWS Elastic BeanStalk Could be Better Mar 3, 2015 Trying to Replace Cassandra with DynamoDB ? Not so fast Feb 2, 2015 Why I am Tempted to Replace Cassandra With DynamoDB Nov 13, 2014 How We Overcomplicated Web Design Oct 8, 2014 Infrastructure in the cloud vs on-premise Aug 25, 2014 Cassandra: a key puzzle piece in a design for failure Aug 18, 2014 Cassandra: Lessons Learned Jun 6, 2014

Ten Questions to Consider Before Choosing Cassandra

August 8, 2015

 

[caption id="attachment_165" align="aligncenter" width="584"]I spent several years working at a financial company in Jersey City, NJ Pavonia-Newport area. On my way to work, over lunch breaks, or otherwise any time I need to get some fresh air, I'd take my camera with me and take some pictures. These are the sights, the views, and people that I have seen and observed in that neighborhood. These are commuters, workers, residence. Things to Consider.[/caption]

1. Do you know what your queries will look like ?


In traditional SQL you design your data model to represent your business objects. Your queries can then evolve over time and can be ad-hoc. You can even create views, materialized or otherwise, to facilitate even more complex analytical queries.

Cassandra does not offer the flexibility of traditional SQL. While your data model can evolve over time and you are not tied to a hard schema, you have to design your data model around the queries you plan to run. The problem with that approach is that it is very rare for end users to say with certainty what they want. Over time their needs change and so do the queries they want to run. Changes to the storage model in Cassandra involve running massive data reloads.

2. What is your team's skillset ?


Consider the human factor. Ability to get to the application's data and build reports and run analytical queries is critical to developer and business user productivity. This is often overlooked but it can mean a difference between delivering features in days vs. weeks.

Not all developers are created equal. SQL is a widely accepted and simple query language that business users should be capable of learning and using. Yet, many have trouble with even the simplest SQL. Introducing a whole new mechanism for querying their data, even if it is as mockingly similar to SQL as CQL, could be a problem.

Traditional SQL databases have well established libraries and tool sets. While other platforms are supported, Java is the primary target platform for Cassandra. Cassandra itself is written in Java, and the vast majority of tools and client libraries for it are written in Java. Running and operating a Cassandra cluster requires understanding of JVM intracacies, such as garbage collection.

3. What is your anticipated amount of data ?


Consider whether the amount of data you expect to store justifies entirely new way of thinking about your data. Cassandra was conceived at the time when multi-core SSD-backed servers were expensive, and Amazon EC2 was in its infancy.

A single modern multi-core SSD-backed server running PostgreSQL or MySQL can offer a good balance of performance and query flexibility. AWS RDS is available up to 3 Terabytes. Both PostgreSQL and MySQL can easily handle tables hundreds of gigabytes in size.

4. What are your anticipated write performance requirements ?


In Cassandra, all writes are O(1) and do not require a read-before-write pattern. You write your data, it gets stored into commit log and control is returned back to your application. Eventually it is available for reads, usually very quickly.

In SQL, things are little more complex. If the tables are indexed all writes require a lookup in the index, which in most cases is a O(log(n)) operation (n is the number of unique values in the index). Over time this is going to slow down writes. The writes are equally performant as reads.

5. What is the type of data you plan on storing ?


Cassandra has some advantages over traditional SQL when it comes to storing certain types of data. Ordered sets and logs are one such case. Set-style data structures in SQL require a read-before-write (aka upsert). Logs to be analyzed at a later point using another set of tools can take advantage of Cassandra's high write throughput.

One has to be cautious about frequently updated data in Cassandra. Compactions are un-predictable and repeatedly updating your data is going to introduce read performance penalties and increase disk storage costs.

6. What are your anticipated read performance requirements ?


Depending on the type of data you are storing, Cassandra may or may not hold advantages. Cassandra is eventually consistent, meaning that the data becomes available at some later point than when it was written. Typically it happens very quickly, but workloads where data is read almost immediately after it was written with expectation that it is exactly what was written are not appropriate for Cassandra. If that is your requirement, you need an ACID-capable database.

Any data that can be easily referred to by primary key at all times is a good fit for Cassandra. A traditional SQL database requires an index scan before it takes you to the right row. Cassandra primary key lookups are essentially O(1) operations and can work equally fast across extremely large data sets. On the other hand, secondary indices present a challenge.

7. Are you prepared for the operations costs ?


Since Cassandra is scaled by adding more nodes to the cluster operations can become quite expensive. Using an SQL database per-se doesn't solve this problem, however. The question becomes managed cloud service vs. rolling your own cluster. On-premise there is no difference between a multi-node Cassandra cluster vs an RDBMS with multiple read replicas in terms of operations and administrative costs. On-premise vs cloud is a topic for another discussion.

Amazon RDS is a managed RDBMS service. Changing the type of server, number of cores, RAM, and storage, involves simply modifying it and letting it automatically get upgraded during the maintenance window. The backups and monitoring are automatic and it requires very little of human interaction other than using the database itself. You do not need a DBA to manage it.

There are some services out there that will manage a Cassandra cluster in the cloud for you. You can and should consider these vs. rolling your own cluster. You should also consider not using Cassandra at all and see if DynamoDB meets your needs.

8. What are your burst requirements ?


The one area where Cassandra is better than DynamoDB (or other similar services) is burst performance. Cassandra's "0-60", so to speak, is instantaneous. It can handle and sustain thousands of operations per second on relatively cheap hardware.

Compared to the scaling profile of DynamoDB in AWS, Cassandra has an upper hand. While DynamoDB can be autoscaled, the autoscaling action can take minutes or hours to complete. In the meantime, your users suffer. With DynamoDB you are stuck with either paying a premium for always-on capacity or you have to come up with clever ways to work around it.

A relational OLTP database may meet your requirements here, however. Even managed service like AWS RDS can handle short bursts of read/write IOPS that are over the provisioned limit and then they get gracefully throttled back.

9. Are you installing on-premise or in the cloud ?


Compared to cloud environments, your options on-premises are limited. You don't have managed services like DynamoDB, RDS, or RedShift. Hardware and maintenance costs of a Cassandra cluster compared to a similarly sized RDBMS cluster are going to be about the same.

In the cloud environment, however, you are in a much better position to make the right decision. You have managed options like Google Big-Table, DynamoDB and RDS.

10. What are your disaster recovery requirements ?


Cassandra can play a crucial role in your design for failure because all data centers can be kept hot without the need for a master-slave failover. That type of a configuration is a lot more complex to achieve with an SQL database. SQL gives you consistency, while Cassandra gives you partition-tolerance.

Conclusion


Cassandra is constantly evolving, as are traditional databases and managed cloud services. What I see happening is convergence of functionality. There is a lot of cross pollination of ideas going on in the industry with NoSQL databases adopting some of the SQL functionality (think: Cassandra CQL and SQL) and SQL databases adopting some of the NoSQL functionality (think: PostgreSQL NoSQL features). It is important to keep a cool head and not jump on any new tech without understanding your use cases and skill sets.