Archive

The Dulin Report

Browsable archive from the WordPress export.

Results (57)

Strategic activity mapping for software architects May 25, 2025 On the role of Distinguished Engineer and CTO Mindset Apr 27, 2025 The future is bright Mar 30, 2025 Software Engineering is here to stay Mar 3, 2024 Some thoughts on recent RTO announcements Jun 22, 2023 Comparing AWS SQS, SNS, and Kinesis: A Technical Breakdown for Enterprise Developers Feb 11, 2023 Should today’s developers worry about AI code generators taking their jobs? Dec 11, 2022 Things to be Thankful for Nov 24, 2022 Book review: Clojure for the Brave and True Oct 2, 2022 Monolithic repository vs a monolith Aug 23, 2022 Scripting languages are tools for tying APIs together, not building complex systems Jun 8, 2022 There is no such thing as one grand unified full-stack programming language May 27, 2022 Most terrifying professional artifact May 14, 2022 Best practices for building a microservice architecture Apr 25, 2022 True identity verification should require a human Mar 16, 2020 On elephant graveyards Feb 15, 2020 TDWI 2019: Architecting Modern Big Data API Ecosystems May 30, 2019 Returning security back to the user Feb 2, 2019 Which AWS messaging and queuing service to use? Jan 25, 2019 Using Markov Chain Generator to create Donald Trump's state of union speech Jan 20, 2019 The religion of JavaScript Nov 26, 2018 Leaving Facebook and Twitter: here are the alternatives Mar 25, 2018 When politics and technology intersect Mar 24, 2018 TypeScript starts where JavaScript leaves off Aug 2, 2017 Node.js is a perfect enterprise application platform Jul 30, 2017 Rather than innovating Walmart bullies their tech vendors to leave AWS Jun 27, 2017 Architecting API ecosystems: my interview with Anthony Brovchenko of R. Culturi Jun 5, 2017 TDWI 2017, Chicago, IL: Architecting Modern Big Data API Ecosystems May 30, 2017 Apple’s recent announcements have been underwhelming Oct 29, 2016 Why I switched to Android and Google Project Fi and why should you Aug 28, 2016 Amazon Alexa is eating the retailers alive Jun 22, 2016 What can we learn from the last week's salesforce.com outage ? May 15, 2016 Why it makes perfect sense for Dropbox to leave AWS May 7, 2016 JEE in the cloud era: building application servers Apr 22, 2016 Managed IT is not the future of the cloud Apr 9, 2016 JavaScript as the language of the cloud Feb 20, 2016 OAuth 2.0: the protocol at the center of the universe Jan 1, 2016 Operations costs are the Achille's heel of NoSQL Nov 23, 2015 IT departments must transform in the face of the cloud revolution Nov 9, 2015 Banking Technology is in Dire Need of Standartization and Openness Sep 28, 2015 Top Ten Differences Between ActiveMQ and Amazon SQS Sep 5, 2015 We Live in a Mobile Device Notification Hell Aug 22, 2015 What Every College Computer Science Freshman Should Know Aug 14, 2015 The Three Myths About JavaScript Simplicity Jul 10, 2015 Book Review: "Shop Class As Soulcraft" By Matthew B. Crawford Jul 5, 2015 Your IT Department's Kodak Moment Jun 17, 2015 The longer the chain of responsibility the less likely there is anyone in the hierarchy who can actually accept it Jun 7, 2015 Smart IT Departments Own Their Business API and Take Ownership of Data Governance May 13, 2015 We Need a Cloud Version of Cassandra May 7, 2015 Building a Supercomputer in AWS: Is it even worth it ? Apr 13, 2015 Ordered Sets and Logs in Cassandra vs SQL Apr 8, 2015 Exploration of the Software Engineering as a Profession Apr 8, 2015 What can Evernote Teach Us About Enterprise App Architecture Apr 2, 2015 Why I am Tempted to Replace Cassandra With DynamoDB Nov 13, 2014 Infrastructure in the cloud vs on-premise Aug 25, 2014 Wall St. wakes up to underinvestment in OMS Aug 21, 2014 Cassandra: Lessons Learned Jun 6, 2014

Cassandra: Lessons Learned

June 6, 2014

After using Cassandra for 3 years since version 0.8.5, I thought I'd put together a blurb on lessons learned. Here it goes!

Use Cases


What works


Anything that involves high speed collection of data for analysis in the background or via batch. For example:

  • Logging and data collection

    • Web servers

    • Mobile devices

    • Internet of things

    • Sensors

    • Finance

      • Market data logging

      • Transaction logging

      • Trading activity

      • Record keeping for compliance





  • Telecommunications

    • Call log



  • Application servers

    • Sharing session data

    • Shopping carts

    • Use profiles and preferences

    • Metrics, metering and monitoring



  • Lucene-style document indexing

  • Expandable, redundant media storage


What doesn't work



  • Anything that requires real time analytics and aggregation

  • Relational queries

  • Reliable counters


Data model


If you are a Java developer, Cassandra data model is best described as the following pseudo code:
public class Row extends TreeMap { } 

public class ColumnFamily extends HashMap { } 

public class Keyspace extends HashMap { } 

public class Cassandra extends HashMap { }  

A keyspace is made up of column families. A column family is made ip of rows. Rows are referred to by keys. Each key is unique within a column family. Rows are made up of columns.

Columns within a row are sorted by column name. Sort order is configured at the time the column family is created and may not be changed. Column names can be composite and made up of multiple parts

Column values can be just about anything including binary. Values can be distributed counters, and important and useful feature. Columns can have a TTL and expire automatically - a very useful feature for managing data retention.

Client API libraries


Thrift


Thrift is a low level RPC protocol used by Cassandra to expose some API. There is a multitude of client libraries, such as Pelops, Hector, Astyanax, etc. I have been using Thrift on my projects. Note that Cassandra team considers Thrift to be feature complete and therefore it has not seen a single new feature in at least 2 years.

CQL


Cassandra supports an SQL-like language called CQL. If you are looking for an equivalent of SQL you are going to be disappointed.

In some cases it is simpler and easier to use than lower level Thrift API and certainly many people swear by it. My humble opinion is tht if you are looking for SQL, save yourself hassle and use an SQL database. However, at least evaluate it if starting a new Cassandra implementation from scratch.

Hardware and infrastructure requirements


One major mistake that those new to Cassandra make is spending a lot of money on expensive hardware. In fact, Cassandra can run on a reasonably configured modern machine.

Commodity hardware with smaller SSD storage


In my experience the most optimal configuration is a minimum of 16–32 Gig of RAM, 256–512 G SSD, and at least four CPU cores. It is ok to virtualize, but make sure that each VM is on separate physocal hardware using separate physical storage.

It is best to start off with no more than 512 G SSD for storage and expand it by adding more nodes, rather than adding more to the same hardware.

For example, if I were to configure Cassandra on Amazon I would pick either c3.2xlarge or c3.4xlarge instance types and combine the two drives using RAID0. As my needs grow I would add more nodes rather than move to larger nodes.

Networking


The faster the better. Slow connections between nodes will result in replication delays.

Operations


Do not attempt to hire a traditional DBA to support Cassandra as knowledge of both Linux and Java is required.

While reasonably performant out of the box with default settings, Cassandra is not an easy system to tune for optimal performance. Doing that requires thorough understanding of core Java and Java memory management parameters. Outside of Java ecosystem this can be a turn-off for some.

Storage, redundancy and performance are expanded by adding more nodes. This can happen during normal business hours as long as consistency parameters are met. Same applies to node replacements.

As the number of servers grows be prepared to hire a devops army or look for a managed solution. Datastax offering helps but still not enough. Even in the cloud there is no good managed solution that we found. Cassandra.io requires you give up Thrift and CQL, and Instaclustr as of this moment does not use third generation SSD-backed instance types.

Technically speaking backups are not strictly needed because data is replicated. In fact, backup mechanisms in Cassandra are limited. You need to come up with your own backup mechanism. Point in time backups are possible but require creative scripting.

Pros and Cons


Pros



  • Powerful and flexible data model

  • Perfect for use cases where you can refer to your stored data directly by primary keys and you need a fast data collection mechanism and have a batch process to analyze it

  • Replication is trivial to configure

  • Once setup can run unattended for long periods of time

  • Fixed cost of a Cassandra cluster in Amazon AWS can be an advantage vs. variable cost of DynamoDB


Cons



  • Point in time style backups aren't possible without clever scripting

  • Can't utilize common DBA skills for operations

  • Can be a devops nightmare

    • Regular repair process is required but is very taxing on the system, requires baby sitting, and may leave the node in an inconsistent state



  • Some advertised features are impractical to use in real life

    • Distributed counters can become inaccurate under heavy load

    • Wide rows are supported but not handled gracefully




Lessons learned



  • Do not spend money to make your life difficult. Use off the shelf hardware rather than spending on enterprise grade iron

  • Use smaller SSDs on each node and expand capacity by adding nodes

  • Keep all nodes hot by having clients on all nodes. This reduces the need for regular repairs.

  • Cassandra is not necessarily your solution to a Big Data

    • Is your data really Big ?

    • Does your use case fit Cassandra's strength

    • Modern SQL databases can handle millions of records

    • If you are in the Amazon environment RDS supports dual redundancy

    • What constitutes Big Data anyway ?

    • Consider your redundancy needs. Do you feel the probability of losing a server warrants the devops hassle of having more of them ?



  • In Amazon AWS cloud I would seriously consider alternatives

    • DynamoDB is much more cost effective to use and operate if your workload is predictable. Since DynamoDB charges per use, costs can be variable. Cassandra on the other hand results in a fixed cost.

    • RDS offers dual redundancy with MySQL and PostgreSQL. Postgres support for JSON documents makes it a good alternative to Cassandra and MongoDB



  • Some data structures are anti-thetical to Cassandra. Queues are problematic because Cassandra cant handle frequently updated data gracefully. Read-before-write workloads are very taxing on the system. Writes followed immediately by reads are unpredictable, especially when replication factor is higher than 2.

  • Wide rows can be a challenge even though Cassandra does support up to 2 billion columns. Wide rows can create a load imbalance and present a challenge for compactions and slice queries.

  • If you need to do complex joins or real time aggregations save yourself trouble and use SQL , while reserving Cassandra for what it is really good at.