The Dulin Report

Browsable archive from the WordPress export.

Results (79)

On the role of Distinguished Engineer and CTO Mindset Apr 27, 2025 The future is bright Mar 30, 2025 On luck and gumption Oct 8, 2023 Some thoughts on recent RTO announcements Jun 22, 2023 One size does not fit all: neither cloud nor on-prem Apr 10, 2023 Comparing AWS SQS, SNS, and Kinesis: A Technical Breakdown for Enterprise Developers Feb 11, 2023 Working from home works as well as any distributed team Nov 25, 2022 Things to be Thankful for Nov 24, 2022 Why you should question the “database per service” pattern Oct 5, 2022 Stop Shakespearizing Sep 16, 2022 Using GNU Make with JavaScript and Node.js to build AWS Lambda functions Sep 4, 2022 Why don’t they tell you that in the instructions? Aug 31, 2022 Monolithic repository vs a monolith Aug 23, 2022 Keep your caching simple and inexpensive Jun 12, 2022 Java is no longer relevant May 29, 2022 There is no such thing as one grand unified full-stack programming language May 27, 2022 Peloton could monetize these ideas if they only listen May 15, 2022 Best practices for building a microservice architecture Apr 25, 2022 True identity verification should require a human Mar 16, 2020 The passwords are no longer a necessity. Let’s find a good alternative. Mar 2, 2020 What programming language to use for a brand new project? Feb 18, 2020 TDWI 2019: Architecting Modern Big Data API Ecosystems May 30, 2019 Configuring Peloton Apple Health integration Feb 16, 2019 All emails are free -- except they are not Feb 9, 2019 Using Markov Chain Generator to create Donald Trump's state of union speech Jan 20, 2019 The religion of JavaScript Nov 26, 2018 Teleportation can corrupt your data Sep 29, 2018 Let’s talk cloud neutrality Sep 17, 2018 A conservative version of Facebook? Aug 30, 2018 On Facebook and Twitter censorship Aug 20, 2018 Facebook is the new Microsoft Apr 14, 2018 Node.js is a perfect enterprise application platform Jul 30, 2017 Design patterns in TypeScript: Factory Jul 30, 2017 Design patterns in TypeScript: Chain of Responsibility Jul 22, 2017 Singletons in TypeScript Jul 16, 2017 Architecting API ecosystems: my interview with Anthony Brovchenko of R. Culturi Jun 5, 2017 TDWI 2017, Chicago, IL: Architecting Modern Big Data API Ecosystems May 30, 2017 I tried an Apple Watch for two days and I hated it Mar 30, 2017 Emails, politics, and common sense Jan 14, 2017 Online grocers have an additional burden to be reliable Jan 5, 2017 Here is to a great 2017! Dec 26, 2016 Apple’s recent announcements have been underwhelming Oct 29, 2016 I am addicted to Medium, and I am tempted to move my entire blog to it Sep 9, 2016 What I learned from using Amazon Alexa for a month Sep 7, 2016 Praising Bank of America's automated phone-based customer service Aug 23, 2016 Amazon Alexa is eating the retailers alive Jun 22, 2016 In search for the mythical neutrality among top-tier public cloud providers Jun 18, 2016 In Support Of Gary Johnson Jun 13, 2016 Files and folders: apps vs documents May 26, 2016 Why it makes perfect sense for Dropbox to leave AWS May 7, 2016 JEE in the cloud era: building application servers Apr 22, 2016 Managed IT is not the future of the cloud Apr 9, 2016 LinkedIn needs a reset Feb 13, 2016 In memory of Ed Yourdon Jan 23, 2016 OAuth 2.0: the protocol at the center of the universe Jan 1, 2016 IT departments must transform in the face of the cloud revolution Nov 9, 2015 Banking Technology is in Dire Need of Standartization and Openness Sep 28, 2015 Top Ten Differences Between ActiveMQ and Amazon SQS Sep 5, 2015 We Live in a Mobile Device Notification Hell Aug 22, 2015 On Maintaining Personal Brand as a Software Engineer Aug 2, 2015 Book Review: "Shop Class As Soulcraft" By Matthew B. Crawford Jul 5, 2015 Attracting STEM Graduates to Traditional Enterprise IT Jul 4, 2015 The longer the chain of responsibility the less likely there is anyone in the hierarchy who can actually accept it Jun 7, 2015 Guaranteeing Delivery of Messages with AWS SQS May 9, 2015 The Clarkson School Class of 2015 Commencement speech May 5, 2015 Apple is (or was) the Biggest User of Apache Cassandra Apr 23, 2015 Ordered Sets and Logs in Cassandra vs SQL Apr 8, 2015 Exploration of the Software Engineering as a Profession Apr 8, 2015 What can Evernote Teach Us About Enterprise App Architecture Apr 2, 2015 Microsoft and Apple Have Everything to Lose if Chromebooks Succeed Mar 31, 2015 Where AWS Elastic BeanStalk Could be Better Mar 3, 2015 Configuring Master-Slave Replication With PostgreSQL Jan 31, 2015 Docker can fundamentally change how you think of server deployments Aug 26, 2014 Infrastructure in the cloud vs on-premise Aug 25, 2014 Things I wish Apache Cassandra was better at Feb 12, 2014 "Hello, World!" Using Apache Thrift Feb 24, 2013 Thoughts on Wall Street Technology Aug 11, 2012 Scripting News: After X years programming Jun 5, 2012 Java, Linux and UNIX: How much things have progressed Dec 7, 2010

Ordered Sets and Logs in Cassandra vs SQL

April 8, 2015

I've written before that Cassandra's achilles' heel is devops:

Storage, redundancy and performance are expanded by adding more nodes. This can happen during normal business hours as long as consistency parameters are met. Same applies to node replacements.

As the number of servers grows be prepared to hire a devops army or look for a managed solution. Datastax offering helps but still not enough. Even in the cloud there is no good managed solution that we found. Cassandra.io requires you give up Thrift and CQL, and Instaclustr as of this moment does not use third generation SSD-backed instance types.

Since I have a total of about two dozen Cassandra nodes to upgrade and regression test the application I am not really looking forward to that exercise. So, one of the things I am working on is finding out whether we actually need Cassandra at all.

Having eliminated all relational-style queries (which are not a use case for Cassandra) by moving some of the data to PostgreSQL (which by the way, in some cases outperforms Cassandra by at least a factor of two on larger data sets¹), I am now looking at some of the more obscure data structures that may be perfect for Cassandra but do not map directly onto SQL easily.

Frequently Updated Ordered Sets of Tuples

In Cassandra, there is a very convenient mechanism for maintaining an ordered set. You create a column family for a collection of ordered sets, with row key being a set identifier and column name being the name of an element in the set. Since columns are sorted and column names are unique within a row, you get an ordered set. In Cassandra a column family with ordered sets of tuples looks like this:

OrderedSetsCF   ->

    SetA    ->  (X=1),(Y=2),(Z=3)

    SetB    ->  (Y=10),(Z=4)

In SQL such a data structure would look like a table where set identifier and set element name are parts of the primary key, and value of an element is the third column. In this approach, the set identifier becomes denormalized and repeated for each set element. For large sets this can become a very big table, especially if you are using UUIDs for primary keys (which is a common technique with Cassandra):

OrderedSetsTable



SetID(PK)   ElementName(PK) ElementValue

----------------------------------------

SetA        X               1

SetA        Y               2

SetA        Z               3

SetB        Y               10

SetB        Z               4

Note that you can mitigate the problem of primary key denormalization in a situation where your keys are much bigger than is actually needed by the universe of your data (i.e. UUIDs). You can create a separate look up table where you store an integer code mapped onto your UUIDs and then use the integer code for your primary keys in the sets table.

Setting a value of a tuple in Cassandra is an O(1) operation due to the fact that Cassandra writes into a commitlog and a memtable and returns immediately. If you are writing into your set very rapidly, the values may not be readable until relatively much later especially if you have a high replication factor and you are not always hitting the same node for reads as the one you are using for writes. Eventually, over some period of time, your data becomes consistent across the cluster.

Since Cassandra is not ACID, the column does not need to be first inserted before you can do an update. Effectively, in Cassandra a create is the same as update. Not a single SQL database provides standard UPSERT operation and there is a lot of academic debate on why.

In SQL an insert into such a table is an O(log(n)) operation (roughly speaking, depending on the type of an index used) where nis a multiple of SetID and ElementName. As the number and size of such sets grows, and if the size of the keys grows, inserts become increasingly slower. Depending on the rate of your data ingestion this may or may not be much of an issue, considering that the upside of using SQL is that your data is immediately readable and you don't have an eventual consistency problem².

Since there is no UPSERT in SQL, and even if there was one it would require a read-before-write pattern, if you are to update a value you will incur a 2*O(log(n)) performance penalty -- one time to look up the existing row, second time to update it, depending on how you do this and what your RDBMS query planner does. Again, this is the price you pay for ACID.

Updates and deletes complicate the situation for both Cassandra and an SQL database but differently. Cassandra uses append-onlySSTable files, meaning that all writes are O(1). However, in the background it must perform a compaction and until that happens you may use more disk space than you actually need and your reads become orders of magnitude slower because they have to skip over tombstones.

To illustrate the tombstone penalty, let's pretend that value X in SetA above has been updated 3 times and value Y was deleted. Until compaction happens, what is actually stored on disk is this:

OrderedSetsCF   ->

    SetA    ->  (X=1),(Y=2),(Z=3)

            ->  (X=4) //(X=1) tombstone

            ->  (X=5) //(X=4) tombstone

            ->  (X=6) //(X=5) tombstone

            ->  {Y=2} //(Y=2) tombstone

Your set of 3 elements now takes up 4 extra units of storage until compaction happens. Furthermore, to get to the latest value of X and to tell that Y has been deleted Cassandra has to skip over the tombstones. So, what could have been an O(1)+O(log(n))operation (to first get to SetA using hash key and then to get to X using column index) now becomes an O(1)+O(log(n))+O(m) where n is the number of columns (elements in your set) and m is the number of times you updated X. If you just did a few hundred thousand updates to the same value over a relatively short period of time, you just created a serious bottleneck for your read operations.

A Cassandra row compaction itself requires additional storage and significant I/O capacity. Without getting into intricacies of the mechanism, consider the fact that in Cassandra SSTables are write-only. So, in order to compact the SetA row from the above example, Cassandra scans the row and skips over tombstones and writes it out into a new SSTable file. Once that file is written, the original fragmented SSTable is deleted. As a result, Cassandra may temporarily double the storage it requires. Furthermore, you have little control over when compaction actually happens.

Now, SQL databases are not immune to the problem of frequently updated data. For example, PostgreSQL VACCUM operation does exactly that -- a very similar mechanism to what Cassandra does with compactions. There is really no escaping that problem. SQL databases like PostgreSQL may give you better control over storage reclamation than Cassandra, though, and because of the differences between VACUUM and compactions you are not incurring the tombstone penalty.

The reason why one would want to use SQL for this type of structure is simple: if you have a requirement to perform some real-time analytics over your data. It is as simple as that -- are you accumulating rapidly changing data to process it later (in which case Cassandra or DynamoDB are very appropriate) or are you accumulating same data in order to gain meaningful insights out of it in real-time (in which case SQL is more appropriate) ?

Log-style Structures

Let's suppose you want to store events from a multitude of remote sensors. This can be temperature sensors in different locations, market data by ticker symbol, or clicks in the apps across thousands or millions of users. Suppose you only want to retain this data for 2 days.

Each event is identified by its source, time (in milliseconds), event type, event id, and some event value.

In Cassandra column family one would store it like this. Source is your row key, time+eventType+eventId are your composite column name, and eventValue is your column value. Each column will have a ttl of 2 days (expressed in seconds). It would look like this:

EventsCF >

    SourceA -> (0:TypeA:1, X, ttl=172800), (1:TypeA:2,Y, ttl=172800), (1:TypeB:3,X, ttl=172800)

    SourceB -> (0:TypeB:1, X, ttl=172800), (1:TypeA:2,Y, ttl=172800)

and so on and so forth. In SQL, it would look like this:

EventsTable



source(PK)  time(PK)    eventType(PK)   eventId(PK) eventValue

---------------------------------------------------------------

SourceA     0           TypeA           1           X   

SourceA     1           TypeA           2           Y   

SourceA     1           TypeB           1           X

and so on. You'd need a batch job to run regularly to delete rows older than 2 days.

In Cassandra, all writes are O(1). In SQL, all writes are O(log(n)) where n is the number of PK combinations. As the size of your log grows, so will the time it takes to insert a row. One could mitigate this in SQL by not using PK or indeces, but then querying this table will become nearly impossible.

Cassandra has a concept of a ttl on values, meaning that they logically disappear when ttl is up. However, it does not mean that the disk space is reclaimed. This too suffers from the compaction problem, and until compaction happens this data structure may consume an enormous amount of disk space. Suppose you accumulate 1 million log entries per day per source. Five days days later, unless compaction happened, you are actually storing 3 days more of data than you actually require.

Retrieval of this data out of Cassandra becomes a bit tricky. If you naively assume that by reading the entire row you are only reading the last two days worth of data you are wrong. Until compaction happens Cassandra will have to scan over tombstones - and in this example three days worth of them! Even if you optimize a bit and use a slice query starting at two days ago the best you will get out of Cassandra will be O(log(n)) where n is the total number of log entries you made in the last five days (until compaction happens).

The disk storage problem is further exacerbated here. Since the data with expired ttl won't actually get deleted until compaction happens, and compaction itself may temporarily double the disk storage requirement you need to make sure you leave extra space on each node. Furthermore, this type of a structure in Cassandra may create an imbalance in the cluster if the amount of data varies a lot between sources.

Cassandra is a clear winner here from the performance perspective if the goal is to collect immense amount of data, especially if that data never expires. However, in the cloud environment like AWS I'd use Amazon's facilities such as DynamoDB, EMR, or RedShift. Cassandra, as it grows, does become a devops nightmare. Over time you may end up with dozens, or hundreds of nodes if you never expire or delete data.

Conclusion

So what am I really getting at here ? Well, Cassandra really is a devops nightmare. I know I am going to stir some debate up on twitter with what I just said. I'd love nothing more than to stop using it. However, it continues to be a useful tool for some of the use cases I deal with, and for all its flaws I have not found a better option yet. As I keep saying, all I want is Cassandra that is a managed SaaS like DynamoDB where I don't have to worry about devops.

Yes I know I need to provide a benchmark. In this post I wanted to spark a conversation and then if I find the time I'll post a benchmark. ↩

This is negated by the use of read replicas that may experience a lag. ↩