Disaggregated Proxy & Storage Nodes

The point of this is to use cheap machines with small/slow storage to coordinate client requests while dedicating the machines with the big and fast storage to doing what they do best. I found that request coordination was contributing to about half the CPU usage on our Cassandra nodes, on average. Solid state storage is quite expensive, nearly doubling the cost of typical hardware. It also means that if people have control over hardware placement within the network, they can place proxy nodes closer to the client without impacting their storage footprint or fault tolerance characteristics.

This is accomplished in Cassandra by passing the -Dcassandra.join_ring=false option when the process is started. These nodes will connect to the seeds, cache the gossip data, load the schema, and begin listening for client requests. Messages like "/x.x.x.x is now UP!" will appear on the other nodes.

There are also some more practical benefits to this. Handling client requests caused us to push the NewSize of the heap up very high on read-heavy clusters. We observed that the ParNewGC is much more efficient with a large newsize (16GB) for proxy-only duties. Handling storage duties drags on the ParNewGC, so for storage duties this is best reduced to half or less. We observed some benefits to latency because of this.

After splitting the proxy and storage duties onto different nodes, we observed that ParNew times went two directions based on a particular node's role. ParNew times on the proxy nodes were about half of the previously combined proxy+storage nodes, and the times nearly doubled for the now dedicated storage nodes. Dropping the NewSize in half on the storage nodes brought ParNew times back in line with the proxy nodes, dropping overall ParNew times by about 40%. I believe this is because ParNew times increase with the amount of live data surviving the GC. Very little data survives during a GC on the proxy nodes, so ParNew is much faster per GB of heap. Tracing GCs are more efficient with larger heaps so with very little surviving data, the NewSize on proxy nodes can be increased to a size that would cause latency issues with combined proxy+storage duties.

It's also eliminated latency spikes associated with the start of streaming operations and reduced them significantly for rolling restarts, node repairs, and "disruptive compactions." Handling client requests makes it difficult for the storage nodes to recover from request storms or overall badness. If some kind of cluster-wide event triggers the clients to "DDoS" the cluster, hypothetically the problem stops gets bottlenecked at the proxy tier, leaving the storage nodes more room to recover. We've seen this kind of behavior during major network issues and expect the disaggregation to improve recovery times.

Was very cool to come across this gist.. I've been trying to separate parts of the read path from our main nodes using this join_ring:false idea. The problem I am running into is trying to pin the driver to a set of proxy nodes (using whitelist policy, roundrobin). If I pin the driver to a single proxy node it works.. if I add more proxy nodes to the white list I get:

[main] WARN com.datastax.driver.core.ControlConnection - Found invalid row in system.peers: [peer=/172.25.69.166, tokens=null]. This is likely a gossip or snitch issue, this host will be ignored.

Cannot use HostFilterPolicy where the filter allows none of the contacts points ([node02we2/172.25.69.166:9042, node01we2/172.25.66.160:9042, node03we2/172.25.66.163:9042]

How are you configuring your client driver to talk to only proxy nodes?

rbranson/gist:038afa9ad7af3693efd0

mihasya commented May 6, 2015

akonkol commented Sep 29, 2016