Skip to content

Instantly share code, notes, and snippets.

@wrobstory
Last active September 28, 2015 21:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wrobstory/95f29d01320dd56aa2ea to your computer and use it in GitHub Desktop.
Save wrobstory/95f29d01320dd56aa2ea to your computer and use it in GitHub Desktop.
Interesting Things About Kudu
  • It supports real primary key constraints, as compared to Google BigQuery or Amazon Redshift. Redshift allows you to specify primary key constraints, but only uses them in the query planner. If your row value is not actually unique, Redshift will give you incorrect distinct results.
  • There are no multi-row transactions. 1 mutation = 1 transaction.
  • Reads are scans, unless you're doing something like an equality predicate on a primary key. From @toddlipcon:

...if you put an equality predicate on the primary key, it doesn't actually "scan" data, it just goes to the correct row. One of our community contributors has been working on a Get API to make it a bit easier to do random reads (and will go through a more optimized code path on the backend).

  • Two types of predicates: Equality (col value == scalar) and ranges
  • User-defined partitioning schemes for request routing, with lots of flexibility in partitioning schemes.
  • The Kudu team made some small improvements to the Raft algorithm
  • Storage layout is decoupled from higher level APIs (yay!). I recently talked about this! https://github.com/wrobstory/ds4ds_2015
  • VACUUM-like flushes from in-memory MemRowSets to DiskRowSets are automatically managed.
  • MemRowSets are concurrent, locking B-trees
  • DiskRowSets are sorted
  • DiskRowSets support dictionary, bitshuffle, front coding, in addition to LZ4/gzip/bzip compression.
  • DiskRowSets are considered immutable once encoding, so they are using a Delta store similar to many other columnar database systems.
@toddlipcon
Copy link

Hey, thanks for writing up these notes. Todd from the Kudu team here. A couple corrections (is it possible to pull-request a gist? :) )

  • Reads are scans - true, but if you put an equality predicate on the primary key, it doesn't actually "scan" data, it just goes to the correct row. One of our community contributors has been working on a Get API to make it a bit easier to do random reads (and will go through a more optimized code path on the backend). Hope this merges soon.
  • on predicates, we also support ranges on non-primary key columns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment