Rob Story wrobstory

PostgreSQL Data Types	AWS DMS Data Types	Redshift Data Types
INTEGER	INT4	INT4
SMALLINT	INT2	INT2
BIGINT	INT8	INT8
NUMERIC (p,s)	If precision is 39 or greater, then use STRING.	If the scale is => 0 and =< 37 then: NUMERIC (p,s) If the scale is => 38 and =< 127 then: VARCHAR (Length)
DECIMAL(P,S)	If precision is 39 or greater, then use STRING.	If the scale is => 0 and =< 37 then: NUMERIC (p,s) If the scale is => 38 and =< 127 then: VARCHAR (Length)
REAL	REAL4	FLOAT4
DOUBLE	REAL8	FLOAT8
SMALLSERIAL	INT2	INT2
SERIAL	INT4	INT4

You're the first data engineer and find your self with the following scenario:

Your company has three user-facing clients: Web, iOS, and Android. Your data science team is interested in analyzing the following data:

Support messages
Client interactions (clicks, touches, how they move through the app, etc)

The data scientists need to be able to join these two data streams together on a common user_id to perform their analysis. Currently the support messages are going to a service owned by the backend team; they go through standard HTTP endpoints and are getting written to PostgreSQL. You're going to be responsible for the service receiving the client interactions.

Q1: Knowing that you're going to be in charge of getting this to some sort of data store downstream, what would your schemas look like? The only hard requirement is that support messages must have the message body, and client interactions have to have event and target fields to represent actions like click on login button and t

Always include the timestamp when a field was written
If Elasticsearch drops an index, it will keep writing data dynamically. This is very bad.
Using one library for critical API logic (like reading from Kafka) lets you update all of your various consuming services with a version bump.
ALWAYS use the ESCAPE option when unloading from Redshift.
Immutable append-only tables always and forever. It's so hard to reason about tables with updates.

It supports real primary key constraints, as compared to Google BigQuery or Amazon Redshift. Redshift allows you to specify primary key constraints, but only uses them in the query planner. If your row value is not actually unique, Redshift will give you incorrect distinct results.
There are no multi-row transactions. 1 mutation = 1 transaction.
Reads are scans, unless you're doing something like an equality predicate on a primary key. From @toddlipcon:

...if you put an equality predicate on the primary key, it doesn't actually "scan" data, it just goes to the correct row. One of our community contributors has been working on a Get API to make it a bit easier to do random reads (and will go through a more optimized code path on the backend).

Two types of predicates: Equality (col value == scalar) and ranges
User-defined partitioning schemes for request routing, with lots of flexibility in partitioning schemes.
The Kudu team made some small improvements to the Raft algorithm
Stor

	.DISCUSSION...Today through next Tuesday...Well, we mentioned a
	nonzero chance of lowland snow in the last few discussions, and that
	appears to be coming to fruition in what will be an extremely
	challenging forecast for the lowlands north of about Salem. The
	addition of high resolution guidance has significantly increased the
	probabilities of snow accumulation for these areas, including for the
	greater Portland and Vancouver metro area. Several inches of snow are
	likely in the Columbia River Gorge east of Multnomah Falls, with over
	a foot likely for the Cascades and upper portions of the Hood River
	Valley by the time snow diminishes late Thursday or early Friday.

	fn into_result(input: &i32) -> Result<&i32, String> {
	Ok(input)
	}

	fn main() {
	let numbers: Vec<i32> = vec![1, 2, 3, 4, 5];
	let mapper = numbers.iter().map(\|x\| into_result(x));
	let vector_of_results = mapper.collect::<Vec<Result<&i32, String>>>();
	println!("{:?}", vector_of_results);
	// [Ok(1), Ok(2), Ok(3), Ok(4), Ok(5)]

	In [1]: %paste
	from sklearn.datasets import load_iris
	from sklearn import tree
	import coremltools

	iris = load_iris()
	clf = tree.DecisionTreeClassifier()
	clf = clf.fit(iris.data, iris.target)

	## -- End pasted text --

	curl -XPUT 'http://localhost:9200/test_index_1/dates/1?pretty' -d '{"when_received": "2016-04-25T13:21:24.000Z"}'
	curl -XPUT 'http://localhost:9200/test_index_1/dates/2?pretty' -d '{"when_received": "2016-05-28T14:21:24.000Z"}'
	curl -XPUT 'http://localhost:9200/test_index_1/dates/3?pretty' -d '{"when_received": "2016-06-28T17:21:24.000Z"}'
	curl -XPUT 'http://localhost:9200/test_index_1/dates/4?pretty' -d '{"when_received": "2016-06-29T17:21:24.000Z"}'

	curl -XPUT 'http://localhost:9200/test_index_2/dates/1?pretty' -d '{"when_recorded": "2016-04-25T13:21:24.000Z", "when_received": "2015-04-25T13:21:24.000Z"}'
	curl -XPUT 'http://localhost:9200/test_index_2/dates/2?pretty' -d '{"when_recorded": "2016-05-28T14:21:24.000Z", "when_received": "2015-05-28T14:21:24.000Z"}'
	curl -XPUT 'http://localhost:9200/test_index_2/dates/3?pretty' -d '{"when_recorded": "2016-06-28T17:21:24.000Z", "when_received": "2015-06-28T17:21:24.000Z"}'
	curl -XPUT 'http://localhost:9200/test_index_2/dates/4?pretty' -d '{"when_recorded": "2016

	{:correlation_id 12345 :_id "abcde" :when_recorded "2015-01-01"}
	{:correlation_id 12345 :_id "fjhij" :when_recorded "2015-01-02"}
	{:correlation_id 12345 :_id "klmno" :when_recorded "2015-01-03"}
	{:correlation_id 12345 :_id "pqrst" :when_recorded "2015-01-04"}
	{:correlation_id 12345 :_id "uvwxy" :when_recorded "2015-01-05"}

	class Pats(object):

	@staticmethod
	def suck():
	return True

	assert Pats.suck() == True