Skip to content

Instantly share code, notes, and snippets.

View wrobstory's full-sized avatar

Rob Story wrobstory

View GitHub Profile
@wrobstory
wrobstory / forecast.txt
Created February 14, 2024 15:17
Feb 14 2024 Forecast
.DISCUSSION...Today through next Tuesday...Well, we mentioned a
nonzero chance of lowland snow in the last few discussions, and that
appears to be coming to fruition in what will be an extremely
challenging forecast for the lowlands north of about Salem. The
addition of high resolution guidance has significantly increased the
probabilities of snow accumulation for these areas, including for the
greater Portland and Vancouver metro area. Several inches of snow are
likely in the Columbia River Gorge east of Multnomah Falls, with over
a foot likely for the Cascades and upper portions of the Hood River
Valley by the time snow diminishes late Thursday or early Friday.
@wrobstory
wrobstory / traverse.rs
Last active April 26, 2020 21:16
Rust Traverse
fn into_result(input: &i32) -> Result<&i32, String> {
Ok(input)
}
fn main() {
let numbers: Vec<i32> = vec![1, 2, 3, 4, 5];
let mapper = numbers.iter().map(|x| into_result(x));
let vector_of_results = mapper.collect::<Vec<Result<&i32, String>>>();
println!("{:?}", vector_of_results);
// [Ok(1), Ok(2), Ok(3), Ok(4), Ok(5)]
@wrobstory
wrobstory / coreml.py
Created June 6, 2017 17:11
CoreML Serialization
In [1]: %paste
from sklearn.datasets import load_iris
from sklearn import tree
import coremltools
iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
## -- End pasted text --
PostgreSQL Data Types AWS DMS Data Types Redshift Data Types
INTEGER INT4 INT4
SMALLINT INT2 INT2
BIGINT INT8 INT8
NUMERIC (p,s) If precision is 39 or greater, then use STRING. If the scale is => 0 and =< 37 then: NUMERIC (p,s) If the scale is => 38 and =< 127 then: VARCHAR (Length)
DECIMAL(P,S) If precision is 39 or greater, then use STRING. If the scale is => 0 and =< 37 then: NUMERIC (p,s) If the scale is => 38 and =< 127 then: VARCHAR (Length)
REAL REAL4 FLOAT4
DOUBLE REAL8 FLOAT8
SMALLSERIAL INT2 INT2
SERIAL INT4 INT4
@wrobstory
wrobstory / dataeng.md
Last active September 24, 2023 16:14
Data Engineering Problem

You're the first data engineer and find your self with the following scenario:

Your company has three user-facing clients: Web, iOS, and Android. Your data science team is interested in analyzing the following data:

  1. Support messages
  2. Client interactions (clicks, touches, how they move through the app, etc)

The data scientists need to be able to join these two data streams together on a common user_id to perform their analysis. Currently the support messages are going to a service owned by the backend team; they go through standard HTTP endpoints and are getting written to PostgreSQL. You're going to be responsible for the service receiving the client interactions.

Q1: Knowing that you're going to be in charge of getting this to some sort of data store downstream, what would your schemas look like? The only hard requirement is that support messages must have the message body, and client interactions have to have event and target fields to represent actions like click on login button and t

@wrobstory
wrobstory / esbug.sh
Created June 20, 2016 23:56
Elasticsearch Bug
curl -XPUT 'http://localhost:9200/test_index_1/dates/1?pretty' -d '{"when_received": "2016-04-25T13:21:24.000Z"}'
curl -XPUT 'http://localhost:9200/test_index_1/dates/2?pretty' -d '{"when_received": "2016-05-28T14:21:24.000Z"}'
curl -XPUT 'http://localhost:9200/test_index_1/dates/3?pretty' -d '{"when_received": "2016-06-28T17:21:24.000Z"}'
curl -XPUT 'http://localhost:9200/test_index_1/dates/4?pretty' -d '{"when_received": "2016-06-29T17:21:24.000Z"}'
curl -XPUT 'http://localhost:9200/test_index_2/dates/1?pretty' -d '{"when_recorded": "2016-04-25T13:21:24.000Z", "when_received": "2015-04-25T13:21:24.000Z"}'
curl -XPUT 'http://localhost:9200/test_index_2/dates/2?pretty' -d '{"when_recorded": "2016-05-28T14:21:24.000Z", "when_received": "2015-05-28T14:21:24.000Z"}'
curl -XPUT 'http://localhost:9200/test_index_2/dates/3?pretty' -d '{"when_recorded": "2016-06-28T17:21:24.000Z", "when_received": "2015-06-28T17:21:24.000Z"}'
curl -XPUT 'http://localhost:9200/test_index_2/dates/4?pretty' -d '{"when_recorded": "2016
@wrobstory
wrobstory / lessons.md
Last active July 18, 2016 22:57
Lessons Learned
  • Always include the timestamp when a field was written
  • If Elasticsearch drops an index, it will keep writing data dynamically. This is very bad.
  • Using one library for critical API logic (like reading from Kafka) lets you update all of your various consuming services with a version bump.
  • ALWAYS use the ESCAPE option when unloading from Redshift.
  • Immutable append-only tables always and forever. It's so hard to reason about tables with updates.
@wrobstory
wrobstory / docs.txt
Created December 4, 2015 23:40
ES Window fn
{:correlation_id 12345 :_id "abcde" :when_recorded "2015-01-01"}
{:correlation_id 12345 :_id "fjhij" :when_recorded "2015-01-02"}
{:correlation_id 12345 :_id "klmno" :when_recorded "2015-01-03"}
{:correlation_id 12345 :_id "pqrst" :when_recorded "2015-01-04"}
{:correlation_id 12345 :_id "uvwxy" :when_recorded "2015-01-05"}
@wrobstory
wrobstory / 4thdown.py
Created October 2, 2015 02:27
4thDownBotTest
class Pats(object):
@staticmethod
def suck():
return True
assert Pats.suck() == True
@wrobstory
wrobstory / kudu.md
Last active September 28, 2015 21:56
Interesting Things About Kudu
  • It supports real primary key constraints, as compared to Google BigQuery or Amazon Redshift. Redshift allows you to specify primary key constraints, but only uses them in the query planner. If your row value is not actually unique, Redshift will give you incorrect distinct results.
  • There are no multi-row transactions. 1 mutation = 1 transaction.
  • Reads are scans, unless you're doing something like an equality predicate on a primary key. From @toddlipcon:

...if you put an equality predicate on the primary key, it doesn't actually "scan" data, it just goes to the correct row. One of our community contributors has been working on a Get API to make it a bit easier to do random reads (and will go through a more optimized code path on the backend).

  • Two types of predicates: Equality (col value == scalar) and ranges
  • User-defined partitioning schemes for request routing, with lots of flexibility in partitioning schemes.
  • The Kudu team made some small improvements to the Raft algorithm
  • Stor