Skip to content

Instantly share code, notes, and snippets.

@wrobstory
Last active September 24, 2023 16:14
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wrobstory/fafbc3ed1b9ad3342f665aade52d9667 to your computer and use it in GitHub Desktop.
Save wrobstory/fafbc3ed1b9ad3342f665aade52d9667 to your computer and use it in GitHub Desktop.
Data Engineering Problem

You're the first data engineer and find your self with the following scenario:

Your company has three user-facing clients: Web, iOS, and Android. Your data science team is interested in analyzing the following data:

  1. Support messages
  2. Client interactions (clicks, touches, how they move through the app, etc)

The data scientists need to be able to join these two data streams together on a common user_id to perform their analysis. Currently the support messages are going to a service owned by the backend team; they go through standard HTTP endpoints and are getting written to PostgreSQL. You're going to be responsible for the service receiving the client interactions.

Q1: Knowing that you're going to be in charge of getting this to some sort of data store downstream, what would your schemas look like? The only hard requirement is that support messages must have the message body, and client interactions have to have event and target fields to represent actions like click on login button and the like?

What you're looking for in their specification:

  • Client type
  • user_id
  • Timestamp (what format? why?)
    • Looking for discussion on UTC and ISO8601.
  • Some discussion of serialization format. Do they recommend JSON? Why? Avro/Protobuf? Why? There isn't a wrong answer here, but they should be aware of some of the tradeoffs.

Q2: Now that we have our schemas, lets look downstream: the backend service responsible for support messages get processed and immediately written to Postgres. The client interactions need API endpoints. Both need to end up in a datastore that analysts can query and join the two streams together. It's ok to drop a certain % of client events, but support messages cannot be dropped.

Discuss/write up/diagram your ideas for an architecture to make this happen. There are no limits on how you can approach this; if you want to write everything to Postgres and let analysts log into the production database, that's a solution!

Again, no right answer here, lots of ways to achieve this. What you're looking for them to discuss:

  • What kind of volume are we seeing for each data stream? You'll need to be able to give them a number.
    • As the interviewer it will be more interesting assume 5 msgs/s for the support messages and 5,000 msgs/s for the client messages. Those two volumes dictate very different architectures.
  • How to get data out of the backend service
    • Put the data on a message queue before/after (why?) persisting to Postgres?
    • Write the data directly to a data warehouse before/after (why?) persisting to Postgres?
    • Replicate the data out of Postgres to a message queue?
    • Batch write data from Postgres to something else each night?
    • What are the error handling/failure scenarios for the above?
    • What message queues would you use? Why?
    • Give analysts access to Postgres directly?
  • How to build an API for the client messages
    • 5,000 msgs/s is a pretty good load. How would they address that?
      • Multiple instances?
      • Load balancer?
    • What happens once we have the message.
      • Do we need to persist it to a local DB? Why?
      • Put it on a queue?
  • The downstream data store
    • Did they choose to just keep their data in local Postgres instances? Why?
      • If so, do analysts have prod data access? Is this a good thing?
    • If using a data warehouse like Redshift, etc, how are they getting the data there?
      • Something needs to read from a queue and get it into the DB
        • Does the data schema need to change? (did they mention a processing-time timestamp?)
      • Are they doing batch writes? Individual writes?
      • Error handling and failure cases?
      • Which datastore did they choose and why?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment