You're the first data engineer and find your self with the following scenario:
Your company has three user-facing clients: Web, iOS, and Android. Your data science team is interested in analyzing the following data:
- Support messages
- Client interactions (clicks, touches, how they move through the app, etc)
The data scientists need to be able to join these two data streams together on a common user_id
to perform their analysis. Currently the support messages are going to a service owned by the backend team; they go through standard HTTP endpoints and are getting written to PostgreSQL. You're going to be responsible for the service receiving the client interactions.
Q1: Knowing that you're going to be in charge of getting this to some sort of data store downstream, what would your schemas look like? The only hard requirement is that support messages must have the message body, and client interactions have to have event
and target
fields to represent actions like click
on login button
and the like?
What you're looking for in their specification:
- Client type
user_id
- Timestamp (what format? why?)
- Looking for discussion on UTC and ISO8601.
- Some discussion of serialization format. Do they recommend JSON? Why? Avro/Protobuf? Why? There isn't a wrong answer here, but they should be aware of some of the tradeoffs.
Q2: Now that we have our schemas, lets look downstream: the backend service responsible for support messages get processed and immediately written to Postgres. The client interactions need API endpoints. Both need to end up in a datastore that analysts can query and join the two streams together. It's ok to drop a certain % of client events, but support messages cannot be dropped.
Discuss/write up/diagram your ideas for an architecture to make this happen. There are no limits on how you can approach this; if you want to write everything to Postgres and let analysts log into the production database, that's a solution!
Again, no right answer here, lots of ways to achieve this. What you're looking for them to discuss:
- What kind of volume are we seeing for each data stream? You'll need to be able to give them a number.
- As the interviewer it will be more interesting assume 5 msgs/s for the support messages and 5,000 msgs/s for the client messages. Those two volumes dictate very different architectures.
- How to get data out of the backend service
- Put the data on a message queue before/after (why?) persisting to Postgres?
- Write the data directly to a data warehouse before/after (why?) persisting to Postgres?
- Replicate the data out of Postgres to a message queue?
- Batch write data from Postgres to something else each night?
- What are the error handling/failure scenarios for the above?
- What message queues would you use? Why?
- Give analysts access to Postgres directly?
- How to build an API for the client messages
- 5,000 msgs/s is a pretty good load. How would they address that?
- Multiple instances?
- Load balancer?
- What happens once we have the message.
- Do we need to persist it to a local DB? Why?
- Put it on a queue?
- 5,000 msgs/s is a pretty good load. How would they address that?
- The downstream data store
- Did they choose to just keep their data in local Postgres instances? Why?
- If so, do analysts have prod data access? Is this a good thing?
- If using a data warehouse like Redshift, etc, how are they getting the data there?
- Something needs to read from a queue and get it into the DB
- Does the data schema need to change? (did they mention a processing-time timestamp?)
- Are they doing batch writes? Individual writes?
- Error handling and failure cases?
- Which datastore did they choose and why?
- Something needs to read from a queue and get it into the DB
- Did they choose to just keep their data in local Postgres instances? Why?