Skip to content

Instantly share code, notes, and snippets.

@eetuko
Created November 23, 2020 09:00
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save eetuko/86b2062ed4923637f976ed9666700a7a to your computer and use it in GitHub Desktop.
Save eetuko/86b2062ed4923637f976ed9666700a7a to your computer and use it in GitHub Desktop.
Method summary for a systems perspective on exploratory data analysis.

In data science, exploratory data analysis is an endless game. Here is a quick summary of the process that I follow to get it done. It is easy to get sidetracked, lost in the details. The commitment to follow bullets points helps to stick to your main goal: get a global overview. The important shift is to iterate at least three times with a different focus at every iteration:

  • Function: What is the purpose of the system? Is it to satisfy customers? How? Something else?
  • Structure: How are the building blocks arranged in space?
  • Processes: How are the building blocks arranged in time?

Depending on the type of question it might make more sense to adapt the ratio between time and space, but the two are important.


Following the next few points is a classical analytical process, you break down blocks. Focus on function, structure and process sequentially, one at every iteration:

  1. Frame the question, what is it that you are curious about?

  2. Would you be able to get data to answer your question?

-> Think who has access to data and interest in hosting it: organisations, companies, etc. -> Do you have the right to use the data for your purpose (license, etc.)? -> Once you get the data, separate the raw data from the one you will be playing with. The raw data is only used to copy it and explore the copy.

  1. [Analytical] Start with data analysis, try to understand the basics:
  • What is the expected average behaviour?
  • What are the extreme values?
  • Is there missing data? Why?
  • etc.
  1. [Bibliography] Look for background, people who know the subject better than you:

-> Colleagues, friends, academic databases/search engines (Google scholar, arXiv, PubMed, etc.), consulting firms reporting (McKinsey Analytics, Gartner, Deloitte etc.), generic search engines (Google, Duckduckgo, Qwant, ...)

In this step, try to focus on understanding what you just saw in the data. The point is not necessarily to be comprehensive but initiate a process.

Use different sources and confront them.

Write down every reference that brings you some kind of knowledge or understanding on the topic.

  1. Visualise! Use whatever tool you are used to and try to actually see the patterns in the data. When building a chart, every element should make sense. Doubts about scales, shapes, wordings are paramount. From univariate to multivariate analysis, try to build visuals that reflect honestly the situation. Use a friend to get a fresh look or let it rest a few hours, days before watching it again.

Conclusion:

Now to make it worth the effort, take a pen and a piece of paper and draw. Draw diagrams about how you understand that things are connected together, explain it to yourself. Once this is done, you can write the report you were intending to or start data modelling efficiently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment