Want to know the more about exploratory data analysis? Here are the 5 core concepts that you need to know about.
Doing an analysis project can be super time-consuming when you have no clue how you should start analysing. With a background in business analytics and a former digital analytics consultant, I am giving you the guidance you need to do your own exploratory data analysis.
You will learn all about exploratory data analysis, from the types of exploratory data analysis, exploratory data analysis steps, methods, tools, and more.
After learning all of these fundamentals, you will have a good idea of how to start conducting your analysis project.
This post is all about the fundamentals of exploratory data analysis.
Define Exploratory Data Analysis
What exactly is exploratory data analysis? If you look up the internet, you will see that many define this as an analysis approach that helps you find patterns and trends in your data.
How I like to describe this is: This is the step where you can think out of the box and play around with your data. Here, you will be able to get a good idea of the overview of your data set without the need to do a deep dive. For instance, you will know whether your data set is skewed towards a demographic group, a social media platform, or even a particular product.
To give you a good idea of when you should conduct exploratory data analysis, this is the step RIGHT before you start coming up with any insights (aka the ANALYSIS step mentioned in my FREE e-book here).
For the more advanced readers in the house, this is also the step that you MUST complete before running the dataset into a model (eg CART, Regression, Neural Network, and many more). This step will allow you to understand the underlying structure, identify the more important variables, and detect outliers & anomalies to allow you to identify the most suitable models.
Types of Exploratory Data Analysis
There are 3 main types of exploratory data analysis.
1. Univariate
Judging from the name, this is where you only look at 1 variable (aka column/dimension) at any single time. This helps you to understand more about the nature of the variable (eg categorical/continuous), whether the data is skewed, and also to identify any outliers. This is the easiest type of exploratory data analysis.
2. Bivariate
This is where you look at 2 variables together. This enables you to understand the relationship between variables A and B. For instance, whether they are independent or correlated to one another.
Common examples are:
- Comparing time with variable X to see whether the variable is sensitive over time.
- Comparing 2 categorical variables to see whether they are correlated.
3. Multivariate
The last type is multivariate where you look at 3 or more variables at a time. You can view this as a more advanced level of bivariate, just that this time round, you are looking at more variables and the relationships between them become more complicated. You will need the ability to really deep dive to identify patterns and outliers.
Exploratory Data Analysis Methods
You can explore your dataset with 2 main methods.
1. Graphical Method
The first one is to explore data through visual representations such as graphs and charts. Common examples are line graphs, box plot, pie charts, wordcloud, bar charts, and scatterplots.
2. Non-graphical Method
The second method is through statistical techniques. Meaning using non-visuals. Examples of metrics used are mean, median, mode, max, min, standard deviation, average, RMS, Skewness, and Kurtosis.
Exploratory Data Analysis Steps
These are the 4 main steps I take personally to explore my data sets.
Step 1: Understand the meaning of each column + Univariate
The first step to start exploring data is to understand what your variables are. These are indicated by the column header names.
You need to have a good understanding of the following:
- What the column is and its definition. Make sure to have no repeated column names.
- Is the variable categorical or continuous? As a rule of thumb, you can do calculations for continuous variables. For instance, product ID vs quantity. Both are numbers but product ID is categorical while quantity is continuous.
Once you have got these 2 sorted out, you can then perform a quick univariate analysis for each variable to check for skewed data and outliers.
Step 2: Find columns that have a relationship to each other + Bivariate
Next, you now identify columns you think may be related to each other and conduct bivariate analysis for every combination you have. Some tools allow you to do this fast.
For continuous variables, the correlation coefficient metric is usually used here to check the strength and direction of the relationship between variables. The standard theory is that if it falls between 0.7 and 1.0 or -0.7 and -1.0, the relationship is strong. + indicates a positive relationship while – indicates a negative relationship. 0 means no linear relationship.
In reality, based on the dataset you have, you sometimes have to adjust the criteria a little bit to 0.6 or -0.6. You will inevitably face situations where there are no variables that fall in these ranges. In such scenarios, you have to continue with the analysis project and this is when you have to be flexible and readjust.
Step 3: Deep dive into interesting findings
After steps 1 and 2, you should have a good feeling of your data set and this is when you can go crazy and test out whatever you want.
Usually, at this stage, you will already have hypotheses that you want to test out or questions that you want to verify. This is where you make those deep dives into the data to check those theories you have.
While this step is super fun, do set an allocated time for this so you don’t go out of hand and still hit the deadline of the analysis project.
Step 4: Multivariate, if needed
The last step is optional for you. I usually skip this step when I deal with smaller projects because
- The concepts for multivariate are difficult to understand and explain
- Step 1 to 3 gives me sufficient information to generate valuable insights
That being said, I still run this from time to time. Here are the things to take note of if you are planning to do multivariate analysis.
- Have a good understanding of the models you want to use. For instance, what it is suitable for, the limitations of the model, and many more.
- Know what model your data set is suitable for.
- Try out your dataset with more than 1 model and take the one that performs the best. Sometimes, you will think that Model A performs the best but the reality might turn out different.
- Set KPIs to compare model performances. For example, RMSE, R-squared, and more.
- Format and clean the data (eg outliers) before inputting the data into the model
Exploratory Data Analysis Tools
Below are 3 common tools that can help you during the exploratory stage. I loved using r and Excel.
1. Exploratory Data Analysis Python
Python is one of the most commonly used and preferred programming languages for data analysis because of its versatility and its extensive libraries.
The commonly used libraries for exploratory data analysis are:
- Pandas
- Matplotlib and Seaborn
- NumPy
Resources for you: EDA in Python | Step-by-Step EDA using Python | Exploratory Data Analysis in Python
2. Exploratory Data Analysis in r
Another popular programming language that I used for exploratory data analysis is R programming. I am more familiar with this as I learned it when I was in university.
Similar to Python, r also has its packages for exploratory data analysis:
- Ggplot2
- Tidyverse
- Corrplot
- caTools
- car
- data.table
Resources for you: EDA in R | EDA in R using tidyverse and ggplot2 | R for data science | ggplot2 cheat sheet | data.table cheat sheet
3. Exploratory Data Analysis in Excel
Though Excel is not as versatile or as powerful as Python and r, it is still one of the favourite choices for exploratory data analysis, especially for people who do not have coding knowledge and don’t like coding (for example, ME).
Not only does Excel have a user-friendly interface, it can also perform basic univariate and bivariate analyses with ease.
Common features used include:
- Simple formula calculation
- PivotTables
- Charts and Graphs
- Data Analysis Toolpak
Resources for you: Download Analysis ToolPak in Excel | Visualisation & Descriptive Statistics in Excel | Introduction to Analysis ToolPack
This post is all about giving you a headstart in conducting your own exploratory data analysis.
If you love to see more content like this, follow me on social media to stay updated with the latest information.
Other posts you may like:
- 5 Simple Steps to Build The Best Dashboard Your Business Needs
- 13+ Website Metrics to track for Ecommerce Website
- 5 Best Google Analytics Report to Look at for Bloggers