How to do Exploratory Data Analysis with Pandas Profile Report
This tutorial will teach you how to quickly and efficiently explore your dataset with a single line of code using Pandas Profile Report Function.
When you first start working with a dataset, the first thing you want to do is explore it and get a feel and intuition for the data before you start hacking away at the data.
You need to study and understand your data because your understanding of the dataset could be the determining factor between failure and success of a data science project. Everything you do from data cleaning, to modeling, to feature engineering hinges on your understanding of the dataset.
So, when you start a new data science project, make sure you spend a lot of time understanding your dataset.
There are manual techniques that you can use to explore your data so that you can understand it, but I have found that the EASIEST way to explore your data is to get a PROFILE REPORT OF THE DATASET.
So, here we go. Let’s dive into learning how to create a pandas profile report.
Step 1: Install Pandas_Profiling
- To install with pip you use this code: pip install pandas-profiling
- To install with anaconda use this code: conda install -c conda-forge pandas-profiling
- To install by building from source or any other installation method, go to pandas profile report documentation here: https://pypi.org/project/pandas-profiling/
Step 2: Import pandas and pandas profile report.
Step 3: Load the dataset you want to create a profile report for.
Step 4: Verify that the data is loaded correctly.
Step 5: Use the pandas profile report to get a profile analysis of the dataset you just loaded with the following lines of code.
HINT: Don’t be alarmed when this line of code takes time to run. The pandas profile report is a bit slow and takes time to execute the code which makes sense because the pandas profile report is doing a lot of number crunching and data analysis behind the scene.
Step 6: This is a sample of what the profile report looks like.
Step 7: What information does the profile report provide? This is pulled word for word from the pandas profiling report documentation. This is the information in a profile report:
“For each column the following statistics – if relevant for the column type – are presented in an interactive HTML report:
Essentials: type, unique values, missing values
Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Most frequent values
Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
Missing values matrix, count, heatmap and dendrogram of missing values”
As you can see, the pandas profile report is a single magical line of code that can save you a lot time, energy, and resources. It gives you a quick analysis and snapshot of your data.
It is the easiest and fastest way to do exploratory data analysis and build an intuition for your dataset before you start data cleaning and eventually modeling your data.
Now that you have the basic fundamental knowledge of pandas profiling report, Go here to read the documentation and learn more about pandas profiling report https://pypi.org/project/pandas-profiling/