What does EDA stand for? What should I do if I’m asked to perform EDA?
Answer
EDA stands for Exploratory Data Analysis. You should always perform EDA before completing any inferential statistical analysis. EDA typically involves the collection of descriptive statistics (i.e. measures of center, spread, etc.) and/or the construction of different types of graphs.
EDA is important because it allows you to “discover patterns, spot anomalies, test a hypothesis, or check assumptions.” Here are a couple situations where EDA is helpful:
-
You check the range of a variable for human height and find several values of zero. This allows you to fix what is likely a data entry error.
-
You plan to fit a simple linear regression model. After building a scatterplot, you discover that the relationship between the two variables is actually quadratic and consider the addition of a higher order term to the model.
-
You plan to fit a multiple linear regression model. You build a heatmap to evaluate the relationship between all variables in the dataset and discover that several of the predictors are strongly correlated with each other. This leads you to consider how multicollinearity will affect the final results.
EDA is the statistical equivalent of gathering clues before giving a suspect a lie detector test. If the results of the test match your clues, you can be reasonably confident in the results. However, if the testimony doesn’t match your clues, you need to consider whether the test was the correct approach to the situation in the first place.