When working with data, especially in the fields of data science and machine learning, we frequently encounter missing values. Understanding the reasons why data is missing is critical because the method used to handle missing data depends on the type of missingness. In statistics and data analysis, we categorize missing data into three main types:
- MCAR (Missing Completely at Random)
- MAR (Missing at Random)
- MNAR (Missing Not at Random)
These terms can sound complex, but they are crucial in determining how we should handle missing values.
In this article, we will explain these concepts in simple terms and demonstrate how to identify and handle them using Python.
1. What Are Missing Data Mechanisms?
Before we dive into the three types of missing data, let’s define what “missing data mechanisms” are:
A missing data mechanism refers to the process or reason why data might be missing. The type of missingness affects the analysis, and the strategy for dealing with it. The three main mechanisms are:
- MCAR (Missing Completely at Random): Missingness occurs entirely by chance.
- MAR (Missing at Random): The missingness is related to other observed data, but not the missing data itself.
- MNAR (Missing Not at Random): The missingness is related to the missing data itself
2. Types of Missing Data
A. MCAR (Missing Completely at Random)
Definition: In MCAR, missing data occurs completely randomly. There is no systematic reason behind the missingness. The absence of data does not depend on any observed or unobserved values.
Real Time Example:
- Imagine you’re conducting a survey about people’s favorite food. Some participants accidentally spill coffee on their responses, causing random entries to be missing. The missing data is unrelated to any specific food preference or any other characteristic of the survey.
Key Points:
- The missing data does not depend on any observed or unobserved data.
- You can ignore MCAR missing values when performing analysis because the data is missing completely by chance.
Python Example:
In this example, the missing value in Charlie’s age is entirely random and independent of other factors.
B. MAR (Missing at Random)
Definition: In MAR, missingness is related to other observed data, but not to the value of the missing data itself. That is, if we know certain observed values, we can predict whether the data will be missing or not.
Real Time Example:
- You’re conducting a survey about people’s income and age. Younger people are less likely to disclose their income, but the missingness of income does not depend on the income value itself—it depends on age. In this case, age is an observed factor that explains the missingness in income data.
Key Points:
- The missingness is related to some other observed data.
- You can handle this type of missing data by using techniques like imputation or by modeling the missingness using observed data.
Here, missing income values for Bob and David were filled based on the average income of people within their age group (since the missingness was related to the observed age
variable).
C. MNAR (Missing Not at Random)
Definition: In MNAR, missingness is related to the value of the missing data itself. This means that the reason why data is missing depends on the unobserved data. In simpler terms, the fact that a value is missing tells you something about the value itself.
Real Time Example:
- Suppose you’re collecting data on people’s income, and only people with very high or very low incomes refuse to disclose their income. The missing data (income) depends on the actual value of the income, making this MNAR.
Key Points:
- The missing data depends on the value of the missing data itself.
- MNAR is the hardest type of missingness to handle because it introduces bias. You might need to use specialized techniques like modeling the missingness or using external data to adjust.
Python Example:
Here, we filled missing values based on the median income, which is a common strategy when dealing with MNAR. In a real-world scenario, however, more sophisticated methods would be needed, such as using modeling techniques that account for the bias introduced by MNAR.
3. Conclusion
Understanding MCAR, MAR, and MNAR is essential when working with missing data. Here’s a quick summary:
- MCAR (Missing Completely at Random): Data is missing completely by chance. You can ignore this when handling the data.
- MAR (Missing at Random): Missingness is related to observed data, but not to the missing data itself. You can use techniques like imputation.
- MNAR (Missing Not at Random): Missingness is related to the missing data itself. This is the most challenging type and often requires advanced handling.
Knowing the type of missing data helps you decide the best strategy to handle it, leading to more accurate and reliable analysis. With Python’s powerful libraries like pandas
, you can effectively identify and manage missing data in your datasets.