DDC: Tackling Missing or Inconsistent Data

Data Science
R
DDC
missing
inconsistent
Author

I. Muliterno

Published

April 24, 2023

Hello and welcome to the Data Driven Chronicles (DDC) series, where I share small doses of my life as a data professional. Today we will talk about a common task: dealing with missing or inconsistent data.

In the world of data science, dealing with missing or inconsistent data is an everyday challenge. The quality of your insights, predictions, and models heavily depends on the quality of the data you use. In this second post of our series on data science daily life challenges, we’ll explore various strategies for handling missing or inconsistent data using R, and how to make informed decisions about the best approach for your specific situation.

  1. Understand the nature of the missing or inconsistent data

Before diving into any solutions, it’s essential to understand the nature of the missing or inconsistent data you’re dealing with. In R, you can use the summary() function to get an overview of your dataset, including the number of missing values:

# load these packages
#library(dplyr)
#library(stringdist)
#library(tidyverse)
#library(mice)
#library(DMwR2)

dataset <- data.frame(col1 = c(1:3, NA),
                 col2 = c("one", NA,"cool", "text"), 
                 col3 = c(TRUE, FALSE, TRUE, TRUE), 
                 col4 = c(0.5, 4.7, 3.2, NA),
                 date_column = c("2000/1/1","2000/2/1" ,"2000/3/1" ,"2023/13/40"),                 stringsAsFactors = FALSE)

summary(dataset)
      col1         col2              col3              col4     
 Min.   :1.0   Length:4           Mode :logical   Min.   :0.50  
 1st Qu.:1.5   Class :character   FALSE:1         1st Qu.:1.85  
 Median :2.0   Mode  :character   TRUE :3         Median :3.20  
 Mean   :2.0                                      Mean   :2.80  
 3rd Qu.:2.5                                      3rd Qu.:3.95  
 Max.   :3.0                                      Max.   :4.70  
 NA's   :1                                        NA's   :1     
 date_column       
 Length:4          
 Class :character  
 Mode  :character  
                   
                   
                   
                   
  1. Data Imputation

One common approach for dealing with missing data is imputation. Imputation involves estimating the missing values based on other available data. Some popular imputation methods in R include:

dataset <- dataset %>%
  mutate(col4 = if_else(is.na(col4), mean(col4, na.rm = TRUE), col4))
imputed_data <- mice(dataset, method = 'norm.predict', m = 5)

 iter imp variable
  1   1  col1
  1   2  col1
  1   3  col1
  1   4  col1
  1   5  col1
  2   1  col1
  2   2  col1
  2   3  col1
  2   4  col1
  2   5  col1
  3   1  col1
  3   2  col1
  3   3  col1
  3   4  col1
  3   5  col1
  4   1  col1
  4   2  col1
  4   3  col1
  4   4  col1
  4   5  col1
  5   1  col1
  5   2  col1
  5   3  col1
  5   4  col1
  5   5  col1
complete_data <- complete(imputed_data)
complete_data
      col1 col2  col3 col4 date_column
1 1.000000  one  TRUE  0.5    2000/1/1
2 2.000000 <NA> FALSE  4.7    2000/2/1
3 3.000000 cool  TRUE  3.2    2000/3/1
4 2.703704 text  TRUE  2.8  2023/13/40
#imputed_data <- knnImputation(dataset, k = 5)

It’s important to note that imputation can introduce bias or distort the underlying data distribution, so it should be used with caution.

  1. Removing missing or inconsistent data

In some cases, it may be appropriate to remove rows or columns containing missing or inconsistent data. This can be done using techniques such as:

dataset <- na.omit(dataset)