Content Disclaimer
Copyright @2014.
All Rights Reserved.
StatsToDo : Handling Missing Data in R

Links : Home Index (Subjects) Contact StatsToDo

Related Links:
R Explained Page

Introduction R Codes Code Explained
This page explains and provides example codes for handling missing data using R. For those unfamiliar with R but wishes to start using it, explanations on how to set up and start using R are provided in R Explained Page

In research and handling data, missing data is a common occurrence. Subjects are lost, errors are made in collecting and transcribing information, and whole host of reasons creating holes in the data table.

R provides an option for how to handle missing data in nearly all its formulae, but this requires the analyst to be familiar with how missing data may affect a particular procedure and which option each procedure provides for handling missing data.

For the sake of simplicity, all the codes provided in StatsToDo assume that the data is already clean and contain no missing data. This separates the procedures handling missing data (which is described in this page) from the statistical algorithms.

This page therefore provides the algorithms for handling missing data at the final stages of data preparation, to produce a complete set of data for analysis

How missing data are represented in R

Within the object dataframe, missing values are represented by NA in the numerical columns and <NA> in text columns. However, in data I/O, the following is used
  • When data is presented directly in the R Code as a text table, or when read in from a comma delimited file (.csv), missing data is represented by NA. Any other representation is interpreted by R as values in the data
  • When data is read in from a Excel worksheet using the package xlsx, missing data are blank cells in the Excel worksheet. Anything else will be interpreted as actual values and processed accordingly
When deciding on how data are to be inputed into R, how missing values are prepresented in the different input media should be carefully considered, and testing with a small dataset would avoid error later.

Different options in dealing with missing data

R provides an extensive collection of methods of handling missing data. Only a few of the more commonly used ones are presented in this page. This panel discusses the options conceptually, the complete set of codes and how the codes work are presented in the other two panels

Option 1. Casewise deletion

This is the easiest, and widely used method. All records containing missing values are deleted.

This method is appropriate if the analyst can reassure himself that data is lost at random, so that removing records containing missing data would not create a bias leading to misinterpretation. The amount of missing data should also be small, say in less than 1% of the cases

Option 2. K Nearest Neighbour

For each missing value, the program searches for k completed records that are nearest (similarity not location) to it, replacing it with the average for a numerical column, and the most frequent value for a text column. k can be specified in the formula. If not specified, the default k=10 is used.

This is a robust method, and can be used even if some bias process is implicated in data loss, as the missing value is replaced by values from similar records.

There are, however, some issues involved. Firstly, for every missing value, k (10) completed records are required. Secondly, the whole database is searched for the nearest records, and this is time consuming if the database is large and missing data numerous.

The method was devised by those working on big data and artificial intelligence, when thousands or even millions of records are available, and the data can be analysed using powerful computers over prolonged periods.

Clinical data are caught between having database not large enough so that k has to be reduced, and the long time required for processing using desk top computers.

Although the method is excellent in theory, it cannot always be successfully used in the clinical setting. However, it is worth a try. If the program crashes or takes too long to run, k can be progressively reduced until the program works. Be aware however that, as k is reduced, the risk of producing bias replacements increases

Option 3. General Imputation

The program randomly selects a missing data value, and replaces it with an estimate using the available data and multiple regression. This is then included in the available data to estimate the next randomly selected missing value. This process is repeated until all missing values are replaced by estimated (imputated) values.

As latter estimations are influenced by earlier estimated values, the results are slightly different depending on the random sequence. The program copes with this by iterating the process a number of times (m) and averaged the results. The number of iterations (m) can be specified by the user. If not specified, the default is m=5. Controversy exists as to what m should be, and some statisticians argue that m should be the same as the number of missing values in the data.

This method is most suited to the small data sets that are common in clinical studies, especially in survey and clinical trials where the sample size is around 100.

The only proviso is that at least one (1) numerical column must exist in the data set for the algorithm to work.

Users should also be aware that the same program and data will produce slightly different results when repeated, as the random sequence is generated at run time so are different each time

Option 4. Numerical Imputation

The program is a mathematical algorithm using existing values in the same column of the data set to estimate a replacement value for the missing values. The methods available are mean. median, mode, and interpolation (average of the available values on the two sides of the missing value).

The method only works in columns of numerical data, and ignores missing values in columns that are text. It is quick to implement and the results easy to interpret. It can be used if such mathematical replacement is appropriate to the analyst's needs

The interpolation method is especially useful in time series data such as continuous monitoring, as the interpolation result is close to what the missing data should be.

Additional information

Checking the results

It is important to check the results of fixing missing data before the data set is used for analysis. There are numerous methods for doing so, but they are not covered in this page. The example codes provide the basic comparisons using the summary command, which will count the different values in text columns, and minimum, maximum, quartile values, means and standard deviation in numerical columns


StatsToDo Home Page    Contact StatsToDo