As you may know, my experience with data analytics is in behavioral health and general health care on the periphery of academia. IBM’s SPSS has always been the primary program that I have used to run analytics and as a point-and-click program, it has performed the job well. I am now at a place where I have some time on my hands, and it is not a secret that one of the primary languages being used in analytics these days is not a new one; it is the open-source package, R.
The National Health and Nutrition Examination Survey (NHANES) are a group of studies that assess the health and nutrition of children and adults in the United States collected between 1971 and 1994. The data on this survey have been well curated and come with a code-book.
I imported the Body-Measurements Data (DS12), which come in a .tsv (tab delimited format) file, and converted that file into a .csv (excel comma separated values) before importing to R.
> getwd()  "/Users/David/Desktop/Working Directory - R" > DS0012<-read.csv("ICPSR_25505/DS0012/25505-0012-Data.csv",header=TRUE)
I wanted to visit my working directory (to make sure I knew where my files were coming from; >getwd()), at which point I identified my .csv file (25505-0012-Data.csv) and it’s location, and told R that my data has headers (header=TRUE), and that I wanted to data set labeled DS0012.
> View(DS0012) > dim(DS0012)  9762 65
With the View(DS0012) command, we can pull the data set open in RStudio.
And we can see that we have 9,762 records with 65 data points per record. We are not going to look at all of the variables at the same time, instead opting to look at only the variables that will answer our questions of interest.
> DS0012 = DS0012[,c("SEQN", "RIAGENDR", "RIDAGEYR", "BMXWT", "BMXHT")] > dim(DS0012)  9762 5 > summary(DS0012) > DS0012 = subset(DS0012, !(is.na(BMXWT) | is.na(BMXHT))) > dim(DS0012)  8861 5
If we reduce the number of variables to our five variables of interest (SEQN: Sequence; RIAGENDR: Gender; RIDAGEYR: Age; BMXHT: Standing Height in cm; BMXWT: Weight in kg), eliminate some of the missing data points, losing about 9.3% of our “data” and are left with “clean” data.
We can also use the following prompt to rename our variables in a more user friendly way:
> names(DS0012) <- c("id", "gender", "age", "weight", "height") > summary(DS0012) > DS0012 = transform(DS0012, BMI = DS0012$weight / (DS0012$height/100)^2)
Finally, we can use the above listed equation to create a new variable named “BMI”, which is equal to ones weight, in kilograms, decided by ones height in m^2 (as our height variable is in the units of cm, we need to divide by 100 to convert to meters (god bless the metric system).
I will be back tomorrow to start running some analysis and generating some graphics from this cleaned data!