Introduction to R: Importing Data

As you may know, my experience with data analytics is in behavioral health and general health care on the periphery of academia. IBM’s SPSS has always been the primary program that I have used to run analytics and as a point-and-click program, it has performed the job well. I am now at a place where I have some time on my hands, and it is not a secret that one of the primary languages being used in analytics these days is not a new one; it is the open-source package, R.

The National Health and Nutrition Examination Survey (NHANES) are a group of studies that assess the health and nutrition of children and adults in the United States collected between 1971 and 1994. The data on this survey have been well curated and come with a code-book.

I imported the Body-Measurements Data (DS12), which come in a .tsv (tab delimited format) file, and converted that file into a .csv (excel comma separated values) before importing to R.

> getwd()
[1] "/Users/David/Desktop/Working Directory - R"
> DS0012<-read.csv("ICPSR_25505/DS0012/25505-0012-Data.csv",header=TRUE)

I wanted to visit my working directory (to make sure I knew where my files were coming from; >getwd()), at which point I identified my .csv file (25505-0012-Data.csv) and it’s location, and told R that my data has headers (header=TRUE), and that I wanted to data set labeled DS0012.

> View(DS0012)
> dim(DS0012)
[1] 9762 65

screen-shot-2017-01-25-at-2-48-17-pmWith the View(DS0012) command, we can pull the data set open in RStudio.

And we can see that we have 9,762 records with 65 data points per record.  We are not going to look at all of the variables at the same time, instead opting to look at only the variables that will answer our questions of interest.

> DS0012 = DS0012[,c("SEQN", "RIAGENDR", "RIDAGEYR", "BMXWT", "BMXHT")]
> dim(DS0012)
[1] 9762 5
> summary(DS0012)
 screen-shot-2017-01-25-at-4-50-58-pm
> DS0012 = subset(DS0012, !(is.na(BMXWT) | is.na(BMXHT)))
> dim(DS0012)
[1] 8861 5

If we reduce the number of variables to our five variables of interest (SEQN: Sequence; RIAGENDR: Gender; RIDAGEYR: Age; BMXHT: Standing Height in cm; BMXWT: Weight in kg), eliminate some of the missing data points, losing about 9.3% of our “data” and are left with “clean” data.

We can also use the following prompt to rename our variables in a more user friendly way:

> names(DS0012) <- c("id", "gender", "age", "weight", "height")
> summary(DS0012)

screen-shot-2017-01-25-at-5-19-04-pm
> DS0012 = transform(DS0012, BMI = DS0012$weight / (DS0012$height/100)^2)

Finally, we can use the above listed equation to create a new variable named “BMI”, which is equal to ones weight, in kilograms, decided by ones height in m^2 (as our height variable is in the units of cm, we need to divide by 100 to convert to meters (god bless the metric system).

screen-shot-2017-01-25-at-5-32-53-pm

I will be back tomorrow to start running some analysis and generating some graphics from this cleaned data!

Advertisements

About dwmaasberg

Memories are physical connections between neurons. I think that is pretty cool!
This entry was posted in R, Statistics. Bookmark the permalink.

One Response to Introduction to R: Importing Data

  1. Pingback: Tables in R | David W. Maasberg

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s