I am going to work with a little bit of voter data from the Pew Research Center, a nonpartisan think-tank that allows downloads of their proprietary data for academic and public use, from this election cycle.
The April 2016 Politics and Foreign Policy Survey data is what I downloaded from Pew. It comes in a compressed file that has the transcript from the phone questionnaire, the .sav data set (SPSS Package), and some other documentation.
> require(foreign) Loading required package: foreign > pew.data ("/Users/David/Desktop/ Working Directory - R/PewData/April16public.sav", use.value.labels = TRUE, to.data.frame = TRUE) > Summary(pew.data) > View(pew.data) > dim(pew.data)  2008 197
As the dimensions of the imported data set indicate, there are 2008 rows (respondents) and 197 columns (survey response options).
There is a significant amount of information (197 survey response options) that was collected using this telephone survey. There is a range of information relating to domestic policy (e.g., healthcare, economy, jobs, immigration, climate change) and foreign policy (e.g., terrorism, international trade, foreign relations), approval ratings, and public perception.
In an effort to organize our data, I would like to create a new variable called pew.data.health that contains components of the questionnaire related to healthcare as well as some demographic data (region, state, education, income, party affiliation, etc).
> pew.data.health = pew.data[,c("respid", "cregion", "state", "usr", "educ", "q1", "q2", "q10a", "q10b", "q54", "q55", "q56", "q57", "q58")] > View(pew.data.health) > dim(pew.data.health)  2008 14
We can easily define and restrict the parameters of our data set, like looking at a tabulation of two of the categorical variables in the data (e.g., Income and Approval Ratings for, then, president Barack Obama–Miss you already).
> table(pew.data.health$income,pew.data.health$q1) > length(unique(pew.data.health$state))  51 > any(is.na(pew.data.health$state))  FALSE
This is from a sample of 2008 respondents and I would imagine that responses vary significantly depending on the state from which a respondent resides (as you can see above, there are 51 unique data points in the state category: 50 states and… something else…I prompted to return any “NA” values, and nothing came up…Moving on.)
Additionally, the table above does not really suggest any variation in approval ratings based on income (unless income < $10,000, where 70.3% “Approve”). So lets see if there are variations based on State:
> UT.Data subset(pew.data,state=="UT") > table(pew.data.health$q1) > table(UT.Data$q1)
There are huge differences when we take state into account! 49.2% of respondents “Approve” of Barack Obama “is handling his job as President”. Only 22.5% of Utah residents polled feel the same way.