Week 4 - Project - Exploring the BRFSS data by Akshay Kotha

Refer /brfss_codebook.html for details on the BRFSS variables.

0.13 Setup

0.13.1 Load packages

library(ggplot2)
library(dplyr)
library(statsr)

0.13.2 Load data

load("brfss2013.RData")

0.14 Part 1: Data

There are two types of observations in general. 1. Data collected via landline telephone interviews. 2. Data collected via cellular phone interviews. It is also mentioned that all the responses were self-reported which is similar to volunteering to answer the questionnairre. When it comes to landline telephone interviews, disproportionte stratified sampling(DSS) was done which implies the results would be representative of entire population. DSS might have been done to cater for the need to represent the entire population. When it comes to cellular phone interviews, it is mentioned that random sampling took place. Based on this, the results and analysis obtained can be generalized to the US population or a population with similar characteristics because the random samples are representative of the entire population across all the states of the US. In both cases, causality cannot be inferred as this is only an observational study which has non-response bias and nowhere it was mentioned that random assignment was done.

0.15 Part 2: Research questions

0.15.1 Research quesion 1:

Relation between height(htin4), weight (wtkg3) and ‘joinpain’ (how bad was joint pain?)Describe the distributions and which probabilistic distributions are skewed (positively or negatively/ right or left) - Distribution of Height(inches) htin4 w.r.t ‘joinpain’ or wtkg3 w.r.t ‘joinpain’?

This is of interest because it helps understand to decide on which variable to use for predictive modelling. If the variables are highly skewed, they have to be transformed and then used to get accurae predictions.

0.15.2 Research quesion 2:

Relation between people who have coronary heart disease (cvdcrhd4) and those who are diagnosed with heartattack (cvfinfr4) using comparison between states of maximum adnd minimum respondents?

Association of two same organ ailments would be helpful in whether both have to be treated separately or together. The check of whether this varies across different states is to understand whether it matters if the people are located in one state over the other. It can be understood whether ’_state’ variable has any association. For instance if it really varies between different states, more variables can be thought about from within the data or externally during the causal analysis.

0.15.3 Research quesion 3:

Are frequency of feeling depressed in the past days (misdeprd), difficulty in concentrating or remembering (decide) are associated or dependent?

Correlation finding is useful as this is an observational study and eventually might help in finding stronger evidence for causality (only after carrying some random experiments but not from this study solely).

0.16 Part 3: Exploratory data analysis

0.16.1 Research quesion 1: Code

#checking the type of variable
str(brfss2013$wtkg3)

##  int [1:491775] 11340 5761 7257 5806 12020 10206 4808 NA 10659 7711 ...

str(brfss2013$htin4)

##  int [1:491775] 67 70 64 64 72 63 60 65 74 65 ...

str(brfss2013$joinpain)

##  int [1:491775] 7 NA 5 NA NA NA 3 8 4 NA ...

#creating new df so that there are no 'NA' values in the varibles under consideration
brfss_joinpain <- brfss2013 %>%
  filter(!is.na(joinpain),!is.na(htin4), !is.na(wtkg3)) %>%
  mutate(wtkg3_actual = wtkg3/100) ##Assumption: The calculated weight variable wtkg3 divided by 100 makes sense hence added new variable wtkg3_actual 

#str(brfss_joinpain$wtkg3_actual)
  
#converting int values of levels in 'joinpain' to factor so that they can be ordered properly in denotion
brfss_joinpain[, 'joinpain']  <- factor(brfss_joinpain[,'joinpain'])
str(brfss_joinpain$joinpain)    #%>%

##  Factor w/ 11 levels "0","1","2","3",..: 8 6 4 5 9 6 5 3 8 8 ...

#TO get an idea of summary statistics of heights  
brfss_joinpain %>%
  group_by(joinpain) %>%
  summarise(count=n(), mean_height = mean(htin4), median_ht = median(htin4), min_ht = min(htin4), max_ht = max(htin4), iqr_ht = IQR(htin4), sd_ht = sd(htin4), var_ht = var(htin4))

## # A tibble: 11 x 9
##    joinpain count mean_height median_ht min_ht max_ht iqr_ht sd_ht var_ht
##    <fct>    <int>       <dbl>     <dbl>  <dbl>  <dbl>  <dbl> <dbl>  <dbl>
##  1 0        10658        66.4        66     40     87      7  4.18   17.5
##  2 1         8716        66.7        66     48     83      6  4.09   16.7
##  3 2        14330        66.7        66      2     81      6  4.08   16.6
##  4 3        17800        66.5        66     36     92      6  4.03   16.3
##  5 4        15440        66.2        66     48     83      6  4.03   16.2
##  6 5        22675        65.8        65     41     87      5  4.00   16.0
##  7 6        12169        66.0        66     45     85      6  4.10   16.8
##  8 7        13073        66.4        66     48   6123      6 53.1  2823. 
##  9 8        14739        65.6        65     38     86      5  4.08   16.7
## 10 9         3728        65.5        65     50     84      5  4.07   16.6
## 11 10        8252        65.0        64     48     90      5  3.95   15.6

#To get an idea of summary stats of weights (kg)
brfss_joinpain %>%
  group_by(joinpain) %>%
  summarise(count=n(), mean_wt = mean(wtkg3_actual), median_wt = median(wtkg3_actual), min_wt = min(wtkg3_actual), max_wt = max(wtkg3_actual), iqr_wt = IQR(wtkg3_actual), sd_wt = sd(wtkg3_actual), var_wt = var(wtkg3_actual))

## # A tibble: 11 x 9
##    joinpain count mean_wt median_wt min_wt max_wt iqr_wt sd_wt var_wt
##    <fct>    <int>   <dbl>     <dbl>  <dbl>  <dbl>  <dbl> <dbl>  <dbl>
##  1 0        10658    79.7      77.1  25.0    231.   25.4  20.3   410.
##  2 1         8716    79.4      77.1  36.3    209.   25.0  19.2   368.
##  3 2        14330    81.1      79.4   0.02   213.   22.7  19.4   378.
##  4 3        17800    81.7      79.4  31.8    290.   24.5  20.1   403.
##  5 4        15440    82.0      79.4  25.0    250    24.9  21.0   441.
##  6 5        22675    81.3      78.5  22.7    272.   24.0  20.9   435.
##  7 6        12169    83.9      81.6  24.5    261.   27.2  22.6   511.
##  8 7        13073    85.3      81.6   0.02   272.   29.5  23.1   532.
##  9 8        14739    84.8      81.6  24.5    263.   29.5  23.7   562.
## 10 9         3728    86.8      82.6  36.3    207.   29.9  24.6   604.
## 11 10        8252    84.4      81.2  24.0    272.   29.0  24.0   578.

#plot probability distributions by categorical level of the variable 'joinpain' for height
#dim(brfss_joinpain)
ggplot(brfss_joinpain, aes(x = htin4, colour = joinpain)) + geom_density() + labs(x = "height (inches)", y = "prob_density")

#plot probability distributions by categorical level of the variable 'joinpain' for weight

ggplot(brfss_joinpain, aes(x = wtkg3_actual, colour = joinpain)) + geom_density() + labs(x = "weight (kg)", y = "prob_density")

0.16.2 Narrative of question 1:

With respect to own scales of the above two plots, distributions of heigts and weights across categories of the joinpain variable are both right skewed. Skewness of both the plots can be verified with the summary statistics calculated before the plots were made. Both of them need further adjustment via normalization(not understood completely, out of scope for this course) or other techniques to use them in the predictive models for higher accuracy.

0.16.3 Research quesion 2: Code

#check variable type
str(brfss2013$cvdinfr4)

##  Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...

str(brfss2013$cvdcrhd4)

##  Factor w/ 2 levels "Yes","No": NA 2 2 2 2 2 2 1 2 2 ...

str(brfss2013$X_state)

##  Factor w/ 55 levels "0","Alabama",..: 2 2 2 2 2 2 2 2 2 2 ...

#_state is not allowed, so replacement is done
grep("state", names(brfss2013), value = TRUE)

## [1] "X_state"  "stateres" "cstate"

# number of respondents grouped by state
statewise_count <- brfss2013 %>%
  group_by(X_state) %>%
  summarise(count = n())

  #arrange(desc(countX_state))
#finding the descending order
statewise_count %>%
  #summarise(max(count), min(count))
  arrange(desc(count))

## # A tibble: 55 x 2
##    X_state       count
##    <fct>         <int>
##  1 Florida       33668
##  2 Kansas        23282
##  3 Nebraska      17139
##  4 Massachusetts 15071
##  5 Minnesota     14340
##  6 New Jersey    13776
##  7 Colorado      13649
##  8 Maryland      13011
##  9 Utah          12769
## 10 Michigan      12761
## # ... with 45 more rows

# new dataframe creation for the purpose of answering this specific question
brfss_heartattack <- brfss2013 %>%
  filter(!is.na(cvdinfr4),!is.na(cvdcrhd4), X_state %in% c("Florida","Guam")) 

#New variable to get percentage in the plots based on select variable
brfss_heartattack %>%
  group_by(X_state, cvdinfr4, cvdcrhd4) %>%
  summarise(count = n()) %>%
  mutate(percentage_count = 100 * count/sum(count)) %>%

#plot coronary heart disease along x axis, cvdinfr4 in a different colour
ggplot(aes(x=cvdcrhd4, y = percentage_count, fill=cvdinfr4)) +
#plot cvdinfr4 alongside cvdcrhd4 ("dodge", alternatively use "stack") and make #a seperate graph for both values of cvdcrhd4
geom_bar(stat = "identity",position = "dodge") + 
  # To split across the states considered
  facet_wrap(~X_state) + 
  # Beautify with color codes
  scale_fill_manual("Condition", values = alpha( c("firebrick", "dodgerblue4"), 1) ) + 
  labs(x = "Ever Diagnosed With Angina Or Coronary Heart Disease", y = "percentage diagnosed with heart attack based on x-variable")

0.16.4 Narrative of question 2:

The X_state variable is not associated with the link between diagnosing with heart attack based on having coronory heart disease because no matter the state has maximum respondents (Florida) or minimum respondents (Guam), the proportions of those who are diagnosed with coronory heart disease are also diagnosed with heart attack are almost same. Implicity, there is an association between having coronoary heart disease and not having coronoary heart disease with having a heart attack. Hence, variable x is associated with the heart attack condition but state variable is not associated with the correlation of earlier two. Summary statistics answering this question is out of scope for this question because it is an analysis between categorical variables.

0.16.5 Research quesion 3: Code

#checking variable type
str(brfss2013$misdeprd)

##  Factor w/ 5 levels "All","Most","Some",..: 5 5 5 5 NA NA NA NA NA NA ...

str(brfss2013$decide)

##  Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...

#data frame without NAs for the variables under consideration
brfss_dep_conc <- brfss2013 %>%
  filter(!is.na(misdeprd),!is.na(decide)) 
  
  #group_by(misdeprd, decide) %>%
  #summarise(n())

#stacked bar plot
ggplot(brfss_dep_conc, aes(x = misdeprd)) + geom_bar(aes(fill = decide), position = 'fill') + labs(x = "How Often Feel Depressed Past 30 Days", y = "proportion of Yes/No for having diffculty concentrating")

0.16.6 Narrative question 3:

The stacked bar plot above shows that there is a clear association (dependency) between DECIDE and MISDEPRD. There is an increasing trend for proportion of people who didn’t face difficulty in concentrating or remembering (decide) in each category of the variable on x-axis (How often people were depressed in the last 30 days) from left to right. Summary statistics answering this question is out of scope for this question because it is an analysis between categorical variables.