Cancer Mortality Analysis

We used data from data.world called the OLS Regression Challenge, which were aggregated from the 2013 American Community Survey (census.gov), ClinicalTrials.gov, and Cancer.gov. The data contained the average cancer mortality across the United States between from 2010 through 2016. It also contains different demographic qualities such as income, ethnicity, education, average reported mortalities due to cancer, and birth rate to name a few.

Pre-processing

The original data set contained 3,047 observations representing the different counties in the United States along with 33 different demographic qualities. Since our project scope covers the states of California, South Carolina, and Illinois, these were filtered and integer encoded. This reduced our data to only 205 observations.

In addition, we:

  • removed binnedinc because median income per capita binned by decile is neither numeric nor factor.

  • removed pctsomecol18_24, pctemployed16_over, pctprivatecoveragealone because they contained lot of NA values.

  • extracted only State information from geography, where the original source contained County and State separated by comma.

Model Fitting

Since there were many variables, we used AIC and BIC to find the optimal model. We then searched for influential points for each model using Cook’s Distance and removed these from the set.

We used Fitted vs. Residual and Q-Q plot graphs to visually determine which model would be preferable. Based purely on the graphs, there’s no clear way to determine which would be better so we defer to quantitative testing by comparing their RMSE, adjusted R2, and results in of the Breusch-Pagan and Shapiro-Wilk Test.

FittedQQplots.png
Screen+Shot+2020-12-22+at+11.06.50+AM.png

In comparing the results, we find that most of the quantitative test results are the same. We did notice that the AIC model's $R^2$ was higher, which was expected for a model with more predictors. By comparing their Adjusted $R^2$ instead, we found a very small difference of 0.0073698.

Seeing as the models are very close, we opted to choose the BIC model as it had less predictors compared to the more complex AIC model. It turns out that these two models are actually nested, so we checked by running an ANOVA test. The test further confirmed that the smaller BIC model is preferred.

Screen+Shot+2020-12-22+at+11.16.57+AM.png

Prediction and Discussion

Screen+Shot+2020-12-22+at+11.18.38+AM.png

We predicted three observations from different states and we found that the model predicted very close to the actual target_deathrate. For the other observations, it was also fairly close. Thus, we think this model is very useful in predicting the Mean per capita (100,000) cancer mortality.

The observed data shows that counties with high median income and education has lower mean per capita (100,000) cancer mortality. It may be possible that cases with higher income can afford expensive treatments, thus reducing mortality rates compared to cases with less income. It is also possible to extrapolate that education might help lead good and healthy lifestyle.

From the coefficients we can infer that Mean per capita (100,000) cancer diagoses, Percent of county residents who identify as Black and Percent of county residents with employee-provided private health coverage has positive correlation with Mean per capita (100,000) cancer mortality.

Similarly, from the coefficients we can infer that Median Income and Percent of county residents ages 18-24 highest education attained (bachelor’s degree) has negative correlation with Mean per capita (100,000) cancer mortality.

Early Flu Detection Using Social Media (Twitter Feed)

In this project, we explored the potential of using social media to detect a potential spike in influenza earlier or later than what would have deemed as flu season. View the Github repository here.

The general idea was to use tweets from previous years as a baseline to determine whether or not there is unusually higher chatter in Twitter about the flu. Once that data is collected, they are analyzed for sentiment: positive, neutral, or negative.

After gathering and analyzing historical data, we captured a live stream of the tweets which were then exported and analyzed to see if there are significant differences against the previous years. The ideal implementation of the system would be a dashboard that is updated every day.

Our findings indicated that the usefulness of a tool such as this is very limited. First of all, the accuracy of sentiment analysis a big driver on the results. As expected, using a better corpus and training data instead of the nltk library default does not yield satisfactory results. Secondly, social media posts are not a reliable source of truth, as most tweets and similar social media posts follow trends. However, there is a potential to use it as a symptom marker or trigger to examine if there is indeed an increase in reported cases of influenza from more reliable sources (hospitals, CDC reports, etc.).

Comparison of Historical Data and New Streams.png

Medicare Drug Spending

We created a dashboard that displays the cost of drugs over years 2011 - 2015 under Medicare. This tool was created to show insights on sudden changes in cost through the use of charts. Some of the preprocessing was done using Hadoop to display knowledge of Cloud Computing Applications (aggregating generic drug data).

This tool would be useful for anyone conducting research on the increasing of Medicare drug spending (Part D) and how it related to the annual part D premium increases published by the Social Security Administration.

Pre-processing Data and Dashboard Setup

The data we acquired was mostly clean and ready to use. However, we wanted to subset data based on the represented generic and non-generic brands. Hadoop was used for this step.

In order to handle on-demand user requests from our dropdown boxes, we created an API to return information for the chart visualizations. The dashboard and the back-end were set up using AWS EC2 and EBS where the filtered data was stored. We also included a load balancer. The front-end was created using basic HTML, PHP, and D3.js for visualization.

[Some photos of dashboard here…I think they’re in my SSD]