CS 109 ended on December 10, 2014. What’s on the docket for December 11?
Given more time, we would have enjoyed expanding our analysis in several dimensions. This section contains some notes on further topics and questions that we are interested in analyzing.
Due primarily to the limitations of the Twitter API, our analysis only applies to recent months of social media activity. Our Twitter data extends back to October 1; Facebook data to September 1. Data for several months in the future would provide several benefits:
Given time constraints and the nature of our data, we had little trouble storing our data in a series of JSON files and joining them together for later analysis. But this process simply would not have scaled. Storing hundreds of thousands of records in this format, especially where there may be significant record overlap (as in the case of progressively-updated Twitter data) raises concerns about reliability, stability, and processing speed.
A NoSQL, document storage engine is an ideal fit for at least storing our retrieved data. MongoDB appears to fit the bill perfectly, featuring native compatibility with our existing data storage formats. We eschewed using MongoDB due to the learning curve: our team has some experience using relational databases but no experience with MongoDB. The “conceptual shift” to using a NoSQL database made MongoDB appear infeasible given the time available and potential benefit.
Much of our social media metrics and analysis revolve around likes, shares, retweets, favorites, and bitly click counts. We did not look closely in depth into how post lengths impact the social media metrics. Moreover, we also did not explore how including different media files such as images and videos impacted the social metrics. Further extensions would include exploring these effects to understand how readers consume digital media from social media platforms.
We explored the social media relationship between news organizations. An extension to the question would be to look at if there are identifiable clusters for articles. Perhaps there are clusters of strong performing articles that tend to go viral and articles that generate lukewarm interest. This analysis can help news organizations identify what proportions of their articles fall in which category and adjust their content strategy accordingly.
Our predictive models used a fairly straightforward selection of features, and only employed two common regression algorithms. Our modeling process, feature selection process, and breadth of outcomes all offer vectors for improvement in the future.
While our topic analysis proved enlightening, it proved far more challenging than expected to implement. That limited our ability to employ topic modeling in predictive analysis or analyze the relationship between post subject matter and sentiment. In theory, the subject matter of a social media post (and its linked article) ought to be a critical component of its popularity. This is even more so for more specialized news organizations (e.g., the Wall Street Journal and its business coverage) or news organizations that are thought to cultivate a more partisan audience (e.g., Fox News and MSNBC).
An extension to consolidate topic modeling with the social media metrics we gathered from Facebook and Twitter is to perform clustering with each article as an observation and color each observation according to the identified topic. This would make for an interesting visualization with our identified top 10 articles.