Next Steps : Predicting Social News Reach

Given more time, we would have enjoyed expanding our analysis in several dimensions. This section contains some notes on further topics and questions that we are interested in analyzing.

More Data

Due primarily to the limitations of the Twitter API, our analysis only applies to recent months of social media activity. Our Twitter data extends back to October 1; Facebook data to September 1. Data for several months in the future would provide several benefits:

More observations tend to make for better predictive models. This is not universally true (especially for models that are fundamentally flawed), but we would hope to find more “signal” in a larger dataset.
We might observe the influence of future news events on social media activity. Our dataset missed many news events in the past, including early coverage of the 2014 US midterm elections (though we obtained lots of recent coverage about it). Our window includes very little of the recent Michael Brown and Eric Garner grand jury decision coverage, reactions to Rolling Stone’s UVA sexual assault coverage, or other very recent stories. Looking to the future, news coverage of a newly-Republican US Senate and the ramp-up of the 2016 Presidential campaign might prove useful for our analysis.
In the long run, our analysis might help observers understand the changing influence of traditional news organizations on novel social media.

Persistent Data Storage

Given time constraints and the nature of our data, we had little trouble storing our data in a series of JSON files and joining them together for later analysis. But this process simply would not have scaled. Storing hundreds of thousands of records in this format, especially where there may be significant record overlap (as in the case of progressively-updated Twitter data) raises concerns about reliability, stability, and processing speed.

A NoSQL, document storage engine is an ideal fit for at least storing our retrieved data. MongoDB appears to fit the bill perfectly, featuring native compatibility with our existing data storage formats. We eschewed using MongoDB due to the learning curve: our team has some experience using relational databases but no experience with MongoDB. The “conceptual shift” to using a NoSQL database made MongoDB appear infeasible given the time available and potential benefit.

Exploring Additional Social Media Metrics

Much of our social media metrics and analysis revolve around likes, shares, retweets, favorites, and bitly click counts. We did not look closely in depth into how post lengths impact the social media metrics. Moreover, we also did not explore how including different media files such as images and videos impacted the social metrics. Further extensions would include exploring these effects to understand how readers consume digital media from social media platforms.

Cluster Analysis of Individual Articles

We explored the social media relationship between news organizations. An extension to the question would be to look at if there are identifiable clusters for articles. Perhaps there are clusters of strong performing articles that tend to go viral and articles that generate lukewarm interest. This analysis can help news organizations identify what proportions of their articles fall in which category and adjust their content strategy accordingly.

Enhanced Predictions

Our predictive models used a fairly straightforward selection of features, and only employed two common regression algorithms. Our modeling process, feature selection process, and breadth of outcomes all offer vectors for improvement in the future.

Leveraging Topic Modeling

While our topic analysis proved enlightening, it proved far more challenging than expected to implement. That limited our ability to employ topic modeling in predictive analysis or analyze the relationship between post subject matter and sentiment. In theory, the subject matter of a social media post (and its linked article) ought to be a critical component of its popularity. This is even more so for more specialized news organizations (e.g., the Wall Street Journal and its business coverage) or news organizations that are thought to cultivate a more partisan audience (e.g., Fox News and MSNBC).

Topic Modeling & Clustering

An extension to consolidate topic modeling with the social media metrics we gathered from Facebook and Twitter is to perform clustering with each article as an observation and color each observation according to the identified topic. This would make for an interesting visualization with our identified top 10 articles.