Business Goals
Exploratory Data Analysis (EDA) Goals
1. Determine overall and distinct user frequency trends of subreddit posts and comments.
We start by visualizing each of the nine subreddits’ posts by grouping data on the subreddit and posted date (year
and month
features engineered from created_utc
) and then aggregating the counts of posts over time. Doing so allows us to gauge periods of intense activity in specific subreddits, which could be attributed to significant political and economic events. We use the same grouped data but now aggregate on the author
column to get distinct users over time for each subreddit to obtain an idea of unique user participation and differences in user diversity and engagement. We then repeat the same process for comments.
2. Gain an understanding of distinguished, gilded, and controversial posts and comments among all subreddits as well as the users of each subreddit.
Use data grouped on subreddit
to gather the normalized counts of posts and comments that are gilded
(the number of times a post/comment received Reddit gold), distinguished
(whether the moderators or admins make the post/comment), and controversial (whether a post/comment received a large number of both upvotes and downvotes). The higher the counts of these variables, the more likely a reddit user will come across and interact with those posts or comments. The same data wrangling is done next but with author
added as grouping variable. The counts of users posting or commenting across multiple subreddits are calculated, and a table with the top 50 posts (based on upvotes) from each subreddit is presented. Doing so allows us to identify the most active users across subreddits.
3. Determine and visualize the 50 highest scored posts across political subreddits.
Each submission and comment receives a score
, which is calculated as the number of upvotes minus the number of downvotes as voted by the respective subreddit users. We use the data grouped by subreddit to gather the top 50 posts with the highest scores (based on upvotes and downvotes) from each subreddit. After obtaining the data, we create a HighChartJS Packed Bubble chart, where the size of each bubble corresponds to the score of individual posts, and the size of the larger bubbles represents the cumulative scores of posts from each subreddit. This vizualization allows us to identify which subreddits’ content resonates the most with its users.
4. Understand the temporal effects of the number of posts across political subreddits.
Use data grouped on subreddit
, month
, and day
of the week and then aggregate the counts of posts. Create a HighChartJS Heatmap chart for each subreddit, where a greater color intensity corresponds with more posts. Doing so helps us identify distinct seasonal patterns to understand when users are most active in each subreddit.
6. Measure the similarity between political and finance/macroeconomic topics discussed among grouped political and economics subreddits.
We first group politics and economics subreddit submissions and filter subreddits for post body (selftext
) and for post title
(two-part analysis). Then, using regex, we identify posts that mention keywords pertaining to politics and economics and gather the counts for each grouped subreddit that contains the keywords mentioned above. Next, we independently calculate the cosine similarity and Jaccard similarity measures for the counts for both post titles and post bodies, resulting in two tables (one for post body and another for post title). A high cosine similarity implies that the two documents have similar distributions of words, even though the counts of individual words may be different. To correctly measure the similarity between the counts of words in the two documents, we use another similarity metric, the Jaccard index. By selecting keywords that are most similar across the two subreddits, we can identify context similarities between the two subreddits and the distribution of the keywords across the two subreddits.
7. Explore the activity of r/Economics
submissions based on the state of the economy.
Combine data on the number of posts and comment scores (derived from upvotes and downvotes) with external GDP data gathered from the FRED
python package. Perform a time window shift of GDP data (since it’s lagged by three months). Visualize the subreddit post activity with the GDP data with HighChartJS. With the help of this time-series visualization, we can identify patterns in GDP growth and its effects on the activity of the r/Economics
subreddit.
Natural Language Processing (NLP) Goals
8. Basic NLP and Text Checks: Tabulate the most discussed and the most important words. Visualize text lengths as well as average number of words over time in combined political and economics subreddits.
Uncover most discussed words: We create four tables of select top 100 common words corresponding to politics subreddits submissions, economics subreddits submissions, politics subreddits comments, and economics subreddits comments. For analyzing submissions, we group politics and economics subreddits separately and count common words on the feature-engineered column of combined post body (selftext
) and post title
. We also use regex to replace certain stopwords, such as “https”, “www”, and “jpg”, that the NLP pipeline could not remove. Once the new feature is obtained, it is turned into an array of words, flattened, grouped by word to get the counts, and sorted and filtered to output a .csv
file of the top 100 words. For analyzing comments, we group politics and economics subreddits separately and count common words in the comment body
, so no new column was created for comments. We also use the regex mentioned above to replace certain stopwords and utilize the same process to obtain the counts of common words.
Uncover most important words: We create three tables of the top 10 most important words, measured by TF-IDF score, corresponding to politics subreddits submissions, economics subreddits submissions, and politics subreddits comments (economics subreddits comments are left out due to allocation of compute capacity). Similar to the process for obtaining the most discussed words, we group politics and economics subreddits separately and run the HashingTF
and IDF
functions on the feature engineered column of combined post body (selftext
) and post title
. For analysing comments, we group politics and economics subreddits separately and run the HashingTF
and IDF
functions on comment body
, so no new column was created for comments.
Visualize text lengths and average number of words over time: Similar to the process for obtaining the most discussed words, we group politics and economics subreddits separately and find the length
of the selftext
column for submissions and body
column for comments. We then create four matplotlib
histogram visualizations of the distribution of text lengths corresponding to politics subreddits submissions, economics subreddits submissions, politics subreddits comments, and economics subreddits comments. Finally, we also create a matplotlib
lineplot visualization of the average number of words over time for politics subreddits submissions, economics subreddits submissions, politics subreddits comments, and economics subreddits comments in the same plot. To obtain average number of words over time, we group politics and economics subreddits separately and find the size
after performing a split
of the selftext
column for submissions and body
column for comments. We then group the data on year
and month
and aggregate each month’s average number of words.
9. Assess the sentiment in posts and comments for combined political and economics subreddits in relation to U.S GDP.
Using SageMaker processing jobs and spark nlp, we generated positive, neutral, or negative sentiment indicators. Considering that Reddit
is a widely used social media site and forum, we used a pre-trained sentiment detection model typically used for Twitter
. For submissions, the post’s title
and body
were concatenated into a concat_submissions
column, which was used for sentiment generation. Similarly, for comments, the post’s body
was used for sentiment generation.
With the generated sentiment data for submissions and comments from each subreddit, we combined the political subreddits’ data with the U.S. GDP. We then generated plots showing changes in user activity and overall sentiments from each political subreddit of interest alongside changes in the U.S. GDP.
10. Determine prevalent subjects or themes in combined political and economics subreddits submissions.
By utilizing both the pyspark
and gensim
libraries, we ensure comprehensive topic modeling analysis, and the interactive visualization powered by the pyLDAvis
package provides a clear representation of the prevalent subjects and themes, enabling informed insights into the content across the targeted subreddits. We specifically chose to perform topic modeling on the submissions instead of comments to reveal which topics users want to initiate a discussion on rather than what they go on to discuss and possibly divert from the initial post. Topic modeling, with gensim
, for submissions was performed by grouping politics and economics subreddits separately and combining the selftext
and title
columns to create a new column called nlpbodytext
, like we did for Business Goal 8. We then create a gensim
dictionary and corpus, and run the LdaModel
function to obtain the topics and their corresponding keywords. Finally, we create an interactive visualization of the topics and their corresponding keywords using the pyLDAvis
package.
Machine Learning (ML) Goals
11. Regression: Estimate the expected score (number of up-votes minus the number of down-votes) of a submission using provided and engineered features.
Each submission (and comment) receives a score derived from the number of up-votes minus the number of down-votes as voted by subreddit readers. These scores drive engagement as more popular posts are pushed to the top by Reddit’s algorithm. Given this, we will create supervised learning models, specifically Linear Regression and Random Forest Regressor, to predict the score of a submission based on provided features, such as whether a submission has adult content and the subreddit
the submission belongs to, and engineered features, such as submission title_length
, day
, month
, and year
, to name a few. We are not concerned with comments for this business goal since we aim to analyze the initial impact and reception of a post on the platform. A score prediction model could help Reddit users improve their posts by predicting their potential engagement. This would enable authors to edit their posts before publishing to maximize visibility and interaction.
12. Classification: Predict the controversiality of a comment using provided and engineered features.
Unlike submissions, the controversiality
feature in comments is determined based on both interactions, specifically the number of up-votes minus the number of down-votes, and the balance of these interactions. We will create supervised learning models, specifically Random Forest Classifier, Logistic Regression, and Gradient Boosting Classifier, to predict the controversiality
of a comment based on provided features, the score of the comment, the subreddit
the comment belongs to, whether the comment is distinguished
(the moderators or admins comment) or gilded
(number of times the comment received Reddit gold), and engineered features, including day
, month
, and year
. This analysis aids in understanding the dynamics of comment engagement and controversy, offering Reddit a powerful tool for content moderation and user engagement strategies.
13. Clustering: Determine what subreddit a submission belongs to based on word embeddings of the submission body and other features.
Using the Word2Vec
function, we create word embeddings of the cleaned_body
column, which is the selftext
column with stopwords removed and lemmatized. We then create a VectorAssembler
to combine the word embeddings with other provided and engineered columns, including day
, month
, year
, distinguished
, score
, gilded
, num_comments
, score
, over_18
, is_spoiler
, and is_video
columns. We then perform PCA on the combined features and condense the features into 2 principal components, facilitating visualization. Then, using the elbow method to obtain k
, the number of clusters, we create a KMeans
model and fit the data to obtain the clusters. Better than the elbow method is the silhouette score, which measures how similar an object is to its cluster compared to other clusters. We then use the silhouette score to determine the optimal number of clusters, a hyperparameter-tuning excercise. Finally, we create a KMeans
model with the optimal number of clusters and fit the data to obtain the clusters. This analysis is valuable for Reddit and its users as it systematically segments submissions into clusters based on their content. We capture the semantic meaning of submission bodies by leveraging word embeddings created with the Word2Vec
function. Including various features, such as temporal information, user distinctions, scores, and content characteristics, enriches the clustering process, offering a nuanced understanding of subreddit categorization.
Acknowledgment
We would like to disclose that we employed Grammarly, Inc. to assist with grammar and proofreading for this section.