EDA
Executive Summary
Our exploratory data analysis (EDA) on Reddit’s major political and economic subreddits has provided a nuanced view of online political discourse, offering valuable context for understanding user behavior. The analysis delved into various aspects of these online subreddit communities, shedding light on the dynamics of discussions and the interplay of user interests.
Preliminary analysis unveiled that r/Conservative
is the most active subreddit, with the highest number of posts and comments, while r/AskPolitics
has the most distinct posts and comments, indicating a more diverse user base. The r/Conservative
subreddit also has the highest average post score, suggesting that its content resonates strongly with its user base. The analysis also revealed that the r/Conservative
subreddit has the most significant number of users in common with other political subreddits, indicating that these users engage in various political communities. Subreddits, such as r/socialism
and r/Liberal
, showcase consistent activity with fewer fluctuations across number of comments and submissions, suggesting a more stable user base.
Additionally, weekly patterns of posting activity revealed that Thursdays are the most active days for posting across r/Liberterian
, r/socialism
, and r/centrist
subreddits, with the exception of r/Conservative
, which sees the highest activity on Wednesdays. The analysis also showed that posting activity generally decreases as the year progresses, with the lowest numbers appearing in December. The data also revealed that Saturdays and Sundays consistently have fewer posts than weekdays, in line with typical online engagement trends.
Finally, we incorporated U.S GDP data from the FRED
python package from January 2021 to March 2023 to compare the economic data with the r/Economics
subreddit posting activity. The analysis revealed that there is a relationship between the state of the economy and the level of engagement and sentiment on the r/Economics
subreddit. During times of economic uncertainty, there is an increase in the number of submissions and a decrease in the average submission score. This suggests that people are more likely to turn to online communities to discuss and seek information about the economy during difficult times. Additionally, the types of posts that are upvoted during these times suggest that people are looking for content that resonates with their concerns and anxieties.
Analysis Report
Business Goal #1: Determine overall and distinct user frequency trends of subreddit posts and comments
Figure 1 presents a line chart that tracks the posting activity across various political and economic subreddits from January 2021 to March 2023. The subreddit r/Conservative
shows the most pronounced variability and has the highest peaks, suggesting periods of intense activity which might be attributed to specific political events or discussions that resonated strongly with its user base. It’s noteworthy that the peaks for r/Conservative
sharply spike above all other subreddits, indicating that certain topics or times drove engagement significantly more than usual.
In contrast, other subreddits such as r/Economics
, r/Finance
, and r/ChangeMyView
maintain a relatively stable, low-level activity over time, with Economics
occasionally showing slight increases that could correspond to economic events or policy discussions. Subreddits like r/Liberal
, r/socialism
, r/Libertarian
, and r/centrist
exhibit moderate activity with fewer fluctuations, reflecting a consistent level of engagement without the sharp spikes observed in the r/Conservative
subreddit. The data suggests that while political subreddits can experience bursts of heightened activity, forums centered around economics and finance tend to have more steady, predictable participation rates.
Figure 2 tracks the volume of comments across various subreddits, highlighting how user interaction ebbs and flows over time. The “Conservative” subreddit shows the most pronounced fluctuations, with sharp spikes indicative of intense conversational bursts. The r/Libertarian
subreddit also experiences notable, albeit less extreme, surges in activity, potentially aligning with key events or discussions. Other subreddits, including r/Liberal
, r/socialism
, and r/centrist
present more consistent comment patterns, with occasional upticks that may correspond to specific events. The steady increase in the r/Economics
subreddit comments towards the end of the timeline suggests growing engagement. This visualization reflects the dynamic nature of online discourse, with political forums often at the mercy of the news cycle, while more thematic forums like r/Finance
or r/Economics
maintain a steadier level of dialogue.
Note: Similar plots for distinct posts, comment counts, and distinct comments can be found in the appendix.
Business Goal #2: Gain an understanding of distinguished, gilded, and controversial posts and comments among all subreddits as well as the users of each subreddit.
The subreddits were ranked based on the counts of meaures that correspond with controversial posts and comments. As Figures 1 and 2 have shown already, there is greater number of posts and comments from the r/Conservative
subreddit, so the counts shown in Table 1 have been normalized. Unsurprisingly, the r/ChangeMyView
subreddit had the greatest number of distinguished posts and comments (being nearly 2 standard deviations from the mean), which may hint at heavy moderation, potentially as a result of touchy subjects being discussed. Ironically, the subreddits typically associated with the authoritarian left and right on the political compass (r/socialism
and r/Conservative
, respectively) have less distinguished submissions and posts that the libertarian left and right subreddits (r/Liberal
and r/Libertarian
, respectively), pointing to an ironic inversion from the expectations of subreddit moderation. Additionally, the r/Conservative
subreddit had the greatest number of gilded posts and comments. This, at a glance, may seem like users incentivizing each other financially with the prospects of Reddit Gold, which would inevitably promote an echo chamber of ideas in an online political space; however, this doesn’t seem to be the case when considering that the subreddit also has the highest count of controversial comments (having a count of over 2 standard deviations from the mean). Instead, this may suggest that users on this subreddit are simply more willing to monetarily support posts and comments that they find appealing.
Next, we looked at individual users and ranked them based on the counts of measures that corerspond with controversial posts and comments. Based on the distinguished submissions and comments, we can see that among the top 20 controversial users, only one of them (user: “ultimis”) is a moderator. Another finding is that the top 20 users all come from the subreddits r/centrist
, r/Libertarian
, or r/Conservative
. Much like the findings from Table 1, Table 2 also seems to suggest that these three political subreddits have a large amount of intra-subreddit discourse among their posts and comments.
Business Goal #3: Determine and visualize the 50 highest scored posts across political subreddits.
Figure 3 is a bubble chart illustrating the average scores of the top 50 posts from various political subreddits, where the score is calculated by subtracting the number of downvotes from upvotes. The size of each subreddit’s bubble reflects the average score of its posts. The chart shows that the subreddit represented by the largest bubble has the highest average post score, while the smallest bubble corresponds to the subreddit with the lowest average score.
The subreddit with the largest bubble (r/Conservative
) has significantly outpaced the others in terms of average score, suggesting that posts in this community resonate strongly with its members and receive more upvotes. Conversely, the subreddit with the smallest bubble (r/centrist
) has the lowest average scores, indicating either a smaller community, less engagement, or a tendency towards more divisive content that doesn’t amass as many upvotes.
Within each of these larger subreddit bubbles are smaller bubbles, each representing an individual post; the title of the post is indicated within these smaller bubbles. Additionally, interacting with these smaller bubbles by clicking on them will redirect the viewer to the respective post’s link on Reddit for further exploration.
Business Goal #4: Understand the temporal effects of number of posts across political subreddits
Figure 4 depicts the daily and monthly number of posts on the r/Conservative
subreddit. There’s a clear trend of higher activity on weekdays, with Wednesdays being the peak, and lower activity on weekends, particularly on Sundays. The data indicates a seasonal trend, with post volumes generally higher in March and April and reducing towards the end of the year, in November and December. The color gradient emphasizes these variations, with darker hues representing more posts, allowing for a visual representation of user engagement on Reddit throughout the different months and days of the week.
Note: A calendar heatmap for different subreddits can be found in the appendix section.
Business Goal #6: Measure the similarity between political and finance/macroeconomic topics discussed among grouped political and economics subreddits
We present the Cosine and Jaccard similarity scores for posts’ bodies and titles across grouped political and economics subreddits. The political subreddits utilized for the two tables below are r/Conservative
, r/Socialism
, r/centrist
, and r/Libertarian
and the economic subreddits are r/Finance
and r/Economics
. Initially, we obtained the counts of three words, [recession, inflation, and unemployment]
, for the grouped subreddits from regex patterns on the posts’ bodies and ran Cosine as well as Jaccard Similarity scores on the counts. The Cosine Similarity score between the grouped subreddits was 0.9 and Jaccard Similarity was 0.04, which meant that the two subreddits have similar distributions of words, even though the counts of individual words were significantly different. This finding culminated in two hypotheses, the first being that the grouped economic subreddits must be containing a large number of “shitposts”, probably related to cryptocurrencies and NFT’s, and the second one that economics subreddits posts’ bodies contain mainly links to online news articles with the key words present in the post titles. Consequently, we updated the word list to account for [crypto, blockchain, nft]
words contained in these flippant posts and also ran the similarity algorithms on the posts’ titles.
In addition to the economics word list, we also made a political word list to obtain grouped subreddit counts for the words [trump, biden, election, fed, powell]
and ran the similarity algorithms on the posts’ bodies and titles. Its results and discussion are presented below.
Post Body
Post Title
Table 4 supports our hyptheses, as we can see that the Jaccard Similarity score is highest across the post titles of the grouped subreddits when we filter for the economics word list. This means that the posts’ titles not only have a similar distribution of counts across the grouped subreddits but they also have similar raw counts for each individual word. This is because the posts’ titles contain the key words that we are looking for, while the posts’ bodies contain links to online news articles. On the other, political words in either the posts’ bodies or titles do not have a similar distribution of counts across the grouped subreddits, as shown in both Table 3 and Table 4.
Business Goal #7: Explore the activity of r/Economics
submissions based on the state of the economy
GDP vs r/Economics
Submissions
Figure 6 presents a comparative analysis of the United States’ Real Gross Domestic Product (GDP) against the number of submissions in the r/Economics
subreddit over a period extending from January 2021 to March 2023. Evident in the plot, the submission count hovers around 1,000 posts per month, reflecting a steady engagement level within the subreddit community.
A notable deviation occurs in July 2022, marked by an annotation that points out the release of news indicating the U.S economy’s contraction for two consecutive quarters, often considered a technical indicator of a recession. Following this announcement, there is a marked spike in subreddit activity in August and September 2022, with submissions surging to well over four times the usual number. This spike in activity suggests a heightened collective concern and a rush to discuss and understand the implications of the economic news
GDP vs r/Economics
Submissions Scores
Figure 7 again plots the United States’ Real Gross Domestic Product (GDP), but this time against the average submission score in the r/Economics
subreddit over a period extending from January 2021 to March 2023. Conversely, this visualization reveals a stark drop in the average score of submissions during the same period of economic concern. This significant downturn in average score might also be interpreted as a barometer of public sentiment towards the economy, rather than simply a measure of post quality. In times of economic uncertainty, it’s common for sentiment to sour, and this shift is often reflected in the collective mood of discussions and the types of content that get upvoted. As the real GDP contracted for two consecutive quarters, the mood likely shifted from cautiously optimistic to more pessimistic and critical, leading to a preference for upvoting posts that resonate with the community’s concerns and anxieties. However, it is also worth noting that since the number of posts in that period also significantly increased, it is likely that lower post activity (comments/upvotes) could also have impacted the average score during that period.