EDA

Executive Summary

Our exploratory data analysis (EDA) on Reddit’s major political and economic subreddits has provided a nuanced view of online political discourse, offering valuable context for understanding user behavior. The analysis delved into various aspects of these online subreddit communities, shedding light on the dynamics of discussions and the interplay of user interests.

Preliminary analysis unveiled that r/Conservative is the most active subreddit, with the highest number of posts and comments, while r/AskPolitics has the most distinct posts and comments, indicating a more diverse user base. The r/Conservative subreddit also has the highest average post score, suggesting that its content resonates strongly with its user base. The analysis also revealed that the r/Conservative subreddit has the most significant number of users in common with other political subreddits, indicating that these users engage in various political communities. Subreddits, such as r/socialism and r/Liberal, showcase consistent activity with fewer fluctuations across number of comments and submissions, suggesting a more stable user base.

Additionally, weekly patterns of posting activity revealed that Thursdays are the most active days for posting across r/Liberterian, r/socialism, and r/centrist subreddits, with the exception of r/Conservative, which sees the highest activity on Wednesdays. The analysis also showed that posting activity generally decreases as the year progresses, with the lowest numbers appearing in December. The data also revealed that Saturdays and Sundays consistently have fewer posts than weekdays, in line with typical online engagement trends.

Finally, we incorporated U.S GDP data from the FRED python package from January 2021 to March 2023 to compare the economic data with the r/Economics subreddit posting activity. The analysis revealed that there is a relationship between the state of the economy and the level of engagement and sentiment on the r/Economics subreddit. During times of economic uncertainty, there is an increase in the number of submissions and a decrease in the average submission score. This suggests that people are more likely to turn to online communities to discuss and seek information about the economy during difficult times. Additionally, the types of posts that are upvoted during these times suggest that people are looking for content that resonates with their concerns and anxieties.

Analysis Report

Business Goal #2: Gain an understanding of distinguished, gilded, and controversial posts and comments among all subreddits as well as the users of each subreddit.

The subreddits were ranked based on the counts of meaures that correspond with controversial posts and comments. As Figures 1 and 2 have shown already, there is greater number of posts and comments from the r/Conservative subreddit, so the counts shown in Table 1 have been normalized. Unsurprisingly, the r/ChangeMyView subreddit had the greatest number of distinguished posts and comments (being nearly 2 standard deviations from the mean), which may hint at heavy moderation, potentially as a result of touchy subjects being discussed. Ironically, the subreddits typically associated with the authoritarian left and right on the political compass (r/socialism and r/Conservative, respectively) have less distinguished submissions and posts that the libertarian left and right subreddits (r/Liberal and r/Libertarian, respectively), pointing to an ironic inversion from the expectations of subreddit moderation. Additionally, the r/Conservative subreddit had the greatest number of gilded posts and comments. This, at a glance, may seem like users incentivizing each other financially with the prospects of Reddit Gold, which would inevitably promote an echo chamber of ideas in an online political space; however, this doesn’t seem to be the case when considering that the subreddit also has the highest count of controversial comments (having a count of over 2 standard deviations from the mean). Instead, this may suggest that users on this subreddit are simply more willing to monetarily support posts and comments that they find appealing.

Table 1: Subreddits with the Most Number of Controversial Posts and Comments (Normalized)

Next, we looked at individual users and ranked them based on the counts of measures that corerspond with controversial posts and comments. Based on the distinguished submissions and comments, we can see that among the top 20 controversial users, only one of them (user: “ultimis”) is a moderator. Another finding is that the top 20 users all come from the subreddits r/centrist, r/Libertarian, or r/Conservative. Much like the findings from Table 1, Table 2 also seems to suggest that these three political subreddits have a large amount of intra-subreddit discourse among their posts and comments.

Table 2: Top 20 Users with the Most Number of Controversial Posts and Comments

Business Goal #3: Determine and visualize the 50 highest scored posts across political subreddits.

Figure 3: Bubble Chart showing the 50 highest scored posts per subreddit.

Figure 3 is a bubble chart illustrating the average scores of the top 50 posts from various political subreddits, where the score is calculated by subtracting the number of downvotes from upvotes. The size of each subreddit’s bubble reflects the average score of its posts. The chart shows that the subreddit represented by the largest bubble has the highest average post score, while the smallest bubble corresponds to the subreddit with the lowest average score.

The subreddit with the largest bubble (r/Conservative) has significantly outpaced the others in terms of average score, suggesting that posts in this community resonate strongly with its members and receive more upvotes. Conversely, the subreddit with the smallest bubble (r/centrist) has the lowest average scores, indicating either a smaller community, less engagement, or a tendency towards more divisive content that doesn’t amass as many upvotes.

Within each of these larger subreddit bubbles are smaller bubbles, each representing an individual post; the title of the post is indicated within these smaller bubbles. Additionally, interacting with these smaller bubbles by clicking on them will redirect the viewer to the respective post’s link on Reddit for further exploration.

Business Goal #4: Understand the temporal effects of number of posts across political subreddits

Figure 4: Calendar Heatmap of r/Conservative Reddit Posts

Figure 4 depicts the daily and monthly number of posts on the r/Conservative subreddit. There’s a clear trend of higher activity on weekdays, with Wednesdays being the peak, and lower activity on weekends, particularly on Sundays. The data indicates a seasonal trend, with post volumes generally higher in March and April and reducing towards the end of the year, in November and December. The color gradient emphasizes these variations, with darker hues representing more posts, allowing for a visual representation of user engagement on Reddit throughout the different months and days of the week.

Note: A calendar heatmap for different subreddits can be found in the appendix section.

Business Goal #5: Shared user interaction across political subreddits

Figure 5: Dependency Wheel of Users in different Subreddits

The dependency wheel depicted in Figure 5 above provides a graphical representation of the shared user base between various political subreddits. The thickness of the connecting bands is directly proportional to the number of users that participate in both subreddits at each end of the band. A notable observation from the wheel is the significant interconnectivity between certain subreddits, which could suggest a shared ideological proximity or an interest overlap among their user bases. For instance, if there is a thick band between the subreddits labeled r/Liberal and r/socialism, this would indicate a large common audience, possibly due to similar political leanings or discussions that appeal to both groups.

Additionally, the visualization reveals subreddits that serve as common grounds for diverse political discourse, such as r/ChangeMyView or r/AskPolitics, where the number of interlinking bands suggests a wide range of users from different political backgrounds converging for debate or inquiry. This could imply that these platforms are more neutral or open-ended, attracting a varied audience seeking to engage with multiple perspectives. The overall pattern highlights not only the segmentation within the political discourse on Reddit but also the points of intersection where cross-ideological conversations are occurring.

Business Goal #6: Measure the similarity between political and finance/macroeconomic topics discussed among grouped political and economics subreddits

We present the Cosine and Jaccard similarity scores for posts’ bodies and titles across grouped political and economics subreddits. The political subreddits utilized for the two tables below are r/Conservative, r/Socialism, r/centrist, and r/Libertarian and the economic subreddits are r/Finance and r/Economics. Initially, we obtained the counts of three words, [recession, inflation, and unemployment], for the grouped subreddits from regex patterns on the posts’ bodies and ran Cosine as well as Jaccard Similarity scores on the counts. The Cosine Similarity score between the grouped subreddits was 0.9 and Jaccard Similarity was 0.04, which meant that the two subreddits have similar distributions of words, even though the counts of individual words were significantly different. This finding culminated in two hypotheses, the first being that the grouped economic subreddits must be containing a large number of “shitposts”, probably related to cryptocurrencies and NFT’s, and the second one that economics subreddits posts’ bodies contain mainly links to online news articles with the key words present in the post titles. Consequently, we updated the word list to account for [crypto, blockchain, nft] words contained in these flippant posts and also ran the similarity algorithms on the posts’ titles.

In addition to the economics word list, we also made a political word list to obtain grouped subreddit counts for the words [trump, biden, election, fed, powell] and ran the similarity algorithms on the posts’ bodies and titles. Its results and discussion are presented below.

Post Body


Table 3: Cosine and Jaccard Similarity of Post Body for combined Political and Economics subreddits

Post Title


Table 4: Cosine and Jaccard Similarity of Post Title for combined Political and Economics subreddits

Table 4 supports our hyptheses, as we can see that the Jaccard Similarity score is highest across the post titles of the grouped subreddits when we filter for the economics word list. This means that the posts’ titles not only have a similar distribution of counts across the grouped subreddits but they also have similar raw counts for each individual word. This is because the posts’ titles contain the key words that we are looking for, while the posts’ bodies contain links to online news articles. On the other, political words in either the posts’ bodies or titles do not have a similar distribution of counts across the grouped subreddits, as shown in both Table 3 and Table 4.

Business Goal #7: Explore the activity of r/Economics submissions based on the state of the economy

GDP vs r/Economics Submissions

Figure 6: Number of r/Economics Submissions vs U.S Real GDP

Figure 6 presents a comparative analysis of the United States’ Real Gross Domestic Product (GDP) against the number of submissions in the r/Economics subreddit over a period extending from January 2021 to March 2023. Evident in the plot, the submission count hovers around 1,000 posts per month, reflecting a steady engagement level within the subreddit community.

A notable deviation occurs in July 2022, marked by an annotation that points out the release of news indicating the U.S economy’s contraction for two consecutive quarters, often considered a technical indicator of a recession. Following this announcement, there is a marked spike in subreddit activity in August and September 2022, with submissions surging to well over four times the usual number. This spike in activity suggests a heightened collective concern and a rush to discuss and understand the implications of the economic news

GDP vs r/Economics Submissions Scores

Figure 7: Average Score of r/Economics Submissions vs U.S Real GDP

Figure 7 again plots the United States’ Real Gross Domestic Product (GDP), but this time against the average submission score in the r/Economics subreddit over a period extending from January 2021 to March 2023. Conversely, this visualization reveals a stark drop in the average score of submissions during the same period of economic concern. This significant downturn in average score might also be interpreted as a barometer of public sentiment towards the economy, rather than simply a measure of post quality. In times of economic uncertainty, it’s common for sentiment to sour, and this shift is often reflected in the collective mood of discussions and the types of content that get upvoted. As the real GDP contracted for two consecutive quarters, the mood likely shifted from cautiously optimistic to more pessimistic and critical, leading to a preference for upvoting posts that resonate with the community’s concerns and anxieties. However, it is also worth noting that since the number of posts in that period also significantly increased, it is likely that lower post activity (comments/upvotes) could also have impacted the average score during that period.

Appendix

Appendix 1: Line Chart of number of posts by distinct number of users (per subreddit)
Appendix 2: Line Chart of number of comments by distinct number of users (per subreddit)
Appendix 3: Calendar Heatmap of r/Liberterian Reddit Posts
Appendix 4: Calendar Heatmap of r/centrist Reddit Posts
Appendix 5: Calendar Heatmap of r/socialism Reddit Posts