• Introduction
  • EDA
    • Analysis Report
    • Executive Summary
  • NLP
    • Analysis Report
    • Executive Summary
  • ML
    • Analysis Report
    • Executive Summary
  • Conclusion
  • Feedback
    • Source Code

On this page

  • Analysis Report
    • Understanding trends in the engagement of posts from 2021-2023
    • Exploring how Reddit post engagement dynamics is influenced by subreddit characteristics.
    • Identify the most engaging and active authors

Exploratory Data Analysis

Analysis Report

In this analysis, we used Submissions Reddit data stored in an Azure Machine Learning workspace blob store in parquet format, employing PySpark for distributed data processing and Pandas for local data manipulation and analysis.

To kick off our exploration of entertainment subreddits, we aimed for a broad understanding of how submissions are distributed across entertainment-related subreddits. To achieve this goal, we created a lollipop graph with post counts of subreddits on the x-axis and different entertainment subreddits on the y-axis.

Figure-1: Count of each subreddit in the dataset from 2021-2023

From Figure 1, it’s evident that the ‘anime’ subreddit is the most active with 404,298 posts, showcasing a very active community with a strong interest in anime-related content. Following closely is the ‘movies’ subreddit, with activity at 382,085 posts. Both ‘anime’ and ‘movie’ subreddits outshine the ‘television’ subreddit in post counts, revealing higher user engagement in these topics. A similar trend can be observed among the subreddits dedicated to suggestions with the ‘AnimeSuggestions’ subreddit having a higher post count followed by ‘MovieSuggestions’ and ‘televisionsuggestions’. This indicates that there is a substantial interest in recommendations for anime. The ‘MovieSuggestions’ subreddit has a moderate number of posts, while ‘televisionsuggestions’ has the least, reflecting lower user engagement for television recommendations on the Reddit platform.

Next, we filtered the data to keep only the subset of columns relevant to our analysis. The selected columns include information about the submission, such as the subreddit, author, title, selftext, creation timestamp, number of comments, score, and various other attributes related to the submission’s characteristics. After selecting specific columns, we aimed to understand the dataset’s structure and quality. To achieve this, we obtained the count of missing values in each column of the dataset. Table 1 represents a summary of missing values in our dataset, where each row corresponds to a different feature, and the ‘Missing Values’ column quantifies the number of missing values for each feature.

Table-1: Count of missing values of all the columns for submissions dataset

Table-1 shows that columns such as ‘author’, ‘title’, ‘selftext’, ‘created_utc’, ‘num_comments’, ‘score’, ‘over_18’, ‘pinned’, and ‘locked’ show complete data with zero missing values, suggesting they are mandatory fields in the platform’s post submission process. While, a significant number of missing values are observed in the ‘disable_comments’, ‘distinguished’, and ‘media’ columns. This could indicate that these fields are optional or applicable only to certain posts. This makes sense for instance, ‘distinguished’ could denote a special status assigned to certain posts, which would naturally be less common. Since the columns with the missing values were not relevant to our business goals and also removing missing values from these columns could remove posts with optional features. So we decided to retain the rows with missing values and it would also preserve the completeness of the dataset.

Moving further, three new columns were added to the data by transforming the original DataFrame:

  • Firstly, we extracted various time-related features like ‘hour_of_day’, ‘day_of_week’, ‘day_of_month’, ‘month’, and ‘year’ from the ‘created_utc’ timestamp. These features can be useful for analyzing trends or patterns in the data based on temporal information.
  • Secondly, we wanted to analyze the overall length of posts, so we added a new column ‘post_length’ to the data. The values in this column are calculated based on the sum of the lengths of the ‘title’ and ‘selftext’ columns for each row.
  • Thirdly, we created a new column called ‘has_media’ that is set to “True” if a post has media content (like images or videos) and “False” if it doesn’t. This makes it easier to identify which posts include some form of media and which do not.

After these transformations were done, the columns like “media,” “created_utc,” “disable_comments,” and “distinguished,” which were no longer necessary for our analysis, were dropped.

The following table shows the final columns and their descriptions that were used for our analysis.

Table-2: Description of key variables in the Reddit data

Moving forward, we performed data manipulation to generate structured datasets encompassing top comments for specific subreddits, author post counts, and time series analysis. These datasets will serve as a foundation for extracting insights into user engagement, content popularity, temporal evolution, and author contributions on the Reddit platform through the creation of meaningful visualizations.

The detailed process and code can be found here for data cleaning and here for data manipulation.

Understanding trends in the engagement of posts from 2021-2023

Next, we were curious to see how the general trend in the engagement across prominent subreddits —‘anime,’ ‘movies,’ and ‘television’ evolve over different periods of time. In order to achieve this we firstly decided to analyse how number of posts and the scores varied over the period of two years.

Figure-2: This graph illustrates the normalized post count trends across three subreddits (anime, movies, and television) over a span of two years, from January 2021 to April 2023. The peaks and troughs indicate fluctuations in user engagement and post frequency within the Reddit community.

Figure-2 attempts to capture the trend in the number of posts made across the three subreddits over the time period of 2021-2023. To facilitate a more accurate assessment of relative fluctuations in user engagement and post frequency, normalized post counts were used. Overall, the subreddits exhibit a more gradual decline, implying a slow yet consistent decrease in the number of posts and interactions. The overarching decline across all three subreddits raises questions about the factors influencing these changes.However, when comparing the three sections, we can observe that the movies and anime subreddits show a general decrease in the number of posts. On the other hand, the engagement, in terms of posts made, remains relatively constant for the television subreddit. The possibility of shifting platform dynamics, evolving content consumption patterns, or the rise of alternative entertainment forums could be contributing to this trend. It indicates a potential shift in the digital landscape where traditional subreddit forums may no longer command the same level of attention as before. Certain high peaks could possibly correlate to significant new releases in any of the three categories, resulting in increased engagement and discussions on the platform. For instance, we can see a peak starting from March 2021, which could equate to the release of Zack Snyder’s much-hyped movie, Justice League.”

Figure-3: This graph illustrates the scores across three subreddits (anime, movies, and television) over a span of two years, from January 2021 to April 2023. The peaks and troughs indicate fluctuations in user engagement and scores within the Reddit community.

Figure 3 shows the trends in changes in the average scores from 2021 to 2023. When comparing Figures 2 and 3, we observe that, in general, scores remain relatively constant for the anime section during this period, despite a considerable fluctuating decline in the number of posts. However, for movies and television, there is a general slight upward trend in scores, in contrast to the step downward trend in the number of posts. This suggests that, for these two categories, the number of posts made is slightly inversely proportional to the rating. This could be due to the availability of fewer posts to view, resulting in an increase in scores for those written during that period.

Next, we wanted to see if there are any trends in the number of posts and average score depending on the time of the month. This was to figure out if there was a day of the month were the number of engagement on the platform were showing a particular trend either high or low.

Figure-4: Number of posts submitted on different days of the month from 2021-2023

Figure 4 compares the number of posts across the three subreddits by the day of the month. In this figure, it is challenging to pinpoint any specific day of the month where the number of posts is particularly high. The downward trend at the end of the month is attributed to the variation in the number of days each month, as not every month has 31 days.

Figure-5: Average score of posts submitted on different days of the month from 2021-2023

When examining Figure 5, which compares the average scores for each day of the month, it is apparent that the scores generally remain constant with occasional bumps in all three categories. However, television shows the highest ratings despite having the fewest number of posts. This observation seems to confirm an inverse relationship between the number of posts and the scores. Furthermore, we aimed to streamline our analysis to understand how trends in the number of posts and scores vary when comparing different days of the week and hours of the day.

Figure-6: Number of posts during the week and time of the day for the subreddits

Figure-6 presents the distribution of posts across days of the week and hours of the day for the subreddits ‘anime,’ ‘movies,’ and ‘television.’ This visualization suggests that the ‘television’ subreddit has its own unique pattern of user interaction, potentially reflecting the airing schedule of popular shows or weekly events that drive discussions, especially during the evening hours from 3 PM to 5 PM. Overall, all three subreddits show a more spread-out activity throughout the week, indicating a consistent level of engagement without significant peaks or troughs.

Figure-7: Average Score of posts during the week and time of the day for the subreddits

Figure-7 illustrates the distribution of average scores across days of the week and hours of the day for the subreddits ‘anime,’ ‘movies,’ and ‘television.’ The plot suggests that people are generally more active in reading and voting for posts each day of the week, particularly during the late afternoon hours of 2 PM to 4 PM. This trend is also directly proportional to the number of posts made during those hours.

Overall, we can assert that interaction activities peak on these Reddit threads during the late afternoon hours. These insights underscore the importance of timing in Reddit community engagement. For marketers and content creators, aligning content releases with these observed peaks could enhance user engagement. For platform moderators, it could inform the scheduling of events or AMAs (Ask Me Anything sessions) to ensure higher participation rates. Additionally, this data could assist in the strategic placement of advertisements during specific hours of a particular day, ensuring they reach the most active audience segments.

Exploring how Reddit post engagement dynamics is influenced by subreddit characteristics.

Next, we delved into the diverse aspects of user engagement dynamics on Reddit posts, investigating variations related to characteristics such as score, post length, number of comments, and the inclusion of media content. To do this analysis we have created a bubble plot comparing the post engagement by Length and Score across the three subreddits. Additionally,the posts with media and those without media within the anime, movies, and television subreddits on Reddit are analysed.

Figure-8: Correlation between post lengths and scores within anime, movies, and television subreddits on Reddit.

Figure 8 illustrates the engagement dynamics of Reddit posts, comparing post length to user engagement across subreddits dedicated to anime, movies, and television. Each dot represents an individual post, with its size corresponding to the score, which is a proxy for the post’s popularity or engagement level.

This analysis reveals a dense clustering of posts with scores under 10,000 across all subreddits, with post lengths predominantly falling below 2000 characters. This suggests that shorter posts are more common, regardless of the subreddit category. Interestingly, there is a visible trend where posts with higher scores tend to have shorter lengths, indicating that more concise posts may resonate better with the audience or are more likely to be engaged with.

The larger-sized dots, indicating posts with exceptionally high scores, are predominantly short and are scattered across all three subreddits. This may indicate that highly engaging posts, which are likely to go viral or hit the front page of Reddit, tend to be concise.

This insight can inform content creators and marketers about optimal post lengths for audience engagement in different entertainment categories on Reddit.

Figure-9: Correlation between post lengths and scores within submissions that have media as well as those that don't have media.

Figure 9 illustrates the engagement dynamics of Reddit posts, comparing post length to user engagement within submissions that have media as well as those that don’t have media. Each dot represents an individual post, with its size corresponding to the score, which is a proxy for the post’s popularity or engagement level.

From Figure 9, it seems that posts with media tend to have a higher score regardless of post length, as indicated by the concentration of orange dots across all lengths, especially in the higher score region. There are some high-scoring posts without media, but these are less frequent as the score increases. This suggests that including media in a post could potentially lead to higher engagement on Reddit.

We further analysed the percentage of posts with and without media across the three subreddits.

Figure-10: Grouped bar plot showing the percentage of posts with media and those without media within the anime, movies, and television subreddits on Reddit.

From Figure 10 we can see that the percentage of posts with media is significantly higher than those without across all three subreddits. The “movies” subreddit has the highest percentage of posts with media (89.75%), followed by “television” (83.64%), and “anime” (82.56%). Conversely, the percentage of posts without media is the lowest in the “movies” subreddit (10.25%), which correlates with the earlier observation that media content is prevalent in high-engagement posts.

Overall, these three graphs gives us insights about engagement on Reddit which suggest that not only is the presence of media a factor, but so is the length of the post. Content creators could use this information to optimize their posts for better engagement by creating concise posts that include media, and targeting their efforts towards subreddits where these types of posts perform well.

Identify the most engaging and active authors

Furthermore, we wanted to analyze the post counts of authors with top comments and score as well as the post counts of the most active authors for the three subreddits separately as all these subreddits are distinct and warrant their own analysis. This analysis would be beneficial for various purposes, including recognizing community leaders, understanding content popularity, and potentially collaborating with influential contributors to enhance community engagement or promotional activities.

To do so, we first identified the authors of the posts with highest number of comments for the three subreddit separately.

Table-3: Ranks of authors from a subreddit based on the number of comments their posts receive, highlighting the most engaging content creators

Table 3 presents the top 10 posts with the highest number of comments, categorizing them based on movie, television, and anime subreddits. High comment counts typically signify that an author’s contributions resonate well with the community, sparking active discussions among the subreddit’s users. The table not only identifies the most engaging authors across subreddits but also emphasizes the importance of creating content that encourages active participation within the community. Furthermore, this data offers insights into the types of posts that cultivate robust discussions on Reddit.

Now that we have identified authors receiving highest number of comments on their posts we wanted to look how active these authors have been on reddit.

Figure-11: Bar chart ranking the post counts of authors who have received highest number of comments on their post

Figure-11 presents the post counts attributed to the authors who have received highest number of comments on their post within the movies, anime, and television subreddits. These counts are a direct indicator of the quantity of content contributed by individual authors rather than the quality or engagement level of the posts.

The movies subreddit exhibits a concentration of activity from a few highly active authors, potentially impacting the diversity of topics and perspectives within the community. Notably, authors like “LETS_MAKE_IT_AWKWARD” and “officialtobeymaguire” have achieved top 10 highest comments with only one post on Reddit, showcasing the influence of specific contributors.

Similarly, in the television subreddit, a skewed distribution among the top ten authors, with ‘MarvelsGrantMan136’ and ‘Neo2199’ being the most active, suggests a reliance on a small group for a significant portion of content. On the other hand, authors like “ewzetf”, “Midnight_OIL_”, and “thetanhausergate” achieved top 10 highest comments with fewer than four posts, indicating impactful contributions despite lower post counts.

In contrast, the anime subreddit experiences a unique scenario with bots as primary content contributors. Despite their automated nature, these bots engage the community more effectively than typical human-authored posts, potentially due to timely, consistently formatted, and relevant content. This challenges the notion that engaging content must be human-generated, showcasing the potential for well-tuned bots to contribute meaningfully to online discourse.

Furthermore, we aimed to assess what kind of content gets the most number of scores. The Scores are calculated based on the number of upvotes minus the number of downvotes.

Table-4: Ranks of authors from a subreddit based on the scores their posts receive, highlighting the most engaging content creators

The table 4 shows a list from each category of subreddit, ranking the top 10 posts by score. For the movie subreddit, the highest-scoring post is an AMA with Keanu Reeves, indicating that interactive sessions with famous actors are highly valued in the community. This is further emphasized by the presence of another AMA with Tobey Maguire in the second position. The third highest post by Nicolas Cage suggests that personal engagement from actors or filmmakers garners significant attention. Posts regarding industry news and opinions, such as the win of Brendan Fraser at the Academy Awards and a call to prioritize voice actors over celebrities in animated features, reflect the community’s interest in both celebrating achievements and critiquing industry trends. A post about a ‘Dune’ sequel hints at the community’s enthusiasm for science fiction and blockbuster franchises.

For the television subreddit, the most popular post highlights a public figure, LeVar Burton, expressing his desire to become the new host of “Jeopardy!”, which is indicative of the community’s interest in show hostings and personalities associated with them. This is further emphasized by another post where Burton encourages the producers of “Jeopardy!” to consider him for the role, showcasing the community’s support for his candidacy. Other posts indicate strong community interest in television show news and updates, like the revival of “Futurama” and a potential third season for “Mindhunter.” The active engagement in these topics suggests that news about television series renewals and continuations is particularly resonant with the audience.

Moving to the anime, The top post is about the death of Kentaro Miura, the creator of “Berserk,” which underscores the impact that influential figures in the anime industry have on the community. The high score indicates that news about significant personalities, especially those who have made a profound impact on the genre, resonates deeply with the audience. Several posts relate to announcements of new seasons for popular anime series like “The Devil is a Part-Timer,” “Spice and Wolf,” and “Konosuba.” This suggests that the subreddit is a hub for fans to discuss and share their excitement about continuing series and new developments within their favorite anime. There is also a post about a prediction tournament for “Best Girl,” a common discussion topic in anime communities where fans vote for their favorite female characters. The popularity of such a post indicates that interactive and participatory events are engaging for the community members.

Now that we have identified authors receiving the highest scores on their posts, we want to examine how active these authors have been on Reddit.

Figure-12: Bar chart ranking the post counts of authors who have received highest scores on their post

Figure 12 presents the post counts attributed to the authors who have received top scores on their post within the movies, anime, and television subreddits.

In the movies subreddit, ‘MarvelsGrantMan136’ dominates but is followed by a group of authors with notable contributions, reflecting a community with diverse yet influential voices. The television subreddit shows ‘MarvelsGrantMan136’ leading, with a narrower margin over ‘chanma50’, suggesting a broader spread of engagement across various authors. ‘AutoLovepon’ leads in the anime subreddit with an exceptionally high post count, indicating a significant impact on the community, likely due to consistent, high-quality content, which could be automated. These patterns highlight different engagement dynamics, with a single dominant presence in anime and a more varied influence among authors in the movies and television subreddits.

Looking at Table 4 and Figure 12, we can also say that high scores don’t equate to a high post count. This can be understood by MarvelsGrantMan136’s entry, where the author has one of the highest engagement as well as posts; however, lionsgate is the post with the highest engagement but doesn’t have even 1/3000 times the number of posts. A similar trend can also be observed for anime and television, and therefore we can conclude that there is no correlation between the author’s scores and post count.

Finally, we wanted to identify active authors in the anime, movies, and television subreddits to recognize the authors who consistently engage with and contribute to these communities.

Figure-13: This bar chart ranks the top 10 authors in the subreddit based on the number of posts they have made

The Figure 13 provides an overview of the top 10 active authors in the anime, movies, and television subreddits, showing a clear disparity in activity levels across these categories.

For the movies subreddit, the top author has an exceptionally high post count, exceeding 20,000, which is significantly higher than the subsequent authors. This could imply a central role in the subreddit, possibly as a source of extensive information or regular discussion threads. However, the drop in post counts among the following authors is stark, indicating a less concentrated field of active contributors and suggesting a more varied source of content within the community.

In the television subreddit, the post counts among the top authors are more evenly distributed, with ‘MarvelsGrantMan136’ leading but with a smaller margin compared to the leaders in the anime and movies subreddits. This indicates a more balanced contribution from the top authors, which could result in a diverse range of discussions and viewpoints being represented in the subreddit.

Moving to the anime subreddit, the leading author, presumably a bot named ‘AutoLovepon’, has a post count that far exceeds that of any human contributor, with over 6000 posts. This suggests a high level of automation in content generation, which is indicative of a structured and consistent posting strategy, possibly catering to the demand for updates, recommendations, or scheduled discussion threads. Given the nature of bot-generated content, this level of activity might provide a streamlined experience for users seeking information without the variability of human-authored posts.

In conclusion, these engagement models reflect the unique cultural dynamics of each subreddit, providing valuable insights for content creators, moderators, and marketers on how to approach each community effectively.

The source code for generating the plots can be found here.

Content 2023 by [Project Team 34]
All content licensed under a Creative Commons Attribution-NonCommercial 4.0 International license (CC BY-NC 4.0)
Made with Quarto
View the source at GitHub