Introduction

Figure 1: Image generated with OpenAI DALL·E 3 with prompt “a group of four friends around a campfire sharing stories in modern digital illustration style”

The Project

The most powerful person in the world is the storyteller.

— Steve Jobs

Since ancient times, storytelling has captivated the human imagination. It has been a powerful tool for imparting wisdom and shaping cultural identities. From the oral traditions of ancient civilizations to the written narratives of modern times, stories continually evolve to reflect the complexities of human emotions and values. Society in the social media era generates vast amounts of text data in which great stories are contained and often go unnoticed. In this project, we are interested in taking a deep dive into storytelling in Reddit: how it comes to be, how it engages the audience, and what lessons we can learn to make ourselves better storytellers.

With its extensive following, open discourse, and strong communities, Reddit provides a substantial opportunity to make sense of valuable data. However, understanding such a large amount of data creates a challenge because of the nuanced posts and niche communities. Our Exploratory Data Analysis (EDA), Natural Language Processing (NLP), and Machine Learning models (ML) will take multiple approaches to manage this large amount of data and generate powerful insights in the most digestible manner possible.

Our exploration spans various aspects; for example, we analyze the language used in several storytelling subreddits. Is storytelling contextual enough so that we can classify posts into specific subreddits? We’re examining NSFW posts’ influence on story engagement and the correlation between a story’s scores and comments. We utilize NLP techniques to unearth trends within storytelling communities, predict “flairs,” and perform sentiment analysis of stories. We also aim to use machine learning techniques to distill meaningful insights from vast quantities of storytelling text data.

We selected the following subreddits related to storytelling for the project:

Figure 2: Shows the banners of the twelve storytelling subreddits chosen for the project

The selected subreddits have a large Reddit following, each with millions of subscribed members. Below, we can see the distribution of subscribers from which we obtained our Reddit posts.

Table 1: Shows the subreddits chosen for the projects, as well as the number of members on each as of November 23, 2023.
Subreddit Members
r/relationship_advice 10,380,573
r/socialskills 3,823,567
r/NoStupidQuestions 4,066,116
r/AskMen 5,763,889
r/TrueOffMyChest 2,198,361
r/explainlikeimfive 22,635,152
r/AITA 11,969,361
r/tifu 18,480,872
r/antiwork 2,775,125
r/OutOfTheLoop 3,234,415
r/unpopularopinion 4,002,725
r/AskWomen 5,519,321

Project Goals

In this work, we explore various ideas centered around storytelling on Reddit. We subset our data only to contain high-membership storytelling subreddits detailed in Table 1. First, we perform EDA on the data. Our first idea is to develop a new engagement metric by correlating the number of comments with post scores. Next, we compare NSFW and non-NSFW posts to assess how not-safe-for-work content affects user engagement. Another goal is to identify peak engagement times by analyzing the temporal patterns of posts. We also investigate the link between engagement and the controversial nature of comments. We then use NLP to determine the age and gender of post authors and analyze sentiments of posts in specific subreddits, like r/AmItheA**hole, correlating them with user engagement and post flairs. We also create feature sets using NLP for ML tasks such as predicting the subreddit a post belongs to, determining post flairs based on text content, and generating Reddit-style stories from given prompts. Finally, we summarize top comments on specific topics using a pre-trained model, evaluating the summaries with accuracy metrics.

For a complete list of analytical goals and technical proposals, see Section 1.3

Appendix A: Analytical Goals and Technical Proposals

Topics

The questions we address in this work are:

Idea 1

  • Business goal: Explore the relationship between the number of comments and the score of Reddit posts to develop a unified measure for better assessing a story’s engagement, aiming to create a more holistic understanding of audience interaction.
  • Technical approach: We plan to analyze the relationship between the number of comments and post scores on Reddit storytelling posts. This involves collecting data grouped by subreddits and using pyspark.sql.functions to calculate the correlation between these two metrics. We will develop an interaction_score feature based on our findings, averaging the number of comments and post scores. This new metric aims to provide a unified measure of user engagement across various posts and platforms.
  • For more on this goal, click here.

Idea 2

  • Business goal: Determine if the presence of not-safe-for-work (NSFW) content in a story significantly boosts user engagement levels to better recognize the impact of content nature on audience interaction and participation.
  • Technical approach: To analyze the influence of NSFW content on user engagement, we will utilize a dataset of stories, focusing on those marked with the over_18 flag. Considering the limited proportion of such posts, we’ll create a balanced dataset by randomly selecting an equivalent number of submissions without the NSFW tag. This approach ensures a fair comparison between NSFW and non-NSFW content. We will employ the interaction_score from Idea 1 as a primary measure of user engagement. We aim to generate a boxplot to visualize the distribution of interaction scores for both NSFW and non-NSFW posts.
  • For more on this goal, click here.

Idea 3

  • Business goal: Identify the specific hours of the day during which posts usually receive the highest levels of user engagement to better understand the patterns and preferences of the audience’s activity.
  • Technical approach: Implement a data analysis process that focuses on understanding the temporal patterns of user engagement on social media posts. Utilize the created_utc column from the dataset to create two new variables: week_of_the_year and hour_of_the_day. Exclude the first two days of 2022 to maintain accurate weekly categorization, as these days are part of week 53 of 2021. Aggregate and analyze the data based on these new variables to reveal patterns in user engagement across different times of the day and weeks of the year. The outcome will be visualized through a plot illustrating the times when posts receive the most engagement.
  • For more on this goal, click here.

Idea 4

  • Business goal: Determine the relationship between engagement and the “controversial” metric for posts.
  • Technical approach: Perform the necessary NLP to isolate the “controversial” metric within the data. We expect that comments that are considered controversial will show more engagement. With this prepared, use comparison metrics as well as visualizations to determine if there is a relationship between the “controversial” metric and other engagement metrics.
  • For more on this goal, click here.

Idea 5

  • Business goal: Extract the age and gender of the author of a Reddit post, with the goal of understanding demographic characteristics of language usage within Reddit.
  • Technical approach: Use complex regular expressions to extract the age and gender of the author of a post, if available (as an example: “My brother (24M) and I (23M) went to the store…”).
  • For more on this goal, click here.

Idea 6

  • Business Goal: Determine and analyze the sentiments of r/AmItheA**hole posts with respect to user engagement and “flairs” to analyze how Redditors judge each other’s stories.
  • Technical Proposal: Apply a pre-trained sentiment model to all non-empty text posts in r/AmItheA**hole created in 2022. Compare and contrast sentiments by the levels of the four primary “flairs” (A**hole, Not the A-hole, Everyone Sucks, No A-holes here) and by various measures of user engagement (e.g., number of comments). Create various visualizations displaying these comparisons and draw conclusions.
  • For more on this goal, click here.

Idea 7

  • Business goal: Explore the textual components of a Reddit post by finding commonly used words and phrases to better understand language usage across Reddit.
  • Technical approach: Use NLP techniques, including CountVectorizer, to construct a feature set from the textual components of a post. These features may be word “dummy variables” (the existence or non-existence of a word), word counts, n-grams (sequences of n words), or other types of features.
  • For more on this goal, click here.

Idea 8

  • Business goal: Identify if storytelling language usage differs between communities, with the goal of categorizing certain types of posts as belonging to a certain community.
  • Technical approach: Using a feature set as a result of NLP goals, build a multi-class classification model capable of classifying the subreddit community to which a post belongs. This model must be capable of handling multiple, imbalanced classes and cannot be limited to simple binary classification.
  • For more on this goal, click here.

Idea 9

  • Business goal: Determine/predict “flairs” of Reddit posts in r/AITA to analyze why Redditors judge other Redditors’ stories the way they do.
  • Technical approach: Focusing primarily on the subreddit r/AITA and its flairs (Asshole, Not the A-hole, Everyone Sucks, No A-holes here), apply NLP techniques such as tokenization on text-based submissions to extract the important contents of each post. Apply a multi-class classification model trained on labeled r/AITA posts (i.e., text posts with a flair) to predict which flair a given post will receive based on its text content. Present confusion matrices on the testing dataset and identify the words/phrases most commonly associated with each flair type.
  • For more on this goal, click here.

Idea 10

  • Business goal: For a provided input prompt, generate the text of a story that can be posted on Reddit, aiming to craft content that resonates with and engages the platform’s audience.
  • Technical approach: We will subset, preprocess, and tokenize both Reddit stories and content from externally sourced books, preparing them to be fed into a Recurrent Neural Network. Utilizing Pytorch, we aim to train the model by minimizing loss and perplexity as accuracy measures. The trained model should be capable of accepting an input prompt and generating a new story text accordingly.
  • For more on this goal, click here.

Idea 11

  • Business goal: Summarize top comments of interest for a particular topic, with the objective of obtaining the key pieces of information from the text.
  • Technical approach: Identify an open-source model that can accurately summarize top comments related to a particular topic of interest. Since we don’t want to label summaries by hand and accuracy can be subjective, we will use an API call to GPT-4 to get reference summaries, and then with those summaries, generate rouge 1, 2, and L scores to evaluate the model. We will use hyperparameter tuning to keep both summaries close to 25% of the length of the original text.
  • For more on this goal, click here.