Executive Summary

Our comprehensive machine learning analysis focuses on three primary objectives within the anime industry: determining the controversiality of comments, enhancing community engagement on social media platforms, and predicting stock prices of anime studios, with a focus on Nintendo.

Determination of controversial comments

Our first goal is attempting to predict whether a comment is controversial or not. The highest prediction accuracy is derived from the Decision Tree model, at about 74%. This means that it is possible to get a reliable determination of controversy based only on limited information about the comment, including the number of words in the text, whether it is topped or not, the sentiment of the text, and the score of the comment.

Enhancing Community Engagement through Content Analysis

Our second goal involved predicting submission scores on anime-related subreddits, aiming to identify key factors that drive community engagement and popularity. By employing Gradient Boosted Regression Trees (GBRT) with hyperparameter tuning, we successfully captured approximately 53%-54% of the variance in submission scores. This analysis highlighted the significant influence of the number of comments and the depth of selftext content on submission scores. Such insights are invaluable for businesses in the anime industry, providing strategic directions to boost post scores and attract larger audiences. For anime enthusiasts, our findings offer a blueprint to foster engaging discussions and create comprehensive content, essential for sustaining an active community.

Forecasting Stock Prices: A Challenge Beyond Subreddit Metrics

Our third objective was to forecast the stock prices of anime studios, particularly Nintendo, by analyzing popularity trends on franchise-related subreddits. Contrary to expectations, our analysis revealed a surprising disconnect between subreddit engagement metrics and Nintendo’s stock prices. The initial linear regression model, designed to correlate subreddit activities with stock prices, faced significant challenges. The model’s negative R-squared value (-25.96) and high RMSE (2.24) indicated a stark misalignment between our selected predictors and the actual stock price trends. This suggested a complex dynamic at play, not effectively captured by subreddit metrics alone.

Seeking a more refined approach, we turned to the ARIMA model. Despite its advanced capabilities, the model struggled to accurately predict market movements. The discrepancy in the predicted and actual stock price trends in early April 2023, indicated by an RMSE of 0.28, emphasized the unpredictable and complex nature of stock market dynamics. This outcome further highlighted the limitations of relying solely on social media metrics for stock price predictions.

Moving Forward: Integrating Insights for Future Strategies

These findings, spanning both content engagement and financial forecasting, underscore the multifaceted challenges and complexities inherent in the anime industry. They emphasize the need for more comprehensive, multifactorial modeling approaches that go beyond simple social media metrics. As we move forward, our focus will shift toward integrating a broader array of market influences and exploring diverse modeling techniques to enhance our predictive capabilities. This project, though challenging, has substantially advanced our understanding of the intricate interplay between social media trends, community engagement, and financial market dynamics in the anime industry.

Topic 8: Comments’ Controversiality

Predicting Comments’ Controversiality

Business goal: Develop predictive models to estimate the controversiality of comments in r/anime subreddit, which aim to help identify comments likely to generate debates and discussions based on various content features.

Technical proposal: Train classification models using features selected for target variable controversiality. All the predictors include > variables related to the sentiment of the published text, like sentiment category, and basic information about the text, such as the number of> words, stickied status, etc. Evaluate the model’s performance using ROC curve and confusion matrix.

Controversial topics and comments attract a lot of attention. It is useful to look at the basic data of a post to determine whether it is controversial or not, without manually viewing the text content. Based on all the variables that already exist for comments, after removing variables like ids and timestamps, there leaves a few usable variables, including the number of words of each comment, the score of each comment, and the stickied status of the comment. Intuitively the sentiment and controversial aspects of the text are related to some extent. Therefore the results of the sentiment analysis obtained from the previous section via natural language processing were used as a new predictor here. The target variable is categorical variable, having values at 0 and 1, where 0 means the comment is not controversial, and 1 means it is controversial. Therefore classification algorithms should be applied in this case.

Make Balanced Dataset

As decribed in the NLP section, the number of comments that are classified to be not controversial is over sixty times as the number of those that are controversial. Extremely unbalanced dataset is potentially leading to misclassification in the machine learning models. Therefore, we sampled the roughly same number of non-controversial comments as controversial ones to make a balanced dataset in terms of the target variable.

Models Comparison

ROC Curve for Naive Bayes Model

ROC Curve for Decision Tree Model

Confusion Matrix for Naive Bayes Model

Confusion Matrix for Decision Tree Model
Model ROC Accuracy
Naive Bayes 0.52 0.52
Decision Tree 0.735 0.736

Conclusion

The Naive Bayes model shows very limited predictive power with an ROC of 0.52 and accuracy of 0.52, which is slightly better than random guess. The Decision Tree model displaied acceptable performance with an ROC of 0.735 and accuracy of 0.736. The reason why the Decision Tree model outperforms the Naive Bayes model might be the ability to capture non-linear relationships among predictors. As for the aim of this prediction, the good prediction results show that it is possible to predict controversy from the basic quantitative data of comments, without further text analysis. But it is indisputable that it is certainly more accurate to analyze the content of the text to obtain a controversial determination. The analysis here simply demonstrates the feasibility of using the basic quantitative data as a low-resource and time-cost method.

Topic 9: Submissions’ Score

Pedicting Reddit Post Scores

Business goal: Determine and predict the score of submissions in Anime subreddit.

Technical proposal: Apply vectorizer, and other transformers to features selected. Log transforms the target variable score to avoid negative prediction. Create a regression model using features like word counts, number of comments, etc. Evaluate the model’s performance using metrics like mean absolute error, R square to determine the effectiveness of score forest based on feature selected.

The score in a submission is a crucial indicator of its popularity and the intensity of discussion. Understanding which content features can affect or are relevant to the score allows us to predict the potential score of a post based on its other content features, contributing to the achievement of our project goal. This capability aids in better maintaining community dynamics and engagement. Moreover, it helps stakeholders understand which content features are more highly discussed and welcomed. Providing insights enables them to make informed decisions when marketing their anime or avoiding inefficient submissions.

As for our target variable score, due to the non-linearity observed in the score distribution histogram above, a model trained without addressing this non-linearity may lack precision during prediction. Consequently, we apply a logarithmic function to linearize our target variable score and restrict predictions to be positive.

In line with our previous analysis of the distribution of selftext word counts in the submissions dataset during the EDA section, it is notable that selftext provides additional context to the posts’ titles, enabling clearer expression of their thoughts or issues. Unlike titles, selftext typically has a higher word count, and the variability in word counts is substantial. In order to utilize the full potential of the selftext word count feature, we perform data transformation on selftext’s word count feature. Specifically, we apply the Bucketizer approach to map the column of continuous features, in this case, selftext’s word counts, into discrete feature buckets. Examining the word count distribution of selftext in the following plot, we observe a varied but relatively evenly distributed word count range. Based on this distribution, we choose the following buckets: [1,30], [31,60], [61,100], [100,∞], enabling effective bucketization of the selftext word count feature.

In addition, we performed the VectorAssembler, a data transformation step on all features, including the bucketed selftext_word count we transformed in previous steps. This step merges all the selected features into a single feature vector for further ML model training. We also apply the VectorIndex transformation step to identify categorical features from the raw features generated from previous VectorAssembler steps. This step indexes them and creates a new final “features” column for modeling.

To avoid rerunning large parts of the code multiple times on the dataset for comparing hyperparameters, we apply a pipeline to assemble all data transformation steps.

Since our goal is to predict the score based on other features in Reddit submissions, we have chosen to apply Gradient Boosted Regression Trees (GBRT) with hyperparameter tuning to build and fine-tune the model. We selected this model because it is based on the idea of an ensemble method derived from a decision tree. Unlike linear models, it maps non-linear relationships quite well, aligning with the characteristics of our target variable score.

The data has been split into a training set (70%), testing set (20%), and validation set (10%). We use multiple evaluation metrics on the model and hyperparameter options to compare their performance in prediction. The metrics include Mean Squared Error (MSE), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared (\(R^2\)).

Model Comparison

Model MSE MAE RMSE R^2
0 GBT 1 1.448826 0.923860 1.203672 0.533944
1 GBT 2 1.440557 0.927095 1.200232 0.536604

The gbtr2 model tuned some parameter based on gbtr1 model, with maxIter changed from 10 to 30, and maxDepth decreased from 10 to 5. Also set the maxBins to 64 with stepSize of 0.01.

From the evaluation metrics, we can observe that both models (gbtr1 and gbtr2) have very similar performance metrics, with slight differences in \(R^2\) and MAE. The gbtr1 model has slight lower MAE, while gbtr2 has higher \(R^2\).

Overall, these two GBTR models seem to capture around 53-54% of the variance in the log_score, and the errors (MSE, MAE, RMSE) are at a moderate level.

Feature Importance

                           importance
num_comments                 0.601587
num_crossposts               0.056616
over_18                      0.001983
title_wordCount              0.126110
bucked_selftext_wordCount    0.213705

The identified feature importance reflects their correlation with our target variable score and their respective roles in optimizing the model’s ability to predict Reddit submission scores.

num_comments has the highest importance, approximately 60.16%, indicating that the number of comments on a submission strongly influences its score, reflecting higher engagement—a significant indicator of popularity and community interest. Additionally, selftext_wordCount with 21.37% importance to score has a substantial impact on the score prediction. This suggests that the depth or length of the content within the submission plays a crucial role. In-depth discussions or detailed content may contribute positively to the overall score.

The identified features align with the project goal of predicting Reddit submission scores to enhance community dynamics and engagement. Content creators and stakeholders can focus on encouraging discussions (num_comments) and crafting comprehensive selftext content (bucked_selftext_wordCount) to potentially improve submission scores. Moreover, whether the content is over 18 or not has limited impact on the overall score prediction.

Conclusion

The gradient boosted regression tree models yield moderately accurate predictions of Reddit post scores based on content features. Both gbtr1 and gbtr2 capture around 53-54% of the variance and demonstrate reasonable error rates. These models provide a valuable starting point for predicting community engagement, surpassing baseline regression. Feature importance analysis reveals the key content areas that drive user discussions.

Given the unique dynamics of social platforms, further refinements tailored to Reddit data could enhance score predictions:

  • Incorporating temporal features, such as time-of-day or day-of-week posts, as social media engagement varies with cycles.
  • Employing embeddings to represent text semantics beyond simple counts. Deep encodings can improve generalization.
  • Implementing NLP model results as features to expand exploration.

While the current models offer a solid foundation, Reddit-specific customization leveraging domain knowledge could enhance score predictions. The implementation of these advanced extensions has the potential to unlock additional performance gains.

Topic 10: Stock Predictions

Predicting Stock Prices Based on Franchise Popularity

Business goal: Our objective is to forecast the stock prices of anime studios by analyzing the popularity trends of related franchise subreddits. Specifically, we aim to use metrics such as the number of comments and submissions on these subreddits to predict stock market fluctuations over time.

Technical proposal: To achieve this, we propose to implement time series analysis techniques that capture both trends and seasonal variations in subreddit engagement metrics. Our analysis will include relevant features such as the time of day, day of the week, and historical popularity data. We will then integrate these factors with the external stock price data of corresponding anime studios. Our primary modeling approach involves utilizing linear regression and ARIMA models to establish a relationship between subreddit popularity metrics and stock prices.

General Trend

In analyzing the time series plot, it becomes evident that there is no significant correlation between the stock prices of NTDOY or TYO and the trends observed in the number of comments and submissions on the Pokemon subreddit. This suggests that the stock prices of Nintendo, the company behind Pokemon, may not be directly influenced by these particular subreddit metrics. It’s possible that other factors, beyond just the volume of comments and submissions, could have a more pronounced correlation with Nintendo’s stock price.

Features Correlation

The correlation heatmap reveals a surprising finding: Sony’s stock price exhibits the highest positive correlation with NTDOY, despite both companies operating in similar market segments. Conversely, there is a significant negative correlation between NTDOY and the overall Japanese stock market index. This analysis also indicates that neither the number of comments nor submissions on the Pokemon subreddit, including more detailed metrics like crossposts and word count, show a substantial positive correlation with Nintendo’s stock price. Notably, there are relatively strong negative correlations between Nintendo’s stock price and the number of comments and submissions, as well as the scores of these comments and submissions. Therefore, in our subsequent machine learning modeling, we will consider these metrics as potential predictive features (X) for forecasting.

Summary statistics

NTDOY SONY TYO Year Month DayOfWeek WeekOfYear total_submissions avg_submission_score ... total_submission_crossposts avg_num_crossposts median_num_crossposts max_num_crossposts avg_title_wordCount avg_selftext_wordCount total_comments avg_comment_score avg_body_wordCount NTDOY_diff
0 count 565.0 565.0 565.0 565.0 565.0 565.0 565.0 565.0 565.0 ... 565.0 565.0 565.0 565.0 565.0 565.0 565.0 565.0 565.0 564.0
1 mean 12.1 96.6 10.0 2021.7 6.1 4.0 24.4 92.0 31.2 ... 2.1 0.0 0.0 0.9 8.6 135.3 6839.2 9.7 24.2 -0.0
2 std 1.8 15.2 1.9 0.7 3.5 1.4 15.4 64.9 61.4 ... 6.7 0.1 0.1 1.9 0.7 29.3 3211.2 2.8 4.7 0.2
3 min 9.3 63.2 7.7 2021.0 1.0 2.0 1.0 32.0 1.9 ... 0.0 0.0 0.0 0.0 6.3 59.7 0.0 0.0 0.0 -0.7
4 25% 10.6 85.2 8.4 2021.0 3.0 3.0 10.0 60.0 3.7 ... 0.0 0.0 0.0 0.0 8.1 116.2 4844.0 7.8 21.5 -0.1
5 50% 11.6 98.2 9.1 2022.0 6.0 4.0 23.0 76.0 6.6 ... 1.0 0.0 0.0 1.0 8.6 132.4 5854.0 9.4 23.4 -0.0
6 75% 13.6 107.1 11.7 2022.0 9.0 5.0 38.0 98.0 27.7 ... 2.0 0.0 0.0 1.0 9.1 148.8 7965.0 11.2 26.3 0.1
7 max 16.3 128.6 14.4 2023.0 12.0 6.0 52.0 897.0 743.1 ... 72.0 1.1 1.0 29.0 11.1 410.0 33657.0 23.4 40.9 1.0

8 rows × 24 columns

Linear Regression

Model Evaluation

Indicator Value
0 R-squared -25.959
1 AIC 195.807
2 RMSE 2.235

Insight for Linear Regression

Often, the level of audience engagement with online social media can serve as an indicator of trending popularity. Therefore, we have decided to predict Nintendo’s stock price, the company behind Pokemon, by utilizing variables identified as potentially relevant factors in the correlation heatmap. These selected features are believed to be influential in determining the stock price dynamics of Nintendo.

Our recent attempt to predict Nintendo’s stock price (‘NTDOY’) using subreddit metrics through linear regression has yielded insightful outcomes. However, the results of the model were not that satisfactory. We observed a significantly negative R-squared value of -25.96, which suggests that our model’s fit is poorer than a simple horizontal line. This indicates that the subreddit metrics we selected as predictors might not effectively capture the variability in Nintendo’s stock price. Additionally, our model’s RMSE of 2.24 implies a notable average deviation from the actual stock prices, signaling substantial prediction errors.

From the AIC value of 195.81, we recognize that while it helps in comparing different models, its standalone value offers limited insight without a comparative context.

Overall, these results imply that the relationship between subreddit metrics and Nintendo’s stock price might be more complex or non-linear than our model can account for. This leads us to consider the possibility that other, more relevant factors or external market dynamics might be at play, which are not encapsulated by subreddit activities. Moving forward, we will explore different modeling approaches and feature selections to better capture the nuances of stock price prediction.

ARIMA Time Series Model

Comparison between Actual Values and Predicted Values

Predicted_date Predicted_stock_price True_stock_price
0 4/1/2023 9.67 9.69
1 4/2/2023 9.66 9.69
2 4/3/2023 9.65 9.74
3 4/4/2023 9.64 10.18
4 4/5/2023 9.63 9.94

Insight for ARIMA Time Series Trend

Recognizing the potential of audience interactions on social media platforms to indicate trends, we opted to use ARIMA (AutoRegressive Integrated Moving Average), a sophisticated time series forecasting model, to predict Nintendo’s stock price. Our aim was to capture the intricate patterns in stock price movements influenced by various online engagement metrics.

However, our ARIMA model’s performance revealed some intriguing insights. Initially, the Augmented Dickey-Fuller test indicated the non-stationarity of the ‘NTDOY’ series, leading us to apply differencing. The ARIMA(1, 1, 1) model shows coefficients for both AR and MA components, with significant p-values, indicating a meaningful contribution to the model; and the AIC of -169.735 is relatively low, which is generally a good sign in model selection. Despite this, the predicted stock values displayed a decreasing trend from 2023-04-01 to 2023-04-05, which starkly contrasted with the actual rising stock prices, with a notable jump on 2023-04-04 observed during the same period. This discrepancy, highlighted by an RMSE of 0.28, implies our model’s limitations in accurately forecasting sudden market movements. Considering the scale of stock prices, this error might be considered small; however, in the context of stock trading, even small deviations can be significant which suggests our ARIMA model failed to capture the upward trend in the stock prices during this period.

Moreover, the negative R-squared value from our previous linear regression attempts further underscores the complexity of predicting stock prices based solely on subreddit metrics. These outcomes suggest that Nintendo’s stock price is influenced by factors beyond what our selected metrics and ARIMA model can capture. Consequently, we will be exploring more comprehensive modeling approaches to incorporate a wider array of influences that might impact stock price dynamics.