Reddit Interview Insights
  • Intro
  • EDA
  • NLP
  • ML
  • Conclusion
  • Feedback

Contents

  • Exploratory Data Analysis
    • Data Preparation
      • 1) Data Filtering
      • 2) Data Cleaning
      • 3) Data Tagging: Regex Searches
        • Regex Search 1: is_tech_comments
        • Regex Search 2: is_asking_for_help_comments
        • Regex Search 3: is_asking_a_question_comments
        • Regex Search 4: is_FAANG_comments
      • 4) Data Additions: Creating New Variables
        • New Variable 1: Facebook Mentions
        • New Variable 2: Amazon Mentions
        • New Variable 3: Apple Mentions
        • New Variable 4: Netflix Mentions
        • New Variable 5: Google Mentions
        • New Variable 5: Multiple FAANG Companies Mentions
      • 5) Data Quality Checks
        • Checking the Names of the Columns
        • Checking the Number of Rows
        • Checking for Nulls
        • Checking Length of Comments
        • Removing Corrupted Rows
      • 6) Data Reformatting
      • 7) Final Data Set Information
      • 8) External Data Set Preparation
    • Summary Tables
      • Table 1: Summary Of Regex Searches
      • Table 2: New Variables Table
      • Table 3: Frequency Counts for Regex Searches
    • Visualizations
      • TOPIC 1: INTERVIEW POST AND COMMENT FREQUENCY OVER TIME
      • TOPIC 2: POPULAR TECHNICAL INTERVIEW QUESTIONS AND TOPICS
      • TOPIC 3: FAANG COMPANY BREAKDOWN

Exploratory Data Analysis

The following page is dedicated solely to the different exploratory data analytics methods performed to gain further insight into the Reddit data, specifically the Reddits relating to interviews and technical discussions around job applications and technical career preparation. The page is broken down into the following sections:

  1. Data Preparation (broken into eight different sub-sections)

  2. Summary Tables

  3. Visualizations (broken down by three EDA topics)

  4. Additional Data Set

Data Preparation

1) Data Filtering

Before exploring the topics above, the data needs to be filtered for the subreddit topics of interest and cleaned appropriately. This section describes and shows how the data was cleaned and prepared. The following plot illustrates the distribution of submissions and comments across the different subreddits that data was pulled from. These included:

  1. leetcode

  2. interviewpreparations

  3. codinginterview

  4. InterviewTips

  5. csinterviewproblems

  6. interviews

  7. big_tech_interviews

The most prominent subreddit is clearly leetcode, which shows that the dataset is largely concerned with technical/coding interview experiences. However, the second-most prominent subreddit is interviews, which introduces some necessary discussion of behavioral/professional interview experiences into the dataset.

Code
# Subreddit breakdown
import pyspark.sql.functions as F
import pandas as pd
import datetime

# Combine submissions and comments
submissions_subreddits = submissions.select('subreddit')
comments_subreddits = comments.select('subreddit')

subreddits = submissions_subreddits.union(comments_subreddits).cache()

# Count total submissions and comments by subreddit
subreddit_counts = subreddits.groupBy('subreddit').count().cache()

# Convert to pandas dataframe
import pandas as pd
subreddit_counts_pd = subreddit_counts.toPandas()

# subreddit_counts.count()

import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))

# Plot the data using matplotlib
fig, ax = plt.subplots(dpi=300)
bars = ax.bar(
    subreddit_counts_pd['subreddit'], 
    subreddit_counts_pd['count'],
    color=['#40B389', '#4A80EB', '#5E9751', '#163D6B']
)

# Add title and labels to the plot
ax.set_title('Breakdown of Comment and Submission Counts by Subreddit')
ax.set_xlabel('Subreddit')
ax.set_ylabel('Comment and Submission Count')
ax.xaxis.set_tick_params(rotation=45)

# Annotate the height of each bar
for bar in bars:
    height = bar.get_height()
    ax.annotate('{}'.format(height),
                xy=(bar.get_x() + bar.get_width() / 2, height),
                xytext=(0, 1),  # 3 points vertical offset
                textcoords="offset points",
                ha='center', va='bottom')

plt.savefig('../data/plots/subreddit_breakdown.png', dpi=300)

# Show the plot
plt.show()

Subreddit Breakdown

2) Data Cleaning

This section covers code used to clean and organize the Reddit posts and comments for further analysis. The data is then broken out into tables, highlighting key columns such as IDs, subreddit information, and textual content. This process of cleaning involves preparing the data in a way that makes it more accessible and suitable for future in-depth analysis with regards to online social commentary on technical interviews.

Code
# DATA CLEANING
import sagemaker
from sagemaker.spark.processing import PySparkProcessor
import time
%%time
session = sagemaker.Session()
bucket = session.default_bucket()
output_prefix_data_submissions = f"project/submissions/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
#print(f"reading submissions from {s3_path}")
submissions = spark.read.parquet(s3_path, header=True)
print(f"Shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}")
%%time
output_prefix_data_comments = "project/comments/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_comments}"
#print(f"reading comments from {s3_path}")
comments = spark.read.parquet(s3_path, header=True)
print(f"Shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")
from pyspark.sql.functions import concat, lit, lower, col, when, array
# Getting only the columns we need for cubmissions and comments
submissions_subset_col = submissions.select("id", "subreddit", "subreddit_id", "title", "selftext", col("created_utc").alias("created_utc_submission"), "num_comments")
comments_subset_col = comments.select("link_id", "body", col("created_utc").alias("created_utc_comment"))
# Making an "searchable_text" column for tagging
submissions_subset_col = submissions_subset_col.withColumn("searchable_text", concat(submissions_subset_col["title"], lit(" "), submissions_subset_col["selftext"]))

# Regex 1
keywords_is_tech = ["tech", "technology", "code", "coding", "computer", "data", "python","java", "javascript", "sql", "database", "statistics", "azure", "cloud computing", "progamming", "leetcode", "saas", "faang"]
pattern = "(?i)" + "|(?i)".join(keywords_is_tech)

comments_tagged = comments_subset_col
                        .withColumn('is_tech_comments', col('body').rlike(pattern))
                        .cache()

submissions_tagged = submissions_subset_col
                        .withColumn('is_tech_submissions', col('searchable_text').rlike(pattern))
# Regex 2
keywords_help = ["help me", "struggle", "struggling", "anyone", "advice", "tips", "help"]
pattern = "(?i)" + "|(?i)".join(keywords_help)

submissions_tagged = submissions_tagged
                        .withColumn('is_asking_for_help_submissions', col('searchable_text').rlike(pattern))

comments_tagged = comments_tagged
                        .withColumn('is_asking_for_help_comments', col('body').rlike(pattern))
                        .cache()

# Regex 3
submissions_tagged = submissions_tagged
                        .withColumn('is_asking_a_question_submissions', col('searchable_text').rlike(r'(?)+'))

comments_tagged = comments_tagged
                        .withColumn('is_asking_a_question_comments', col('body').rlike(r'(?)+'))
                        .cache()

# Regex 4
keywords_faang = ["facebook", "amazon", "apple", "netflix", "google"]
pattern = "(?i)" + "|(?i)".join(keywords_faang)

submissions_tagged = submissions_tagged
                        .withColumn('is_FAANG_submissions', col('searchable_text').rlike(pattern))
                        .cache()

comments_tagged = comments_tagged
                        .withColumn('is_FAANG_comments', col('body').rlike(pattern))
                        .cache()

# Join comments and submissions
joined_subset = comments_new_id.join(submissions_tagged, comments_new_id['link_id'] == submissions_tagged['id'], 'outer')

3) Data Tagging: Regex Searches

The following section incorporates Regex searches to create new information in the data - particulary attributes the inclue where Reddit posts/comments fall into the following categories:

  • is_tech
  • is_asking_for_help
  • is_asking_a_question
  • is_FAANG

Each subsection below provides a glimpse into actual posts/comments that fall in each Regex search pattern.

Regex Search 1: is_tech_comments

Keywords: tech, technology, code, coding, computer, data, python, java, javascript, sql, database, statistics, azure, cloud computing, progamming, leetcode, saas, faang

Code
import sagemaker
from sagemaker.spark.processing import PySparkProcessor
import time
%%time
session = sagemaker.Session()
bucket = session.default_bucket()
output_prefix_data_submissions = f"project/submissions/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
#print(f"reading submissions from {s3_path}")
submissions = spark.read.parquet(s3_path, header=True)
#print(f"shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}")
%%time
output_prefix_data_comments = "project/comments/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_comments}"
#print(f"reading comments from {s3_path}")
comments = spark.read.parquet(s3_path, header=True)
#print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")
from pyspark.sql.functions import concat, lit, lower, col, when, array
# Getting only the columns we need for cubmissions and comments
submissions_subset_col = submissions.select("id", "subreddit", "subreddit_id", "title", "selftext", col("created_utc").alias("created_utc_submission"), "num_comments")
comments_subset_col = comments.select("link_id", "body", col("created_utc").alias("created_utc_comment"))
# Making an "searchable_text" column for tagging
submissions_subset_col = submissions_subset_col.withColumn("searchable_text", concat(submissions_subset_col["title"], lit(" "), submissions_subset_col["selftext"]))
keywords_is_tech = ["tech", "technology", "code", "coding", "computer", "data", "python","java", "javascript", "sql", "database", "statistics", "azure", "cloud computing", "progamming", "leetcode", "saas", "faang"]
pattern = "(?i)" + "|(?i)".join(keywords_is_tech)

comments_tagged = comments_subset_col
                        .withColumn('is_tech_comments', col('body').rlike(pattern))
                        .cache()
id body TECHNICHAL REDDIT
t3_100a8f0 I got it too. See… True
t3_100a8f0 I got this too, i… True
t3_100a8f0 Most leetcode use… True
Regex Search 2: is_asking_for_help_comments

Keywords: help me, struggle, struggling, anyone, advice, tips, help

Pattern: includes a “?”

Code
import sagemaker
from sagemaker.spark.processing import PySparkProcessor
import time
%%time
session = sagemaker.Session()
bucket = session.default_bucket()
output_prefix_data_submissions = f"project/submissions/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
#print(f"reading submissions from {s3_path}")
submissions = spark.read.parquet(s3_path, header=True)
#print(f"shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}")
%%time
output_prefix_data_comments = "project/comments/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_comments}"
#print(f"reading comments from {s3_path}")
comments = spark.read.parquet(s3_path, header=True)
#print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")
from pyspark.sql.functions import concat, lit, lower, col, when, array
# Getting only the columns we need for cubmissions and comments
submissions_subset_col = submissions.select("id", "subreddit", "subreddit_id", "title", "selftext", col("created_utc").alias("created_utc_submission"), "num_comments")
comments_subset_col = comments.select("link_id", "body", col("created_utc").alias("created_utc_comment"))
# Making an "searchable_text" column for tagging
submissions_subset_col = submissions_subset_col.withColumn("searchable_text", concat(submissions_subset_col["title"], lit(" "), submissions_subset_col["selftext"]))
keywords_help = ["help me", "struggle", "struggling", "anyone", "advice", "tips", "help"]
pattern = "(?i)" + "|(?i)".join(keywords_help)

comments_tagged = comments_tagged
                        .withColumn('is_asking_for_help_comments', col('body').rlike(pattern))
                        .cache()
id body comment asking for help
t3_100a8f0 I have the 300+… True
t3_100ebu9 A lot of people j… True
t3_100ebu9 Solving quest… True
Regex Search 3: is_asking_a_question_comments

Pattern: includes a “?”

Code
import sagemaker
from sagemaker.spark.processing import PySparkProcessor
import time
%%time
session = sagemaker.Session()
bucket = session.default_bucket()
output_prefix_data_submissions = f"project/submissions/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
#print(f"reading submissions from {s3_path}")
submissions = spark.read.parquet(s3_path, header=True)
#print(f"shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}")
%%time
output_prefix_data_comments = "project/comments/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_comments}"
#print(f"reading comments from {s3_path}")
comments = spark.read.parquet(s3_path, header=True)
#print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")
from pyspark.sql.functions import concat, lit, lower, col, when, array
# Getting only the columns we need for cubmissions and comments
submissions_subset_col = submissions.select("id", "subreddit", "subreddit_id", "title", "selftext", col("created_utc").alias("created_utc_submission"), "num_comments")
comments_subset_col = comments.select("link_id", "body", col("created_utc").alias("created_utc_comment"))
# Making an "searchable_text" column for tagging
submissions_subset_col = submissions_subset_col.withColumn("searchable_text", concat(submissions_subset_col["title"], lit(" "), submissions_subset_col["selftext"]))
comments_tagged = comments_tagged
                        .withColumn('is_asking_a_question_comments', col('body').rlike(r'(?)+'))
                        .cache()
id body comment asking a question
t3_1009w0e I interviewed rec… True
t3_100a8f0 What do we ch… True
t3_100a8f0 Heck yeah do … True
Regex Search 4: is_FAANG_comments

Keywords: facebook, amazon, apple, netflix, google

Pattern: includes a “?”

Code
import sagemaker
from sagemaker.spark.processing import PySparkProcessor
import time
%%time
session = sagemaker.Session()
bucket = session.default_bucket()
output_prefix_data_submissions = f"project/submissions/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
#print(f"reading submissions from {s3_path}")
submissions = spark.read.parquet(s3_path, header=True)
#print(f"shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}")
%%time
output_prefix_data_comments = "project/comments/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_comments}"
#print(f"reading comments from {s3_path}")
comments = spark.read.parquet(s3_path, header=True)
#print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")
from pyspark.sql.functions import concat, lit, lower, col, when, array
# Getting only the columns we need for cubmissions and comments
submissions_subset_col = submissions.select("id", "subreddit", "subreddit_id", "title", "selftext", col("created_utc").alias("created_utc_submission"), "num_comments")
comments_subset_col = comments.select("link_id", "body", col("created_utc").alias("created_utc_comment"))
# Making an "searchable_text" column for tagging
submissions_subset_col = submissions_subset_col.withColumn("searchable_text", concat(submissions_subset_col["title"], lit(" "), submissions_subset_col["selftext"]))
keywords_faang = ["facebook", "amazon", "apple", "netflix", "google"]
pattern = "(?i)" + "|(?i)".join(keywords_faang)

comments_tagged = comments_tagged
                        .withColumn('is_FAANG_comments', col('body').rlike(pattern))
                        .cache()
id body FAANG comment
t3_1010kbv When I go for int… True
t3_1010kbv I appreciate all … True
t3_103ppfi I am interested a… True

4) Data Additions: Creating New Variables

Five new variables were created using the columsn produced from the regex searches (explained in the previous section). Each of these varibales checks for a mention of one or multiple of the FAANG companies in the comments. The variables (and mentions) that are searched for are Facebook, Amazon, Apple, Netflix, and Google. Additionally, a new ‘multiple’ variable has been created to account for when multiple of these companies have been mentioned in one comment.

Code
%%time
session = sagemaker.Session()
bucket = 'tm1450-project'
#output_prefix_data_submissions = f"project/submissions/yyyy=*"
output_prefix_data_submissions = f"final_tagged_data.parquet/"
s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
df = spark.read.parquet(s3_path, header=True)
New Variable 1: Facebook Mentions

The following provides a glimpse of the data when Facebook is mentioned. If so, the new variable will be true, otherwise it’s false.

Code
df_FAANG_True= df_FAANG_True.withColumn("body", lower(col("body")))
df_new_faang = df_FAANG_True.withColumn("Facebook", when(col("body").like("%facebook%"), True).otherwise(False))
df_new_faang=df_new_faang.select("link_id", "body","is_FAANG_comments", "Facebook")
df_new_faang.where(col("Facebook") == True).show(20, truncate=True)
ID BODY FAANG COMMENT FACEBOOK
t3_117gmt1 [https://leetcode… True True
t3_11afkew hey, went thr… True True
t3_jck8os Omg, stop say… True True
New Variable 2: Amazon Mentions

If Amazon is mentioned, this variable will be true, otherwise it’s false.

Code
df_new_faang = df_FAANG_True.withColumn("Amazon", when(col("body").like("%amazon%"), True).otherwise(False))
df_new_faang=df_new_faang.select("link_id", "body","is_FAANG_comments", "Amazon")
df_new_faang.where(col("Amazon") == True).show(20, truncate=True)
ID BODY FAANG COMMENT AMAZON
t3_10cccra Here is what I wo… True True
t3_10cccra I think the real … True True
t3_10h69zq Here are some lin… True True
New Variable 3: Apple Mentions

If Apple is mentioned, this variable will be true, otherwise it’s false.

Code
df_new_faang = df_FAANG_True.withColumn("Apple", when(col("body").like("%apple%"), True).otherwise(False))
df_new_faang=df_new_faang.select("link_id", "body","is_FAANG_comments", "Apple")
df_new_faang.where(col("Apple") == True).show(20, truncate=True)
ID BODY FAANG COMMENT APPLE
t3_109g5hw just use some… True True
t3_11ohugl It was personal… True True
t3_120284u I have heard it… True True
New Variable 4: Netflix Mentions

If Netflix is mentioned, this variable will be true, otherwise it’s false.

Code
df_new_faang = df_FAANG_True.withColumn("Netflix", when(col("body").like("%netflix%"), True).otherwise(False))
df_new_faang=df_new_faang.select("link_id", "body","is_FAANG_comments", "Netflix")
df_new_faang.where(col("Netflix") == True).show(20, truncate=True)
id body FAANG comment Netflix
t3_n09dlg without netflix i… True True
t3_pa6044 If you want a tim… True True
t3_u74ujy [https://imgur.co… True True
New Variable 5: Google Mentions

If Google is mentioned, this variable will be true, otherwise it’s false.

Code
df_new_faang = df_FAANG_True.withColumn("Google", when(col("body").like("%google%"), True).otherwise(False))
df_new_faang=df_new_faang.select("link_id", "body","is_FAANG_comments", "Google")
df_new_faang.where(col("Google") == True).show(20, truncate=True)
id body FAANG comment Google
t3_1010kbv When I go for int… True True
t3_1010kbv I appreciate all … True True
t3_105fbkg I think this is a… True True
New Variable 5: Multiple FAANG Companies Mentions

If more than one FAANG company is mentioned, this variable will be true. If one or none FAANG comapny is mentioned, this variable is false.

Code
df_faang_Apple = df_FAANG_True.withColumn("Multiple", 
                              when((col("body").like("%facebook%") & col("body").like("%amazon%")) |
                                   (col("body").like("%facebook%") & col("body").like("%apple%")) |
                                   (col("body").like("%facebook%") & col("body").like("%netflix%")) |
                                   (col("body").like("%facebook%") & col("body").like("%google%")) |
                                   (col("body").like("%amazon%") & col("body").like("%apple%")) |
                                   (col("body").like("%amazon%") & col("body").like("%netflix%")) |
                                   (col("body").like("%amazon%") & col("body").like("%google%")) |
                                   (col("body").like("%apple%") & col("body").like("%netflix%")) |
                                   (col("body").like("%apple%") & col("body").like("%google%")) |
                                   (col("body").like("%netflix%") & col("body").like("%google%")),
                              True).otherwise(False))
               
df_faang_Apple.where(col("Multiple") == True).show(20, truncate=True)
id body FAANG comment Multiple
t3_108dzhl yea, i agree with… True True
t3_10k842y yes i agree, only… True True
t3_113wq8t you can take a… True True

5) Data Quality Checks

At this point in the exploratory data analysis, the data has been cleaned and new variables have been created. This next section aims to perform quality checks on the first draft of a final clean dataset. Performing QC checks and taking the time to actual observe the data is crucial. Simple awareness of a dataset’s attributes such as column names, missing values, df shape, etc., help identify any inconsistencies, errors, or missing information in the dataset that might impact the outcomes of future analysis. Therefore, the following subsections provide a sample of the data for these types of observations.

Checking the Names of the Columns
Code
joined_subset.columns

The code above checks the column names; however, please see the schema in section 7 below for the complete list of columns.

Checking the Number of Rows
Code
joined_subset.count()

114334

Checking for Nulls
Code
joined_subset.select([count(when(col(c).isNull(), c)).alias(c) for c in joined_subset.columns]).show()

There are 8,111 rows with NULL in the comment-related columns. This is to be expected, and indicates that there are 8,111 submissions in the dataset that have no comments, and therefore have missing data in the comment-related fields.

Similarly, there are 934 rows with NULL in the submission-related columns. This indicates that there are 934 comments that respond to submissions outside of the dataset. This is likely due to the connected submission being posted outside the window of time that this dataset tracked. In any case, there is still interest in these comments and will keep them in the dataset.

Checking Length of Comments
Code
comment_length = joined_subset.filter(joined_subset.link_id.isNotNull())
                                .select(length('body').alias('comment_len'), 'body')

comment_length.select('comment_len').describe().show()
SUMMARY COMMENT LENGTH
Count 106223.0
Mean 190.8
Standard Deviation 304.1
Minimum 1.0
Maximum 8183.0

The distribution of the length of comments seems reasonable. The only result that may raise an eyebrow is the minimum comment length of 1. However, some of the comments are primarily emojis, which contain their own special significance in the world of NLP. For this reason, all of the comments will be kept.

Removing Corrupted Rows

Corrupted rows were filtered out while reading in the data.

6) Data Reformatting

Finally, the data from above needed to be reformatted so that the text is all in the same column and submissions and comments are concatenated rather than joined. The code below shows this logic, and the table shows the top three rows for a selection of the columns.

Code
%%time
import sagemaker
session = sagemaker.Session()
output_prefix_data_submissions = "final_tagged_data.parquet/"
s3_path = f"s3a://{bucket}/project/{output_prefix_data_submissions}"
print(f"reading tagged data from {s3_path}")
df = spark.read.parquet(s3_path, header=True)
print(f"shape of the submissions dataframe is {df.count():,}x{len(df.columns)}")

import pyspark.sql.functions as F
import pandas as pd
import datetime

# Separate into two dataframes (comments and submissions)
df_submissions = df.filter(df.created_utc_submission.isNotNull())
                    .select(
                        F.col('id').alias('link_id'), 
                        F.col('created_utc_submission').alias('created_utc'),
                        F.col('searchable_text').alias('body'),
                        F.col('is_tech_submissions').alias('is_tech'),
                        F.col('is_asking_for_help_submissions').alias('is_asking_for_help'),
                        F.col('is_asking_a_question_submissions').alias('is_asking_a_question'),
                        F.col('is_FAANG_submissions').alias('is_FAANG'),
                        F.lit(True).alias('isSubmission')
                    )
                    .distinct()

print("Number of unique submissions:", df_submissions.count())

df_comments = df.filter(df.created_utc_comment.isNotNull())
                    .select(
                        'link_id', 
                        F.col('created_utc_comment').alias('created_utc'),
                        'body',
                        F.col('is_tech_comments').alias('is_tech'),
                        F.col('is_asking_for_help_comments').alias('is_asking_for_help'),
                        F.col('is_asking_a_question_comments').alias('is_asking_a_question'),
                        F.col('is_FAANG_comments').alias('is_FAANG'),
                        F.lit(False).alias('isSubmission')
                    )
                    .distinct()

print("Number of unique comments:", df_comments.count())


# Concatenate comments and submissions
df_clean = df_submissions.union(df_comments)

df_clean.show(10)
Body Technical Reddit Asking for help? Asking a question? FAANG Reddit Submission
Amazon Inter… False False True True True
Coffee in a… False True True False True
Attire for… False False Frue False Frue

7) Final Data Set Information

Interesting Columns:

  • link_id
  • body
  • created_utc_comment
  • subreddit
  • subreddit_id
  • title
  • selftext
  • num_comments

Number of rows: The shape of the submissions dataframe is 22,122x68 Shape of the comments dataframe is 106,223x21.

Schema:

root
|-- link_id: string (nullable = true)  
|-- body: string (nullable = true)  
|-- created_utc_comment: timestamp (nullable = true)  
|-- is_tech_comments: boolean (nullable = true)  
|-- is_asking_for_help_comments: boolean (nullable = true)  
|-- is_asking_a_question_comments: boolean (nullable = true)  
|-- is_FAANG_comments: boolean (nullable = true)  
|-- id: string (nullable = true)  
|-- subreddit: string (nullable = true)  
|-- subreddit_id: string (nullable = true)  
|-- title: string (nullable = true)  
|-- selftext: string (nullable = true)  
|-- created_utc_submission: timestamp (nullable = true)  
|-- num_comments: long (nullable = true)  
|-- searchable_text: string (nullable = true)  
|-- is_tech_submissions: boolean (nullable = true)  
|-- is_asking_for_help_submissions: boolean (nullable = true)  
|-- is_asking_a_question_submissions: boolean (nullable = true)  
|-- is_FAANG_submissions: boolean (nullable = true)  

8) External Data Set Preparation

External Data Source: link

This data set was combined by the author of the GitHub repository above from two sources: LEVELS.FYI API and a Kaggle dataset on data science and stem salaries (see README in repository linked above). The original data set contains yearly compensation (total, base salary, bonuses, and stock grants), location, years of experience, and years at the company information for STEM and data science roles within over 1500 companies. This data ranges from 2017 to 2021.

For the project’s purposes, the cleaned data set will average yearly compensation, base salary, bonus, years of experience, and years at company for each company and role type (tech or non-tech). From there, subsets can made for particular companies of interest. Additionally, the ability to focus on tech vs. non-tech roles can be easily achieved through filtering. As seen above, the data has been tagged for containing info on the FAANG companies (Facebook, Amazon, Apple, Netflix, and Google). Plot 5 above was created by merging the main data set with this cleaned external data set to allow for a comparison between the Reddit comment frequency for the FAANG companies with the average yearly compensation and average base salaries at these companies.

Code
# Import libraries
import os
import openpyxl
import pandas as pd
import numpy as np
# Read in the data
company_names = pd.read_csv("../data/external_data/company_name_matching_RAW.csv")
salaries = pd.read_excel("../data/external_data/levels_fyi_salary_data_RAW.xlsx", sheet_name = "data")

# Subset columns
salaries_subset = salaries[["timestamp", "company", "title", "totalyearlycompensation", "location", 
                            "yearsofexperience", "yearsatcompany", "basesalary", "stockgrantvalue", "bonus"]]

# Add a year column and drop timestamp column
salaries_subset["year"] = pd.to_datetime(salaries_subset["timestamp"]).dt.year
salaries_subset = salaries_subset.drop(["timestamp"], axis = 1)

# Change data types
data_types = {"company": str, "title": str, "totalyearlycompensation": float, "location": str, "yearsofexperience": float, 
    "yearsatcompany": float, "basesalary": float, "stockgrantvalue": float, "bonus": float, "year": int}
salaries_subset = salaries_subset.astype(data_types)
salaries_subset["year"] = pd.to_datetime(salaries_subset["year"], format = "%Y")

# Create clean company name matching
company_mapping = dict(zip(company_names["Variant"].str.strip(), company_names["Clean Company"]))

# Make a new column with the clean company names
salaries_subset = salaries_subset.rename(columns = {"company": "old_company"})
salaries_subset["company"] = salaries_subset["old_company"].map(company_mapping)
salaries_subset = salaries_subset.drop(["old_company"], axis = 1)

# Categorize available titles as "tech" or "non tech"
role_type_dict = {'Product Manager': 'tech', 'Software Engineer': 'tech',
       'Software Engineering Manager': 'tech', 'Data Scientist': 'tech',
       'Solution Architect': 'tech', 'Technical Program Manager': 'tech',
       'Human Resources': 'non tech', 'Product Designer': 'tech', 'Marketing': 'non tech',
       'Business Analyst': 'tech', 'Hardware Engineer': 'tech', 'Sales': 'non tech', 
       'Recruiter': 'non tech', 'Mechanical Engineer': 'tech', 'Management Consultant': 'non tech'}

salaries_subset["role_type"] = salaries_subset["title"].map(role_type_dict)

# Rearrange columns
salaries_subset = salaries_subset[["company", "year", "role_type", "title", "totalyearlycompensation", "location", "yearsofexperience",
                                    "yearsatcompany", "basesalary", "stockgrantvalue", "bonus"]]
# Get a summary table of averages by company and role type
summary_df = salaries_subset.groupby(["company", "role_type"]).agg({
    "totalyearlycompensation": "mean", "yearsofexperience": "mean",
    "yearsatcompany": "mean", "basesalary": "mean",
    "stockgrantvalue": "mean", "bonus": "mean"}).reset_index()
summary_df.columns = ["company", "role_type", "average_yearly_compensation", "average_years_experience", "average_years_at_company", 
                        "average_base_salary", "average_stock_grant_value", "average_bonus"]

# Subset for just FAANG (Facebook, Amazon, Apple, Netflix, Google)
faang_df = summary_df[(summary_df["company"] == "Facebook") | (summary_df["company"] == "Amazon") | (summary_df["company"] == "Apple") | (summary_df["company"] == "Netflix") | (summary_df["company"] == "Google")]

summary_df.head(20)
#faang_df.head(20)
company role type average yearly compensation average years experience average years at company average base salary average stock grant value average bonus
10x Genomics Tech 193333.333333 3.166667 0.666667 149666.666667 27666.666667 15166.666667
23andMe Non Tech 160000.000000 10.000000 4.000000 160000.000000 0.000000 0.000000
23andMe Tech 152333.333333 2.000000 0.333333 109166.666667 19833.333333 0.000000

Summary Tables

After conducting the necessary quality checks on the dataset, a summary of the information through different perspectives to enhance the team’s overall understanding of the data. This approach allows for a more comprehensive analysis and interpretation of the dataset’s characteristics and trends.

Table 1: Summary Of Regex Searches

The following table shows which regex searches performed. For certian submissions/comments several regex searches can be true at the same time. Only the first 8 rows are printed for demonstration.

Code
df.select('link_id', 'body', 'is_tech_comments', 'is_asking_for_help_comments','is_asking_a_question_comments','is_FAANG_comments').show(20,True)
ID BODY TECHNICAL REDDIT COMMENT ASKING FOR HELP COMMENT ASINKING A QUESTION FAANG COMMENT
t3_1004fo3 I mean this in… False False False False
t3_1004fyy Dark mode… False False False False
t3_1004fyy Leetcode does… True False False False

Table 2: New Variables Table

The table below provides on overview of the new variables created. It focuses on the mentions of FAANG companies in the comments. Since this website’s focus is on technical interviews, this allows the team to separate ‘the big guns’ on the market from the rest. It is important to note that there are other companies that could be included, but in order the narrow down the search, FAANG alone was chosen.

Code
df.select('link_id', 'body','is_FAANG_comments','Facebook', 'Amazon', 'Apple', 'Netflix', 'Google', 'Multiple').show(20,True)
ID BODY FAANG COMMENT FACEBOOK AMAZON APPLE NETFLIX GOOGLE MULTIPLE
t3_1005e3k not to discount… True False True False False False False
t3_1005e3k i am aware… True False True False False True True
t3_1005e3k you must mean… True False False True False False False

Table 3: Frequency Counts for Regex Searches

Since there are significantly more comments than submissions, a summary of the frequency counts for each of the regex searches has been formatted in a table.

Code
# Table 3
frequency_is_tech_comments = df.groupBy('is_tech_comments').count()
frequency_is_asking_for_help_comments = df.groupBy('is_asking_for_help_comments').count()
frequency_is_asking_a_question_comments = df.groupBy('is_asking_a_question_comments').count()
frequency_is_FAANG_comments = df.groupBy('is_FAANG_comments').count()
frequency_is_tech_comments.show()
frequency_is_asking_for_help_comments.show()
frequency_is_asking_a_question_comments.show()
frequency_is_FAANG_comments.show()
REGEX TRUE FALSE
Technical Comment 20955 85268
Comment Asking For Help 8969 97254
Comment Asking A Question 21703 84520
FAANG Comment 3797 102426

Visualizations

In the section below, topics specifically relating to EDA will be explore. The visualizations are broken down into each of the three topics.

TOPIC 1: INTERVIEW POST AND COMMENT FREQUENCY OVER TIME

Goal: Determine the time periods throughout the year with the most submissions (posts) and comments on Reddit regarding technical/coding interviews. Do these windows fall during typical hiring windows? This line of analysis could explain spikes or troughs in job application volume in the big tech industry.

The line graph below illustrates the monthly frequency of submissions and comments over a span of three years (2021, 2022, and 2023) emphasizing user interview engagement trends in Reddit. There is a year-over-year increase, likely influenced by post-COVID interest in technology sectors like NFTs and AI. The peak in 2022 around late summer coincides with common hiring cycles, whereas the consistent 2021 data may reflect the erratic job market during COVID-19. Data up to August 2023 suggests an ongoing rise in engagement, indicative of either growing user participation or platform expansion.

Code
# Plot 1

import sagemaker
session = sagemaker.Session()
bucket = session.default_bucket()
output_prefix_data_submissions = "project/submissions/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
#print(f"reading comments from {s3_path}")
submissions = spark.read.parquet(s3_path, header=True)
#print(f"shape of the comments dataframe is {submissions.count():,}x{len(submissions)
from pyspark.sql.functions import year, month
import matplotlib.pyplot as plt

# Extract year and month from the 'created_utc' column
sub_df_with_month = submissions.withColumn("year", year("created_utc")).withColumn("month", month("created_utc"))

# Group the data by year and month and count the number of comments
sub_counts = sub_df_with_month.groupBy("year", "month").count().orderBy("year", "month").cache()

# Convert the result to a Pandas DataFrame for easier plotting
sub_counts_pd = sub_counts.toPandas()
%%time
import sagemaker
session = sagemaker.Session()
bucket = session.default_bucket()
output_prefix_data_comments = "project/comments/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_comments}"
#print(f"reading comments from {s3_path}")
comments = spark.read.parquet(s3_path, header=True)
#print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")
from pyspark.sql.functions import year, month
import matplotlib.pyplot as plt

# Extract year and month from the 'created_utc' column
comm_df_with_month = comments.withColumn("year", year("created_utc")).withColumn("month", month("created_utc"))

# Group the data by year and month and count the number of comments
comment_counts = comm_df_with_month.groupBy("year", "month").count().orderBy("year", "month").cache()

# Convert the result to a Pandas DataFrame for easier plotting
comment_counts_pd = comment_counts.toPandas()
import pandas as pd
combined_df_pd = pd.merge(sub_counts_pd, comment_counts_pd, on=['year', 'month'], how='outer', suffixes=('_submissions', '_comments'))
#combined_df_pd = combined_df_pd.fillna(0)
import calendar

# Filter the data for each year
data_2021 = combined_df_pd[combined_df_pd['year'] == 2021]
data_2022 = combined_df_pd[combined_df_pd['year'] == 2022]
data_2023 = combined_df_pd[combined_df_pd['year'] == 2023]

# Convert numeric months to month names
data_2021['month'] = data_2021['month'].apply(lambda x: calendar.month_name[x])
data_2022['month'] = data_2022['month'].apply(lambda x: calendar.month_name[x])
data_2023['month'] = data_2023['month'].apply(lambda x: calendar.month_name[x])

# Create a line plot with month names on the x-axis and a separate line for each year
plt.figure(figsize=(10, 6))
plt.plot(data_2021['month'], data_2021['count_submissions'], marker='o', linestyle='-', label='Submissions 2021', color='#204321')
plt.plot(data_2022['month'], data_2022['count_submissions'], marker='o', linestyle='-', label='Submissions 2022', color='#5E9751')
plt.plot(data_2023['month'], data_2023['count_submissions'], marker='o', linestyle='-', label='Submissions 2023', color='#82B8B5')
plt.plot(data_2021['month'], data_2021['count_comments'], marker='o', linestyle='dotted', label='Comments 2021', color='#163D6B')
plt.plot(data_2022['month'], data_2022['count_comments'], marker='o', linestyle='dotted', label='Comments 2022', color='#4A80EB')
plt.plot(data_2023['month'], data_2023['count_comments'], marker='o', linestyle='dotted', label='Comments 2023', color='#251279')
plt.xlabel('Month')
plt.ylabel('Number of Submissions & Comments')
plt.title('Submissions & Comments Over Time by Year')
plt.xticks(rotation=45)  # Rotate x-axis labels for readability
plt.grid(False)
plt.legend()
plt.tight_layout()

plt.savefig('../data/plots/sub_com_over_time.png', dpi=300) 

plt.show()

TOPIC 2: POPULAR TECHNICAL INTERVIEW QUESTIONS AND TOPICS

Goal: Determine the most popular interview questions and/or topics to gauge the specific areas of interest regarding technical and coding-related interviews. Questions that are being discussed more frequently can help future applicants streamline how they study for interviews to focus on the topics with the most traction on Reddit. Highlighting content matter that’s seen most frequently can direct students in their studying.

The wordcloud below was created using the body column in the dataset. The body column has all the text from the reddits and subreddits. Since the focus is on both submissions and comments, the two datasets were joined together to make sure all the text to produce an accurate wordcloud were extracted. The text was additionally tokenized to produce a tokens, and all stopwords and special characters have been removed. Finally the vocabulary was flattened and the wordcloud was created from said flattened vocabulary.

Code
# Plot 2
# get WordCloud library
!pip install matplotlib wordcloud
# import libraries
from pyspark.sql.functions import col, lower, regexp_replace, array_except, array, lit, when
from pyspark.ml.feature import Tokenizer, StopWordsRemover
from pyspark.sql import functions as F
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np

# preprocess the text: clean up, tokenize, and remove stop words
# make text lowercase, and remove special characters
df_clean = df.withColumn('body', lower(col('body'))) 
             .withColumn('body', regexp_replace('body', "[^a-zA-Zs]", "")) 
             .withColumn('body', regexp_replace('body', "bw{1,2}b", ""))
# tokenize text
tokenizer = Tokenizer(inputCol="body", outputCol="words")
# transform tokenized text
df_words = tokenizer.transform(df_clean)
# remove stopwrods
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
df_filtered = remover.transform(df_words)
#remove empty strings
empty_string_array = array(lit(''))
df_filtered = df_filtered.withColumn("filtered", array_except(col("filtered"), empty_string_array))
#flatten the vocabulary list
words_list = df_filtered.select('filtered').rdd.flatMap(lambda x: x).collect()
flat_list = [word for sublist in words_list for word in sublist]
# join the list to a single string, as wordcloud takes a string as input
word_string = " ".join(flat_list)
# adjust figure size
wordcloud = WordCloud(width=1600, height=800, collocations=True, background_color='white').generate(word_string)
plt.figure(figsize=(20, 10), dpi=300) 
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In the word cloud above, the most prominent words are ‘one’, ‘problem’, and ‘interview’; this is absolutely expected based on the fact that the Reddit posts are focused asking questions about technical interviews. The most significant term observed here is ‘LeetCode’ because it indicates a focus on coding or technical interviews, as LeetCode is a platform that provides coding challenges for such purposes. This suggests that the Reddit posts have a strong representation of technical and coding questions, with potential sources or solutions on the LeetCode website. This supports the decision to carry out a regex search for technical Reddit posts.

This next plot illustrates the percent of comments and/or submissions in the dataset that pose questions. Submissions and comments were identified as posing a question with a simple regex search that matched a question mark (?) character. In a casual forum like Reddit, it is expected to miss a few questions in which users forget to add a question mark, but the examination of the dataset showed that the accuracy of this regex match was adequately reliable.

Code
# Question counts
import pyspark.sql.functions as F
import pandas as pd
import datetime

# Add date column
df_time = df.withColumn(
    "monthyear",
    F.concat(
        F.month("created_utc"), 
        F.lit('-'), 
        F.year("created_utc")
    )
)

# Question counts
df_question_counts = df_time.filter('is_asking_a_question').groupBy('monthyear').count()

# Submission/comment counts
df_total_counts = df_time.groupBy('monthyear')\
                    .count()\
                    .select(F.col('monthyear').alias('monthyear2'), F.col('count').alias('total_count'))

# Convert to pandas dataframe
pd_df1 = df_question_counts.join(
    df_total_counts, 
    df_question_counts.monthyear == df_total_counts.monthyear2, 
    'left'
).select('monthyear', 'count', 'total_count').toPandas()

# Reorder pandas dataframe
pd_df1['month'] = pd_df1['monthyear'].apply(lambda x: int(x.split('-')[0]))
pd_df1['year'] = pd_df1['monthyear'].apply(lambda x: int(x.split('-')[1]))

pd_df1 = pd_df1.sort_values(['year', 'month']).reset_index(drop=True)

# Add ratio of questions to all posts as new column
pd_df1['pct_questions'] = pd_df1['count'] / pd_df1['total_count'] * 100
pd_df1.head()

# Get question counts for submissions
# Question counts
df_question_counts_sub = df_time.filter('is_asking_a_question').filter('isSubmission').groupBy('monthyear').count()

# Submission/comment counts
df_sub_counts = df_time.filter('isSubmission')\
                    .groupBy('monthyear')\
                    .count()\
                    .select(F.col('monthyear').alias('monthyear2'), F.col('count').alias('total_count'))

# Convert to pandas dataframe
pd_df_sub = df_question_counts_sub.join(
    df_sub_counts, 
    df_question_counts_sub.monthyear == df_sub_counts.monthyear2, 
    'left'
).select('monthyear', 'count', 'total_count').toPandas()

# Reorder pandas dataframe
pd_df_sub['month'] = pd_df_sub['monthyear'].apply(lambda x: int(x.split('-')[0]))
pd_df_sub['year'] = pd_df_sub['monthyear'].apply(lambda x: int(x.split('-')[1]))

pd_df_sub = pd_df_sub.sort_values(['year', 'month']).reset_index(drop=True)

# Add question ratio as new column
pd_df_sub['pct_questions'] = pd_df_sub['count'] / pd_df_sub['total_count'] * 100

# Get question counts for comments
# Question counts
df_question_counts_com = df_time.filter('is_asking_a_question')\
                    .filter(F.col('isSubmission') == False)\
                    .groupBy('monthyear')\
                    .count()

# Submission/comment counts
df_com_counts = df_time.filter(F.col('isSubmission') == False)\
                    .groupBy('monthyear')\
                    .count()\
                    .select(F.col('monthyear').alias('monthyear2'), F.col('count').alias('total_count'))

# Convert to pandas dataframe
pd_df_com = df_question_counts_com.join(
    df_com_counts, 
    df_question_counts_com.monthyear == df_com_counts.monthyear2, 
    'left'
).select('monthyear', 'count', 'total_count').toPandas()

# Reorder pandas dataframe
pd_df_com['month'] = pd_df_com['monthyear'].apply(lambda x: int(x.split('-')[0]))
pd_df_com['year'] = pd_df_com['monthyear'].apply(lambda x: int(x.split('-')[1]))

pd_df_com = pd_df_com.sort_values(['year', 'month']).reset_index(drop=True)

# Add question ratio as new column
pd_df_com['pct_questions'] = pd_df_com['count'] / pd_df_com['total_count'] * 100

# Line plot
plt.figure(figsize=(10, 6))

# Plot aggregated question percent
plt.plot(
    pd_df1['monthyear'],
    pd_df1['pct_questions'],
    color='#204321',
    marker='o',
    linestyle='-',
    label="Combined (all posts)"
)

# Plot submission question percent
plt.plot(
    pd_df_sub['monthyear'],
    pd_df_sub['pct_questions'],
    color='#5E9751',
    marker='o',
    linestyle='dotted',
    label="Submissions"
)

# Plot comment question percent
plt.plot(
    pd_df_com['monthyear'],
    pd_df_com['pct_questions'],
    color='#82B8B5',
    marker='o',
    linestyle='dotted',
    label="Comments"
)

# Annotate
plt.xlabel('Month')
plt.ylabel('Percent of Total Submissions/Comments')
plt.title('Questions as a Percent of Total Submissions/Comments Over Time')
plt.xticks(rotation=45) 
plt.ylim(0, 100)
plt.grid(False)
plt.legend()
plt.tight_layout()

plt.savefig('../../website-source/images/question_pcts.png', dpi=300)

plt.show()

In the plot above, it’s clear from the solid line that the percent of questions out of all comments and submissions (combined) is roughly constant at about 35% over time. This line suggests that the dataset is primarily comprised of Q&A-type discourse, in which roughly 35% of posts ask questions and the other 65% of posts include (but may not be limited to) direct answers to these questions. This follows the general trend of most Q&A-based Reddit communities, where a user asks a question in the form of a submission, and usually several users respond in the form of comments.

The two dotted lines in the plot seem to confirm this pattern, showing that only about 30% of comments consist of questions (presumably in the form of follow-up questions to the submission) while the remainder of comments likely express statements of fact or opinion that respond to the question posed in the submission. Although the percentage of comments posing questions remains constant over time, the plot shows that the percentage of submissions that pose questions is more volatile. For most of the time, roughly 60% of all submissions in the data pose questions. But from the Fall of 2021 and into the Spring of 2022, the percentage of submissions that pose questions drops to about 40%. Based on the plot in Topic 1, this change in question percentage does not clearly coincide with a change in submission frequency. This suggests that the number of submissions is constant over this time period, but users are posting fewer questions. This sudden drop and rise in questions could be due to a number of possible factors, such as hiring freezes following the COVID-19 hiring surge, changes in subreddit moderator activity, or even random chance associated with smaller sample sizes (there are roughly 500 submissions per month, while comments per month are consistently in the thousands). Regardless of the reason behind this temporary drop in questions, the plot shows that, by and large, the percentage of questions over time is generally constant.

TOPIC 3: FAANG COMPANY BREAKDOWN

Goal: FAANG is an acronym that refers to five of the largest and most influential tech companies in the world: Facebook, Apple, Amazon, Netflix, and Google. Consequently, these companies are often seeking many skilled data scientists and analysts. Observe the data related to the frequency with which these companies are mentioned in posts and comments so individuals can gauge their prevalence in discussions related to job interviews in the tech industry. Use an external dataset containing salary information for each FAANG company to enable insight into how public interest in these companies correlates with compensation.

The bar plot below was created using the new variable, which was created using a regex search - is_FANNG_comments. All comments were searched and times when one of the FAANG companies (Facebook, Amazon, Apple, Netflix, Google) were been mentioned have been idenfitied. For this plot, there was a desire to look at the frequency of comments that include the mentions of FAANG versus the frequency of comments that don’t mention it at all. For that reason a bar plot was chosen to show the team how many comments mention FAANG and how many do not. As expected, there are significantly more comments that do not mention any of the FAANG companies. In fact, the comments that do include FAANG mentions make up only less than 4% of the comments data set.

Code
# Plot 3
# make a copy of data
df_faang = df
#select columns of  interest
df_faang = df_faang.select("link_id", "body", "is_tech_comments", "is_asking_for_help_comments","is_asking_a_question_comments", "is_FAANG_comments")
# group by 'is_FAANG_comments' and count each group
faang_comments_freq = df_faang.groupBy("is_FAANG_comments").count()
# collect the data to the driver
faang_comments_data = faang_comments_freq.collect()
# prepare the data for plotting
labels = [str(row['is_FAANG_comments']) for row in faang_comments_data]  # Convert boolean to string for labels
counts = [row['count'] for row in faang_comments_data]
# plot the data using matplotlib
fig, ax = plt.subplots(dpi=300)
bars = ax.bar(labels, counts, color=['#40B389', '#4A80EB'])
# add title and labels to the plot
ax.set_title('Frequency of FAANG Comments')
ax.set_xlabel('Has a Mention of FAANG')
ax.set_ylabel('Frequency')
# set the x-ticks to correspond to the labels
ax.set_xticks([0, 1], labels)
# annotate the height of each bar
for bar in bars:
    height = bar.get_height()
    ax.annotate('{}'.format(height),
                xy=(bar.get_x() + bar.get_width() / 2, height),
                xytext=(0, 1),  # 3 points vertical offset
                textcoords="offset points",
                ha='center', va='bottom')
# show the plot
plt.show()

The following plot is a direct extension of the previous plot. With this plot, the focus is on all the comments that were tagged as true for is_FAANG_comment. Next, the goal was to observe the divide in frequency between the number of times each of the FAANG companies are mentioned. Therefore, 5 bars were created - each for one of the companies (Facebook, Amazon, Apple, Netflix, Google), and additionally also included a bar for muliple compnaies mentions. The multiple variable accounts for all cases where two or more of the FAANG companies were mentioned in one comment. From the visual, Google has been mentioned significantly more than the other companies. On the other hand, Netflix was hardly ever mentioned. The hypothesis is that it because Netflix as a streming platform requires very specific typle of skills, while for the other companies the skills are more general.

Code
# Plot 4
#get only rows where is_FAANG_comments == True
df_FAANG_True = df_faang.filter(df_faang.is_FAANG_comments == True)
#select columns of interest
df_FAANG_True = df_FAANG_True.select("link_id","body","is_FAANG_comments")
#make all text lowercase
df_FAANG_True= df_FAANG_True.withColumn("body", lower(col("body")))
#add new columns based on company
df_new_faang = df_FAANG_True.withColumn("Facebook", when(col("body").like("%facebook%"), True).otherwise(False)) 
                  .withColumn("Amazon", when(col("body").like("%amazon%"), True).otherwise(False)) 
                  .withColumn("Apple", when(col("body").like("%apple%"), True).otherwise(False)) 
                  .withColumn("Netflix", when(col("body").like("%netflix%"), True).otherwise(False)) 
                  .withColumn("Google", when(col("body").like("%google%"), True).otherwise(False)) 
                  .withColumn("Multiple", 
                              when((col("body").like("%facebook%") & col("body").like("%amazon%")) |
                                   (col("body").like("%facebook%") & col("body").like("%apple%")) |
                                   (col("body").like("%facebook%") & col("body").like("%netflix%")) |
                                   (col("body").like("%facebook%") & col("body").like("%google%")) |
                                   (col("body").like("%amazon%") & col("body").like("%apple%")) |
                                   (col("body").like("%amazon%") & col("body").like("%netflix%")) |
                                   (col("body").like("%amazon%") & col("body").like("%google%")) |
                                   (col("body").like("%apple%") & col("body").like("%netflix%")) |
                                   (col("body").like("%apple%") & col("body").like("%google%")) |
                                   (col("body").like("%netflix%") & col("body").like("%google%")),
                              True).otherwise(False))

faang_sums = df_new_faang.select(
    [F.sum(F.col(c).cast("int")).alias(c) for c in ['Facebook', 'Amazon', 'Apple', 'Netflix', 'Google', 'Multiple']]
).collect()[0].asDict()
#define a list of colors for the bars
colors = ['#5E9751', '#40B389', '#251279', '#6CACED', '#82B8B5', '#4A80EB']
#make sure we have as many colors as we have companies
assert len(colors) == len(faang_sums.keys()), "Number of colors must match the number of companies"
#create a bar plot with the specified colors
plt.figure(dpi=300 )
plt.bar(faang_sums.keys(), faang_sums.values(), color=colors)
#add titles and labels
plt.title('Frequency of FAANG Mentions')
plt.xlabel('Company')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
#add value labels on top of each bar
for i, (key, value) in enumerate(faang_sums.items()):
    plt.text(i, value, str(value), ha='center', va='bottom', color='black')
#show the plot
plt.tight_layout()
plt.show()

The plot below was created using the frequency of FAANG mentions in the Reddit post comments broken down by each company within “FAANG.” Comments that included multiple FAANG companies were excluded from this visualization. The external data set (described in more detail at the end of this page) provided the average yearly compensation and base salary information for each of the 5 companies once it was filtered and cleaned. For more information on the cleaning process of the external data set, please see the last section on this page.

There are many interesting takeaways from the plot. First, it is clear that there is not a direct association between the frequency that a company is mentioned in the selection of Reddit comments and the average compensation and average base salary at that company. Netflix is hardly mentioned; Netflix’s average yearly compensation and average base salary is higher than the rest of the FAANG companies according to the additional data set. Interestingly, the two companies mentioned the most in the Reddit comments (Google and Amazon) do not have the highest compensation metrics. Of course, this is only a subset of Reddit comments and the compensation metrics are based off an external data set, so these takeaways should be used with caution.

NOTE: The compensation and base salary averages were calculated using data from 2017 to 2021 for each of these companies.

Code
# Plot 5
# Filter out comments with multiple FAANG mentions - just get the ones that specifically mention 1 FAANG
from pyspark.sql.functions import col, when, lit
df_one_faang = df.filter(col("Multiple") == False)
# Check new row count
df_one_faang.count()
# Prep for merge with external by making a company column
df_one_faang = df_one_faang.withColumn("company", 
    when(col("Facebook") == True, lit("Facebook"))
    .when(col("Apple") == True, lit("Apple"))
    .when(col("Amazon") == True, lit("Amazon"))
    .when(col("Google") == True, lit("Google"))
    .when(col("Netflix") == True, lit("Netflix"))
    .otherwise(lit("Other"))
)
# Convert to pandas since small, dont need spark
josh_df_pandas = df_one_faang.select("link_id", "body", "company").toPandas()

# Read in the external data set
all_salary_info = pd.read_csv("../data/external_data/all_company_salary_info_CLEAN.csv")
faang_salary_info = pd.read_csv("../data/external_data/faang_salary_info_CLEAN.csv")
# Drop index column
faang_salary_info = faang_salary_info.drop(["Unnamed: 0"], axis = 1)
# Round float values
faang_salary_info = faang_salary_info.apply(lambda x: round(x, 2) if x.dtype == 'float64' else x)
# Average tech and non-tech data
faang_df = faang_salary_info.groupby(["company"]).agg({
    "average_yearly_compensation": "mean", "average_years_experience": "mean",
    "average_years_at_company": "mean", "average_base_salary": "mean",
    "average_stock_grant_value": "mean", "average_bonus": "mean"}).reset_index()

merged_df = josh_df_pandas.merge(faang_df, on = "company", how = "left")

company_counts = merged_df["company"].value_counts()
average_compensation = merged_df.groupby("company")["average_yearly_compensation"].mean()
average_base = merged_df.groupby("company")["average_base_salary"].mean()
average_experience = merged_df.groupby("company")["average_years_experience"].mean()
average_stock_grant = merged_df.groupby("company")["average_stock_grant_value"].mean()

fig, ax1 = plt.subplots(figsize = (10, 6))
ax2 = ax1.twinx()

# Set the bar width and positions
bar_width = 0.2
x = np.arange(len(company_counts.index))

ax1.bar(x - bar_width/3*2, company_counts, bar_width, label='Mention Frequency', alpha=1, color='#5164A1')
ax2.bar(x + bar_width/3, average_compensation, bar_width, label='Average Yearly Compensation', alpha=1, color='#82B8B5')
ax2.bar(x + bar_width/3*4, average_base, bar_width, label='Average Base Salary', alpha=1, color='#5E9751')

# Set labels
ax1.set_xticks(x)
ax1.set_xticklabels(company_counts.index, rotation=0)

ax1.set_ylabel('Mention Frequency', color='black')
ax2.set_ylabel('Average Yearly Compensation/Average Base Salary', color='black')

# Combine legend and put outside of plot
lines, labels = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines + lines2, labels + labels2, loc='upper left', bbox_to_anchor=(0.27, 1))

# Set title and footer
plt.title('Mention Frequency vs. Average Yearly Compensation and Base Salary by FAANG Company')
plt.figtext(0.1, 0.01, 'Note: Comments and submissions mentioning multiple FAANG companies were not included.', fontsize=10, color='black')

# Save figure
plt.savefig('../data/imgs/faang_salary_breakdown.png', dpi = 300)

plt.show()

Source Code
## Exploratory Data Analysis

The following page is dedicated solely to the different exploratory data analytics methods performed to gain further insight into the Reddit data, specifically the Reddits relating to interviews and technical discussions around job applications and technical career preparation. The page is broken down into the following sections: 

1) Data Preparation (broken into eight different sub-sections)

2) Summary Tables

3) Visualizations (broken down by three EDA topics)

4) Additional Data Set

### Data Preparation

#### 1) Data Filtering

Before exploring the topics above, the data needs to be filtered for the subreddit topics of interest and cleaned appropriately. This section describes and shows how the data was cleaned and prepared. The following plot illustrates the distribution of submissions and comments across the different subreddits that data was pulled from. These included:

1. `leetcode`

2. `interviewpreparations`

3. `codinginterview`

4. `InterviewTips`

5. `csinterviewproblems`

6. `interviews`

7. `big_tech_interviews`

The most prominent subreddit is clearly `leetcode`, which shows that the dataset is largely concerned with technical/coding interview experiences. However, the second-most prominent subreddit is `interviews`, which introduces some necessary discussion of behavioral/professional interview experiences into the dataset.

```{python}
#| eval: False
# Subreddit breakdown
import pyspark.sql.functions as F
import pandas as pd
import datetime

# Combine submissions and comments
submissions_subreddits = submissions.select('subreddit')
comments_subreddits = comments.select('subreddit')

subreddits = submissions_subreddits.union(comments_subreddits).cache()

# Count total submissions and comments by subreddit
subreddit_counts = subreddits.groupBy('subreddit').count().cache()

# Convert to pandas dataframe
import pandas as pd
subreddit_counts_pd = subreddit_counts.toPandas()

# subreddit_counts.count()

import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))

# Plot the data using matplotlib
fig, ax = plt.subplots(dpi=300)
bars = ax.bar(
    subreddit_counts_pd['subreddit'], 
    subreddit_counts_pd['count'],
    color=['#40B389', '#4A80EB', '#5E9751', '#163D6B']
)

# Add title and labels to the plot
ax.set_title('Breakdown of Comment and Submission Counts by Subreddit')
ax.set_xlabel('Subreddit')
ax.set_ylabel('Comment and Submission Count')
ax.xaxis.set_tick_params(rotation=45)

# Annotate the height of each bar
for bar in bars:
    height = bar.get_height()
    ax.annotate('{}'.format(height),
                xy=(bar.get_x() + bar.get_width() / 2, height),
                xytext=(0, 1),  # 3 points vertical offset
                textcoords="offset points",
                ha='center', va='bottom')

plt.savefig('../data/plots/subreddit_breakdown.png', dpi=300)

# Show the plot
plt.show()

```

<!-- ![](images//subreddit_breakdown.png) -->
<img src="images/subreddit_breakdown.png" alt="Subreddit Breakdown" style="width: 70%; height: auto;"/>


#### 2) Data Cleaning

This section covers code used to clean and organize the Reddit posts and comments for further analysis. The data is then broken out into tables, highlighting key columns such as IDs, subreddit information, and textual content. This process of cleaning involves preparing the data in a way that makes it more accessible and suitable for future in-depth analysis with regards to online social commentary on technical interviews.

```{python}
#| eval: False
# DATA CLEANING
import sagemaker
from sagemaker.spark.processing import PySparkProcessor
import time
%%time
session = sagemaker.Session()
bucket = session.default_bucket()
output_prefix_data_submissions = f"project/submissions/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
#print(f"reading submissions from {s3_path}")
submissions = spark.read.parquet(s3_path, header=True)
print(f"Shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}")
%%time
output_prefix_data_comments = "project/comments/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_comments}"
#print(f"reading comments from {s3_path}")
comments = spark.read.parquet(s3_path, header=True)
print(f"Shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")
from pyspark.sql.functions import concat, lit, lower, col, when, array
# Getting only the columns we need for cubmissions and comments
submissions_subset_col = submissions.select("id", "subreddit", "subreddit_id", "title", "selftext", col("created_utc").alias("created_utc_submission"), "num_comments")
comments_subset_col = comments.select("link_id", "body", col("created_utc").alias("created_utc_comment"))
# Making an "searchable_text" column for tagging
submissions_subset_col = submissions_subset_col.withColumn("searchable_text", concat(submissions_subset_col["title"], lit(" "), submissions_subset_col["selftext"]))

# Regex 1
keywords_is_tech = ["tech", "technology", "code", "coding", "computer", "data", "python","java", "javascript", "sql", "database", "statistics", "azure", "cloud computing", "progamming", "leetcode", "saas", "faang"]
pattern = "(?i)" + "|(?i)".join(keywords_is_tech)

comments_tagged = comments_subset_col
                        .withColumn('is_tech_comments', col('body').rlike(pattern))
                        .cache()

submissions_tagged = submissions_subset_col
                        .withColumn('is_tech_submissions', col('searchable_text').rlike(pattern))
# Regex 2
keywords_help = ["help me", "struggle", "struggling", "anyone", "advice", "tips", "help"]
pattern = "(?i)" + "|(?i)".join(keywords_help)

submissions_tagged = submissions_tagged
                        .withColumn('is_asking_for_help_submissions', col('searchable_text').rlike(pattern))

comments_tagged = comments_tagged
                        .withColumn('is_asking_for_help_comments', col('body').rlike(pattern))
                        .cache()

# Regex 3
submissions_tagged = submissions_tagged
                        .withColumn('is_asking_a_question_submissions', col('searchable_text').rlike(r'(?)+'))

comments_tagged = comments_tagged
                        .withColumn('is_asking_a_question_comments', col('body').rlike(r'(?)+'))
                        .cache()

# Regex 4
keywords_faang = ["facebook", "amazon", "apple", "netflix", "google"]
pattern = "(?i)" + "|(?i)".join(keywords_faang)

submissions_tagged = submissions_tagged
                        .withColumn('is_FAANG_submissions', col('searchable_text').rlike(pattern))
                        .cache()

comments_tagged = comments_tagged
                        .withColumn('is_FAANG_comments', col('body').rlike(pattern))
                        .cache()

# Join comments and submissions
joined_subset = comments_new_id.join(submissions_tagged, comments_new_id['link_id'] == submissions_tagged['id'], 'outer')

```

#### 3) Data Tagging: Regex Searches

The following section incorporates Regex searches to create new information in the data - particulary attributes the inclue where Reddit posts/comments fall into the following categories:

-   is_tech
-   is_asking_for_help
-   is_asking_a\_question
-   is_FAANG

Each subsection below provides a glimpse into actual posts/comments that fall in each Regex search pattern.

##### Regex Search 1: is_tech_comments

**Keywords:** tech, technology, code, coding, computer, data, python, java, javascript, sql, database, statistics, azure, cloud computing, progamming, leetcode, saas, faang

```{python}
#| eval: False
import sagemaker
from sagemaker.spark.processing import PySparkProcessor
import time
%%time
session = sagemaker.Session()
bucket = session.default_bucket()
output_prefix_data_submissions = f"project/submissions/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
#print(f"reading submissions from {s3_path}")
submissions = spark.read.parquet(s3_path, header=True)
#print(f"shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}")
%%time
output_prefix_data_comments = "project/comments/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_comments}"
#print(f"reading comments from {s3_path}")
comments = spark.read.parquet(s3_path, header=True)
#print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")
from pyspark.sql.functions import concat, lit, lower, col, when, array
# Getting only the columns we need for cubmissions and comments
submissions_subset_col = submissions.select("id", "subreddit", "subreddit_id", "title", "selftext", col("created_utc").alias("created_utc_submission"), "num_comments")
comments_subset_col = comments.select("link_id", "body", col("created_utc").alias("created_utc_comment"))
# Making an "searchable_text" column for tagging
submissions_subset_col = submissions_subset_col.withColumn("searchable_text", concat(submissions_subset_col["title"], lit(" "), submissions_subset_col["selftext"]))
keywords_is_tech = ["tech", "technology", "code", "coding", "computer", "data", "python","java", "javascript", "sql", "database", "statistics", "azure", "cloud computing", "progamming", "leetcode", "saas", "faang"]
pattern = "(?i)" + "|(?i)".join(keywords_is_tech)

comments_tagged = comments_subset_col
                        .withColumn('is_tech_comments', col('body').rlike(pattern))
                        .cache()
```

| link_id    | body                 | is_tech_comments |
|------------|----------------------|------------------|
| t3_100a8f0 | I got it too. See... | true             |
| t3_100a8f0 | I got this too, i... | true             |
| t3_100a8f0 | Most leetcode use... | true             |
| t3_100dubf | Only .2% of leetc... | true             |
| t3_100ebu9 | Competitive codin... | true             |

##### Regex Search 2: is_asking_for_help_comments

**Keywords:** help me, struggle, struggling, anyone, advice, tips, help

**Pattern:** includes a "?"

```{python}
#| eval: False
import sagemaker
from sagemaker.spark.processing import PySparkProcessor
import time
%%time
session = sagemaker.Session()
bucket = session.default_bucket()
output_prefix_data_submissions = f"project/submissions/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
#print(f"reading submissions from {s3_path}")
submissions = spark.read.parquet(s3_path, header=True)
#print(f"shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}")
%%time
output_prefix_data_comments = "project/comments/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_comments}"
#print(f"reading comments from {s3_path}")
comments = spark.read.parquet(s3_path, header=True)
#print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")
from pyspark.sql.functions import concat, lit, lower, col, when, array
# Getting only the columns we need for cubmissions and comments
submissions_subset_col = submissions.select("id", "subreddit", "subreddit_id", "title", "selftext", col("created_utc").alias("created_utc_submission"), "num_comments")
comments_subset_col = comments.select("link_id", "body", col("created_utc").alias("created_utc_comment"))
# Making an "searchable_text" column for tagging
submissions_subset_col = submissions_subset_col.withColumn("searchable_text", concat(submissions_subset_col["title"], lit(" "), submissions_subset_col["selftext"]))
keywords_help = ["help me", "struggle", "struggling", "anyone", "advice", "tips", "help"]
pattern = "(?i)" + "|(?i)".join(keywords_help)

comments_tagged = comments_tagged
                        .withColumn('is_asking_for_help_comments', col('body').rlike(pattern))
                        .cache()
```

| link_id    | body               | is_asking_for_help_comments |
|------------|--------------------|-----------------------------|
| t3_100a8f0 | I have the 300+…   | true                        |
| t3_100ebu9 | A lot of people j… | true                        |
| t3_100ebu9 | Solving quest…     | true                        |
| t3_100ebu9 | I have 300 lc pro… | true                        |
| t3_100onp5 | Dynamic Progra…    | true                        |


##### Regex Search 3: is_asking_a\_question_comments

**Pattern:** includes a "?"

```{python}
#| eval: False
import sagemaker
from sagemaker.spark.processing import PySparkProcessor
import time
%%time
session = sagemaker.Session()
bucket = session.default_bucket()
output_prefix_data_submissions = f"project/submissions/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
#print(f"reading submissions from {s3_path}")
submissions = spark.read.parquet(s3_path, header=True)
#print(f"shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}")
%%time
output_prefix_data_comments = "project/comments/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_comments}"
#print(f"reading comments from {s3_path}")
comments = spark.read.parquet(s3_path, header=True)
#print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")
from pyspark.sql.functions import concat, lit, lower, col, when, array
# Getting only the columns we need for cubmissions and comments
submissions_subset_col = submissions.select("id", "subreddit", "subreddit_id", "title", "selftext", col("created_utc").alias("created_utc_submission"), "num_comments")
comments_subset_col = comments.select("link_id", "body", col("created_utc").alias("created_utc_comment"))
# Making an "searchable_text" column for tagging
submissions_subset_col = submissions_subset_col.withColumn("searchable_text", concat(submissions_subset_col["title"], lit(" "), submissions_subset_col["selftext"]))
comments_tagged = comments_tagged
                        .withColumn('is_asking_a_question_comments', col('body').rlike(r'(?)+'))
                        .cache()
```

| link_id    | body               | is_asking_a\_question_comments |
|------------|--------------------|--------------------------------|
| t3_1009w0e | I interviewed rec… | true                           |
| t3_100a8f0 | What do we ch…     | true                           |
| t3_100a8f0 | Heck yeah do …     | true                           |
| t3_100a8f0 | Just curious, wil… | true                           |
| t3_100a8f0 | I got this too, i… | true                           |

##### Regex Search 4: is_FAANG_comments

**Keywords:** facebook, amazon, apple, netflix, google

**Pattern:** includes a "?"

```{python}
#| eval: False
import sagemaker
from sagemaker.spark.processing import PySparkProcessor
import time
%%time
session = sagemaker.Session()
bucket = session.default_bucket()
output_prefix_data_submissions = f"project/submissions/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
#print(f"reading submissions from {s3_path}")
submissions = spark.read.parquet(s3_path, header=True)
#print(f"shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}")
%%time
output_prefix_data_comments = "project/comments/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_comments}"
#print(f"reading comments from {s3_path}")
comments = spark.read.parquet(s3_path, header=True)
#print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")
from pyspark.sql.functions import concat, lit, lower, col, when, array
# Getting only the columns we need for cubmissions and comments
submissions_subset_col = submissions.select("id", "subreddit", "subreddit_id", "title", "selftext", col("created_utc").alias("created_utc_submission"), "num_comments")
comments_subset_col = comments.select("link_id", "body", col("created_utc").alias("created_utc_comment"))
# Making an "searchable_text" column for tagging
submissions_subset_col = submissions_subset_col.withColumn("searchable_text", concat(submissions_subset_col["title"], lit(" "), submissions_subset_col["selftext"]))
keywords_faang = ["facebook", "amazon", "apple", "netflix", "google"]
pattern = "(?i)" + "|(?i)".join(keywords_faang)

comments_tagged = comments_tagged
                        .withColumn('is_FAANG_comments', col('body').rlike(pattern))
                        .cache()
```

| link_id    | body                 | is_FAANG_comments |
|------------|----------------------|-------------------|
| t3_1010kbv | When I go for int... | true              |
| t3_1010kbv | I appreciate all ... | true              |
| t3_103ppfi | I am interested a... | true              |
| t3_104wf74 | SOURCE: https://s... | true              |
| t3_105fbkg | I think this is a... | true              |

#### 4) Data Additions: Creating New Variables

Five new variables were created using the columsn produced from the regex searches (explained in the previous section). Each of these varibales checks for a mention of one or multiple of the FAANG companies in the comments. The variables (and mentions) that are searched for are Facebook, Amazon, Apple, Netflix, and Google. Additionally, a new 'multiple' variable has been created to account for when multiple of these companies have been mentioned in one comment.

```{python}
#| eval: False
%%time
session = sagemaker.Session()
bucket = 'tm1450-project'
#output_prefix_data_submissions = f"project/submissions/yyyy=*"
output_prefix_data_submissions = f"final_tagged_data.parquet/"
s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
df = spark.read.parquet(s3_path, header=True)
```

##### New Variable 1: Facebook Mentions

The following provides a glimpse of the data when Facebook is mentioned. If so, the new variable will be true, otherwise it's false.

```{python}
#| eval: False
df_FAANG_True= df_FAANG_True.withColumn("body", lower(col("body")))
df_new_faang = df_FAANG_True.withColumn("Facebook", when(col("body").like("%facebook%"), True).otherwise(False))
df_new_faang=df_new_faang.select("link_id", "body","is_FAANG_comments", "Facebook")
df_new_faang.where(col("Facebook") == True).show(20, truncate=True)
```

| link_id    | body                | is_FAANG_comments | Facebook |
|------------|---------------------|-------------------|----------|
| t3_117gmt1 | \[https://leetcode… | true              | true     |
| t3_11afkew | hey, went through…  | true              | true     |
| t3_jck8os  | Omg, stop saying …  | true              | true     |
| t3_jck8os  | \[facebookrecruiti… | true              | true     |
| t3_jck8os  | You’ll notice \[fa… | true              | true     |

##### New Variable 2: Amazon Mentions

If Amazon is mentioned, this variable will be true, otherwise it's false.

```{python}
#| eval: False
df_new_faang = df_FAANG_True.withColumn("Amazon", when(col("body").like("%amazon%"), True).otherwise(False))
df_new_faang=df_new_faang.select("link_id", "body","is_FAANG_comments", "Amazon")
df_new_faang.where(col("Amazon") == True).show(20, truncate=True)
```

| link_id    | body                 | is_FAANG_comments | Amazon |
|------------|----------------------|-------------------|--------|
| t3_10cccra | Here is what I wo…   | true              | true   |
| t3_10cccra | I think the real …   | true              | true   |
| t3_10h69zq | Here are some lin…   | true              | true   |
| t3_111t0kf | Just saw \[this\](h… | true              | true   |
| t3_112xb4g | Daily challenge +…   | true              | true   |

##### New Variable 3: Apple Mentions

If Apple is mentioned, this variable will be true, otherwise it's false.

```{python}
#| eval: False
df_new_faang = df_FAANG_True.withColumn("Apple", when(col("body").like("%apple%"), True).otherwise(False))
df_new_faang=df_new_faang.select("link_id", "body","is_FAANG_comments", "Apple")
df_new_faang.where(col("Apple") == True).show(20, truncate=True)
```

| link_id    | body              | is_FAANG_comments | Apple |
|------------|-------------------|-------------------|-------|
| t3_109g5hw | just use some…    | true              | true  |
| t3_11ohugl | It was personal…  | true              | true  |
| t3_120284u | I have heard it…  | true              | true  |
| t3_120284u | An apple recruit… | true              | true  |
| t3_120284u | Hey, really curi… | true              | true  |

##### New Variable 4: Netflix Mentions

If Netflix is mentioned, this variable will be true, otherwise it's false.

```{python}
#| eval: False
df_new_faang = df_FAANG_True.withColumn("Netflix", when(col("body").like("%netflix%"), True).otherwise(False))
df_new_faang=df_new_faang.select("link_id", "body","is_FAANG_comments", "Netflix")
df_new_faang.where(col("Netflix") == True).show(20, truncate=True)
```

| link_id   | body                | is_FAANG_comments | Netflix |
|-----------|---------------------|-------------------|---------|
| t3_n09dlg | without netflix i…  | true              | true    |
| t3_pa6044 | If you want a tim…  | true              | true    |
| t3_u74ujy | \[https://imgur.co… | true              | true    |
| t3_uksgf5 | go and watch ne…    | true              | true    |
| t3_vrjr16 | to also add to th…  | true              | true    |

##### New Variable 5: Google Mentions

If Google is mentioned, this variable will be true, otherwise it's false.

```{python}
#| eval: False
df_new_faang = df_FAANG_True.withColumn("Google", when(col("body").like("%google%"), True).otherwise(False))
df_new_faang=df_new_faang.select("link_id", "body","is_FAANG_comments", "Google")
df_new_faang.where(col("Google") == True).show(20, truncate=True)
```

| link_id    | body               | is_FAANG_comments | Google |
|------------|--------------------|-------------------|--------|
| t3_1010kbv | When I go for int… | true              | true   |
| t3_1010kbv | I appreciate all … | true              | true   |
| t3_105fbkg | I think this is a… | true              | true   |
| t3_107056x | I am 16, however … | true              | true   |
| t3_10775b6 | Bro is an ex-go…   | true              | true   |

##### New Variable 5: Multiple FAANG Companies Mentions

If more than one FAANG company is mentioned, this variable will be true. If one or none FAANG comapny is mentioned, this variable is false.

```{python}
#| eval: False
df_faang_Apple = df_FAANG_True.withColumn("Multiple", 
                              when((col("body").like("%facebook%") & col("body").like("%amazon%")) |
                                   (col("body").like("%facebook%") & col("body").like("%apple%")) |
                                   (col("body").like("%facebook%") & col("body").like("%netflix%")) |
                                   (col("body").like("%facebook%") & col("body").like("%google%")) |
                                   (col("body").like("%amazon%") & col("body").like("%apple%")) |
                                   (col("body").like("%amazon%") & col("body").like("%netflix%")) |
                                   (col("body").like("%amazon%") & col("body").like("%google%")) |
                                   (col("body").like("%apple%") & col("body").like("%netflix%")) |
                                   (col("body").like("%apple%") & col("body").like("%google%")) |
                                   (col("body").like("%netflix%") & col("body").like("%google%")),
                              True).otherwise(False))
               
df_faang_Apple.where(col("Multiple") == True).show(20, truncate=True)
```

| link_id    | body               | is_FAANG_comments | Multiple |
|------------|--------------------|-------------------|----------|
| t3_108dzhl | yea, i agree with… | true              | true     |
| t3_10k842y | yes i agree, only… | true              | true     |
| t3_113wq8t | you can take a…    | true              | true     |
| t3_11afkew | hey, went thro…    | true              | true     |
| t3_11bl3rd | because it se …    | true              | true     |

#### 5) Data Quality Checks
At this point in the exploratory data analysis, the data has been cleaned and new variables have been created. This next section aims to perform quality checks on the first draft of a final clean dataset. Performing QC checks and taking the time to actual observe the data is crucial. Simple awareness of a dataset's attributes such as column names, missing values, df shape, etc., help identify any inconsistencies, errors, or missing information in the dataset that might impact the outcomes of future analysis. Therefore, the following subsections provide a sample of the data for these types of observations.

##### Checking the Names of the Columns

```{python}
#| eval: False
joined_subset.columns
```

The code above checks the column names; however, please see the schema in section 7 below for the complete list of columns.

##### Checking the Number of Rows

```{python}
#| eval: False
joined_subset.count()
```

114334

##### Checking for Nulls

```{python}
#| eval: False
joined_subset.select([count(when(col(c).isNull(), c)).alias(c) for c in joined_subset.columns]).show()
```

| column                            | number of nulls |
|-----------------------------------|-----------------|
| link_id                           | 8111            |
| body                              | 8111            |
| created_utc_comment               | 8111            |
| is_tech_comments                  | 8111            |
| is_asking_for_help_comments       | 8111            |
| is_asking_a\_question_comments    | 8111            |
| is_FAANG_comments                 | 8111            |
| id                                | 934             |
| subreddit                         | 934             |
| subreddit_id                      | 934             |
| title                             | 934             |
| selftext                          | 934             |
| created_utc_submission            | 934             |
| num_comments                      | 934             |
| searchable_text                   | 934             |
| is_tech_submissions               | 934             |
| is_asking_for_help_submissions    | 934             |
| is_asking_a\_question_submissions | 934             |
| is_FAANG_submissions              | 934             |

There are 8,111 rows with NULL in the comment-related columns. This is to be expected, and indicates that there are 8,111 submissions in the dataset that have no comments, and therefore have missing data in the comment-related fields.

Similarly, there are 934 rows with NULL in the submission-related columns. This indicates that there are 934 comments that respond to submissions outside of the dataset. This is likely due to the connected submission being posted outside the window of time that this dataset tracked. In any case, there is still interest in these comments and will keep them in the dataset.

##### Checking Length of Comments

```{python}
#| eval: False
comment_length = joined_subset.filter(joined_subset.link_id.isNotNull())
                                .select(length('body').alias('comment_len'), 'body')

comment_length.select('comment_len').describe().show()
```

| summary | comment_len |
|---------|-------------|
| count   | 106223.0    |
| mean    | 190.8       |
| stddev  | 304.1       |
| min     | 1.0         |
| max     | 8183.0      |

The distribution of the length of comments seems reasonable. The only result that may raise an eyebrow is the minimum comment length of 1. However, some of the comments are primarily emojis, which contain their own special significance in the world of NLP. For this reason, all of the comments will be kept.

##### Removing Corrupted Rows

Corrupted rows were filtered out while reading in the data.

#### 6) Data Reformatting

Finally, the data from above needed to be reformatted so that the text is all in the same column and submissions and comments are concatenated rather than joined. The code below shows this logic, and the table shows the top three rows for a selection of the columns.

```{python}
#| eval: False
%%time
import sagemaker
session = sagemaker.Session()
output_prefix_data_submissions = "final_tagged_data.parquet/"
s3_path = f"s3a://{bucket}/project/{output_prefix_data_submissions}"
print(f"reading tagged data from {s3_path}")
df = spark.read.parquet(s3_path, header=True)
print(f"shape of the submissions dataframe is {df.count():,}x{len(df.columns)}")

import pyspark.sql.functions as F
import pandas as pd
import datetime

# Separate into two dataframes (comments and submissions)
df_submissions = df.filter(df.created_utc_submission.isNotNull())
                    .select(
                        F.col('id').alias('link_id'), 
                        F.col('created_utc_submission').alias('created_utc'),
                        F.col('searchable_text').alias('body'),
                        F.col('is_tech_submissions').alias('is_tech'),
                        F.col('is_asking_for_help_submissions').alias('is_asking_for_help'),
                        F.col('is_asking_a_question_submissions').alias('is_asking_a_question'),
                        F.col('is_FAANG_submissions').alias('is_FAANG'),
                        F.lit(True).alias('isSubmission')
                    )
                    .distinct()

print("Number of unique submissions:", df_submissions.count())

df_comments = df.filter(df.created_utc_comment.isNotNull())
                    .select(
                        'link_id', 
                        F.col('created_utc_comment').alias('created_utc'),
                        'body',
                        F.col('is_tech_comments').alias('is_tech'),
                        F.col('is_asking_for_help_comments').alias('is_asking_for_help'),
                        F.col('is_asking_a_question_comments').alias('is_asking_a_question'),
                        F.col('is_FAANG_comments').alias('is_FAANG'),
                        F.lit(False).alias('isSubmission')
                    )
                    .distinct()

print("Number of unique comments:", df_comments.count())


# Concatenate comments and submissions
df_clean = df_submissions.union(df_comments)

df_clean.show(10)

```

| body                 | is_tech | is_asking_for_help | is_asking_a\_question | is_FAANG | isSubmission |
|----------|----------|----------|----------|----------|----------|
| Amazon Inter...      | false   | false              | true                  | true     | true         |
| Coffee in a 2nd i... | false   | true               | true                  | false    | true         |
| Attire for Coach ... | false   | false              | true                  | false    | true         |


#### 7) Final Data Set Information

**Interesting Columns:**

-   link_id
-   body
-   created_utc_comment
-   subreddit
-   subreddit_id
-   title
-   selftext
-   num_comments

**Number of rows:** The shape of the submissions dataframe is 22,122x68 Shape of the comments dataframe is 106,223x21.

**Schema:**

```         
root
|-- link_id: string (nullable = true)  
|-- body: string (nullable = true)  
|-- created_utc_comment: timestamp (nullable = true)  
|-- is_tech_comments: boolean (nullable = true)  
|-- is_asking_for_help_comments: boolean (nullable = true)  
|-- is_asking_a_question_comments: boolean (nullable = true)  
|-- is_FAANG_comments: boolean (nullable = true)  
|-- id: string (nullable = true)  
|-- subreddit: string (nullable = true)  
|-- subreddit_id: string (nullable = true)  
|-- title: string (nullable = true)  
|-- selftext: string (nullable = true)  
|-- created_utc_submission: timestamp (nullable = true)  
|-- num_comments: long (nullable = true)  
|-- searchable_text: string (nullable = true)  
|-- is_tech_submissions: boolean (nullable = true)  
|-- is_asking_for_help_submissions: boolean (nullable = true)  
|-- is_asking_a_question_submissions: boolean (nullable = true)  
|-- is_FAANG_submissions: boolean (nullable = true)  
```


#### 8) External Data Set Preparation

**External Data Source:** [link](https://github.com/yuenherny/um-wqd7003-ds-salary-analysis/tree/main)

This data set was combined by the author of the GitHub repository above from two sources: LEVELS.FYI API and a Kaggle dataset on data science and stem salaries (see README in repository linked above). The original data set contains yearly compensation (total, base salary, bonuses, and stock grants), location, years of experience, and years at the company information for STEM and data science roles within over 1500 companies. This data ranges from 2017 to 2021.

For the project's purposes, the cleaned data set will average yearly compensation, base salary, bonus, years of experience, and years at company for each company and role type (tech or non-tech). From there, subsets can made for particular companies of interest. Additionally, the ability to focus on tech vs. non-tech roles can be easily achieved through filtering. As seen above, the data has been tagged for containing info on the FAANG companies (Facebook, Amazon, Apple, Netflix, and Google). Plot 5 above was created by merging the main data set with this cleaned external data set to allow for a comparison between the Reddit comment frequency for the FAANG companies with the average yearly compensation and average base salaries at these companies.

```{python}
#| warning: False
# Import libraries
import os
import openpyxl
import pandas as pd
import numpy as np
# Read in the data
company_names = pd.read_csv("../data/external_data/company_name_matching_RAW.csv")
salaries = pd.read_excel("../data/external_data/levels_fyi_salary_data_RAW.xlsx", sheet_name = "data")

# Subset columns
salaries_subset = salaries[["timestamp", "company", "title", "totalyearlycompensation", "location", 
                            "yearsofexperience", "yearsatcompany", "basesalary", "stockgrantvalue", "bonus"]]

# Add a year column and drop timestamp column
salaries_subset["year"] = pd.to_datetime(salaries_subset["timestamp"]).dt.year
salaries_subset = salaries_subset.drop(["timestamp"], axis = 1)

# Change data types
data_types = {"company": str, "title": str, "totalyearlycompensation": float, "location": str, "yearsofexperience": float, 
    "yearsatcompany": float, "basesalary": float, "stockgrantvalue": float, "bonus": float, "year": int}
salaries_subset = salaries_subset.astype(data_types)
salaries_subset["year"] = pd.to_datetime(salaries_subset["year"], format = "%Y")

# Create clean company name matching
company_mapping = dict(zip(company_names["Variant"].str.strip(), company_names["Clean Company"]))

# Make a new column with the clean company names
salaries_subset = salaries_subset.rename(columns = {"company": "old_company"})
salaries_subset["company"] = salaries_subset["old_company"].map(company_mapping)
salaries_subset = salaries_subset.drop(["old_company"], axis = 1)

# Categorize available titles as "tech" or "non tech"
role_type_dict = {'Product Manager': 'tech', 'Software Engineer': 'tech',
       'Software Engineering Manager': 'tech', 'Data Scientist': 'tech',
       'Solution Architect': 'tech', 'Technical Program Manager': 'tech',
       'Human Resources': 'non tech', 'Product Designer': 'tech', 'Marketing': 'non tech',
       'Business Analyst': 'tech', 'Hardware Engineer': 'tech', 'Sales': 'non tech', 
       'Recruiter': 'non tech', 'Mechanical Engineer': 'tech', 'Management Consultant': 'non tech'}

salaries_subset["role_type"] = salaries_subset["title"].map(role_type_dict)

# Rearrange columns
salaries_subset = salaries_subset[["company", "year", "role_type", "title", "totalyearlycompensation", "location", "yearsofexperience",
                                    "yearsatcompany", "basesalary", "stockgrantvalue", "bonus"]]
# Get a summary table of averages by company and role type
summary_df = salaries_subset.groupby(["company", "role_type"]).agg({
    "totalyearlycompensation": "mean", "yearsofexperience": "mean",
    "yearsatcompany": "mean", "basesalary": "mean",
    "stockgrantvalue": "mean", "bonus": "mean"}).reset_index()
summary_df.columns = ["company", "role_type", "average_yearly_compensation", "average_years_experience", "average_years_at_company", 
                        "average_base_salary", "average_stock_grant_value", "average_bonus"]

# Subset for just FAANG (Facebook, Amazon, Apple, Netflix, Google)
faang_df = summary_df[(summary_df["company"] == "Facebook") | (summary_df["company"] == "Amazon") | (summary_df["company"] == "Apple") | (summary_df["company"] == "Netflix") | (summary_df["company"] == "Google")]

summary_df.head(20)
#faang_df.head(20)
```

### Summary Tables

After conducting the necessary quality checks on the dataset, a summary of the information through different perspectives to enhance the team's overall understanding of the data. This approach allows for a more comprehensive analysis and interpretation of the dataset's characteristics and trends.

#### Table 1: Summary Of Regex Searches

The following table shows which regex searches performed. For certian submissions/comments several regex searches can be true at the same time. Only the first 8 rows are printed for demonstration.

```{python}
#| eval: False
df.select('link_id', 'body', 'is_tech_comments', 'is_asking_for_help_comments','is_asking_a_question_comments','is_FAANG_comments').show(20,True)
```

| link_id    | body               | is_tech_comments | is_asking_for_help_comments | is_asking_a\_question_comments | is_FAANG_comments |
|------------|------------|------------|------------|------------|------------|
| t3_1004fo3 | I mean this in a…  | false            | false                       | false                          | false             |
| t3_1004fyy | Dark mode…         | false            | false                       | false                          | false             |
| t3_1004fyy | Leetcode does…     | true             | false                       | false                          | false             |
| t3_1005b4u | Ok that’s what I…  | false            | false                       | true                           | false             |
| t3_1005b4u | This, since it’s…  | false            | false                       | false                          | false             |
| t3_1005b4u | Well the total ru… | false            | false                       | false                          | false             |
| t3_1005b4u | O(n\^2)            | false            | false                       | false                          | false             |
| t3_1005e3k | In my opinion…     | false            | false                       | false                          | false             |

#### Table 2: New Variables Table

The table below provides on overview of the new variables created. It focuses on the mentions of FAANG companies in the comments. Since this website's focus is on technical interviews, this allows the team to separate 'the big guns' on the market from the rest. It is important to note that there are other companies that could be included, but in order the narrow down the search, FAANG alone was chosen.

```{python}
#| eval: False
df.select('link_id', 'body','is_FAANG_comments','Facebook', 'Amazon', 'Apple', 'Netflix', 'Google', 'Multiple').show(20,True)
```

| link_id    | body              | is_FAANG_comments | Facebook | Amazon | Apple | Netflix | Google | Multiple |
|--------|--------|--------|--------|--------|--------|--------|--------|--------|
| t3_1005e3k | not to discount…  | true              | false    | true   | false | false   | false  | false    |
| t3_1005e3k | i am aware that…  | true              | false    | true   | false | false   | true   | true     |
| t3_1005e3k | you must mean…    | true              | false    | false  | true  | false   | false  | false    |
| t3_1005e3k | i had around…     | true              | false    | true   | false | false   | false  | false    |
| t3_1005e3k | contest ratings…  | true              | false    | true   | false | false   | false  | false    |
| t3_1005e3k | i also have 550…  | true              | false    | true   | false | false   | true   | true     |
| t3_1005e3k | i’ve interviewed… | true              | false    | true   | false | false   | true   | true     |
| t3_1005e3k | thanks. when…     | true              | false    | true   | false | false   | false  | false    |
| t3_100ivcc | lesson 1: google… | true              | false    | false  | false | false   | true   | false    |
| t3_100wpsk | hey, just want…   | true              | false    | true   | false | false   | false  | false    |

#### Table 3: Frequency Counts for Regex Searches

Since there are significantly more comments than submissions, a summary of the frequency counts for each of the regex searches has been formatted in a table.

```{python}
#| eval: False
# Table 3
frequency_is_tech_comments = df.groupBy('is_tech_comments').count()
frequency_is_asking_for_help_comments = df.groupBy('is_asking_for_help_comments').count()
frequency_is_asking_a_question_comments = df.groupBy('is_asking_a_question_comments').count()
frequency_is_FAANG_comments = df.groupBy('is_FAANG_comments').count()
frequency_is_tech_comments.show()
frequency_is_asking_for_help_comments.show()
frequency_is_asking_a_question_comments.show()
frequency_is_FAANG_comments.show()

```

| regex                          | True  | False  |
|--------------------------------|-------|--------|
| is_tech_comments               | 20955 | 85268  |
| is_asking_for_help_comments    | 8969  | 97254  |
| is_asking_a\_question_comments | 21703 | 84520  |
| is_FAANG_comments              | 3797  | 102426 |

### Visualizations

In the section below, topics specifically relating to EDA will be explore. The visualizations are broken down into each of the three topics.

#### TOPIC 1: INTERVIEW POST AND COMMENT FREQUENCY OVER TIME

*Goal:* Determine the time periods throughout the year with the most submissions (posts) and comments on Reddit regarding technical/coding interviews. Do these windows fall during typical hiring windows? This line of analysis could explain spikes or troughs in job application volume in the big tech industry.

The line graph below illustrates the monthly frequency of submissions and comments over a span of three years (2021, 2022, and 2023) emphasizing user interview engagement trends in Reddit. There is a year-over-year increase, likely influenced by post-COVID interest in technology sectors like NFTs and AI. The peak in 2022 around late summer coincides with common hiring cycles, whereas the consistent 2021 data may reflect the erratic job market during COVID-19. Data up to August 2023 suggests an ongoing rise in engagement, indicative of either growing user participation or platform expansion.

```{python}
#| eval: False
# Plot 1

import sagemaker
session = sagemaker.Session()
bucket = session.default_bucket()
output_prefix_data_submissions = "project/submissions/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
#print(f"reading comments from {s3_path}")
submissions = spark.read.parquet(s3_path, header=True)
#print(f"shape of the comments dataframe is {submissions.count():,}x{len(submissions)
from pyspark.sql.functions import year, month
import matplotlib.pyplot as plt

# Extract year and month from the 'created_utc' column
sub_df_with_month = submissions.withColumn("year", year("created_utc")).withColumn("month", month("created_utc"))

# Group the data by year and month and count the number of comments
sub_counts = sub_df_with_month.groupBy("year", "month").count().orderBy("year", "month").cache()

# Convert the result to a Pandas DataFrame for easier plotting
sub_counts_pd = sub_counts.toPandas()
%%time
import sagemaker
session = sagemaker.Session()
bucket = session.default_bucket()
output_prefix_data_comments = "project/comments/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_comments}"
#print(f"reading comments from {s3_path}")
comments = spark.read.parquet(s3_path, header=True)
#print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")
from pyspark.sql.functions import year, month
import matplotlib.pyplot as plt

# Extract year and month from the 'created_utc' column
comm_df_with_month = comments.withColumn("year", year("created_utc")).withColumn("month", month("created_utc"))

# Group the data by year and month and count the number of comments
comment_counts = comm_df_with_month.groupBy("year", "month").count().orderBy("year", "month").cache()

# Convert the result to a Pandas DataFrame for easier plotting
comment_counts_pd = comment_counts.toPandas()
import pandas as pd
combined_df_pd = pd.merge(sub_counts_pd, comment_counts_pd, on=['year', 'month'], how='outer', suffixes=('_submissions', '_comments'))
#combined_df_pd = combined_df_pd.fillna(0)
import calendar

# Filter the data for each year
data_2021 = combined_df_pd[combined_df_pd['year'] == 2021]
data_2022 = combined_df_pd[combined_df_pd['year'] == 2022]
data_2023 = combined_df_pd[combined_df_pd['year'] == 2023]

# Convert numeric months to month names
data_2021['month'] = data_2021['month'].apply(lambda x: calendar.month_name[x])
data_2022['month'] = data_2022['month'].apply(lambda x: calendar.month_name[x])
data_2023['month'] = data_2023['month'].apply(lambda x: calendar.month_name[x])

# Create a line plot with month names on the x-axis and a separate line for each year
plt.figure(figsize=(10, 6))
plt.plot(data_2021['month'], data_2021['count_submissions'], marker='o', linestyle='-', label='Submissions 2021', color='#204321')
plt.plot(data_2022['month'], data_2022['count_submissions'], marker='o', linestyle='-', label='Submissions 2022', color='#5E9751')
plt.plot(data_2023['month'], data_2023['count_submissions'], marker='o', linestyle='-', label='Submissions 2023', color='#82B8B5')
plt.plot(data_2021['month'], data_2021['count_comments'], marker='o', linestyle='dotted', label='Comments 2021', color='#163D6B')
plt.plot(data_2022['month'], data_2022['count_comments'], marker='o', linestyle='dotted', label='Comments 2022', color='#4A80EB')
plt.plot(data_2023['month'], data_2023['count_comments'], marker='o', linestyle='dotted', label='Comments 2023', color='#251279')
plt.xlabel('Month')
plt.ylabel('Number of Submissions & Comments')
plt.title('Submissions & Comments Over Time by Year')
plt.xticks(rotation=45)  # Rotate x-axis labels for readability
plt.grid(False)
plt.legend()
plt.tight_layout()

plt.savefig('../data/plots/sub_com_over_time.png', dpi=300) 

plt.show()

```

<!-- ![](images/sub_com_over_time.png) -->
<img src="images/sub_com_over_time.png" style="width: 80%; height: auto;"/>


#### TOPIC 2: POPULAR TECHNICAL INTERVIEW QUESTIONS AND TOPICS

*Goal:* Determine the most popular interview questions and/or topics to gauge the specific areas of interest regarding technical and coding-related interviews. Questions that are being discussed more frequently can help future applicants streamline how they study for interviews to focus on the topics with the most traction on Reddit. Highlighting content matter that’s seen most frequently can direct students in their studying.

The wordcloud below was created using the body column in the dataset. The body column has all the text from the reddits and subreddits. Since the focus is on both submissions and comments, the two datasets were joined together to make sure all the text to produce an accurate wordcloud were extracted. The text was additionally tokenized to produce a tokens, and all stopwords and special characters have been removed. Finally the vocabulary was flattened and the wordcloud was created from said flattened vocabulary.

```{python}
#| eval: False
# Plot 2
# get WordCloud library
!pip install matplotlib wordcloud
# import libraries
from pyspark.sql.functions import col, lower, regexp_replace, array_except, array, lit, when
from pyspark.ml.feature import Tokenizer, StopWordsRemover
from pyspark.sql import functions as F
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np

# preprocess the text: clean up, tokenize, and remove stop words
# make text lowercase, and remove special characters
df_clean = df.withColumn('body', lower(col('body'))) 
             .withColumn('body', regexp_replace('body', "[^a-zA-Zs]", "")) 
             .withColumn('body', regexp_replace('body', "bw{1,2}b", ""))
# tokenize text
tokenizer = Tokenizer(inputCol="body", outputCol="words")
# transform tokenized text
df_words = tokenizer.transform(df_clean)
# remove stopwrods
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
df_filtered = remover.transform(df_words)
#remove empty strings
empty_string_array = array(lit(''))
df_filtered = df_filtered.withColumn("filtered", array_except(col("filtered"), empty_string_array))
#flatten the vocabulary list
words_list = df_filtered.select('filtered').rdd.flatMap(lambda x: x).collect()
flat_list = [word for sublist in words_list for word in sublist]
# join the list to a single string, as wordcloud takes a string as input
word_string = " ".join(flat_list)
# adjust figure size
wordcloud = WordCloud(width=1600, height=800, collocations=True, background_color='white').generate(word_string)
plt.figure(figsize=(20, 10), dpi=300) 
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
```

<!-- ![](images/high_resolution_wordcloud_updated.png) -->
<img src="images/high_resolution_wordcloud_updated.png" style="width: 75%; height: auto;"/>


In the word cloud above, the most prominent words are ‘one’, ‘problem’, and ‘interview’; this is absolutely expected based on the fact that the Reddit posts are focused asking questions about technical interviews. The most significant term observed here is ‘LeetCode’ because it indicates a focus on coding or technical interviews, as LeetCode is a platform that provides coding challenges for such purposes. This suggests that the Reddit posts have a strong representation of technical and coding questions, with potential sources or solutions on the LeetCode website. This supports the decision to carry out a regex search for technical Reddit posts.

This next plot illustrates the percent of comments and/or submissions in the dataset that pose questions. Submissions and comments were identified as posing a question with a simple regex search that matched a question mark (?) character. In a casual forum like Reddit, it is expected to miss a few questions in which users forget to add a question mark, but the examination of the dataset showed that the accuracy of this regex match was adequately reliable.

```{python}
#| eval: False
# Question counts
import pyspark.sql.functions as F
import pandas as pd
import datetime

# Add date column
df_time = df.withColumn(
    "monthyear",
    F.concat(
        F.month("created_utc"), 
        F.lit('-'), 
        F.year("created_utc")
    )
)

# Question counts
df_question_counts = df_time.filter('is_asking_a_question').groupBy('monthyear').count()

# Submission/comment counts
df_total_counts = df_time.groupBy('monthyear')\
                    .count()\
                    .select(F.col('monthyear').alias('monthyear2'), F.col('count').alias('total_count'))

# Convert to pandas dataframe
pd_df1 = df_question_counts.join(
    df_total_counts, 
    df_question_counts.monthyear == df_total_counts.monthyear2, 
    'left'
).select('monthyear', 'count', 'total_count').toPandas()

# Reorder pandas dataframe
pd_df1['month'] = pd_df1['monthyear'].apply(lambda x: int(x.split('-')[0]))
pd_df1['year'] = pd_df1['monthyear'].apply(lambda x: int(x.split('-')[1]))

pd_df1 = pd_df1.sort_values(['year', 'month']).reset_index(drop=True)

# Add ratio of questions to all posts as new column
pd_df1['pct_questions'] = pd_df1['count'] / pd_df1['total_count'] * 100
pd_df1.head()

# Get question counts for submissions
# Question counts
df_question_counts_sub = df_time.filter('is_asking_a_question').filter('isSubmission').groupBy('monthyear').count()

# Submission/comment counts
df_sub_counts = df_time.filter('isSubmission')\
                    .groupBy('monthyear')\
                    .count()\
                    .select(F.col('monthyear').alias('monthyear2'), F.col('count').alias('total_count'))

# Convert to pandas dataframe
pd_df_sub = df_question_counts_sub.join(
    df_sub_counts, 
    df_question_counts_sub.monthyear == df_sub_counts.monthyear2, 
    'left'
).select('monthyear', 'count', 'total_count').toPandas()

# Reorder pandas dataframe
pd_df_sub['month'] = pd_df_sub['monthyear'].apply(lambda x: int(x.split('-')[0]))
pd_df_sub['year'] = pd_df_sub['monthyear'].apply(lambda x: int(x.split('-')[1]))

pd_df_sub = pd_df_sub.sort_values(['year', 'month']).reset_index(drop=True)

# Add question ratio as new column
pd_df_sub['pct_questions'] = pd_df_sub['count'] / pd_df_sub['total_count'] * 100

# Get question counts for comments
# Question counts
df_question_counts_com = df_time.filter('is_asking_a_question')\
                    .filter(F.col('isSubmission') == False)\
                    .groupBy('monthyear')\
                    .count()

# Submission/comment counts
df_com_counts = df_time.filter(F.col('isSubmission') == False)\
                    .groupBy('monthyear')\
                    .count()\
                    .select(F.col('monthyear').alias('monthyear2'), F.col('count').alias('total_count'))

# Convert to pandas dataframe
pd_df_com = df_question_counts_com.join(
    df_com_counts, 
    df_question_counts_com.monthyear == df_com_counts.monthyear2, 
    'left'
).select('monthyear', 'count', 'total_count').toPandas()

# Reorder pandas dataframe
pd_df_com['month'] = pd_df_com['monthyear'].apply(lambda x: int(x.split('-')[0]))
pd_df_com['year'] = pd_df_com['monthyear'].apply(lambda x: int(x.split('-')[1]))

pd_df_com = pd_df_com.sort_values(['year', 'month']).reset_index(drop=True)

# Add question ratio as new column
pd_df_com['pct_questions'] = pd_df_com['count'] / pd_df_com['total_count'] * 100

# Line plot
plt.figure(figsize=(10, 6))

# Plot aggregated question percent
plt.plot(
    pd_df1['monthyear'],
    pd_df1['pct_questions'],
    color='#204321',
    marker='o',
    linestyle='-',
    label="Combined (all posts)"
)

# Plot submission question percent
plt.plot(
    pd_df_sub['monthyear'],
    pd_df_sub['pct_questions'],
    color='#5E9751',
    marker='o',
    linestyle='dotted',
    label="Submissions"
)

# Plot comment question percent
plt.plot(
    pd_df_com['monthyear'],
    pd_df_com['pct_questions'],
    color='#82B8B5',
    marker='o',
    linestyle='dotted',
    label="Comments"
)

# Annotate
plt.xlabel('Month')
plt.ylabel('Percent of Total Submissions/Comments')
plt.title('Questions as a Percent of Total Submissions/Comments Over Time')
plt.xticks(rotation=45) 
plt.ylim(0, 100)
plt.grid(False)
plt.legend()
plt.tight_layout()

plt.savefig('../../website-source/images/question_pcts.png', dpi=300)

plt.show()
```

<!-- ![](images//question_counts.png) -->

<img src="images/question_pcts.png" style="width: 80%; height: auto;"/>

In the plot above, it's clear from the solid line that the percent of questions out of all comments and submissions (combined) is roughly constant at about 35% over time. This line suggests that the dataset is primarily comprised of Q&A-type discourse, in which roughly 35% of posts ask questions and the other 65% of posts include (but may not be limited to) direct answers to these questions. This follows the general trend of most Q&A-based Reddit communities, where a user asks a question in the form of a submission, and usually *several* users respond in the form of comments.

The two dotted lines in the plot seem to confirm this pattern, showing that only about 30% of comments consist of questions (presumably in the form of follow-up questions to the submission) while the remainder of comments likely express statements of fact or opinion that respond to the question posed in the submission. Although the percentage of comments posing questions remains constant over time, the plot shows that the percentage of submissions that pose questions is more volatile. For most of the time, roughly 60% of all submissions in the data pose questions. But from the Fall of 2021 and into the Spring of 2022, the percentage of submissions that pose questions drops to about 40%. Based on the plot in Topic 1, this change in question percentage does not clearly coincide with a change in submission frequency. This suggests that the number of submissions is constant over this time period, but users are posting fewer questions. This sudden drop and rise in questions could be due to a number of possible factors, such as hiring freezes following the COVID-19 hiring surge, changes in subreddit moderator activity, or even random chance associated with smaller sample sizes (there are roughly 500 submissions per month, while comments per month are consistently in the thousands). Regardless of the reason behind this temporary drop in questions, the plot shows that, by and large, the percentage of questions over time is generally constant.


#### TOPIC 3: FAANG COMPANY BREAKDOWN

*Goal:* FAANG is an acronym that refers to five of the largest and most influential tech companies in the world: Facebook, Apple, Amazon, Netflix, and Google. Consequently, these companies are often seeking many skilled data scientists and analysts. Observe the data related to the frequency with which these companies are mentioned in posts and comments so individuals can gauge their prevalence in discussions related to job interviews in the tech industry. Use an external dataset containing salary information for each FAANG company to enable insight into how public interest in these companies correlates with compensation.

The bar plot below was created using the new variable, which was created using a regex search - is_FANNG_comments. All comments were searched and times when one of the FAANG companies (Facebook, Amazon, Apple, Netflix, Google) were been mentioned have been idenfitied. For this plot, there was a desire to look at the frequency of comments that include the mentions of FAANG versus the frequency of comments that don't mention it at all. For that reason a bar plot was chosen to show the team how many comments mention FAANG and how many do not. As expected, there are significantly more comments that do not mention any of the FAANG companies. In fact, the comments that do include FAANG mentions make up only less than 4% of the comments data set.

```{python}
#| eval: False
# Plot 3
# make a copy of data
df_faang = df
#select columns of  interest
df_faang = df_faang.select("link_id", "body", "is_tech_comments", "is_asking_for_help_comments","is_asking_a_question_comments", "is_FAANG_comments")
# group by 'is_FAANG_comments' and count each group
faang_comments_freq = df_faang.groupBy("is_FAANG_comments").count()
# collect the data to the driver
faang_comments_data = faang_comments_freq.collect()
# prepare the data for plotting
labels = [str(row['is_FAANG_comments']) for row in faang_comments_data]  # Convert boolean to string for labels
counts = [row['count'] for row in faang_comments_data]
# plot the data using matplotlib
fig, ax = plt.subplots(dpi=300)
bars = ax.bar(labels, counts, color=['#40B389', '#4A80EB'])
# add title and labels to the plot
ax.set_title('Frequency of FAANG Comments')
ax.set_xlabel('Has a Mention of FAANG')
ax.set_ylabel('Frequency')
# set the x-ticks to correspond to the labels
ax.set_xticks([0, 1], labels)
# annotate the height of each bar
for bar in bars:
    height = bar.get_height()
    ax.annotate('{}'.format(height),
                xy=(bar.get_x() + bar.get_width() / 2, height),
                xytext=(0, 1),  # 3 points vertical offset
                textcoords="offset points",
                ha='center', va='bottom')
# show the plot
plt.show()
```

<!-- ![](images/faang_frequency.png) -->
<img src="images/faang_frequency.png" style="width: 70%; height: auto;"/>


The following plot is a direct extension of the previous plot. With this plot, the focus is on all the comments that were tagged as true for is_FAANG_comment. Next, the goal was to observe the divide in frequency between the number of times each of the FAANG companies are mentioned. Therefore, 5 bars were created - each for one of the companies (Facebook, Amazon, Apple, Netflix, Google), and additionally also included a bar for muliple compnaies mentions. The multiple variable accounts for all cases where two or more of the FAANG companies were mentioned in one comment. From the visual, Google has been mentioned significantly more than the other companies. On the other hand, Netflix was hardly ever mentioned. The hypothesis is that it because Netflix as a streming platform requires very specific typle of skills, while for the other companies the skills are more general.

```{python}
#| eval: False
# Plot 4
#get only rows where is_FAANG_comments == True
df_FAANG_True = df_faang.filter(df_faang.is_FAANG_comments == True)
#select columns of interest
df_FAANG_True = df_FAANG_True.select("link_id","body","is_FAANG_comments")
#make all text lowercase
df_FAANG_True= df_FAANG_True.withColumn("body", lower(col("body")))
#add new columns based on company
df_new_faang = df_FAANG_True.withColumn("Facebook", when(col("body").like("%facebook%"), True).otherwise(False)) 
                  .withColumn("Amazon", when(col("body").like("%amazon%"), True).otherwise(False)) 
                  .withColumn("Apple", when(col("body").like("%apple%"), True).otherwise(False)) 
                  .withColumn("Netflix", when(col("body").like("%netflix%"), True).otherwise(False)) 
                  .withColumn("Google", when(col("body").like("%google%"), True).otherwise(False)) 
                  .withColumn("Multiple", 
                              when((col("body").like("%facebook%") & col("body").like("%amazon%")) |
                                   (col("body").like("%facebook%") & col("body").like("%apple%")) |
                                   (col("body").like("%facebook%") & col("body").like("%netflix%")) |
                                   (col("body").like("%facebook%") & col("body").like("%google%")) |
                                   (col("body").like("%amazon%") & col("body").like("%apple%")) |
                                   (col("body").like("%amazon%") & col("body").like("%netflix%")) |
                                   (col("body").like("%amazon%") & col("body").like("%google%")) |
                                   (col("body").like("%apple%") & col("body").like("%netflix%")) |
                                   (col("body").like("%apple%") & col("body").like("%google%")) |
                                   (col("body").like("%netflix%") & col("body").like("%google%")),
                              True).otherwise(False))

faang_sums = df_new_faang.select(
    [F.sum(F.col(c).cast("int")).alias(c) for c in ['Facebook', 'Amazon', 'Apple', 'Netflix', 'Google', 'Multiple']]
).collect()[0].asDict()
#define a list of colors for the bars
colors = ['#5E9751', '#40B389', '#251279', '#6CACED', '#82B8B5', '#4A80EB']
#make sure we have as many colors as we have companies
assert len(colors) == len(faang_sums.keys()), "Number of colors must match the number of companies"
#create a bar plot with the specified colors
plt.figure(dpi=300 )
plt.bar(faang_sums.keys(), faang_sums.values(), color=colors)
#add titles and labels
plt.title('Frequency of FAANG Mentions')
plt.xlabel('Company')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
#add value labels on top of each bar
for i, (key, value) in enumerate(faang_sums.items()):
    plt.text(i, value, str(value), ha='center', va='bottom', color='black')
#show the plot
plt.tight_layout()
plt.show()
```

<!-- ![](images/faang_mentions.png) -->
<img src="images/faang_mentions.png" style="width: 70%; height: auto;"/>


The plot below was created using the frequency of FAANG mentions in the Reddit post comments broken down by each company within "FAANG." Comments that included multiple FAANG companies were excluded from this visualization. The external data set (described in more detail at the end of this page) provided the average yearly compensation and base salary information for each of the 5 companies once it was filtered and cleaned. For more information on the cleaning process of the external data set, please see the last section on this page.

There are many interesting takeaways from the plot. First, it is clear that there is not a direct association between the frequency that a company is mentioned in the selection of Reddit comments and the average compensation and average base salary at that company. Netflix is hardly mentioned; Netflix's average yearly compensation and average base salary is higher than the rest of the FAANG companies according to the additional data set. Interestingly, the two companies mentioned the most in the Reddit comments (Google and Amazon) do not have the highest compensation metrics. Of course, this is only a subset of Reddit comments and the compensation metrics are based off an external data set, so these takeaways should be used with caution.

**NOTE:** The compensation and base salary averages were calculated using data from 2017 to 2021 for each of these companies.

```{python}
#| eval: False
# Plot 5
# Filter out comments with multiple FAANG mentions - just get the ones that specifically mention 1 FAANG
from pyspark.sql.functions import col, when, lit
df_one_faang = df.filter(col("Multiple") == False)
# Check new row count
df_one_faang.count()
# Prep for merge with external by making a company column
df_one_faang = df_one_faang.withColumn("company", 
    when(col("Facebook") == True, lit("Facebook"))
    .when(col("Apple") == True, lit("Apple"))
    .when(col("Amazon") == True, lit("Amazon"))
    .when(col("Google") == True, lit("Google"))
    .when(col("Netflix") == True, lit("Netflix"))
    .otherwise(lit("Other"))
)
# Convert to pandas since small, dont need spark
josh_df_pandas = df_one_faang.select("link_id", "body", "company").toPandas()

# Read in the external data set
all_salary_info = pd.read_csv("../data/external_data/all_company_salary_info_CLEAN.csv")
faang_salary_info = pd.read_csv("../data/external_data/faang_salary_info_CLEAN.csv")
# Drop index column
faang_salary_info = faang_salary_info.drop(["Unnamed: 0"], axis = 1)
# Round float values
faang_salary_info = faang_salary_info.apply(lambda x: round(x, 2) if x.dtype == 'float64' else x)
# Average tech and non-tech data
faang_df = faang_salary_info.groupby(["company"]).agg({
    "average_yearly_compensation": "mean", "average_years_experience": "mean",
    "average_years_at_company": "mean", "average_base_salary": "mean",
    "average_stock_grant_value": "mean", "average_bonus": "mean"}).reset_index()

merged_df = josh_df_pandas.merge(faang_df, on = "company", how = "left")

company_counts = merged_df["company"].value_counts()
average_compensation = merged_df.groupby("company")["average_yearly_compensation"].mean()
average_base = merged_df.groupby("company")["average_base_salary"].mean()
average_experience = merged_df.groupby("company")["average_years_experience"].mean()
average_stock_grant = merged_df.groupby("company")["average_stock_grant_value"].mean()

fig, ax1 = plt.subplots(figsize = (10, 6))
ax2 = ax1.twinx()

# Set the bar width and positions
bar_width = 0.2
x = np.arange(len(company_counts.index))

ax1.bar(x - bar_width/3*2, company_counts, bar_width, label='Mention Frequency', alpha=1, color='#5164A1')
ax2.bar(x + bar_width/3, average_compensation, bar_width, label='Average Yearly Compensation', alpha=1, color='#82B8B5')
ax2.bar(x + bar_width/3*4, average_base, bar_width, label='Average Base Salary', alpha=1, color='#5E9751')

# Set labels
ax1.set_xticks(x)
ax1.set_xticklabels(company_counts.index, rotation=0)

ax1.set_ylabel('Mention Frequency', color='black')
ax2.set_ylabel('Average Yearly Compensation/Average Base Salary', color='black')

# Combine legend and put outside of plot
lines, labels = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines + lines2, labels + labels2, loc='upper left', bbox_to_anchor=(0.27, 1))

# Set title and footer
plt.title('Mention Frequency vs. Average Yearly Compensation and Base Salary by FAANG Company')
plt.figtext(0.1, 0.01, 'Note: Comments and submissions mentioning multiple FAANG companies were not included.', fontsize=10, color='black')

# Save figure
plt.savefig('../data/imgs/faang_salary_breakdown.png', dpi = 300)

plt.show()
```

<!-- ![](images//faang_salary_breakdown.png) -->
<img src="images/faang_salary_breakdown.png" style="width: 80%; height: auto;"/>
Content 2023 by Amelia Baier, Joshua Gladwell, Tereza Martinkova, and Mia Mayerhofer
All content licensed under a Creative Commons Attribution-NonCommercial 4.0 International license (CC BY-NC 4.0)
Made with and Quarto
View the source at GitHub