Reddit Interview Insights
  • Intro
  • EDA
  • NLP
  • ML
  • Conclusion
  • Feedback

Contents

  • Natural Language Processing
    • 1) External Data Set
      • Cleaning the External Data Set
      • Preparing the Reddit Data for Merging
      • Data Quality Checks
      • TOPIC 4: MARKET SALARY TREND ANALYSIS
        • Top 20 Companies Mentioned with Compensation Comparison
    • 2) Conducting Natural Language Processing Work
      • Topic 5: NLP Trends for Reddit Interview Content
        • Most Common Words Overall
        • Distribution of Text Lengths
        • Important Words According to TF-IDF
        • Cleaning the Text Using 5+ NLP Techniques
      • Topic 6: Identification of Most Important Skills for Technical Interviewing
        • Creating New Dummy Variables
    • 3) Building a Sentiment Model
      • Topic 7: Sentiment Analysis of Interview Posts & Comments
      • Tables of Summary Statistics
      • Full Dataframe Sentiment Counts
        • Sentiment Counts for is_tech Column
        • Sentiment Counts for is_asking_for_help Column
        • Sentiment Counts for is_asking_a_question Column
        • Sentiment Counts for is_FAANG Column
        • Plot of Summary Statistics

Natural Language Processing

This webpage focuses on using Natural Language Processing (NLP) techniques to dive into the world of Reddit and uncover insights related to interview subreddits. Main accomplishments include identifying and analyzing the significant themes and sentiments within the data. By using NLP techniques, the commonly discussed topics were supported and extracted, providing a clear understanding of the dominant conversations surrounding technical interview attributes. In addition, external data elevated the ability to extract financial insights related to certain types of companies seeking job applicants. These efforts were important in aligning with the business strategies for the target audience (job seekers and students).

Furthermore, sentiment analysis was employed to provide a perspective on the tone of the Reddit discussions. This aspect of the project was particularly enlightening as it revealed the mood and attitudes of subreddit participants towards certain interview aspects. These insights are significantly valuable for helping students in prepare for technical interviews by highlighting common concerns and the overall mood surrounding the interview process. The integration of NLP, in bridging data and decision making showcases how technology can elevate business intelligence and enhance student engagement.

1) External Data Set

Cleaning the External Data Set

External Data Source: link

This data set was combined by the author of the GitHub repository above from two sources: LEVELS.FYI API and a Kaggle dataset on data science and stem salaries (see README in repository linked above). The original data set contains yearly compensation (total, base salary, bonuses, and stock grants), location, years of experience, and years at the company information for STEM and data science roles within over 1500 companies. This data ranges from 2017 to 2021.

For the project’s purposes, the cleaned data set averages yearly compensation, base salary, bonus, years of experience, and years at company for each company and role type (tech or non-tech). The role type column was created by classifying roles as either “tech” or “non tech”. For example, a software engineer would be classified as “tech”; meanwhile, sales is classified as “non tech”. From there, the data was subset to include just the tech role information. Additionally, the ability to focus on tech vs. non-tech roles was achieved through filtering. Below is the code used to initially clean the external data set along with a table showing the first 20 rows.

Code
# Import libraries
import os
import openpyxl
import pandas as pd
import numpy as np
# Read in the data
company_names = pd.read_csv("../data/external_data/company_name_matching_RAW.csv")
salaries = pd.read_excel("../data/external_data/levels_fyi_salary_data_RAW.xlsx", sheet_name = "data")

# Subset columns
salaries_subset = salaries[["timestamp", "company", "title", "totalyearlycompensation", "location", 
                            "yearsofexperience", "yearsatcompany", "basesalary", "stockgrantvalue", "bonus"]]

# Add a year column and drop timestamp column
salaries_subset["year"] = pd.to_datetime(salaries_subset["timestamp"]).dt.year
salaries_subset = salaries_subset.drop(["timestamp"], axis = 1)

# Change data types
data_types = {"company": str, "title": str, "totalyearlycompensation": float, "location": str, "yearsofexperience": float, 
    "yearsatcompany": float, "basesalary": float, "stockgrantvalue": float, "bonus": float, "year": int}
salaries_subset = salaries_subset.astype(data_types)
salaries_subset["year"] = pd.to_datetime(salaries_subset["year"], format = "%Y")

# Create clean company name matching
company_mapping = dict(zip(company_names["Variant"].str.strip(), company_names["Clean Company"]))

# Make a new column with the clean company names
salaries_subset = salaries_subset.rename(columns = {"company": "old_company"})
salaries_subset["company"] = salaries_subset["old_company"].map(company_mapping)
salaries_subset = salaries_subset.drop(["old_company"], axis = 1)

# Categorize available titles as "tech" or "non tech"
role_type_dict = {'Product Manager': 'tech', 'Software Engineer': 'tech',
       'Software Engineering Manager': 'tech', 'Data Scientist': 'tech',
       'Solution Architect': 'tech', 'Technical Program Manager': 'tech',
       'Human Resources': 'non tech', 'Product Designer': 'tech', 'Marketing': 'non tech',
       'Business Analyst': 'tech', 'Hardware Engineer': 'tech', 'Sales': 'non tech', 
       'Recruiter': 'non tech', 'Mechanical Engineer': 'tech', 'Management Consultant': 'non tech'}

salaries_subset["role_type"] = salaries_subset["title"].map(role_type_dict)

# Rearrange columns
salaries_subset = salaries_subset[["company", "year", "role_type", "title", "totalyearlycompensation", "location", "yearsofexperience",
                                    "yearsatcompany", "basesalary", "stockgrantvalue", "bonus"]]
# Get a summary table of averages by company and role type
summary_df = salaries_subset.groupby(["company", "role_type"]).agg({
    "totalyearlycompensation": "mean", "yearsofexperience": "mean",
    "yearsatcompany": "mean", "basesalary": "mean",
    "stockgrantvalue": "mean", "bonus": "mean"}).reset_index()
summary_df.columns = ["company", "role_type", "average_yearly_compensation", "average_years_experience", "average_years_at_company", 
                        "average_base_salary", "average_stock_grant_value", "average_bonus"]

# Subset for just FAANG (Facebook, Amazon, Apple, Netflix, Google)
# faang_df = summary_df[(summary_df["company"] == "Facebook") | (summary_df["company"] == "Amazon") | (summary_df["company"] == "Apple") | (summary_df["company"] == "Netflix") | (summary_df["company"] == "Google")]

summary_df.head(10)
company role type average yearly compensation average years experience average years at company average base salary average stock grant value average bonus
10x Genomics Tech 193333.333333 3.166667 0.666667 149666.666667 27666.666667 15166.666667
23andMe Non Tech 160000.000000 10.000000 4.000000 160000.000000 0.000000 0.000000
23andMe Tech 152333.333333 2.000000 0.333333 109166.666667 19833.333333 0.000000

The plots below depict the distribution of the six numeric variables in the external data set: average yearly compensation, average base salary, average years spent at the company, average years of experience, average stock grant value, and average bonus. From this point forward, the only focus will be on “tech” roles; hence, the figures below depict the distributions for each metric for the tech roles only.

In the EDA portion, the data set was tagged based on whether or not a FAANG company was mentioned. FAANG includes Facebook, Apple, Amazon, Netflix, and Google. To take it further, all of the companies present in the external data set were searched for and tagged accordingly. Then, the two data sets were merged based on the company mentioned in each row. This did not apply to cases when multiple companies are mentioned in a post or comment or when no companies were mentioned (Null). Below is the code used to tag the data set with more companies and merge with the external data set.

Preparing the Reddit Data for Merging

Code
# Import packages
import time
import sagemaker
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.dataset as ds
import matplotlib.pyplot as plt
import pyspark.sql.functions as F
import pandas as pd
import datetime
from pyspark.sql import SparkSession
from s3fs import S3FileSystem
from sagemaker.spark.processing import PySparkProcessor
from pyspark.sql.functions import concat, lit, lower, col, when, array, expr, count, isnan, length, udf
from pyspark.sql.types import ArrayType, StringType

# Build Spark session
spark = (
    SparkSession.builder.appName("PySparkApp")
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.2")
    .config(
        "fs.s3a.aws.credentials.provider",
        "com.amazonaws.auth.ContainerCredentialsProvider",
    )
    .getOrCreate()
)

print(spark.version)

# Read in the filtered submissions
%%time
session = sagemaker.Session()
bucket = "sagemaker-us-east-1-158894274492"
output_prefix_data_submissions = f"project/submissions/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
print(f"reading submissions from {s3_path}")
submissions = spark.read.parquet(s3_path, header=True)
print(f"shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}")

# Read in the filtered comments
%%time
output_prefix_data_comments = "project/comments/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_comments}"
print(f"reading comments from {s3_path}")
comments = spark.read.parquet(s3_path, header=True)
print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")

# Read in the external data set
all_salary_info = pd.read_csv("../data/external_data/all_company_salary_info_CLEAN.csv")
# Drop index column
all_salary_info = all_salary_info.drop(["Unnamed: 0"], axis = 1)
# Round float values
all_salary_info = all_salary_info.apply(lambda x: round(x, 2) if x.dtype == 'float64' else x)
# Get just the tech salaries
tech_df = all_salary_info[all_salary_info["role_type"] == "tech"]

# Getting only the columns we need for cubmissions and comments
submissions_subset_col = submissions.select("id", "subreddit", "subreddit_id", "title", "selftext", col("created_utc").alias("created_utc_submission"), "num_comments")
comments_subset_col = comments.select("link_id", "body", col("created_utc").alias("created_utc_comment"))
# Making an "searchable_text" column for tagging
submissions_subset_col = submissions_subset_col.withColumn("searchable_text", concat(submissions_subset_col["title"], lit(" "), submissions_subset_col["selftext"]))

## TAGGING SUBMISSIONS
keywords_is_tech = ["tech", "technology", "code", "coding", "computer", "data", "python","java", "javascript", "sql", "database", "statistics", "azure", "cloud computing", "progamming", "leetcode", "saas", "faang"]
pattern = "(?i)" + "|(?i)".join(keywords_is_tech)

submissions_tagged = submissions_subset_col\
                        .withColumn('is_tech_submissions', col('searchable_text').rlike(pattern))

keywords_help = ["help me", "struggle", "struggling", "anyone", "advice", "tips", "help"]
pattern = "(?i)" + "|(?i)".join(keywords_help)

submissions_tagged = submissions_tagged\
                        .withColumn('is_asking_for_help_submissions', col('searchable_text').rlike(pattern))

submissions_tagged = submissions_tagged\
                        .withColumn('is_asking_a_question_submissions', col('searchable_text').rlike(r'(\?)+'))

keywords_faang = ["facebook", "amazon", "apple", "netflix", "google"]
pattern = "(?i)" + "|(?i)".join(keywords_faang)

submissions_tagged = submissions_tagged\
                        .withColumn('is_FAANG_submissions', col('searchable_text').rlike(pattern))\
                        .cache()

# Store companies mentioned in Reddit text from companies present in the external data set
companies = list(tech_df.company.unique())
# Make a udf to get mentioned companies 
@udf(ArrayType(StringType()))
def get_company_mentions(text):
    return [company for company in companies if " " + company.lower() + " " in text.lower()]
submissions_tagged = submissions_tagged.withColumn("company_mentions", get_company_mentions(col("searchable_text")))

# Make another UDF to make final tagged column (if none found - NA, if 2+ found, "Multiple", else put the company mentioned
@udf(StringType())
def clean_company_mentions(company_list):
    if not company_list:
        return None
    elif len(company_list) > 1:
        return "Multiple"
    else:
        return company_list[0]
submissions_tagged = submissions_tagged.withColumn("final_company_mentioned", clean_company_mentions(col("company_mentions")))

## TAGGING COMMENTS
keywords_is_tech = ["tech", "technology", "code", "coding", "computer", "data", "python","java", "javascript", "sql", "database", "statistics", "azure", "cloud computing", "progamming", "leetcode", "saas", "faang"]
pattern = "(?i)" + "|(?i)".join(keywords_is_tech)

comments_tagged = comments_subset_col\
                        .withColumn('is_tech_comments', col('body').rlike(pattern))

keywords_help = ["help me", "struggle", "struggling", "anyone", "advice", "tips", "help"]
pattern = "(?i)" + "|(?i)".join(keywords_help)

comments_tagged = comments_tagged\
                        .withColumn('is_asking_for_help_comments', col('body').rlike(pattern))

comments_tagged = comments_tagged\
                        .withColumn('is_asking_a_question_comments', col('body').rlike(r'(\?)+'))

keywords_faang = ["facebook", "amazon", "apple", "netflix", "google"]
pattern = "(?i)" + "|(?i)".join(keywords_faang)

comments_tagged = comments_tagged\
                        .withColumn('is_FAANG_comments', col('body').rlike(pattern))\
                        .cache()

# Store companies mentioned in Reddit text from companies present in the external data set
companies = list(tech_df.company.unique())
# Make a udf to get mentioned companies 
@udf(ArrayType(StringType()))
def get_company_mentions(text):
    return [company for company in companies if " " + company.lower() + " " in text.lower()]
comments_tagged = comments_tagged.withColumn("company_mentions", get_company_mentions(col("body")))

# Make another UDF to make final tagged column (if none found - NA, if 2+ found, "Multiple", else put the company mentioned
@udf(StringType())
def clean_company_mentions(company_list):
    if not company_list:
        return None
    elif len(company_list) > 1:
        return "Multiple"
    else:
        return company_list[0]
comments_tagged = comments_tagged.withColumn("final_company_mentioned", clean_company_mentions(col("company_mentions")))

## JOIN COMMENTS AND SUBMISSIONS
# Rename some columns for this reformatting
comments_tagged = (comments_tagged
                   .withColumnRenamed("company_mentions", "company_mentions_comments")
                   .withColumnRenamed("final_company_mentioned", "final_company_mentioned_comments"))
submissions_tagged = (submissions_tagged
                   .withColumnRenamed("company_mentions", "company_mentions_submissions")
                   .withColumnRenamed("final_company_mentioned", "final_company_mentioned_submissions"))


# Clean comments link_id "t3_#####" to match submissions id "#####"
comments_new_id = comments_tagged.withColumn(
    'link_id',
    expr("substring(link_id, 4, length(link_id)-3)")
)

# Join comments with submissions
df = comments_new_id.join(submissions_tagged, comments_new_id['link_id'] == submissions_tagged['id'], 'outer')

# Reformatting

# Separate into two dataframes (comments and submissions)
df_submissions = df.filter(df.created_utc_submission.isNotNull())\
                    .select(
                        F.col('id').alias('link_id'), 
                        F.col('created_utc_submission').alias('created_utc'),
                        F.col('searchable_text').alias('body'),
                        F.col('is_tech_submissions').alias('is_tech'),
                        F.col('is_asking_for_help_submissions').alias('is_asking_for_help'),
                        F.col('is_asking_a_question_submissions').alias('is_asking_a_question'),
                        F.col('is_FAANG_submissions').alias('is_FAANG'),
                        F.col('company_mentions_submissions').alias('company_mentions'),
                        F.col('final_company_mentioned_submissions').alias('final_company_mentioned'),
                        F.lit(True).alias('isSubmission')
                    )\
                    .distinct()

df_comments = df.filter(df.created_utc_comment.isNotNull())\
                    .select(
                        'link_id', 
                        F.col('created_utc_comment').alias('created_utc'),
                        'body',
                        F.col('is_tech_comments').alias('is_tech'),
                        F.col('is_asking_for_help_comments').alias('is_asking_for_help'),
                        F.col('is_asking_a_question_comments').alias('is_asking_a_question'),
                        F.col('is_FAANG_comments').alias('is_FAANG'),
                        F.col('company_mentions_comments').alias('company_mentions'),
                        F.col('final_company_mentioned_comments').alias('final_company_mentioned'),
                        F.lit(False).alias('isSubmission')
                    )\
                    .distinct()

# Concatenate comments and submissions
df_clean = df_submissions.union(df_comments)

## MERGE WITH EXTERNAL DATA SET
# Rename for merge
df_clean = (df_clean
            .withColumnRenamed("final_company_mentioned", "company"))
# Convert external to spark data frame
tech_spark_df = spark.createDataFrame(tech_df)
# Join on final_company_mentioned column
merged_df = df_clean.join(tech_spark_df, on = "company", how = "left")
id comment timestamp body technical comment asking for help? asking a question? FAANG comment company mentions
wghtce 2022-08-05 01:11:58 Applying for Simi… False False False False []
sbtver 2022-01-24 19:13:22 Leetcode Premium … True False False False []
ue9ggm 2022-04-29 00:32:14 Explanation of Le… True False False False []
submission role type average yearly compensation average years experience average years at company average base salary average stock grant value average bonus
True Null Null Null Null Null Null Null
True Null Null Null Null Null Null Null
True Null Null Null Null Null Null Null

Data Quality Checks

At this point, the data has been cleaned and new variables have been created. This next section aims to perform quality checks on said dataset. Being aware of key dataset characteristics like schema, overall structure, and missing values helps in detecting any inconsistencies, errors, or gaps that could influence subsequent analyses.

Schema:

root

|– company: string (nullable = true)

|– link_id: string (nullable = true)

|– created_utc: timestamp (nullable = true)

|– body: string (nullable = true)

|– is_tech: boolean (nullable = true)

|– is_asking_for_help: boolean (nullable = true)

|– is_asking_a_question: boolean (nullable = true)

|– is_FAANG: boolean (nullable = true)

|– company_mentions: array (nullable = true)

| |– element: string (containsNull = true)

|– isSubmission: boolean (nullable = false)

|– role_type: string (nullable = true)

|– average_yearly_compensation: double (nullable = true)

|– average_years_experience: double (nullable = true)

|– average_years_at_company: double (nullable = true)

|– average_base_salary: double (nullable = true)

|– average_stock_grant_value: double (nullable = true)

|– average_bonus: double (nullable = true)

Code
# Print number of rows
merged_df.count()

Number of Rows: 128339

Code
# Check for nulls
merged_df.select([count(when(col(c).isNull(), c)).alias(c) for c in merged_df.columns]).show()

Null Count by Column:

column number of nulls
company 8111
body 8111
is_tech 8111
is_asking_for_help 8111
is_asking_a_question 8111
is_FAANG 934
company_mentions 934
isSubmission 934
role_type 934
average_yearly_compensation 934
average_years_experience 934

Most of the company salary information is null in the data set after the merge, which is expected. Many subreddits mention no companies at all, some mention multiple, and some probably have mentioned companies not listed in the external data set. For the visualizations below, the nulls were removed to create a smaller, less sparse data set.

TOPIC 4: MARKET SALARY TREND ANALYSIS

Goal: Connect salary information on various firms with the firms discussed in the Reddit submissions and comments. Identify the firms that offer the highest salaries and contrast them with the firms that are discussed most frequently in the context of job hunting and interviewing. Also devote particular attention to the firms represented by the FAANG acronym to build off prior examination of these firms from the EDA stage.

Below are two bar plots showing the distribution of submission counts and comment counts by company. Nulls, “Other”, and “Multiple” tagged rows were removed to just isolate the submissions and comments that mention one company specifically. Abnormal company names such as “Here” and “Test” were removed as well. Additionally, company counts with less than 5 occurrences were removed to make sure the plot did not get too cluttered and unreadable.

Code
# Separate df into submissions and comments
filtered_df = df.filter(df['company'].isNotNull()) # take out Nulls
submissions = filtered_df.filter(filtered_df.isSubmission == True)
comments = filtered_df.filter(filtered_df.isSubmission == False)

# Get the frequency of posts and comments by company
submission_freqs = submissions.groupBy('company').count().orderBy('company')
comments_freqs = comments.groupBy('company').count().orderBy('company')

# Convert counts to pandas
submission_freqs = submission_freqs.toPandas()
comments_freqs = comments_freqs.toPandas()

# Filter out "Other" and "None" and "Multiple" to just get individual companies
exclude = ['Other', 'None', 'Multiple', 'Test', 'Here']
submission_freqs = submission_freqs[~submission_freqs['company'].isin(exclude)]
comments_freqs = comments_freqs[~comments_freqs['company'].isin(exclude)]

# Filter out companies with small counts
submission_freqs = submission_freqs[submission_freqs['count'] > 5]
comments_freqs = comments_freqs[comments_freqs['count'] > 5]

# Plotting
plt.figure(figsize = (60, 60))

submission_freqs = submission_freqs.sort_values(by = 'count', ascending = True)
comments_freqs = comments_freqs.sort_values(by = 'count', ascending = True)

plt.subplot(1, 2, 1)
bars = plt.barh(submission_freqs['company'].astype(str), submission_freqs['count'].astype(int), color = '#251279')
plt.xlabel('No. of Submissions', fontsize = 60)
plt.tick_params(axis = 'x', which = 'major', labelsize = 50) 
plt.tick_params(axis = 'y', which = 'major', labelsize = 50) 
plt.title('Number of Submissions by Company', fontsize = 60)
plt.bar_label(bars, labels = submission_freqs['count'].astype(int), padding = 3, fontsize = 50)


plt.subplot(1, 2, 2)
bars = plt.barh(comments_freqs['company'].astype(str), comments_freqs['count'].astype(int), color = '#40B389')
plt.xlabel('No. of Comments', fontsize = 60)
plt.tick_params(axis = 'x', which = 'major', labelsize = 50) 
plt.tick_params(axis = 'y', which = 'major', labelsize = 50) 
plt.title('Number of Comments by Company', fontsize = 60)
plt.bar_label(bars, labels = comments_freqs['count'].astype(int), padding = 3, fontsize = 50)

plt.figtext(0.05, -0.02, 'Note: Companies with less than 5 mentions are not included. Additionally, posts mentioning multiple or "Other" companies are not included.', fontsize = 50, color='black')

plt.tight_layout()

# Save figure
plt.savefig('../../website-source/images/submissions_comments_by_company.png', dpi = 300)

plt.show()

As expected, Amazon and Google are the clear front-runners, surpassing other “FAANG” companies like Apple, Netflix, and Facebook. Medium is an interesting front-runner as well. Medium provides a lot of informative articles and code samples, so this might explain the reason for the high count of submissions and comments mentioning Medium.

Top 20 Companies Mentioned with Compensation Comparison

Below are the total mention counts for the top twenty companies.

Code
exclude = ['Other', 'None', 'Multiple', 'Test', 'Here']
filtered_df = filtered_df[~filtered_df['company'].isin(exclude)]
company_counts = filtered_df.groupBy('company').count()
top_20_companies = company_counts.orderBy(F.col('count').desc()).limit(20)
top_20_companies.show()
company count
Google 1279
Medium 993
Move 967
Amazon 859
NICE 528
Wish 442
Discord 370
zoom 258
Target 253
Reddit 210
Match 175
GitHub 163
Microsoft 125
Indeed 121
HackerRank 112
Gap 109
Button 106
Facebook 98
Wise 92
Intuitive 87

With this list of top twenty companies, the compensation metrics can be compared to see if any possible relationships arrive.

Code
# Aggregate data by company and calculate the mean for each metric
data_by_company = tech_df.groupby('company')[column_list].mean().reset_index()

company_list = ["Google", "Medium", "Move", "Amazon", "NICE", "Wish", "Discord", "zoom", 
"Target", "Reddit", "Match", "GitHub", "Microsoft", "Indeed", "HackerRank", "Gap", 
"Button", "Facebook", "Wise", "Intuitive"]

data_by_company["top10"] = data_by_company["company"].isin(company_list)

top20 = data_by_company[data_by_company["top10"] == True]

ordered_top20 = {
    "average_yearly_compensation": list(top20.sort_values('average_yearly_compensation', ascending = False)['company']),
    "average_years_experience": list(top20.sort_values('average_years_experience', ascending = False)['company']),
    "average_years_at_company": list(top20.sort_values('average_years_at_company', ascending = False)['company']),
    "average_base_salary": list(top20.sort_values('average_base_salary', ascending = False)['company']),
    "average_stock_grant_value": list(top20.sort_values('average_stock_grant_value', ascending = False)['company']),
    "average_bonus": list(top20.sort_values('average_bonus', ascending = False)['company'])
}

# Create bar plots for each metric
bar_charts = []
for i in range(len(title_list)):
    chart = alt.Chart(top20).mark_bar().encode(
        x = alt.X('company', title = 'Company', sort = ordered_top20[column_list[i]]),
        y = alt.Y(column_list[i], title = x_axis_list[i]),
        color = alt.ColorValue(color_list[i])
    ).properties(
        title = f"{title_list[i]} for Top 10 Mentioned Companies",
        width = 400,
        height = 220
    )
    bar_charts.append(chart)

# Concatenate bar charts in a grid layout
final_bars = alt.vconcat(
    alt.hconcat(bar_charts[0], bar_charts[1]),
    alt.hconcat(bar_charts[2], bar_charts[3]),
    alt.hconcat(bar_charts[4], bar_charts[5])
)

FAANG companies like Facebook, Amazon and Google tend to be in the top half of the top twenty most mentioned companies regarding these compensation metrics. Overall, the charts suggest that there is a mixture of FAANG and non-FAANG companies that offer high compensation. It is interesting to see that companies with the highest compensation, do not necessarily observe the most years of experience or years at the compnay suggesting other factors play a role in compensation packages. The plot also shows variability in how long employees stay with these companies, which implies that factors other than compensation may influence tenure.

2) Conducting Natural Language Processing Work

Topic 5: NLP Trends for Reddit Interview Content

Goal: Identify topics within the posts/comments of the interview subreddits to uncover general themes for students to study further. This can highlight what types of subject matter are being discussed the most, whether coding topics or behavioral questions.

From a technical perspective, this portion of the analysis begins with identifying the most common words across the dataset, using techniques such as tokenization, stop words removal, and word count aggregation. Next, the analysis explores the distribution of text lengths, offering a statistical overview of the dataset’s text size. The third step involved identifying important words using Term Frequency-Inverse Document Frequency (TF-IDF), a method that highlights words that are unique to specific documents within the corpus. Transforming of text data using multiple NLP techniques like URL removal, tokenization, normalization, and stemming created a cleaner version of the dataset for improved text analysis. The creation of dummy variables based on regex patterns enables the categorization of texts into different technical domains such as programming languages and coding libraries.

From a business perspective, the technical analysis in this section accomplishes goals such as insight into key topics and trends as well as data-driven decision making. By identifying the most common words and important terms (via TF-IDF), the analysis uncovers the main themes and topics prevalent in the dataset. This can be crucial for job seekers in understanding prevalent interview trends or job seeker concerns.

Most Common Words Overall

The table below shows that the most common words table shows the terms, “leetcode” and “questions” which might be helpful in steering job applicants to focus their study efforts on this subject.

Code
from pyspark.sql import SparkSession
from pyspark.sql.functions import length, explode, split
from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer, IDF
from pyspark.sql.functions import col
import matplotlib.pyplot as plt

# tokenize 
tokenizer = Tokenizer(inputCol="body", outputCol="words")
df_words = tokenizer.transform(df)

# remove stop words 
remover = StopWordsRemover(inputCol="words", outputCol="filtered_words")
df_filtered = remover.transform(df_words)

# explode the array into  words
df_exploded = df_filtered.select(explode(col("filtered_words")).alias("word"))

# filter out single character words
df_filtered_single_char = df_exploded.filter(length(col("word")) > 1)

# count 
df_word_count = df_filtered_single_char.groupBy("word").count()
df_word_count.orderBy(col("count").desc()).show()
word count
Interview 22631
Like 18719
Get 17015
Questions 13818
Time 13330
One 13245
Good 12364
Know 11881
Job 11718
Leetcode 11557
Think 11144
Also 9331
Want 9048
Really 8620
Need 8564
Problems 8560
[deleted] 8453
Problem 8444
Work 8017
First 7521
Distribution of Text Lengths

These statistics confirm that the dataset contains a significant amount of posts/comments with an average length of about 234 characters. The standard deviation is quite high, indicating a large variability in the length of texts.

Code
# calc length of each text
df = df.withColumn("text_length", length(df["body"]))
df.describe("text_length").show()
summary text length
Count 128339
Mean 233.57049688715043
Standard Deviation 480.0254950591896
Minimum 1
Maximum 38271
Important Words According to TF-IDF

The provided list of words and their TF-IDF scores provides insight into the topics and concerns that are most prominent in the Reddit posts and comments surround technical interviews. The most important term, ‘spaced’, is an unusual observation as it does not offer clarity as to why it would be a key concept in technical interview Reddits. Other terms such as ‘interview’, ‘simple’, and ‘subscribe’ are more in line with expectations as they indicate a focus on interview processes, seeking simple advice or interview resources, and potentially encouraged to follow or subscribe to threads or channels for updates on interview strategies. The presence of emotionally charged words like ‘ugh’ and ‘crushing’ may reflect the anxiety and difficulty of technical interview preparation.

Code
# term frequency
df_sampled = df_filtered.sample(False, 0.1)
cv = CountVectorizer(inputCol="filtered_words", outputCol="raw_features")
cv_model = cv.fit(df_sampled)
df_featurized = cv_model.transform(df_sampled)

# idf
idf = IDF(inputCol="raw_features", outputCol="features")
idf_model = idf.fit(df_featurized)
df_rescaled = idf_model.transform(df_featurized)

tf_idf_pandas_df = df_rescaled.select("features").toPandas()

def extract_top_words(features, vocabulary, n_top_words=3):
    dense_vector = features.toArray()
    top_indices = dense_vector.argsort()[-n_top_words:][::-1]
    top_words = [(vocabulary[i], dense_vector[i]) for i in top_indices]
    return top_words

vocabulary = cv_model.vocabulary

# extract top words for each row
n_top_words = 1 
top_words_list = [extract_top_words(row['features'], vocabulary, n_top_words) for index, row in tf_idf_pandas_df.iterrows()]
top_words_list[35:45]
Word TF-IDF Score
Spaced 13.378421024061316
Bud 8.768652053710493
Pill 8.768652053710493
Interview 8.768652053710493
Simple 9.794902085605205
Ugh 7.670039765042384
Rated 7.6700397650423845
Crushing 7.2645746569342196
[deleted] 2.7537151504173694
Subscribe 7.159214141276394
Cleaning the Text Using 5+ NLP Techniques

In this section, more detailed text cleaning techniques are put in place. The following code used Spark NLP methods such as tokenizing, stemming, removing punctuation, removing stop words, and more to clean the Reddit content.

Code
# remove https links 
def remove_https_links(text):
    return re.sub(r'https?:\/\/\S+', '', text)

remove_https_links_udf = udf(remove_https_links, StringType())

df_no_url = df.withColumn("body_no_url", remove_https_links_udf("body"))

df_no_url.select('body_no_url').show(4, truncate=False)

# set up spark pipeline

document_assembler = DocumentAssembler() \
    .setInputCol("body_no_url") \
    .setOutputCol("document")

# tokenize
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")


# normalize words to remove unnecessary punctuation
normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized") \
    .setLowercase(True) \
    .setCleanupPatterns(["[^A-Za-z0-9 ]"]) 


# remove stop words
stopwords_cleaner = StopWordsCleaner() \
    .setInputCols(["normalized"]) \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)

# stem words
stemmer = Stemmer() \
    .setInputCols(["cleanTokens"]) \
    .setOutputCol("stem")

# sentence_detector = SentenceDetector() \
#     .setInputCols(["cleanTokens"]) \
#     .setOutputCol("sentence")

finisher = Finisher() \
    .setInputCols(["cleanTokens"]) \
    .setOutputCols(["finished_ngrams"]) \
    .setOutputAsArray(False) \
    .setCleanAnnotations(True)


# Define the pipeline
pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    normalizer,
    stopwords_cleaner,
    stemmer,
    finisher
])
# run the pipeline 
model = pipeline.fit(df_no_url)
clean_df = model.transform(df_no_url)

from pyspark.sql.functions import regexp_replace
clean_df_nlp = clean_df.withColumn("clean_body", regexp_replace("finished_ngrams", "@", " "))
clean_df_nlp.select("clean_body").show(5,truncate=False)

Preview of posts and comments after cleaning:

clean body
messaging 21 days 20230128 065831 utc remind link click link send pm also reminded reduce spam parent commenter delete message hide others info reminders sorry didnt see thank howd get change anything prep strategy

Topic 6: Identification of Most Important Skills for Technical Interviewing

Goal: Hone in on the most important and most frequent skills discussed on Reddit for technical interviews. Explore mention frequency of various programming languages, tools, and libraries. Also identify the most important stages of the technical interview.

Creating New Dummy Variables

To focus on topics such as programming languages, tools, and libraries, new variables are required.

Code
from pyspark.sql.functions import regexp_extract, col

language_pattern = r'(?i)(Python|SQL|SparkSQL|PySpark|Java|Scala|Julia|MATLAB|C\+\+|SAS|NoSQL|Hadoop|Hive|BigQuery)'
library_pattern = r'(?i)(Pandas|Tensorflow|Pytorch|sklearn|Keras|NumPy|SciPy|Matplotlib|Seaborn|Dask|NLTK|Gensim|Scrapy|Statsmodels|XGBoost|LightGBM|Jupyter|FastAI|Spacy|Theano|Yolo)'
tool_pattern = r'(?i)(Tableau|Shiny|Sagemaker|Azure|PowerBI|Databricks|Alteryx|Qlik|Looker|Google Cloud Platform|AWS|Dataiku|Docker|Kubernetes|Airflow|ML)'
stage_pattern = r'(?i)(initial screening|technical round|HR round|final round|offer stage)'

clean_df_nlp_var = clean_df_nlp.withColumn("programming_language_mentioned", (regexp_extract(col("clean_body"), language_pattern, 0) != '').cast('int'))
clean_df_nlp_var = clean_df_nlp_var.withColumn("coding_library_mentioned", (regexp_extract(col("clean_body"), library_pattern, 0) != '').cast('int'))
clean_df_nlp_var = clean_df_nlp_var.withColumn("ds_tool_mentioned", (regexp_extract(col("clean_body"), tool_pattern, 0) != '').cast('int'))
clean_df_nlp_var = clean_df_nlp_var.withColumn("interview_stage_mentioned", (regexp_extract(col("clean_body"), stage_pattern, 0) != '').cast('int'))

from pyspark.sql.functions import col, count, mean, sum

summary_stats = clean_df_nlp_var.agg(
    count("*").alias("total_count"),
    sum(col("programming_language_mentioned")).alias("sum_programming_language_mentioned"),
    sum(col("coding_library_mentioned")).alias("sum_coding_library_mentioned"),
    sum(col("ds_tool_mentioned")).alias("sum_ds_tool_mentioned"),
    sum(col("interview_stage_mentioned")).alias("sum_interview_stage_mentioned")
)

summary_stats.show(truncate=False)

Statistical summary table for regex patterns from new variables:

total count sum of programming languages mentioned sum of coding libraries mentioned sum of ds tools mentioned sum of interview stages mentioned
128339 5188 53 1004 476

Visualization for regex patterns from new variables:

The extraction and visualization of patterns through regex-based categorization, encompassing elements like programming languages and coding libraries, offers additional guidance. The sundial visualization, for example, highlights the prevalence of programming languages, especially Python, within the text. This observation can streamline a job applicant’s study plan by emphasizing Python, thereby optimizing interview preparation and enhancing their chances of success.

Code
# filter out row where no programming language is mentioned
df_w_languages = clean_df_nlp_var.filter(col("programming_language_mentioned")==1)
df_w_libraries = clean_df_nlp_var.filter(col("coding_library_mentioned")==1)
df_w_tools = clean_df_nlp_var.filter(col("ds_tool_mentioned")==1)
df_w_stage = clean_df_nlp_var.filter(col("interview_stage_mentioned")==1)

# write function to extract language frequency in each text 
def extract_pattern(pattern,text):
    return re.findall(pattern, text) 

# create udf
extract_pattern_udf = udf(extract_pattern, ArrayType(StringType()))

# add a column with the pattern to the DataFrames
df_w_languages = df_w_languages.withColumn("pattern", lit(language_pattern))
df_w_libraries = df_w_libraries.withColumn("pattern", lit(library_pattern))
df_w_tools = df_w_tools.withColumn("pattern", lit(tool_pattern))
df_w_stage = df_w_stage.withColumn("pattern", lit(stage_pattern))

# apply the UDF 
df_w_languages = df_w_languages.withColumn("mentioned_languages", extract_pattern_udf(col("pattern"), col("clean_body")))
df_w_libraries = df_w_libraries.withColumn("mentioned_libraries", extract_pattern_udf(col("pattern"), col("clean_body")))
df_w_tools = df_w_tools.withColumn("mentioned_tools", extract_pattern_udf(col("pattern"), col("clean_body")))
df_w_stage = df_w_stage.withColumn("mentioned_stages", extract_pattern_udf(col("pattern"), col("clean_body")))

# explode the mentioned languages
exploded_languages = df_w_languages.withColumn("language", explode(col("mentioned_languages")))
exploded_libraries = df_w_libraries.withColumn("library", explode(col("mentioned_libraries")))
exploded_tools = df_w_tools.withColumn("tool", explode(col("mentioned_tools")))
exploded_stages = df_w_stage.withColumn("stage", explode(col("mentioned_stages")))

# group by language and count
language_cts = exploded_languages.groupBy("language").count()
library_cts = exploded_libraries.groupBy("library").count()
tool_cts = exploded_tools.groupBy("tool").count()
stage_cts = exploded_stages.groupBy("stage").count()

# add col describing pattern
language_cts = language_cts.withColumn("pattern_topic", lit("programming language"))
library_cts = library_cts.withColumn("pattern_topic", lit("coding library"))
tool_cts = tool_cts.withColumn("pattern_topic", lit("data science tool"))
stage_cts = stage_cts.withColumn("pattern_topic", lit("interview stage"))

language_cts = language_cts.withColumnRenamed("language", "pattern")
library_cts = library_cts.withColumnRenamed("library", "pattern")
tool_cts = tool_cts.withColumnRenamed("tool", "pattern")
stage_cts = stage_cts.withColumnRenamed("stage", "pattern")

# convert to pandas for future visualization
language_cts_pd = language_cts.toPandas()
library_cts_pd = library_cts.toPandas()
tool_cts_pd = tool_cts.toPandas()
stage_cts_pd = stage_cts.toPandas()

# union pd dataframes for visualizations later
combined_cts_pd = pd.concat([language_cts_pd, library_cts_pd, tool_cts_pd, stage_cts_pd], ignore_index=True)

# visualize
color_map = {
    'programming language': '#163d6b', 
    'coding library': '#204321',
    'data science tool': '#82B8B5' ,
    'interview stage' :'#5E9751'}
    
fig = px.sunburst(
    combined_cts_pd,
    path=['pattern_topic', 'pattern'],
    values='count',
    color='pattern_topic',
    color_discrete_map=color_map, 
    title='Distribution of RegEx Topic Patterns'
)

fig.update_layout(width=650, height=650)
fig.show()

3) Building a Sentiment Model

Topic 7: Sentiment Analysis of Interview Posts & Comments

Goal: Observe the overall sentiment of posts and comments related to the filtered dataset that focuses on interviews. This will help identify how candidates are feeling about the interview process, which can draw insight into a specific employer or topic interview difficulty level.

For the purpose of the sentiment analysis, spark jars packages and the John Snow Labs pagckage were used. The John Snow Labs package includes several pre-trained models. The one best suited for this analysis is the sentimentdl_use_twitter. The models consist of four parts. The first part, documentAssembler, takes as an input the text column (in this case ‘body’ column). The second part, the use part, has the Universal Sentence Encoder that utilizes the TenserFlow Hub use. The input of the second part is the output of the first documentAssembler part. Part three is the sentiment deep learning pretrained model that takes as the input the output of the second part and utilizes the John Snow Labs sentimentdl_use_twitter model with the English languge. This part outputs the sentiments for each row of the ‘body’ text column. The final part is the NLP Pipeline, which puts all the previous parts together, allowing the input of the data frame into the pipeline. The pipline runs through all the previously mentioned parts and produces a new sentiment column with three sentiment options - postive, neutral, negative. The last step in the sentiment analysis is transforming the output of the pipeline into a results dataframe.

Code
! pip install sparknlp
import json
import sparknlp
import numpy as np
import pandas as pd
from sparknlp.base import *
from pyspark.ml import Pipeline
from sparknlp.annotator import *
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
import sagemaker
from sagemaker.spark.processing import PySparkProcessor
from sparknlp.pretrained import PretrainedPipeline
from pyspark.sql import SparkSession
import sparknlp
import sagemaker
sess = sagemaker.Session()
# Import pyspark and build Spark session
from pyspark.sql import SparkSession
import sparknlp
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import UniversalSentenceEncoder, SentimentDLModel
from pyspark.ml import Pipeline

# Initialize Spark session for Spark NLP
spark = SparkSession.builder\
    .appName("PySparkApp")\
    .master("local[*]")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3,org.apache.hadoop:hadoop-aws:3.2.2")\
    .config("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.ContainerCredentialsProvider")\
    .getOrCreate()

%%time
import sagemaker

session = sagemaker.Session()
#s3://sagemaker-us-east-1-131536150362/project/data_reformat_clean_nlp.parquet/
bucket = "tm1450-project"
output_prefix_data_submissions = "data_reformat_clean.parquet/"

s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
#print(f"reading tagged data from {s3_path}")
df = spark.read.parquet(s3_path, header=True)
#print(f"shape of the submissions dataframe is {df.count():,}x{len(df.columns)}")
# Define model names and pipeline components
MODEL_NAME = 'sentimentdl_use_twitter'

documentAssembler = DocumentAssembler()\
    .setInputCol("body")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained(name="tfhub_use", lang="en")\
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

sentimentdl = SentimentDLModel.pretrained(name=MODEL_NAME, lang="en")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("sentiment")

nlpPipeline = Pipeline(
      stages = [
          documentAssembler,
          use,
          sentimentdl
      ])
pipelineModel = nlpPipeline.fit(df)
results = pipelineModel.transform(df)

Schema:

root
|– link_id: string (nullable = true)
|– created_utc: timestamp (nullable = true)
|– body: string (nullable = true)
|– is_tech: boolean (nullable = true)
|– is_asking_for_help: boolean (nullable = true)
|– is_asking_a_question: boolean (nullable = true)
|– is_FAANG: boolean (nullable = true)
|– isSubmission: boolean (nullable = true)
|– body_no_url: string (nullable = true)
|– finished_ngrams: string (nullable = true)
|– clean_body: string (nullable = true)
|– document: array (nullable = true)
| |– element: struct (containsNull = true)
| | |– annotatorType: string (nullable = true)
| | |– begin: integer (nullable = false)
| | |– end: integer (nullable = false)
| | |– result: string (nullable = true)
| | |– metadata: map (nullable = true)
| | | |– key: string
| | | |– value: string (valueContainsNull = true)
| | |– embeddings: array (nullable = true)
| | | |– element: float (containsNull = false)
|– sentence_embeddings: array (nullable = true)
| |– element: struct (containsNull = true)
| | |– annotatorType: string (nullable = true)
| | |– begin: integer (nullable = false)
| | |– end: integer (nullable = false)
| | |– result: string (nullable = true)
| | |– metadata: map (nullable = true)
| | | |– key: string
| | | |– value: string (valueContainsNull = true)
| | |– embeddings: array (nullable = true)
| | | |– element: float (containsNull = false)
|– sentiment: array (nullable = true)
| |– element: struct (containsNull = true)
| | |– annotatorType: string (nullable = true)
| | |– begin: integer (nullable = false)
| | |– end: integer (nullable = false)
| | |– result: string (nullable = true)
| | |– metadata: map (nullable = true)
| | | |– key: string
| | | |– value: string (valueContainsNull = true)
| | |– embeddings: array (nullable = true)
| | | |– element: float (containsNull = false)

Tables of Summary Statistics

Below is a table that provides a glimpse at the data in terms of how the sentiment analysis created a new variable that allows further analysis, such as what does sentiment look like for the entire dataset, or for sepcific attributes such as posts/comments that focus on asking for help or FAANG companies.

Code
from pyspark.sql.functions import explode, col

# Explode the sentiment array and extract the result field
results = results.withColumn('sentiment', F.explode('sentiment'))
results = results.withColumn('sentiment', F.col('sentiment.result'))

# Selecting columns for the final DataFrame
final_data = results.select(
    'is_tech', 'is_asking_for_help', 'is_asking_a_question', 
    'is_FAANG', 'clean_body', 'sentiment'
)

# Persist and show the DataFrame
final_data.persist()
final_data.show()
technical reddit asking for help? asking a question? FAANG reddit clean body sentiment
True False False False messaging 21 days… Neutral
True False True False sorry didnt see t… Positive
False False False False howd get Positive

Full Dataframe Sentiment Counts

The large number of positive sentiments suggests that although there is a strong presence of positive discourse, there is also a substantial number of negative experiences or thoughts being shared.

Code
sentiment_counts = final_data.groupBy("sentiment").count()
sentiment_counts.show()
sentiment count
Positive 81780
Neutral 4164
Negative 42395
Sentiment Counts for is_tech Column

The is_tech column represents a subset of the data where the focus is on technical aspects of interviews. The sentiment analysis summary for this subset of data shows positive sentiments far outnumber the negatives, suggesting that discussions in this category are generally more positive. This is surprising as the technical aspect of interviews are usually the most stressful so this provides a bit of hope to those starting the long road of technical interview preparation.

Code
tech_data = final_data.filter(col('is_tech') == True)
sentiment_counts = tech_data.groupBy('sentiment').count()
print('Sentiment Counts for is_tech Column')
sentiment_counts.show()
sentiment count
Positive 20776
Neutral 979
Negative 7629
Sentiment Counts for is_asking_for_help Column

The is_asking_for_help column represents a data subset focusing on posts where individuals are seeking assistance. Positive sentiment is almost double the negative sentiment, which may signify a supportive environment for those asking for help, again providing hope for those starting the challenge of preparing for technical interviews.

Code
help_data = final_data.filter(col('is_asking_for_help') == True)
sentiment_counts = help_data.groupBy('sentiment').count()
print('Sentiment Counts for is_asking_for_help Column')
sentiment_counts.show()
sentiment count
Positive 9839
Neutral 346
Negative 4602
Sentiment Counts for is_asking_a_question Column

The table below shows a higher number of positive sentiments than negative ones for posts/comments that ask a question. This might suggest that the Reddit community is usually supportive when individuals ask questions, creating an environment for information exchange and learning. Having knowledge of this may encourage new users seeking advice for technical interview to be less afraid in asking questions.

sentiment count
Positive 20014
Neutral 867
Negative 13028
Sentiment Counts for is_FAANG Column

The is_FAANG posts/comments discussing interviews at FAANG companies (Facebook, Amazon, Apple, Netflix, Google), show that sentiment is more positive than negative. This indicates that conversations around interviews at these top tech companies are more positive, which could be due to successful interview experiences or helpful insights posted by users who have interviewed with these companies.

Code
question_data = final_data.filter(col('is_asking_a_question') == True)
sentiment_counts = question_data.groupBy('sentiment').count()
print('Sentiment Counts for is_asking_a_question Column')
sentiment_counts.show()
sentiment count
Positive 3488
Neutral 171
Negative 1554
Plot of Summary Statistics

The plot below displays a more succinct summary of the count of posts (submissions and comments) for each sentiment classification broken down by post type. Again, the text content types include:

  1. A post asking for help

  2. A post asking a question

  3. A post mentioning a FAANG company

  4. A technical post

These topics were found using Regex as described above and in previous parts of the website. In general, for the full dataset, there are mainly postive sentiments for submissions and comments. The positive submissions and comments are almost double in size as compared to the submissions and comments with negative sentiment. For texts where is_tech column is true the ratio of postive to negative sentiments is higher than for any other column. The opposite is true for the rows where is_asking_for_help column is true. The number of neutral sentiments is the lowest in the whole dataset and as well in relation to the other sentiments for all the options where column_name = True.

Code
import pyspark.sql.functions as F
import pandas as pd

# Get sentiment counts for submissions/comments that ask for help
df_tech_counts = df\
    .filter(F.col('is_tech'))\
    .groupBy('sentiment')\
    .count()\
    .withColumn('category', F.lit('Technical'))

# Get sentiment counts for submissions/comments that ask for help
df_help_counts = df\
    .filter(F.col('is_asking_for_help'))\
    .groupBy('sentiment')\
    .count()\
    .withColumn('category', F.lit('Asking for help'))

# Get sentiment counts for submissions/comments that ask a question
df_question_counts = df\
    .filter(F.col('is_asking_a_question'))\
    .groupBy('sentiment')\
    .count()\
    .withColumn('category', F.lit('Asking question'))

# Get sentiment counts for submissions/comments that mention a FAANG
# company
df_FAANG_counts = df\
    .filter(F.col('is_FAANG'))\
    .groupBy('sentiment')\
    .count()\
    .withColumn('category', F.lit('Mentions FAANG'))

# Concatenate the count dfs together and convert to pandas
df_count_summary = df_tech_counts\
    .union(df_help_counts)\
    .union(df_question_counts)\
    .union(df_FAANG_counts)\
    .toPandas()

# Pivot dataframe
df_plot = df_count_summary.pivot(index='category', columns='sentiment', values='count').reset_index()

# Code referenced: https://matplotlib.org/stable/gallery/lines_bars_and_markers/bar_colors.html
import matplotlib.pyplot as plt
import numpy as np

x = np.arange(len(df_plot['category']))  # the label locations
width = 0.25  # the width of the bars
multiplier = 0

color_dict = {
    'negative': '#251279',
    'neutral': '#6CACED',
    'positive': '#5E9751'
}

fig, ax = plt.subplots(layout='constrained')

# Cycle through each sentiment label (negative, neutral, positive)
for sentiment_label in df_plot.columns:
    # Skip the category column
    if sentiment_label != 'category':
        offset = width * multiplier
        rects = ax.bar(
            x + offset,
            df_plot[sentiment_label],
            width,
            label=sentiment_label.title(),
            color=color_dict[sentiment_label]
        )
        ax.bar_label(rects, padding=3)
        multiplier += 1

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Number of Submissions/Comments')
ax.set_title('Distribution of Sentiment')
ax.set_xticks(x + width, df_plot['category'])
ax.legend(loc='upper left')
ax.set_ylim(0, 22000)

plt.savefig('../../website-source/images/sentiment_dist.png', dpi=300)
plt.show()

Source Code
## Natural Language Processing

This webpage focuses on using Natural Language Processing (NLP) techniques to dive into the world of Reddit and uncover insights related to interview subreddits. Main accomplishments include identifying and analyzing the significant themes and sentiments within the data. By using NLP techniques, the commonly discussed topics were supported and extracted, providing a clear understanding of the dominant conversations surrounding technical interview attributes. In addition, external data elevated the ability to extract financial insights related to certain types of companies seeking job applicants. These efforts were important in aligning with the business strategies for the target audience (job seekers and students).

Furthermore, sentiment analysis was employed to provide a perspective on the tone of the Reddit discussions. This aspect of the project was particularly enlightening as it revealed the mood and attitudes of subreddit participants towards certain interview aspects. These insights are significantly valuable for helping students in prepare for technical interviews by highlighting common concerns and the overall mood surrounding the interview process. The integration of NLP, in bridging data and decision making showcases how technology can elevate business intelligence and enhance student engagement.

### 1) External Data Set

#### Cleaning the External Data Set

**External Data Source:** [link](https://github.com/yuenherny/um-wqd7003-ds-salary-analysis/tree/main)

This data set was combined by the author of the GitHub repository above from two sources: LEVELS.FYI API and a Kaggle dataset on data science and stem salaries (see README in repository linked above). The original data set contains yearly compensation (total, base salary, bonuses, and stock grants), location, years of experience, and years at the company information for STEM and data science roles within over 1500 companies. This data ranges from 2017 to 2021.

For the project's purposes, the cleaned data set averages yearly compensation, base salary, bonus, years of experience, and years at company for each company and role type (tech or non-tech). The role type column was created by classifying roles as either "tech" or "non tech". For example, a software engineer would be classified as "tech"; meanwhile, sales is classified as "non tech". From there, the data was subset to include just the tech role information. Additionally, the ability to focus on tech vs. non-tech roles was achieved through filtering. Below is the code used to initially clean the external data set along with a table showing the first 20 rows.

```{python}
#| warning: False
# Import libraries
import os
import openpyxl
import pandas as pd
import numpy as np
# Read in the data
company_names = pd.read_csv("../data/external_data/company_name_matching_RAW.csv")
salaries = pd.read_excel("../data/external_data/levels_fyi_salary_data_RAW.xlsx", sheet_name = "data")

# Subset columns
salaries_subset = salaries[["timestamp", "company", "title", "totalyearlycompensation", "location", 
                            "yearsofexperience", "yearsatcompany", "basesalary", "stockgrantvalue", "bonus"]]

# Add a year column and drop timestamp column
salaries_subset["year"] = pd.to_datetime(salaries_subset["timestamp"]).dt.year
salaries_subset = salaries_subset.drop(["timestamp"], axis = 1)

# Change data types
data_types = {"company": str, "title": str, "totalyearlycompensation": float, "location": str, "yearsofexperience": float, 
    "yearsatcompany": float, "basesalary": float, "stockgrantvalue": float, "bonus": float, "year": int}
salaries_subset = salaries_subset.astype(data_types)
salaries_subset["year"] = pd.to_datetime(salaries_subset["year"], format = "%Y")

# Create clean company name matching
company_mapping = dict(zip(company_names["Variant"].str.strip(), company_names["Clean Company"]))

# Make a new column with the clean company names
salaries_subset = salaries_subset.rename(columns = {"company": "old_company"})
salaries_subset["company"] = salaries_subset["old_company"].map(company_mapping)
salaries_subset = salaries_subset.drop(["old_company"], axis = 1)

# Categorize available titles as "tech" or "non tech"
role_type_dict = {'Product Manager': 'tech', 'Software Engineer': 'tech',
       'Software Engineering Manager': 'tech', 'Data Scientist': 'tech',
       'Solution Architect': 'tech', 'Technical Program Manager': 'tech',
       'Human Resources': 'non tech', 'Product Designer': 'tech', 'Marketing': 'non tech',
       'Business Analyst': 'tech', 'Hardware Engineer': 'tech', 'Sales': 'non tech', 
       'Recruiter': 'non tech', 'Mechanical Engineer': 'tech', 'Management Consultant': 'non tech'}

salaries_subset["role_type"] = salaries_subset["title"].map(role_type_dict)

# Rearrange columns
salaries_subset = salaries_subset[["company", "year", "role_type", "title", "totalyearlycompensation", "location", "yearsofexperience",
                                    "yearsatcompany", "basesalary", "stockgrantvalue", "bonus"]]
# Get a summary table of averages by company and role type
summary_df = salaries_subset.groupby(["company", "role_type"]).agg({
    "totalyearlycompensation": "mean", "yearsofexperience": "mean",
    "yearsatcompany": "mean", "basesalary": "mean",
    "stockgrantvalue": "mean", "bonus": "mean"}).reset_index()
summary_df.columns = ["company", "role_type", "average_yearly_compensation", "average_years_experience", "average_years_at_company", 
                        "average_base_salary", "average_stock_grant_value", "average_bonus"]

# Subset for just FAANG (Facebook, Amazon, Apple, Netflix, Google)
# faang_df = summary_df[(summary_df["company"] == "Facebook") | (summary_df["company"] == "Amazon") | (summary_df["company"] == "Apple") | (summary_df["company"] == "Netflix") | (summary_df["company"] == "Google")]

summary_df.head(10)
```

The plots below depict the distribution of the six numeric variables in the external data set: average yearly compensation, average base salary, average years spent at the company, average years of experience, average stock grant value, and average bonus. From this point forward, the only focus will be on "tech" roles; hence, the figures below depict the distributions for each metric for the tech roles only.

![](images/external_histograms.png){width="1000" height="1100"}

In the EDA portion, the data set was tagged based on whether or not a FAANG company was mentioned. FAANG includes Facebook, Apple, Amazon, Netflix, and Google. To take it further, all of the companies present in the external data set were searched for and tagged accordingly. Then, the two data sets were merged based on the company mentioned in each row. This did not apply to cases when multiple companies are mentioned in a post or comment or when no companies were mentioned (Null). Below is the code used to tag the data set with more companies and merge with the external data set.

#### Preparing the Reddit Data for Merging

```{python}
#| eval: False

# Import packages
import time
import sagemaker
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.dataset as ds
import matplotlib.pyplot as plt
import pyspark.sql.functions as F
import pandas as pd
import datetime
from pyspark.sql import SparkSession
from s3fs import S3FileSystem
from sagemaker.spark.processing import PySparkProcessor
from pyspark.sql.functions import concat, lit, lower, col, when, array, expr, count, isnan, length, udf
from pyspark.sql.types import ArrayType, StringType

# Build Spark session
spark = (
    SparkSession.builder.appName("PySparkApp")
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.2")
    .config(
        "fs.s3a.aws.credentials.provider",
        "com.amazonaws.auth.ContainerCredentialsProvider",
    )
    .getOrCreate()
)

print(spark.version)

# Read in the filtered submissions
%%time
session = sagemaker.Session()
bucket = "sagemaker-us-east-1-158894274492"
output_prefix_data_submissions = f"project/submissions/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
print(f"reading submissions from {s3_path}")
submissions = spark.read.parquet(s3_path, header=True)
print(f"shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}")

# Read in the filtered comments
%%time
output_prefix_data_comments = "project/comments/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_comments}"
print(f"reading comments from {s3_path}")
comments = spark.read.parquet(s3_path, header=True)
print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")

# Read in the external data set
all_salary_info = pd.read_csv("../data/external_data/all_company_salary_info_CLEAN.csv")
# Drop index column
all_salary_info = all_salary_info.drop(["Unnamed: 0"], axis = 1)
# Round float values
all_salary_info = all_salary_info.apply(lambda x: round(x, 2) if x.dtype == 'float64' else x)
# Get just the tech salaries
tech_df = all_salary_info[all_salary_info["role_type"] == "tech"]

# Getting only the columns we need for cubmissions and comments
submissions_subset_col = submissions.select("id", "subreddit", "subreddit_id", "title", "selftext", col("created_utc").alias("created_utc_submission"), "num_comments")
comments_subset_col = comments.select("link_id", "body", col("created_utc").alias("created_utc_comment"))
# Making an "searchable_text" column for tagging
submissions_subset_col = submissions_subset_col.withColumn("searchable_text", concat(submissions_subset_col["title"], lit(" "), submissions_subset_col["selftext"]))

## TAGGING SUBMISSIONS
keywords_is_tech = ["tech", "technology", "code", "coding", "computer", "data", "python","java", "javascript", "sql", "database", "statistics", "azure", "cloud computing", "progamming", "leetcode", "saas", "faang"]
pattern = "(?i)" + "|(?i)".join(keywords_is_tech)

submissions_tagged = submissions_subset_col\
                        .withColumn('is_tech_submissions', col('searchable_text').rlike(pattern))

keywords_help = ["help me", "struggle", "struggling", "anyone", "advice", "tips", "help"]
pattern = "(?i)" + "|(?i)".join(keywords_help)

submissions_tagged = submissions_tagged\
                        .withColumn('is_asking_for_help_submissions', col('searchable_text').rlike(pattern))

submissions_tagged = submissions_tagged\
                        .withColumn('is_asking_a_question_submissions', col('searchable_text').rlike(r'(\?)+'))

keywords_faang = ["facebook", "amazon", "apple", "netflix", "google"]
pattern = "(?i)" + "|(?i)".join(keywords_faang)

submissions_tagged = submissions_tagged\
                        .withColumn('is_FAANG_submissions', col('searchable_text').rlike(pattern))\
                        .cache()

# Store companies mentioned in Reddit text from companies present in the external data set
companies = list(tech_df.company.unique())
# Make a udf to get mentioned companies 
@udf(ArrayType(StringType()))
def get_company_mentions(text):
    return [company for company in companies if " " + company.lower() + " " in text.lower()]
submissions_tagged = submissions_tagged.withColumn("company_mentions", get_company_mentions(col("searchable_text")))

# Make another UDF to make final tagged column (if none found - NA, if 2+ found, "Multiple", else put the company mentioned
@udf(StringType())
def clean_company_mentions(company_list):
    if not company_list:
        return None
    elif len(company_list) > 1:
        return "Multiple"
    else:
        return company_list[0]
submissions_tagged = submissions_tagged.withColumn("final_company_mentioned", clean_company_mentions(col("company_mentions")))

## TAGGING COMMENTS
keywords_is_tech = ["tech", "technology", "code", "coding", "computer", "data", "python","java", "javascript", "sql", "database", "statistics", "azure", "cloud computing", "progamming", "leetcode", "saas", "faang"]
pattern = "(?i)" + "|(?i)".join(keywords_is_tech)

comments_tagged = comments_subset_col\
                        .withColumn('is_tech_comments', col('body').rlike(pattern))

keywords_help = ["help me", "struggle", "struggling", "anyone", "advice", "tips", "help"]
pattern = "(?i)" + "|(?i)".join(keywords_help)

comments_tagged = comments_tagged\
                        .withColumn('is_asking_for_help_comments', col('body').rlike(pattern))

comments_tagged = comments_tagged\
                        .withColumn('is_asking_a_question_comments', col('body').rlike(r'(\?)+'))

keywords_faang = ["facebook", "amazon", "apple", "netflix", "google"]
pattern = "(?i)" + "|(?i)".join(keywords_faang)

comments_tagged = comments_tagged\
                        .withColumn('is_FAANG_comments', col('body').rlike(pattern))\
                        .cache()

# Store companies mentioned in Reddit text from companies present in the external data set
companies = list(tech_df.company.unique())
# Make a udf to get mentioned companies 
@udf(ArrayType(StringType()))
def get_company_mentions(text):
    return [company for company in companies if " " + company.lower() + " " in text.lower()]
comments_tagged = comments_tagged.withColumn("company_mentions", get_company_mentions(col("body")))

# Make another UDF to make final tagged column (if none found - NA, if 2+ found, "Multiple", else put the company mentioned
@udf(StringType())
def clean_company_mentions(company_list):
    if not company_list:
        return None
    elif len(company_list) > 1:
        return "Multiple"
    else:
        return company_list[0]
comments_tagged = comments_tagged.withColumn("final_company_mentioned", clean_company_mentions(col("company_mentions")))

## JOIN COMMENTS AND SUBMISSIONS
# Rename some columns for this reformatting
comments_tagged = (comments_tagged
                   .withColumnRenamed("company_mentions", "company_mentions_comments")
                   .withColumnRenamed("final_company_mentioned", "final_company_mentioned_comments"))
submissions_tagged = (submissions_tagged
                   .withColumnRenamed("company_mentions", "company_mentions_submissions")
                   .withColumnRenamed("final_company_mentioned", "final_company_mentioned_submissions"))


# Clean comments link_id "t3_#####" to match submissions id "#####"
comments_new_id = comments_tagged.withColumn(
    'link_id',
    expr("substring(link_id, 4, length(link_id)-3)")
)

# Join comments with submissions
df = comments_new_id.join(submissions_tagged, comments_new_id['link_id'] == submissions_tagged['id'], 'outer')

# Reformatting

# Separate into two dataframes (comments and submissions)
df_submissions = df.filter(df.created_utc_submission.isNotNull())\
                    .select(
                        F.col('id').alias('link_id'), 
                        F.col('created_utc_submission').alias('created_utc'),
                        F.col('searchable_text').alias('body'),
                        F.col('is_tech_submissions').alias('is_tech'),
                        F.col('is_asking_for_help_submissions').alias('is_asking_for_help'),
                        F.col('is_asking_a_question_submissions').alias('is_asking_a_question'),
                        F.col('is_FAANG_submissions').alias('is_FAANG'),
                        F.col('company_mentions_submissions').alias('company_mentions'),
                        F.col('final_company_mentioned_submissions').alias('final_company_mentioned'),
                        F.lit(True).alias('isSubmission')
                    )\
                    .distinct()

df_comments = df.filter(df.created_utc_comment.isNotNull())\
                    .select(
                        'link_id', 
                        F.col('created_utc_comment').alias('created_utc'),
                        'body',
                        F.col('is_tech_comments').alias('is_tech'),
                        F.col('is_asking_for_help_comments').alias('is_asking_for_help'),
                        F.col('is_asking_a_question_comments').alias('is_asking_a_question'),
                        F.col('is_FAANG_comments').alias('is_FAANG'),
                        F.col('company_mentions_comments').alias('company_mentions'),
                        F.col('final_company_mentioned_comments').alias('final_company_mentioned'),
                        F.lit(False).alias('isSubmission')
                    )\
                    .distinct()

# Concatenate comments and submissions
df_clean = df_submissions.union(df_comments)

## MERGE WITH EXTERNAL DATA SET
# Rename for merge
df_clean = (df_clean
            .withColumnRenamed("final_company_mentioned", "company"))
# Convert external to spark data frame
tech_spark_df = spark.createDataFrame(tech_df)
# Join on final_company_mentioned column
merged_df = df_clean.join(tech_spark_df, on = "company", how = "left")
```

|      | link_id | created_utc         | body                 | is_tech | is_asking_for_help | is_asking_a\_question | is_FAANG | company_mentions | isSubmission | role_type | average_yearly_compensation | average_years_experience | average_years_at_company | average_base_salary | average_stock_grant_value | average_bonus |
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
| null | wghtce  | 2022-08-05 01:11:58 | Applying for Simi... | false   | false              | false                 | false    | \[\]             | true         | null      | null                        | null                     | null                     | null                | null                      | null          |
| null | sbtver  | 2022-01-24 19:13:22 | Leetcode Premium ... | true    | false              | false                 | false    | \[\]             | true         | null      | null                        | null                     | null                     | null                | null                      | null          |
| null | ue9ggm  | 2022-04-29 00:32:14 | Explanation of Le... | true    | false              | false                 | false    | \[\]             | true         | null      | null                        | null                     | null                     | null                | null                      | null          |

#### Data Quality Checks
At this point, the data has been cleaned and new variables have been created. This next section aims to perform quality checks on said dataset. Being aware of key dataset characteristics like schema, overall structure, and missing values helps in detecting any inconsistencies, errors, or gaps that could influence subsequent analyses. 

**Schema:**

root

\|-- company: string (nullable = true)

\|-- link_id: string (nullable = true)

\|-- created_utc: timestamp (nullable = true)

\|-- body: string (nullable = true)

\|-- is_tech: boolean (nullable = true)

\|-- is_asking_for_help: boolean (nullable = true)

\|-- is_asking_a\_question: boolean (nullable = true)

\|-- is_FAANG: boolean (nullable = true)

\|-- company_mentions: array (nullable = true)

\| \|-- element: string (containsNull = true)

\|-- isSubmission: boolean (nullable = false)

\|-- role_type: string (nullable = true)

\|-- average_yearly_compensation: double (nullable = true)

\|-- average_years_experience: double (nullable = true)

\|-- average_years_at_company: double (nullable = true)

\|-- average_base_salary: double (nullable = true)

\|-- average_stock_grant_value: double (nullable = true)

\|-- average_bonus: double (nullable = true)

```{python}
#| eval: False
# Print number of rows
merged_df.count()
```

**Number of Rows:** 128339

```{python}
#| eval: False
# Check for nulls
merged_df.select([count(when(col(c).isNull(), c)).alias(c) for c in merged_df.columns]).show()
```

**Null Count by Column:**

| column                      | number of nulls |
|-----------------------------|-----------------|
| company                     | 8111            |
| body                        | 8111            |
| is_tech                     | 8111            |
| is_asking_for_help          | 8111            |
| is_asking_a\_question       | 8111            |
| is_FAANG                    | 934             |
| company_mentions            | 934             |
| isSubmission                | 934             |
| role_type                   | 934             |
| average_yearly_compensation | 934             |
| average_years_experience    | 934             |

Most of the company salary information is null in the data set after the merge, which is expected. Many subreddits mention no companies at all, some mention multiple, and some probably have mentioned companies not listed in the external data set. For the visualizations below, the nulls were removed to create a smaller, less sparse data set.

#### **TOPIC 4: MARKET SALARY TREND ANALYSIS**

*Goal:* Connect salary information on various firms with the firms discussed in the Reddit submissions and comments. Identify the firms that offer the highest salaries and contrast them with the firms that are discussed most frequently in the context of job hunting and interviewing. Also devote particular attention to the firms represented by the FAANG acronym to build off prior examination of these firms from the EDA stage.

Below are two bar plots showing the distribution of submission counts and comment counts by company. Nulls, "Other", and "Multiple" tagged rows were removed to just isolate the submissions and comments that mention one company specifically. Abnormal company names such as "Here" and "Test" were removed as well. Additionally, company counts with **less than 5** occurrences were removed to make sure the plot did not get too cluttered and unreadable.

```{python}
#| eval: False
# Separate df into submissions and comments
filtered_df = df.filter(df['company'].isNotNull()) # take out Nulls
submissions = filtered_df.filter(filtered_df.isSubmission == True)
comments = filtered_df.filter(filtered_df.isSubmission == False)

# Get the frequency of posts and comments by company
submission_freqs = submissions.groupBy('company').count().orderBy('company')
comments_freqs = comments.groupBy('company').count().orderBy('company')

# Convert counts to pandas
submission_freqs = submission_freqs.toPandas()
comments_freqs = comments_freqs.toPandas()

# Filter out "Other" and "None" and "Multiple" to just get individual companies
exclude = ['Other', 'None', 'Multiple', 'Test', 'Here']
submission_freqs = submission_freqs[~submission_freqs['company'].isin(exclude)]
comments_freqs = comments_freqs[~comments_freqs['company'].isin(exclude)]

# Filter out companies with small counts
submission_freqs = submission_freqs[submission_freqs['count'] > 5]
comments_freqs = comments_freqs[comments_freqs['count'] > 5]

# Plotting
plt.figure(figsize = (60, 60))

submission_freqs = submission_freqs.sort_values(by = 'count', ascending = True)
comments_freqs = comments_freqs.sort_values(by = 'count', ascending = True)

plt.subplot(1, 2, 1)
bars = plt.barh(submission_freqs['company'].astype(str), submission_freqs['count'].astype(int), color = '#251279')
plt.xlabel('No. of Submissions', fontsize = 60)
plt.tick_params(axis = 'x', which = 'major', labelsize = 50) 
plt.tick_params(axis = 'y', which = 'major', labelsize = 50) 
plt.title('Number of Submissions by Company', fontsize = 60)
plt.bar_label(bars, labels = submission_freqs['count'].astype(int), padding = 3, fontsize = 50)


plt.subplot(1, 2, 2)
bars = plt.barh(comments_freqs['company'].astype(str), comments_freqs['count'].astype(int), color = '#40B389')
plt.xlabel('No. of Comments', fontsize = 60)
plt.tick_params(axis = 'x', which = 'major', labelsize = 50) 
plt.tick_params(axis = 'y', which = 'major', labelsize = 50) 
plt.title('Number of Comments by Company', fontsize = 60)
plt.bar_label(bars, labels = comments_freqs['count'].astype(int), padding = 3, fontsize = 50)

plt.figtext(0.05, -0.02, 'Note: Companies with less than 5 mentions are not included. Additionally, posts mentioning multiple or "Other" companies are not included.', fontsize = 50, color='black')

plt.tight_layout()

# Save figure
plt.savefig('../../website-source/images/submissions_comments_by_company.png', dpi = 300)

plt.show()
```

![](images/submissions_comments_by_company.png){width="1150" height="1300"}

As expected, Amazon and Google are the clear front-runners, surpassing other "FAANG" companies like Apple, Netflix, and Facebook. Medium is an interesting front-runner as well. Medium provides a lot of informative articles and code samples, so this might explain the reason for the high count of submissions and comments mentioning Medium.

##### Top 20 Companies Mentioned with Compensation Comparison

Below are the total mention counts for the top twenty companies.

```{python}
#| eval: False
exclude = ['Other', 'None', 'Multiple', 'Test', 'Here']
filtered_df = filtered_df[~filtered_df['company'].isin(exclude)]
company_counts = filtered_df.groupBy('company').count()
top_20_companies = company_counts.orderBy(F.col('count').desc()).limit(20)
top_20_companies.show()
```

| company    | count |
|------------|-------|
| Google     | 1279  |
| Medium     | 993   |
| Move       | 967   |
| Amazon     | 859   |
| NICE       | 528   |
| Wish       | 442   |
| Discord    | 370   |
| zoom       | 258   |
| Target     | 253   |
| Reddit     | 210   |
| Match      | 175   |
| GitHub     | 163   |
| Microsoft  | 125   |
| Indeed     | 121   |
| HackerRank | 112   |
| Gap        | 109   |
| Button     | 106   |
| Facebook   | 98    |
| Wise       | 92    |
| Intuitive  | 87    |

With this list of top twenty companies, the compensation metrics can be compared to see if any possible relationships arrive.

```{python}
#| eval: False
# Aggregate data by company and calculate the mean for each metric
data_by_company = tech_df.groupby('company')[column_list].mean().reset_index()

company_list = ["Google", "Medium", "Move", "Amazon", "NICE", "Wish", "Discord", "zoom", 
"Target", "Reddit", "Match", "GitHub", "Microsoft", "Indeed", "HackerRank", "Gap", 
"Button", "Facebook", "Wise", "Intuitive"]

data_by_company["top10"] = data_by_company["company"].isin(company_list)

top20 = data_by_company[data_by_company["top10"] == True]

ordered_top20 = {
    "average_yearly_compensation": list(top20.sort_values('average_yearly_compensation', ascending = False)['company']),
    "average_years_experience": list(top20.sort_values('average_years_experience', ascending = False)['company']),
    "average_years_at_company": list(top20.sort_values('average_years_at_company', ascending = False)['company']),
    "average_base_salary": list(top20.sort_values('average_base_salary', ascending = False)['company']),
    "average_stock_grant_value": list(top20.sort_values('average_stock_grant_value', ascending = False)['company']),
    "average_bonus": list(top20.sort_values('average_bonus', ascending = False)['company'])
}

# Create bar plots for each metric
bar_charts = []
for i in range(len(title_list)):
    chart = alt.Chart(top20).mark_bar().encode(
        x = alt.X('company', title = 'Company', sort = ordered_top20[column_list[i]]),
        y = alt.Y(column_list[i], title = x_axis_list[i]),
        color = alt.ColorValue(color_list[i])
    ).properties(
        title = f"{title_list[i]} for Top 10 Mentioned Companies",
        width = 400,
        height = 220
    )
    bar_charts.append(chart)

# Concatenate bar charts in a grid layout
final_bars = alt.vconcat(
    alt.hconcat(bar_charts[0], bar_charts[1]),
    alt.hconcat(bar_charts[2], bar_charts[3]),
    alt.hconcat(bar_charts[4], bar_charts[5])
)
```

![](images/top20compensation.png){width="1150" height="1100"}

FAANG companies like Facebook, Amazon and Google tend to be in the top half of the top twenty most mentioned companies regarding these compensation metrics. Overall, the charts suggest that there is a mixture of FAANG and non-FAANG companies that offer high compensation. It is interesting to see that companies with the highest compensation, do not necessarily observe the most years of experience or years at the compnay suggesting other factors play a role in compensation packages. The plot also shows variability in how long employees stay with these companies, which implies that factors other than compensation may influence tenure.

### 2) Conducting Natural Language Processing Work

#### **Topic 5: NLP Trends for Reddit Interview Content**

*Goal:* Identify topics within the posts/comments of the interview subreddits to uncover general themes for students to study further. This can highlight what types of subject matter are being discussed the most, whether coding topics or behavioral questions.
 

From a technical perspective, this portion of the analysis begins with identifying the most common words across the dataset, using techniques such as tokenization, stop words removal, and word count aggregation. Next, the analysis explores the distribution of text lengths, offering a statistical overview of the dataset's text size. The third step involved identifying important words using Term Frequency-Inverse Document Frequency (TF-IDF), a method that highlights words that are unique to specific documents within the corpus. Transforming of text data using multiple NLP techniques like URL removal, tokenization, normalization, and stemming created a cleaner version of the dataset for improved text analysis. The creation of dummy variables based on regex patterns enables the categorization of texts into different technical domains such as programming languages and coding libraries.

From a business perspective, the technical analysis in this section accomplishes goals such as insight into key topics and trends as well as data-driven decision making. By identifying the most common words and important terms (via TF-IDF), the analysis uncovers the main themes and topics prevalent in the dataset. This can be crucial for job seekers in understanding prevalent interview trends or job seeker concerns. 


##### Most Common Words Overall
The table below shows that the most common words table shows the terms, "leetcode" and "questions" which might be helpful in steering job applicants to focus their study efforts on this subject. 

```{python}
#| eval: False
from pyspark.sql import SparkSession
from pyspark.sql.functions import length, explode, split
from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer, IDF
from pyspark.sql.functions import col
import matplotlib.pyplot as plt

# tokenize 
tokenizer = Tokenizer(inputCol="body", outputCol="words")
df_words = tokenizer.transform(df)

# remove stop words 
remover = StopWordsRemover(inputCol="words", outputCol="filtered_words")
df_filtered = remover.transform(df_words)

# explode the array into  words
df_exploded = df_filtered.select(explode(col("filtered_words")).alias("word"))

# filter out single character words
df_filtered_single_char = df_exploded.filter(length(col("word")) > 1)

# count 
df_word_count = df_filtered_single_char.groupBy("word").count()
df_word_count.orderBy(col("count").desc()).show()
```

+--------------+------------+
| ```          | count      |
| word         |            |
| ```          |            |
+==============+============+
| interview\   | 22631\     |
| like\        | 18719\     |
| get\         | 17015\     |
| questions\   | 13818\     |
| time\        | 13330\     |
| one\         | 13245\     |
| good\        | 12364\     |
| know\        | 11881\     |
| job\         | 11718\     |
| leetcode\    | 11557\     |
| think\       | 11144\     |
| also\        | 9331\      |
| want\        | 9048\      |
| really\      | 8620\      |
| need\        | 8564\      |
| problems\    | 8560\      |
| \[deleted\]\ | 8453\      |
| problem\     | 8444\      |
| work\        | 8017\      |
| first\       | 7521\      |
+--------------+------------+

##### Distribution of Text Lengths
These statistics confirm that the dataset contains a significant amount of posts/comments with an average length of about 234 characters. The standard deviation is quite high, indicating a large variability in the length of texts. 

```{python}
#| eval: False
# calc length of each text
df = df.withColumn("text_length", length(df["body"]))
df.describe("text_length").show()

```

+---------------+---------------------+
| summary       | text_length         |
+===============+=====================+
| count\        | 128339\             |
| mean\         | 233.57049688715043\ |
| stddev\       | 480.0254950591896\  |
| min\          | 1\                  |
| max\          | 38271\              |
+---------------+---------------------+

##### Important Words According to TF-IDF
The provided list of words and their TF-IDF scores provides insight into the topics and concerns that are most prominent in the Reddit posts and comments surround technical interviews. The most important term, 'spaced', is an unusual observation as it does not offer clarity as to why it would be a key concept in technical interview Reddits. Other terms such as 'interview', 'simple', and 'subscribe' are more in line with expectations as they indicate a focus on interview processes, seeking simple advice or interview resources, and potentially encouraged to follow or subscribe to threads or channels for updates on interview strategies. The presence of emotionally charged words like 'ugh' and 'crushing' may reflect the anxiety and difficulty of technical interview preparation.

```{python}
#| eval: False

# term frequency
df_sampled = df_filtered.sample(False, 0.1)
cv = CountVectorizer(inputCol="filtered_words", outputCol="raw_features")
cv_model = cv.fit(df_sampled)
df_featurized = cv_model.transform(df_sampled)

# idf
idf = IDF(inputCol="raw_features", outputCol="features")
idf_model = idf.fit(df_featurized)
df_rescaled = idf_model.transform(df_featurized)

tf_idf_pandas_df = df_rescaled.select("features").toPandas()

def extract_top_words(features, vocabulary, n_top_words=3):
    dense_vector = features.toArray()
    top_indices = dense_vector.argsort()[-n_top_words:][::-1]
    top_words = [(vocabulary[i], dense_vector[i]) for i in top_indices]
    return top_words

vocabulary = cv_model.vocabulary

# extract top words for each row
n_top_words = 1 
top_words_list = [extract_top_words(row['features'], vocabulary, n_top_words) for index, row in tf_idf_pandas_df.iterrows()]
top_words_list[35:45]
```

Below is a list of tuples, each containing a word and its corresponding aggregated TF-IDF score.

\[\[('spaced', 13.378421024061316)\],\
\[('bud.', 8.768652053710493)\],\
\[('pill.', 8.768652053710493)\],\
\[('interview', 8.768652053710493)\],\
\[('simple', 9.794902085605205)\],\
\[('ugh', 7.670039765042384)\],\
\[('rated', 7.670039765042384)\],\
\[('crushing', 7.2645746569342196)\],\
\[('\[deleted\]', 2.7537151504173694)\],\
\[('subscribe', 7.159214141276394)\]\] 

##### Cleaning the Text Using 5+ NLP Techniques

In this section, more detailed text cleaning techniques are put in place. The following code used Spark NLP methods such as tokenizing, stemming, removing punctuation, removing stop words, and more to clean the Reddit content. 

```{python}
#| eval: False

# remove https links 
def remove_https_links(text):
    return re.sub(r'https?:\/\/\S+', '', text)

remove_https_links_udf = udf(remove_https_links, StringType())

df_no_url = df.withColumn("body_no_url", remove_https_links_udf("body"))

df_no_url.select('body_no_url').show(4, truncate=False)

# set up spark pipeline

document_assembler = DocumentAssembler() \
    .setInputCol("body_no_url") \
    .setOutputCol("document")

# tokenize
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")


# normalize words to remove unnecessary punctuation
normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized") \
    .setLowercase(True) \
    .setCleanupPatterns(["[^A-Za-z0-9 ]"]) 


# remove stop words
stopwords_cleaner = StopWordsCleaner() \
    .setInputCols(["normalized"]) \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)

# stem words
stemmer = Stemmer() \
    .setInputCols(["cleanTokens"]) \
    .setOutputCol("stem")

# sentence_detector = SentenceDetector() \
#     .setInputCols(["cleanTokens"]) \
#     .setOutputCol("sentence")

finisher = Finisher() \
    .setInputCols(["cleanTokens"]) \
    .setOutputCols(["finished_ngrams"]) \
    .setOutputAsArray(False) \
    .setCleanAnnotations(True)


# Define the pipeline
pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    normalizer,
    stopwords_cleaner,
    stemmer,
    finisher
])
# run the pipeline 
model = pipeline.fit(df_no_url)
clean_df = model.transform(df_no_url)

from pyspark.sql.functions import regexp_replace
clean_df_nlp = clean_df.withColumn("clean_body", regexp_replace("finished_ngrams", "@", " "))
clean_df_nlp.select("clean_body").show(5,truncate=False)

```

**Preview of posts and comments after cleaning:**

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| clean_body                                                                                                                                                                                                             |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| messaging 21 days 20230128 065831 utc remind link click link send pm also reminded reduce spam parent commenter delete message hide others info reminders sorry didnt see thank howd get change anything prep strategy |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

#### **Topic 6: Identification of Most Important Skills for Technical Interviewing**

*Goal:* Hone in on the most important and most frequent skills discussed on Reddit for technical interviews. Explore mention frequency of various programming languages, tools, and libraries. Also identify the most important stages of the technical interview.

##### Creating New Dummy Variables
To focus on topics such as programming languages, tools, and libraries, new variables are required. 

```{python}
#| eval: False
from pyspark.sql.functions import regexp_extract, col

language_pattern = r'(?i)(Python|SQL|SparkSQL|PySpark|Java|Scala|Julia|MATLAB|C\+\+|SAS|NoSQL|Hadoop|Hive|BigQuery)'
library_pattern = r'(?i)(Pandas|Tensorflow|Pytorch|sklearn|Keras|NumPy|SciPy|Matplotlib|Seaborn|Dask|NLTK|Gensim|Scrapy|Statsmodels|XGBoost|LightGBM|Jupyter|FastAI|Spacy|Theano|Yolo)'
tool_pattern = r'(?i)(Tableau|Shiny|Sagemaker|Azure|PowerBI|Databricks|Alteryx|Qlik|Looker|Google Cloud Platform|AWS|Dataiku|Docker|Kubernetes|Airflow|ML)'
stage_pattern = r'(?i)(initial screening|technical round|HR round|final round|offer stage)'

clean_df_nlp_var = clean_df_nlp.withColumn("programming_language_mentioned", (regexp_extract(col("clean_body"), language_pattern, 0) != '').cast('int'))
clean_df_nlp_var = clean_df_nlp_var.withColumn("coding_library_mentioned", (regexp_extract(col("clean_body"), library_pattern, 0) != '').cast('int'))
clean_df_nlp_var = clean_df_nlp_var.withColumn("ds_tool_mentioned", (regexp_extract(col("clean_body"), tool_pattern, 0) != '').cast('int'))
clean_df_nlp_var = clean_df_nlp_var.withColumn("interview_stage_mentioned", (regexp_extract(col("clean_body"), stage_pattern, 0) != '').cast('int'))

from pyspark.sql.functions import col, count, mean, sum

summary_stats = clean_df_nlp_var.agg(
    count("*").alias("total_count"),
    sum(col("programming_language_mentioned")).alias("sum_programming_language_mentioned"),
    sum(col("coding_library_mentioned")).alias("sum_coding_library_mentioned"),
    sum(col("ds_tool_mentioned")).alias("sum_ds_tool_mentioned"),
    sum(col("interview_stage_mentioned")).alias("sum_interview_stage_mentioned")
)

summary_stats.show(truncate=False)
```

**Statistical summary table for regex patterns from new variables:**

+-------------+------------------------------------+------------------------------+-----------------------+-------------------------------+
| total_count | sum_programming_language_mentioned | sum_coding_library_mentioned | sum_ds_tool_mentioned | sum_interview_stage_mentioned |
+-------------+------------------------------------+------------------------------+-----------------------+-------------------------------+
| 128339      | 5188                               | 53                           | 1004                  | 476                           |
+-------------+------------------------------------+------------------------------+-----------------------+-------------------------------+

**Visualization for regex patterns from new variables:**

The extraction and visualization of patterns through regex-based categorization, encompassing elements like programming languages and coding libraries, offers additional guidance. The sundial visualization, for example, highlights the prevalence of programming languages, especially Python, within the text. This observation can streamline a job applicant's study plan by emphasizing Python, thereby optimizing interview preparation and enhancing their chances of success. 
```{python}
#| eval: False
# filter out row where no programming language is mentioned
df_w_languages = clean_df_nlp_var.filter(col("programming_language_mentioned")==1)
df_w_libraries = clean_df_nlp_var.filter(col("coding_library_mentioned")==1)
df_w_tools = clean_df_nlp_var.filter(col("ds_tool_mentioned")==1)
df_w_stage = clean_df_nlp_var.filter(col("interview_stage_mentioned")==1)

# write function to extract language frequency in each text 
def extract_pattern(pattern,text):
    return re.findall(pattern, text) 

# create udf
extract_pattern_udf = udf(extract_pattern, ArrayType(StringType()))

# add a column with the pattern to the DataFrames
df_w_languages = df_w_languages.withColumn("pattern", lit(language_pattern))
df_w_libraries = df_w_libraries.withColumn("pattern", lit(library_pattern))
df_w_tools = df_w_tools.withColumn("pattern", lit(tool_pattern))
df_w_stage = df_w_stage.withColumn("pattern", lit(stage_pattern))

# apply the UDF 
df_w_languages = df_w_languages.withColumn("mentioned_languages", extract_pattern_udf(col("pattern"), col("clean_body")))
df_w_libraries = df_w_libraries.withColumn("mentioned_libraries", extract_pattern_udf(col("pattern"), col("clean_body")))
df_w_tools = df_w_tools.withColumn("mentioned_tools", extract_pattern_udf(col("pattern"), col("clean_body")))
df_w_stage = df_w_stage.withColumn("mentioned_stages", extract_pattern_udf(col("pattern"), col("clean_body")))

# explode the mentioned languages
exploded_languages = df_w_languages.withColumn("language", explode(col("mentioned_languages")))
exploded_libraries = df_w_libraries.withColumn("library", explode(col("mentioned_libraries")))
exploded_tools = df_w_tools.withColumn("tool", explode(col("mentioned_tools")))
exploded_stages = df_w_stage.withColumn("stage", explode(col("mentioned_stages")))

# group by language and count
language_cts = exploded_languages.groupBy("language").count()
library_cts = exploded_libraries.groupBy("library").count()
tool_cts = exploded_tools.groupBy("tool").count()
stage_cts = exploded_stages.groupBy("stage").count()

# add col describing pattern
language_cts = language_cts.withColumn("pattern_topic", lit("programming language"))
library_cts = library_cts.withColumn("pattern_topic", lit("coding library"))
tool_cts = tool_cts.withColumn("pattern_topic", lit("data science tool"))
stage_cts = stage_cts.withColumn("pattern_topic", lit("interview stage"))

language_cts = language_cts.withColumnRenamed("language", "pattern")
library_cts = library_cts.withColumnRenamed("library", "pattern")
tool_cts = tool_cts.withColumnRenamed("tool", "pattern")
stage_cts = stage_cts.withColumnRenamed("stage", "pattern")

# convert to pandas for future visualization
language_cts_pd = language_cts.toPandas()
library_cts_pd = library_cts.toPandas()
tool_cts_pd = tool_cts.toPandas()
stage_cts_pd = stage_cts.toPandas()

# union pd dataframes for visualizations later
combined_cts_pd = pd.concat([language_cts_pd, library_cts_pd, tool_cts_pd, stage_cts_pd], ignore_index=True)

# visualize
color_map = {
    'programming language': '#163d6b', 
    'coding library': '#204321',
    'data science tool': '#82B8B5' ,
    'interview stage' :'#5E9751'}
    
fig = px.sunburst(
    combined_cts_pd,
    path=['pattern_topic', 'pattern'],
    values='count',
    color='pattern_topic',
    color_discrete_map=color_map, 
    title='Distribution of RegEx Topic Patterns'
)

fig.update_layout(width=650, height=650)
fig.show()

```


<iframe src="images/sundial.html" width="300%" height="600px">

</iframe>

### 3) Building a Sentiment Model

#### **Topic 7: Sentiment Analysis of Interview Posts & Comments**

*Goal:* Observe the overall sentiment of posts and comments related to the filtered dataset that focuses on interviews. This will help identify how candidates are feeling about the interview process, which can draw insight into a specific employer or topic interview difficulty level.

For the purpose of the sentiment analysis, spark jars packages and the John Snow Labs pagckage were used. The John Snow Labs package includes several pre-trained models. The one best suited for this analysis is the sentimentdl_use_twitter. The models consist of four parts. The first part, documentAssembler, takes as an input the text column (in this case 'body' column). The second part, the use part, has the Universal Sentence Encoder that utilizes the TenserFlow Hub use. The input of the second part is the output of the first documentAssembler part. Part three is the sentiment deep learning pretrained model that takes as the input the output of the second part and utilizes the John Snow Labs sentimentdl_use_twitter model with the English languge. This part outputs the sentiments for each row of the 'body' text column. The final part is the NLP Pipeline, which puts all the previous parts together, allowing the input of the data frame into the pipeline. The pipline runs through all the previously mentioned parts and produces a new sentiment column with three sentiment options - postive, neutral, negative. The last step in the sentiment analysis is transforming the output of the pipeline into a results dataframe.

```{python}
#| eval: False
! pip install sparknlp
import json
import sparknlp
import numpy as np
import pandas as pd
from sparknlp.base import *
from pyspark.ml import Pipeline
from sparknlp.annotator import *
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
import sagemaker
from sagemaker.spark.processing import PySparkProcessor
from sparknlp.pretrained import PretrainedPipeline
from pyspark.sql import SparkSession
import sparknlp
import sagemaker
sess = sagemaker.Session()
# Import pyspark and build Spark session
from pyspark.sql import SparkSession
import sparknlp
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import UniversalSentenceEncoder, SentimentDLModel
from pyspark.ml import Pipeline

# Initialize Spark session for Spark NLP
spark = SparkSession.builder\
    .appName("PySparkApp")\
    .master("local[*]")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3,org.apache.hadoop:hadoop-aws:3.2.2")\
    .config("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.ContainerCredentialsProvider")\
    .getOrCreate()

%%time
import sagemaker

session = sagemaker.Session()
#s3://sagemaker-us-east-1-131536150362/project/data_reformat_clean_nlp.parquet/
bucket = "tm1450-project"
output_prefix_data_submissions = "data_reformat_clean.parquet/"

s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
#print(f"reading tagged data from {s3_path}")
df = spark.read.parquet(s3_path, header=True)
#print(f"shape of the submissions dataframe is {df.count():,}x{len(df.columns)}")
# Define model names and pipeline components
MODEL_NAME = 'sentimentdl_use_twitter'

documentAssembler = DocumentAssembler()\
    .setInputCol("body")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained(name="tfhub_use", lang="en")\
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

sentimentdl = SentimentDLModel.pretrained(name=MODEL_NAME, lang="en")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("sentiment")

nlpPipeline = Pipeline(
      stages = [
          documentAssembler,
          use,
          sentimentdl
      ])
pipelineModel = nlpPipeline.fit(df)
results = pipelineModel.transform(df)
```

**Schema:**

root\
\|-- link_id: string (nullable = true)\
\|-- created_utc: timestamp (nullable = true)\
\|-- body: string (nullable = true)\
\|-- is_tech: boolean (nullable = true)\
\|-- is_asking_for_help: boolean (nullable = true)\
\|-- is_asking_a\_question: boolean (nullable = true)\
\|-- is_FAANG: boolean (nullable = true)\
\|-- isSubmission: boolean (nullable = true)\
\|-- body_no_url: string (nullable = true)\
\|-- finished_ngrams: string (nullable = true)\
\|-- clean_body: string (nullable = true)\
\|-- document: array (nullable = true)\
\| \|-- element: struct (containsNull = true)\
\| \| \|-- annotatorType: string (nullable = true)\
\| \| \|-- begin: integer (nullable = false)\
\| \| \|-- end: integer (nullable = false)\
\| \| \|-- result: string (nullable = true)\
\| \| \|-- metadata: map (nullable = true)\
\| \| \| \|-- key: string\
\| \| \| \|-- value: string (valueContainsNull = true)\
\| \| \|-- embeddings: array (nullable = true)\
\| \| \| \|-- element: float (containsNull = false)\
\|-- sentence_embeddings: array (nullable = true)\
\| \|-- element: struct (containsNull = true)\
\| \| \|-- annotatorType: string (nullable = true)\
\| \| \|-- begin: integer (nullable = false)\
\| \| \|-- end: integer (nullable = false)\
\| \| \|-- result: string (nullable = true)\
\| \| \|-- metadata: map (nullable = true)\
\| \| \| \|-- key: string\
\| \| \| \|-- value: string (valueContainsNull = true)\
\| \| \|-- embeddings: array (nullable = true)\
\| \| \| \|-- element: float (containsNull = false)\
\|-- sentiment: array (nullable = true)\
\| \|-- element: struct (containsNull = true)\
\| \| \|-- annotatorType: string (nullable = true)\
\| \| \|-- begin: integer (nullable = false)\
\| \| \|-- end: integer (nullable = false)\
\| \| \|-- result: string (nullable = true)\
\| \| \|-- metadata: map (nullable = true)\
\| \| \| \|-- key: string\
\| \| \| \|-- value: string (valueContainsNull = true)\
\| \| \|-- embeddings: array (nullable = true)\
\| \| \| \|-- element: float (containsNull = false)\

#### Tables of Summary Statistics
Below is a table that provides a glimpse at the data in terms of how the sentiment analysis created a new variable that allows further analysis, such as what does sentiment look like for the entire dataset, or for sepcific attributes such as posts/comments that focus on asking for help or FAANG companies.

```{python}
#| eval: False
from pyspark.sql.functions import explode, col

# Explode the sentiment array and extract the result field
results = results.withColumn('sentiment', F.explode('sentiment'))
results = results.withColumn('sentiment', F.col('sentiment.result'))

# Selecting columns for the final DataFrame
final_data = results.select(
    'is_tech', 'is_asking_for_help', 'is_asking_a_question', 
    'is_FAANG', 'clean_body', 'sentiment'
)

# Persist and show the DataFrame
final_data.persist()
final_data.show()
```

+----------+--------------------+-----------------------+----------+-----------------------+-----------+
| is_tech  | is_asking_for_help | is_asking_a\_question | is_FAANG | ```                   | sentiment |
|          |                    |                       |          |            clean_body |           |
|          |                    |                       |          | ```                   |           |
+==========+====================+=======================+==========+=======================+===========+
| true\    | false\             | false\                | false\   | \                     | neutral\  |
| true\    | false\             | true\                 | false\   | messaging 21 days...\ | positive\ |
| false\   | false\             | false\                | false\   | sorry didnt see t...\ | positive\ |
| false\   | false\             | true\                 | false\   | howd get\             | positive\ |
| false\   | false\             | true\                 | false\   | change anything p...\ | positive\ |
| false\   | false\             | false\                | false\   | one 180 question ...\ | negative\ |
| false\   | true\              | true\                 | true\    | approach solving ...\ | positive\ |
| false\   | false\             | false\                | false\   | places wanting pe...\ | positive\ |
| false\   | true\              | true\                 | false\   | principles mean s...\ | positive\ |
| false\   | false\             | false\                | false\   | interested\           | positive\ |
| true\    | false\             | false\                | false\   | thanks exactly fo...\ | positive\ |
| false\   | false\             | false\                | false\   | okay thats good i...\ | positive\ |
| false\   | false\             | false\                | false\   | wait take back sa...\ | neutral\  |
| true\    | false\             | false\                | false\   | map contains inte...\ | positive\ |
| false\   | false\             | false\                | false\   | practice together\    | positive\ |
| false\   | false\             | false\                | false\   | \                     | positive\ |
| true\    | true\              | true\                 | false\   | gtdo think solvin...\ | positive\ |
| false\   | false\             | false\                | false\   | bro need work sof...\ | negative\ |
| false\   | false\             | false\                | false\   | settle sugarlips\     | positive\ |
| false\   | false\             | false\                | false\   | 25 years almost\      | positive\ |
+----------+--------------------+-----------------------+----------+-----------------------+-----------+

#### Full Dataframe Sentiment Counts
The large number of positive sentiments suggests that although there is a strong presence of positive discourse, there is also a substantial number of negative experiences or thoughts being shared. 

```{python}
#| eval: False
sentiment_counts = final_data.groupBy("sentiment").count()
sentiment_counts.show()
```

+-----------+---------+
| sentiment | count   |
+===========+=========+
| positive\ | 81780\  |
| neutral\  | 4164\   |
| negative\ | 42395\  |
+-----------+---------+

##### Sentiment Counts for is_tech Column
The is_tech column represents a subset of the data where the focus is on technical aspects of interviews. The sentiment analysis summary for this subset of data shows positive sentiments far outnumber the negatives, suggesting that discussions in this category are generally more positive. This is surprising as the technical aspect of interviews are usually the most stressful so this provides a bit of hope to those starting the long road of technical interview preparation. 

```{python}
#| eval: False
tech_data = final_data.filter(col('is_tech') == True)
sentiment_counts = tech_data.groupBy('sentiment').count()
print('Sentiment Counts for is_tech Column')
sentiment_counts.show()
```

+-----------+---------+
| sentiment | count   |
+===========+=========+
| positive\ | 20776\  |
| neutral\  | 979\    |
| negative\ | 7629\   |
+-----------+---------+

##### Sentiment Counts for is_asking_for_help Column
The is_asking_for_help column represents a data subset focusing on posts where individuals are seeking assistance. Positive sentiment is almost double the negative sentiment, which may signify a supportive environment for those asking for help, again providing hope for those starting the challenge of preparing for technical interviews. 
```{python}
#| eval: False
help_data = final_data.filter(col('is_asking_for_help') == True)
sentiment_counts = help_data.groupBy('sentiment').count()
print('Sentiment Counts for is_asking_for_help Column')
sentiment_counts.show()
```

+-----------+---------+
| sentiment | count   |
+===========+=========+
| positive\ | 9839\   |
| neutral\  | 346\    |
| negative\ | 4602\   |
+-----------+---------+

##### Sentiment Counts for is_asking_a\_question Column
The table below shows a higher number of positive sentiments than negative ones for posts/comments that ask a question. This might suggest that the Reddit community is usually supportive when individuals ask questions, creating an environment for information exchange and learning. Having knowledge of this may encourage new users seeking advice for technical interview to be less afraid in asking questions. 

```{python}
#| eval: False
#| 
```

+-----------+---------+
| sentiment | count   |
+===========+=========+
| positive\ | 20014\  |
| neutral\  | 867\    |
| negative\ | 13028\  |
+-----------+---------+

##### Sentiment Counts for is_FAANG Column
The is_FAANG posts/comments discussing interviews at FAANG companies (Facebook, Amazon, Apple, Netflix, Google), show that sentiment is more positive than negative. This indicates that conversations around interviews at these top tech companies are more positive, which could be due to successful interview experiences or helpful insights posted by users who have interviewed with these companies.
```{python}
#| eval: False
question_data = final_data.filter(col('is_asking_a_question') == True)
sentiment_counts = question_data.groupBy('sentiment').count()
print('Sentiment Counts for is_asking_a_question Column')
sentiment_counts.show()
```

+-----------+---------+
| sentiment | count   |
+===========+=========+
| positive\ | 3488\   |
| neutral\  | 171\    |
| negative\ | 1554\   |
+-----------+---------+

##### Plot of Summary Statistics

The plot below displays a more succinct summary of the count of posts (submissions and comments) for each sentiment classification broken down by post type. Again, the text content types include:

1) A post asking for help

2) A post asking a question

3) A post mentioning a FAANG company

4) A technical post

These topics were found using Regex as described above and in previous parts of the website. In general, for the full dataset, there are mainly postive sentiments for submissions and comments. The positive submissions and comments are almost double in size as compared to the submissions and comments with negative sentiment. For texts where is_tech column is true the ratio of postive to negative sentiments is higher than for any other column. The opposite is true for the rows where is_asking_for_help column is true. The number of neutral sentiments is the lowest in the whole dataset and as well in relation to the other sentiments for all the options where column_name = True.

```{python}
#| eval: False

import pyspark.sql.functions as F
import pandas as pd

# Get sentiment counts for submissions/comments that ask for help
df_tech_counts = df\
    .filter(F.col('is_tech'))\
    .groupBy('sentiment')\
    .count()\
    .withColumn('category', F.lit('Technical'))

# Get sentiment counts for submissions/comments that ask for help
df_help_counts = df\
    .filter(F.col('is_asking_for_help'))\
    .groupBy('sentiment')\
    .count()\
    .withColumn('category', F.lit('Asking for help'))

# Get sentiment counts for submissions/comments that ask a question
df_question_counts = df\
    .filter(F.col('is_asking_a_question'))\
    .groupBy('sentiment')\
    .count()\
    .withColumn('category', F.lit('Asking question'))

# Get sentiment counts for submissions/comments that mention a FAANG
# company
df_FAANG_counts = df\
    .filter(F.col('is_FAANG'))\
    .groupBy('sentiment')\
    .count()\
    .withColumn('category', F.lit('Mentions FAANG'))

# Concatenate the count dfs together and convert to pandas
df_count_summary = df_tech_counts\
    .union(df_help_counts)\
    .union(df_question_counts)\
    .union(df_FAANG_counts)\
    .toPandas()

# Pivot dataframe
df_plot = df_count_summary.pivot(index='category', columns='sentiment', values='count').reset_index()

# Code referenced: https://matplotlib.org/stable/gallery/lines_bars_and_markers/bar_colors.html
import matplotlib.pyplot as plt
import numpy as np

x = np.arange(len(df_plot['category']))  # the label locations
width = 0.25  # the width of the bars
multiplier = 0

color_dict = {
    'negative': '#251279',
    'neutral': '#6CACED',
    'positive': '#5E9751'
}

fig, ax = plt.subplots(layout='constrained')

# Cycle through each sentiment label (negative, neutral, positive)
for sentiment_label in df_plot.columns:
    # Skip the category column
    if sentiment_label != 'category':
        offset = width * multiplier
        rects = ax.bar(
            x + offset,
            df_plot[sentiment_label],
            width,
            label=sentiment_label.title(),
            color=color_dict[sentiment_label]
        )
        ax.bar_label(rects, padding=3)
        multiplier += 1

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Number of Submissions/Comments')
ax.set_title('Distribution of Sentiment')
ax.set_xticks(x + width, df_plot['category'])
ax.legend(loc='upper left')
ax.set_ylim(0, 22000)

plt.savefig('../../website-source/images/sentiment_dist.png', dpi=300)
plt.show()
```

<img src="images/sentiment_dist.png" style="width: 75%; height: auto;"/>
Content 2023 by Amelia Baier, Joshua Gladwell, Tereza Martinkova, and Mia Mayerhofer
All content licensed under a Creative Commons Attribution-NonCommercial 4.0 International license (CC BY-NC 4.0)
Made with and Quarto
View the source at GitHub