Reddit Interview Insights

Natural Language Processing

This webpage focuses on using Natural Language Processing (NLP) techniques to dive into the world of Reddit and uncover insights related to interview subreddits. Main accomplishments include identifying and analyzing the significant themes and sentiments within the data. By using NLP techniques, the commonly discussed topics were supported and extracted, providing a clear understanding of the dominant conversations surrounding technical interview attributes. In addition, external data elevated the ability to extract financial insights related to certain types of companies seeking job applicants. These efforts were important in aligning with the business strategies for the target audience (job seekers and students).

Furthermore, sentiment analysis was employed to provide a perspective on the tone of the Reddit discussions. This aspect of the project was particularly enlightening as it revealed the mood and attitudes of subreddit participants towards certain interview aspects. These insights are significantly valuable for helping students in prepare for technical interviews by highlighting common concerns and the overall mood surrounding the interview process. The integration of NLP, in bridging data and decision making showcases how technology can elevate business intelligence and enhance student engagement.

1) External Data Set

Cleaning the External Data Set

External Data Source: link

This data set was combined by the author of the GitHub repository above from two sources: LEVELS.FYI API and a Kaggle dataset on data science and stem salaries (see README in repository linked above). The original data set contains yearly compensation (total, base salary, bonuses, and stock grants), location, years of experience, and years at the company information for STEM and data science roles within over 1500 companies. This data ranges from 2017 to 2021.

For the project’s purposes, the cleaned data set averages yearly compensation, base salary, bonus, years of experience, and years at company for each company and role type (tech or non-tech). The role type column was created by classifying roles as either “tech” or “non tech”. For example, a software engineer would be classified as “tech”; meanwhile, sales is classified as “non tech”. From there, the data was subset to include just the tech role information. Additionally, the ability to focus on tech vs. non-tech roles was achieved through filtering. Below is the code used to initially clean the external data set along with a table showing the first 20 rows.

Code

# Import libraries
import os
import openpyxl
import pandas as pd
import numpy as np
# Read in the data
company_names = pd.read_csv("../data/external_data/company_name_matching_RAW.csv")
salaries = pd.read_excel("../data/external_data/levels_fyi_salary_data_RAW.xlsx", sheet_name = "data")

# Subset columns
salaries_subset = salaries[["timestamp", "company", "title", "totalyearlycompensation", "location", 
                            "yearsofexperience", "yearsatcompany", "basesalary", "stockgrantvalue", "bonus"]]

# Add a year column and drop timestamp column
salaries_subset["year"] = pd.to_datetime(salaries_subset["timestamp"]).dt.year
salaries_subset = salaries_subset.drop(["timestamp"], axis = 1)

# Change data types
data_types = {"company": str, "title": str, "totalyearlycompensation": float, "location": str, "yearsofexperience": float, 
    "yearsatcompany": float, "basesalary": float, "stockgrantvalue": float, "bonus": float, "year": int}
salaries_subset = salaries_subset.astype(data_types)
salaries_subset["year"] = pd.to_datetime(salaries_subset["year"], format = "%Y")

# Create clean company name matching
company_mapping = dict(zip(company_names["Variant"].str.strip(), company_names["Clean Company"]))

# Make a new column with the clean company names
salaries_subset = salaries_subset.rename(columns = {"company": "old_company"})
salaries_subset["company"] = salaries_subset["old_company"].map(company_mapping)
salaries_subset = salaries_subset.drop(["old_company"], axis = 1)

# Categorize available titles as "tech" or "non tech"
role_type_dict = {'Product Manager': 'tech', 'Software Engineer': 'tech',
       'Software Engineering Manager': 'tech', 'Data Scientist': 'tech',
       'Solution Architect': 'tech', 'Technical Program Manager': 'tech',
       'Human Resources': 'non tech', 'Product Designer': 'tech', 'Marketing': 'non tech',
       'Business Analyst': 'tech', 'Hardware Engineer': 'tech', 'Sales': 'non tech', 
       'Recruiter': 'non tech', 'Mechanical Engineer': 'tech', 'Management Consultant': 'non tech'}

salaries_subset["role_type"] = salaries_subset["title"].map(role_type_dict)

# Rearrange columns
salaries_subset = salaries_subset[["company", "year", "role_type", "title", "totalyearlycompensation", "location", "yearsofexperience",
                                    "yearsatcompany", "basesalary", "stockgrantvalue", "bonus"]]
# Get a summary table of averages by company and role type
summary_df = salaries_subset.groupby(["company", "role_type"]).agg({
    "totalyearlycompensation": "mean", "yearsofexperience": "mean",
    "yearsatcompany": "mean", "basesalary": "mean",
    "stockgrantvalue": "mean", "bonus": "mean"}).reset_index()
summary_df.columns = ["company", "role_type", "average_yearly_compensation", "average_years_experience", "average_years_at_company", 
                        "average_base_salary", "average_stock_grant_value", "average_bonus"]

# Subset for just FAANG (Facebook, Amazon, Apple, Netflix, Google)
# faang_df = summary_df[(summary_df["company"] == "Facebook") | (summary_df["company"] == "Amazon") | (summary_df["company"] == "Apple") | (summary_df["company"] == "Netflix") | (summary_df["company"] == "Google")]

summary_df.head(10)

company	role type	average yearly compensation	average years experience	average years at company	average base salary	average stock grant value	average bonus
10x Genomics	Tech	193333.333333	3.166667	0.666667	149666.666667	27666.666667	15166.666667
23andMe	Non Tech	160000.000000	10.000000	4.000000	160000.000000	0.000000	0.000000
23andMe	Tech	152333.333333	2.000000	0.333333	109166.666667	19833.333333	0.000000

The plots below depict the distribution of the six numeric variables in the external data set: average yearly compensation, average base salary, average years spent at the company, average years of experience, average stock grant value, and average bonus. From this point forward, the only focus will be on “tech” roles; hence, the figures below depict the distributions for each metric for the tech roles only.

In the EDA portion, the data set was tagged based on whether or not a FAANG company was mentioned. FAANG includes Facebook, Apple, Amazon, Netflix, and Google. To take it further, all of the companies present in the external data set were searched for and tagged accordingly. Then, the two data sets were merged based on the company mentioned in each row. This did not apply to cases when multiple companies are mentioned in a post or comment or when no companies were mentioned (Null). Below is the code used to tag the data set with more companies and merge with the external data set.

Preparing the Reddit Data for Merging

Code

# Import packages
import time
import sagemaker
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.dataset as ds
import matplotlib.pyplot as plt
import pyspark.sql.functions as F
import pandas as pd
import datetime
from pyspark.sql import SparkSession
from s3fs import S3FileSystem
from sagemaker.spark.processing import PySparkProcessor
from pyspark.sql.functions import concat, lit, lower, col, when, array, expr, count, isnan, length, udf
from pyspark.sql.types import ArrayType, StringType

# Build Spark session
spark = (
    SparkSession.builder.appName("PySparkApp")
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.2")
    .config(
        "fs.s3a.aws.credentials.provider",
        "com.amazonaws.auth.ContainerCredentialsProvider",
    )
    .getOrCreate()
)

print(spark.version)

# Read in the filtered submissions
%%time
session = sagemaker.Session()
bucket = "sagemaker-us-east-1-158894274492"
output_prefix_data_submissions = f"project/submissions/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
print(f"reading submissions from {s3_path}")
submissions = spark.read.parquet(s3_path, header=True)
print(f"shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}")

# Read in the filtered comments
%%time
output_prefix_data_comments = "project/comments/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_comments}"
print(f"reading comments from {s3_path}")
comments = spark.read.parquet(s3_path, header=True)
print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")

# Read in the external data set
all_salary_info = pd.read_csv("../data/external_data/all_company_salary_info_CLEAN.csv")
# Drop index column
all_salary_info = all_salary_info.drop(["Unnamed: 0"], axis = 1)
# Round float values
all_salary_info = all_salary_info.apply(lambda x: round(x, 2) if x.dtype == 'float64' else x)
# Get just the tech salaries
tech_df = all_salary_info[all_salary_info["role_type"] == "tech"]

# Getting only the columns we need for cubmissions and comments
submissions_subset_col = submissions.select("id", "subreddit", "subreddit_id", "title", "selftext", col("created_utc").alias("created_utc_submission"), "num_comments")
comments_subset_col = comments.select("link_id", "body", col("created_utc").alias("created_utc_comment"))
# Making an "searchable_text" column for tagging
submissions_subset_col = submissions_subset_col.withColumn("searchable_text", concat(submissions_subset_col["title"], lit(" "), submissions_subset_col["selftext"]))

## TAGGING SUBMISSIONS
keywords_is_tech = ["tech", "technology", "code", "coding", "computer", "data", "python","java", "javascript", "sql", "database", "statistics", "azure", "cloud computing", "progamming", "leetcode", "saas", "faang"]
pattern = "(?i)" + "|(?i)".join(keywords_is_tech)

submissions_tagged = submissions_subset_col\
                        .withColumn('is_tech_submissions', col('searchable_text').rlike(pattern))

keywords_help = ["help me", "struggle", "struggling", "anyone", "advice", "tips", "help"]
pattern = "(?i)" + "|(?i)".join(keywords_help)

submissions_tagged = submissions_tagged\
                        .withColumn('is_asking_for_help_submissions', col('searchable_text').rlike(pattern))

submissions_tagged = submissions_tagged\
                        .withColumn('is_asking_a_question_submissions', col('searchable_text').rlike(r'(\?)+'))

keywords_faang = ["facebook", "amazon", "apple", "netflix", "google"]
pattern = "(?i)" + "|(?i)".join(keywords_faang)

submissions_tagged = submissions_tagged\
                        .withColumn('is_FAANG_submissions', col('searchable_text').rlike(pattern))\
                        .cache()

# Store companies mentioned in Reddit text from companies present in the external data set
companies = list(tech_df.company.unique())
# Make a udf to get mentioned companies 
@udf(ArrayType(StringType()))
def get_company_mentions(text):
    return [company for company in companies if " " + company.lower() + " " in text.lower()]
submissions_tagged = submissions_tagged.withColumn("company_mentions", get_company_mentions(col("searchable_text")))

# Make another UDF to make final tagged column (if none found - NA, if 2+ found, "Multiple", else put the company mentioned
@udf(StringType())
def clean_company_mentions(company_list):
    if not company_list:
        return None
    elif len(company_list) > 1:
        return "Multiple"
    else:
        return company_list[0]
submissions_tagged = submissions_tagged.withColumn("final_company_mentioned", clean_company_mentions(col("company_mentions")))

## TAGGING COMMENTS
keywords_is_tech = ["tech", "technology", "code", "coding", "computer", "data", "python","java", "javascript", "sql", "database", "statistics", "azure", "cloud computing", "progamming", "leetcode", "saas", "faang"]
pattern = "(?i)" + "|(?i)".join(keywords_is_tech)

comments_tagged = comments_subset_col\
                        .withColumn('is_tech_comments', col('body').rlike(pattern))

keywords_help = ["help me", "struggle", "struggling", "anyone", "advice", "tips", "help"]
pattern = "(?i)" + "|(?i)".join(keywords_help)

comments_tagged = comments_tagged\
                        .withColumn('is_asking_for_help_comments', col('body').rlike(pattern))

comments_tagged = comments_tagged\
                        .withColumn('is_asking_a_question_comments', col('body').rlike(r'(\?)+'))

keywords_faang = ["facebook", "amazon", "apple", "netflix", "google"]
pattern = "(?i)" + "|(?i)".join(keywords_faang)

comments_tagged = comments_tagged\
                        .withColumn('is_FAANG_comments', col('body').rlike(pattern))\
                        .cache()

# Store companies mentioned in Reddit text from companies present in the external data set
companies = list(tech_df.company.unique())
# Make a udf to get mentioned companies 
@udf(ArrayType(StringType()))
def get_company_mentions(text):
    return [company for company in companies if " " + company.lower() + " " in text.lower()]
comments_tagged = comments_tagged.withColumn("company_mentions", get_company_mentions(col("body")))

# Make another UDF to make final tagged column (if none found - NA, if 2+ found, "Multiple", else put the company mentioned
@udf(StringType())
def clean_company_mentions(company_list):
    if not company_list:
        return None
    elif len(company_list) > 1:
        return "Multiple"
    else:
        return company_list[0]
comments_tagged = comments_tagged.withColumn("final_company_mentioned", clean_company_mentions(col("company_mentions")))

## JOIN COMMENTS AND SUBMISSIONS
# Rename some columns for this reformatting
comments_tagged = (comments_tagged
                   .withColumnRenamed("company_mentions", "company_mentions_comments")
                   .withColumnRenamed("final_company_mentioned", "final_company_mentioned_comments"))
submissions_tagged = (submissions_tagged
                   .withColumnRenamed("company_mentions", "company_mentions_submissions")
                   .withColumnRenamed("final_company_mentioned", "final_company_mentioned_submissions"))


# Clean comments link_id "t3_#####" to match submissions id "#####"
comments_new_id = comments_tagged.withColumn(
    'link_id',
    expr("substring(link_id, 4, length(link_id)-3)")
)

# Join comments with submissions
df = comments_new_id.join(submissions_tagged, comments_new_id['link_id'] == submissions_tagged['id'], 'outer')

# Reformatting

# Separate into two dataframes (comments and submissions)
df_submissions = df.filter(df.created_utc_submission.isNotNull())\
                    .select(
                        F.col('id').alias('link_id'), 
                        F.col('created_utc_submission').alias('created_utc'),
                        F.col('searchable_text').alias('body'),
                        F.col('is_tech_submissions').alias('is_tech'),
                        F.col('is_asking_for_help_submissions').alias('is_asking_for_help'),
                        F.col('is_asking_a_question_submissions').alias('is_asking_a_question'),
                        F.col('is_FAANG_submissions').alias('is_FAANG'),
                        F.col('company_mentions_submissions').alias('company_mentions'),
                        F.col('final_company_mentioned_submissions').alias('final_company_mentioned'),
                        F.lit(True).alias('isSubmission')
                    )\
                    .distinct()

df_comments = df.filter(df.created_utc_comment.isNotNull())\
                    .select(
                        'link_id', 
                        F.col('created_utc_comment').alias('created_utc'),
                        'body',
                        F.col('is_tech_comments').alias('is_tech'),
                        F.col('is_asking_for_help_comments').alias('is_asking_for_help'),
                        F.col('is_asking_a_question_comments').alias('is_asking_a_question'),
                        F.col('is_FAANG_comments').alias('is_FAANG'),
                        F.col('company_mentions_comments').alias('company_mentions'),
                        F.col('final_company_mentioned_comments').alias('final_company_mentioned'),
                        F.lit(False).alias('isSubmission')
                    )\
                    .distinct()

# Concatenate comments and submissions
df_clean = df_submissions.union(df_comments)

## MERGE WITH EXTERNAL DATA SET
# Rename for merge
df_clean = (df_clean
            .withColumnRenamed("final_company_mentioned", "company"))
# Convert external to spark data frame
tech_spark_df = spark.createDataFrame(tech_df)
# Join on final_company_mentioned column
merged_df = df_clean.join(tech_spark_df, on = "company", how = "left")

id	comment timestamp	body	technical comment	asking for help?	asking a question?	FAANG comment	company mentions
wghtce	2022-08-05 01:11:58	Applying for Simi…	False	False	False	False	[]
sbtver	2022-01-24 19:13:22	Leetcode Premium …	True	False	False	False	[]
ue9ggm	2022-04-29 00:32:14	Explanation of Le…	True	False	False	False	[]

submission	role type	average yearly compensation	average years experience	average years at company	average base salary	average stock grant value	average bonus
True	Null	Null	Null	Null	Null	Null	Null
True	Null	Null	Null	Null	Null	Null	Null
True	Null	Null	Null	Null	Null	Null	Null

Data Quality Checks

At this point, the data has been cleaned and new variables have been created. This next section aims to perform quality checks on said dataset. Being aware of key dataset characteristics like schema, overall structure, and missing values helps in detecting any inconsistencies, errors, or gaps that could influence subsequent analyses.

Schema:

root

|– company: string (nullable = true)

|– link_id: string (nullable = true)

|– created_utc: timestamp (nullable = true)

|– body: string (nullable = true)

|– is_tech: boolean (nullable = true)

|– is_asking_for_help: boolean (nullable = true)

|– is_asking_a_question: boolean (nullable = true)

|– is_FAANG: boolean (nullable = true)

|– company_mentions: array (nullable = true)

| |– element: string (containsNull = true)

|– isSubmission: boolean (nullable = false)

|– role_type: string (nullable = true)

|– average_yearly_compensation: double (nullable = true)

|– average_years_experience: double (nullable = true)

|– average_years_at_company: double (nullable = true)

|– average_base_salary: double (nullable = true)

|– average_stock_grant_value: double (nullable = true)

|– average_bonus: double (nullable = true)

Code

# Print number of rows
merged_df.count()

Number of Rows: 128339

Code

# Check for nulls
merged_df.select([count(when(col(c).isNull(), c)).alias(c) for c in merged_df.columns]).show()

Null Count by Column:

column	number of nulls
company	8111
body	8111
is_tech	8111
is_asking_for_help	8111
is_asking_a_question	8111
is_FAANG	934
company_mentions	934
isSubmission	934
role_type	934
average_yearly_compensation	934
average_years_experience	934

Most of the company salary information is null in the data set after the merge, which is expected. Many subreddits mention no companies at all, some mention multiple, and some probably have mentioned companies not listed in the external data set. For the visualizations below, the nulls were removed to create a smaller, less sparse data set.

TOPIC 4: MARKET SALARY TREND ANALYSIS

Goal: Connect salary information on various firms with the firms discussed in the Reddit submissions and comments. Identify the firms that offer the highest salaries and contrast them with the firms that are discussed most frequently in the context of job hunting and interviewing. Also devote particular attention to the firms represented by the FAANG acronym to build off prior examination of these firms from the EDA stage.

Below are two bar plots showing the distribution of submission counts and comment counts by company. Nulls, “Other”, and “Multiple” tagged rows were removed to just isolate the submissions and comments that mention one company specifically. Abnormal company names such as “Here” and “Test” were removed as well. Additionally, company counts with less than 5 occurrences were removed to make sure the plot did not get too cluttered and unreadable.

Code

# Separate df into submissions and comments
filtered_df = df.filter(df['company'].isNotNull()) # take out Nulls
submissions = filtered_df.filter(filtered_df.isSubmission == True)
comments = filtered_df.filter(filtered_df.isSubmission == False)

# Get the frequency of posts and comments by company
submission_freqs = submissions.groupBy('company').count().orderBy('company')
comments_freqs = comments.groupBy('company').count().orderBy('company')

# Convert counts to pandas
submission_freqs = submission_freqs.toPandas()
comments_freqs = comments_freqs.toPandas()

# Filter out "Other" and "None" and "Multiple" to just get individual companies
exclude = ['Other', 'None', 'Multiple', 'Test', 'Here']
submission_freqs = submission_freqs[~submission_freqs['company'].isin(exclude)]
comments_freqs = comments_freqs[~comments_freqs['company'].isin(exclude)]

# Filter out companies with small counts
submission_freqs = submission_freqs[submission_freqs['count'] > 5]
comments_freqs = comments_freqs[comments_freqs['count'] > 5]

# Plotting
plt.figure(figsize = (60, 60))

submission_freqs = submission_freqs.sort_values(by = 'count', ascending = True)
comments_freqs = comments_freqs.sort_values(by = 'count', ascending = True)

plt.subplot(1, 2, 1)
bars = plt.barh(submission_freqs['company'].astype(str), submission_freqs['count'].astype(int), color = '#251279')
plt.xlabel('No. of Submissions', fontsize = 60)
plt.tick_params(axis = 'x', which = 'major', labelsize = 50) 
plt.tick_params(axis = 'y', which = 'major', labelsize = 50) 
plt.title('Number of Submissions by Company', fontsize = 60)
plt.bar_label(bars, labels = submission_freqs['count'].astype(int), padding = 3, fontsize = 50)


plt.subplot(1, 2, 2)
bars = plt.barh(comments_freqs['company'].astype(str), comments_freqs['count'].astype(int), color = '#40B389')
plt.xlabel('No. of Comments', fontsize = 60)
plt.tick_params(axis = 'x', which = 'major', labelsize = 50) 
plt.tick_params(axis = 'y', which = 'major', labelsize = 50) 
plt.title('Number of Comments by Company', fontsize = 60)
plt.bar_label(bars, labels = comments_freqs['count'].astype(int), padding = 3, fontsize = 50)

plt.figtext(0.05, -0.02, 'Note: Companies with less than 5 mentions are not included. Additionally, posts mentioning multiple or "Other" companies are not included.', fontsize = 50, color='black')

plt.tight_layout()

# Save figure
plt.savefig('../../website-source/images/submissions_comments_by_company.png', dpi = 300)

plt.show()

As expected, Amazon and Google are the clear front-runners, surpassing other “FAANG” companies like Apple, Netflix, and Facebook. Medium is an interesting front-runner as well. Medium provides a lot of informative articles and code samples, so this might explain the reason for the high count of submissions and comments mentioning Medium.

Top 20 Companies Mentioned with Compensation Comparison

Below are the total mention counts for the top twenty companies.

Code

exclude = ['Other', 'None', 'Multiple', 'Test', 'Here']
filtered_df = filtered_df[~filtered_df['company'].isin(exclude)]
company_counts = filtered_df.groupBy('company').count()
top_20_companies = company_counts.orderBy(F.col('count').desc()).limit(20)
top_20_companies.show()

company	count
Google	1279
Medium	993
Move	967
Amazon	859
NICE	528
Wish	442
Discord	370
zoom	258
Target	253
Reddit	210
Match	175
GitHub	163
Microsoft	125
Indeed	121
HackerRank	112
Gap	109
Button	106
Facebook	98
Wise	92
Intuitive	87

With this list of top twenty companies, the compensation metrics can be compared to see if any possible relationships arrive.

Code

# Aggregate data by company and calculate the mean for each metric
data_by_company = tech_df.groupby('company')[column_list].mean().reset_index()

company_list = ["Google", "Medium", "Move", "Amazon", "NICE", "Wish", "Discord", "zoom", 
"Target", "Reddit", "Match", "GitHub", "Microsoft", "Indeed", "HackerRank", "Gap", 
"Button", "Facebook", "Wise", "Intuitive"]

data_by_company["top10"] = data_by_company["company"].isin(company_list)

top20 = data_by_company[data_by_company["top10"] == True]

ordered_top20 = {
    "average_yearly_compensation": list(top20.sort_values('average_yearly_compensation', ascending = False)['company']),
    "average_years_experience": list(top20.sort_values('average_years_experience', ascending = False)['company']),
    "average_years_at_company": list(top20.sort_values('average_years_at_company', ascending = False)['company']),
    "average_base_salary": list(top20.sort_values('average_base_salary', ascending = False)['company']),
    "average_stock_grant_value": list(top20.sort_values('average_stock_grant_value', ascending = False)['company']),
    "average_bonus": list(top20.sort_values('average_bonus', ascending = False)['company'])
}

# Create bar plots for each metric
bar_charts = []
for i in range(len(title_list)):
    chart = alt.Chart(top20).mark_bar().encode(
        x = alt.X('company', title = 'Company', sort = ordered_top20[column_list[i]]),
        y = alt.Y(column_list[i], title = x_axis_list[i]),
        color = alt.ColorValue(color_list[i])
    ).properties(
        title = f"{title_list[i]} for Top 10 Mentioned Companies",
        width = 400,
        height = 220
    )
    bar_charts.append(chart)

# Concatenate bar charts in a grid layout
final_bars = alt.vconcat(
    alt.hconcat(bar_charts[0], bar_charts[1]),
    alt.hconcat(bar_charts[2], bar_charts[3]),
    alt.hconcat(bar_charts[4], bar_charts[5])
)

FAANG companies like Facebook, Amazon and Google tend to be in the top half of the top twenty most mentioned companies regarding these compensation metrics. Overall, the charts suggest that there is a mixture of FAANG and non-FAANG companies that offer high compensation. It is interesting to see that companies with the highest compensation, do not necessarily observe the most years of experience or years at the compnay suggesting other factors play a role in compensation packages. The plot also shows variability in how long employees stay with these companies, which implies that factors other than compensation may influence tenure.

2) Conducting Natural Language Processing Work

Topic 5: NLP Trends for Reddit Interview Content

Goal: Identify topics within the posts/comments of the interview subreddits to uncover general themes for students to study further. This can highlight what types of subject matter are being discussed the most, whether coding topics or behavioral questions.

From a technical perspective, this portion of the analysis begins with identifying the most common words across the dataset, using techniques such as tokenization, stop words removal, and word count aggregation. Next, the analysis explores the distribution of text lengths, offering a statistical overview of the dataset’s text size. The third step involved identifying important words using Term Frequency-Inverse Document Frequency (TF-IDF), a method that highlights words that are unique to specific documents within the corpus. Transforming of text data using multiple NLP techniques like URL removal, tokenization, normalization, and stemming created a cleaner version of the dataset for improved text analysis. The creation of dummy variables based on regex patterns enables the categorization of texts into different technical domains such as programming languages and coding libraries.

From a business perspective, the technical analysis in this section accomplishes goals such as insight into key topics and trends as well as data-driven decision making. By identifying the most common words and important terms (via TF-IDF), the analysis uncovers the main themes and topics prevalent in the dataset. This can be crucial for job seekers in understanding prevalent interview trends or job seeker concerns.

Most Common Words Overall

The table below shows that the most common words table shows the terms, “leetcode” and “questions” which might be helpful in steering job applicants to focus their study efforts on this subject.

Code

from pyspark.sql import SparkSession
from pyspark.sql.functions import length, explode, split
from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer, IDF
from pyspark.sql.functions import col
import matplotlib.pyplot as plt

# tokenize 
tokenizer = Tokenizer(inputCol="body", outputCol="words")
df_words = tokenizer.transform(df)

# remove stop words 
remover = StopWordsRemover(inputCol="words", outputCol="filtered_words")
df_filtered = remover.transform(df_words)

# explode the array into  words
df_exploded = df_filtered.select(explode(col("filtered_words")).alias("word"))

# filter out single character words
df_filtered_single_char = df_exploded.filter(length(col("word")) > 1)

# count 
df_word_count = df_filtered_single_char.groupBy("word").count()
df_word_count.orderBy(col("count").desc()).show()

word	count
Interview	22631
Like	18719
Get	17015
Questions	13818
Time	13330
One	13245
Good	12364
Know	11881
Job	11718
Leetcode	11557
Think	11144
Also	9331
Want	9048
Really	8620
Need	8564
Problems	8560
[deleted]	8453
Problem	8444
Work	8017
First	7521

Distribution of Text Lengths

These statistics confirm that the dataset contains a significant amount of posts/comments with an average length of about 234 characters. The standard deviation is quite high, indicating a large variability in the length of texts.

Code

# calc length of each text
df = df.withColumn("text_length", length(df["body"]))
df.describe("text_length").show()

summary	text length
Count	128339
Mean	233.57049688715043
Standard Deviation	480.0254950591896
Minimum	1
Maximum	38271

Important Words According to TF-IDF

The provided list of words and their TF-IDF scores provides insight into the topics and concerns that are most prominent in the Reddit posts and comments surround technical interviews. The most important term, ‘spaced’, is an unusual observation as it does not offer clarity as to why it would be a key concept in technical interview Reddits. Other terms such as ‘interview’, ‘simple’, and ‘subscribe’ are more in line with expectations as they indicate a focus on interview processes, seeking simple advice or interview resources, and potentially encouraged to follow or subscribe to threads or channels for updates on interview strategies. The presence of emotionally charged words like ‘ugh’ and ‘crushing’ may reflect the anxiety and difficulty of technical interview preparation.

Code

# term frequency
df_sampled = df_filtered.sample(False, 0.1)
cv = CountVectorizer(inputCol="filtered_words", outputCol="raw_features")
cv_model = cv.fit(df_sampled)
df_featurized = cv_model.transform(df_sampled)

# idf
idf = IDF(inputCol="raw_features", outputCol="features")
idf_model = idf.fit(df_featurized)
df_rescaled = idf_model.transform(df_featurized)

tf_idf_pandas_df = df_rescaled.select("features").toPandas()

def extract_top_words(features, vocabulary, n_top_words=3):
    dense_vector = features.toArray()
    top_indices = dense_vector.argsort()[-n_top_words:][::-1]
    top_words = [(vocabulary[i], dense_vector[i]) for i in top_indices]
    return top_words

vocabulary = cv_model.vocabulary

# extract top words for each row
n_top_words = 1 
top_words_list = [extract_top_words(row['features'], vocabulary, n_top_words) for index, row in tf_idf_pandas_df.iterrows()]
top_words_list[35:45]

Word	TF-IDF Score
Spaced	13.378421024061316
Bud	8.768652053710493
Pill	8.768652053710493
Interview	8.768652053710493
Simple	9.794902085605205
Ugh	7.670039765042384
Rated	7.6700397650423845
Crushing	7.2645746569342196
[deleted]	2.7537151504173694
Subscribe	7.159214141276394

Cleaning the Text Using 5+ NLP Techniques

In this section, more detailed text cleaning techniques are put in place. The following code used Spark NLP methods such as tokenizing, stemming, removing punctuation, removing stop words, and more to clean the Reddit content.

Code

# remove https links 
def remove_https_links(text):
    return re.sub(r'https?:\/\/\S+', '', text)

remove_https_links_udf = udf(remove_https_links, StringType())

df_no_url = df.withColumn("body_no_url", remove_https_links_udf("body"))

df_no_url.select('body_no_url').show(4, truncate=False)

# set up spark pipeline

document_assembler = DocumentAssembler() \
    .setInputCol("body_no_url") \
    .setOutputCol("document")

# tokenize
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")


# normalize words to remove unnecessary punctuation
normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized") \
    .setLowercase(True) \
    .setCleanupPatterns(["[^A-Za-z0-9 ]"]) 


# remove stop words
stopwords_cleaner = StopWordsCleaner() \
    .setInputCols(["normalized"]) \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)

# stem words
stemmer = Stemmer() \
    .setInputCols(["cleanTokens"]) \
    .setOutputCol("stem")

# sentence_detector = SentenceDetector() \
#     .setInputCols(["cleanTokens"]) \
#     .setOutputCol("sentence")

finisher = Finisher() \
    .setInputCols(["cleanTokens"]) \
    .setOutputCols(["finished_ngrams"]) \
    .setOutputAsArray(False) \
    .setCleanAnnotations(True)


# Define the pipeline
pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    normalizer,
    stopwords_cleaner,
    stemmer,
    finisher
])
# run the pipeline 
model = pipeline.fit(df_no_url)
clean_df = model.transform(df_no_url)

from pyspark.sql.functions import regexp_replace
clean_df_nlp = clean_df.withColumn("clean_body", regexp_replace("finished_ngrams", "@", " "))
clean_df_nlp.select("clean_body").show(5,truncate=False)

Preview of posts and comments after cleaning:

clean body
messaging 21 days 20230128 065831 utc remind link click link send pm also reminded reduce spam parent commenter delete message hide others info reminders sorry didnt see thank howd get change anything prep strategy

Topic 6: Identification of Most Important Skills for Technical Interviewing

Goal: Hone in on the most important and most frequent skills discussed on Reddit for technical interviews. Explore mention frequency of various programming languages, tools, and libraries. Also identify the most important stages of the technical interview.

Creating New Dummy Variables

To focus on topics such as programming languages, tools, and libraries, new variables are required.

Code

from pyspark.sql.functions import regexp_extract, col

language_pattern = r'(?i)(Python|SQL|SparkSQL|PySpark|Java|Scala|Julia|MATLAB|C\+\+|SAS|NoSQL|Hadoop|Hive|BigQuery)'
library_pattern = r'(?i)(Pandas|Tensorflow|Pytorch|sklearn|Keras|NumPy|SciPy|Matplotlib|Seaborn|Dask|NLTK|Gensim|Scrapy|Statsmodels|XGBoost|LightGBM|Jupyter|FastAI|Spacy|Theano|Yolo)'
tool_pattern = r'(?i)(Tableau|Shiny|Sagemaker|Azure|PowerBI|Databricks|Alteryx|Qlik|Looker|Google Cloud Platform|AWS|Dataiku|Docker|Kubernetes|Airflow|ML)'
stage_pattern = r'(?i)(initial screening|technical round|HR round|final round|offer stage)'

clean_df_nlp_var = clean_df_nlp.withColumn("programming_language_mentioned", (regexp_extract(col("clean_body"), language_pattern, 0) != '').cast('int'))
clean_df_nlp_var = clean_df_nlp_var.withColumn("coding_library_mentioned", (regexp_extract(col("clean_body"), library_pattern, 0) != '').cast('int'))
clean_df_nlp_var = clean_df_nlp_var.withColumn("ds_tool_mentioned", (regexp_extract(col("clean_body"), tool_pattern, 0) != '').cast('int'))
clean_df_nlp_var = clean_df_nlp_var.withColumn("interview_stage_mentioned", (regexp_extract(col("clean_body"), stage_pattern, 0) != '').cast('int'))

from pyspark.sql.functions import col, count, mean, sum

summary_stats = clean_df_nlp_var.agg(
    count("*").alias("total_count"),
    sum(col("programming_language_mentioned")).alias("sum_programming_language_mentioned"),
    sum(col("coding_library_mentioned")).alias("sum_coding_library_mentioned"),
    sum(col("ds_tool_mentioned")).alias("sum_ds_tool_mentioned"),
    sum(col("interview_stage_mentioned")).alias("sum_interview_stage_mentioned")
)

summary_stats.show(truncate=False)

Statistical summary table for regex patterns from new variables:

total count	sum of programming languages mentioned	sum of coding libraries mentioned	sum of ds tools mentioned	sum of interview stages mentioned
128339	5188	53	1004	476

Visualization for regex patterns from new variables:

The extraction and visualization of patterns through regex-based categorization, encompassing elements like programming languages and coding libraries, offers additional guidance. The sundial visualization, for example, highlights the prevalence of programming languages, especially Python, within the text. This observation can streamline a job applicant’s study plan by emphasizing Python, thereby optimizing interview preparation and enhancing their chances of success.

Code

# filter out row where no programming language is mentioned
df_w_languages = clean_df_nlp_var.filter(col("programming_language_mentioned")==1)
df_w_libraries = clean_df_nlp_var.filter(col("coding_library_mentioned")==1)
df_w_tools = clean_df_nlp_var.filter(col("ds_tool_mentioned")==1)
df_w_stage = clean_df_nlp_var.filter(col("interview_stage_mentioned")==1)

# write function to extract language frequency in each text 
def extract_pattern(pattern,text):
    return re.findall(pattern, text) 

# create udf
extract_pattern_udf = udf(extract_pattern, ArrayType(StringType()))

# add a column with the pattern to the DataFrames
df_w_languages = df_w_languages.withColumn("pattern", lit(language_pattern))
df_w_libraries = df_w_libraries.withColumn("pattern", lit(library_pattern))
df_w_tools = df_w_tools.withColumn("pattern", lit(tool_pattern))
df_w_stage = df_w_stage.withColumn("pattern", lit(stage_pattern))

# apply the UDF 
df_w_languages = df_w_languages.withColumn("mentioned_languages", extract_pattern_udf(col("pattern"), col("clean_body")))
df_w_libraries = df_w_libraries.withColumn("mentioned_libraries", extract_pattern_udf(col("pattern"), col("clean_body")))
df_w_tools = df_w_tools.withColumn("mentioned_tools", extract_pattern_udf(col("pattern"), col("clean_body")))
df_w_stage = df_w_stage.withColumn("mentioned_stages", extract_pattern_udf(col("pattern"), col("clean_body")))

# explode the mentioned languages
exploded_languages = df_w_languages.withColumn("language", explode(col("mentioned_languages")))
exploded_libraries = df_w_libraries.withColumn("library", explode(col("mentioned_libraries")))
exploded_tools = df_w_tools.withColumn("tool", explode(col("mentioned_tools")))
exploded_stages = df_w_stage.withColumn("stage", explode(col("mentioned_stages")))

# group by language and count
language_cts = exploded_languages.groupBy("language").count()
library_cts = exploded_libraries.groupBy("library").count()
tool_cts = exploded_tools.groupBy("tool").count()
stage_cts = exploded_stages.groupBy("stage").count()

# add col describing pattern
language_cts = language_cts.withColumn("pattern_topic", lit("programming language"))
library_cts = library_cts.withColumn("pattern_topic", lit("coding library"))
tool_cts = tool_cts.withColumn("pattern_topic", lit("data science tool"))
stage_cts = stage_cts.withColumn("pattern_topic", lit("interview stage"))

language_cts = language_cts.withColumnRenamed("language", "pattern")
library_cts = library_cts.withColumnRenamed("library", "pattern")
tool_cts = tool_cts.withColumnRenamed("tool", "pattern")
stage_cts = stage_cts.withColumnRenamed("stage", "pattern")

# convert to pandas for future visualization
language_cts_pd = language_cts.toPandas()
library_cts_pd = library_cts.toPandas()
tool_cts_pd = tool_cts.toPandas()
stage_cts_pd = stage_cts.toPandas()

# union pd dataframes for visualizations later
combined_cts_pd = pd.concat([language_cts_pd, library_cts_pd, tool_cts_pd, stage_cts_pd], ignore_index=True)

# visualize
color_map = {
    'programming language': '#163d6b', 
    'coding library': '#204321',
    'data science tool': '#82B8B5' ,
    'interview stage' :'#5E9751'}
    
fig = px.sunburst(
    combined_cts_pd,
    path=['pattern_topic', 'pattern'],
    values='count',
    color='pattern_topic',
    color_discrete_map=color_map, 
    title='Distribution of RegEx Topic Patterns'
)

fig.update_layout(width=650, height=650)
fig.show()

3) Building a Sentiment Model

Topic 7: Sentiment Analysis of Interview Posts & Comments

Goal: Observe the overall sentiment of posts and comments related to the filtered dataset that focuses on interviews. This will help identify how candidates are feeling about the interview process, which can draw insight into a specific employer or topic interview difficulty level.

For the purpose of the sentiment analysis, spark jars packages and the John Snow Labs pagckage were used. The John Snow Labs package includes several pre-trained models. The one best suited for this analysis is the sentimentdl_use_twitter. The models consist of four parts. The first part, documentAssembler, takes as an input the text column (in this case ‘body’ column). The second part, the use part, has the Universal Sentence Encoder that utilizes the TenserFlow Hub use. The input of the second part is the output of the first documentAssembler part. Part three is the sentiment deep learning pretrained model that takes as the input the output of the second part and utilizes the John Snow Labs sentimentdl_use_twitter model with the English languge. This part outputs the sentiments for each row of the ‘body’ text column. The final part is the NLP Pipeline, which puts all the previous parts together, allowing the input of the data frame into the pipeline. The pipline runs through all the previously mentioned parts and produces a new sentiment column with three sentiment options - postive, neutral, negative. The last step in the sentiment analysis is transforming the output of the pipeline into a results dataframe.

Code

! pip install sparknlp
import json
import sparknlp
import numpy as np
import pandas as pd
from sparknlp.base import *
from pyspark.ml import Pipeline
from sparknlp.annotator import *
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
import sagemaker
from sagemaker.spark.processing import PySparkProcessor
from sparknlp.pretrained import PretrainedPipeline
from pyspark.sql import SparkSession
import sparknlp
import sagemaker
sess = sagemaker.Session()
# Import pyspark and build Spark session
from pyspark.sql import SparkSession
import sparknlp
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import UniversalSentenceEncoder, SentimentDLModel
from pyspark.ml import Pipeline

# Initialize Spark session for Spark NLP
spark = SparkSession.builder\
    .appName("PySparkApp")\
    .master("local[*]")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3,org.apache.hadoop:hadoop-aws:3.2.2")\
    .config("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.ContainerCredentialsProvider")\
    .getOrCreate()

%%time
import sagemaker

session = sagemaker.Session()
#s3://sagemaker-us-east-1-131536150362/project/data_reformat_clean_nlp.parquet/
bucket = "tm1450-project"
output_prefix_data_submissions = "data_reformat_clean.parquet/"

s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
#print(f"reading tagged data from {s3_path}")
df = spark.read.parquet(s3_path, header=True)
#print(f"shape of the submissions dataframe is {df.count():,}x{len(df.columns)}")
# Define model names and pipeline components
MODEL_NAME = 'sentimentdl_use_twitter'

documentAssembler = DocumentAssembler()\
    .setInputCol("body")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained(name="tfhub_use", lang="en")\
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

sentimentdl = SentimentDLModel.pretrained(name=MODEL_NAME, lang="en")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("sentiment")

nlpPipeline = Pipeline(
      stages = [
          documentAssembler,
          use,
          sentimentdl
      ])
pipelineModel = nlpPipeline.fit(df)
results = pipelineModel.transform(df)

Schema:

Tables of Summary Statistics

Below is a table that provides a glimpse at the data in terms of how the sentiment analysis created a new variable that allows further analysis, such as what does sentiment look like for the entire dataset, or for sepcific attributes such as posts/comments that focus on asking for help or FAANG companies.

Code

from pyspark.sql.functions import explode, col

# Explode the sentiment array and extract the result field
results = results.withColumn('sentiment', F.explode('sentiment'))
results = results.withColumn('sentiment', F.col('sentiment.result'))

# Selecting columns for the final DataFrame
final_data = results.select(
    'is_tech', 'is_asking_for_help', 'is_asking_a_question', 
    'is_FAANG', 'clean_body', 'sentiment'
)

# Persist and show the DataFrame
final_data.persist()
final_data.show()

technical reddit	asking for help?	asking a question?	FAANG reddit	clean body	sentiment
True	False	False	False	messaging 21 days…	Neutral
True	False	True	False	sorry didnt see t…	Positive
False	False	False	False	howd get	Positive

Full Dataframe Sentiment Counts

The large number of positive sentiments suggests that although there is a strong presence of positive discourse, there is also a substantial number of negative experiences or thoughts being shared.

Code

sentiment_counts = final_data.groupBy("sentiment").count()
sentiment_counts.show()

sentiment	count
Positive	81780
Neutral	4164
Negative	42395

Sentiment Counts for is_tech Column

The is_tech column represents a subset of the data where the focus is on technical aspects of interviews. The sentiment analysis summary for this subset of data shows positive sentiments far outnumber the negatives, suggesting that discussions in this category are generally more positive. This is surprising as the technical aspect of interviews are usually the most stressful so this provides a bit of hope to those starting the long road of technical interview preparation.

Code

tech_data = final_data.filter(col('is_tech') == True)
sentiment_counts = tech_data.groupBy('sentiment').count()
print('Sentiment Counts for is_tech Column')
sentiment_counts.show()

sentiment	count
Positive	20776
Neutral	979
Negative	7629

Sentiment Counts for is_asking_for_help Column

The is_asking_for_help column represents a data subset focusing on posts where individuals are seeking assistance. Positive sentiment is almost double the negative sentiment, which may signify a supportive environment for those asking for help, again providing hope for those starting the challenge of preparing for technical interviews.

Code

help_data = final_data.filter(col('is_asking_for_help') == True)
sentiment_counts = help_data.groupBy('sentiment').count()
print('Sentiment Counts for is_asking_for_help Column')
sentiment_counts.show()

sentiment	count
Positive	9839
Neutral	346
Negative	4602

Sentiment Counts for is_asking_a_question Column

The table below shows a higher number of positive sentiments than negative ones for posts/comments that ask a question. This might suggest that the Reddit community is usually supportive when individuals ask questions, creating an environment for information exchange and learning. Having knowledge of this may encourage new users seeking advice for technical interview to be less afraid in asking questions.

sentiment	count
Positive	20014
Neutral	867
Negative	13028

Sentiment Counts for is_FAANG Column

The is_FAANG posts/comments discussing interviews at FAANG companies (Facebook, Amazon, Apple, Netflix, Google), show that sentiment is more positive than negative. This indicates that conversations around interviews at these top tech companies are more positive, which could be due to successful interview experiences or helpful insights posted by users who have interviewed with these companies.

Code

question_data = final_data.filter(col('is_asking_a_question') == True)
sentiment_counts = question_data.groupBy('sentiment').count()
print('Sentiment Counts for is_asking_a_question Column')
sentiment_counts.show()

sentiment	count
Positive	3488
Neutral	171
Negative	1554

Plot of Summary Statistics

The plot below displays a more succinct summary of the count of posts (submissions and comments) for each sentiment classification broken down by post type. Again, the text content types include:

A post asking for help
A post asking a question
A post mentioning a FAANG company
A technical post

These topics were found using Regex as described above and in previous parts of the website. In general, for the full dataset, there are mainly postive sentiments for submissions and comments. The positive submissions and comments are almost double in size as compared to the submissions and comments with negative sentiment. For texts where is_tech column is true the ratio of postive to negative sentiments is higher than for any other column. The opposite is true for the rows where is_asking_for_help column is true. The number of neutral sentiments is the lowest in the whole dataset and as well in relation to the other sentiments for all the options where column_name = True.

Code

import pyspark.sql.functions as F
import pandas as pd

# Get sentiment counts for submissions/comments that ask for help
df_tech_counts = df\
    .filter(F.col('is_tech'))\
    .groupBy('sentiment')\
    .count()\
    .withColumn('category', F.lit('Technical'))

# Get sentiment counts for submissions/comments that ask for help
df_help_counts = df\
    .filter(F.col('is_asking_for_help'))\
    .groupBy('sentiment')\
    .count()\
    .withColumn('category', F.lit('Asking for help'))

# Get sentiment counts for submissions/comments that ask a question
df_question_counts = df\
    .filter(F.col('is_asking_a_question'))\
    .groupBy('sentiment')\
    .count()\
    .withColumn('category', F.lit('Asking question'))

# Get sentiment counts for submissions/comments that mention a FAANG
# company
df_FAANG_counts = df\
    .filter(F.col('is_FAANG'))\
    .groupBy('sentiment')\
    .count()\
    .withColumn('category', F.lit('Mentions FAANG'))

# Concatenate the count dfs together and convert to pandas
df_count_summary = df_tech_counts\
    .union(df_help_counts)\
    .union(df_question_counts)\
    .union(df_FAANG_counts)\
    .toPandas()

# Pivot dataframe
df_plot = df_count_summary.pivot(index='category', columns='sentiment', values='count').reset_index()

# Code referenced: https://matplotlib.org/stable/gallery/lines_bars_and_markers/bar_colors.html
import matplotlib.pyplot as plt
import numpy as np

x = np.arange(len(df_plot['category']))  # the label locations
width = 0.25  # the width of the bars
multiplier = 0

color_dict = {
    'negative': '#251279',
    'neutral': '#6CACED',
    'positive': '#5E9751'
}

fig, ax = plt.subplots(layout='constrained')

# Cycle through each sentiment label (negative, neutral, positive)
for sentiment_label in df_plot.columns:
    # Skip the category column
    if sentiment_label != 'category':
        offset = width * multiplier
        rects = ax.bar(
            x + offset,
            df_plot[sentiment_label],
            width,
            label=sentiment_label.title(),
            color=color_dict[sentiment_label]
        )
        ax.bar_label(rects, padding=3)
        multiplier += 1

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Number of Submissions/Comments')
ax.set_title('Distribution of Sentiment')
ax.set_xticks(x + width, df_plot['category'])
ax.legend(loc='upper left')
ax.set_ylim(0, 22000)

plt.savefig('../../website-source/images/sentiment_dist.png', dpi=300)
plt.show()

## Natural Language Processing This webpage focuses on using Natural Language Processing (NLP) techniques to dive into the world of Reddit and uncover insights related to interview subreddits. Main accomplishments include identifying and analyzing the significant themes and sentiments within the data. By using NLP techniques, the commonly discussed topics were supported and extracted, providing a clear understanding of the dominant conversations surrounding technical interview attributes. In addition, external data elevated the ability to extract financial insights related to certain types of companies seeking job applicants. These efforts were important in aligning with the business strategies for the target audience (job seekers and students). Furthermore, sentiment analysis was employed to provide a perspective on the tone of the Reddit discussions. This aspect of the project was particularly enlightening as it revealed the mood and attitudes of subreddit participants towards certain interview aspects. These insights are significantly valuable for helping students in prepare for technical interviews by highlighting common concerns and the overall mood surrounding the interview process. The integration of NLP, in bridging data and decision making showcases how technology can elevate business intelligence and enhance student engagement. ### 1) External Data Set #### Cleaning the External Data Set **External Data Source:** [link](https://github.com/yuenherny/um-wqd7003-ds-salary-analysis/tree/main) This data set was combined by the author of the GitHub repository above from two sources: LEVELS.FYI API and a Kaggle dataset on data science and stem salaries (see README in repository linked above). The original data set contains yearly compensation (total, base salary, bonuses, and stock grants), location, years of experience, and years at the company information for STEM and data science roles within over 1500 companies. This data ranges from 2017 to 2021. For the project's purposes, the cleaned data set averages yearly compensation, base salary, bonus, years of experience, and years at company for each company and role type (tech or non-tech). The role type column was created by classifying roles as either "tech" or "non tech". For example, a software engineer would be classified as "tech"; meanwhile, sales is classified as "non tech". From there, the data was subset to include just the tech role information. Additionally, the ability to focus on tech vs. non-tech roles was achieved through filtering. Below is the code used to initially clean the external data set along with a table showing the first 20 rows. ```{python} #| warning: False # Import libraries import os import openpyxl import pandas as pd import numpy as np # Read in the data company_names = pd.read_csv("../data/external_data/company_name_matching_RAW.csv") salaries = pd.read_excel("../data/external_data/levels_fyi_salary_data_RAW.xlsx", sheet_name = "data") # Subset columns salaries_subset = salaries[["timestamp", "company", "title", "totalyearlycompensation", "location", "yearsofexperience", "yearsatcompany", "basesalary", "stockgrantvalue", "bonus"]] # Add a year column and drop timestamp column salaries_subset["year"] = pd.to_datetime(salaries_subset["timestamp"]).dt.year salaries_subset = salaries_subset.drop(["timestamp"], axis = 1) # Change data types data_types = {"company": str, "title": str, "totalyearlycompensation": float, "location": str, "yearsofexperience": float, "yearsatcompany": float, "basesalary": float, "stockgrantvalue": float, "bonus": float, "year": int} salaries_subset = salaries_subset.astype(data_types) salaries_subset["year"] = pd.to_datetime(salaries_subset["year"], format = "%Y") # Create clean company name matching company_mapping = dict(zip(company_names["Variant"].str.strip(), company_names["Clean Company"])) # Make a new column with the clean company names salaries_subset = salaries_subset.rename(columns = {"company": "old_company"}) salaries_subset["company"] = salaries_subset["old_company"].map(company_mapping) salaries_subset = salaries_subset.drop(["old_company"], axis = 1) # Categorize available titles as "tech" or "non tech" role_type_dict = {'Product Manager': 'tech', 'Software Engineer': 'tech', 'Software Engineering Manager': 'tech', 'Data Scientist': 'tech', 'Solution Architect': 'tech', 'Technical Program Manager': 'tech', 'Human Resources': 'non tech', 'Product Designer': 'tech', 'Marketing': 'non tech', 'Business Analyst': 'tech', 'Hardware Engineer': 'tech', 'Sales': 'non tech', 'Recruiter': 'non tech', 'Mechanical Engineer': 'tech', 'Management Consultant': 'non tech'} salaries_subset["role_type"] = salaries_subset["title"].map(role_type_dict) # Rearrange columns salaries_subset = salaries_subset[["company", "year", "role_type", "title", "totalyearlycompensation", "location", "yearsofexperience", "yearsatcompany", "basesalary", "stockgrantvalue", "bonus"]] # Get a summary table of averages by company and role type summary_df = salaries_subset.groupby(["company", "role_type"]).agg({ "totalyearlycompensation": "mean", "yearsofexperience": "mean", "yearsatcompany": "mean", "basesalary": "mean", "stockgrantvalue": "mean", "bonus": "mean"}).reset_index() summary_df.columns = ["company", "role_type", "average_yearly_compensation", "average_years_experience", "average_years_at_company", "average_base_salary", "average_stock_grant_value", "average_bonus"] # Subset for just FAANG (Facebook, Amazon, Apple, Netflix, Google) # faang_df = summary_df[(summary_df["company"] == "Facebook") | (summary_df["company"] == "Amazon") | (summary_df["company"] == "Apple") | (summary_df["company"] == "Netflix") | (summary_df["company"] == "Google")] summary_df.head(10) ``` The plots below depict the distribution of the six numeric variables in the external data set: average yearly compensation, average base salary, average years spent at the company, average years of experience, average stock grant value, and average bonus. From this point forward, the only focus will be on "tech" roles; hence, the figures below depict the distributions for each metric for the tech roles only. ![](images/external_histograms.png){width="1000" height="1100"} In the EDA portion, the data set was tagged based on whether or not a FAANG company was mentioned. FAANG includes Facebook, Apple, Amazon, Netflix, and Google. To take it further, all of the companies present in the external data set were searched for and tagged accordingly. Then, the two data sets were merged based on the company mentioned in each row. This did not apply to cases when multiple companies are mentioned in a post or comment or when no companies were mentioned (Null). Below is the code used to tag the data set with more companies and merge with the external data set. #### Preparing the Reddit Data for Merging ```{python} #| eval: False # Import packages import time import sagemaker import numpy as np import pandas as pd import pyarrow as pa import pyarrow.dataset as ds import matplotlib.pyplot as plt import pyspark.sql.functions as F import pandas as pd import datetime from pyspark.sql import SparkSession from s3fs import S3FileSystem from sagemaker.spark.processing import PySparkProcessor from pyspark.sql.functions import concat, lit, lower, col, when, array, expr, count, isnan, length, udf from pyspark.sql.types import ArrayType, StringType # Build Spark session spark = ( SparkSession.builder.appName("PySparkApp") .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.2") .config( "fs.s3a.aws.credentials.provider", "com.amazonaws.auth.ContainerCredentialsProvider", ) .getOrCreate() ) print(spark.version) # Read in the filtered submissions %%time session = sagemaker.Session() bucket = "sagemaker-us-east-1-158894274492" output_prefix_data_submissions = f"project/submissions/yyyy=*" s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}" print(f"reading submissions from {s3_path}") submissions = spark.read.parquet(s3_path, header=True) print(f"shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}") # Read in the filtered comments %%time output_prefix_data_comments = "project/comments/yyyy=*" s3_path = f"s3a://{bucket}/{output_prefix_data_comments}" print(f"reading comments from {s3_path}") comments = spark.read.parquet(s3_path, header=True) print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}") # Read in the external data set all_salary_info = pd.read_csv("../data/external_data/all_company_salary_info_CLEAN.csv") # Drop index column all_salary_info = all_salary_info.drop(["Unnamed: 0"], axis = 1) # Round float values all_salary_info = all_salary_info.apply(lambda x: round(x, 2) if x.dtype == 'float64' else x) # Get just the tech salaries tech_df = all_salary_info[all_salary_info["role_type"] == "tech"] # Getting only the columns we need for cubmissions and comments submissions_subset_col = submissions.select("id", "subreddit", "subreddit_id", "title", "selftext", col("created_utc").alias("created_utc_submission"), "num_comments") comments_subset_col = comments.select("link_id", "body", col("created_utc").alias("created_utc_comment")) # Making an "searchable_text" column for tagging submissions_subset_col = submissions_subset_col.withColumn("searchable_text", concat(submissions_subset_col["title"], lit(" "), submissions_subset_col["selftext"])) ## TAGGING SUBMISSIONS keywords_is_tech = ["tech", "technology", "code", "coding", "computer", "data", "python","java", "javascript", "sql", "database", "statistics", "azure", "cloud computing", "progamming", "leetcode", "saas", "faang"] pattern = "(?i)" + "|(?i)".join(keywords_is_tech) submissions_tagged = submissions_subset_col\ .withColumn('is_tech_submissions', col('searchable_text').rlike(pattern)) keywords_help = ["help me", "struggle", "struggling", "anyone", "advice", "tips", "help"] pattern = "(?i)" + "|(?i)".join(keywords_help) submissions_tagged = submissions_tagged\ .withColumn('is_asking_for_help_submissions', col('searchable_text').rlike(pattern)) submissions_tagged = submissions_tagged\ .withColumn('is_asking_a_question_submissions', col('searchable_text').rlike(r'(\?)+')) keywords_faang = ["facebook", "amazon", "apple", "netflix", "google"] pattern = "(?i)" + "|(?i)".join(keywords_faang) submissions_tagged = submissions_tagged\ .withColumn('is_FAANG_submissions', col('searchable_text').rlike(pattern))\ .cache() # Store companies mentioned in Reddit text from companies present in the external data set companies = list(tech_df.company.unique()) # Make a udf to get mentioned companies @udf(ArrayType(StringType())) def get_company_mentions(text): return [company for company in companies if " " + company.lower() + " " in text.lower()] submissions_tagged = submissions_tagged.withColumn("company_mentions", get_company_mentions(col("searchable_text"))) # Make another UDF to make final tagged column (if none found - NA, if 2+ found, "Multiple", else put the company mentioned @udf(StringType()) def clean_company_mentions(company_list): if not company_list: return None elif len(company_list) > 1: return "Multiple" else: return company_list[0] submissions_tagged = submissions_tagged.withColumn("final_company_mentioned", clean_company_mentions(col("company_mentions"))) ## TAGGING COMMENTS keywords_is_tech = ["tech", "technology", "code", "coding", "computer", "data", "python","java", "javascript", "sql", "database", "statistics", "azure", "cloud computing", "progamming", "leetcode", "saas", "faang"] pattern = "(?i)" + "|(?i)".join(keywords_is_tech) comments_tagged = comments_subset_col\ .withColumn('is_tech_comments', col('body').rlike(pattern)) keywords_help = ["help me", "struggle", "struggling", "anyone", "advice", "tips", "help"] pattern = "(?i)" + "|(?i)".join(keywords_help) comments_tagged = comments_tagged\ .withColumn('is_asking_for_help_comments', col('body').rlike(pattern)) comments_tagged = comments_tagged\ .withColumn('is_asking_a_question_comments', col('body').rlike(r'(\?)+')) keywords_faang = ["facebook", "amazon", "apple", "netflix", "google"] pattern = "(?i)" + "|(?i)".join(keywords_faang) comments_tagged = comments_tagged\ .withColumn('is_FAANG_comments', col('body').rlike(pattern))\ .cache() # Store companies mentioned in Reddit text from companies present in the external data set companies = list(tech_df.company.unique()) # Make a udf to get mentioned companies @udf(ArrayType(StringType())) def get_company_mentions(text): return [company for company in companies if " " + company.lower() + " " in text.lower()] comments_tagged = comments_tagged.withColumn("company_mentions", get_company_mentions(col("body"))) # Make another UDF to make final tagged column (if none found - NA, if 2+ found, "Multiple", else put the company mentioned @udf(StringType()) def clean_company_mentions(company_list): if not company_list: return None elif len(company_list) > 1: return "Multiple" else: return company_list[0] comments_tagged = comments_tagged.withColumn("final_company_mentioned", clean_company_mentions(col("company_mentions"))) ## JOIN COMMENTS AND SUBMISSIONS # Rename some columns for this reformatting comments_tagged = (comments_tagged .withColumnRenamed("company_mentions", "company_mentions_comments") .withColumnRenamed("final_company_mentioned", "final_company_mentioned_comments")) submissions_tagged = (submissions_tagged .withColumnRenamed("company_mentions", "company_mentions_submissions") .withColumnRenamed("final_company_mentioned", "final_company_mentioned_submissions")) # Clean comments link_id "t3_#####" to match submissions id "#####" comments_new_id = comments_tagged.withColumn( 'link_id', expr("substring(link_id, 4, length(link_id)-3)") ) # Join comments with submissions df = comments_new_id.join(submissions_tagged, comments_new_id['link_id'] == submissions_tagged['id'], 'outer') # Reformatting # Separate into two dataframes (comments and submissions) df_submissions = df.filter(df.created_utc_submission.isNotNull())\ .select( F.col('id').alias('link_id'), F.col('created_utc_submission').alias('created_utc'), F.col('searchable_text').alias('body'), F.col('is_tech_submissions').alias('is_tech'), F.col('is_asking_for_help_submissions').alias('is_asking_for_help'), F.col('is_asking_a_question_submissions').alias('is_asking_a_question'), F.col('is_FAANG_submissions').alias('is_FAANG'), F.col('company_mentions_submissions').alias('company_mentions'), F.col('final_company_mentioned_submissions').alias('final_company_mentioned'), F.lit(True).alias('isSubmission') )\ .distinct() df_comments = df.filter(df.created_utc_comment.isNotNull())\ .select( 'link_id', F.col('created_utc_comment').alias('created_utc'), 'body', F.col('is_tech_comments').alias('is_tech'), F.col('is_asking_for_help_comments').alias('is_asking_for_help'), F.col('is_asking_a_question_comments').alias('is_asking_a_question'), F.col('is_FAANG_comments').alias('is_FAANG'), F.col('company_mentions_comments').alias('company_mentions'), F.col('final_company_mentioned_comments').alias('final_company_mentioned'), F.lit(False).alias('isSubmission') )\ .distinct() # Concatenate comments and submissions df_clean = df_submissions.union(df_comments) ## MERGE WITH EXTERNAL DATA SET # Rename for merge df_clean = (df_clean .withColumnRenamed("final_company_mentioned", "company")) # Convert external to spark data frame tech_spark_df = spark.createDataFrame(tech_df) # Join on final_company_mentioned column merged_df = df_clean.join(tech_spark_df, on = "company", how = "left") ``` | | link_id | created_utc | body | is_tech | is_asking_for_help | is_asking_a\_question | is_FAANG | company_mentions | isSubmission | role_type | average_yearly_compensation | average_years_experience | average_years_at_company | average_base_salary | average_stock_grant_value | average_bonus | |-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----| | null | wghtce | 2022-08-05 01:11:58 | Applying for Simi... | false | false | false | false | \[\] | true | null | null | null | null | null | null | null | | null | sbtver | 2022-01-24 19:13:22 | Leetcode Premium ... | true | false | false | false | \[\] | true | null | null | null | null | null | null | null | | null | ue9ggm | 2022-04-29 00:32:14 | Explanation of Le... | true | false | false | false | \[\] | true | null | null | null | null | null | null | null | #### Data Quality Checks At this point, the data has been cleaned and new variables have been created. This next section aims to perform quality checks on said dataset. Being aware of key dataset characteristics like schema, overall structure, and missing values helps in detecting any inconsistencies, errors, or gaps that could influence subsequent analyses. **Schema:** root \|-- company: string (nullable = true) \|-- link_id: string (nullable = true) \|-- created_utc: timestamp (nullable = true) \|-- body: string (nullable = true) \|-- is_tech: boolean (nullable = true) \|-- is_asking_for_help: boolean (nullable = true) \|-- is_asking_a\_question: boolean (nullable = true) \|-- is_FAANG: boolean (nullable = true) \|-- company_mentions: array (nullable = true) \| \|-- element: string (containsNull = true) \|-- isSubmission: boolean (nullable = false) \|-- role_type: string (nullable = true) \|-- average_yearly_compensation: double (nullable = true) \|-- average_years_experience: double (nullable = true) \|-- average_years_at_company: double (nullable = true) \|-- average_base_salary: double (nullable = true) \|-- average_stock_grant_value: double (nullable = true) \|-- average_bonus: double (nullable = true) ```{python} #| eval: False # Print number of rows merged_df.count() ``` **Number of Rows:** 128339 ```{python} #| eval: False # Check for nulls merged_df.select([count(when(col(c).isNull(), c)).alias(c) for c in merged_df.columns]).show() ``` **Null Count by Column:** | column | number of nulls | |-----------------------------|-----------------| | company | 8111 | | body | 8111 | | is_tech | 8111 | | is_asking_for_help | 8111 | | is_asking_a\_question | 8111 | | is_FAANG | 934 | | company_mentions | 934 | | isSubmission | 934 | | role_type | 934 | | average_yearly_compensation | 934 | | average_years_experience | 934 | Most of the company salary information is null in the data set after the merge, which is expected. Many subreddits mention no companies at all, some mention multiple, and some probably have mentioned companies not listed in the external data set. For the visualizations below, the nulls were removed to create a smaller, less sparse data set. #### **TOPIC 4: MARKET SALARY TREND ANALYSIS** *Goal:* Connect salary information on various firms with the firms discussed in the Reddit submissions and comments. Identify the firms that offer the highest salaries and contrast them with the firms that are discussed most frequently in the context of job hunting and interviewing. Also devote particular attention to the firms represented by the FAANG acronym to build off prior examination of these firms from the EDA stage. Below are two bar plots showing the distribution of submission counts and comment counts by company. Nulls, "Other", and "Multiple" tagged rows were removed to just isolate the submissions and comments that mention one company specifically. Abnormal company names such as "Here" and "Test" were removed as well. Additionally, company counts with **less than 5** occurrences were removed to make sure the plot did not get too cluttered and unreadable. ```{python} #| eval: False # Separate df into submissions and comments filtered_df = df.filter(df['company'].isNotNull()) # take out Nulls submissions = filtered_df.filter(filtered_df.isSubmission == True) comments = filtered_df.filter(filtered_df.isSubmission == False) # Get the frequency of posts and comments by company submission_freqs = submissions.groupBy('company').count().orderBy('company') comments_freqs = comments.groupBy('company').count().orderBy('company') # Convert counts to pandas submission_freqs = submission_freqs.toPandas() comments_freqs = comments_freqs.toPandas() # Filter out "Other" and "None" and "Multiple" to just get individual companies exclude = ['Other', 'None', 'Multiple', 'Test', 'Here'] submission_freqs = submission_freqs[~submission_freqs['company'].isin(exclude)] comments_freqs = comments_freqs[~comments_freqs['company'].isin(exclude)] # Filter out companies with small counts submission_freqs = submission_freqs[submission_freqs['count'] > 5] comments_freqs = comments_freqs[comments_freqs['count'] > 5] # Plotting plt.figure(figsize = (60, 60)) submission_freqs = submission_freqs.sort_values(by = 'count', ascending = True) comments_freqs = comments_freqs.sort_values(by = 'count', ascending = True) plt.subplot(1, 2, 1) bars = plt.barh(submission_freqs['company'].astype(str), submission_freqs['count'].astype(int), color = '#251279') plt.xlabel('No. of Submissions', fontsize = 60) plt.tick_params(axis = 'x', which = 'major', labelsize = 50) plt.tick_params(axis = 'y', which = 'major', labelsize = 50) plt.title('Number of Submissions by Company', fontsize = 60) plt.bar_label(bars, labels = submission_freqs['count'].astype(int), padding = 3, fontsize = 50) plt.subplot(1, 2, 2) bars = plt.barh(comments_freqs['company'].astype(str), comments_freqs['count'].astype(int), color = '#40B389') plt.xlabel('No. of Comments', fontsize = 60) plt.tick_params(axis = 'x', which = 'major', labelsize = 50) plt.tick_params(axis = 'y', which = 'major', labelsize = 50) plt.title('Number of Comments by Company', fontsize = 60) plt.bar_label(bars, labels = comments_freqs['count'].astype(int), padding = 3, fontsize = 50) plt.figtext(0.05, -0.02, 'Note: Companies with less than 5 mentions are not included. Additionally, posts mentioning multiple or "Other" companies are not included.', fontsize = 50, color='black') plt.tight_layout() # Save figure plt.savefig('../../website-source/images/submissions_comments_by_company.png', dpi = 300) plt.show() ``` ![](images/submissions_comments_by_company.png){width="1150" height="1300"} As expected, Amazon and Google are the clear front-runners, surpassing other "FAANG" companies like Apple, Netflix, and Facebook. Medium is an interesting front-runner as well. Medium provides a lot of informative articles and code samples, so this might explain the reason for the high count of submissions and comments mentioning Medium. ##### Top 20 Companies Mentioned with Compensation Comparison Below are the total mention counts for the top twenty companies. ```{python} #| eval: False exclude = ['Other', 'None', 'Multiple', 'Test', 'Here'] filtered_df = filtered_df[~filtered_df['company'].isin(exclude)] company_counts = filtered_df.groupBy('company').count() top_20_companies = company_counts.orderBy(F.col('count').desc()).limit(20) top_20_companies.show() ``` | company | count | |------------|-------| | Google | 1279 | | Medium | 993 | | Move | 967 | | Amazon | 859 | | NICE | 528 | | Wish | 442 | | Discord | 370 | | zoom | 258 | | Target | 253 | | Reddit | 210 | | Match | 175 | | GitHub | 163 | | Microsoft | 125 | | Indeed | 121 | | HackerRank | 112 | | Gap | 109 | | Button | 106 | | Facebook | 98 | | Wise | 92 | | Intuitive | 87 | With this list of top twenty companies, the compensation metrics can be compared to see if any possible relationships arrive. ```{python} #| eval: False # Aggregate data by company and calculate the mean for each metric data_by_company = tech_df.groupby('company')[column_list].mean().reset_index() company_list = ["Google", "Medium", "Move", "Amazon", "NICE", "Wish", "Discord", "zoom", "Target", "Reddit", "Match", "GitHub", "Microsoft", "Indeed", "HackerRank", "Gap", "Button", "Facebook", "Wise", "Intuitive"] data_by_company["top10"] = data_by_company["company"].isin(company_list) top20 = data_by_company[data_by_company["top10"] == True] ordered_top20 = { "average_yearly_compensation": list(top20.sort_values('average_yearly_compensation', ascending = False)['company']), "average_years_experience": list(top20.sort_values('average_years_experience', ascending = False)['company']), "average_years_at_company": list(top20.sort_values('average_years_at_company', ascending = False)['company']), "average_base_salary": list(top20.sort_values('average_base_salary', ascending = False)['company']), "average_stock_grant_value": list(top20.sort_values('average_stock_grant_value', ascending = False)['company']), "average_bonus": list(top20.sort_values('average_bonus', ascending = False)['company']) } # Create bar plots for each metric bar_charts = [] for i in range(len(title_list)): chart = alt.Chart(top20).mark_bar().encode( x = alt.X('company', title = 'Company', sort = ordered_top20[column_list[i]]), y = alt.Y(column_list[i], title = x_axis_list[i]), color = alt.ColorValue(color_list[i]) ).properties( title = f"{title_list[i]} for Top 10 Mentioned Companies", width = 400, height = 220 ) bar_charts.append(chart) # Concatenate bar charts in a grid layout final_bars = alt.vconcat( alt.hconcat(bar_charts[0], bar_charts[1]), alt.hconcat(bar_charts[2], bar_charts[3]), alt.hconcat(bar_charts[4], bar_charts[5]) ) ``` ![](images/top20compensation.png){width="1150" height="1100"} FAANG companies like Facebook, Amazon and Google tend to be in the top half of the top twenty most mentioned companies regarding these compensation metrics. Overall, the charts suggest that there is a mixture of FAANG and non-FAANG companies that offer high compensation. It is interesting to see that companies with the highest compensation, do not necessarily observe the most years of experience or years at the compnay suggesting other factors play a role in compensation packages. The plot also shows variability in how long employees stay with these companies, which implies that factors other than compensation may influence tenure. ### 2) Conducting Natural Language Processing Work #### **Topic 5: NLP Trends for Reddit Interview Content** *Goal:* Identify topics within the posts/comments of the interview subreddits to uncover general themes for students to study further. This can highlight what types of subject matter are being discussed the most, whether coding topics or behavioral questions. From a technical perspective, this portion of the analysis begins with identifying the most common words across the dataset, using techniques such as tokenization, stop words removal, and word count aggregation. Next, the analysis explores the distribution of text lengths, offering a statistical overview of the dataset's text size. The third step involved identifying important words using Term Frequency-Inverse Document Frequency (TF-IDF), a method that highlights words that are unique to specific documents within the corpus. Transforming of text data using multiple NLP techniques like URL removal, tokenization, normalization, and stemming created a cleaner version of the dataset for improved text analysis. The creation of dummy variables based on regex patterns enables the categorization of texts into different technical domains such as programming languages and coding libraries. From a business perspective, the technical analysis in this section accomplishes goals such as insight into key topics and trends as well as data-driven decision making. By identifying the most common words and important terms (via TF-IDF), the analysis uncovers the main themes and topics prevalent in the dataset. This can be crucial for job seekers in understanding prevalent interview trends or job seeker concerns. ##### Most Common Words Overall The table below shows that the most common words table shows the terms, "leetcode" and "questions" which might be helpful in steering job applicants to focus their study efforts on this subject. ```{python} #| eval: False from pyspark.sql import SparkSession from pyspark.sql.functions import length, explode, split from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer, IDF from pyspark.sql.functions import col import matplotlib.pyplot as plt # tokenize tokenizer = Tokenizer(inputCol="body", outputCol="words") df_words = tokenizer.transform(df) # remove stop words remover = StopWordsRemover(inputCol="words", outputCol="filtered_words") df_filtered = remover.transform(df_words) # explode the array into words df_exploded = df_filtered.select(explode(col("filtered_words")).alias("word")) # filter out single character words df_filtered_single_char = df_exploded.filter(length(col("word")) > 1) # count df_word_count = df_filtered_single_char.groupBy("word").count() df_word_count.orderBy(col("count").desc()).show() ``` +--------------+------------+ | ``` | count | | word | | | ``` | | +==============+============+ | interview\ | 22631\ | | like\ | 18719\ | | get\ | 17015\ | | questions\ | 13818\ | | time\ | 13330\ | | one\ | 13245\ | | good\ | 12364\ | | know\ | 11881\ | | job\ | 11718\ | | leetcode\ | 11557\ | | think\ | 11144\ | | also\ | 9331\ | | want\ | 9048\ | | really\ | 8620\ | | need\ | 8564\ | | problems\ | 8560\ | | \[deleted\]\ | 8453\ | | problem\ | 8444\ | | work\ | 8017\ | | first\ | 7521\ | +--------------+------------+ ##### Distribution of Text Lengths These statistics confirm that the dataset contains a significant amount of posts/comments with an average length of about 234 characters. The standard deviation is quite high, indicating a large variability in the length of texts. ```{python} #| eval: False # calc length of each text df = df.withColumn("text_length", length(df["body"])) df.describe("text_length").show() ``` +---------------+---------------------+ | summary | text_length | +===============+=====================+ | count\ | 128339\ | | mean\ | 233.57049688715043\ | | stddev\ | 480.0254950591896\ | | min\ | 1\ | | max\ | 38271\ | +---------------+---------------------+ ##### Important Words According to TF-IDF The provided list of words and their TF-IDF scores provides insight into the topics and concerns that are most prominent in the Reddit posts and comments surround technical interviews. The most important term, 'spaced', is an unusual observation as it does not offer clarity as to why it would be a key concept in technical interview Reddits. Other terms such as 'interview', 'simple', and 'subscribe' are more in line with expectations as they indicate a focus on interview processes, seeking simple advice or interview resources, and potentially encouraged to follow or subscribe to threads or channels for updates on interview strategies. The presence of emotionally charged words like 'ugh' and 'crushing' may reflect the anxiety and difficulty of technical interview preparation. ```{python} #| eval: False # term frequency df_sampled = df_filtered.sample(False, 0.1) cv = CountVectorizer(inputCol="filtered_words", outputCol="raw_features") cv_model = cv.fit(df_sampled) df_featurized = cv_model.transform(df_sampled) # idf idf = IDF(inputCol="raw_features", outputCol="features") idf_model = idf.fit(df_featurized) df_rescaled = idf_model.transform(df_featurized) tf_idf_pandas_df = df_rescaled.select("features").toPandas() def extract_top_words(features, vocabulary, n_top_words=3): dense_vector = features.toArray() top_indices = dense_vector.argsort()[-n_top_words:][::-1] top_words = [(vocabulary[i], dense_vector[i]) for i in top_indices] return top_words vocabulary = cv_model.vocabulary # extract top words for each row n_top_words = 1 top_words_list = [extract_top_words(row['features'], vocabulary, n_top_words) for index, row in tf_idf_pandas_df.iterrows()] top_words_list[35:45] ``` Below is a list of tuples, each containing a word and its corresponding aggregated TF-IDF score. \[\[('spaced', 13.378421024061316)\],\ \[('bud.', 8.768652053710493)\],\ \[('pill.', 8.768652053710493)\],\ \[('interview', 8.768652053710493)\],\ \[('simple', 9.794902085605205)\],\ \[('ugh', 7.670039765042384)\],\ \[('rated', 7.670039765042384)\],\ \[('crushing', 7.2645746569342196)\],\ \[('\[deleted\]', 2.7537151504173694)\],\ \[('subscribe', 7.159214141276394)\]\] ##### Cleaning the Text Using 5+ NLP Techniques In this section, more detailed text cleaning techniques are put in place. The following code used Spark NLP methods such as tokenizing, stemming, removing punctuation, removing stop words, and more to clean the Reddit content. ```{python} #| eval: False # remove https links def remove_https_links(text): return re.sub(r'https?:\/\/\S+', '', text) remove_https_links_udf = udf(remove_https_links, StringType()) df_no_url = df.withColumn("body_no_url", remove_https_links_udf("body")) df_no_url.select('body_no_url').show(4, truncate=False) # set up spark pipeline document_assembler = DocumentAssembler() \ .setInputCol("body_no_url") \ .setOutputCol("document") # tokenize tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") # normalize words to remove unnecessary punctuation normalizer = Normalizer() \ .setInputCols(["token"]) \ .setOutputCol("normalized") \ .setLowercase(True) \ .setCleanupPatterns(["[^A-Za-z0-9 ]"]) # remove stop words stopwords_cleaner = StopWordsCleaner() \ .setInputCols(["normalized"]) \ .setOutputCol("cleanTokens") \ .setCaseSensitive(False) # stem words stemmer = Stemmer() \ .setInputCols(["cleanTokens"]) \ .setOutputCol("stem") # sentence_detector = SentenceDetector() \ # .setInputCols(["cleanTokens"]) \ # .setOutputCol("sentence") finisher = Finisher() \ .setInputCols(["cleanTokens"]) \ .setOutputCols(["finished_ngrams"]) \ .setOutputAsArray(False) \ .setCleanAnnotations(True) # Define the pipeline pipeline = Pipeline().setStages([ document_assembler, tokenizer, normalizer, stopwords_cleaner, stemmer, finisher ]) # run the pipeline model = pipeline.fit(df_no_url) clean_df = model.transform(df_no_url) from pyspark.sql.functions import regexp_replace clean_df_nlp = clean_df.withColumn("clean_body", regexp_replace("finished_ngrams", "@", " ")) clean_df_nlp.select("clean_body").show(5,truncate=False) ``` **Preview of posts and comments after cleaning:** +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | clean_body | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | messaging 21 days 20230128 065831 utc remind link click link send pm also reminded reduce spam parent commenter delete message hide others info reminders sorry didnt see thank howd get change anything prep strategy | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ #### **Topic 6: Identification of Most Important Skills for Technical Interviewing** *Goal:* Hone in on the most important and most frequent skills discussed on Reddit for technical interviews. Explore mention frequency of various programming languages, tools, and libraries. Also identify the most important stages of the technical interview. ##### Creating New Dummy Variables To focus on topics such as programming languages, tools, and libraries, new variables are required. ```{python} #| eval: False from pyspark.sql.functions import regexp_extract, col language_pattern = r'(?i)(Python|SQL|SparkSQL|PySpark|Java|Scala|Julia|MATLAB|C\+\+|SAS|NoSQL|Hadoop|Hive|BigQuery)' library_pattern = r'(?i)(Pandas|Tensorflow|Pytorch|sklearn|Keras|NumPy|SciPy|Matplotlib|Seaborn|Dask|NLTK|Gensim|Scrapy|Statsmodels|XGBoost|LightGBM|Jupyter|FastAI|Spacy|Theano|Yolo)' tool_pattern = r'(?i)(Tableau|Shiny|Sagemaker|Azure|PowerBI|Databricks|Alteryx|Qlik|Looker|Google Cloud Platform|AWS|Dataiku|Docker|Kubernetes|Airflow|ML)' stage_pattern = r'(?i)(initial screening|technical round|HR round|final round|offer stage)' clean_df_nlp_var = clean_df_nlp.withColumn("programming_language_mentioned", (regexp_extract(col("clean_body"), language_pattern, 0) != '').cast('int')) clean_df_nlp_var = clean_df_nlp_var.withColumn("coding_library_mentioned", (regexp_extract(col("clean_body"), library_pattern, 0) != '').cast('int')) clean_df_nlp_var = clean_df_nlp_var.withColumn("ds_tool_mentioned", (regexp_extract(col("clean_body"), tool_pattern, 0) != '').cast('int')) clean_df_nlp_var = clean_df_nlp_var.withColumn("interview_stage_mentioned", (regexp_extract(col("clean_body"), stage_pattern, 0) != '').cast('int')) from pyspark.sql.functions import col, count, mean, sum summary_stats = clean_df_nlp_var.agg( count("*").alias("total_count"), sum(col("programming_language_mentioned")).alias("sum_programming_language_mentioned"), sum(col("coding_library_mentioned")).alias("sum_coding_library_mentioned"), sum(col("ds_tool_mentioned")).alias("sum_ds_tool_mentioned"), sum(col("interview_stage_mentioned")).alias("sum_interview_stage_mentioned") ) summary_stats.show(truncate=False) ``` **Statistical summary table for regex patterns from new variables:** +-------------+------------------------------------+------------------------------+-----------------------+-------------------------------+ | total_count | sum_programming_language_mentioned | sum_coding_library_mentioned | sum_ds_tool_mentioned | sum_interview_stage_mentioned | +-------------+------------------------------------+------------------------------+-----------------------+-------------------------------+ | 128339 | 5188 | 53 | 1004 | 476 | +-------------+------------------------------------+------------------------------+-----------------------+-------------------------------+ **Visualization for regex patterns from new variables:** The extraction and visualization of patterns through regex-based categorization, encompassing elements like programming languages and coding libraries, offers additional guidance. The sundial visualization, for example, highlights the prevalence of programming languages, especially Python, within the text. This observation can streamline a job applicant's study plan by emphasizing Python, thereby optimizing interview preparation and enhancing their chances of success. ```{python} #| eval: False # filter out row where no programming language is mentioned df_w_languages = clean_df_nlp_var.filter(col("programming_language_mentioned")==1) df_w_libraries = clean_df_nlp_var.filter(col("coding_library_mentioned")==1) df_w_tools = clean_df_nlp_var.filter(col("ds_tool_mentioned")==1) df_w_stage = clean_df_nlp_var.filter(col("interview_stage_mentioned")==1) # write function to extract language frequency in each text def extract_pattern(pattern,text): return re.findall(pattern, text) # create udf extract_pattern_udf = udf(extract_pattern, ArrayType(StringType())) # add a column with the pattern to the DataFrames df_w_languages = df_w_languages.withColumn("pattern", lit(language_pattern)) df_w_libraries = df_w_libraries.withColumn("pattern", lit(library_pattern)) df_w_tools = df_w_tools.withColumn("pattern", lit(tool_pattern)) df_w_stage = df_w_stage.withColumn("pattern", lit(stage_pattern)) # apply the UDF df_w_languages = df_w_languages.withColumn("mentioned_languages", extract_pattern_udf(col("pattern"), col("clean_body"))) df_w_libraries = df_w_libraries.withColumn("mentioned_libraries", extract_pattern_udf(col("pattern"), col("clean_body"))) df_w_tools = df_w_tools.withColumn("mentioned_tools", extract_pattern_udf(col("pattern"), col("clean_body"))) df_w_stage = df_w_stage.withColumn("mentioned_stages", extract_pattern_udf(col("pattern"), col("clean_body"))) # explode the mentioned languages exploded_languages = df_w_languages.withColumn("language", explode(col("mentioned_languages"))) exploded_libraries = df_w_libraries.withColumn("library", explode(col("mentioned_libraries"))) exploded_tools = df_w_tools.withColumn("tool", explode(col("mentioned_tools"))) exploded_stages = df_w_stage.withColumn("stage", explode(col("mentioned_stages"))) # group by language and count language_cts = exploded_languages.groupBy("language").count() library_cts = exploded_libraries.groupBy("library").count() tool_cts = exploded_tools.groupBy("tool").count() stage_cts = exploded_stages.groupBy("stage").count() # add col describing pattern language_cts = language_cts.withColumn("pattern_topic", lit("programming language")) library_cts = library_cts.withColumn("pattern_topic", lit("coding library")) tool_cts = tool_cts.withColumn("pattern_topic", lit("data science tool")) stage_cts = stage_cts.withColumn("pattern_topic", lit("interview stage")) language_cts = language_cts.withColumnRenamed("language", "pattern") library_cts = library_cts.withColumnRenamed("library", "pattern") tool_cts = tool_cts.withColumnRenamed("tool", "pattern") stage_cts = stage_cts.withColumnRenamed("stage", "pattern") # convert to pandas for future visualization language_cts_pd = language_cts.toPandas() library_cts_pd = library_cts.toPandas() tool_cts_pd = tool_cts.toPandas() stage_cts_pd = stage_cts.toPandas() # union pd dataframes for visualizations later combined_cts_pd = pd.concat([language_cts_pd, library_cts_pd, tool_cts_pd, stage_cts_pd], ignore_index=True) # visualize color_map = { 'programming language': '#163d6b', 'coding library': '#204321', 'data science tool': '#82B8B5' , 'interview stage' :'#5E9751'} fig = px.sunburst( combined_cts_pd, path=['pattern_topic', 'pattern'], values='count', color='pattern_topic', color_discrete_map=color_map, title='Distribution of RegEx Topic Patterns' ) fig.update_layout(width=650, height=650) fig.show() ``` <iframe src="images/sundial.html" width="300%" height="600px"> </iframe> ### 3) Building a Sentiment Model #### **Topic 7: Sentiment Analysis of Interview Posts & Comments** *Goal:* Observe the overall sentiment of posts and comments related to the filtered dataset that focuses on interviews. This will help identify how candidates are feeling about the interview process, which can draw insight into a specific employer or topic interview difficulty level. For the purpose of the sentiment analysis, spark jars packages and the John Snow Labs pagckage were used. The John Snow Labs package includes several pre-trained models. The one best suited for this analysis is the sentimentdl_use_twitter. The models consist of four parts. The first part, documentAssembler, takes as an input the text column (in this case 'body' column). The second part, the use part, has the Universal Sentence Encoder that utilizes the TenserFlow Hub use. The input of the second part is the output of the first documentAssembler part. Part three is the sentiment deep learning pretrained model that takes as the input the output of the second part and utilizes the John Snow Labs sentimentdl_use_twitter model with the English languge. This part outputs the sentiments for each row of the 'body' text column. The final part is the NLP Pipeline, which puts all the previous parts together, allowing the input of the data frame into the pipeline. The pipline runs through all the previously mentioned parts and produces a new sentiment column with three sentiment options - postive, neutral, negative. The last step in the sentiment analysis is transforming the output of the pipeline into a results dataframe. ```{python} #| eval: False ! pip install sparknlp import json import sparknlp import numpy as np import pandas as pd from sparknlp.base import * from pyspark.ml import Pipeline from sparknlp.annotator import * import pyspark.sql.functions as F from pyspark.sql import SparkSession import sagemaker from sagemaker.spark.processing import PySparkProcessor from sparknlp.pretrained import PretrainedPipeline from pyspark.sql import SparkSession import sparknlp import sagemaker sess = sagemaker.Session() # Import pyspark and build Spark session from pyspark.sql import SparkSession import sparknlp from sparknlp.base import DocumentAssembler from sparknlp.annotator import UniversalSentenceEncoder, SentimentDLModel from pyspark.ml import Pipeline # Initialize Spark session for Spark NLP spark = SparkSession.builder\ .appName("PySparkApp")\ .master("local[*]")\ .config("spark.driver.memory","16G")\ .config("spark.driver.maxResultSize", "0") \ .config("spark.kryoserializer.buffer.max", "2000M")\ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3,org.apache.hadoop:hadoop-aws:3.2.2")\ .config("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.ContainerCredentialsProvider")\ .getOrCreate() %%time import sagemaker session = sagemaker.Session() #s3://sagemaker-us-east-1-131536150362/project/data_reformat_clean_nlp.parquet/ bucket = "tm1450-project" output_prefix_data_submissions = "data_reformat_clean.parquet/" s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}" #print(f"reading tagged data from {s3_path}") df = spark.read.parquet(s3_path, header=True) #print(f"shape of the submissions dataframe is {df.count():,}x{len(df.columns)}") # Define model names and pipeline components MODEL_NAME = 'sentimentdl_use_twitter' documentAssembler = DocumentAssembler()\ .setInputCol("body")\ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained(name="tfhub_use", lang="en")\ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") sentimentdl = SentimentDLModel.pretrained(name=MODEL_NAME, lang="en")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("sentiment") nlpPipeline = Pipeline( stages = [ documentAssembler, use, sentimentdl ]) pipelineModel = nlpPipeline.fit(df) results = pipelineModel.transform(df) ``` **Schema:** root\ \|-- link_id: string (nullable = true)\ \|-- created_utc: timestamp (nullable = true)\ \|-- body: string (nullable = true)\ \|-- is_tech: boolean (nullable = true)\ \|-- is_asking_for_help: boolean (nullable = true)\ \|-- is_asking_a\_question: boolean (nullable = true)\ \|-- is_FAANG: boolean (nullable = true)\ \|-- isSubmission: boolean (nullable = true)\ \|-- body_no_url: string (nullable = true)\ \|-- finished_ngrams: string (nullable = true)\ \|-- clean_body: string (nullable = true)\ \|-- document: array (nullable = true)\ \| \|-- element: struct (containsNull = true)\ \| \| \|-- annotatorType: string (nullable = true)\ \| \| \|-- begin: integer (nullable = false)\ \| \| \|-- end: integer (nullable = false)\ \| \| \|-- result: string (nullable = true)\ \| \| \|-- metadata: map (nullable = true)\ \| \| \| \|-- key: string\ \| \| \| \|-- value: string (valueContainsNull = true)\ \| \| \|-- embeddings: array (nullable = true)\ \| \| \| \|-- element: float (containsNull = false)\ \|-- sentence_embeddings: array (nullable = true)\ \| \|-- element: struct (containsNull = true)\ \| \| \|-- annotatorType: string (nullable = true)\ \| \| \|-- begin: integer (nullable = false)\ \| \| \|-- end: integer (nullable = false)\ \| \| \|-- result: string (nullable = true)\ \| \| \|-- metadata: map (nullable = true)\ \| \| \| \|-- key: string\ \| \| \| \|-- value: string (valueContainsNull = true)\ \| \| \|-- embeddings: array (nullable = true)\ \| \| \| \|-- element: float (containsNull = false)\ \|-- sentiment: array (nullable = true)\ \| \|-- element: struct (containsNull = true)\ \| \| \|-- annotatorType: string (nullable = true)\ \| \| \|-- begin: integer (nullable = false)\ \| \| \|-- end: integer (nullable = false)\ \| \| \|-- result: string (nullable = true)\ \| \| \|-- metadata: map (nullable = true)\ \| \| \| \|-- key: string\ \| \| \| \|-- value: string (valueContainsNull = true)\ \| \| \|-- embeddings: array (nullable = true)\ \| \| \| \|-- element: float (containsNull = false)\ #### Tables of Summary Statistics Below is a table that provides a glimpse at the data in terms of how the sentiment analysis created a new variable that allows further analysis, such as what does sentiment look like for the entire dataset, or for sepcific attributes such as posts/comments that focus on asking for help or FAANG companies. ```{python} #| eval: False from pyspark.sql.functions import explode, col # Explode the sentiment array and extract the result field results = results.withColumn('sentiment', F.explode('sentiment')) results = results.withColumn('sentiment', F.col('sentiment.result')) # Selecting columns for the final DataFrame final_data = results.select( 'is_tech', 'is_asking_for_help', 'is_asking_a_question', 'is_FAANG', 'clean_body', 'sentiment' ) # Persist and show the DataFrame final_data.persist() final_data.show() ``` +----------+--------------------+-----------------------+----------+-----------------------+-----------+ | is_tech | is_asking_for_help | is_asking_a\_question | is_FAANG | ``` | sentiment | | | | | | clean_body | | | | | | | ``` | | +==========+====================+=======================+==========+=======================+===========+ | true\ | false\ | false\ | false\ | \ | neutral\ | | true\ | false\ | true\ | false\ | messaging 21 days...\ | positive\ | | false\ | false\ | false\ | false\ | sorry didnt see t...\ | positive\ | | false\ | false\ | true\ | false\ | howd get\ | positive\ | | false\ | false\ | true\ | false\ | change anything p...\ | positive\ | | false\ | false\ | false\ | false\ | one 180 question ...\ | negative\ | | false\ | true\ | true\ | true\ | approach solving ...\ | positive\ | | false\ | false\ | false\ | false\ | places wanting pe...\ | positive\ | | false\ | true\ | true\ | false\ | principles mean s...\ | positive\ | | false\ | false\ | false\ | false\ | interested\ | positive\ | | true\ | false\ | false\ | false\ | thanks exactly fo...\ | positive\ | | false\ | false\ | false\ | false\ | okay thats good i...\ | positive\ | | false\ | false\ | false\ | false\ | wait take back sa...\ | neutral\ | | true\ | false\ | false\ | false\ | map contains inte...\ | positive\ | | false\ | false\ | false\ | false\ | practice together\ | positive\ | | false\ | false\ | false\ | false\ | \ | positive\ | | true\ | true\ | true\ | false\ | gtdo think solvin...\ | positive\ | | false\ | false\ | false\ | false\ | bro need work sof...\ | negative\ | | false\ | false\ | false\ | false\ | settle sugarlips\ | positive\ | | false\ | false\ | false\ | false\ | 25 years almost\ | positive\ | +----------+--------------------+-----------------------+----------+-----------------------+-----------+ #### Full Dataframe Sentiment Counts The large number of positive sentiments suggests that although there is a strong presence of positive discourse, there is also a substantial number of negative experiences or thoughts being shared. ```{python} #| eval: False sentiment_counts = final_data.groupBy("sentiment").count() sentiment_counts.show() ``` +-----------+---------+ | sentiment | count | +===========+=========+ | positive\ | 81780\ | | neutral\ | 4164\ | | negative\ | 42395\ | +-----------+---------+ ##### Sentiment Counts for is_tech Column The is_tech column represents a subset of the data where the focus is on technical aspects of interviews. The sentiment analysis summary for this subset of data shows positive sentiments far outnumber the negatives, suggesting that discussions in this category are generally more positive. This is surprising as the technical aspect of interviews are usually the most stressful so this provides a bit of hope to those starting the long road of technical interview preparation. ```{python} #| eval: False tech_data = final_data.filter(col('is_tech') == True) sentiment_counts = tech_data.groupBy('sentiment').count() print('Sentiment Counts for is_tech Column') sentiment_counts.show() ``` +-----------+---------+ | sentiment | count | +===========+=========+ | positive\ | 20776\ | | neutral\ | 979\ | | negative\ | 7629\ | +-----------+---------+ ##### Sentiment Counts for is_asking_for_help Column The is_asking_for_help column represents a data subset focusing on posts where individuals are seeking assistance. Positive sentiment is almost double the negative sentiment, which may signify a supportive environment for those asking for help, again providing hope for those starting the challenge of preparing for technical interviews. ```{python} #| eval: False help_data = final_data.filter(col('is_asking_for_help') == True) sentiment_counts = help_data.groupBy('sentiment').count() print('Sentiment Counts for is_asking_for_help Column') sentiment_counts.show() ``` +-----------+---------+ | sentiment | count | +===========+=========+ | positive\ | 9839\ | | neutral\ | 346\ | | negative\ | 4602\ | +-----------+---------+ ##### Sentiment Counts for is_asking_a\_question Column The table below shows a higher number of positive sentiments than negative ones for posts/comments that ask a question. This might suggest that the Reddit community is usually supportive when individuals ask questions, creating an environment for information exchange and learning. Having knowledge of this may encourage new users seeking advice for technical interview to be less afraid in asking questions. ```{python} #| eval: False #| ``` +-----------+---------+ | sentiment | count | +===========+=========+ | positive\ | 20014\ | | neutral\ | 867\ | | negative\ | 13028\ | +-----------+---------+ ##### Sentiment Counts for is_FAANG Column The is_FAANG posts/comments discussing interviews at FAANG companies (Facebook, Amazon, Apple, Netflix, Google), show that sentiment is more positive than negative. This indicates that conversations around interviews at these top tech companies are more positive, which could be due to successful interview experiences or helpful insights posted by users who have interviewed with these companies. ```{python} #| eval: False question_data = final_data.filter(col('is_asking_a_question') == True) sentiment_counts = question_data.groupBy('sentiment').count() print('Sentiment Counts for is_asking_a_question Column') sentiment_counts.show() ``` +-----------+---------+ | sentiment | count | +===========+=========+ | positive\ | 3488\ | | neutral\ | 171\ | | negative\ | 1554\ | +-----------+---------+ ##### Plot of Summary Statistics The plot below displays a more succinct summary of the count of posts (submissions and comments) for each sentiment classification broken down by post type. Again, the text content types include: 1) A post asking for help 2) A post asking a question 3) A post mentioning a FAANG company 4) A technical post These topics were found using Regex as described above and in previous parts of the website. In general, for the full dataset, there are mainly postive sentiments for submissions and comments. The positive submissions and comments are almost double in size as compared to the submissions and comments with negative sentiment. For texts where is_tech column is true the ratio of postive to negative sentiments is higher than for any other column. The opposite is true for the rows where is_asking_for_help column is true. The number of neutral sentiments is the lowest in the whole dataset and as well in relation to the other sentiments for all the options where column_name = True. ```{python} #| eval: False import pyspark.sql.functions as F import pandas as pd # Get sentiment counts for submissions/comments that ask for help df_tech_counts = df\ .filter(F.col('is_tech'))\ .groupBy('sentiment')\ .count()\ .withColumn('category', F.lit('Technical')) # Get sentiment counts for submissions/comments that ask for help df_help_counts = df\ .filter(F.col('is_asking_for_help'))\ .groupBy('sentiment')\ .count()\ .withColumn('category', F.lit('Asking for help')) # Get sentiment counts for submissions/comments that ask a question df_question_counts = df\ .filter(F.col('is_asking_a_question'))\ .groupBy('sentiment')\ .count()\ .withColumn('category', F.lit('Asking question')) # Get sentiment counts for submissions/comments that mention a FAANG # company df_FAANG_counts = df\ .filter(F.col('is_FAANG'))\ .groupBy('sentiment')\ .count()\ .withColumn('category', F.lit('Mentions FAANG')) # Concatenate the count dfs together and convert to pandas df_count_summary = df_tech_counts\ .union(df_help_counts)\ .union(df_question_counts)\ .union(df_FAANG_counts)\ .toPandas() # Pivot dataframe df_plot = df_count_summary.pivot(index='category', columns='sentiment', values='count').reset_index() # Code referenced: https://matplotlib.org/stable/gallery/lines_bars_and_markers/bar_colors.html import matplotlib.pyplot as plt import numpy as np x = np.arange(len(df_plot['category'])) # the label locations width = 0.25 # the width of the bars multiplier = 0 color_dict = { 'negative': '#251279', 'neutral': '#6CACED', 'positive': '#5E9751' } fig, ax = plt.subplots(layout='constrained') # Cycle through each sentiment label (negative, neutral, positive) for sentiment_label in df_plot.columns: # Skip the category column if sentiment_label != 'category': offset = width * multiplier rects = ax.bar( x + offset, df_plot[sentiment_label], width, label=sentiment_label.title(), color=color_dict[sentiment_label] ) ax.bar_label(rects, padding=3) multiplier += 1 # Add some text for labels, title and custom x-axis tick labels, etc. ax.set_ylabel('Number of Submissions/Comments') ax.set_title('Distribution of Sentiment') ax.set_xticks(x + width, df_plot['category']) ax.legend(loc='upper left') ax.set_ylim(0, 22000) plt.savefig('../../website-source/images/sentiment_dist.png', dpi=300) plt.show() ``` <img src="images/sentiment_dist.png" style="width: 75%; height: auto;"/>