Job Application Insights with Reddit: Navigating Interviews and Technical Discussions
Team 03 Members
- Amelia Baier: ab3868@georgetown.edu
- Joshua Gladwell: jeg307@georgetown.edu
- Tereza Martinkova: tm1450@georgetown.edu
- Mia Mayerhofer: mmm552@georgetown.edu
Course Information
DSAN 6000 (Big Data and Cloud Computing) - Fall 2023
Project Goal
In the world of applying for jobs, understanding how to prepare for hiring interviews can be difficult. Any strong applicant’s confidence can often get derailed by encountering some unexpected event during the interview, such as a tricky question or a surprise coding demonstration. Ideally, an applicant could prepare by asking an HR representative what questions or tasks to expect in an upcoming interview, but to some this may seem taboo. Reddit is an ideal forum for sharing job interview experiences with fellow job applicants. It provides the perfect level of anonymity to help users feel comfortable sharing the details of their job interview experiences. This project examines the language surrounding discussion of job interviews on Reddit.
Subreddits
leetcode
interviewpreparations
codinginterview
InterviewTips
csinterviewproblems
interviews
big_tech_interview
Ten Topics for Exploration
Below is a set of ten topics that will be explored further throughout this website. The topics are broken down into three sections: Exploratory Data Analysis (EDA) topics, Natural Language Processing (NLP) topics, and Machine Learning (ML) topics.
Exploratory Data Analysis
The importance of exploratory data analysis to any type of data science project or task cannot be understated. Processing, preparing, and understanding the data that will be used for the more complex tasks, algorithms, and visualizations is the key to gaining successful insights. The three topics below explain the EDA performed for this project in more detail.
Topic 1: Interview Post and Comment Frequency Over Time
Business Goal: Determine the time periods throughout the year with the most submissions (posts) and comments on Reddit regarding technical/coding interviews. Do these windows fall during typical hiring windows? This line of analysis could explain spikes or troughs in job application volume in the big tech industry.
Technical Proposal: Generate a clear and visually appealing graphic (line plot) that highlights the frequency of interview posts and comments during each month of the calendar year using available data for 2021-2023. Use counts to determine the month with the most interview discussion on Reddit. Conduct a brief analysis of the frequency data within the temporal aspect to identify any patterns or trends present in the data.
Topic 2: Popular Technical Interview Questions and Topics
Business Goal: Determine the most popular interview questions and/or topics to gauge the specific areas of interest regarding technical and coding-related interviews. Questions that are being discussed more frequently can help future applicants streamline how they study for interviews to focus on the topics with the most traction on Reddit. Highlighting content matter that’s seen most frequently can direct students in their studying.
Technical Proposal: Generate a word cloud highlighting the most common words appearing in interview Reddit comments. Use regex to analyze the body of text to determine the number of comments/posts focused on technical interviews, looking for help/struggling with an interview, or asking a question related to an interview topic (not necessarily struggling with something). Generate a line/bar plot that visualizes the most common questions asked in comments/posts.
Topic 3: FAANG Company Breakdown
Business Goal: FAANG is an acronym that refers to five of the world’s largest and most influential tech companies: Facebook, Apple, Amazon, Netflix, and Google. Consequently, these companies are often seeking many skilled data scientists and analysts. Observe the data related to the frequency with which these companies are mentioned in posts and comments so individuals can gauge their prevalence in discussions related to job interviews in the tech industry. Use an external dataset containing salary information for each FAANG company to enable insight into how public interest in these companies correlates with compensation.
Technical Proposal: Generate bar plots that compare the volume of comments that reference a FAANG company to those that do not. Subsequently, for comments that do mention a FAANG company, generate a bar plot that specifies the number of comments associated with each FAANG company. Merge the FAANG frequency dataset with the external salary dataset on the company and visualize results via a bar plot.
Natural Language Processing
Due to the Reddit data containing predominantly text data, different NLP methods and models will be used to gain further insights into the Reddit discourse regarding interviews and technical job preparations. The following four topics describe the exact methods and models seen in the NLP section of the website.
Topic 4: Market Salary Trends Analysis
Business Goal: Connect salary information on various firms with the firms discussed in the Reddit submissions and comments. Identify the firms that offer the highest salaries and contrast them with the firms that are discussed most frequently in the context of job hunting and interviewing. Also devote particular attention to the firms represented by the FAANG acronym to build off prior examination of these firms from the EDA stage.
Technical Proposal: Locate external salary data to merge onto the existing dataset. Create plots to compare the frequency by which firms are mentioned with the level of salaries they offer. Perform data quality checks on the newly-merged dataset.
Topic 5: NLP Trends for Reddit Interview Content
Business Goal: Identify topics within the posts/comments of the interview subreddits to uncover general themes for students to study further. This can highlight what types of subject matter are being discussed the most, whether coding topics or behavioral questions.
Technical Proposal: Utilize PySpark NLP techniques such as Tokenizer and StopwordsRemover to process the data. Implement a CountVectorizer to identify and count the most frequently occurring words within the subreddit posts and comments. Perform Term Frequency-Inverse Document Frequency (TF-IDF) to locate the most important terms in the comment and submission texts.
Topic 6: Identification of Most Important Skills for Technical Interviewing
Business Goal: Hone in on the most important and most frequent skills discussed on Reddit for technical interviews. Explore mention frequency of various programming languages, tools, and libraries. Also identify the most important stages of the technical interview.
Technical Proposal: Create regex patterns to categorize text into predetermined technical and interview-related categories encompassing programming languages, technical tools, and interview stages. Create dummy variables to distinguish these categorizations in the cleaned text dataset. Employ visual tools like sunburst charts to represent the distribution and frequency of categorized terms.
Topic 7: Sentiment Analysis of Interview Posts & Comments
Business Goal: Observe the overall sentiment of posts and comments related to the filtered dataset that focuses on interviews. This will help identify how candidates are feeling about the interview process, which can draw insight into a specific employer or topic interview difficulty level.
Technical Proposal: Use a pre-trained sentiment analysis model such as TextBlob or VADER on the body of the Reddit comments/posts to categorize them as positive, negative, or neutral. Segment the data by various categories defined from the EDA stage and compare the sentiment between discussion of these categories. Create a plot to illustrate this comparison.
Machine Learning
Now that the data has been properly explored and curated from the EDA and NLP tasks mentioned in the sections above, the final area to explore involves machine learning models. Machine learning models, both supervised and unsupervised, enable us to gain even more insight into the data set and continue understanding and exploring the Reddit discussions.
Topic 8: Text Classification
Business Goal: Label posts as useful, semi-useful, or not useful through the use of text classification to provide job seekers guidance as to which characteristics contribute to a post’s engagement and, therefore, which discussions are most valuable for their study and preparation, allowing for focus on high-value content or shifting their focus elsewhere accordingly.
Technical Proposal: Engineer some features that may correlate to the usefulness of a post, such as length of text, number of comments, outputs generated from prior NLP topics, etc., and label the data accordingly. Choose an appropriate algorithm for text classification, such as Naive Bayes or Random Forest, and train the model. Evaluate the model performance using accuracy, precision, recall, and F1-score metrics. Results can be visualized using a confusion matrix.
Topic 9: Topic Modeling
Business Goal: Enhance the understanding of content themes in technical interview subreddits by applying Latent Dirichlet Allocation (LDA). This goal focuses on extracting concise and meaningful topics from the large amounts of text data from reddit, revealing trends in online discussions surrounding technical interviews.
Technical Proposal: Use LDA to bin the text data for each post (comments and submissions) into topics. Perform rounds of LDA with a different number of topics to see which breakdown provides the most coherent and logical topics. Draw comparisons between the topics formed using the different number of topics parameter chosen.
Topic 10: Post Clustering
Business Goal: Discover underlying themes in subreddit discussions by applying K-means clustering. This method aims to segment posts and comments into distinct clusters based on the content of the text. This presents an opportunity to offer a new perspective on strong topics and conversation patterns within the technical interview Reddit community.
Technical Proposal: Use K-means clustering on the textual data from the subreddit posts and comments data. Tokenize, clean, and vectorize the text data to convert it into a suitable format for clustering. Vary the number of clusters by changing the value of k to observe different natural clutsers that form. Analyze the formed clusters to understand the commonalities within each cluster and how they differ from each other.