Setup¶
In [ ]:
# Setup - Run only once per Kernel App
%conda install openjdk -y
# install PySpark
%pip install pyspark==3.3.0
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")
Collecting package metadata (current_repodata.json): done Solving environment: done ==> WARNING: A newer version of conda exists. <== current version: 23.3.1 latest version: 23.10.0 Please update conda by running $ conda update -n base -c defaults conda Or to minimize the number of packages updated during conda update use conda install conda=23.10.0 ## Package Plan ## environment location: /opt/conda added / updated specs: - openjdk The following packages will be downloaded: package | build ---------------------------|----------------- ca-certificates-2023.08.22 | h06a4308_0 123 KB certifi-2023.7.22 | py310h06a4308_0 153 KB openjdk-11.0.13 | h87a67e3_0 341.0 MB ------------------------------------------------------------ Total: 341.3 MB The following NEW packages will be INSTALLED: openjdk pkgs/main/linux-64::openjdk-11.0.13-h87a67e3_0 The following packages will be UPDATED: ca-certificates conda-forge::ca-certificates-2023.7.2~ --> pkgs/main::ca-certificates-2023.08.22-h06a4308_0 The following packages will be SUPERSEDED by a higher-priority channel: certifi conda-forge/noarch::certifi-2023.7.22~ --> pkgs/main/linux-64::certifi-2023.7.22-py310h06a4308_0 Downloading and Extracting Packages certifi-2023.7.22 | 153 KB | | 0% openjdk-11.0.13 | 341.0 MB | | 0% ca-certificates-2023 | 123 KB | | 0% ca-certificates-2023 | 123 KB | ##################################### | 100% certifi-2023.7.22 | 153 KB | ##################################### | 100% openjdk-11.0.13 | 341.0 MB | # | 3% openjdk-11.0.13 | 341.0 MB | ##3 | 6% openjdk-11.0.13 | 341.0 MB | ###7 | 10% openjdk-11.0.13 | 341.0 MB | #####4 | 15% openjdk-11.0.13 | 341.0 MB | ######9 | 19% openjdk-11.0.13 | 341.0 MB | ########2 | 22% openjdk-11.0.13 | 341.0 MB | #########8 | 27% openjdk-11.0.13 | 341.0 MB | ###########4 | 31% openjdk-11.0.13 | 341.0 MB | ############8 | 35% openjdk-11.0.13 | 341.0 MB | ##############1 | 38% openjdk-11.0.13 | 341.0 MB | ###############6 | 42% openjdk-11.0.13 | 341.0 MB | #################2 | 47% openjdk-11.0.13 | 341.0 MB | ##################8 | 51% openjdk-11.0.13 | 341.0 MB | ####################3 | 55% openjdk-11.0.13 | 341.0 MB | #####################7 | 59% openjdk-11.0.13 | 341.0 MB | #######################3 | 63% openjdk-11.0.13 | 341.0 MB | ########################9 | 67% openjdk-11.0.13 | 341.0 MB | ##########################5 | 72% openjdk-11.0.13 | 341.0 MB | ###########################9 | 76% openjdk-11.0.13 | 341.0 MB | #############################5 | 80% openjdk-11.0.13 | 341.0 MB | ############################### | 84% openjdk-11.0.13 | 341.0 MB | ################################5 | 88% openjdk-11.0.13 | 341.0 MB | #################################9 | 92% openjdk-11.0.13 | 341.0 MB | ###################################4 | 96% openjdk-11.0.13 | 341.0 MB | ####################################9 | 100% Preparing transaction: done Verifying transaction: done Executing transaction: done Note: you may need to restart the kernel to use updated packages. Collecting pyspark==3.3.0 Using cached pyspark-3.3.0-py2.py3-none-any.whl Collecting py4j==0.10.9.5 (from pyspark==3.3.0) Using cached py4j-0.10.9.5-py2.py3-none-any.whl (199 kB) Installing collected packages: py4j, pyspark Successfully installed py4j-0.10.9.5 pyspark-3.3.0 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv [notice] A new release of pip is available: 23.2.1 -> 23.3.1 [notice] To update, run: pip install --upgrade pip Note: you may need to restart the kernel to use updated packages.
Out[ ]:
In [ ]:
# Import pyspark and build Spark session
from pyspark.sql import SparkSession
spark = (
SparkSession.builder.appName("PySparkApp")
.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.2")
.config(
"fs.s3a.aws.credentials.provider",
"com.amazonaws.auth.ContainerCredentialsProvider",
)
.getOrCreate()
)
print(spark.version)
Warning: Ignoring non-Spark config property: fs.s3a.aws.credentials.provider
:: loading settings :: url = jar:file:/opt/conda/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2/cache The jars for the packages stored in: /root/.ivy2/jars org.apache.hadoop#hadoop-aws added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-fde6f6f4-d468-4615-80c7-dd7f3e709851;1.0 confs: [default] found org.apache.hadoop#hadoop-aws;3.2.2 in central found com.amazonaws#aws-java-sdk-bundle;1.11.563 in central :: resolution report :: resolve 439ms :: artifacts dl 54ms :: modules in use: com.amazonaws#aws-java-sdk-bundle;1.11.563 from central in [default] org.apache.hadoop#hadoop-aws;3.2.2 from central in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 2 | 0 | 0 | 0 || 2 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent-fde6f6f4-d468-4615-80c7-dd7f3e709851 confs: [default] 0 artifacts copied, 2 already retrieved (0kB/42ms)
23/11/07 21:30:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
3.3.0
In [ ]:
import sagemaker
session = sagemaker.Session()
bucket = session.default_bucket()
print(bucket)
# Create or retrieve a Spark session
spark = SparkSession.builder.appName("ReadS3Parquet").getOrCreate()
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml sagemaker-us-east-1-224518912016 23/11/07 21:30:45 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.
Data¶
In [ ]:
# S3 directory path
s3_directory = f"s3a://{bucket}/project/cleaned/comments/"
# Read all the Parquet files in the directory into a DataFrame
df_comments_conservative_finance = spark.read.parquet(s3_directory)
23/11/07 21:30:46 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
In [ ]:
# S3 directory path
s3_directory = f"s3a://{bucket}/project/cleaned/submissions/"
# Read all the Parquet files in the directory into a DataFrame
df_submissions_conservative_finance = spark.read.parquet(s3_directory)
23/11/07 21:30:52 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Import Data from other Subreddits¶
In [ ]:
# Tegveer subreddits
tegveer_bucket = 'sagemaker-us-east-1-433974840707'
s3_directory_tegveer = f"s3a://{tegveer_bucket}/project/cleaned/submissions/"
s3_directory_tegveer_comments = f"s3a://{tegveer_bucket}/project/cleaned/comments/"
# Read all the Parquet files in the directory into a DataFrame
df_submissions_centrist_liberterian = spark.read.parquet(s3_directory_tegveer)
df_comments_centrist_liberterian = spark.read.parquet(s3_directory_tegveer_comments)
# Eric subreddits
eric_bucket = 'sagemaker-us-east-1-395393721134'
s3_directory_eric = f"s3a://{eric_bucket}/project/cleaned/submissions/"
s3_directory_eric_comments = f"s3a://{eric_bucket}/project/cleaned/comments/"
# Read all the Parquet files in the directory into a DataFrame
df_submissions_socialism_economics_liberal = spark.read.parquet(s3_directory_eric)
df_submissions_socialism_economics_liberal_comments = spark.read.parquet(s3_directory_eric_comments)
# Anthony subreddits
anthony_bucket = 'sagemaker-us-east-1-711387073580'
s3_directory_anthony_submissions = f"s3a://{anthony_bucket}/project/cleaned/submissions/"
s3_directory_anthony_comments = f"s3a://{anthony_bucket}/project/cleaned/comments/"
# Read all the Parquet files in the directory into a DataFrame
df_submissions_askpolitics_changemyview = spark.read.parquet(s3_directory_anthony_submissions)
df_comments_askpolitics_changemyview = spark.read.parquet(s3_directory_anthony_comments)
In [ ]:
# extract other subreddits
import pyspark.sql.functions as f
from pyspark.sql.functions import col
from pyspark.sql.types import StringType
df_submissions_finance = df_submissions_conservative_finance.filter(col("subreddit")=="finance")
df_submissions_economics = df_submissions_socialism_economics_liberal.filter(col("subreddit")=="Economics")
df_submissions_conservative = df_submissions_conservative_finance.filter(col("subreddit")=="Conservative")
In [ ]:
df_submissions_economics.printSchema()
root |-- author: string (nullable = true) |-- title: string (nullable = true) |-- selftext: string (nullable = true) |-- subreddit: string (nullable = true) |-- score: long (nullable = true) |-- num_comments: long (nullable = true) |-- permalink: string (nullable = true) |-- created_utc: timestamp (nullable = true) |-- url: string (nullable = true) |-- domain: string (nullable = true) |-- is_video: boolean (nullable = true) |-- is_self: boolean (nullable = true) |-- is_reddit_media_domain: boolean (nullable = true) |-- spoiler: boolean (nullable = true) |-- over_18: boolean (nullable = true) |-- stickied: boolean (nullable = true) |-- thumbnail: string (nullable = true) |-- media: struct (nullable = true) | |-- event_id: string (nullable = true) | |-- oembed: struct (nullable = true) | | |-- author_name: string (nullable = true) | | |-- author_url: string (nullable = true) | | |-- cache_age: long (nullable = true) | | |-- description: string (nullable = true) | | |-- height: long (nullable = true) | | |-- html: string (nullable = true) | | |-- provider_name: string (nullable = true) | | |-- provider_url: string (nullable = true) | | |-- thumbnail_height: long (nullable = true) | | |-- thumbnail_url: string (nullable = true) | | |-- thumbnail_width: long (nullable = true) | | |-- title: string (nullable = true) | | |-- type: string (nullable = true) | | |-- url: string (nullable = true) | | |-- version: string (nullable = true) | | |-- width: long (nullable = true) | |-- reddit_video: struct (nullable = true) | | |-- dash_url: string (nullable = true) | | |-- duration: long (nullable = true) | | |-- fallback_url: string (nullable = true) | | |-- height: long (nullable = true) | | |-- hls_url: string (nullable = true) | | |-- is_gif: boolean (nullable = true) | | |-- scrubber_media_url: string (nullable = true) | | |-- transcoding_status: string (nullable = true) | | |-- width: long (nullable = true) | |-- type: string (nullable = true) |-- secure_media: struct (nullable = true) | |-- event_id: string (nullable = true) | |-- oembed: struct (nullable = true) | | |-- author_name: string (nullable = true) | | |-- author_url: string (nullable = true) | | |-- cache_age: long (nullable = true) | | |-- description: string (nullable = true) | | |-- height: long (nullable = true) | | |-- html: string (nullable = true) | | |-- provider_name: string (nullable = true) | | |-- provider_url: string (nullable = true) | | |-- thumbnail_height: long (nullable = true) | | |-- thumbnail_url: string (nullable = true) | | |-- thumbnail_width: long (nullable = true) | | |-- title: string (nullable = true) | | |-- type: string (nullable = true) | | |-- url: string (nullable = true) | | |-- version: string (nullable = true) | | |-- width: long (nullable = true) | |-- type: string (nullable = true) |-- gilded: long (nullable = true) |-- archived: boolean (nullable = true) |-- distinguished: string (nullable = true) |-- crosspost_parent: string (nullable = true) |-- crosspost_parent_list: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- approved_at_utc: string (nullable = true) | | |-- approved_by: string (nullable = true) | | |-- archived: boolean (nullable = true) | | |-- author: string (nullable = true) | | |-- author_flair_css_class: string (nullable = true) | | |-- author_flair_text: string (nullable = true) | | |-- banned_at_utc: string (nullable = true) | | |-- banned_by: string (nullable = true) | | |-- brand_safe: boolean (nullable = true) | | |-- can_gild: boolean (nullable = true) | | |-- can_mod_post: boolean (nullable = true) | | |-- clicked: boolean (nullable = true) | | |-- contest_mode: boolean (nullable = true) | | |-- created: double (nullable = true) | | |-- created_utc: double (nullable = true) | | |-- distinguished: string (nullable = true) | | |-- domain: string (nullable = true) | | |-- downs: long (nullable = true) | | |-- edited: boolean (nullable = true) | | |-- gilded: long (nullable = true) | | |-- hidden: boolean (nullable = true) | | |-- hide_score: boolean (nullable = true) | | |-- id: string (nullable = true) | | |-- is_crosspostable: boolean (nullable = true) | | |-- is_reddit_media_domain: boolean (nullable = true) | | |-- is_self: boolean (nullable = true) | | |-- is_video: boolean (nullable = true) | | |-- likes: string (nullable = true) | | |-- link_flair_css_class: string (nullable = true) | | |-- link_flair_text: string (nullable = true) | | |-- locked: boolean (nullable = true) | | |-- media: string (nullable = true) | | |-- mod_reports: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- name: string (nullable = true) | | |-- num_comments: long (nullable = true) | | |-- num_crossposts: long (nullable = true) | | |-- num_reports: string (nullable = true) | | |-- over_18: boolean (nullable = true) | | |-- parent_whitelist_status: string (nullable = true) | | |-- permalink: string (nullable = true) | | |-- pinned: boolean (nullable = true) | | |-- quarantine: boolean (nullable = true) | | |-- removal_reason: string (nullable = true) | | |-- report_reasons: string (nullable = true) | | |-- saved: boolean (nullable = true) | | |-- score: long (nullable = true) | | |-- secure_media: string (nullable = true) | | |-- selftext: string (nullable = true) | | |-- selftext_html: string (nullable = true) | | |-- spoiler: boolean (nullable = true) | | |-- stickied: boolean (nullable = true) | | |-- subreddit: string (nullable = true) | | |-- subreddit_id: string (nullable = true) | | |-- subreddit_name_prefixed: string (nullable = true) | | |-- subreddit_type: string (nullable = true) | | |-- suggested_sort: string (nullable = true) | | |-- thumbnail: string (nullable = true) | | |-- thumbnail_height: string (nullable = true) | | |-- thumbnail_width: string (nullable = true) | | |-- title: string (nullable = true) | | |-- ups: long (nullable = true) | | |-- url: string (nullable = true) | | |-- user_reports: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- view_count: string (nullable = true) | | |-- visited: boolean (nullable = true) | | |-- whitelist_status: string (nullable = true)
In [ ]:
from pyspark.sql.functions import year, month, count
# group by year and month extracted from created_utc and count the number of rows
df_grouped_by_year_month = df_submissions_economics \
.groupBy(year("created_utc").alias("year"), month("created_utc").alias("month")) \
.count()
# order by year and month
df_grouped_by_year_month = df_grouped_by_year_month.orderBy("year", "month")
# show the result
df_grouped_by_year_month.show()
[Stage 8:============================================> (3 + 1) / 4]
+----+-----+-----+ |year|month|count| +----+-----+-----+ |2021| 1| 1358| |2021| 2| 1348| |2021| 3| 1431| |2021| 4| 1455| |2021| 5| 1132| |2021| 6| 1076| |2021| 7| 939| |2021| 8| 828| |2021| 9| 1061| |2021| 10| 852| |2021| 11| 765| |2021| 12| 1048| |2022| 1| 1177| |2022| 2| 936| |2022| 3| 1234| |2022| 4| 984| |2022| 5| 1110| |2022| 6| 1223| |2022| 7| 1256| |2022| 8| 4480| +----+-----+-----+ only showing top 20 rows
In [ ]:
# order by count of submissions in descending order
df_grouped_by_year_month = df_grouped_by_year_month.orderBy("count", ascending=False)
# show the result
df_grouped_by_year_month.show()
[Stage 11:===========================================> (3 + 1) / 4]
+----+-----+-----+ |year|month|count| +----+-----+-----+ |2022| 9| 6675| |2022| 8| 4480| |2023| 3| 1644| |2022| 10| 1593| |2021| 4| 1455| |2021| 3| 1431| |2022| 11| 1374| |2021| 1| 1358| |2021| 2| 1348| |2023| 1| 1301| |2022| 7| 1256| |2022| 3| 1234| |2022| 12| 1224| |2022| 6| 1223| |2022| 1| 1177| |2021| 5| 1132| |2022| 5| 1110| |2023| 2| 1100| |2021| 6| 1076| |2021| 9| 1061| +----+-----+-----+ only showing top 20 rows
In [ ]:
from pyspark.sql.functions import year, month, avg
# Group by year and month extracted from created_utc and calculate the average score
df_avg_score_by_year_month = df_submissions_economics \
.groupBy(year("created_utc").alias("year"), month("created_utc").alias("month")) \
.agg(avg("score").alias("average_score"))
# Order by year and month
df_avg_score_by_year_month = df_avg_score_by_year_month.orderBy("year", "month")
# Show the result
df_avg_score_by_year_month.show()
[Stage 28:===========================================> (3 + 1) / 4]
+----+-----+------------------+ |year|month| average_score| +----+-----+------------------+ |2021| 1|246.18114874815905| |2021| 2|138.58679525222553| |2021| 3| 171.7393431167016| |2021| 4| 153.3786941580756| |2021| 5|155.07508833922262| |2021| 6|128.57620817843866| |2021| 7|110.50053248136315| |2021| 8|148.81521739130434| |2021| 9| 144.0923656927427| |2021| 10|176.42723004694835| |2021| 11|176.98431372549018| |2021| 12|132.48664122137404| |2022| 1|145.80798640611724| |2022| 2|313.80555555555554| |2022| 3| 296.3095623987034| |2022| 4| 271.3973577235772| |2022| 5| 317.7477477477477| |2022| 6|282.42191332788224| |2022| 7|282.11544585987264| |2022| 8| 66.63995535714285| +----+-----+------------------+ only showing top 20 rows
In [ ]:
from pyspark.sql.functions import year, month, min, max
# get year and month
df_with_year_month = df_submissions_economics \
.withColumn("year", year("created_utc")) \
.withColumn("month", month("created_utc"))
# get range
min_date = df_with_year_month.select(min("year"), min("month")).first()
max_date = df_with_year_month.select(max("year"), max("month")).first()
print(f"Earliest date: {min_date['min(year)']}-{min_date['min(month)']:02d}")
print(f"Latest date: {max_date['max(year)']}-{max_date['max(month)']:02d}")
[Stage 17:=============================> (2 + 2) / 4]
Earliest date: 2021-01 Latest date: 2023-12
Import External Data¶
In [ ]:
import os
import pandas as pd
os.getcwd()
Out[ ]:
'/root/project/fall-2023-reddit-project-team-01/code/eda'
In [ ]:
unemployment = pd.read_csv("../../data/csv/unemployment.csv", index_col = 0)
fed_funds = pd.read_csv("../../data/csv/fed_funds.csv", index_col = 0)
gdp = pd.read_csv("../../data/csv/gdp.csv", index_col = 0)
In [ ]:
real_gdp = pd.read_csv("../../data/csv/real_gdp.csv", index_col = 0)
In [ ]:
real_gdp.head()
Out[ ]:
value | date | |
---|---|---|
0 | 19055.655 | 2021-01 |
1 | 19358.176 | 2021-04 |
2 | 19465.195 | 2021-07 |
3 | 19805.962 | 2021-10 |
4 | 19727.918 | 2022-01 |
In [ ]:
gdp['date'] = pd.to_datetime(gdp['date'], format = "%Y/%m")
gdp.dtypes
Out[ ]:
value float64 date datetime64[ns] dtype: object
In [ ]:
df_grouped_by_year_month.printSchema()
root |-- year: integer (nullable = true) |-- month: integer (nullable = true) |-- count: long (nullable = false)
In [ ]:
type(df_grouped_by_year_month)
Out[ ]:
pyspark.sql.dataframe.DataFrame
In [ ]:
# convert Spark DataFrame to Pandas DataFrame
df_grouped_by_year_month_pd = df_grouped_by_year_month.toPandas()
# combine 'year' and 'month' into a 'date' column and convert to datetime
df_grouped_by_year_month_pd['date'] = pd.to_datetime(df_grouped_by_year_month_pd.assign(day=1)[['year', 'month', 'day']])
In [ ]:
df_avg_score_by_year_month
# convert Spark DataFrame to Pandas DataFrame
df_avg_score_by_year_month_pd = df_avg_score_by_year_month.toPandas()
# combine 'year' and 'month' into a 'date' column and convert to datetime
df_avg_score_by_year_month_pd['date'] = pd.to_datetime(df_avg_score_by_year_month_pd.assign(day=1)[['year', 'month', 'day']])
In [ ]:
df_avg_score_by_year_month_pd.head()
Out[ ]:
year | month | average_score | date | |
---|---|---|---|---|
0 | 2021 | 1 | 246.181149 | 2021-01-01 |
1 | 2021 | 2 | 138.586795 | 2021-02-01 |
2 | 2021 | 3 | 171.739343 | 2021-03-01 |
3 | 2021 | 4 | 153.378694 | 2021-04-01 |
4 | 2021 | 5 | 155.075088 | 2021-05-01 |
In [ ]:
combined_gdp_score = pd.merge(df_avg_score_by_year_month_pd, real_gdp, on='date', how='left')
In [ ]:
combined_gdp_score.head()
Out[ ]:
year | month | average_score | date | value | |
---|---|---|---|---|---|
0 | 2021 | 1 | 246.181149 | 2021-01-01 | 22187.2805 |
1 | 2021 | 2 | 138.586795 | 2021-02-01 | NaN |
2 | 2021 | 3 | 171.739343 | 2021-03-01 | NaN |
3 | 2021 | 4 | 153.378694 | 2021-04-01 | 22906.8410 |
4 | 2021 | 5 | 155.075088 | 2021-05-01 | NaN |
In [ ]:
combined_gdp_score['value_shifted'] = combined_gdp_score['value'].shift(3)
In [ ]:
combined_gdp_score.head(10)
Out[ ]:
year | month | average_score | date | value | value_shifted | |
---|---|---|---|---|---|---|
0 | 2021 | 1 | 246.181149 | 2021-01-01 | 19055.655 | NaN |
1 | 2021 | 2 | 138.586795 | 2021-02-01 | NaN | NaN |
2 | 2021 | 3 | 171.739343 | 2021-03-01 | NaN | NaN |
3 | 2021 | 4 | 153.378694 | 2021-04-01 | 19358.176 | 19055.655 |
4 | 2021 | 5 | 155.075088 | 2021-05-01 | NaN | NaN |
5 | 2021 | 6 | 128.576208 | 2021-06-01 | NaN | NaN |
6 | 2021 | 7 | 110.500532 | 2021-07-01 | 19465.195 | 19358.176 |
7 | 2021 | 8 | 148.815217 | 2021-08-01 | NaN | NaN |
8 | 2021 | 9 | 144.092366 | 2021-09-01 | NaN | NaN |
9 | 2021 | 10 | 176.427230 | 2021-10-01 | 19805.962 | 19465.195 |
In [ ]:
import plotly.graph_objs as go
from plotly.subplots import make_subplots
# create a subplot with 2 y-axes
fig = make_subplots(specs=[[{"secondary_y": True}]])
# add the bar plot for the number of Reddit submissions
fig.add_trace(
go.Bar(x=combined_gdp_score['date'], y=combined_gdp_score['average_score'], name='Average Submission Scoare'),
secondary_y=False,
)
# add the line plot for the GDP with the values shifted as required
fig.add_trace(
go.Scatter(x=combined_gdp_score['date'], y=combined_gdp_score['value_shifted'], name='Real GDP', mode='lines+markers'),
secondary_y=True,
)
# set x-axis title
fig.update_xaxes(title_text="Date")
# set y-axes titles
fig.update_yaxes(title_text="<b>Average Score</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>GDP</b>", secondary_y=True)
# set plot title and apply the 'simple_white' template
fig.update_layout(
title_text="Average Post Score and Real GDP Over Time",
template="simple_white"
)
# show plot
fig.show()
In [ ]:
# Merge on the 'date' column
combined_gdp = pd.merge(df_grouped_by_year_month_pd, gdp, on='date', how='left')
In [ ]:
combined_gdp.head()
Out[ ]:
year | month | count | date | value | |
---|---|---|---|---|---|
0 | 2022 | 9 | 6675 | 2022-09-01 | NaN |
1 | 2022 | 8 | 4480 | 2022-08-01 | NaN |
2 | 2023 | 3 | 1644 | 2023-03-01 | NaN |
3 | 2022 | 10 | 1593 | 2022-10-01 | 26205.95275 |
4 | 2021 | 4 | 1455 | 2021-04-01 | 22906.84100 |
In [ ]:
real_gdp['date'] = pd.to_datetime(real_gdp['date'], format = "%Y/%m")
real_gdp.dtypes
Out[ ]:
value float64 date datetime64[ns] dtype: object
In [ ]:
real_gdp.head(10)
Out[ ]:
value | date | |
---|---|---|
0 | 19055.655 | 2021-01-01 |
1 | 19358.176 | 2021-04-01 |
2 | 19465.195 | 2021-07-01 |
3 | 19805.962 | 2021-10-01 |
4 | 19727.918 | 2022-01-01 |
5 | 19681.682 | 2022-04-01 |
6 | 20021.721 | 2022-07-01 |
7 | 20182.491 | 2022-10-01 |
8 | 20235.878 | 2023-01-01 |
9 | 20386.467 | 2023-04-01 |
In [ ]:
combined_real_gdp = pd.merge(df_grouped_by_year_month_pd, real_gdp, on='date', how='left')
In [ ]:
combined_real_gdp = combined_real_gdp.sort_values(by='date')
In [ ]:
combined_real_gdp['value_shifted'] = combined_real_gdp['value'].shift(3)
In [ ]:
combined_real_gdp.head(20)
Out[ ]:
year | month | count | date | value | value_shifted | |
---|---|---|---|---|---|---|
7 | 2021 | 1 | 1358 | 2021-01-01 | 19055.655 | NaN |
8 | 2021 | 2 | 1348 | 2021-02-01 | NaN | NaN |
5 | 2021 | 3 | 1431 | 2021-03-01 | NaN | NaN |
4 | 2021 | 4 | 1455 | 2021-04-01 | 19358.176 | 19055.655 |
15 | 2021 | 5 | 1132 | 2021-05-01 | NaN | NaN |
18 | 2021 | 6 | 1076 | 2021-06-01 | NaN | NaN |
22 | 2021 | 7 | 939 | 2021-07-01 | 19465.195 | 19358.176 |
25 | 2021 | 8 | 828 | 2021-08-01 | NaN | NaN |
19 | 2021 | 9 | 1061 | 2021-09-01 | NaN | NaN |
24 | 2021 | 10 | 852 | 2021-10-01 | 19805.962 | 19465.195 |
26 | 2021 | 11 | 765 | 2021-11-01 | NaN | NaN |
20 | 2021 | 12 | 1048 | 2021-12-01 | NaN | NaN |
14 | 2022 | 1 | 1177 | 2022-01-01 | 19727.918 | 19805.962 |
23 | 2022 | 2 | 936 | 2022-02-01 | NaN | NaN |
11 | 2022 | 3 | 1234 | 2022-03-01 | NaN | NaN |
21 | 2022 | 4 | 984 | 2022-04-01 | 19681.682 | 19727.918 |
16 | 2022 | 5 | 1110 | 2022-05-01 | NaN | NaN |
13 | 2022 | 6 | 1223 | 2022-06-01 | NaN | NaN |
10 | 2022 | 7 | 1256 | 2022-07-01 | 20021.721 | 19681.682 |
1 | 2022 | 8 | 4480 | 2022-08-01 | NaN | NaN |
In [ ]:
import plotly.graph_objs as go
from plotly.subplots import make_subplots
# create a subplot with 2 y-axes
fig = make_subplots(specs=[[{"secondary_y": True}]])
# add the bar plot for the number of Reddit submissions
fig.add_trace(
go.Bar(x=combined_real_gdp['date'], y=combined_real_gdp['count'], name='Reddit Submissions'),
secondary_y=False,
)
# add the line plot for the GDP with the values shifted as required
fig.add_trace(
go.Scatter(x=combined_real_gdp['date'], y=combined_real_gdp['value_shifted'], name='Real GDP', mode='lines+markers'),
secondary_y=True,
)
# set x-axis title
fig.update_xaxes(title_text="Date")
# set y-axes titles
fig.update_yaxes(title_text="<b>Reddit Submissions</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>GDP</b>", secondary_y=True)
# set plot title and apply the 'simple_white' template
fig.update_layout(
title_text="Reddit Submissions and Real GDP Over Time",
template="simple_white"
)
# show plot
fig.show()
In [ ]: