Setup¶

In [ ]:
# Setup - Run only once per Kernel App
%conda install openjdk -y

# install PySpark
%pip install pyspark==3.3.0

# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")
Collecting package metadata (current_repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 23.3.1
  latest version: 23.10.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.10.0



## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - openjdk


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2023.08.22 |       h06a4308_0         123 KB
    certifi-2023.7.22          |  py310h06a4308_0         153 KB
    openjdk-11.0.13            |       h87a67e3_0       341.0 MB
    ------------------------------------------------------------
                                           Total:       341.3 MB

The following NEW packages will be INSTALLED:

  openjdk            pkgs/main/linux-64::openjdk-11.0.13-h87a67e3_0 

The following packages will be UPDATED:

  ca-certificates    conda-forge::ca-certificates-2023.7.2~ --> pkgs/main::ca-certificates-2023.08.22-h06a4308_0 

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi            conda-forge/noarch::certifi-2023.7.22~ --> pkgs/main/linux-64::certifi-2023.7.22-py310h06a4308_0 



Downloading and Extracting Packages
certifi-2023.7.22    | 153 KB    |                                       |   0% 
openjdk-11.0.13      | 341.0 MB  |                                       |   0% 

ca-certificates-2023 | 123 KB    |                                       |   0% 

ca-certificates-2023 | 123 KB    | ##################################### | 100% 
certifi-2023.7.22    | 153 KB    | ##################################### | 100% 
openjdk-11.0.13      | 341.0 MB  | #                                     |   3% 
openjdk-11.0.13      | 341.0 MB  | ##3                                   |   6% 
openjdk-11.0.13      | 341.0 MB  | ###7                                  |  10% 
openjdk-11.0.13      | 341.0 MB  | #####4                                |  15% 
openjdk-11.0.13      | 341.0 MB  | ######9                               |  19% 
openjdk-11.0.13      | 341.0 MB  | ########2                             |  22% 
openjdk-11.0.13      | 341.0 MB  | #########8                            |  27% 
openjdk-11.0.13      | 341.0 MB  | ###########4                          |  31% 
openjdk-11.0.13      | 341.0 MB  | ############8                         |  35% 
openjdk-11.0.13      | 341.0 MB  | ##############1                       |  38% 
openjdk-11.0.13      | 341.0 MB  | ###############6                      |  42% 
openjdk-11.0.13      | 341.0 MB  | #################2                    |  47% 
openjdk-11.0.13      | 341.0 MB  | ##################8                   |  51% 
openjdk-11.0.13      | 341.0 MB  | ####################3                 |  55% 
openjdk-11.0.13      | 341.0 MB  | #####################7                |  59% 
openjdk-11.0.13      | 341.0 MB  | #######################3              |  63% 
openjdk-11.0.13      | 341.0 MB  | ########################9             |  67% 
openjdk-11.0.13      | 341.0 MB  | ##########################5           |  72% 
openjdk-11.0.13      | 341.0 MB  | ###########################9          |  76% 
openjdk-11.0.13      | 341.0 MB  | #############################5        |  80% 
openjdk-11.0.13      | 341.0 MB  | ###############################       |  84% 
openjdk-11.0.13      | 341.0 MB  | ################################5     |  88% 
openjdk-11.0.13      | 341.0 MB  | #################################9    |  92% 
openjdk-11.0.13      | 341.0 MB  | ###################################4  |  96% 
openjdk-11.0.13      | 341.0 MB  | ####################################9 | 100% 
                                                                                
                                                                                

                                                                                
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Note: you may need to restart the kernel to use updated packages.
Collecting pyspark==3.3.0
  Using cached pyspark-3.3.0-py2.py3-none-any.whl
Collecting py4j==0.10.9.5 (from pyspark==3.3.0)
  Using cached py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Out[ ]:
In [ ]:
# Import pyspark and build Spark session
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("PySparkApp")
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.2")
    .config(
        "fs.s3a.aws.credentials.provider",
        "com.amazonaws.auth.ContainerCredentialsProvider",
    )
    .getOrCreate()
)

print(spark.version)
Warning: Ignoring non-Spark config property: fs.s3a.aws.credentials.provider
:: loading settings :: url = jar:file:/opt/conda/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-fde6f6f4-d468-4615-80c7-dd7f3e709851;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.2.2 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.563 in central
:: resolution report :: resolve 439ms :: artifacts dl 54ms
	:: modules in use:
	com.amazonaws#aws-java-sdk-bundle;1.11.563 from central in [default]
	org.apache.hadoop#hadoop-aws;3.2.2 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-fde6f6f4-d468-4615-80c7-dd7f3e709851
	confs: [default]
	0 artifacts copied, 2 already retrieved (0kB/42ms)
23/11/07 21:30:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
3.3.0
In [ ]:
import sagemaker

session = sagemaker.Session()
bucket = session.default_bucket()
print(bucket)

# Create or retrieve a Spark session
spark = SparkSession.builder.appName("ReadS3Parquet").getOrCreate()
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker-us-east-1-224518912016
23/11/07 21:30:45 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.

Data¶

In [ ]:
# S3 directory path
s3_directory = f"s3a://{bucket}/project/cleaned/comments/"

# Read all the Parquet files in the directory into a DataFrame
df_comments_conservative_finance = spark.read.parquet(s3_directory)
23/11/07 21:30:46 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                
In [ ]:
# S3 directory path
s3_directory = f"s3a://{bucket}/project/cleaned/submissions/"

# Read all the Parquet files in the directory into a DataFrame
df_submissions_conservative_finance = spark.read.parquet(s3_directory)
23/11/07 21:30:52 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.

Import Data from other Subreddits¶

In [ ]:
# Tegveer subreddits
tegveer_bucket = 'sagemaker-us-east-1-433974840707'
s3_directory_tegveer = f"s3a://{tegveer_bucket}/project/cleaned/submissions/"
s3_directory_tegveer_comments = f"s3a://{tegveer_bucket}/project/cleaned/comments/"

# Read all the Parquet files in the directory into a DataFrame
df_submissions_centrist_liberterian = spark.read.parquet(s3_directory_tegveer)
df_comments_centrist_liberterian = spark.read.parquet(s3_directory_tegveer_comments)

# Eric subreddits
eric_bucket = 'sagemaker-us-east-1-395393721134'
s3_directory_eric = f"s3a://{eric_bucket}/project/cleaned/submissions/"
s3_directory_eric_comments = f"s3a://{eric_bucket}/project/cleaned/comments/"

# Read all the Parquet files in the directory into a DataFrame
df_submissions_socialism_economics_liberal = spark.read.parquet(s3_directory_eric)
df_submissions_socialism_economics_liberal_comments = spark.read.parquet(s3_directory_eric_comments)

# Anthony subreddits
anthony_bucket = 'sagemaker-us-east-1-711387073580'
s3_directory_anthony_submissions = f"s3a://{anthony_bucket}/project/cleaned/submissions/"
s3_directory_anthony_comments = f"s3a://{anthony_bucket}/project/cleaned/comments/"

# Read all the Parquet files in the directory into a DataFrame
df_submissions_askpolitics_changemyview = spark.read.parquet(s3_directory_anthony_submissions)
df_comments_askpolitics_changemyview = spark.read.parquet(s3_directory_anthony_comments)
In [ ]:
# extract other subreddits

import pyspark.sql.functions as f
from pyspark.sql.functions import col
from pyspark.sql.types import StringType

df_submissions_finance = df_submissions_conservative_finance.filter(col("subreddit")=="finance")
df_submissions_economics = df_submissions_socialism_economics_liberal.filter(col("subreddit")=="Economics")
df_submissions_conservative = df_submissions_conservative_finance.filter(col("subreddit")=="Conservative")
In [ ]:
df_submissions_economics.printSchema()
root
 |-- author: string (nullable = true)
 |-- title: string (nullable = true)
 |-- selftext: string (nullable = true)
 |-- subreddit: string (nullable = true)
 |-- score: long (nullable = true)
 |-- num_comments: long (nullable = true)
 |-- permalink: string (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- url: string (nullable = true)
 |-- domain: string (nullable = true)
 |-- is_video: boolean (nullable = true)
 |-- is_self: boolean (nullable = true)
 |-- is_reddit_media_domain: boolean (nullable = true)
 |-- spoiler: boolean (nullable = true)
 |-- over_18: boolean (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- thumbnail: string (nullable = true)
 |-- media: struct (nullable = true)
 |    |-- event_id: string (nullable = true)
 |    |-- oembed: struct (nullable = true)
 |    |    |-- author_name: string (nullable = true)
 |    |    |-- author_url: string (nullable = true)
 |    |    |-- cache_age: long (nullable = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- height: long (nullable = true)
 |    |    |-- html: string (nullable = true)
 |    |    |-- provider_name: string (nullable = true)
 |    |    |-- provider_url: string (nullable = true)
 |    |    |-- thumbnail_height: long (nullable = true)
 |    |    |-- thumbnail_url: string (nullable = true)
 |    |    |-- thumbnail_width: long (nullable = true)
 |    |    |-- title: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |    |    |-- version: string (nullable = true)
 |    |    |-- width: long (nullable = true)
 |    |-- reddit_video: struct (nullable = true)
 |    |    |-- dash_url: string (nullable = true)
 |    |    |-- duration: long (nullable = true)
 |    |    |-- fallback_url: string (nullable = true)
 |    |    |-- height: long (nullable = true)
 |    |    |-- hls_url: string (nullable = true)
 |    |    |-- is_gif: boolean (nullable = true)
 |    |    |-- scrubber_media_url: string (nullable = true)
 |    |    |-- transcoding_status: string (nullable = true)
 |    |    |-- width: long (nullable = true)
 |    |-- type: string (nullable = true)
 |-- secure_media: struct (nullable = true)
 |    |-- event_id: string (nullable = true)
 |    |-- oembed: struct (nullable = true)
 |    |    |-- author_name: string (nullable = true)
 |    |    |-- author_url: string (nullable = true)
 |    |    |-- cache_age: long (nullable = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- height: long (nullable = true)
 |    |    |-- html: string (nullable = true)
 |    |    |-- provider_name: string (nullable = true)
 |    |    |-- provider_url: string (nullable = true)
 |    |    |-- thumbnail_height: long (nullable = true)
 |    |    |-- thumbnail_url: string (nullable = true)
 |    |    |-- thumbnail_width: long (nullable = true)
 |    |    |-- title: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |    |    |-- version: string (nullable = true)
 |    |    |-- width: long (nullable = true)
 |    |-- type: string (nullable = true)
 |-- gilded: long (nullable = true)
 |-- archived: boolean (nullable = true)
 |-- distinguished: string (nullable = true)
 |-- crosspost_parent: string (nullable = true)
 |-- crosspost_parent_list: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- approved_at_utc: string (nullable = true)
 |    |    |-- approved_by: string (nullable = true)
 |    |    |-- archived: boolean (nullable = true)
 |    |    |-- author: string (nullable = true)
 |    |    |-- author_flair_css_class: string (nullable = true)
 |    |    |-- author_flair_text: string (nullable = true)
 |    |    |-- banned_at_utc: string (nullable = true)
 |    |    |-- banned_by: string (nullable = true)
 |    |    |-- brand_safe: boolean (nullable = true)
 |    |    |-- can_gild: boolean (nullable = true)
 |    |    |-- can_mod_post: boolean (nullable = true)
 |    |    |-- clicked: boolean (nullable = true)
 |    |    |-- contest_mode: boolean (nullable = true)
 |    |    |-- created: double (nullable = true)
 |    |    |-- created_utc: double (nullable = true)
 |    |    |-- distinguished: string (nullable = true)
 |    |    |-- domain: string (nullable = true)
 |    |    |-- downs: long (nullable = true)
 |    |    |-- edited: boolean (nullable = true)
 |    |    |-- gilded: long (nullable = true)
 |    |    |-- hidden: boolean (nullable = true)
 |    |    |-- hide_score: boolean (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- is_crosspostable: boolean (nullable = true)
 |    |    |-- is_reddit_media_domain: boolean (nullable = true)
 |    |    |-- is_self: boolean (nullable = true)
 |    |    |-- is_video: boolean (nullable = true)
 |    |    |-- likes: string (nullable = true)
 |    |    |-- link_flair_css_class: string (nullable = true)
 |    |    |-- link_flair_text: string (nullable = true)
 |    |    |-- locked: boolean (nullable = true)
 |    |    |-- media: string (nullable = true)
 |    |    |-- mod_reports: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- num_comments: long (nullable = true)
 |    |    |-- num_crossposts: long (nullable = true)
 |    |    |-- num_reports: string (nullable = true)
 |    |    |-- over_18: boolean (nullable = true)
 |    |    |-- parent_whitelist_status: string (nullable = true)
 |    |    |-- permalink: string (nullable = true)
 |    |    |-- pinned: boolean (nullable = true)
 |    |    |-- quarantine: boolean (nullable = true)
 |    |    |-- removal_reason: string (nullable = true)
 |    |    |-- report_reasons: string (nullable = true)
 |    |    |-- saved: boolean (nullable = true)
 |    |    |-- score: long (nullable = true)
 |    |    |-- secure_media: string (nullable = true)
 |    |    |-- selftext: string (nullable = true)
 |    |    |-- selftext_html: string (nullable = true)
 |    |    |-- spoiler: boolean (nullable = true)
 |    |    |-- stickied: boolean (nullable = true)
 |    |    |-- subreddit: string (nullable = true)
 |    |    |-- subreddit_id: string (nullable = true)
 |    |    |-- subreddit_name_prefixed: string (nullable = true)
 |    |    |-- subreddit_type: string (nullable = true)
 |    |    |-- suggested_sort: string (nullable = true)
 |    |    |-- thumbnail: string (nullable = true)
 |    |    |-- thumbnail_height: string (nullable = true)
 |    |    |-- thumbnail_width: string (nullable = true)
 |    |    |-- title: string (nullable = true)
 |    |    |-- ups: long (nullable = true)
 |    |    |-- url: string (nullable = true)
 |    |    |-- user_reports: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- view_count: string (nullable = true)
 |    |    |-- visited: boolean (nullable = true)
 |    |    |-- whitelist_status: string (nullable = true)

In [ ]:
from pyspark.sql.functions import year, month, count

# group by year and month extracted from created_utc and count the number of rows
df_grouped_by_year_month = df_submissions_economics \
    .groupBy(year("created_utc").alias("year"), month("created_utc").alias("month")) \
    .count()

# order by year and month
df_grouped_by_year_month = df_grouped_by_year_month.orderBy("year", "month")

# show the result
df_grouped_by_year_month.show()
[Stage 8:============================================>              (3 + 1) / 4]
+----+-----+-----+
|year|month|count|
+----+-----+-----+
|2021|    1| 1358|
|2021|    2| 1348|
|2021|    3| 1431|
|2021|    4| 1455|
|2021|    5| 1132|
|2021|    6| 1076|
|2021|    7|  939|
|2021|    8|  828|
|2021|    9| 1061|
|2021|   10|  852|
|2021|   11|  765|
|2021|   12| 1048|
|2022|    1| 1177|
|2022|    2|  936|
|2022|    3| 1234|
|2022|    4|  984|
|2022|    5| 1110|
|2022|    6| 1223|
|2022|    7| 1256|
|2022|    8| 4480|
+----+-----+-----+
only showing top 20 rows

                                                                                
In [ ]:
# order by count of submissions in descending order
df_grouped_by_year_month = df_grouped_by_year_month.orderBy("count", ascending=False)

# show the result
df_grouped_by_year_month.show()
[Stage 11:===========================================>              (3 + 1) / 4]
+----+-----+-----+
|year|month|count|
+----+-----+-----+
|2022|    9| 6675|
|2022|    8| 4480|
|2023|    3| 1644|
|2022|   10| 1593|
|2021|    4| 1455|
|2021|    3| 1431|
|2022|   11| 1374|
|2021|    1| 1358|
|2021|    2| 1348|
|2023|    1| 1301|
|2022|    7| 1256|
|2022|    3| 1234|
|2022|   12| 1224|
|2022|    6| 1223|
|2022|    1| 1177|
|2021|    5| 1132|
|2022|    5| 1110|
|2023|    2| 1100|
|2021|    6| 1076|
|2021|    9| 1061|
+----+-----+-----+
only showing top 20 rows

                                                                                
In [ ]:
from pyspark.sql.functions import year, month, avg

# Group by year and month extracted from created_utc and calculate the average score
df_avg_score_by_year_month = df_submissions_economics \
    .groupBy(year("created_utc").alias("year"), month("created_utc").alias("month")) \
    .agg(avg("score").alias("average_score"))

# Order by year and month
df_avg_score_by_year_month = df_avg_score_by_year_month.orderBy("year", "month")

# Show the result
df_avg_score_by_year_month.show()
[Stage 28:===========================================>              (3 + 1) / 4]
+----+-----+------------------+
|year|month|     average_score|
+----+-----+------------------+
|2021|    1|246.18114874815905|
|2021|    2|138.58679525222553|
|2021|    3| 171.7393431167016|
|2021|    4| 153.3786941580756|
|2021|    5|155.07508833922262|
|2021|    6|128.57620817843866|
|2021|    7|110.50053248136315|
|2021|    8|148.81521739130434|
|2021|    9| 144.0923656927427|
|2021|   10|176.42723004694835|
|2021|   11|176.98431372549018|
|2021|   12|132.48664122137404|
|2022|    1|145.80798640611724|
|2022|    2|313.80555555555554|
|2022|    3| 296.3095623987034|
|2022|    4| 271.3973577235772|
|2022|    5| 317.7477477477477|
|2022|    6|282.42191332788224|
|2022|    7|282.11544585987264|
|2022|    8| 66.63995535714285|
+----+-----+------------------+
only showing top 20 rows

                                                                                
In [ ]:
from pyspark.sql.functions import year, month, min, max

# get year and month 
df_with_year_month = df_submissions_economics \
    .withColumn("year", year("created_utc")) \
    .withColumn("month", month("created_utc"))

# get range
min_date = df_with_year_month.select(min("year"), min("month")).first()
max_date = df_with_year_month.select(max("year"), max("month")).first()

print(f"Earliest date: {min_date['min(year)']}-{min_date['min(month)']:02d}")
print(f"Latest date: {max_date['max(year)']}-{max_date['max(month)']:02d}")
[Stage 17:=============================>                            (2 + 2) / 4]
Earliest date: 2021-01
Latest date: 2023-12
                                                                                

Import External Data¶

In [ ]:
import os
import pandas as pd
os.getcwd()
Out[ ]:
'/root/project/fall-2023-reddit-project-team-01/code/eda'
In [ ]:
unemployment = pd.read_csv("../../data/csv/unemployment.csv", index_col = 0)
fed_funds = pd.read_csv("../../data/csv/fed_funds.csv", index_col = 0)
gdp = pd.read_csv("../../data/csv/gdp.csv", index_col = 0)
In [ ]:
real_gdp = pd.read_csv("../../data/csv/real_gdp.csv", index_col = 0)
In [ ]:
real_gdp.head()
Out[ ]:
value date
0 19055.655 2021-01
1 19358.176 2021-04
2 19465.195 2021-07
3 19805.962 2021-10
4 19727.918 2022-01
In [ ]:
gdp['date'] = pd.to_datetime(gdp['date'], format = "%Y/%m")
gdp.dtypes
Out[ ]:
value           float64
date     datetime64[ns]
dtype: object
In [ ]:
df_grouped_by_year_month.printSchema()
root
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- count: long (nullable = false)

In [ ]:
type(df_grouped_by_year_month)
Out[ ]:
pyspark.sql.dataframe.DataFrame
In [ ]:
# convert Spark DataFrame to Pandas DataFrame
df_grouped_by_year_month_pd = df_grouped_by_year_month.toPandas()

# combine 'year' and 'month' into a 'date' column and convert to datetime
df_grouped_by_year_month_pd['date'] = pd.to_datetime(df_grouped_by_year_month_pd.assign(day=1)[['year', 'month', 'day']])
                                                                                
In [ ]:
df_avg_score_by_year_month
# convert Spark DataFrame to Pandas DataFrame
df_avg_score_by_year_month_pd = df_avg_score_by_year_month.toPandas()

# combine 'year' and 'month' into a 'date' column and convert to datetime
df_avg_score_by_year_month_pd['date'] = pd.to_datetime(df_avg_score_by_year_month_pd.assign(day=1)[['year', 'month', 'day']])
                                                                                
In [ ]:
df_avg_score_by_year_month_pd.head()
Out[ ]:
year month average_score date
0 2021 1 246.181149 2021-01-01
1 2021 2 138.586795 2021-02-01
2 2021 3 171.739343 2021-03-01
3 2021 4 153.378694 2021-04-01
4 2021 5 155.075088 2021-05-01
In [ ]:
combined_gdp_score = pd.merge(df_avg_score_by_year_month_pd, real_gdp, on='date', how='left')
In [ ]:
combined_gdp_score.head()
Out[ ]:
year month average_score date value
0 2021 1 246.181149 2021-01-01 22187.2805
1 2021 2 138.586795 2021-02-01 NaN
2 2021 3 171.739343 2021-03-01 NaN
3 2021 4 153.378694 2021-04-01 22906.8410
4 2021 5 155.075088 2021-05-01 NaN
In [ ]:
combined_gdp_score['value_shifted'] = combined_gdp_score['value'].shift(3)
In [ ]:
combined_gdp_score.head(10)
Out[ ]:
year month average_score date value value_shifted
0 2021 1 246.181149 2021-01-01 19055.655 NaN
1 2021 2 138.586795 2021-02-01 NaN NaN
2 2021 3 171.739343 2021-03-01 NaN NaN
3 2021 4 153.378694 2021-04-01 19358.176 19055.655
4 2021 5 155.075088 2021-05-01 NaN NaN
5 2021 6 128.576208 2021-06-01 NaN NaN
6 2021 7 110.500532 2021-07-01 19465.195 19358.176
7 2021 8 148.815217 2021-08-01 NaN NaN
8 2021 9 144.092366 2021-09-01 NaN NaN
9 2021 10 176.427230 2021-10-01 19805.962 19465.195
In [ ]:
import plotly.graph_objs as go
from plotly.subplots import make_subplots

# create a subplot with 2 y-axes
fig = make_subplots(specs=[[{"secondary_y": True}]])

# add the bar plot for the number of Reddit submissions
fig.add_trace(
    go.Bar(x=combined_gdp_score['date'], y=combined_gdp_score['average_score'], name='Average Submission Scoare'),
    secondary_y=False,
)

# add the line plot for the GDP with the values shifted as required
fig.add_trace(
    go.Scatter(x=combined_gdp_score['date'], y=combined_gdp_score['value_shifted'], name='Real GDP', mode='lines+markers'),
    secondary_y=True,
)

# set x-axis title
fig.update_xaxes(title_text="Date")

# set y-axes titles
fig.update_yaxes(title_text="<b>Average Score</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>GDP</b>", secondary_y=True)

# set plot title and apply the 'simple_white' template
fig.update_layout(
    title_text="Average Post Score and Real GDP Over Time",
    template="simple_white"
)

# show plot
fig.show()
In [ ]:
# Merge on the 'date' column
combined_gdp = pd.merge(df_grouped_by_year_month_pd, gdp, on='date', how='left')
In [ ]:
combined_gdp.head()
Out[ ]:
year month count date value
0 2022 9 6675 2022-09-01 NaN
1 2022 8 4480 2022-08-01 NaN
2 2023 3 1644 2023-03-01 NaN
3 2022 10 1593 2022-10-01 26205.95275
4 2021 4 1455 2021-04-01 22906.84100
In [ ]:
real_gdp['date'] = pd.to_datetime(real_gdp['date'], format = "%Y/%m")
real_gdp.dtypes
Out[ ]:
value           float64
date     datetime64[ns]
dtype: object
In [ ]:
real_gdp.head(10)
Out[ ]:
value date
0 19055.655 2021-01-01
1 19358.176 2021-04-01
2 19465.195 2021-07-01
3 19805.962 2021-10-01
4 19727.918 2022-01-01
5 19681.682 2022-04-01
6 20021.721 2022-07-01
7 20182.491 2022-10-01
8 20235.878 2023-01-01
9 20386.467 2023-04-01
In [ ]:
combined_real_gdp = pd.merge(df_grouped_by_year_month_pd, real_gdp, on='date', how='left')
In [ ]:
combined_real_gdp = combined_real_gdp.sort_values(by='date')
In [ ]:
combined_real_gdp['value_shifted'] = combined_real_gdp['value'].shift(3)
In [ ]:
combined_real_gdp.head(20)
Out[ ]:
year month count date value value_shifted
7 2021 1 1358 2021-01-01 19055.655 NaN
8 2021 2 1348 2021-02-01 NaN NaN
5 2021 3 1431 2021-03-01 NaN NaN
4 2021 4 1455 2021-04-01 19358.176 19055.655
15 2021 5 1132 2021-05-01 NaN NaN
18 2021 6 1076 2021-06-01 NaN NaN
22 2021 7 939 2021-07-01 19465.195 19358.176
25 2021 8 828 2021-08-01 NaN NaN
19 2021 9 1061 2021-09-01 NaN NaN
24 2021 10 852 2021-10-01 19805.962 19465.195
26 2021 11 765 2021-11-01 NaN NaN
20 2021 12 1048 2021-12-01 NaN NaN
14 2022 1 1177 2022-01-01 19727.918 19805.962
23 2022 2 936 2022-02-01 NaN NaN
11 2022 3 1234 2022-03-01 NaN NaN
21 2022 4 984 2022-04-01 19681.682 19727.918
16 2022 5 1110 2022-05-01 NaN NaN
13 2022 6 1223 2022-06-01 NaN NaN
10 2022 7 1256 2022-07-01 20021.721 19681.682
1 2022 8 4480 2022-08-01 NaN NaN
In [ ]:
import plotly.graph_objs as go
from plotly.subplots import make_subplots

# create a subplot with 2 y-axes
fig = make_subplots(specs=[[{"secondary_y": True}]])

# add the bar plot for the number of Reddit submissions
fig.add_trace(
    go.Bar(x=combined_real_gdp['date'], y=combined_real_gdp['count'], name='Reddit Submissions'),
    secondary_y=False,
)

# add the line plot for the GDP with the values shifted as required
fig.add_trace(
    go.Scatter(x=combined_real_gdp['date'], y=combined_real_gdp['value_shifted'], name='Real GDP', mode='lines+markers'),
    secondary_y=True,
)

# set x-axis title
fig.update_xaxes(title_text="Date")

# set y-axes titles
fig.update_yaxes(title_text="<b>Reddit Submissions</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>GDP</b>", secondary_y=True)

# set plot title and apply the 'simple_white' template
fig.update_layout(
    title_text="Reddit Submissions and Real GDP Over Time",
    template="simple_white"
)

# show plot
fig.show()
In [ ]: