Code: EDA-Anime

Set up

AWS

# Setup - Run only once per Kernel App
%conda install openjdk -y

# install PySpark
%pip install pyspark==3.2.0

# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

# Import pyspark and build Spark session
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("PySparkApp")
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.2")
    .config(
        "fs.s3a.aws.credentials.provider",
        "com.amazonaws.auth.ContainerCredentialsProvider",
    )
    .getOrCreate()
)

print(spark.version)
Collecting package metadata (current_repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 23.3.1
  latest version: 23.11.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.11.0



## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - openjdk


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2023.08.22 |       h06a4308_0         123 KB
    certifi-2023.11.17         |  py310h06a4308_0         158 KB
    openjdk-11.0.13            |       h87a67e3_0       341.0 MB
    ------------------------------------------------------------
                                           Total:       341.3 MB

The following NEW packages will be INSTALLED:

  openjdk            pkgs/main/linux-64::openjdk-11.0.13-h87a67e3_0 

The following packages will be UPDATED:

  ca-certificates    conda-forge::ca-certificates-2023.7.2~ --> pkgs/main::ca-certificates-2023.08.22-h06a4308_0 
  certifi            conda-forge/noarch::certifi-2023.7.22~ --> pkgs/main/linux-64::certifi-2023.11.17-py310h06a4308_0 



Downloading and Extracting Packages
openjdk-11.0.13      | 341.0 MB  |                                       |   0% 
certifi-2023.11.17   | 158 KB    |                                       |   0% 

ca-certificates-2023 | 123 KB    |                                       |   0% 

ca-certificates-2023 | 123 KB    | ##################################### | 100% 
                                                                                
                                                                                

                                                                                
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Note: you may need to restart the kernel to use updated packages.
Collecting pyspark==3.2.0
  Using cached pyspark-3.2.0-py2.py3-none-any.whl
Collecting py4j==0.10.9.2 (from pyspark==3.2.0)
  Using cached py4j-0.10.9.2-py2.py3-none-any.whl (198 kB)
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.2 pyspark-3.2.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Warning: Ignoring non-Spark config property: fs.s3a.aws.credentials.provider
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/conda/lib/python3.10/site-packages/pyspark/jars/spark-unsafe_2.12-3.2.0.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
:: loading settings :: url = jar:file:/opt/conda/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-8afe1d5f-af95-49e4-af32-ce0b0801deb1;1.0
    confs: [default]
    found org.apache.hadoop#hadoop-aws;3.2.2 in central
    found com.amazonaws#aws-java-sdk-bundle;1.11.563 in central
:: resolution report :: resolve 428ms :: artifacts dl 27ms
    :: modules in use:
    com.amazonaws#aws-java-sdk-bundle;1.11.563 from central in [default]
    org.apache.hadoop#hadoop-aws;3.2.2 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-8afe1d5f-af95-49e4-af32-ce0b0801deb1
    confs: [default]
    0 artifacts copied, 2 already retrieved (0kB/22ms)
23/12/05 07:31:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
3.2.0

Import packages

#import package
import pyspark.sql.functions as F
from pyspark.sql.functions import sum as _sum, mean, stddev, max as _max, min as _min, count, percentile_approx, year, month, dayofmonth, ceil, col, dayofweek, hour, explode, date_format, lower, size, split, regexp_replace, isnan, when
# from pyspark.sql.functions import sum as _sum

Azure

spark
StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 6, Finished, Available)

SparkSession - hive

SparkContext

Spark UI

Version
v3.2.2.5.1-100879434
Master
yarn
AppName
Azure ML Experiment

Read full reddit data

blob_account_name = "marckvnonprodblob"
blob_container_name = "bigdata"
# read only
blob_sas_token = "?sv=2021-10-04&st=2023-10-04T01%3A42%3A59Z&se=2024-01-02T02%3A42%3A00Z&sr=c&sp=rlf&sig=w3CH9MbCOpwO7DtHlrahc7AlRPxSZZb8MOgS6TaXLzI%3D"

wasbs_base_url = (
    f"wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net/"
)
spark.conf.set(
    f"fs.azure.sas.{blob_container_name}.{blob_account_name}.blob.core.windows.net",
    blob_sas_token,
)
StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 7, Finished, Available)
comments_path = "reddit-parquet/comments/"
submissions_path = "reddit-parquet/submissions/"
StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 8, Finished, Available)
comments_df = spark.read.parquet(f"{wasbs_base_url}{comments_path}")
submissions_df = spark.read.parquet(f"{wasbs_base_url}{submissions_path}")
StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 9, Finished, Available)

Filter Reddit data based on subreddit selected

# Single subreddit
submissions = submissions_df.filter(submissions_df.subreddit == "anime")
StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 10, Finished, Available)
# Multiple subreddit
submissions_multiple = submissions_df.filter(
    (col("subreddit") == "anime") | (col("subreddit") == "pokemon")
)
comments = comments_df.filter(comments_df.subreddit == "anime")
StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 11, Finished, Available)
comments_multiple = comments_df.filter(
    (col("subreddit") == "anime") | (col("subreddit") == "pokemon")
)

Read the filtered data

AWS

%%time
import sagemaker
session = sagemaker.Session()
bucket = session.default_bucket()
output_prefix_data_comments = "project/comments/yyyy=*"

s3_path = f"s3a://{bucket}/{output_prefix_data_comments}"
#s3_path = "s3a://sagemaker-us-east-1-038932893404/project/comments/yyyy=2021/part-00000-90796409-5783-4705-92c0-27c27eda8c4c-c000.snappy.parquet"
print(f"reading comments from {s3_path}")
comments = spark.read.parquet(s3_path, header=True)
print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
reading comments from s3a://sagemaker-us-east-1-300610919477/project/comments/yyyy=*
23/12/05 05:49:36 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
[Stage 1:======================================================>(247 + 2) / 249]
shape of the comments dataframe is 7,356,042x21
CPU times: user 3.1 s, sys: 364 ms, total: 3.47 s
Wall time: 20min 20s
                                                                                
%%time
output_prefix_data_submissions = f"project/submissions/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
print(f"reading submissions from {s3_path}")
submissions = spark.read.parquet(s3_path, header=True)
print(f"shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}")
reading submissions from s3a://sagemaker-us-east-1-300610919477/project/submissions/yyyy=*
23/12/05 06:10:21 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 5:========================================================>(97 + 1) / 98]
shape of the submissions dataframe is 404,298x68
CPU times: user 187 ms, sys: 33.1 ms, total: 221 ms
Wall time: 8min 37s
                                                                                

Data Basic Information

  1. Data counts
  2. Data Schema
  3. Interesting columns - data selection
submissions.groupBy('subreddit').count().show()
[Stage 8:========================================================>(97 + 1) / 98]
+---------+------+
|subreddit| count|
+---------+------+
|    anime|404298|
+---------+------+
                                                                                
submissions.printSchema()
root
 |-- adserver_click_url: string (nullable = true)
 |-- adserver_imp_pixel: string (nullable = true)
 |-- archived: boolean (nullable = true)
 |-- author: string (nullable = true)
 |-- author_cakeday: boolean (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- author_id: string (nullable = true)
 |-- brand_safe: boolean (nullable = true)
 |-- contest_mode: boolean (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- crosspost_parent: string (nullable = true)
 |-- crosspost_parent_list: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- approved_at_utc: string (nullable = true)
 |    |    |-- approved_by: string (nullable = true)
 |    |    |-- archived: boolean (nullable = true)
 |    |    |-- author: string (nullable = true)
 |    |    |-- author_flair_css_class: string (nullable = true)
 |    |    |-- author_flair_text: string (nullable = true)
 |    |    |-- banned_at_utc: string (nullable = true)
 |    |    |-- banned_by: string (nullable = true)
 |    |    |-- brand_safe: boolean (nullable = true)
 |    |    |-- can_gild: boolean (nullable = true)
 |    |    |-- can_mod_post: boolean (nullable = true)
 |    |    |-- clicked: boolean (nullable = true)
 |    |    |-- contest_mode: boolean (nullable = true)
 |    |    |-- created: double (nullable = true)
 |    |    |-- created_utc: double (nullable = true)
 |    |    |-- distinguished: string (nullable = true)
 |    |    |-- domain: string (nullable = true)
 |    |    |-- downs: long (nullable = true)
 |    |    |-- edited: boolean (nullable = true)
 |    |    |-- gilded: long (nullable = true)
 |    |    |-- hidden: boolean (nullable = true)
 |    |    |-- hide_score: boolean (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- is_crosspostable: boolean (nullable = true)
 |    |    |-- is_reddit_media_domain: boolean (nullable = true)
 |    |    |-- is_self: boolean (nullable = true)
 |    |    |-- is_video: boolean (nullable = true)
 |    |    |-- likes: string (nullable = true)
 |    |    |-- link_flair_css_class: string (nullable = true)
 |    |    |-- link_flair_text: string (nullable = true)
 |    |    |-- locked: boolean (nullable = true)
 |    |    |-- media: string (nullable = true)
 |    |    |-- mod_reports: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- num_comments: long (nullable = true)
 |    |    |-- num_crossposts: long (nullable = true)
 |    |    |-- num_reports: string (nullable = true)
 |    |    |-- over_18: boolean (nullable = true)
 |    |    |-- parent_whitelist_status: string (nullable = true)
 |    |    |-- permalink: string (nullable = true)
 |    |    |-- pinned: boolean (nullable = true)
 |    |    |-- quarantine: boolean (nullable = true)
 |    |    |-- removal_reason: string (nullable = true)
 |    |    |-- report_reasons: string (nullable = true)
 |    |    |-- saved: boolean (nullable = true)
 |    |    |-- score: long (nullable = true)
 |    |    |-- secure_media: string (nullable = true)
 |    |    |-- selftext: string (nullable = true)
 |    |    |-- selftext_html: string (nullable = true)
 |    |    |-- spoiler: boolean (nullable = true)
 |    |    |-- stickied: boolean (nullable = true)
 |    |    |-- subreddit: string (nullable = true)
 |    |    |-- subreddit_id: string (nullable = true)
 |    |    |-- subreddit_name_prefixed: string (nullable = true)
 |    |    |-- subreddit_type: string (nullable = true)
 |    |    |-- suggested_sort: string (nullable = true)
 |    |    |-- thumbnail: string (nullable = true)
 |    |    |-- thumbnail_height: string (nullable = true)
 |    |    |-- thumbnail_width: string (nullable = true)
 |    |    |-- title: string (nullable = true)
 |    |    |-- ups: long (nullable = true)
 |    |    |-- url: string (nullable = true)
 |    |    |-- user_reports: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- view_count: string (nullable = true)
 |    |    |-- visited: boolean (nullable = true)
 |    |    |-- whitelist_status: string (nullable = true)
 |-- disable_comments: boolean (nullable = true)
 |-- distinguished: string (nullable = true)
 |-- domain: string (nullable = true)
 |-- domain_override: string (nullable = true)
 |-- edited: string (nullable = true)
 |-- embed_type: string (nullable = true)
 |-- embed_url: string (nullable = true)
 |-- gilded: long (nullable = true)
 |-- hidden: boolean (nullable = true)
 |-- hide_score: boolean (nullable = true)
 |-- href_url: string (nullable = true)
 |-- id: string (nullable = true)
 |-- imp_pixel: string (nullable = true)
 |-- is_crosspostable: boolean (nullable = true)
 |-- is_reddit_media_domain: boolean (nullable = true)
 |-- is_self: boolean (nullable = true)
 |-- is_video: boolean (nullable = true)
 |-- link_flair_css_class: string (nullable = true)
 |-- link_flair_text: string (nullable = true)
 |-- locked: boolean (nullable = true)
 |-- media: struct (nullable = true)
 |    |-- event_id: string (nullable = true)
 |    |-- oembed: struct (nullable = true)
 |    |    |-- author_name: string (nullable = true)
 |    |    |-- author_url: string (nullable = true)
 |    |    |-- cache_age: long (nullable = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- height: long (nullable = true)
 |    |    |-- html: string (nullable = true)
 |    |    |-- provider_name: string (nullable = true)
 |    |    |-- provider_url: string (nullable = true)
 |    |    |-- thumbnail_height: long (nullable = true)
 |    |    |-- thumbnail_url: string (nullable = true)
 |    |    |-- thumbnail_width: long (nullable = true)
 |    |    |-- title: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |    |    |-- version: string (nullable = true)
 |    |    |-- width: long (nullable = true)
 |    |-- reddit_video: struct (nullable = true)
 |    |    |-- dash_url: string (nullable = true)
 |    |    |-- duration: long (nullable = true)
 |    |    |-- fallback_url: string (nullable = true)
 |    |    |-- height: long (nullable = true)
 |    |    |-- hls_url: string (nullable = true)
 |    |    |-- is_gif: boolean (nullable = true)
 |    |    |-- scrubber_media_url: string (nullable = true)
 |    |    |-- transcoding_status: string (nullable = true)
 |    |    |-- width: long (nullable = true)
 |    |-- type: string (nullable = true)
 |-- media_embed: struct (nullable = true)
 |    |-- content: string (nullable = true)
 |    |-- height: long (nullable = true)
 |    |-- scrolling: boolean (nullable = true)
 |    |-- width: long (nullable = true)
 |-- mobile_ad_url: string (nullable = true)
 |-- num_comments: long (nullable = true)
 |-- num_crossposts: long (nullable = true)
 |-- original_link: string (nullable = true)
 |-- over_18: boolean (nullable = true)
 |-- parent_whitelist_status: string (nullable = true)
 |-- permalink: string (nullable = true)
 |-- pinned: boolean (nullable = true)
 |-- post_hint: string (nullable = true)
 |-- preview: struct (nullable = true)
 |    |-- enabled: boolean (nullable = true)
 |    |-- images: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- resolutions: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |-- source: struct (nullable = true)
 |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |-- variants: struct (nullable = true)
 |    |    |    |    |-- gif: struct (nullable = true)
 |    |    |    |    |    |-- resolutions: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |    |    |-- source: struct (nullable = true)
 |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |    |-- mp4: struct (nullable = true)
 |    |    |    |    |    |-- resolutions: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |    |    |-- source: struct (nullable = true)
 |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |    |-- nsfw: struct (nullable = true)
 |    |    |    |    |    |-- resolutions: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |    |    |-- source: struct (nullable = true)
 |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |    |-- obfuscated: struct (nullable = true)
 |    |    |    |    |    |-- resolutions: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |    |    |-- source: struct (nullable = true)
 |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |-- width: long (nullable = true)
 |-- promoted: boolean (nullable = true)
 |-- promoted_by: string (nullable = true)
 |-- promoted_display_name: string (nullable = true)
 |-- promoted_url: string (nullable = true)
 |-- retrieved_on: timestamp (nullable = true)
 |-- score: long (nullable = true)
 |-- secure_media: struct (nullable = true)
 |    |-- event_id: string (nullable = true)
 |    |-- oembed: struct (nullable = true)
 |    |    |-- author_name: string (nullable = true)
 |    |    |-- author_url: string (nullable = true)
 |    |    |-- cache_age: long (nullable = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- height: long (nullable = true)
 |    |    |-- html: string (nullable = true)
 |    |    |-- provider_name: string (nullable = true)
 |    |    |-- provider_url: string (nullable = true)
 |    |    |-- thumbnail_height: long (nullable = true)
 |    |    |-- thumbnail_url: string (nullable = true)
 |    |    |-- thumbnail_width: long (nullable = true)
 |    |    |-- title: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |    |    |-- version: string (nullable = true)
 |    |    |-- width: long (nullable = true)
 |    |-- type: string (nullable = true)
 |-- secure_media_embed: struct (nullable = true)
 |    |-- content: string (nullable = true)
 |    |-- height: long (nullable = true)
 |    |-- media_domain_url: string (nullable = true)
 |    |-- scrolling: boolean (nullable = true)
 |    |-- width: long (nullable = true)
 |-- selftext: string (nullable = true)
 |-- spoiler: boolean (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- subreddit: string (nullable = true)
 |-- subreddit_id: string (nullable = true)
 |-- suggested_sort: string (nullable = true)
 |-- third_party_trackers: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- third_party_tracking: string (nullable = true)
 |-- third_party_tracking_2: string (nullable = true)
 |-- thumbnail: string (nullable = true)
 |-- thumbnail_height: long (nullable = true)
 |-- thumbnail_width: long (nullable = true)
 |-- title: string (nullable = true)
 |-- url: string (nullable = true)
 |-- whitelist_status: string (nullable = true)
sub = submissions.select("subreddit", "author", "author_flair_text", "created_utc", "title", "selftext", "num_comments", "num_crossposts", "over_18", "score", "stickied", "id")
sub.show()
+---------+--------------------+--------------------+-------------------+--------------------+--------------------+------------+--------------+-------+-----+--------+------+
|subreddit|              author|   author_flair_text|        created_utc|               title|            selftext|num_comments|num_crossposts|over_18|score|stickied|    id|
+---------+--------------------+--------------------+-------------------+--------------------+--------------------+------------+--------------+-------+-----+--------+------+
|    anime|Affectionate_Tutor_3|                null|2021-04-19 20:34:43|Invincible - Epis...|                    |           1|             0|  false|    1|   false|mu9vpw|
|    anime|           [deleted]|                null|2021-04-19 20:37:02|Spotted this gem ...|           [deleted]|           1|             0|  false|    9|   false|mu9xfr|
|    anime|           [deleted]|                null|2021-04-19 20:38:33|Do you watch show...|           [deleted]|          12|             0|  false|    0|   false|mu9ykt|
|    anime|PsychologicalGift299|                null|2021-04-19 20:42:46|Anime movies for ...|So as my fellow o...|          12|             0|  false|    0|   false|mua1uo|
|    anime|        Tuttles4ever|                null|2021-04-19 20:48:42|I need a very spe...|Are there any ani...|           7|             0|  false|    0|   false|mua6g3|
|    anime|          xXdamaneXx|                null|2021-04-19 20:50:25|            Funny 😂|                    |           1|             0|  false|    1|   false|mua7s3|
|    anime|   Ok-Dragonfly-9999|                null|2021-04-19 20:50:45|Anybody got the s...|                    |           0|             0|  false|    1|   false|mua815|
|    anime|           [deleted]|                null|2021-04-19 20:52:26|Anime for the Family|           [deleted]|           8|             0|  false|    1|   false|mua9bd|
|    anime|          nemifloras|                null|2021-04-19 20:52:42|Any atmospheric a...|i finished reassi...|           9|             0|  false|    0|   false|mua9iu|
|    anime|           [deleted]|                null|2021-04-19 20:54:30|         ZeroTwo🥵🥵|           [deleted]|           1|             0|  false|    1|   false|muaavr|
|    anime|           [deleted]|                null|2021-04-19 20:55:30|Please recommend ...|           [deleted]|           8|             0|  false|    0|   false|muabnv|
|    anime|       YOungwolfzzzz|                null|2021-04-19 20:55:46|a question about ...|           [removed]|           2|             0|  false|    1|   false|muabvm|
|    anime|         _Ualien_591|                null|2021-04-19 20:55:48|What anime’s shou...|           [removed]|           1|             0|  false|    1|   false|muabwt|
|    anime|            sznsteve|                null|2021-04-19 20:58:08|Fate Series in Order|           [removed]|           4|             0|  false|    1|   false|muadmx|
|    anime|           [deleted]|                null|2021-04-19 20:59:26|         FBI OPEN UP|           [deleted]|           1|             0|  false|    1|   false|muaemq|
|    anime|            Gaporigo|http://myanimelis...|2021-04-19 21:00:25|Best Girl of Wint...|##[Vote here!](ht...|          91|             1|  false|  153|   false|muafhy|
|    anime|    CommanderJake724|                null|2021-04-19 21:00:32|Does anyone know ...|Ok I forgot this ...|           4|             0|  false|    1|   false|muafln|
|    anime|           [deleted]|                null|2021-04-19 21:01:12|Is Black Clover w...|           [deleted]|          17|             0|  false|    1|   false|muag59|
|    anime|         AustrianBen|                null|2021-04-19 21:02:15|Was thinking abou...|Hello! I think an...|          10|             0|  false|    0|   false|muagz4|
|    anime|           [deleted]|                null|2021-04-19 21:04:21|To Whom This May ...|           [removed]|           1|             0|  false|    2|   false|muainr|
+---------+--------------------+--------------------+-------------------+--------------------+--------------------+------------+--------------+-------+-----+--------+------+
only showing top 20 rows
comments.groupBy('subreddit').count().show()
[Stage 12:=====================================================>(248 + 1) / 249]
+---------+-------+
|subreddit|  count|
+---------+-------+
|    anime|7356042|
+---------+-------+
                                                                                
comments.printSchema()
root
 |-- author: string (nullable = true)
 |-- author_cakeday: boolean (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- body: string (nullable = true)
 |-- can_gild: boolean (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- distinguished: string (nullable = true)
 |-- edited: string (nullable = true)
 |-- gilded: long (nullable = true)
 |-- id: string (nullable = true)
 |-- is_submitter: boolean (nullable = true)
 |-- link_id: string (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- permalink: string (nullable = true)
 |-- retrieved_on: timestamp (nullable = true)
 |-- score: long (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- subreddit: string (nullable = true)
 |-- subreddit_id: string (nullable = true)
com = comments.select("subreddit", "author", "author_flair_text", "created_utc", "body", "controversiality", "score",  "parent_id", "stickied", "link_id", "id")
com.show()
+---------+-----------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+
|subreddit|           author|   author_flair_text|        created_utc|                body|controversiality|score| parent_id|stickied|  link_id|     id|
+---------+-----------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+
|    anime|    DonaldJenkins|                null|2021-11-14 04:39:47|  I sent it to ya ;)|               0|    1|t1_hk0whi9|   false|t3_ov07rq|hkjr7uj|
|    anime|         DonMo999|:MAL:https://myan...|2021-11-14 04:40:25|Displate has some...|               0|    1| t3_qtgc12|   false|t3_qtgc12|hkjralc|
|    anime|   OrangeBanana38|:AMQ::STAR::AL:ht...|2021-11-14 04:41:01|That sounds like ...|               0|    3|t1_hkjq6wn|   false|t3_qryjfm|hkjrd4w|
|    anime|         ClBanjai|                null|2021-11-14 04:41:03|what kind of ques...|               0|    1| t3_qth8ql|   false|t3_qth8ql|hkjrdae|
|    anime|      helsaabiart|                null|2021-11-14 04:42:02|Today on Shokugek...|               0|    4| t3_qt8p0u|   false|t3_qt8p0u|hkjrhg6|
|    anime|           Lezoux|:MAL:https://myan...|2021-11-14 04:42:08|   It's easy enough.|               0|    3|t1_hkjrd4w|   false|t3_qryjfm|hkjrhv3|
|    anime|    AutoModerator|                null|2021-11-14 04:42:39|Hello! If you eve...|               0|    1| t3_qti7iu|   false|t3_qti7iu|hkjrk63|
|    anime|    AutoModerator|                null|2021-11-14 04:42:39|Hi xxcile, it see...|               0|    1| t3_qti7iu|   false|t3_qti7iu|hkjrk6q|
|    anime|         Terra246|                null|2021-11-14 04:42:47|I did see amagi b...|               0|    4|t1_hkjm3z4|   false|t3_qtgzu3|hkjrkr1|
|    anime|ZaphodBeebblebrox|:S3::AL:https://a...|2021-11-14 04:43:39|Which is your Mad...|               0|    3|t1_hkjpckv|   false|t3_qryjfm|hkjrogp|
|    anime|     GreekFire242|                null|2021-11-14 04:43:47|        Demon Slayer|               0|    2|t1_hkjpe43|   false|t3_qtgcp8|hkjrp3s|
|    anime|         Terra246|                null|2021-11-14 04:43:52|I mean, it is one...|               0|    2|t1_hkjpa0q|   false|t3_qtgzu3|hkjrpgf|
|    anime|     MakotoPrince|                null|2021-11-14 04:44:32|Yet another good ...|               0|    3| t3_qtg0z3|   false|t3_qtg0z3|hkjrsa2|
|    anime|   Gryse_Blacolar|                null|2021-11-14 04:44:41|That's basically ...|               0|    2| t3_qsz91x|   false|t3_qsz91x|hkjrsvy|
|    anime|     Junnielocked|                null|2021-11-14 04:44:47|Looked up the ani...|               0|    2| t3_qt7yff|   false|t3_qt7yff|hkjrtas|
|    anime|        kubabubba|                null|2021-11-14 04:45:01|      How about now?|               0|   24|t1_hkihkib|   false|t3_qt7yff|hkjrual|
|    anime|    alotmorealots|                null|2021-11-14 04:45:08|Your post could d...|               0|    1| t3_qtgpcl|   false|t3_qtgpcl|hkjrutb|
|    anime|        heimdal77|                null|2021-11-14 04:45:17|Depends is it lik...|               0|    2|t1_hkj9jju|   false|t3_qtfmin|hkjrvfv|
|    anime|    jackofslayers|                null|2021-11-14 04:45:35|I have my own sus...|               0|    7|t1_hkjcx61|   false|t3_qt5igg|hkjrwsz|
|    anime| SarcasmUndefined|                null|2021-11-14 04:45:59|Looking submissiv...|               0|    4|t1_hkhunza|   false|t3_qt3ovl|hkjrykr|
+---------+-----------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+
only showing top 20 rows

Data Quality Check

Before diving into the analysis, we first need to check the data quality and perform data cleaning based on that.

The quality checks we performed are:

  • Checking for missing values
    • Examining the number of missing values for each feature.
  • Checking for duplicates
    • Verifying if the data contains any duplicates.
  • Checking for corrupted data points
    • Since some submissions or comments may have been deleted by users or some users deleted their accounts, all corresponding submissions or comments may have been removed.

1. Check missing value

sub_miss = sub.select([count(when(((col(c).isNull()) | (col(c) == '')), c)).alias(c) for c in sub.columns])
sub_miss.show()
[Stage 16:=======================================================>(97 + 1) / 98]
+---------+------+-----------------+-----------+-----+--------+------------+--------------+-------+-----+--------+---+
|subreddit|author|author_flair_text|created_utc|title|selftext|num_comments|num_crossposts|over_18|score|stickied| id|
+---------+------+-----------------+-----------+-----+--------+------------+--------------+-------+-----+--------+---+
|        0|     0|           384489|          0|    0|  105322|           0|             0|      0|    0|       0|  0|
+---------+------+-----------------+-----------+-----+--------+------------+--------------+-------+-----+--------+---+
                                                                                
com_miss = com.select([count(when(((col(c).isNull()) | (col(c) == '')), c)).alias(c) for c in com.columns])

com_miss.show()
[Stage 19:=====================================================>(248 + 1) / 249]
+---------+------+-----------------+-----------+----+----------------+-----+---------+--------+-------+---+
|subreddit|author|author_flair_text|created_utc|body|controversiality|score|parent_id|stickied|link_id| id|
+---------+------+-----------------+-----------+----+----------------+-----+---------+--------+-------+---+
|        0|     0|          5036992|          0|  63|               0|    0|        0|       0|      0|  0|
+---------+------+-----------------+-----------+----+----------------+-----+---------+--------+-------+---+
                                                                                

2. Check duplicates

duplicate_sub = sub.count() - sub.dropDuplicates().count()
print(f"Number of duplicate rows for submissions: {duplicate_sub}")

# sub.dropDuplicates()
[Stage 27:>                                                         (0 + 2) / 2]
Number of duplicate rows for submissions: 0
                                                                                
duplicate_com = com.count() - com.dropDuplicates().count()
print(f"Number of duplicate rows for comments: {duplicate_com}")

# com.dropDuplicates()
[Stage 36:======================================================> (28 + 1) / 29]
Number of duplicate rows for comments: 0
                                                                                

3. Check corrupted data points

corrupted_sub = sub.filter((col('title')=='') | (col('title')=='[deleted]') | (col('title')=='[removed]') | (col('author')=='[deleted]') | (col('author')=='[removed]') | (col('selftext')=='') | (col('selftext')=='[deleted]') | (col('selftext')=='[removed]'))
corrupted_sub.show()
[Stage 40:>                                                         (0 + 1) / 1]
+---------+--------------------+-----------------+-------------------+--------------------+---------+------------+--------------+-------+-----+--------+------+
|subreddit|              author|author_flair_text|        created_utc|               title| selftext|num_comments|num_crossposts|over_18|score|stickied|    id|
+---------+--------------------+-----------------+-------------------+--------------------+---------+------------+--------------+-------+-----+--------+------+
|    anime|Affectionate_Tutor_3|             null|2021-04-19 20:34:43|Invincible - Epis...|         |           1|             0|  false|    1|   false|mu9vpw|
|    anime|           [deleted]|             null|2021-04-19 20:37:02|Spotted this gem ...|[deleted]|           1|             0|  false|    9|   false|mu9xfr|
|    anime|           [deleted]|             null|2021-04-19 20:38:33|Do you watch show...|[deleted]|          12|             0|  false|    0|   false|mu9ykt|
|    anime|          xXdamaneXx|             null|2021-04-19 20:50:25|            Funny 😂|         |           1|             0|  false|    1|   false|mua7s3|
|    anime|   Ok-Dragonfly-9999|             null|2021-04-19 20:50:45|Anybody got the s...|         |           0|             0|  false|    1|   false|mua815|
|    anime|           [deleted]|             null|2021-04-19 20:52:26|Anime for the Family|[deleted]|           8|             0|  false|    1|   false|mua9bd|
|    anime|           [deleted]|             null|2021-04-19 20:54:30|         ZeroTwo🥵🥵|[deleted]|           1|             0|  false|    1|   false|muaavr|
|    anime|           [deleted]|             null|2021-04-19 20:55:30|Please recommend ...|[deleted]|           8|             0|  false|    0|   false|muabnv|
|    anime|       YOungwolfzzzz|             null|2021-04-19 20:55:46|a question about ...|[removed]|           2|             0|  false|    1|   false|muabvm|
|    anime|         _Ualien_591|             null|2021-04-19 20:55:48|What anime’s shou...|[removed]|           1|             0|  false|    1|   false|muabwt|
|    anime|            sznsteve|             null|2021-04-19 20:58:08|Fate Series in Order|[removed]|           4|             0|  false|    1|   false|muadmx|
|    anime|           [deleted]|             null|2021-04-19 20:59:26|         FBI OPEN UP|[deleted]|           1|             0|  false|    1|   false|muaemq|
|    anime|           [deleted]|             null|2021-04-19 21:01:12|Is Black Clover w...|[deleted]|          17|             0|  false|    1|   false|muag59|
|    anime|           [deleted]|             null|2021-04-19 21:04:21|To Whom This May ...|[removed]|           1|             0|  false|    2|   false|muainr|
|    anime|           [deleted]|             null|2021-04-19 21:04:22|  Not all men but...|[deleted]|           1|             0|  false|    1|   false|muaioj|
|    anime|Equivalent-Income576|             null|2021-04-19 21:05:37|    Anime’s to watch|[removed]|           1|             0|  false|    1|   false|muajo2|
|    anime|     ShotinTodorokin|             null|2021-04-19 21:06:26|How to watch the ...|[removed]|           3|             0|  false|    0|   false|muakbu|
|    anime|           [deleted]|             null|2021-04-19 21:07:52|Wonder Egg Priori...|[removed]|           2|             0|  false|    1|   false|mualh6|
|    anime|Equivalent-Income576|             null|2021-04-19 21:08:03|Next anime recomm...|[removed]|           1|             0|  false|    1|   false|mualna|
|    anime|           [deleted]|             null|2021-04-21 02:53:53|Your Sword Can't ...|[deleted]|           0|             0|  false|    1|   false|mv6rk0|
+---------+--------------------+-----------------+-------------------+--------------------+---------+------------+--------------+-------+-----+--------+------+
only showing top 20 rows
                                                                                
corrupted_sub.count()
                                                                                
294051
corrupted_com = com.filter((col('body')=='') | (col('body')=='[deleted]') | (col('body')=='[removed]') | (col('author')=='[deleted]') | (col('author')=='[removed]'))
corrupted_com.show()
[Stage 8:>                                                          (0 + 1) / 1]
+---------+---------+-----------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+
|subreddit|   author|author_flair_text|        created_utc|                body|controversiality|score| parent_id|stickied|  link_id|     id|
+---------+---------+-----------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+
|    anime|[deleted]|             null|2021-11-14 04:52:37|           [removed]|               0|    1|t1_hkjcx61|   false|t3_qt5igg|hkjsr7k|
|    anime|[deleted]|             null|2021-11-14 04:54:55|           [deleted]|               0|    5| t3_qthqwb|   false|t3_qthqwb|hkjt0u4|
|    anime|[deleted]|             null|2021-11-14 04:56:49|           [deleted]|               0|    5| t3_qti7iu|   false|t3_qti7iu|hkjt95w|
|    anime|[deleted]|             null|2021-12-19 05:34:39|           [deleted]|               1|    0|t1_hp4xzt9|   false|t3_rjpi2a|hp4y917|
|    anime|[deleted]|             null|2021-12-19 05:35:01|Sayounara, Jahy-s...|               0|    3| t3_ri3u54|   false|t3_ri3u54|hp4yaf8|
|    anime|[deleted]|             null|2021-12-19 05:36:04|I'd say Funimatio...|               0|    2| t3_rjpfed|   false|t3_rjpfed|hp4yea4|
|    anime|[deleted]|             null|2021-12-19 05:36:09|           [deleted]|               0|    1|t1_hp4y9u1|   false|t3_rjotam|hp4yemy|
|    anime|[deleted]|             null|2021-12-19 05:36:37|           [deleted]|               0|    6|t1_hp2wequ|   false|t3_rjc0jz|hp4ygcy|
|    anime|[deleted]|             null|2021-12-19 05:36:49|           [deleted]|               0|   -1|t1_hp4ycgc|   false|t3_rjpi2a|hp4yh3p|
|    anime|[deleted]|             null|2021-12-19 05:36:51|It's Samurai Berserk|               0|    1|t1_hp4yft0|   false|t3_rjit5c|hp4yh90|
|    anime|[deleted]|             null|2021-12-19 05:36:57|           [deleted]|               0|    0|t1_hp4yd4i|   false|t3_rjpvio|hp4yhlq|
|    anime|[deleted]|             null|2021-12-19 05:37:26|           [removed]|               0|    1|t1_hp4y6yp|   false|t3_rjit5c|hp4yjd1|
|    anime|[deleted]|             null|2021-12-19 05:38:20|A fantastic trail...|               0|   15| t3_rjpuxa|   false|t3_rjpuxa|hp4ymkd|
|    anime|[deleted]|             null|2021-12-19 05:38:31|           [deleted]|               0|    1| t3_rjpvb5|   false|t3_rjpvb5|hp4yn6l|
|    anime|[deleted]|             null|2021-12-19 05:38:58|           [deleted]|               0|    1|t1_hp4y6gi|   false|t3_rjp7mc|hp4yosn|
|    anime|[deleted]|             null|2021-12-19 05:39:07|Surprisingly outs...|               0|   10| t3_rjpvio|   false|t3_rjpvio|hp4ypap|
|    anime|[deleted]|             null|2021-12-19 05:39:55|It’s still lame a...|               0|  -21|t1_hp4skgi|   false|t3_rjm887|hp4ys7w|
|    anime|[deleted]|             null|2021-12-19 05:40:02|           [deleted]|               1|    0|t1_hp4w9fj|   false|t3_rjpi2a|hp4ysmx|
|    anime|[deleted]|             null|2021-12-19 05:40:30|             Thanks!|               0|    2|t1_hp4tai7|   false|t3_rjnfnk|hp4yu7l|
|    anime|[deleted]|             null|2021-12-19 05:40:30|           [deleted]|               0|   -2| t3_rjax2d|   false|t3_rjax2d|hp4yu95|
+---------+---------+-----------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+
only showing top 20 rows
                                                                                
corrupted_com.count()
                                                                                
476923

Data Cleaning

  1. Filter out corrupted data that include [deleted] or [removed]
  2. Clean Text data including remove puntuation or symbols, extra space, and make it all lowercases
  3. Create dummy variable (contain_pokemon) with regexp
  4. Create new columns (date, year, hour, week, month, cleaned text, wordCount)
    • Conduct word count on cleaned text for each data point
# filter out corrupted data

sub = sub.filter(
    (col('title')!='') & \
    (col('title')!='[deleted]') & \
    (col('title')!='[removed]') & \
    (col('selftext')!='') & \
    (col('selftext')!='[deleted]') & \
    (col('selftext')!='[removed]') & \
    (col('author')!='[deleted]') & \
    (col('author')!='[removed]')
)
print(f'Filtered submissions count: {sub.count()}')
[Stage 15:=======================================================>(97 + 1) / 98]
Filtered submissions count: 110247
                                                                                
# filter out corrupted data

com = com.filter(
    (col('body')!='') & \
    (col('body')!='[deleted]') & \
    (col('body')!='[removed]') & \
    (col('author')!='[deleted]') & \
    (col('author')!='[removed]')
)
print(f'Filtered comments count: {com.count()}')
[Stage 18:=====================================================>(248 + 1) / 249]
Filtered comments count: 6879119
                                                                                
# s_com = com.limit(2000)
# s_sub = sub.limit(2000)
sub_cleaned = (
    sub
    .withColumn("created_date", date_format("created_utc", "yyyy-MM-dd")) # create date column
    .withColumn("created_hour", hour("created_utc")) # create hour column
    .withColumn("created_week", dayofweek("created_utc")) # create day of the week column
    .withColumn("created_month", month("created_utc")) # create month of the year column
    .withColumn("created_year", year("created_utc")) # create the year column
    .withColumn("title", lower(col('title'))) # text cleaning: lowercase
    .withColumn("selftext", lower(col('selftext'))) # text cleaning: lowercase
    .withColumn("cleaned_title", regexp_replace(col('title'), r'[^a-zA-Z0-9\s]', '')) # text cleaning: only contain words or number
    .withColumn("cleaned_title", regexp_replace(col('cleaned_title'), r'\s+', ' ')) # text cleaning: remove extra space in text
    .withColumn('title_wordCount', size(split(col('cleaned_title'), ' '))) # word count
    .withColumn("cleaned_selftext", regexp_replace(col('selftext'), r'[^a-zA-Z0-9\s]', '')) # text cleaning: only contain words or number
    .withColumn("cleaned_selftext", regexp_replace(col('cleaned_selftext'), r'\s+', ' ')) # text cleaning: remove extra space in text
    .withColumn('selftext_wordCount', size(split(col('cleaned_selftext'), ' '))) # word count
    .withColumn('contain_pokemon', col("cleaned_title").rlike("""(?i)pokemon|(?i)pokémon""")) # create dummy variable column on title
)
com_cleaned = (
    com
    .withColumn("created_date", date_format("created_utc", "yyyy-MM-dd")) # create date column
    .withColumn("created_hour", hour("created_utc")) # create hour column
    .withColumn("created_week", dayofweek("created_utc")) # create day of the week column
    .withColumn("created_month", month("created_utc")) # create month of the year column
    .withColumn("created_year", year("created_utc")) # create the year column
    .withColumn("body", lower(col('body'))) # text cleaning: lowercase
    .withColumn("cleaned", regexp_replace(col('body'), r'[^a-zA-Z0-9\s]', '')) # text cleaning: only contain words or number
    .withColumn("cleaned", regexp_replace(col('cleaned'), r'\s+', ' ')) # text cleaning: remove extra space in text
    .withColumn('body_wordCount', size(split(col('cleaned'), ' '))) # word count
    .withColumn('contain_pokemon', col("body").rlike("""(?i)pokemon|(?i)pokémon""")) # create dummy variable column
)
# need to cache() in full cleaned dataset
# sub_cleaned.cache()
sub_cleaned.show()
StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 4, 21, Finished, Available)
+---------+-------------------+-----------------+-------------------+--------------------+--------------------+------------+--------------+-------+-----+--------+-------+------------+------------+------------+-------------+------------+--------------------+---------------+--------------------+------------------+---------------+
|subreddit|             author|author_flair_text|        created_utc|               title|            selftext|num_comments|num_crossposts|over_18|score|stickied|     id|created_date|created_hour|created_week|created_month|created_year|       cleaned_title|title_wordCount|    cleaned_selftext|selftext_wordCount|contain_pokemon|
+---------+-------------------+-----------------+-------------------+--------------------+--------------------+------------+--------------+-------+-----+--------+-------+------------+------------+------------+-------------+------------+--------------------+---------------+--------------------+------------------+---------------+
|    anime|    SpicyDuckNugget|             null|2023-02-22 21:22:18|mya question: has...|i seem to remembe...|           7|             0|  false|    0|   false|119dsui|  2023-02-22|          21|           4|            2|        2023|mya question has ...|             10|i seem to remembe...|                40|          false|
|    anime|Routine_Quality6886|             null|2023-02-22 21:23:54|what if the spide...|do any anime exis...|          14|             0|  false|    0|   false|119dv9y|  2023-02-22|          21|           4|            2|        2023|what if the spide...|              9|do any anime exis...|                27|          false|
|    anime|       CarmenRose_m|             null|2023-02-21 06:00:13|ouran high school...|okay so like \nth...|           5|             0|  false|    0|   false|117w2eq|  2023-02-21|           6|           3|            2|        2023|ouran high school...|              6|okay so like this...|               303|          false|
|    anime|  Dismal-Shake-6725|             null|2023-02-21 06:14:49|what is your favo...|whether its the f...|          25|             0|  false|    1|   false|117wbud|  2023-02-21|           6|           3|            2|        2023|what is your favo...|              8|whether its the f...|               157|          false|
|    anime|     WPScorpionWind|             null|2023-02-21 06:18:38|high school dxd: ...|so i'm just tryin...|           3|             0|  false|    0|   false|117we9z|  2023-02-21|           6|           3|            2|        2023|high school dxd i...|             11|so im just trying...|                43|          false|
|    anime|     iNomNomAwesome|             null|2023-02-21 06:26:30|code geass is an ...|i give up.\n\ni'v...|          35|             0|  false|    0|   false|117wj3w|  2023-02-21|           6|           3|            2|        2023|code geass is an ...|              6|i give up ive wat...|               155|          false|
|    anime|           MWGhost2|             null|2023-02-21 05:15:50|did i miss someth...|i watched all of ...|           4|             0|  false|    0|   false|117va1n|  2023-02-21|           5|           3|            2|        2023|did i miss someth...|              7|i watched all of ...|                41|          false|
|    anime|         fluxyweber|             null|2023-02-21 19:42:52|help me find this...|does anyone know ...|           4|             0|  false|    1|   false|118d9r6|  2023-02-21|          19|           3|            2|        2023|help me find this...|             10|does anyone know ...|               149|          false|
|    anime|            dhochoy|                 |2023-02-21 19:43:37|is anime more mai...|for the longest t...|          34|             0|  false|    0|   false|118da23|  2023-02-21|          19|           3|            2|        2023|is anime more mai...|              7|for the longest t...|                82|          false|
|    anime|       Holofan4life|             null|2023-02-21 20:00:11|the familiar of z...|hello everyone! i...|          53|             0|  false|   20|   false|118djrq|  2023-02-21|          20|           3|            2|        2023|the familiar of z...|             11|hello everyone i ...|               336|          false|
|    anime|       MaeBorrowski|             null|2023-02-21 20:03:51|was haibane renme...|title, because i ...|          20|             0|  false|    1|   false|118dmu9|  2023-02-21|          20|           3|            2|        2023|was haibane renme...|             10|title because i f...|                80|          false|
|    anime|  the_penis_taker69|             null|2023-02-21 20:13:13|what are some thi...|my friend and i l...|          11|             0|  false|    0|   false|118dtp2|  2023-02-21|          20|           3|            2|        2023|what are some thi...|             11|my friend and i l...|                22|          false|
|    anime|    Dark-_-Angel151|             null|2021-02-23 08:03:59|darling in the franx|is there darlingl...|           4|             0|  false|    0|   false| lqczfg|  2021-02-23|           8|           3|            2|        2021|darling in the franx|              4|is there darlingl...|                 6|          false|
|    anime|       HellaSausage|             null|2021-02-23 08:04:07|light should have...|so i'm watching a...|          13|             0|  false|    0|   false| lqczie|  2021-02-23|           8|           3|            2|        2021|light should have...|              5|so im watching an...|                51|          false|
|    anime|     bigpotatoboy64|             null|2021-02-23 08:08:40|i just finished w...|after satoru wake...|          17|             0|  false|    2|   false| lqd22k|  2021-02-23|           8|           3|            2|        2021|i just finished w...|             23|after satoru wake...|                82|          false|
|    anime|  kittycatninja9000|             null|2021-02-23 08:16:47|i’m trying to wat...|i like anything w...|          29|             0|  false|    0|   false| lqd79y|  2021-02-23|           8|           3|            2|        2021|im trying to watc...|             11|i like anything w...|                43|          false|
|    anime|           Novalcia|             null|2021-02-23 08:17:46|anyone else excit...|(first time posti...|           1|             0|  false|    1|   false| lqd7vg|  2021-02-23|           8|           3|            2|        2021|anyone else excit...|             10|first time postin...|                61|          false|
|    anime|    ThatBeardedBean|             null|2021-02-23 08:24:10|looking for a nic...|i’ve recently bee...|           5|             0|  false|    1|   false| lqdbav|  2021-02-23|           8|           3|            2|        2021|looking for a nic...|              7|ive recently been...|                35|          false|
|    anime|              XOIIO|             null|2021-02-23 08:26:20|so, i've finally ...|so, i've seen thi...|           5|             0|  false|   27|   false| lqdcj6|  2021-02-23|           8|           3|            2|        2021|so ive finally wa...|             26|so ive seen this ...|               446|          false|
|    anime|     Patricio124678|             null|2021-02-23 08:40:00|question regardin...|so, me and my fri...|           4|             0|  false|    0|   false| lqdl1o|  2021-02-23|           8|           3|            2|        2021|question regardin...|              4|so me and my frie...|                35|          false|
+---------+-------------------+-----------------+-------------------+--------------------+--------------------+------------+--------------+-------+-----+--------+-------+------------+------------+------------+-------------+------------+--------------------+---------------+--------------------+------------------+---------------+
only showing top 20 rows
# need to cache() in full cleaned dataset
# com_cleaned.cache()
com_cleaned.show()
StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 4, 22, Finished, Available)
+---------+----------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
|subreddit|          author|   author_flair_text|        created_utc|                body|controversiality|score| parent_id|stickied|  link_id|     id|created_date|created_hour|created_week|created_month|created_year|             cleaned|body_wordCount|contain_pokemon|
+---------+----------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
|    anime|      Azevedo128|https://myanimeli...|2021-06-23 08:32:59|   always has been🔫|               0|   30|t1_h2qu7hw|   false|t3_o61em9|h2qxmw6|  2021-06-23|           8|           4|            6|        2021|     always has been|             3|          false|
|    anime|          Worm38|https://myanimeli...|2021-06-23 08:33:05|&gt; x is literal...|               0|   12| t3_o2blj8|   false|t3_o2blj8|h2qxn4d|  2021-06-23|           8|           4|            6|        2021|gt x is literally...|            15|          false|
|    anime|      PaperSonic|                null|2021-06-23 08:33:17|people mess up th...|               0|    1|t1_h2qsl8l|   false|t3_o5yfyf|h2qxnkh|  2021-06-23|           8|           4|            6|        2021|people mess up th...|            36|          false|
|    anime|         Dio5000|https://myanimeli...|2021-06-23 08:33:27|you literally hav...|               0|    3| t3_o68jss|   false|t3_o68jss|h2qxnxj|  2021-06-23|           8|           4|            6|        2021|you literally hav...|            70|          false|
|    anime|        ayakuweb|                null|2021-06-23 08:34:27|tsukishima from h...|               0|    7| t3_o66ymm|   false|t3_o66ymm|h2qxq8q|  2021-06-23|           8|           4|            6|        2021|tsukishima from h...|             3|          false|
|    anime|  VergelCayabyab|                null|2021-06-23 08:34:32|*ann, happy eight...|               0|    1| t3_o5nrc3|   false|t3_o5nrc3|h2qxqfj|  2021-06-23|           8|           4|            6|        2021|ann happy eighth ...|            15|          false|
|    anime| DeliciousDebris|                null|2021-06-23 08:35:02|"the balls i **th...|               0|    1|t1_h2quyfy|   false|t3_nss2lt|h2qxrik|  2021-06-23|           8|           4|            6|        2021|the balls i threw...|            39|          false|
|    anime| Anime_Sucks_Ass|                null|2021-06-23 08:35:27|   ill ( ill) smith.|               0|    2|t1_h2qufsk|   false|t3_o5ourc|h2qxsh9|  2021-06-23|           8|           4|            6|        2021|       ill ill smith|             3|          false|
|    anime|    punkgibson11|                null|2021-06-23 08:35:31|there are many bu...|               0|    2| t3_o66ymm|   false|t3_o66ymm|h2qxslt|  2021-06-23|           8|           4|            6|        2021|there are many bu...|            31|          false|
|    anime|   CompleteAster|                null|2021-06-23 08:35:46|i have a shocking...|               0|    3|t1_h2qxqrr|   false|t3_o61em9|h2qxt77|  2021-06-23|           8|           4|            6|        2021|i have a shocking...|            10|          false|
|    anime|        zairaner|https://myanimeli...|2021-06-23 08:35:49|&gt;"last order" ...|               0|    2|t1_h2qux20|   false|t3_o5yfyf|h2qxtbk|  2021-06-23|           8|           4|            6|        2021|gtlast order is a...|            41|          false|
|    anime|    IllHoneydew6|                null|2021-06-23 08:36:36|kara no kyoukai/g...|               0|    3| t3_o687kz|   false|t3_o687kz|h2qxv55|  2021-06-23|           8|           4|            6|        2021|kara no kyoukaiga...|            14|          false|
|    anime|Theboyandthefrog|                null|2021-06-23 08:37:14|people here final...|               0|   10| t3_o68c5n|   false|t3_o68c5n|h2qxwku|  2021-06-23|           8|           4|            6|        2021|people here final...|            18|          false|
|    anime|       duncandun|                null|2021-06-23 08:38:00|lesser known… one...|               0|   14|t1_h2qgayn|   false|t3_o5yxkr|h2qxy8w|  2021-06-23|           8|           4|            6|        2021|lesser known one ...|             4|          false|
|    anime|      AnnaShock2|                null|2021-06-23 08:38:02|now and then here...|               0|    3| t3_o66fv7|   false|t3_o66fv7|h2qxybn|  2021-06-23|           8|           4|            6|        2021|now and then here...|             8|          false|
|    anime|        Fartikus|https://myanimeli...|2021-06-23 08:38:28|&gt; still salty ...|               0|   12|t1_h2qt5qq|   false|t3_o61krt|h2qxz9x|  2021-06-23|           8|           4|            6|        2021|gt still salty ab...|           126|          false|
|    anime|    _mandal4real|                null|2021-06-23 08:38:39|         u/savevideo|               0|    1| t3_o5nrc3|   false|t3_o5nrc3|h2qxzok|  2021-06-23|           8|           4|            6|        2021|          usavevideo|             1|          false|
|    anime|        J-Hobber|                null|2021-06-23 08:39:29|plastic neesan\n\...|               0|    2| t3_o688a3|   false|t3_o688a3|h2qy1he|  2021-06-23|           8|           4|            6|        2021|plastic neesan ka...|             5|          false|
|    anime|           Sodra|https://myanimeli...|2021-06-23 08:40:01|astro boy was the...|               0|    5| t3_o68ine|   false|t3_o68ine|h2qy2mq|  2021-06-23|           8|           4|            6|        2021|astro boy was the...|            36|          false|
|    anime|       deheckman|                null|2021-06-23 08:40:24|noragami is prett...|               0|    0| t3_o688a3|   false|t3_o688a3|h2qy3ii|  2021-06-23|           8|           4|            6|        2021|noragami is prett...|             7|          false|
+---------+----------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
only showing top 20 rows

Save cleaned data

AWS

output_sub = "project/cleaned/sub"

s3_path_sub = f"s3a://{bucket}/{output_sub}"
print(f"writing cleaned submission to {s3_path_sub}")
sub_cleaned.write.parquet(s3_path_sub, mode="overwrite")
writing cleaned submission to s3a://sagemaker-us-east-1-300610919477/project/cleaned/sub
                                                                                
output_com = "project/cleaned/com"

s3_path_com = f"s3a://{bucket}/{output_com}"
print(f"writing cleaned comments to {s3_path_com}")
com_cleaned.write.parquet(s3_path_com, mode="overwrite")
writing cleaned comments to s3a://sagemaker-us-east-1-300610919477/project/cleaned/com
                                                                                
# Read cleaned data from parquet
import sagemaker
session = sagemaker.Session()
bucket = session.default_bucket()
# bucket = 'sagemaker-us-east-1-315969085594'

sub_bucket_path = f"s3a://{bucket}/project/cleaned/sub"
com_bucket_path = f"s3a://{bucket}/project/cleaned/com"

print(f"reading submissions from {sub_bucket_path}")
sub = spark.read.parquet(sub_bucket_path, header=True)
print(f"shape of the sub dataframe is {sub.count():,}x{len(sub.columns)}")

print(f"reading comments from {com_bucket_path}")
com = spark.read.parquet(com_bucket_path, header=True)
print(f"shape of the com dataframe is {com.count():,}x{len(com.columns)}")
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
reading submissions from s3a://sagemaker-us-east-1-300610919477/project/cleaned/sub
23/12/05 07:31:23 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                
shape of the sub dataframe is 110,247x22
reading comments from s3a://sagemaker-us-east-1-300610919477/project/cleaned/com
[Stage 5:======================================================>  (23 + 1) / 24]
shape of the com dataframe is 6,879,119x19
                                                                                

Azure

datastore = 'azureml://datastores/workspaceblobstore/paths'

# https://<STORAGE-ACCOUNT>.blob.core.windows.net/<CONTAINER-NAME>
Storage_URI="https://group09astorage08f5ea16c.blob.core.windows.net/azureml-blobstore-600c08e7-3c4d-4e17-a310-86a7327468a9"

workspace_default_storage_account = "group09astorage08f5ea16c"
workspace_default_container = "azureml-blobstore-600c08e7-3c4d-4e17-a310-86a7327468a9"

workspace_wasbs_base_url = (
    f"wasbs://{workspace_default_container}@{workspace_default_storage_account}.blob.core.windows.net/")
StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 20, Finished, Available)
# sub_cleaned.write.mode("overwrite").parquet(f"{datastore}/anime_sub_cleaned.parquet")
sub_cleaned.write.mode("overwrite").parquet(f"{workspace_wasbs_base_url}/anime_sub_cleaned.parquet")
StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 21, Finished, Available)
# com_cleaned.write.parquet(f"{datastore}/anime_com_cleaned.parquet")
com_cleaned.write.parquet(f"{workspace_wasbs_base_url}/anime_com_cleaned.parquet")
StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 22, Finished, Available)
datastore = 'azureml://datastores/workspaceblobstore/paths'

sub = spark.read.parquet(f"{datastore}/anime_sub_cleaned.parquet")
com = spark.read.parquet(f"{datastore}/anime_com_cleaned.parquet")

# sub = spark.read.parquet(f"{workspace_wasbs_base_url}/anime_sub_cleaned.parquet")
# com = spark.read.parquet(f"{workspace_wasbs_base_url}/anime_com_cleaned.parquet")
StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 50, Finished, Available)
com.show()
StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 28, Finished, Available)
+---------+--------------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
|subreddit|              author|   author_flair_text|        created_utc|                body|controversiality|score| parent_id|stickied|  link_id|     id|created_date|created_hour|created_week|created_month|created_year|             cleaned|body_wordCount|contain_pokemon|
+---------+--------------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
|    anime|        cocksmongler|                null|2022-12-20 22:50:59|        she does too|               0|   16|t1_j11628o|   false|t3_zqtkyg|j11989t|  2022-12-20|          22|           3|           12|        2022|        she does too|             3|          false|
|    anime|           Raiking02|:Id::Ie::If::M:ht...|2022-12-20 22:51:14|*finally decides ...|               0|    8| t3_zn0m5r|   false|t3_zn0m5r|j1199j8|  2022-12-20|          22|           3|           12|        2022|finally decides t...|            32|          false|
|    anime|            MonoFauz|                null|2022-12-20 22:52:34|this bette be an ...|               0|    1| t3_zqchel|   false|t3_zqchel|j119gf5|  2022-12-20|          22|           3|           12|        2022|this bette be an ...|            15|          false|
|    anime|           eruciform|                null|2022-12-20 22:52:46|cardcaptor sakura...|               0|    6| t3_zr280b|   false|t3_zr280b|j119hfj|  2022-12-20|          22|           3|           12|        2022|cardcaptor sakura...|            14|          false|
|    anime|             Karkava|                null|2022-12-20 22:52:46|and even then, hi...|               0|    3|t1_j0zxz09|   false|t3_zqtkyg|j119hfr|  2022-12-20|          22|           3|           12|        2022|and even then his...|            14|          false|
|    anime|              HelioA|:MAL:https://myan...|2022-12-20 22:53:00|                no u|               0|    2|t1_j1188rf|   false|t3_zn0m5r|j119ip2|  2022-12-20|          22|           3|           12|        2022|                no u|             2|          false|
|    anime| Random_Useless_Tips|                null|2022-12-20 22:53:06|in english it can...|               0|   29|t1_j0zxnir|   false|t3_zqtkyg|j119j7e|  2022-12-20|          22|           3|           12|        2022|in english it can...|            83|          false|
|    anime|   ComfortablyRotten|:BC::BD::U::M:htt...|2022-12-20 22:53:11|[i think i have m...|               0|    4|t1_j1162s1|   false|t3_zn0m5r|j119jnw|  2022-12-20|          22|           3|           12|        2022|i think i have mo...|            95|          false|
|    anime|       Eventhorrizon|                null|2022-12-20 22:53:26|"started watching...|               0|    8| t3_zr280b|   false|t3_zr280b|j119l1k|  2022-12-20|          22|           3|           12|        2022|started watching ...|            58|          false|
|    anime| JustAnswerAQuestion|:CS::CT::I::M:htt...|2022-12-20 22:53:28|i meant to reply ...|               0|    3|t1_j1190s6|   false|t3_zn0m5r|j119l6e|  2022-12-20|          22|           3|           12|        2022|i meant to reply ...|             6|          false|
|    anime|   electric_anteater|                null|2022-12-20 22:53:38|            bet what|               0|    6|t1_j116w30|   false|t3_zqtkyg|j119m3c|  2022-12-20|          22|           3|           12|        2022|            bet what|             2|          false|
|    anime|    CrewOrdinary8872|                null|2022-12-20 22:53:40|not the same swor...|               0|   15|t1_j118awx|   false|t3_zqtkyg|j119m92|  2022-12-20|          22|           3|           12|        2022|not the same swor...|            17|          false|
|    anime|       AutoModerator|                null|2022-12-20 22:53:57|hi reason-local! ...|               0|    1| t3_zr2dxt|   false|t3_zr2dxt|j119noi|  2022-12-20|          22|           3|           12|        2022|hi reasonlocal yo...|           151|          false|
|    anime|equalopurtunityotter|                null|2022-12-20 22:54:03|are there any ani...|               0|    1| t3_zqkjyo|   false|t3_zqkjyo|j119o93|  2022-12-20|          22|           3|           12|        2022|are there any ani...|            51|          false|
|    anime|               mutDD|                null|2022-12-20 22:54:07|see my reply to s...|               0|    0|t1_j114rml|   false|t3_zr1fxi|j119olf|  2022-12-20|          22|           3|           12|        2022|see my reply to s...|            32|          false|
|    anime|            Vikkio92|:K:https://kitsu....|2022-12-20 22:54:36|i have no idea wh...|               0|    6| t3_zqtkyg|   false|t3_zqtkyg|j119r1s|  2022-12-20|          22|           3|           12|        2022|i have no idea wh...|            11|          false|
|    anime|             Karkava|                null|2022-12-20 22:54:41|[the midway point...|               0|    2|t1_j10dofz|   false|t3_zqtkyg|j119rf6|  2022-12-20|          22|           3|           12|        2022|the midway point ...|            11|          false|
|    anime|              HelioA|:MAL:https://myan...|2022-12-20 22:55:33|any particular re...|               0|    2|t1_j1188rf|   false|t3_zn0m5r|j119vue|  2022-12-20|          22|           3|           12|        2022|any particular re...|             8|          false|
|    anime|      LilyGinnyBlack|                null|2022-12-20 22:55:34|fruits basket (re...|               0|    2| t3_zqy8ii|   false|t3_zqy8ii|j119vw8|  2022-12-20|          22|           3|           12|        2022|fruits basket reb...|             5|          false|
|    anime|           Raiking02|:Id::Ie::If::M:ht...|2022-12-20 22:55:48|oh god, it's over...|               0|    8|t1_j118mxd|   false|t3_zn0m5r|j119x1q|  2022-12-20|          22|           3|           12|        2022|oh god its over 8...|            14|          false|
+---------+--------------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
only showing top 20 rows

EDA on Anime

%pip install plotly
%pip install wordcloud

import pyspark.sql.types as T
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import plotly.graph_objects as go
import plotly.subplots as sp
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook_connected"
import seaborn as sns
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
Requirement already satisfied: plotly in /opt/conda/lib/python3.10/site-packages (5.9.0)
Requirement already satisfied: tenacity>=6.2.0 in /opt/conda/lib/python3.10/site-packages (from plotly) (8.0.1)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Collecting wordcloud
  Obtaining dependency information for wordcloud from https://files.pythonhosted.org/packages/22/0d/bb4eccd60d272b33cbc79c661c60acc604f1688cfc922deb9d3eb5be640a/wordcloud-1.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Using cached wordcloud-1.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.3 kB)
Requirement already satisfied: numpy>=1.6.1 in /opt/conda/lib/python3.10/site-packages (from wordcloud) (1.26.0)
Requirement already satisfied: pillow in /opt/conda/lib/python3.10/site-packages (from wordcloud) (10.0.1)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.10/site-packages (from wordcloud) (3.5.2)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.10/site-packages (from matplotlib->wordcloud) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib->wordcloud) (4.25.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->wordcloud) (1.4.2)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib->wordcloud) (21.3)
Requirement already satisfied: pyparsing>=2.2.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->wordcloud) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.10/site-packages (from matplotlib->wordcloud) (2.8.2)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)
Using cached wordcloud-1.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (455 kB)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.9.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
True

The number of comments and submissions for each day

sub_daily_count = sub\
    .groupBy("created_date")\
    .agg(
        count("*").alias("total_submissions"),
        mean("num_comments").alias("avg_num_comments"),
        percentile_approx("num_comments", 0.5).alias("median_num_comments"),
        _max("num_comments").alias("max_num_comments"),
        mean("score").alias("avg_score"),
        mean("title_wordCount").alias("avg_title_wordCount"),
        mean("selftext_wordCount").alias("avg_selftext_wordCount")
        ).toPandas()
sub_daily_count['created_date'] = pd.to_datetime(sub_daily_count['created_date'])
sub_daily_count = sub_daily_count.sort_values(by='created_date')
                                                                                
com_daily_count = com\
    .groupBy("created_date")\
    .agg(
        count("*").alias("total_comments"),
        mean("score").alias("avg_score"),
        mean("body_wordCount").alias("avg_body_wordCount")
        ).toPandas()
com_daily_count['created_date'] = pd.to_datetime(com_daily_count['created_date'])
com_daily_count = com_daily_count.sort_values(by='created_date')
                                                                                
com_daily_count
created_date total_comments avg_score avg_body_wordCount
288 2021-01-01 9013 5.926107 34.627538
238 2021-01-02 8711 6.291356 39.110665
164 2021-01-03 9572 6.185541 38.544191
275 2021-01-04 9145 7.188409 37.068125
243 2021-01-05 9353 7.122207 32.787662
... ... ... ... ...
761 2023-03-27 5300 9.527358 40.971509
816 2023-03-28 4901 6.703122 41.518466
797 2023-03-29 4539 8.152236 43.081956
780 2023-03-30 4603 7.773626 41.344775
543 2023-03-31 5492 8.339221 36.546249

819 rows × 4 columns

sub_daily_count.to_csv("../data/csv/sub_daily.csv", index=False)
com_daily_count.to_csv("../data/csv/com_daily.csv", index=False)
sub_daily_count = pd.read_csv("Users/kw823/fall-2023-reddit-project-team-09/data/csv/sub_daily.csv")
com_daily_count = pd.read_csv("Users/kw823/fall-2023-reddit-project-team-09/data/csv/com_daily.csv")
StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 47, Finished, Available)
# Create a subplot with two y-axes
fig = sp.make_subplots(specs=[[{"secondary_y": True}]])

# Add traces for submissions and comments to the respective y-axes
fig.add_trace(go.Scatter(x=sub_daily_count['created_date'], y=sub_daily_count['total_submissions'], marker_color='#42a1b9', opacity=.65, name="submissions"), secondary_y=False)
fig.add_trace(go.Scatter(x=com_daily_count['created_date'], y=com_daily_count['total_comments'], marker_color='#d13a47', opacity=.65, name="comments"), secondary_y=True)

# Update the layout for the whole figure
fig.update_layout(
    title='The number of comments and submissions for each day',
    xaxis={'title': 'Date'},
)

# Update the y-axis labels
fig.update_yaxes(title_text="# Submissions", secondary_y=False)
fig.update_yaxes(title_text="# Comments", secondary_y=True)

# Show the figure
fig.show()
StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 48, Finished, Available)
sub_daily_count_top10 = sub_daily_count.sort_values(by='total_submissions',ascending=False).iloc[0:10,:]
sub_daily_count_top10.to_csv("../data/csv/sub_daily_count_top10.csv", index=False)
sub_daily_count_top10
created_date count
277 2021-02-21 263
630 2021-02-03 253
17 2021-02-13 250
26 2021-02-10 249
531 2021-01-18 247
269 2021-01-21 245
481 2021-01-31 243
347 2021-02-01 237
72 2021-01-12 236
236 2021-02-28 236
com_daily_count_top10 = com_daily_count.sort_values(by='total_comments',ascending=False).iloc[0:10,:]
com_daily_count_top10.to_csv("../data/csv/com_daily_count_top10.csv", index=False)
com_daily_count_top10
created_date count
265 2021-07-30 16787
245 2021-01-31 15742
282 2021-03-25 15616
162 2021-03-26 15509
147 2021-02-05 14624
249 2021-03-24 14375
272 2021-03-28 14268
273 2021-01-18 14139
134 2021-02-21 14021
148 2021-02-14 13720

Heatmap by Hour and Day of Week for r/Anime

Submission

def custom_xaxis_formatter(x, pos):
    return int(x)

sub_grp_hour_week = sub.groupBy("created_hour","created_week").count().toPandas()

sub_piv = pd.pivot_table(sub_grp_hour_week, columns=["created_hour"], index=["created_week"], values=['count'], fill_value=0)

#plot pivot table as heatmap using seaborn
palette1 = sns.cubehelix_palette(as_cmap=True)
ax = sns.heatmap(sub_piv, cmap=palette1, yticklabels=['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat'])
plt.setp(ax.xaxis.get_majorticklabels(), rotation=0)
plt.setp(ax.yaxis.get_majorticklabels(), rotation=0)
plt.xlabel("Hour of Day", fontsize = 12)
plt.ylabel("Day of Week", fontsize = 12)
plt.gca().xaxis.set_major_formatter(FuncFormatter(custom_xaxis_formatter))
plt.title("Heatmap by Hour and Day of Week for Submission")
plt.tight_layout()
plt.savefig("../website-source/images/anime_submission_heatmap.png")
plt.show()
                                                                                

sub_piv.transpose().to_csv("../data/csv/sub_pivot_hours_day_of_week.csv")
sub_piv.transpose()
created_week 1 2 3 4 5 6 7
created_hour
count 0 807 855 812 703 748 848 726
1 771 714 686 679 629 692 610
2 890 737 655 665 669 678 741
3 671 692 674 712 667 681 657
4 627 666 646 651 658 621 620
5 630 613 593 548 578 519 522
6 572 582 553 490 528 518 495
7 458 509 498 507 519 463 483
8 485 474 458 453 473 425 426
9 567 431 386 397 438 381 502
10 511 480 475 431 492 428 504
11 435 428 443 418 475 428 447
12 538 525 494 526 463 453 500
13 622 573 522 578 548 537 602
14 776 618 663 714 672 686 611
15 717 720 770 834 790 637 728
16 832 769 782 799 728 867 894
17 852 747 682 740 810 738 903
18 866 770 689 691 715 821 879
19 851 750 667 686 649 746 735
20 843 701 763 675 683 688 783
21 985 875 790 791 793 821 960
22 950 778 846 807 780 842 885
23 886 787 800 817 809 795 812

Comments

com_grp_hour_week = com.groupBy("created_hour","created_week").count().toPandas()

com_piv = pd.pivot_table(com_grp_hour_week, columns=["created_hour"], index=["created_week"], values=['count'], fill_value=0)

#plot pivot table as heatmap using seaborn
palette2 = sns.cubehelix_palette(start=.5, rot=-.5, as_cmap=True)

ax = sns.heatmap(com_piv, cmap=palette2, yticklabels=['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat'])
plt.setp(ax.xaxis.get_majorticklabels(), rotation=0)
plt.setp(ax.yaxis.get_majorticklabels(), rotation=0)
plt.xlabel("Hour of Day", fontsize = 12)
plt.ylabel("Day of Week", fontsize = 12)
plt.gca().xaxis.set_major_formatter(FuncFormatter(custom_xaxis_formatter))
plt.title("Heatmap by Hour and Day of Week for Comments")
plt.tight_layout()
plt.savefig("../website-source/images/anime_comments_heatmap.png")
plt.show()
                                                                                

com_piv.transpose().to_csv("../data/csv/com_pivot_hours_day_of_week.csv")
com_piv.transpose()
created_week 1 2 3 4 5 6 7
created_hour
count 0 47651 51633 47777 46405 47692 62720 46911
1 45103 48403 44398 44038 44427 52377 41927
2 44963 47158 42658 42636 41943 43371 40079
3 42748 43122 40810 40382 41116 41792 38848
4 38780 39527 37307 36229 38238 38314 36195
5 36152 35026 33592 32644 34165 33889 32429
6 32815 32001 30028 28909 30898 30516 29615
7 29169 28721 26484 26619 27311 27145 27898
8 27207 25805 24416 24382 24639 24757 25377
9 26832 23897 24065 23533 23368 24232 25158
10 28262 25019 24043 23794 24020 23792 26762
11 28484 27084 26051 26479 25824 26487 28357
12 29978 29552 29725 29655 28447 28418 31457
13 37198 33515 32184 33822 31727 36384 35141
14 43539 39039 37562 38800 36290 43784 41212
15 46864 43910 45864 49317 42725 56274 48052
16 55765 49009 53828 55681 48112 49522 55314
17 58151 51640 52969 54625 50297 53138 62519
18 55055 48914 49153 51302 50117 50345 59030
19 56457 48696 46254 48969 47736 51018 56331
20 56576 47935 46099 48002 47060 48690 51222
21 63528 48848 48319 49404 49683 48863 52715
22 63254 52500 51358 53852 52320 55184 54477
23 58728 51049 50377 51925 53791 53674 53474

Wordcloud for both comment body and submission title

def top_n_words(n, word_counts):
    top_n_words = word_counts.limit(n).select("word", "count")
    # convert to a list of tuples (word, count)
    word_list = [(row["word"], row["count"]) for row in top_n_words.collect()]
    word_dict = dict(word_list)
    # create word cloud
    wordcloud = WordCloud(width=800, height=400, background_color="white").generate_from_frequencies(word_dict)
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()
stop_words = set(stopwords.words('english'))

# comment body
word_counts_com = (
    com
    .withColumn("word", F.explode(F.split(F.col("body"), "\s+")))
    .withColumn("word", F.regexp_replace("word", "[^\w]", ""))
    .groupBy("word")
    .count()
    .sort("count", ascending=False)
)

filtered_word_counts_com = word_counts_com.filter(~word_counts_com["word"].isin(stop_words))
filtered_word_counts_com.show(10)
[Stage 16:>                                                         (0 + 2) / 2]
+------+-------+
|  word|  count|
+------+-------+
|      |2303336|
|  like|1514254|
| anime|1416479|
|   one| 900388|
| watch| 743528|
|  dont| 733973|
|really| 704311|
|    im| 669780|
|  good| 627644|
|  also| 616659|
+------+-------+
only showing top 10 rows
                                                                                
top_n_words(200,filtered_word_counts_com)
                                                                                

# submission title
word_counts_sub = (
    sub
    .withColumn("word", F.explode(F.split(F.col("title"), "\s+")))
    .withColumn("word", F.regexp_replace("word", "[^\w]", ""))
    .groupBy("word")
    .count()
    .sort("count", ascending=False)
)

filtered_word_counts_sub = word_counts_sub.filter(~word_counts_sub["word"].isin(stop_words))
filtered_word_counts_sub.show(10)
[Stage 20:===========================================>              (3 + 1) / 4]
+----------+-----+
|      word|count|
+----------+-----+
|     anime|47850|
|          |17981|
|   episode|11728|
|discussion|10424|
|     watch| 9941|
|      help| 6317|
|   looking| 5950|
|      need| 5654|
|      like| 5503|
|      good| 5471|
+----------+-----+
only showing top 10 rows
                                                                                
top_n_words(200,filtered_word_counts_sub)
                                                                                

Boxplot of word count by stickied or not for submission

# Get boolean columns' names
bool_columns = [col[0] for col in sub.dtypes if col[1] == 'boolean']
# Assuming you have a PySpark DataFrame 'df'
bool_columns
for col_name in bool_columns:
    sub = sub.withColumn(col_name, F.coalesce(F.col(col_name).cast('int'), F.lit(0)))
# sub.printSchema()

sub_df = sub.toPandas()

# Create a box plot using Plotly
fig = px.box(sub_df, x="stickied", y="title_wordCount", title=f"Box Plot of title_wordCount by stickied")
fig.show()
StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 53, Finished, Available)

Word Count V.S. score

Submissions: Title Word Count V.S. Score

# Get boolean columns' names
bool_columns = [col[0] for col in sub.dtypes if col[1] == 'boolean']
# Assuming you have a PySpark DataFrame 'df'
bool_columns
for col_name in bool_columns:
    sub = sub.withColumn(col_name, F.coalesce(F.col(col_name).cast('int'), F.lit(0)))
# sub.printSchema()

sub_df = sub.toPandas()
                                                                                
ax = sub_df.plot.scatter('score', 'title_wordCount', s=3, c="#d13a47")
ax.set_title("Scatter plot of title word count and score for submissions")
ax.set_xscale('log')
plt.savefig("../website-source/images/anime_submissions_wordcount_score_scatterplot.png")
plt.show()

Comments: Body Word Count V.S. Score

com_wordcount_score_df = com.select("body_wordCount", 'score').toPandas()
ax = com_wordcount_score_df.plot.scatter('score', 'body_wordCount', s=3, c="#42a1b9")
ax.set_title("Scatter plot of body word count and score for comments")
ax.set_xscale('log')
ax.set_yscale('log')
plt.savefig("../website-source/images/anime_comments_wordcount_score_scatterplot.png")
plt.show()