Code: EDA-Anime

Set up

AWS

# Setup - Run only once per Kernel App
%conda install openjdk -y

# install PySpark
%pip install pyspark==3.2.0

# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

# Import pyspark and build Spark session
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("PySparkApp")
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.2")
    .config(
        "fs.s3a.aws.credentials.provider",
        "com.amazonaws.auth.ContainerCredentialsProvider",
    )
    .getOrCreate()
)

print(spark.version)

Collecting package metadata (current_repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 23.3.1
  latest version: 23.11.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.11.0



## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - openjdk


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2023.08.22 |       h06a4308_0         123 KB
    certifi-2023.11.17         |  py310h06a4308_0         158 KB
    openjdk-11.0.13            |       h87a67e3_0       341.0 MB
    ------------------------------------------------------------
                                           Total:       341.3 MB

The following NEW packages will be INSTALLED:

  openjdk            pkgs/main/linux-64::openjdk-11.0.13-h87a67e3_0 

The following packages will be UPDATED:

  ca-certificates    conda-forge::ca-certificates-2023.7.2~ --> pkgs/main::ca-certificates-2023.08.22-h06a4308_0 
  certifi            conda-forge/noarch::certifi-2023.7.22~ --> pkgs/main/linux-64::certifi-2023.11.17-py310h06a4308_0 



Downloading and Extracting Packages
openjdk-11.0.13      | 341.0 MB  |                                       |   0% 
certifi-2023.11.17   | 158 KB    |                                       |   0% 

ca-certificates-2023 | 123 KB    |                                       |   0% 

ca-certificates-2023 | 123 KB    | ##################################### | 100% 
                                                                                
                                                                                

                                                                                
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Note: you may need to restart the kernel to use updated packages.
Collecting pyspark==3.2.0
  Using cached pyspark-3.2.0-py2.py3-none-any.whl
Collecting py4j==0.10.9.2 (from pyspark==3.2.0)
  Using cached py4j-0.10.9.2-py2.py3-none-any.whl (198 kB)
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.2 pyspark-3.2.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.

Warning: Ignoring non-Spark config property: fs.s3a.aws.credentials.provider
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/conda/lib/python3.10/site-packages/pyspark/jars/spark-unsafe_2.12-3.2.0.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

:: loading settings :: url = jar:file:/opt/conda/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml

Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-8afe1d5f-af95-49e4-af32-ce0b0801deb1;1.0
    confs: [default]
    found org.apache.hadoop#hadoop-aws;3.2.2 in central
    found com.amazonaws#aws-java-sdk-bundle;1.11.563 in central
:: resolution report :: resolve 428ms :: artifacts dl 27ms
    :: modules in use:
    com.amazonaws#aws-java-sdk-bundle;1.11.563 from central in [default]
    org.apache.hadoop#hadoop-aws;3.2.2 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-8afe1d5f-af95-49e4-af32-ce0b0801deb1
    confs: [default]
    0 artifacts copied, 2 already retrieved (0kB/22ms)
23/12/05 07:31:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

3.2.0

Import packages

#import package
import pyspark.sql.functions as F
from pyspark.sql.functions import sum as _sum, mean, stddev, max as _max, min as _min, count, percentile_approx, year, month, dayofmonth, ceil, col, dayofweek, hour, explode, date_format, lower, size, split, regexp_replace, isnan, when
# from pyspark.sql.functions import sum as _sum

Azure

spark

StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 6, Finished, Available)

SparkSession - hive

SparkContext

Spark UI

Version: v3.2.2.5.1-100879434
Master: yarn
AppName: Azure ML Experiment

Read full reddit data

blob_account_name = "marckvnonprodblob"
blob_container_name = "bigdata"
# read only
blob_sas_token = "?sv=2021-10-04&st=2023-10-04T01%3A42%3A59Z&se=2024-01-02T02%3A42%3A00Z&sr=c&sp=rlf&sig=w3CH9MbCOpwO7DtHlrahc7AlRPxSZZb8MOgS6TaXLzI%3D"

wasbs_base_url = (
    f"wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net/"
)
spark.conf.set(
    f"fs.azure.sas.{blob_container_name}.{blob_account_name}.blob.core.windows.net",
    blob_sas_token,
)

StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 7, Finished, Available)

comments_path = "reddit-parquet/comments/"
submissions_path = "reddit-parquet/submissions/"

StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 8, Finished, Available)

comments_df = spark.read.parquet(f"{wasbs_base_url}{comments_path}")
submissions_df = spark.read.parquet(f"{wasbs_base_url}{submissions_path}")

StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 9, Finished, Available)

Filter Reddit data based on subreddit selected

# Single subreddit
submissions = submissions_df.filter(submissions_df.subreddit == "anime")

StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 10, Finished, Available)

# Multiple subreddit
submissions_multiple = submissions_df.filter(
    (col("subreddit") == "anime") | (col("subreddit") == "pokemon")
)

comments = comments_df.filter(comments_df.subreddit == "anime")

StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 11, Finished, Available)

comments_multiple = comments_df.filter(
    (col("subreddit") == "anime") | (col("subreddit") == "pokemon")
)

Read the filtered data

AWS

%%time
import sagemaker
session = sagemaker.Session()
bucket = session.default_bucket()
output_prefix_data_comments = "project/comments/yyyy=*"

s3_path = f"s3a://{bucket}/{output_prefix_data_comments}"
#s3_path = "s3a://sagemaker-us-east-1-038932893404/project/comments/yyyy=2021/part-00000-90796409-5783-4705-92c0-27c27eda8c4c-c000.snappy.parquet"
print(f"reading comments from {s3_path}")
comments = spark.read.parquet(s3_path, header=True)
print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
reading comments from s3a://sagemaker-us-east-1-300610919477/project/comments/yyyy=*

23/12/05 05:49:36 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
[Stage 1:======================================================>(247 + 2) / 249]

shape of the comments dataframe is 7,356,042x21
CPU times: user 3.1 s, sys: 364 ms, total: 3.47 s
Wall time: 20min 20s

%%time
output_prefix_data_submissions = f"project/submissions/yyyy=*"
s3_path = f"s3a://{bucket}/{output_prefix_data_submissions}"
print(f"reading submissions from {s3_path}")
submissions = spark.read.parquet(s3_path, header=True)
print(f"shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}")

reading submissions from s3a://sagemaker-us-east-1-300610919477/project/submissions/yyyy=*

23/12/05 06:10:21 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 5:========================================================>(97 + 1) / 98]

shape of the submissions dataframe is 404,298x68
CPU times: user 187 ms, sys: 33.1 ms, total: 221 ms
Wall time: 8min 37s

Data Basic Information

Data counts
Data Schema
Interesting columns - data selection

submissions.groupBy('subreddit').count().show()

[Stage 8:========================================================>(97 + 1) / 98]

+---------+------+
|subreddit| count|
+---------+------+
|    anime|404298|
+---------+------+

submissions.printSchema()

root
 |-- adserver_click_url: string (nullable = true)
 |-- adserver_imp_pixel: string (nullable = true)
 |-- archived: boolean (nullable = true)
 |-- author: string (nullable = true)
 |-- author_cakeday: boolean (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- author_id: string (nullable = true)
 |-- brand_safe: boolean (nullable = true)
 |-- contest_mode: boolean (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- crosspost_parent: string (nullable = true)
 |-- crosspost_parent_list: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- approved_at_utc: string (nullable = true)
 |    |    |-- approved_by: string (nullable = true)
 |    |    |-- archived: boolean (nullable = true)
 |    |    |-- author: string (nullable = true)
 |    |    |-- author_flair_css_class: string (nullable = true)
 |    |    |-- author_flair_text: string (nullable = true)
 |    |    |-- banned_at_utc: string (nullable = true)
 |    |    |-- banned_by: string (nullable = true)
 |    |    |-- brand_safe: boolean (nullable = true)
 |    |    |-- can_gild: boolean (nullable = true)
 |    |    |-- can_mod_post: boolean (nullable = true)
 |    |    |-- clicked: boolean (nullable = true)
 |    |    |-- contest_mode: boolean (nullable = true)
 |    |    |-- created: double (nullable = true)
 |    |    |-- created_utc: double (nullable = true)
 |    |    |-- distinguished: string (nullable = true)
 |    |    |-- domain: string (nullable = true)
 |    |    |-- downs: long (nullable = true)
 |    |    |-- edited: boolean (nullable = true)
 |    |    |-- gilded: long (nullable = true)
 |    |    |-- hidden: boolean (nullable = true)
 |    |    |-- hide_score: boolean (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- is_crosspostable: boolean (nullable = true)
 |    |    |-- is_reddit_media_domain: boolean (nullable = true)
 |    |    |-- is_self: boolean (nullable = true)
 |    |    |-- is_video: boolean (nullable = true)
 |    |    |-- likes: string (nullable = true)
 |    |    |-- link_flair_css_class: string (nullable = true)
 |    |    |-- link_flair_text: string (nullable = true)
 |    |    |-- locked: boolean (nullable = true)
 |    |    |-- media: string (nullable = true)
 |    |    |-- mod_reports: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- num_comments: long (nullable = true)
 |    |    |-- num_crossposts: long (nullable = true)
 |    |    |-- num_reports: string (nullable = true)
 |    |    |-- over_18: boolean (nullable = true)
 |    |    |-- parent_whitelist_status: string (nullable = true)
 |    |    |-- permalink: string (nullable = true)
 |    |    |-- pinned: boolean (nullable = true)
 |    |    |-- quarantine: boolean (nullable = true)
 |    |    |-- removal_reason: string (nullable = true)
 |    |    |-- report_reasons: string (nullable = true)
 |    |    |-- saved: boolean (nullable = true)
 |    |    |-- score: long (nullable = true)
 |    |    |-- secure_media: string (nullable = true)
 |    |    |-- selftext: string (nullable = true)
 |    |    |-- selftext_html: string (nullable = true)
 |    |    |-- spoiler: boolean (nullable = true)
 |    |    |-- stickied: boolean (nullable = true)
 |    |    |-- subreddit: string (nullable = true)
 |    |    |-- subreddit_id: string (nullable = true)
 |    |    |-- subreddit_name_prefixed: string (nullable = true)
 |    |    |-- subreddit_type: string (nullable = true)
 |    |    |-- suggested_sort: string (nullable = true)
 |    |    |-- thumbnail: string (nullable = true)
 |    |    |-- thumbnail_height: string (nullable = true)
 |    |    |-- thumbnail_width: string (nullable = true)
 |    |    |-- title: string (nullable = true)
 |    |    |-- ups: long (nullable = true)
 |    |    |-- url: string (nullable = true)
 |    |    |-- user_reports: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- view_count: string (nullable = true)
 |    |    |-- visited: boolean (nullable = true)
 |    |    |-- whitelist_status: string (nullable = true)
 |-- disable_comments: boolean (nullable = true)
 |-- distinguished: string (nullable = true)
 |-- domain: string (nullable = true)
 |-- domain_override: string (nullable = true)
 |-- edited: string (nullable = true)
 |-- embed_type: string (nullable = true)
 |-- embed_url: string (nullable = true)
 |-- gilded: long (nullable = true)
 |-- hidden: boolean (nullable = true)
 |-- hide_score: boolean (nullable = true)
 |-- href_url: string (nullable = true)
 |-- id: string (nullable = true)
 |-- imp_pixel: string (nullable = true)
 |-- is_crosspostable: boolean (nullable = true)
 |-- is_reddit_media_domain: boolean (nullable = true)
 |-- is_self: boolean (nullable = true)
 |-- is_video: boolean (nullable = true)
 |-- link_flair_css_class: string (nullable = true)
 |-- link_flair_text: string (nullable = true)
 |-- locked: boolean (nullable = true)
 |-- media: struct (nullable = true)
 |    |-- event_id: string (nullable = true)
 |    |-- oembed: struct (nullable = true)
 |    |    |-- author_name: string (nullable = true)
 |    |    |-- author_url: string (nullable = true)
 |    |    |-- cache_age: long (nullable = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- height: long (nullable = true)
 |    |    |-- html: string (nullable = true)
 |    |    |-- provider_name: string (nullable = true)
 |    |    |-- provider_url: string (nullable = true)
 |    |    |-- thumbnail_height: long (nullable = true)
 |    |    |-- thumbnail_url: string (nullable = true)
 |    |    |-- thumbnail_width: long (nullable = true)
 |    |    |-- title: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |    |    |-- version: string (nullable = true)
 |    |    |-- width: long (nullable = true)
 |    |-- reddit_video: struct (nullable = true)
 |    |    |-- dash_url: string (nullable = true)
 |    |    |-- duration: long (nullable = true)
 |    |    |-- fallback_url: string (nullable = true)
 |    |    |-- height: long (nullable = true)
 |    |    |-- hls_url: string (nullable = true)
 |    |    |-- is_gif: boolean (nullable = true)
 |    |    |-- scrubber_media_url: string (nullable = true)
 |    |    |-- transcoding_status: string (nullable = true)
 |    |    |-- width: long (nullable = true)
 |    |-- type: string (nullable = true)
 |-- media_embed: struct (nullable = true)
 |    |-- content: string (nullable = true)
 |    |-- height: long (nullable = true)
 |    |-- scrolling: boolean (nullable = true)
 |    |-- width: long (nullable = true)
 |-- mobile_ad_url: string (nullable = true)
 |-- num_comments: long (nullable = true)
 |-- num_crossposts: long (nullable = true)
 |-- original_link: string (nullable = true)
 |-- over_18: boolean (nullable = true)
 |-- parent_whitelist_status: string (nullable = true)
 |-- permalink: string (nullable = true)
 |-- pinned: boolean (nullable = true)
 |-- post_hint: string (nullable = true)
 |-- preview: struct (nullable = true)
 |    |-- enabled: boolean (nullable = true)
 |    |-- images: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- resolutions: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |-- source: struct (nullable = true)
 |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |-- variants: struct (nullable = true)
 |    |    |    |    |-- gif: struct (nullable = true)
 |    |    |    |    |    |-- resolutions: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |    |    |-- source: struct (nullable = true)
 |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |    |-- mp4: struct (nullable = true)
 |    |    |    |    |    |-- resolutions: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |    |    |-- source: struct (nullable = true)
 |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |    |-- nsfw: struct (nullable = true)
 |    |    |    |    |    |-- resolutions: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |    |    |-- source: struct (nullable = true)
 |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |    |-- obfuscated: struct (nullable = true)
 |    |    |    |    |    |-- resolutions: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |    |    |-- source: struct (nullable = true)
 |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |-- width: long (nullable = true)
 |-- promoted: boolean (nullable = true)
 |-- promoted_by: string (nullable = true)
 |-- promoted_display_name: string (nullable = true)
 |-- promoted_url: string (nullable = true)
 |-- retrieved_on: timestamp (nullable = true)
 |-- score: long (nullable = true)
 |-- secure_media: struct (nullable = true)
 |    |-- event_id: string (nullable = true)
 |    |-- oembed: struct (nullable = true)
 |    |    |-- author_name: string (nullable = true)
 |    |    |-- author_url: string (nullable = true)
 |    |    |-- cache_age: long (nullable = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- height: long (nullable = true)
 |    |    |-- html: string (nullable = true)
 |    |    |-- provider_name: string (nullable = true)
 |    |    |-- provider_url: string (nullable = true)
 |    |    |-- thumbnail_height: long (nullable = true)
 |    |    |-- thumbnail_url: string (nullable = true)
 |    |    |-- thumbnail_width: long (nullable = true)
 |    |    |-- title: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |    |    |-- version: string (nullable = true)
 |    |    |-- width: long (nullable = true)
 |    |-- type: string (nullable = true)
 |-- secure_media_embed: struct (nullable = true)
 |    |-- content: string (nullable = true)
 |    |-- height: long (nullable = true)
 |    |-- media_domain_url: string (nullable = true)
 |    |-- scrolling: boolean (nullable = true)
 |    |-- width: long (nullable = true)
 |-- selftext: string (nullable = true)
 |-- spoiler: boolean (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- subreddit: string (nullable = true)
 |-- subreddit_id: string (nullable = true)
 |-- suggested_sort: string (nullable = true)
 |-- third_party_trackers: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- third_party_tracking: string (nullable = true)
 |-- third_party_tracking_2: string (nullable = true)
 |-- thumbnail: string (nullable = true)
 |-- thumbnail_height: long (nullable = true)
 |-- thumbnail_width: long (nullable = true)
 |-- title: string (nullable = true)
 |-- url: string (nullable = true)
 |-- whitelist_status: string (nullable = true)

sub = submissions.select("subreddit", "author", "author_flair_text", "created_utc", "title", "selftext", "num_comments", "num_crossposts", "over_18", "score", "stickied", "id")

sub.show()

+---------+--------------------+--------------------+-------------------+--------------------+--------------------+------------+--------------+-------+-----+--------+------+
|subreddit|              author|   author_flair_text|        created_utc|               title|            selftext|num_comments|num_crossposts|over_18|score|stickied|    id|
+---------+--------------------+--------------------+-------------------+--------------------+--------------------+------------+--------------+-------+-----+--------+------+
|    anime|Affectionate_Tutor_3|                null|2021-04-19 20:34:43|Invincible - Epis...|                    |           1|             0|  false|    1|   false|mu9vpw|
|    anime|           [deleted]|                null|2021-04-19 20:37:02|Spotted this gem ...|           [deleted]|           1|             0|  false|    9|   false|mu9xfr|
|    anime|           [deleted]|                null|2021-04-19 20:38:33|Do you watch show...|           [deleted]|          12|             0|  false|    0|   false|mu9ykt|
|    anime|PsychologicalGift299|                null|2021-04-19 20:42:46|Anime movies for ...|So as my fellow o...|          12|             0|  false|    0|   false|mua1uo|
|    anime|        Tuttles4ever|                null|2021-04-19 20:48:42|I need a very spe...|Are there any ani...|           7|             0|  false|    0|   false|mua6g3|
|    anime|          xXdamaneXx|                null|2021-04-19 20:50:25|            Funny 😂|                    |           1|             0|  false|    1|   false|mua7s3|
|    anime|   Ok-Dragonfly-9999|                null|2021-04-19 20:50:45|Anybody got the s...|                    |           0|             0|  false|    1|   false|mua815|
|    anime|           [deleted]|                null|2021-04-19 20:52:26|Anime for the Family|           [deleted]|           8|             0|  false|    1|   false|mua9bd|
|    anime|          nemifloras|                null|2021-04-19 20:52:42|Any atmospheric a...|i finished reassi...|           9|             0|  false|    0|   false|mua9iu|
|    anime|           [deleted]|                null|2021-04-19 20:54:30|         ZeroTwo🥵🥵|           [deleted]|           1|             0|  false|    1|   false|muaavr|
|    anime|           [deleted]|                null|2021-04-19 20:55:30|Please recommend ...|           [deleted]|           8|             0|  false|    0|   false|muabnv|
|    anime|       YOungwolfzzzz|                null|2021-04-19 20:55:46|a question about ...|           [removed]|           2|             0|  false|    1|   false|muabvm|
|    anime|         _Ualien_591|                null|2021-04-19 20:55:48|What anime’s shou...|           [removed]|           1|             0|  false|    1|   false|muabwt|
|    anime|            sznsteve|                null|2021-04-19 20:58:08|Fate Series in Order|           [removed]|           4|             0|  false|    1|   false|muadmx|
|    anime|           [deleted]|                null|2021-04-19 20:59:26|         FBI OPEN UP|           [deleted]|           1|             0|  false|    1|   false|muaemq|
|    anime|            Gaporigo|http://myanimelis...|2021-04-19 21:00:25|Best Girl of Wint...|##[Vote here!](ht...|          91|             1|  false|  153|   false|muafhy|
|    anime|    CommanderJake724|                null|2021-04-19 21:00:32|Does anyone know ...|Ok I forgot this ...|           4|             0|  false|    1|   false|muafln|
|    anime|           [deleted]|                null|2021-04-19 21:01:12|Is Black Clover w...|           [deleted]|          17|             0|  false|    1|   false|muag59|
|    anime|         AustrianBen|                null|2021-04-19 21:02:15|Was thinking abou...|Hello! I think an...|          10|             0|  false|    0|   false|muagz4|
|    anime|           [deleted]|                null|2021-04-19 21:04:21|To Whom This May ...|           [removed]|           1|             0|  false|    2|   false|muainr|
+---------+--------------------+--------------------+-------------------+--------------------+--------------------+------------+--------------+-------+-----+--------+------+
only showing top 20 rows

comments.groupBy('subreddit').count().show()

[Stage 12:=====================================================>(248 + 1) / 249]

+---------+-------+
|subreddit|  count|
+---------+-------+
|    anime|7356042|
+---------+-------+

comments.printSchema()

root
 |-- author: string (nullable = true)
 |-- author_cakeday: boolean (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- body: string (nullable = true)
 |-- can_gild: boolean (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- distinguished: string (nullable = true)
 |-- edited: string (nullable = true)
 |-- gilded: long (nullable = true)
 |-- id: string (nullable = true)
 |-- is_submitter: boolean (nullable = true)
 |-- link_id: string (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- permalink: string (nullable = true)
 |-- retrieved_on: timestamp (nullable = true)
 |-- score: long (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- subreddit: string (nullable = true)
 |-- subreddit_id: string (nullable = true)

com = comments.select("subreddit", "author", "author_flair_text", "created_utc", "body", "controversiality", "score",  "parent_id", "stickied", "link_id", "id")

com.show()

+---------+-----------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+
|subreddit|           author|   author_flair_text|        created_utc|                body|controversiality|score| parent_id|stickied|  link_id|     id|
+---------+-----------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+
|    anime|    DonaldJenkins|                null|2021-11-14 04:39:47|  I sent it to ya ;)|               0|    1|t1_hk0whi9|   false|t3_ov07rq|hkjr7uj|
|    anime|         DonMo999|:MAL:https://myan...|2021-11-14 04:40:25|Displate has some...|               0|    1| t3_qtgc12|   false|t3_qtgc12|hkjralc|
|    anime|   OrangeBanana38|:AMQ::STAR::AL:ht...|2021-11-14 04:41:01|That sounds like ...|               0|    3|t1_hkjq6wn|   false|t3_qryjfm|hkjrd4w|
|    anime|         ClBanjai|                null|2021-11-14 04:41:03|what kind of ques...|               0|    1| t3_qth8ql|   false|t3_qth8ql|hkjrdae|
|    anime|      helsaabiart|                null|2021-11-14 04:42:02|Today on Shokugek...|               0|    4| t3_qt8p0u|   false|t3_qt8p0u|hkjrhg6|
|    anime|           Lezoux|:MAL:https://myan...|2021-11-14 04:42:08|   It's easy enough.|               0|    3|t1_hkjrd4w|   false|t3_qryjfm|hkjrhv3|
|    anime|    AutoModerator|                null|2021-11-14 04:42:39|Hello! If you eve...|               0|    1| t3_qti7iu|   false|t3_qti7iu|hkjrk63|
|    anime|    AutoModerator|                null|2021-11-14 04:42:39|Hi xxcile, it see...|               0|    1| t3_qti7iu|   false|t3_qti7iu|hkjrk6q|
|    anime|         Terra246|                null|2021-11-14 04:42:47|I did see amagi b...|               0|    4|t1_hkjm3z4|   false|t3_qtgzu3|hkjrkr1|
|    anime|ZaphodBeebblebrox|:S3::AL:https://a...|2021-11-14 04:43:39|Which is your Mad...|               0|    3|t1_hkjpckv|   false|t3_qryjfm|hkjrogp|
|    anime|     GreekFire242|                null|2021-11-14 04:43:47|        Demon Slayer|               0|    2|t1_hkjpe43|   false|t3_qtgcp8|hkjrp3s|
|    anime|         Terra246|                null|2021-11-14 04:43:52|I mean, it is one...|               0|    2|t1_hkjpa0q|   false|t3_qtgzu3|hkjrpgf|
|    anime|     MakotoPrince|                null|2021-11-14 04:44:32|Yet another good ...|               0|    3| t3_qtg0z3|   false|t3_qtg0z3|hkjrsa2|
|    anime|   Gryse_Blacolar|                null|2021-11-14 04:44:41|That's basically ...|               0|    2| t3_qsz91x|   false|t3_qsz91x|hkjrsvy|
|    anime|     Junnielocked|                null|2021-11-14 04:44:47|Looked up the ani...|               0|    2| t3_qt7yff|   false|t3_qt7yff|hkjrtas|
|    anime|        kubabubba|                null|2021-11-14 04:45:01|      How about now?|               0|   24|t1_hkihkib|   false|t3_qt7yff|hkjrual|
|    anime|    alotmorealots|                null|2021-11-14 04:45:08|Your post could d...|               0|    1| t3_qtgpcl|   false|t3_qtgpcl|hkjrutb|
|    anime|        heimdal77|                null|2021-11-14 04:45:17|Depends is it lik...|               0|    2|t1_hkj9jju|   false|t3_qtfmin|hkjrvfv|
|    anime|    jackofslayers|                null|2021-11-14 04:45:35|I have my own sus...|               0|    7|t1_hkjcx61|   false|t3_qt5igg|hkjrwsz|
|    anime| SarcasmUndefined|                null|2021-11-14 04:45:59|Looking submissiv...|               0|    4|t1_hkhunza|   false|t3_qt3ovl|hkjrykr|
+---------+-----------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+
only showing top 20 rows

Data Quality Check

Before diving into the analysis, we first need to check the data quality and perform data cleaning based on that.

The quality checks we performed are:

Checking for missing values
- Examining the number of missing values for each feature.
Checking for duplicates
- Verifying if the data contains any duplicates.
Checking for corrupted data points
- Since some submissions or comments may have been deleted by users or some users deleted their accounts, all corresponding submissions or comments may have been removed.

1. Check missing value

sub_miss = sub.select([count(when(((col(c).isNull()) | (col(c) == '')), c)).alias(c) for c in sub.columns])
sub_miss.show()

[Stage 16:=======================================================>(97 + 1) / 98]

+---------+------+-----------------+-----------+-----+--------+------------+--------------+-------+-----+--------+---+
|subreddit|author|author_flair_text|created_utc|title|selftext|num_comments|num_crossposts|over_18|score|stickied| id|
+---------+------+-----------------+-----------+-----+--------+------------+--------------+-------+-----+--------+---+
|        0|     0|           384489|          0|    0|  105322|           0|             0|      0|    0|       0|  0|
+---------+------+-----------------+-----------+-----+--------+------------+--------------+-------+-----+--------+---+

com_miss = com.select([count(when(((col(c).isNull()) | (col(c) == '')), c)).alias(c) for c in com.columns])

com_miss.show()

[Stage 19:=====================================================>(248 + 1) / 249]

+---------+------+-----------------+-----------+----+----------------+-----+---------+--------+-------+---+
|subreddit|author|author_flair_text|created_utc|body|controversiality|score|parent_id|stickied|link_id| id|
+---------+------+-----------------+-----------+----+----------------+-----+---------+--------+-------+---+
|        0|     0|          5036992|          0|  63|               0|    0|        0|       0|      0|  0|
+---------+------+-----------------+-----------+----+----------------+-----+---------+--------+-------+---+

2. Check duplicates

duplicate_sub = sub.count() - sub.dropDuplicates().count()
print(f"Number of duplicate rows for submissions: {duplicate_sub}")

# sub.dropDuplicates()

[Stage 27:>                                                         (0 + 2) / 2]

Number of duplicate rows for submissions: 0

duplicate_com = com.count() - com.dropDuplicates().count()
print(f"Number of duplicate rows for comments: {duplicate_com}")

# com.dropDuplicates()

[Stage 36:======================================================> (28 + 1) / 29]

Number of duplicate rows for comments: 0

3. Check corrupted data points

corrupted_sub = sub.filter((col('title')=='') | (col('title')=='[deleted]') | (col('title')=='[removed]') | (col('author')=='[deleted]') | (col('author')=='[removed]') | (col('selftext')=='') | (col('selftext')=='[deleted]') | (col('selftext')=='[removed]'))
corrupted_sub.show()

[Stage 40:>                                                         (0 + 1) / 1]

+---------+--------------------+-----------------+-------------------+--------------------+---------+------------+--------------+-------+-----+--------+------+
|subreddit|              author|author_flair_text|        created_utc|               title| selftext|num_comments|num_crossposts|over_18|score|stickied|    id|
+---------+--------------------+-----------------+-------------------+--------------------+---------+------------+--------------+-------+-----+--------+------+
|    anime|Affectionate_Tutor_3|             null|2021-04-19 20:34:43|Invincible - Epis...|         |           1|             0|  false|    1|   false|mu9vpw|
|    anime|           [deleted]|             null|2021-04-19 20:37:02|Spotted this gem ...|[deleted]|           1|             0|  false|    9|   false|mu9xfr|
|    anime|           [deleted]|             null|2021-04-19 20:38:33|Do you watch show...|[deleted]|          12|             0|  false|    0|   false|mu9ykt|
|    anime|          xXdamaneXx|             null|2021-04-19 20:50:25|            Funny 😂|         |           1|             0|  false|    1|   false|mua7s3|
|    anime|   Ok-Dragonfly-9999|             null|2021-04-19 20:50:45|Anybody got the s...|         |           0|             0|  false|    1|   false|mua815|
|    anime|           [deleted]|             null|2021-04-19 20:52:26|Anime for the Family|[deleted]|           8|             0|  false|    1|   false|mua9bd|
|    anime|           [deleted]|             null|2021-04-19 20:54:30|         ZeroTwo🥵🥵|[deleted]|           1|             0|  false|    1|   false|muaavr|
|    anime|           [deleted]|             null|2021-04-19 20:55:30|Please recommend ...|[deleted]|           8|             0|  false|    0|   false|muabnv|
|    anime|       YOungwolfzzzz|             null|2021-04-19 20:55:46|a question about ...|[removed]|           2|             0|  false|    1|   false|muabvm|
|    anime|         _Ualien_591|             null|2021-04-19 20:55:48|What anime’s shou...|[removed]|           1|             0|  false|    1|   false|muabwt|
|    anime|            sznsteve|             null|2021-04-19 20:58:08|Fate Series in Order|[removed]|           4|             0|  false|    1|   false|muadmx|
|    anime|           [deleted]|             null|2021-04-19 20:59:26|         FBI OPEN UP|[deleted]|           1|             0|  false|    1|   false|muaemq|
|    anime|           [deleted]|             null|2021-04-19 21:01:12|Is Black Clover w...|[deleted]|          17|             0|  false|    1|   false|muag59|
|    anime|           [deleted]|             null|2021-04-19 21:04:21|To Whom This May ...|[removed]|           1|             0|  false|    2|   false|muainr|
|    anime|           [deleted]|             null|2021-04-19 21:04:22|  Not all men but...|[deleted]|           1|             0|  false|    1|   false|muaioj|
|    anime|Equivalent-Income576|             null|2021-04-19 21:05:37|    Anime’s to watch|[removed]|           1|             0|  false|    1|   false|muajo2|
|    anime|     ShotinTodorokin|             null|2021-04-19 21:06:26|How to watch the ...|[removed]|           3|             0|  false|    0|   false|muakbu|
|    anime|           [deleted]|             null|2021-04-19 21:07:52|Wonder Egg Priori...|[removed]|           2|             0|  false|    1|   false|mualh6|
|    anime|Equivalent-Income576|             null|2021-04-19 21:08:03|Next anime recomm...|[removed]|           1|             0|  false|    1|   false|mualna|
|    anime|           [deleted]|             null|2021-04-21 02:53:53|Your Sword Can't ...|[deleted]|           0|             0|  false|    1|   false|mv6rk0|
+---------+--------------------+-----------------+-------------------+--------------------+---------+------------+--------------+-------+-----+--------+------+
only showing top 20 rows

corrupted_sub.count()

corrupted_com = com.filter((col('body')=='') | (col('body')=='[deleted]') | (col('body')=='[removed]') | (col('author')=='[deleted]') | (col('author')=='[removed]'))
corrupted_com.show()

[Stage 8:>                                                          (0 + 1) / 1]

+---------+---------+-----------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+
|subreddit|   author|author_flair_text|        created_utc|                body|controversiality|score| parent_id|stickied|  link_id|     id|
+---------+---------+-----------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+
|    anime|[deleted]|             null|2021-11-14 04:52:37|           [removed]|               0|    1|t1_hkjcx61|   false|t3_qt5igg|hkjsr7k|
|    anime|[deleted]|             null|2021-11-14 04:54:55|           [deleted]|               0|    5| t3_qthqwb|   false|t3_qthqwb|hkjt0u4|
|    anime|[deleted]|             null|2021-11-14 04:56:49|           [deleted]|               0|    5| t3_qti7iu|   false|t3_qti7iu|hkjt95w|
|    anime|[deleted]|             null|2021-12-19 05:34:39|           [deleted]|               1|    0|t1_hp4xzt9|   false|t3_rjpi2a|hp4y917|
|    anime|[deleted]|             null|2021-12-19 05:35:01|Sayounara, Jahy-s...|               0|    3| t3_ri3u54|   false|t3_ri3u54|hp4yaf8|
|    anime|[deleted]|             null|2021-12-19 05:36:04|I'd say Funimatio...|               0|    2| t3_rjpfed|   false|t3_rjpfed|hp4yea4|
|    anime|[deleted]|             null|2021-12-19 05:36:09|           [deleted]|               0|    1|t1_hp4y9u1|   false|t3_rjotam|hp4yemy|
|    anime|[deleted]|             null|2021-12-19 05:36:37|           [deleted]|               0|    6|t1_hp2wequ|   false|t3_rjc0jz|hp4ygcy|
|    anime|[deleted]|             null|2021-12-19 05:36:49|           [deleted]|               0|   -1|t1_hp4ycgc|   false|t3_rjpi2a|hp4yh3p|
|    anime|[deleted]|             null|2021-12-19 05:36:51|It's Samurai Berserk|               0|    1|t1_hp4yft0|   false|t3_rjit5c|hp4yh90|
|    anime|[deleted]|             null|2021-12-19 05:36:57|           [deleted]|               0|    0|t1_hp4yd4i|   false|t3_rjpvio|hp4yhlq|
|    anime|[deleted]|             null|2021-12-19 05:37:26|           [removed]|               0|    1|t1_hp4y6yp|   false|t3_rjit5c|hp4yjd1|
|    anime|[deleted]|             null|2021-12-19 05:38:20|A fantastic trail...|               0|   15| t3_rjpuxa|   false|t3_rjpuxa|hp4ymkd|
|    anime|[deleted]|             null|2021-12-19 05:38:31|           [deleted]|               0|    1| t3_rjpvb5|   false|t3_rjpvb5|hp4yn6l|
|    anime|[deleted]|             null|2021-12-19 05:38:58|           [deleted]|               0|    1|t1_hp4y6gi|   false|t3_rjp7mc|hp4yosn|
|    anime|[deleted]|             null|2021-12-19 05:39:07|Surprisingly outs...|               0|   10| t3_rjpvio|   false|t3_rjpvio|hp4ypap|
|    anime|[deleted]|             null|2021-12-19 05:39:55|It’s still lame a...|               0|  -21|t1_hp4skgi|   false|t3_rjm887|hp4ys7w|
|    anime|[deleted]|             null|2021-12-19 05:40:02|           [deleted]|               1|    0|t1_hp4w9fj|   false|t3_rjpi2a|hp4ysmx|
|    anime|[deleted]|             null|2021-12-19 05:40:30|             Thanks!|               0|    2|t1_hp4tai7|   false|t3_rjnfnk|hp4yu7l|
|    anime|[deleted]|             null|2021-12-19 05:40:30|           [deleted]|               0|   -2| t3_rjax2d|   false|t3_rjax2d|hp4yu95|
+---------+---------+-----------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+
only showing top 20 rows

corrupted_com.count()

Data Cleaning

Filter out corrupted data that include [deleted] or [removed]
Clean Text data including remove puntuation or symbols, extra space, and make it all lowercases
Create dummy variable (contain_pokemon) with regexp
Create new columns (date, year, hour, week, month, cleaned text, wordCount)
- Conduct word count on cleaned text for each data point

# filter out corrupted data

sub = sub.filter(
    (col('title')!='') & \
    (col('title')!='[deleted]') & \
    (col('title')!='[removed]') & \
    (col('selftext')!='') & \
    (col('selftext')!='[deleted]') & \
    (col('selftext')!='[removed]') & \
    (col('author')!='[deleted]') & \
    (col('author')!='[removed]')
)

print(f'Filtered submissions count: {sub.count()}')

[Stage 15:=======================================================>(97 + 1) / 98]

Filtered submissions count: 110247

# filter out corrupted data

com = com.filter(
    (col('body')!='') & \
    (col('body')!='[deleted]') & \
    (col('body')!='[removed]') & \
    (col('author')!='[deleted]') & \
    (col('author')!='[removed]')
)

print(f'Filtered comments count: {com.count()}')

[Stage 18:=====================================================>(248 + 1) / 249]

Filtered comments count: 6879119

# s_com = com.limit(2000)
# s_sub = sub.limit(2000)

sub_cleaned = (
    sub
    .withColumn("created_date", date_format("created_utc", "yyyy-MM-dd")) # create date column
    .withColumn("created_hour", hour("created_utc")) # create hour column
    .withColumn("created_week", dayofweek("created_utc")) # create day of the week column
    .withColumn("created_month", month("created_utc")) # create month of the year column
    .withColumn("created_year", year("created_utc")) # create the year column
    .withColumn("title", lower(col('title'))) # text cleaning: lowercase
    .withColumn("selftext", lower(col('selftext'))) # text cleaning: lowercase
    .withColumn("cleaned_title", regexp_replace(col('title'), r'[^a-zA-Z0-9\s]', '')) # text cleaning: only contain words or number
    .withColumn("cleaned_title", regexp_replace(col('cleaned_title'), r'\s+', ' ')) # text cleaning: remove extra space in text
    .withColumn('title_wordCount', size(split(col('cleaned_title'), ' '))) # word count
    .withColumn("cleaned_selftext", regexp_replace(col('selftext'), r'[^a-zA-Z0-9\s]', '')) # text cleaning: only contain words or number
    .withColumn("cleaned_selftext", regexp_replace(col('cleaned_selftext'), r'\s+', ' ')) # text cleaning: remove extra space in text
    .withColumn('selftext_wordCount', size(split(col('cleaned_selftext'), ' '))) # word count
    .withColumn('contain_pokemon', col("cleaned_title").rlike("""(?i)pokemon|(?i)pokémon""")) # create dummy variable column on title
)

com_cleaned = (
    com
    .withColumn("created_date", date_format("created_utc", "yyyy-MM-dd")) # create date column
    .withColumn("created_hour", hour("created_utc")) # create hour column
    .withColumn("created_week", dayofweek("created_utc")) # create day of the week column
    .withColumn("created_month", month("created_utc")) # create month of the year column
    .withColumn("created_year", year("created_utc")) # create the year column
    .withColumn("body", lower(col('body'))) # text cleaning: lowercase
    .withColumn("cleaned", regexp_replace(col('body'), r'[^a-zA-Z0-9\s]', '')) # text cleaning: only contain words or number
    .withColumn("cleaned", regexp_replace(col('cleaned'), r'\s+', ' ')) # text cleaning: remove extra space in text
    .withColumn('body_wordCount', size(split(col('cleaned'), ' '))) # word count
    .withColumn('contain_pokemon', col("body").rlike("""(?i)pokemon|(?i)pokémon""")) # create dummy variable column
)

# need to cache() in full cleaned dataset
# sub_cleaned.cache()
sub_cleaned.show()

StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 4, 21, Finished, Available)

+---------+-------------------+-----------------+-------------------+--------------------+--------------------+------------+--------------+-------+-----+--------+-------+------------+------------+------------+-------------+------------+--------------------+---------------+--------------------+------------------+---------------+
|subreddit|             author|author_flair_text|        created_utc|               title|            selftext|num_comments|num_crossposts|over_18|score|stickied|     id|created_date|created_hour|created_week|created_month|created_year|       cleaned_title|title_wordCount|    cleaned_selftext|selftext_wordCount|contain_pokemon|
+---------+-------------------+-----------------+-------------------+--------------------+--------------------+------------+--------------+-------+-----+--------+-------+------------+------------+------------+-------------+------------+--------------------+---------------+--------------------+------------------+---------------+
|    anime|    SpicyDuckNugget|             null|2023-02-22 21:22:18|mya question: has...|i seem to remembe...|           7|             0|  false|    0|   false|119dsui|  2023-02-22|          21|           4|            2|        2023|mya question has ...|             10|i seem to remembe...|                40|          false|
|    anime|Routine_Quality6886|             null|2023-02-22 21:23:54|what if the spide...|do any anime exis...|          14|             0|  false|    0|   false|119dv9y|  2023-02-22|          21|           4|            2|        2023|what if the spide...|              9|do any anime exis...|                27|          false|
|    anime|       CarmenRose_m|             null|2023-02-21 06:00:13|ouran high school...|okay so like \nth...|           5|             0|  false|    0|   false|117w2eq|  2023-02-21|           6|           3|            2|        2023|ouran high school...|              6|okay so like this...|               303|          false|
|    anime|  Dismal-Shake-6725|             null|2023-02-21 06:14:49|what is your favo...|whether its the f...|          25|             0|  false|    1|   false|117wbud|  2023-02-21|           6|           3|            2|        2023|what is your favo...|              8|whether its the f...|               157|          false|
|    anime|     WPScorpionWind|             null|2023-02-21 06:18:38|high school dxd: ...|so i'm just tryin...|           3|             0|  false|    0|   false|117we9z|  2023-02-21|           6|           3|            2|        2023|high school dxd i...|             11|so im just trying...|                43|          false|
|    anime|     iNomNomAwesome|             null|2023-02-21 06:26:30|code geass is an ...|i give up.\n\ni'v...|          35|             0|  false|    0|   false|117wj3w|  2023-02-21|           6|           3|            2|        2023|code geass is an ...|              6|i give up ive wat...|               155|          false|
|    anime|           MWGhost2|             null|2023-02-21 05:15:50|did i miss someth...|i watched all of ...|           4|             0|  false|    0|   false|117va1n|  2023-02-21|           5|           3|            2|        2023|did i miss someth...|              7|i watched all of ...|                41|          false|
|    anime|         fluxyweber|             null|2023-02-21 19:42:52|help me find this...|does anyone know ...|           4|             0|  false|    1|   false|118d9r6|  2023-02-21|          19|           3|            2|        2023|help me find this...|             10|does anyone know ...|               149|          false|
|    anime|            dhochoy|                 |2023-02-21 19:43:37|is anime more mai...|for the longest t...|          34|             0|  false|    0|   false|118da23|  2023-02-21|          19|           3|            2|        2023|is anime more mai...|              7|for the longest t...|                82|          false|
|    anime|       Holofan4life|             null|2023-02-21 20:00:11|the familiar of z...|hello everyone! i...|          53|             0|  false|   20|   false|118djrq|  2023-02-21|          20|           3|            2|        2023|the familiar of z...|             11|hello everyone i ...|               336|          false|
|    anime|       MaeBorrowski|             null|2023-02-21 20:03:51|was haibane renme...|title, because i ...|          20|             0|  false|    1|   false|118dmu9|  2023-02-21|          20|           3|            2|        2023|was haibane renme...|             10|title because i f...|                80|          false|
|    anime|  the_penis_taker69|             null|2023-02-21 20:13:13|what are some thi...|my friend and i l...|          11|             0|  false|    0|   false|118dtp2|  2023-02-21|          20|           3|            2|        2023|what are some thi...|             11|my friend and i l...|                22|          false|
|    anime|    Dark-_-Angel151|             null|2021-02-23 08:03:59|darling in the franx|is there darlingl...|           4|             0|  false|    0|   false| lqczfg|  2021-02-23|           8|           3|            2|        2021|darling in the franx|              4|is there darlingl...|                 6|          false|
|    anime|       HellaSausage|             null|2021-02-23 08:04:07|light should have...|so i'm watching a...|          13|             0|  false|    0|   false| lqczie|  2021-02-23|           8|           3|            2|        2021|light should have...|              5|so im watching an...|                51|          false|
|    anime|     bigpotatoboy64|             null|2021-02-23 08:08:40|i just finished w...|after satoru wake...|          17|             0|  false|    2|   false| lqd22k|  2021-02-23|           8|           3|            2|        2021|i just finished w...|             23|after satoru wake...|                82|          false|
|    anime|  kittycatninja9000|             null|2021-02-23 08:16:47|i’m trying to wat...|i like anything w...|          29|             0|  false|    0|   false| lqd79y|  2021-02-23|           8|           3|            2|        2021|im trying to watc...|             11|i like anything w...|                43|          false|
|    anime|           Novalcia|             null|2021-02-23 08:17:46|anyone else excit...|(first time posti...|           1|             0|  false|    1|   false| lqd7vg|  2021-02-23|           8|           3|            2|        2021|anyone else excit...|             10|first time postin...|                61|          false|
|    anime|    ThatBeardedBean|             null|2021-02-23 08:24:10|looking for a nic...|i’ve recently bee...|           5|             0|  false|    1|   false| lqdbav|  2021-02-23|           8|           3|            2|        2021|looking for a nic...|              7|ive recently been...|                35|          false|
|    anime|              XOIIO|             null|2021-02-23 08:26:20|so, i've finally ...|so, i've seen thi...|           5|             0|  false|   27|   false| lqdcj6|  2021-02-23|           8|           3|            2|        2021|so ive finally wa...|             26|so ive seen this ...|               446|          false|
|    anime|     Patricio124678|             null|2021-02-23 08:40:00|question regardin...|so, me and my fri...|           4|             0|  false|    0|   false| lqdl1o|  2021-02-23|           8|           3|            2|        2021|question regardin...|              4|so me and my frie...|                35|          false|
+---------+-------------------+-----------------+-------------------+--------------------+--------------------+------------+--------------+-------+-----+--------+-------+------------+------------+------------+-------------+------------+--------------------+---------------+--------------------+------------------+---------------+
only showing top 20 rows

# need to cache() in full cleaned dataset
# com_cleaned.cache()
com_cleaned.show()

StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 4, 22, Finished, Available)

+---------+----------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
|subreddit|          author|   author_flair_text|        created_utc|                body|controversiality|score| parent_id|stickied|  link_id|     id|created_date|created_hour|created_week|created_month|created_year|             cleaned|body_wordCount|contain_pokemon|
+---------+----------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
|    anime|      Azevedo128|https://myanimeli...|2021-06-23 08:32:59|   always has been🔫|               0|   30|t1_h2qu7hw|   false|t3_o61em9|h2qxmw6|  2021-06-23|           8|           4|            6|        2021|     always has been|             3|          false|
|    anime|          Worm38|https://myanimeli...|2021-06-23 08:33:05|&gt; x is literal...|               0|   12| t3_o2blj8|   false|t3_o2blj8|h2qxn4d|  2021-06-23|           8|           4|            6|        2021|gt x is literally...|            15|          false|
|    anime|      PaperSonic|                null|2021-06-23 08:33:17|people mess up th...|               0|    1|t1_h2qsl8l|   false|t3_o5yfyf|h2qxnkh|  2021-06-23|           8|           4|            6|        2021|people mess up th...|            36|          false|
|    anime|         Dio5000|https://myanimeli...|2021-06-23 08:33:27|you literally hav...|               0|    3| t3_o68jss|   false|t3_o68jss|h2qxnxj|  2021-06-23|           8|           4|            6|        2021|you literally hav...|            70|          false|
|    anime|        ayakuweb|                null|2021-06-23 08:34:27|tsukishima from h...|               0|    7| t3_o66ymm|   false|t3_o66ymm|h2qxq8q|  2021-06-23|           8|           4|            6|        2021|tsukishima from h...|             3|          false|
|    anime|  VergelCayabyab|                null|2021-06-23 08:34:32|*ann, happy eight...|               0|    1| t3_o5nrc3|   false|t3_o5nrc3|h2qxqfj|  2021-06-23|           8|           4|            6|        2021|ann happy eighth ...|            15|          false|
|    anime| DeliciousDebris|                null|2021-06-23 08:35:02|"the balls i **th...|               0|    1|t1_h2quyfy|   false|t3_nss2lt|h2qxrik|  2021-06-23|           8|           4|            6|        2021|the balls i threw...|            39|          false|
|    anime| Anime_Sucks_Ass|                null|2021-06-23 08:35:27|   ill ( ill) smith.|               0|    2|t1_h2qufsk|   false|t3_o5ourc|h2qxsh9|  2021-06-23|           8|           4|            6|        2021|       ill ill smith|             3|          false|
|    anime|    punkgibson11|                null|2021-06-23 08:35:31|there are many bu...|               0|    2| t3_o66ymm|   false|t3_o66ymm|h2qxslt|  2021-06-23|           8|           4|            6|        2021|there are many bu...|            31|          false|
|    anime|   CompleteAster|                null|2021-06-23 08:35:46|i have a shocking...|               0|    3|t1_h2qxqrr|   false|t3_o61em9|h2qxt77|  2021-06-23|           8|           4|            6|        2021|i have a shocking...|            10|          false|
|    anime|        zairaner|https://myanimeli...|2021-06-23 08:35:49|&gt;"last order" ...|               0|    2|t1_h2qux20|   false|t3_o5yfyf|h2qxtbk|  2021-06-23|           8|           4|            6|        2021|gtlast order is a...|            41|          false|
|    anime|    IllHoneydew6|                null|2021-06-23 08:36:36|kara no kyoukai/g...|               0|    3| t3_o687kz|   false|t3_o687kz|h2qxv55|  2021-06-23|           8|           4|            6|        2021|kara no kyoukaiga...|            14|          false|
|    anime|Theboyandthefrog|                null|2021-06-23 08:37:14|people here final...|               0|   10| t3_o68c5n|   false|t3_o68c5n|h2qxwku|  2021-06-23|           8|           4|            6|        2021|people here final...|            18|          false|
|    anime|       duncandun|                null|2021-06-23 08:38:00|lesser known… one...|               0|   14|t1_h2qgayn|   false|t3_o5yxkr|h2qxy8w|  2021-06-23|           8|           4|            6|        2021|lesser known one ...|             4|          false|
|    anime|      AnnaShock2|                null|2021-06-23 08:38:02|now and then here...|               0|    3| t3_o66fv7|   false|t3_o66fv7|h2qxybn|  2021-06-23|           8|           4|            6|        2021|now and then here...|             8|          false|
|    anime|        Fartikus|https://myanimeli...|2021-06-23 08:38:28|&gt; still salty ...|               0|   12|t1_h2qt5qq|   false|t3_o61krt|h2qxz9x|  2021-06-23|           8|           4|            6|        2021|gt still salty ab...|           126|          false|
|    anime|    _mandal4real|                null|2021-06-23 08:38:39|         u/savevideo|               0|    1| t3_o5nrc3|   false|t3_o5nrc3|h2qxzok|  2021-06-23|           8|           4|            6|        2021|          usavevideo|             1|          false|
|    anime|        J-Hobber|                null|2021-06-23 08:39:29|plastic neesan\n\...|               0|    2| t3_o688a3|   false|t3_o688a3|h2qy1he|  2021-06-23|           8|           4|            6|        2021|plastic neesan ka...|             5|          false|
|    anime|           Sodra|https://myanimeli...|2021-06-23 08:40:01|astro boy was the...|               0|    5| t3_o68ine|   false|t3_o68ine|h2qy2mq|  2021-06-23|           8|           4|            6|        2021|astro boy was the...|            36|          false|
|    anime|       deheckman|                null|2021-06-23 08:40:24|noragami is prett...|               0|    0| t3_o688a3|   false|t3_o688a3|h2qy3ii|  2021-06-23|           8|           4|            6|        2021|noragami is prett...|             7|          false|
+---------+----------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
only showing top 20 rows

Save cleaned data

AWS

output_sub = "project/cleaned/sub"

s3_path_sub = f"s3a://{bucket}/{output_sub}"
print(f"writing cleaned submission to {s3_path_sub}")
sub_cleaned.write.parquet(s3_path_sub, mode="overwrite")

writing cleaned submission to s3a://sagemaker-us-east-1-300610919477/project/cleaned/sub

output_com = "project/cleaned/com"

s3_path_com = f"s3a://{bucket}/{output_com}"
print(f"writing cleaned comments to {s3_path_com}")
com_cleaned.write.parquet(s3_path_com, mode="overwrite")

writing cleaned comments to s3a://sagemaker-us-east-1-300610919477/project/cleaned/com

# Read cleaned data from parquet
import sagemaker
session = sagemaker.Session()
bucket = session.default_bucket()
# bucket = 'sagemaker-us-east-1-315969085594'

sub_bucket_path = f"s3a://{bucket}/project/cleaned/sub"
com_bucket_path = f"s3a://{bucket}/project/cleaned/com"

print(f"reading submissions from {sub_bucket_path}")
sub = spark.read.parquet(sub_bucket_path, header=True)
print(f"shape of the sub dataframe is {sub.count():,}x{len(sub.columns)}")

print(f"reading comments from {com_bucket_path}")
com = spark.read.parquet(com_bucket_path, header=True)
print(f"shape of the com dataframe is {com.count():,}x{len(com.columns)}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
reading submissions from s3a://sagemaker-us-east-1-300610919477/project/cleaned/sub

23/12/05 07:31:23 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties

shape of the sub dataframe is 110,247x22
reading comments from s3a://sagemaker-us-east-1-300610919477/project/cleaned/com

[Stage 5:======================================================>  (23 + 1) / 24]

shape of the com dataframe is 6,879,119x19

Azure

datastore = 'azureml://datastores/workspaceblobstore/paths'

# https://<STORAGE-ACCOUNT>.blob.core.windows.net/<CONTAINER-NAME>
Storage_URI="https://group09astorage08f5ea16c.blob.core.windows.net/azureml-blobstore-600c08e7-3c4d-4e17-a310-86a7327468a9"

workspace_default_storage_account = "group09astorage08f5ea16c"
workspace_default_container = "azureml-blobstore-600c08e7-3c4d-4e17-a310-86a7327468a9"

workspace_wasbs_base_url = (
    f"wasbs://{workspace_default_container}@{workspace_default_storage_account}.blob.core.windows.net/")

StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 20, Finished, Available)

# sub_cleaned.write.mode("overwrite").parquet(f"{datastore}/anime_sub_cleaned.parquet")
sub_cleaned.write.mode("overwrite").parquet(f"{workspace_wasbs_base_url}/anime_sub_cleaned.parquet")

StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 21, Finished, Available)

# com_cleaned.write.parquet(f"{datastore}/anime_com_cleaned.parquet")
com_cleaned.write.parquet(f"{workspace_wasbs_base_url}/anime_com_cleaned.parquet")

StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 22, Finished, Available)

datastore = 'azureml://datastores/workspaceblobstore/paths'

sub = spark.read.parquet(f"{datastore}/anime_sub_cleaned.parquet")
com = spark.read.parquet(f"{datastore}/anime_com_cleaned.parquet")

# sub = spark.read.parquet(f"{workspace_wasbs_base_url}/anime_sub_cleaned.parquet")
# com = spark.read.parquet(f"{workspace_wasbs_base_url}/anime_com_cleaned.parquet")

StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 50, Finished, Available)

com.show()

StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 28, Finished, Available)

+---------+--------------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
|subreddit|              author|   author_flair_text|        created_utc|                body|controversiality|score| parent_id|stickied|  link_id|     id|created_date|created_hour|created_week|created_month|created_year|             cleaned|body_wordCount|contain_pokemon|
+---------+--------------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
|    anime|        cocksmongler|                null|2022-12-20 22:50:59|        she does too|               0|   16|t1_j11628o|   false|t3_zqtkyg|j11989t|  2022-12-20|          22|           3|           12|        2022|        she does too|             3|          false|
|    anime|           Raiking02|:Id::Ie::If::M:ht...|2022-12-20 22:51:14|*finally decides ...|               0|    8| t3_zn0m5r|   false|t3_zn0m5r|j1199j8|  2022-12-20|          22|           3|           12|        2022|finally decides t...|            32|          false|
|    anime|            MonoFauz|                null|2022-12-20 22:52:34|this bette be an ...|               0|    1| t3_zqchel|   false|t3_zqchel|j119gf5|  2022-12-20|          22|           3|           12|        2022|this bette be an ...|            15|          false|
|    anime|           eruciform|                null|2022-12-20 22:52:46|cardcaptor sakura...|               0|    6| t3_zr280b|   false|t3_zr280b|j119hfj|  2022-12-20|          22|           3|           12|        2022|cardcaptor sakura...|            14|          false|
|    anime|             Karkava|                null|2022-12-20 22:52:46|and even then, hi...|               0|    3|t1_j0zxz09|   false|t3_zqtkyg|j119hfr|  2022-12-20|          22|           3|           12|        2022|and even then his...|            14|          false|
|    anime|              HelioA|:MAL:https://myan...|2022-12-20 22:53:00|                no u|               0|    2|t1_j1188rf|   false|t3_zn0m5r|j119ip2|  2022-12-20|          22|           3|           12|        2022|                no u|             2|          false|
|    anime| Random_Useless_Tips|                null|2022-12-20 22:53:06|in english it can...|               0|   29|t1_j0zxnir|   false|t3_zqtkyg|j119j7e|  2022-12-20|          22|           3|           12|        2022|in english it can...|            83|          false|
|    anime|   ComfortablyRotten|:BC::BD::U::M:htt...|2022-12-20 22:53:11|[i think i have m...|               0|    4|t1_j1162s1|   false|t3_zn0m5r|j119jnw|  2022-12-20|          22|           3|           12|        2022|i think i have mo...|            95|          false|
|    anime|       Eventhorrizon|                null|2022-12-20 22:53:26|"started watching...|               0|    8| t3_zr280b|   false|t3_zr280b|j119l1k|  2022-12-20|          22|           3|           12|        2022|started watching ...|            58|          false|
|    anime| JustAnswerAQuestion|:CS::CT::I::M:htt...|2022-12-20 22:53:28|i meant to reply ...|               0|    3|t1_j1190s6|   false|t3_zn0m5r|j119l6e|  2022-12-20|          22|           3|           12|        2022|i meant to reply ...|             6|          false|
|    anime|   electric_anteater|                null|2022-12-20 22:53:38|            bet what|               0|    6|t1_j116w30|   false|t3_zqtkyg|j119m3c|  2022-12-20|          22|           3|           12|        2022|            bet what|             2|          false|
|    anime|    CrewOrdinary8872|                null|2022-12-20 22:53:40|not the same swor...|               0|   15|t1_j118awx|   false|t3_zqtkyg|j119m92|  2022-12-20|          22|           3|           12|        2022|not the same swor...|            17|          false|
|    anime|       AutoModerator|                null|2022-12-20 22:53:57|hi reason-local! ...|               0|    1| t3_zr2dxt|   false|t3_zr2dxt|j119noi|  2022-12-20|          22|           3|           12|        2022|hi reasonlocal yo...|           151|          false|
|    anime|equalopurtunityotter|                null|2022-12-20 22:54:03|are there any ani...|               0|    1| t3_zqkjyo|   false|t3_zqkjyo|j119o93|  2022-12-20|          22|           3|           12|        2022|are there any ani...|            51|          false|
|    anime|               mutDD|                null|2022-12-20 22:54:07|see my reply to s...|               0|    0|t1_j114rml|   false|t3_zr1fxi|j119olf|  2022-12-20|          22|           3|           12|        2022|see my reply to s...|            32|          false|
|    anime|            Vikkio92|:K:https://kitsu....|2022-12-20 22:54:36|i have no idea wh...|               0|    6| t3_zqtkyg|   false|t3_zqtkyg|j119r1s|  2022-12-20|          22|           3|           12|        2022|i have no idea wh...|            11|          false|
|    anime|             Karkava|                null|2022-12-20 22:54:41|[the midway point...|               0|    2|t1_j10dofz|   false|t3_zqtkyg|j119rf6|  2022-12-20|          22|           3|           12|        2022|the midway point ...|            11|          false|
|    anime|              HelioA|:MAL:https://myan...|2022-12-20 22:55:33|any particular re...|               0|    2|t1_j1188rf|   false|t3_zn0m5r|j119vue|  2022-12-20|          22|           3|           12|        2022|any particular re...|             8|          false|
|    anime|      LilyGinnyBlack|                null|2022-12-20 22:55:34|fruits basket (re...|               0|    2| t3_zqy8ii|   false|t3_zqy8ii|j119vw8|  2022-12-20|          22|           3|           12|        2022|fruits basket reb...|             5|          false|
|    anime|           Raiking02|:Id::Ie::If::M:ht...|2022-12-20 22:55:48|oh god, it's over...|               0|    8|t1_j118mxd|   false|t3_zn0m5r|j119x1q|  2022-12-20|          22|           3|           12|        2022|oh god its over 8...|            14|          false|
+---------+--------------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
only showing top 20 rows

EDA on Anime

%pip install plotly
%pip install wordcloud

import pyspark.sql.types as T
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import plotly.graph_objects as go
import plotly.subplots as sp
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook_connected"
import seaborn as sns
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

Requirement already satisfied: plotly in /opt/conda/lib/python3.10/site-packages (5.9.0)
Requirement already satisfied: tenacity>=6.2.0 in /opt/conda/lib/python3.10/site-packages (from plotly) (8.0.1)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Collecting wordcloud
  Obtaining dependency information for wordcloud from https://files.pythonhosted.org/packages/22/0d/bb4eccd60d272b33cbc79c661c60acc604f1688cfc922deb9d3eb5be640a/wordcloud-1.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Using cached wordcloud-1.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.3 kB)
Requirement already satisfied: numpy>=1.6.1 in /opt/conda/lib/python3.10/site-packages (from wordcloud) (1.26.0)
Requirement already satisfied: pillow in /opt/conda/lib/python3.10/site-packages (from wordcloud) (10.0.1)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.10/site-packages (from wordcloud) (3.5.2)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.10/site-packages (from matplotlib->wordcloud) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib->wordcloud) (4.25.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->wordcloud) (1.4.2)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib->wordcloud) (21.3)
Requirement already satisfied: pyparsing>=2.2.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->wordcloud) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.10/site-packages (from matplotlib->wordcloud) (2.8.2)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)
Using cached wordcloud-1.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (455 kB)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.9.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

True

The number of comments and submissions for each day

sub_daily_count = sub\
    .groupBy("created_date")\
    .agg(
        count("*").alias("total_submissions"),
        mean("num_comments").alias("avg_num_comments"),
        percentile_approx("num_comments", 0.5).alias("median_num_comments"),
        _max("num_comments").alias("max_num_comments"),
        mean("score").alias("avg_score"),
        mean("title_wordCount").alias("avg_title_wordCount"),
        mean("selftext_wordCount").alias("avg_selftext_wordCount")
        ).toPandas()
sub_daily_count['created_date'] = pd.to_datetime(sub_daily_count['created_date'])
sub_daily_count = sub_daily_count.sort_values(by='created_date')

com_daily_count = com\
    .groupBy("created_date")\
    .agg(
        count("*").alias("total_comments"),
        mean("score").alias("avg_score"),
        mean("body_wordCount").alias("avg_body_wordCount")
        ).toPandas()
com_daily_count['created_date'] = pd.to_datetime(com_daily_count['created_date'])
com_daily_count = com_daily_count.sort_values(by='created_date')

com_daily_count

	created_date	total_comments	avg_score	avg_body_wordCount
288	2021-01-01	9013	5.926107	34.627538
238	2021-01-02	8711	6.291356	39.110665
164	2021-01-03	9572	6.185541	38.544191
275	2021-01-04	9145	7.188409	37.068125
243	2021-01-05	9353	7.122207	32.787662
...	...	...	...	...
761	2023-03-27	5300	9.527358	40.971509
816	2023-03-28	4901	6.703122	41.518466
797	2023-03-29	4539	8.152236	43.081956
780	2023-03-30	4603	7.773626	41.344775
543	2023-03-31	5492	8.339221	36.546249

819 rows × 4 columns

sub_daily_count.to_csv("../data/csv/sub_daily.csv", index=False)
com_daily_count.to_csv("../data/csv/com_daily.csv", index=False)

sub_daily_count = pd.read_csv("Users/kw823/fall-2023-reddit-project-team-09/data/csv/sub_daily.csv")
com_daily_count = pd.read_csv("Users/kw823/fall-2023-reddit-project-team-09/data/csv/com_daily.csv")

StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 47, Finished, Available)

# Create a subplot with two y-axes
fig = sp.make_subplots(specs=[[{"secondary_y": True}]])

# Add traces for submissions and comments to the respective y-axes
fig.add_trace(go.Scatter(x=sub_daily_count['created_date'], y=sub_daily_count['total_submissions'], marker_color='#42a1b9', opacity=.65, name="submissions"), secondary_y=False)
fig.add_trace(go.Scatter(x=com_daily_count['created_date'], y=com_daily_count['total_comments'], marker_color='#d13a47', opacity=.65, name="comments"), secondary_y=True)

# Update the layout for the whole figure
fig.update_layout(
    title='The number of comments and submissions for each day',
    xaxis={'title': 'Date'},
)

# Update the y-axis labels
fig.update_yaxes(title_text="# Submissions", secondary_y=False)
fig.update_yaxes(title_text="# Comments", secondary_y=True)

# Show the figure
fig.show()

StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 48, Finished, Available)

sub_daily_count_top10 = sub_daily_count.sort_values(by='total_submissions',ascending=False).iloc[0:10,:]
sub_daily_count_top10.to_csv("../data/csv/sub_daily_count_top10.csv", index=False)
sub_daily_count_top10

	created_date	count
277	2021-02-21	263
630	2021-02-03	253
17	2021-02-13	250
26	2021-02-10	249
531	2021-01-18	247
269	2021-01-21	245
481	2021-01-31	243
347	2021-02-01	237
72	2021-01-12	236
236	2021-02-28	236

com_daily_count_top10 = com_daily_count.sort_values(by='total_comments',ascending=False).iloc[0:10,:]
com_daily_count_top10.to_csv("../data/csv/com_daily_count_top10.csv", index=False)
com_daily_count_top10

	created_date	count
265	2021-07-30	16787
245	2021-01-31	15742
282	2021-03-25	15616
162	2021-03-26	15509
147	2021-02-05	14624
249	2021-03-24	14375
272	2021-03-28	14268
273	2021-01-18	14139
134	2021-02-21	14021
148	2021-02-14	13720

Heatmap by Hour and Day of Week for r/Anime

Submission

def custom_xaxis_formatter(x, pos):
    return int(x)

sub_grp_hour_week = sub.groupBy("created_hour","created_week").count().toPandas()

sub_piv = pd.pivot_table(sub_grp_hour_week, columns=["created_hour"], index=["created_week"], values=['count'], fill_value=0)

#plot pivot table as heatmap using seaborn
palette1 = sns.cubehelix_palette(as_cmap=True)
ax = sns.heatmap(sub_piv, cmap=palette1, yticklabels=['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat'])
plt.setp(ax.xaxis.get_majorticklabels(), rotation=0)
plt.setp(ax.yaxis.get_majorticklabels(), rotation=0)
plt.xlabel("Hour of Day", fontsize = 12)
plt.ylabel("Day of Week", fontsize = 12)
plt.gca().xaxis.set_major_formatter(FuncFormatter(custom_xaxis_formatter))
plt.title("Heatmap by Hour and Day of Week for Submission")
plt.tight_layout()
plt.savefig("../website-source/images/anime_submission_heatmap.png")
plt.show()

sub_piv.transpose().to_csv("../data/csv/sub_pivot_hours_day_of_week.csv")
sub_piv.transpose()

	created_week	1	2	3	4	5	6	7
	created_hour
count	0	807	855	812	703	748	848	726
	1	771	714	686	679	629	692	610
	2	890	737	655	665	669	678	741
	3	671	692	674	712	667	681	657
	4	627	666	646	651	658	621	620
	5	630	613	593	548	578	519	522
	6	572	582	553	490	528	518	495
	7	458	509	498	507	519	463	483
	8	485	474	458	453	473	425	426
	9	567	431	386	397	438	381	502
	10	511	480	475	431	492	428	504
	11	435	428	443	418	475	428	447
	12	538	525	494	526	463	453	500
	13	622	573	522	578	548	537	602
	14	776	618	663	714	672	686	611
	15	717	720	770	834	790	637	728
	16	832	769	782	799	728	867	894
	17	852	747	682	740	810	738	903
	18	866	770	689	691	715	821	879
	19	851	750	667	686	649	746	735
	20	843	701	763	675	683	688	783
	21	985	875	790	791	793	821	960
	22	950	778	846	807	780	842	885
	23	886	787	800	817	809	795	812

Comments

com_grp_hour_week = com.groupBy("created_hour","created_week").count().toPandas()

com_piv = pd.pivot_table(com_grp_hour_week, columns=["created_hour"], index=["created_week"], values=['count'], fill_value=0)

#plot pivot table as heatmap using seaborn
palette2 = sns.cubehelix_palette(start=.5, rot=-.5, as_cmap=True)

ax = sns.heatmap(com_piv, cmap=palette2, yticklabels=['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat'])
plt.setp(ax.xaxis.get_majorticklabels(), rotation=0)
plt.setp(ax.yaxis.get_majorticklabels(), rotation=0)
plt.xlabel("Hour of Day", fontsize = 12)
plt.ylabel("Day of Week", fontsize = 12)
plt.gca().xaxis.set_major_formatter(FuncFormatter(custom_xaxis_formatter))
plt.title("Heatmap by Hour and Day of Week for Comments")
plt.tight_layout()
plt.savefig("../website-source/images/anime_comments_heatmap.png")
plt.show()

com_piv.transpose().to_csv("../data/csv/com_pivot_hours_day_of_week.csv")
com_piv.transpose()

	created_week	1	2	3	4	5	6	7
	created_hour
count	0	47651	51633	47777	46405	47692	62720	46911
	1	45103	48403	44398	44038	44427	52377	41927
	2	44963	47158	42658	42636	41943	43371	40079
	3	42748	43122	40810	40382	41116	41792	38848
	4	38780	39527	37307	36229	38238	38314	36195
	5	36152	35026	33592	32644	34165	33889	32429
	6	32815	32001	30028	28909	30898	30516	29615
	7	29169	28721	26484	26619	27311	27145	27898
	8	27207	25805	24416	24382	24639	24757	25377
	9	26832	23897	24065	23533	23368	24232	25158
	10	28262	25019	24043	23794	24020	23792	26762
	11	28484	27084	26051	26479	25824	26487	28357
	12	29978	29552	29725	29655	28447	28418	31457
	13	37198	33515	32184	33822	31727	36384	35141
	14	43539	39039	37562	38800	36290	43784	41212
	15	46864	43910	45864	49317	42725	56274	48052
	16	55765	49009	53828	55681	48112	49522	55314
	17	58151	51640	52969	54625	50297	53138	62519
	18	55055	48914	49153	51302	50117	50345	59030
	19	56457	48696	46254	48969	47736	51018	56331
	20	56576	47935	46099	48002	47060	48690	51222
	21	63528	48848	48319	49404	49683	48863	52715
	22	63254	52500	51358	53852	52320	55184	54477
	23	58728	51049	50377	51925	53791	53674	53474

Wordcloud for both comment body and submission title

def top_n_words(n, word_counts):
    top_n_words = word_counts.limit(n).select("word", "count")
    # convert to a list of tuples (word, count)
    word_list = [(row["word"], row["count"]) for row in top_n_words.collect()]
    word_dict = dict(word_list)
    # create word cloud
    wordcloud = WordCloud(width=800, height=400, background_color="white").generate_from_frequencies(word_dict)
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()

stop_words = set(stopwords.words('english'))

# comment body
word_counts_com = (
    com
    .withColumn("word", F.explode(F.split(F.col("body"), "\s+")))
    .withColumn("word", F.regexp_replace("word", "[^\w]", ""))
    .groupBy("word")
    .count()
    .sort("count", ascending=False)
)

filtered_word_counts_com = word_counts_com.filter(~word_counts_com["word"].isin(stop_words))
filtered_word_counts_com.show(10)

[Stage 16:>                                                         (0 + 2) / 2]

+------+-------+
|  word|  count|
+------+-------+
|      |2303336|
|  like|1514254|
| anime|1416479|
|   one| 900388|
| watch| 743528|
|  dont| 733973|
|really| 704311|
|    im| 669780|
|  good| 627644|
|  also| 616659|
+------+-------+
only showing top 10 rows

top_n_words(200,filtered_word_counts_com)

# submission title
word_counts_sub = (
    sub
    .withColumn("word", F.explode(F.split(F.col("title"), "\s+")))
    .withColumn("word", F.regexp_replace("word", "[^\w]", ""))
    .groupBy("word")
    .count()
    .sort("count", ascending=False)
)

filtered_word_counts_sub = word_counts_sub.filter(~word_counts_sub["word"].isin(stop_words))
filtered_word_counts_sub.show(10)

[Stage 20:===========================================>              (3 + 1) / 4]

+----------+-----+
|      word|count|
+----------+-----+
|     anime|47850|
|          |17981|
|   episode|11728|
|discussion|10424|
|     watch| 9941|
|      help| 6317|
|   looking| 5950|
|      need| 5654|
|      like| 5503|
|      good| 5471|
+----------+-----+
only showing top 10 rows

top_n_words(200,filtered_word_counts_sub)

Boxplot of word count by stickied or not for submission

# Get boolean columns' names
bool_columns = [col[0] for col in sub.dtypes if col[1] == 'boolean']
# Assuming you have a PySpark DataFrame 'df'
bool_columns
for col_name in bool_columns:
    sub = sub.withColumn(col_name, F.coalesce(F.col(col_name).cast('int'), F.lit(0)))
# sub.printSchema()

sub_df = sub.toPandas()

# Create a box plot using Plotly
fig = px.box(sub_df, x="stickied", y="title_wordCount", title=f"Box Plot of title_wordCount by stickied")
fig.show()

StatementMeta(bdbc6c25-e1fb-4fd1-9ce0-b2c60e73c096, 5, 53, Finished, Available)

The number of comments of pokemon-related submissions

ax = sns.histplot(data=sub_df[sub_df["contain_pokemon"]==True], x="num_comments", hue='contain_pokemon', binwidth=7, color="#d5ecf1")
ax.set_title("Histogram of the number of comments for pokemon-related submissions")
ax.set_xlabel("# comments")
ax.get_legend().remove()
plt.savefig("../website-source/images/anime_histogram_submissions.png")
plt.show()

Word Count V.S. score

Submissions: Title Word Count V.S. Score

# Get boolean columns' names
bool_columns = [col[0] for col in sub.dtypes if col[1] == 'boolean']
# Assuming you have a PySpark DataFrame 'df'
bool_columns
for col_name in bool_columns:
    sub = sub.withColumn(col_name, F.coalesce(F.col(col_name).cast('int'), F.lit(0)))
# sub.printSchema()

sub_df = sub.toPandas()

ax = sub_df.plot.scatter('score', 'title_wordCount', s=3, c="#d13a47")
ax.set_title("Scatter plot of title word count and score for submissions")
ax.set_xscale('log')
plt.savefig("../website-source/images/anime_submissions_wordcount_score_scatterplot.png")
plt.show()

Comments: Body Word Count V.S. Score

com_wordcount_score_df = com.select("body_wordCount", 'score').toPandas()
ax = com_wordcount_score_df.plot.scatter('score', 'body_wordCount', s=3, c="#42a1b9")
ax.set_title("Scatter plot of body word count and score for comments")
ax.set_xscale('log')
ax.set_yscale('log')
plt.savefig("../website-source/images/anime_comments_wordcount_score_scatterplot.png")
plt.show()