Code: NLP-Topic 5

Set up

# Setup - Run only once per Kernel App
%conda install openjdk -y

# install PySpark
%pip install pyspark==3.4.0

# install spark-nlp
%pip install spark-nlp==5.1.3
%pip install sparknlp
%pip install plotly

# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 23.3.1
  latest version: 23.10.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.10.0



## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - openjdk


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2023.08.22 |       h06a4308_0         123 KB
    certifi-2023.11.17         |  py310h06a4308_0         158 KB
    openjdk-11.0.13            |       h87a67e3_0       341.0 MB
    ------------------------------------------------------------
                                           Total:       341.3 MB

The following NEW packages will be INSTALLED:

  openjdk            pkgs/main/linux-64::openjdk-11.0.13-h87a67e3_0 

The following packages will be UPDATED:

  ca-certificates    conda-forge::ca-certificates-2023.7.2~ --> pkgs/main::ca-certificates-2023.08.22-h06a4308_0 
  certifi            conda-forge/noarch::certifi-2023.7.22~ --> pkgs/main/linux-64::certifi-2023.11.17-py310h06a4308_0 



Downloading and Extracting Packages
ca-certificates-2023 | 123 KB    |                                       |   0% 
openjdk-11.0.13      | 341.0 MB  |                                       |   0% 

certifi-2023.11.17   | 158 KB    |                                       |   0% 

ca-certificates-2023 | 123 KB    | ##################################### | 100% 
openjdk-11.0.13      | 341.0 MB  | 5                                     |   2% 
openjdk-11.0.13      | 341.0 MB  | ##4                                   |   7% 
openjdk-11.0.13      | 341.0 MB  | ####2                                 |  11% 
openjdk-11.0.13      | 341.0 MB  | ######1                               |  17% 
openjdk-11.0.13      | 341.0 MB  | ########1                             |  22% 
openjdk-11.0.13      | 341.0 MB  | #########8                            |  27% 
openjdk-11.0.13      | 341.0 MB  | ###########5                          |  31% 
openjdk-11.0.13      | 341.0 MB  | #############3                        |  36% 
openjdk-11.0.13      | 341.0 MB  | ###############                       |  41% 
openjdk-11.0.13      | 341.0 MB  | ################8                     |  46% 
openjdk-11.0.13      | 341.0 MB  | ##################7                   |  51% 
openjdk-11.0.13      | 341.0 MB  | ####################5                 |  56% 
openjdk-11.0.13      | 341.0 MB  | ######################4               |  61% 
openjdk-11.0.13      | 341.0 MB  | ########################4             |  66% 
openjdk-11.0.13      | 341.0 MB  | ##########################2           |  71% 
openjdk-11.0.13      | 341.0 MB  | ############################          |  76% 
openjdk-11.0.13      | 341.0 MB  | #############################9        |  81% 
openjdk-11.0.13      | 341.0 MB  | ###############################8      |  86% 
openjdk-11.0.13      | 341.0 MB  | #################################8    |  91% 
openjdk-11.0.13      | 341.0 MB  | ###################################7  |  97% 
                                                                                
                                                                                

                                                                                
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Note: you may need to restart the kernel to use updated packages.
Collecting pyspark==3.4.0
  Using cached pyspark-3.4.0-py2.py3-none-any.whl
Collecting py4j==0.10.9.7 (from pyspark==3.4.0)
  Using cached py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.7 pyspark-3.4.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Collecting spark-nlp==5.1.3
  Obtaining dependency information for spark-nlp==5.1.3 from https://files.pythonhosted.org/packages/cd/7d/bc0eca4c9ec4c9c1d9b28c42c2f07942af70980a7d912d0aceebf8db32dd/spark_nlp-5.1.3-py2.py3-none-any.whl.metadata
  Using cached spark_nlp-5.1.3-py2.py3-none-any.whl.metadata (53 kB)
Using cached spark_nlp-5.1.3-py2.py3-none-any.whl (537 kB)
Installing collected packages: spark-nlp
Successfully installed spark-nlp-5.1.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Collecting sparknlp
  Using cached sparknlp-1.0.0-py3-none-any.whl (1.4 kB)
Requirement already satisfied: spark-nlp in /opt/conda/lib/python3.10/site-packages (from sparknlp) (5.1.3)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from sparknlp) (1.26.0)
Installing collected packages: sparknlp
Successfully installed sparknlp-1.0.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: plotly in /opt/conda/lib/python3.10/site-packages (5.9.0)
Requirement already satisfied: tenacity>=6.2.0 in /opt/conda/lib/python3.10/site-packages (from plotly) (8.0.1)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.

!wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.1.3.jar

--2023-11-21 04:02:11--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.1.3.jar
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.62.112, 52.216.53.0, 52.216.57.72, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.62.112|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 708534094 (676M) [application/java-archive]
Saving to: ‘spark-nlp-assembly-5.1.3.jar’

spark-nlp-assembly- 100%[===================>] 675.71M  80.0MB/s    in 8.8s    

2023-11-21 04:02:23 (76.8 MB/s) - ‘spark-nlp-assembly-5.1.3.jar’ saved [708534094/708534094]

## Import packages
import json
import sparknlp
import numpy as np
import pandas as pd
from sparknlp.base import *
from pyspark.ml import Pipeline
from sparknlp.annotator import *
import pyspark.sql.functions as F
from pyspark.sql.functions import mean, stddev, max, min, count, percentile_approx, year, month, dayofmonth, ceil, col, dayofweek, hour, explode, date_format, lower, size, split, regexp_replace, isnan, when
from pyspark.sql import SparkSession
from sparknlp.pretrained import PretrainedPipeline
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns

from pyspark.sql import SparkSession
from py4j.java_gateway import java_import

spark = SparkSession.builder \
    .appName("Spark NLP")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3,org.apache.hadoop:hadoop-aws:3.2.2")\
    .config(
        "fs.s3a.aws.credentials.provider",
        "com.amazonaws.auth.ContainerCredentialsProvider",
    )\
    .getOrCreate()

print(f"Spark version: {spark.version}")
print(f"sparknlp version: {sparknlp.version()}")

Warning: Ignoring non-Spark config property: fs.s3a.aws.credentials.provider

:: loading settings :: url = jar:file:/opt/conda/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml

Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-a0107b46-ddad-4865-93b5-de0a60b32d78;1.0
    confs: [default]
    found com.johnsnowlabs.nlp#spark-nlp_2.12;5.1.3 in central
    found com.typesafe#config;1.4.2 in central
    found org.rocksdb#rocksdbjni;6.29.5 in central
    found com.amazonaws#aws-java-sdk-bundle;1.11.828 in central
    found com.github.universal-automata#liblevenshtein;3.0.0 in central
    found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
    found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
    found com.google.code.gson#gson;2.3 in central
    found it.unimi.dsi#fastutil;7.0.12 in central
    found org.projectlombok#lombok;1.16.8 in central
    found com.google.cloud#google-cloud-storage;2.20.1 in central
    found com.google.guava#guava;31.1-jre in central
    found com.google.guava#failureaccess;1.0.1 in central
    found com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava in central
    found com.google.errorprone#error_prone_annotations;2.18.0 in central
    found com.google.j2objc#j2objc-annotations;1.3 in central
    found com.google.http-client#google-http-client;1.43.0 in central
    found io.opencensus#opencensus-contrib-http-util;0.31.1 in central
    found com.google.http-client#google-http-client-jackson2;1.43.0 in central
    found com.google.http-client#google-http-client-gson;1.43.0 in central
    found com.google.api-client#google-api-client;2.2.0 in central
    found commons-codec#commons-codec;1.15 in central
    found com.google.oauth-client#google-oauth-client;1.34.1 in central
    found com.google.http-client#google-http-client-apache-v2;1.43.0 in central
    found com.google.apis#google-api-services-storage;v1-rev20220705-2.0.0 in central
    found com.google.code.gson#gson;2.10.1 in central
    found com.google.cloud#google-cloud-core;2.12.0 in central
    found io.grpc#grpc-context;1.53.0 in central
    found com.google.auto.value#auto-value-annotations;1.10.1 in central
    found com.google.auto.value#auto-value;1.10.1 in central
    found javax.annotation#javax.annotation-api;1.3.2 in central
    found commons-logging#commons-logging;1.2 in central
    found com.google.cloud#google-cloud-core-http;2.12.0 in central
    found com.google.http-client#google-http-client-appengine;1.43.0 in central
    found com.google.api#gax-httpjson;0.108.2 in central
    found com.google.cloud#google-cloud-core-grpc;2.12.0 in central
    found io.grpc#grpc-alts;1.53.0 in central
    found io.grpc#grpc-grpclb;1.53.0 in central
    found org.conscrypt#conscrypt-openjdk-uber;2.5.2 in central
    found io.grpc#grpc-auth;1.53.0 in central
    found io.grpc#grpc-protobuf;1.53.0 in central
    found io.grpc#grpc-protobuf-lite;1.53.0 in central
    found io.grpc#grpc-core;1.53.0 in central
    found com.google.api#gax;2.23.2 in central
    found com.google.api#gax-grpc;2.23.2 in central
    found com.google.auth#google-auth-library-credentials;1.16.0 in central
    found com.google.auth#google-auth-library-oauth2-http;1.16.0 in central
    found com.google.api#api-common;2.6.2 in central
    found io.opencensus#opencensus-api;0.31.1 in central
    found com.google.api.grpc#proto-google-iam-v1;1.9.2 in central
    found com.google.protobuf#protobuf-java;3.21.12 in central
    found com.google.protobuf#protobuf-java-util;3.21.12 in central
    found com.google.api.grpc#proto-google-common-protos;2.14.2 in central
    found org.threeten#threetenbp;1.6.5 in central
    found com.google.api.grpc#proto-google-cloud-storage-v2;2.20.1-alpha in central
    found com.google.api.grpc#grpc-google-cloud-storage-v2;2.20.1-alpha in central
    found com.google.api.grpc#gapic-google-cloud-storage-v2;2.20.1-alpha in central
    found com.fasterxml.jackson.core#jackson-core;2.14.2 in central
    found com.google.code.findbugs#jsr305;3.0.2 in central
    found io.grpc#grpc-api;1.53.0 in central
    found io.grpc#grpc-stub;1.53.0 in central
    found org.checkerframework#checker-qual;3.31.0 in central
    found io.perfmark#perfmark-api;0.26.0 in central
    found com.google.android#annotations;4.1.1.4 in central
    found org.codehaus.mojo#animal-sniffer-annotations;1.22 in central
    found io.opencensus#opencensus-proto;0.2.0 in central
    found io.grpc#grpc-services;1.53.0 in central
    found com.google.re2j#re2j;1.6 in central
    found io.grpc#grpc-netty-shaded;1.53.0 in central
    found io.grpc#grpc-googleapis;1.53.0 in central
    found io.grpc#grpc-xds;1.53.0 in central
    found com.navigamez#greex;1.0 in central
    found dk.brics.automaton#automaton;1.11-8 in central
    found com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.4.4 in central
    found com.microsoft.onnxruntime#onnxruntime;1.15.0 in central
    found org.apache.hadoop#hadoop-aws;3.2.2 in central
:: resolution report :: resolve 4943ms :: artifacts dl 897ms
    :: modules in use:
    com.amazonaws#aws-java-sdk-bundle;1.11.828 from central in [default]
    com.fasterxml.jackson.core#jackson-core;2.14.2 from central in [default]
    com.github.universal-automata#liblevenshtein;3.0.0 from central in [default]
    com.google.android#annotations;4.1.1.4 from central in [default]
    com.google.api#api-common;2.6.2 from central in [default]
    com.google.api#gax;2.23.2 from central in [default]
    com.google.api#gax-grpc;2.23.2 from central in [default]
    com.google.api#gax-httpjson;0.108.2 from central in [default]
    com.google.api-client#google-api-client;2.2.0 from central in [default]
    com.google.api.grpc#gapic-google-cloud-storage-v2;2.20.1-alpha from central in [default]
    com.google.api.grpc#grpc-google-cloud-storage-v2;2.20.1-alpha from central in [default]
    com.google.api.grpc#proto-google-cloud-storage-v2;2.20.1-alpha from central in [default]
    com.google.api.grpc#proto-google-common-protos;2.14.2 from central in [default]
    com.google.api.grpc#proto-google-iam-v1;1.9.2 from central in [default]
    com.google.apis#google-api-services-storage;v1-rev20220705-2.0.0 from central in [default]
    com.google.auth#google-auth-library-credentials;1.16.0 from central in [default]
    com.google.auth#google-auth-library-oauth2-http;1.16.0 from central in [default]
    com.google.auto.value#auto-value;1.10.1 from central in [default]
    com.google.auto.value#auto-value-annotations;1.10.1 from central in [default]
    com.google.cloud#google-cloud-core;2.12.0 from central in [default]
    com.google.cloud#google-cloud-core-grpc;2.12.0 from central in [default]
    com.google.cloud#google-cloud-core-http;2.12.0 from central in [default]
    com.google.cloud#google-cloud-storage;2.20.1 from central in [default]
    com.google.code.findbugs#jsr305;3.0.2 from central in [default]
    com.google.code.gson#gson;2.10.1 from central in [default]
    com.google.errorprone#error_prone_annotations;2.18.0 from central in [default]
    com.google.guava#failureaccess;1.0.1 from central in [default]
    com.google.guava#guava;31.1-jre from central in [default]
    com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava from central in [default]
    com.google.http-client#google-http-client;1.43.0 from central in [default]
    com.google.http-client#google-http-client-apache-v2;1.43.0 from central in [default]
    com.google.http-client#google-http-client-appengine;1.43.0 from central in [default]
    com.google.http-client#google-http-client-gson;1.43.0 from central in [default]
    com.google.http-client#google-http-client-jackson2;1.43.0 from central in [default]
    com.google.j2objc#j2objc-annotations;1.3 from central in [default]
    com.google.oauth-client#google-oauth-client;1.34.1 from central in [default]
    com.google.protobuf#protobuf-java;3.21.12 from central in [default]
    com.google.protobuf#protobuf-java-util;3.21.12 from central in [default]
    com.google.re2j#re2j;1.6 from central in [default]
    com.johnsnowlabs.nlp#spark-nlp_2.12;5.1.3 from central in [default]
    com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.4.4 from central in [default]
    com.microsoft.onnxruntime#onnxruntime;1.15.0 from central in [default]
    com.navigamez#greex;1.0 from central in [default]
    com.typesafe#config;1.4.2 from central in [default]
    commons-codec#commons-codec;1.15 from central in [default]
    commons-logging#commons-logging;1.2 from central in [default]
    dk.brics.automaton#automaton;1.11-8 from central in [default]
    io.grpc#grpc-alts;1.53.0 from central in [default]
    io.grpc#grpc-api;1.53.0 from central in [default]
    io.grpc#grpc-auth;1.53.0 from central in [default]
    io.grpc#grpc-context;1.53.0 from central in [default]
    io.grpc#grpc-core;1.53.0 from central in [default]
    io.grpc#grpc-googleapis;1.53.0 from central in [default]
    io.grpc#grpc-grpclb;1.53.0 from central in [default]
    io.grpc#grpc-netty-shaded;1.53.0 from central in [default]
    io.grpc#grpc-protobuf;1.53.0 from central in [default]
    io.grpc#grpc-protobuf-lite;1.53.0 from central in [default]
    io.grpc#grpc-services;1.53.0 from central in [default]
    io.grpc#grpc-stub;1.53.0 from central in [default]
    io.grpc#grpc-xds;1.53.0 from central in [default]
    io.opencensus#opencensus-api;0.31.1 from central in [default]
    io.opencensus#opencensus-contrib-http-util;0.31.1 from central in [default]
    io.opencensus#opencensus-proto;0.2.0 from central in [default]
    io.perfmark#perfmark-api;0.26.0 from central in [default]
    it.unimi.dsi#fastutil;7.0.12 from central in [default]
    javax.annotation#javax.annotation-api;1.3.2 from central in [default]
    org.apache.hadoop#hadoop-aws;3.2.2 from central in [default]
    org.checkerframework#checker-qual;3.31.0 from central in [default]
    org.codehaus.mojo#animal-sniffer-annotations;1.22 from central in [default]
    org.conscrypt#conscrypt-openjdk-uber;2.5.2 from central in [default]
    org.projectlombok#lombok;1.16.8 from central in [default]
    org.rocksdb#rocksdbjni;6.29.5 from central in [default]
    org.threeten#threetenbp;1.6.5 from central in [default]
    :: evicted modules:
    com.google.protobuf#protobuf-java-util;3.0.0-beta-3 by [com.google.protobuf#protobuf-java-util;3.21.12] in [default]
    com.google.protobuf#protobuf-java;3.0.0-beta-3 by [com.google.protobuf#protobuf-java;3.21.12] in [default]
    com.google.code.gson#gson;2.3 by [com.google.code.gson#gson;2.10.1] in [default]
    com.amazonaws#aws-java-sdk-bundle;1.11.563 by [com.amazonaws#aws-java-sdk-bundle;1.11.828] in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   77  |   0   |   0   |   4   ||   73  |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-a0107b46-ddad-4865-93b5-de0a60b32d78
    confs: [default]
    0 artifacts copied, 73 already retrieved (0kB/295ms)
23/11/21 04:02:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Spark version: 3.4.0
sparknlp version: 5.1.3

Data preparation

## Read cleaned data from parquet

### Anime subreddits
import sagemaker
# session = sagemaker.Session()
# bucket = session.default_bucket()
bucket = 'sagemaker-us-east-1-315969085594'

sub_bucket_path = f"s3a://{bucket}/project/cleaned/sub"
com_bucket_path = f"s3a://{bucket}/project/cleaned/com"

print(f"reading submissions from {sub_bucket_path}")
sub = spark.read.parquet(sub_bucket_path, header=True)
print(f"shape of the sub dataframe is {sub.count():,}x{len(sub.columns)}")

print(f"reading comments from {com_bucket_path}")
com = spark.read.parquet(com_bucket_path, header=True)
print(f"shape of the com dataframe is {com.count():,}x{len(com.columns)}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
reading submissions from s3a://sagemaker-us-east-1-315969085594/project/cleaned/sub

23/11/19 02:26:40 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties

shape of the sub dataframe is 110,247x22
reading comments from s3a://sagemaker-us-east-1-315969085594/project/cleaned/com

[Stage 5:======================================================>  (23 + 1) / 24]

shape of the com dataframe is 6,879,119x19

sub.groupBy('subreddit').count().show()

[Stage 8:============================================>              (3 + 1) / 4]

+---------+------+
|subreddit| count|
+---------+------+
|    anime|110247|
+---------+------+

sub.printSchema()

root
 |-- subreddit: string (nullable = true)
 |-- author: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- title: string (nullable = true)
 |-- selftext: string (nullable = true)
 |-- num_comments: long (nullable = true)
 |-- num_crossposts: long (nullable = true)
 |-- over_18: boolean (nullable = true)
 |-- score: long (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- id: string (nullable = true)
 |-- created_date: string (nullable = true)
 |-- created_hour: integer (nullable = true)
 |-- created_week: integer (nullable = true)
 |-- created_month: integer (nullable = true)
 |-- created_year: integer (nullable = true)
 |-- cleaned_title: string (nullable = true)
 |-- title_wordCount: integer (nullable = true)
 |-- cleaned_selftext: string (nullable = true)
 |-- selftext_wordCount: integer (nullable = true)
 |-- contain_pokemon: boolean (nullable = true)

sub.show(3)

[Stage 11:>                                                         (0 + 1) / 1]

+---------+--------------------+-----------------+-------------------+--------------------+--------------------+------------+--------------+-------+-----+--------+------+------------+------------+------------+-------------+------------+--------------------+---------------+--------------------+------------------+---------------+
|subreddit|              author|author_flair_text|        created_utc|               title|            selftext|num_comments|num_crossposts|over_18|score|stickied|    id|created_date|created_hour|created_week|created_month|created_year|       cleaned_title|title_wordCount|    cleaned_selftext|selftext_wordCount|contain_pokemon|
+---------+--------------------+-----------------+-------------------+--------------------+--------------------+------------+--------------+-------+-----+--------+------+------------+------------+------------+-------------+------------+--------------------+---------------+--------------------+------------------+---------------+
|    anime|PsychologicalGift299|             null|2021-04-19 20:42:46|anime movies for ...|so as my fellow o...|          12|             0|  false|    0|   false|mua1uo|  2021-04-19|          20|           2|            4|        2021|anime movies for 420|              4|so as my fellow o...|                64|          false|
|    anime|        Tuttles4ever|             null|2021-04-19 20:48:42|i need a very spe...|are there any ani...|           7|             0|  false|    0|   false|mua6g3|  2021-04-19|          20|           2|            4|        2021|i need a very spe...|             15|are there any ani...|                42|          false|
|    anime|          nemifloras|             null|2021-04-19 20:52:42|any atmospheric a...|i finished reassi...|           9|             0|  false|    0|   false|mua9iu|  2021-04-19|          20|           2|            4|        2021|any atmospheric a...|             10|i finished reassi...|                18|          false|
+---------+--------------------+-----------------+-------------------+--------------------+--------------------+------------+--------------+-------+-----+--------+------+------------+------------+------------+-------------+------------+--------------------+---------------+--------------------+------------------+---------------+
only showing top 3 rows

com.groupBy('subreddit').count().show()

[Stage 12:=====================================================>  (23 + 1) / 24]

+---------+-------+
|subreddit|  count|
+---------+-------+
|    anime|6879119|
+---------+-------+

com.printSchema()

root
 |-- subreddit: string (nullable = true)
 |-- author: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- body: string (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- score: long (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- link_id: string (nullable = true)
 |-- id: string (nullable = true)
 |-- created_date: string (nullable = true)
 |-- created_hour: integer (nullable = true)
 |-- created_week: integer (nullable = true)
 |-- created_month: integer (nullable = true)
 |-- created_year: integer (nullable = true)
 |-- cleaned: string (nullable = true)
 |-- body_wordCount: integer (nullable = true)
 |-- contain_pokemon: boolean (nullable = true)

com.show(3)

[Stage 15:>                                                         (0 + 1) / 1]

+---------+--------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
|subreddit|        author|   author_flair_text|        created_utc|                body|controversiality|score| parent_id|stickied|  link_id|     id|created_date|created_hour|created_week|created_month|created_year|             cleaned|body_wordCount|contain_pokemon|
+---------+--------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
|    anime| DonaldJenkins|                null|2021-11-14 04:39:47|  i sent it to ya ;)|               0|    1|t1_hk0whi9|   false|t3_ov07rq|hkjr7uj|  2021-11-14|           4|           1|           11|        2021|    i sent it to ya |             6|          false|
|    anime|      DonMo999|:MAL:https://myan...|2021-11-14 04:40:25|displate has some...|               0|    1| t3_qtgc12|   false|t3_qtgc12|hkjralc|  2021-11-14|           4|           1|           11|        2021|displate has some...|            16|          false|
|    anime|OrangeBanana38|:AMQ::STAR::AL:ht...|2021-11-14 04:41:01|that sounds like ...|               0|    3|t1_hkjq6wn|   false|t3_qryjfm|hkjrd4w|  2021-11-14|           4|           1|           11|        2021|that sounds like ...|             6|          false|
+---------+--------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
only showing top 3 rows

Text Cleaning Pipline

Build a SparkNLP Pipeline

# Step 1: Transforms raw texts to `document` annotation
# documentAssembler = DocumentAssembler()\
#     .setInputCol("text")\
#     .setOutputCol("document")\
#     .setCleanupMode("shrink") # shrink: removes new lines and tabs, plus merging multiple spaces and blank lines to a single space.

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")


# step 2: Removes all dirty characters from text following a regex pattern and transforms

cleanUpPatterns = ["[^a-zA-Z\s]+"] # ["[^\w\d\s]"] : remove punctuations (keep alphanumeric chars)

# emoji_pat = '[\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2600-\u26FF\u2700-\u27BF]'
# clean_pat = '[^a-zA-Z\s]+'
# cleanUpPatterns = [r"({})|({})".format(emoji_pat, clean_pat)]

documentNormalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalizedDocument") \
    .setAction("clean") \
    .setPatterns(cleanUpPatterns) \
    .setReplacement(" ") \
    .setPolicy("pretty_all") \
    .setLowercase(True)

# step 3: Identifies tokens with tokenization open standards
tokenizer = Tokenizer() \
    .setInputCols(["normalizedDocument"]) \
    .setOutputCol("token") \
    .setSplitChars(['-']) \
    .setContextChars(['?', '!']) \

# # step *: 
# spellChecker = ContextSpellCheckerApproach() \
#     .setInputCols("token") \
#     .setOutputCol("corrected") \
#     .setWordMaxDistance(3) \
#     .setBatchSize(24) \
#     .setEpochs(8) \
#     .setLanguageModelClasses(1650)  # dependant on vocabulary size

# step 4: Find lemmas out of words with the objective of returning a base dictionary word
lemmatizer = LemmatizerModel.pretrained() \
    .setInputCols(["token"]) \
    .setOutputCol("lemma") \

# step 5: Drops all the stop words from the input sequences
stopwords_cleaner = StopWordsCleaner()\
    .setInputCols("lemma")\
    .setOutputCol("cleanTokens")\
    .setCaseSensitive(False)\

# step 6: Reconstructs a DOCUMENT type annotation from tokens
tokenassembler = TokenAssembler()\
    .setInputCols(["document", "cleanTokens"]) \
    .setOutputCol("clean_text")


nlpPipeline = Pipeline(
    stages=[
        documentAssembler,
        documentNormalizer,
        tokenizer,
        lemmatizer,
        stopwords_cleaner,
        tokenassembler
     ])

lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[ | ]lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
Download done! Loading the resource.
[ / ]

[OK!]

# renamed the column that need text cleaning to `text` to match the nlpPipline
body_com = com.withColumnRenamed('body','text')
title_sub = sub.withColumnRenamed('title','text')
selftext_sub = sub.withColumnRenamed('selftext','text')

# fit the dataframe to process the text cleaning
# body_com, title_sub, selftext_sub
data = body_com
pipelineModel = nlpPipeline.fit(data)
result = pipelineModel.transform(data)
# result.selectExpr("clean_text.result").show(truncate=False)

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.util.SizeEstimator$ (file:/opt/conda/lib/python3.10/site-packages/pyspark/jars/spark-core_2.12-3.4.0.jar) to field java.util.regex.Pattern.pattern
WARNING: Please consider reporting this to the maintainers of org.apache.spark.util.SizeEstimator$
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

result.show()

[Stage 10:>                                                         (0 + 1) / 1]

+---------+-----------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|subreddit|           author|   author_flair_text|        created_utc|                text|controversiality|score| parent_id|stickied|  link_id|     id|created_date|created_hour|created_week|created_month|created_year|             cleaned|body_wordCount|contain_pokemon|            document|  normalizedDocument|               token|               lemma|         cleanTokens|          clean_text|
+---------+-----------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|    anime|    DonaldJenkins|                null|2021-11-14 04:39:47|  i sent it to ya ;)|               0|    1|t1_hk0whi9|   false|t3_ov07rq|hkjr7uj|  2021-11-14|           4|           1|           11|        2021|    i sent it to ya |             6|          false|[{document, 0, 17...|[{document, 0, 14...|[{token, 0, 0, i,...|[{token, 0, 0, i,...|[{token, 2, 5, se...|[{document, 0, 3,...|
|    anime|         DonMo999|:MAL:https://myan...|2021-11-14 04:40:25|displate has some...|               0|    1| t3_qtgc12|   false|t3_qtgc12|hkjralc|  2021-11-14|           4|           1|           11|        2021|displate has some...|            16|          false|[{document, 0, 93...|[{document, 0, 91...|[{token, 0, 7, di...|[{token, 0, 7, di...|[{token, 0, 7, di...|[{document, 0, 74...|
|    anime|   OrangeBanana38|:AMQ::STAR::AL:ht...|2021-11-14 04:41:01|that sounds like ...|               0|    3|t1_hkjq6wn|   false|t3_qryjfm|hkjrd4w|  2021-11-14|           4|           1|           11|        2021|that sounds like ...|             6|          false|[{document, 0, 40...|[{document, 0, 34...|[{token, 0, 3, th...|[{token, 0, 3, th...|[{token, 5, 10, s...|[{document, 0, 28...|
|    anime|         ClBanjai|                null|2021-11-14 04:41:03|what kind of ques...|               0|    1| t3_qth8ql|   false|t3_qth8ql|hkjrdae|  2021-11-14|           4|           1|           11|        2021|what kind of ques...|            10|          false|[{document, 0, 51...|[{document, 0, 48...|[{token, 0, 3, wh...|[{token, 0, 3, wh...|[{token, 5, 8, ki...|[{document, 0, 25...|
|    anime|      helsaabiart|                null|2021-11-14 04:42:02|today on shokugek...|               0|    4| t3_qt8p0u|   false|t3_qt8p0u|hkjrhg6|  2021-11-14|           4|           1|           11|        2021|today on shokugek...|            28|          false|[{document, 0, 16...|[{document, 0, 16...|[{token, 0, 4, to...|[{token, 0, 4, to...|[{token, 0, 4, to...|[{document, 0, 11...|
|    anime|           Lezoux|:MAL:https://myan...|2021-11-14 04:42:08|   it's easy enough.|               0|    3|t1_hkjrd4w|   false|t3_qryjfm|hkjrhv3|  2021-11-14|           4|           1|           11|        2021|     its easy enough|             3|          false|[{document, 0, 16...|[{document, 0, 15...|[{token, 0, 1, it...|[{token, 0, 1, it...|[{token, 5, 8, ea...|[{document, 0, 10...|
|    anime|    AutoModerator|                null|2021-11-14 04:42:39|hello! if you eve...|               0|    1| t3_qti7iu|   false|t3_qti7iu|hkjrk63|  2021-11-14|           4|           1|           11|        2021|hello if you ever...|           160|          false|[{document, 0, 16...|[{document, 0, 15...|[{token, 0, 4, he...|[{token, 0, 4, he...|[{token, 0, 4, he...|[{document, 0, 11...|
|    anime|    AutoModerator|                null|2021-11-14 04:42:39|hi xxcile, it see...|               0|    1| t3_qti7iu|   false|t3_qti7iu|hkjrk6q|  2021-11-14|           4|           1|           11|        2021|hi xxcile it seem...|            82|          false|[{document, 0, 57...|[{document, 0, 54...|[{token, 0, 1, hi...|[{token, 0, 1, hi...|[{token, 0, 1, hi...|[{document, 0, 37...|
|    anime|         Terra246|                null|2021-11-14 04:42:47|i did see amagi b...|               0|    4|t1_hkjm3z4|   false|t3_qtgzu3|hkjrkr1|  2021-11-14|           4|           1|           11|        2021|i did see amagi b...|             9|          false|[{document, 0, 46...|[{document, 0, 45...|[{token, 0, 0, i,...|[{token, 0, 0, i,...|[{token, 6, 8, se...|[{document, 0, 36...|
|    anime|ZaphodBeebblebrox|:S3::AL:https://a...|2021-11-14 04:43:39|which is your mad...|               0|    3|t1_hkjpckv|   false|t3_qryjfm|hkjrogp|  2021-11-14|           4|           1|           11|        2021|which is your madoka|             4|          false|[{document, 0, 20...|[{document, 0, 19...|[{token, 0, 4, wh...|[{token, 0, 4, wh...|[{token, 14, 19, ...|[{document, 0, 5,...|
|    anime|     GreekFire242|                null|2021-11-14 04:43:47|        demon slayer|               0|    2|t1_hkjpe43|   false|t3_qtgcp8|hkjrp3s|  2021-11-14|           4|           1|           11|        2021|        demon slayer|             2|          false|[{document, 0, 11...|[{document, 0, 11...|[{token, 0, 4, de...|[{token, 0, 4, de...|[{token, 0, 4, de...|[{document, 0, 11...|
|    anime|         Terra246|                null|2021-11-14 04:43:52|i mean, it is one...|               0|    2|t1_hkjpa0q|   false|t3_qtgzu3|hkjrpgf|  2021-11-14|           4|           1|           11|        2021|i mean it is one ...|             8|          false|[{document, 0, 33...|[{document, 0, 31...|[{token, 0, 0, i,...|[{token, 0, 0, i,...|[{token, 2, 5, me...|[{document, 0, 14...|
|    anime|     MakotoPrince|                null|2021-11-14 04:44:32|yet another good ...|               0|    3| t3_qtg0z3|   false|t3_qtg0z3|hkjrsa2|  2021-11-14|           4|           1|           11|        2021|yet another good ...|            60|          false|[{document, 0, 30...|[{document, 0, 29...|[{token, 0, 2, ye...|[{token, 0, 2, ye...|[{token, 0, 2, ye...|[{document, 0, 18...|
|    anime|   Gryse_Blacolar|                null|2021-11-14 04:44:41|that's basically ...|               0|    2| t3_qsz91x|   false|t3_qsz91x|hkjrsvy|  2021-11-14|           4|           1|           11|        2021|thats basically s...|            15|          false|[{document, 0, 86...|[{document, 0, 84...|[{token, 0, 3, th...|[{token, 0, 3, th...|[{token, 7, 15, b...|[{document, 0, 55...|
|    anime|     Junnielocked|                null|2021-11-14 04:44:47|looked up the ani...|               0|    2| t3_qt7yff|   false|t3_qt7yff|hkjrtas|  2021-11-14|           4|           1|           11|        2021|looked up the ani...|            29|          false|[{document, 0, 15...|[{document, 0, 15...|[{token, 0, 5, lo...|[{token, 0, 5, lo...|[{token, 0, 5, lo...|[{document, 0, 93...|
|    anime|        kubabubba|                null|2021-11-14 04:45:01|      how about now?|               0|   24|t1_hkihkib|   false|t3_qt7yff|hkjrual|  2021-11-14|           4|           1|           11|        2021|       how about now|             3|          false|[{document, 0, 13...|[{document, 0, 12...|[{token, 0, 2, ho...|[{token, 0, 2, ho...|                  []|[{document, 0, -1...|
|    anime|    alotmorealots|                null|2021-11-14 04:45:08|your post could d...|               0|    1| t3_qtgpcl|   false|t3_qtgpcl|hkjrutb|  2021-11-14|           4|           1|           11|        2021|your post could d...|             9|          false|[{document, 0, 48...|[{document, 0, 47...|[{token, 0, 3, yo...|[{token, 0, 3, yo...|[{token, 5, 8, po...|[{document, 0, 20...|
|    anime|        heimdal77|                null|2021-11-14 04:45:17|depends is it lik...|               0|    2|t1_hkj9jju|   false|t3_qtfmin|hkjrvfv|  2021-11-14|           4|           1|           11|        2021|depends is it lik...|             5|          false|[{document, 0, 26...|[{document, 0, 25...|[{token, 0, 6, de...|[{token, 0, 6, de...|[{token, 0, 6, de...|[{document, 0, 18...|
|    anime|    jackofslayers|                null|2021-11-14 04:45:35|i have my own sus...|               0|    7|t1_hkjcx61|   false|t3_qt5igg|hkjrwsz|  2021-11-14|           4|           1|           11|        2021|i have my own sus...|            13|          false|[{document, 0, 59...|[{document, 0, 59...|[{token, 0, 0, i,...|[{token, 0, 0, i,...|[{token, 14, 23, ...|[{document, 0, 29...|
|    anime| SarcasmUndefined|                null|2021-11-14 04:45:59|looking submissiv...|               0|    4|t1_hkhunza|   false|t3_qt3ovl|hkjrykr|  2021-11-14|           4|           1|           11|        2021|looking submissiv...|             4|          false|[{document, 0, 31...|[{document, 0, 31...|[{token, 0, 6, lo...|[{token, 0, 6, lo...|[{token, 0, 6, lo...|[{document, 0, 24...|
+---------+-----------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 20 rows

result.printSchema()

root
 |-- subreddit: string (nullable = true)
 |-- author: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- text: string (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- score: long (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- link_id: string (nullable = true)
 |-- id: string (nullable = true)
 |-- created_date: string (nullable = true)
 |-- created_hour: integer (nullable = true)
 |-- created_week: integer (nullable = true)
 |-- created_month: integer (nullable = true)
 |-- created_year: integer (nullable = true)
 |-- cleaned: string (nullable = true)
 |-- body_wordCount: integer (nullable = true)
 |-- contain_pokemon: boolean (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- normalizedDocument: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- token: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- lemma: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- cleanTokens: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- clean_text: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)

# drop columns created within the pipeline
result = result.drop("document","normalizedDocument","lemma","cleanTokens")
result.printSchema()

root
 |-- subreddit: string (nullable = true)
 |-- author: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- text: string (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- score: long (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- link_id: string (nullable = true)
 |-- id: string (nullable = true)
 |-- created_date: string (nullable = true)
 |-- created_hour: integer (nullable = true)
 |-- created_week: integer (nullable = true)
 |-- created_month: integer (nullable = true)
 |-- created_year: integer (nullable = true)
 |-- cleaned: string (nullable = true)
 |-- body_wordCount: integer (nullable = true)
 |-- contain_pokemon: boolean (nullable = true)
 |-- token: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- clean_text: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)

result.select(
    "subreddit",
    "created_utc",
    "text",
    "controversiality",
    "score",
    "id",
    "created_date",
    "created_hour",
    "created_week",
    "created_month",
    "created_year",
    "cleaned",
    "body_wordCount",
    "contain_pokemon"
).show()

+---------+-------------------+--------------------+----------------+-----+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
|subreddit|        created_utc|                text|controversiality|score|     id|created_date|created_hour|created_week|created_month|created_year|             cleaned|body_wordCount|contain_pokemon|
+---------+-------------------+--------------------+----------------+-----+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
|    anime|2021-11-14 04:39:47|  i sent it to ya ;)|               0|    1|hkjr7uj|  2021-11-14|           4|           1|           11|        2021|    i sent it to ya |             6|          false|
|    anime|2021-11-14 04:40:25|displate has some...|               0|    1|hkjralc|  2021-11-14|           4|           1|           11|        2021|displate has some...|            16|          false|
|    anime|2021-11-14 04:41:01|that sounds like ...|               0|    3|hkjrd4w|  2021-11-14|           4|           1|           11|        2021|that sounds like ...|             6|          false|
|    anime|2021-11-14 04:41:03|what kind of ques...|               0|    1|hkjrdae|  2021-11-14|           4|           1|           11|        2021|what kind of ques...|            10|          false|
|    anime|2021-11-14 04:42:02|today on shokugek...|               0|    4|hkjrhg6|  2021-11-14|           4|           1|           11|        2021|today on shokugek...|            28|          false|
|    anime|2021-11-14 04:42:08|   it's easy enough.|               0|    3|hkjrhv3|  2021-11-14|           4|           1|           11|        2021|     its easy enough|             3|          false|
|    anime|2021-11-14 04:42:39|hello! if you eve...|               0|    1|hkjrk63|  2021-11-14|           4|           1|           11|        2021|hello if you ever...|           160|          false|
|    anime|2021-11-14 04:42:39|hi xxcile, it see...|               0|    1|hkjrk6q|  2021-11-14|           4|           1|           11|        2021|hi xxcile it seem...|            82|          false|
|    anime|2021-11-14 04:42:47|i did see amagi b...|               0|    4|hkjrkr1|  2021-11-14|           4|           1|           11|        2021|i did see amagi b...|             9|          false|
|    anime|2021-11-14 04:43:39|which is your mad...|               0|    3|hkjrogp|  2021-11-14|           4|           1|           11|        2021|which is your madoka|             4|          false|
|    anime|2021-11-14 04:43:47|        demon slayer|               0|    2|hkjrp3s|  2021-11-14|           4|           1|           11|        2021|        demon slayer|             2|          false|
|    anime|2021-11-14 04:43:52|i mean, it is one...|               0|    2|hkjrpgf|  2021-11-14|           4|           1|           11|        2021|i mean it is one ...|             8|          false|
|    anime|2021-11-14 04:44:32|yet another good ...|               0|    3|hkjrsa2|  2021-11-14|           4|           1|           11|        2021|yet another good ...|            60|          false|
|    anime|2021-11-14 04:44:41|that's basically ...|               0|    2|hkjrsvy|  2021-11-14|           4|           1|           11|        2021|thats basically s...|            15|          false|
|    anime|2021-11-14 04:44:47|looked up the ani...|               0|    2|hkjrtas|  2021-11-14|           4|           1|           11|        2021|looked up the ani...|            29|          false|
|    anime|2021-11-14 04:45:01|      how about now?|               0|   24|hkjrual|  2021-11-14|           4|           1|           11|        2021|       how about now|             3|          false|
|    anime|2021-11-14 04:45:08|your post could d...|               0|    1|hkjrutb|  2021-11-14|           4|           1|           11|        2021|your post could d...|             9|          false|
|    anime|2021-11-14 04:45:17|depends is it lik...|               0|    2|hkjrvfv|  2021-11-14|           4|           1|           11|        2021|depends is it lik...|             5|          false|
|    anime|2021-11-14 04:45:35|i have my own sus...|               0|    7|hkjrwsz|  2021-11-14|           4|           1|           11|        2021|i have my own sus...|            13|          false|
|    anime|2021-11-14 04:45:59|looking submissiv...|               0|    4|hkjrykr|  2021-11-14|           4|           1|           11|        2021|looking submissiv...|             4|          false|
+---------+-------------------+--------------------+----------------+-----+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
only showing top 20 rows

Topics in dummy variables

Model Using: sentimentdl_use_twitter

MODEL_NAME='sentimentdl_use_twitter'

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr

patterns = {
    "OnePiece": "(?i)(One\s?Piece|Luffy|Zoro|Roronoa|Sanji|Usopp|Tony\s?Tony\s?Chopper|Nico\s?Robin|Portgas\s?D\s?Ace|Straw\s?Hat\s?Pirates|Devil\s?Fruit)",
    "Pokemon": "(?i)(Pok[eé]mon|Ash\s?Ketchum|Pikachu|Pichu|Bulbasaur|Charmander|Squirtle|Ivysaur|Venusaur|Charmeleon|Charizard|Squirtle|Wartortle|Blastoise|Raichu|Eevee|Vaporeon|Jolteon|Flareon|Snorlax|Espeon|Umbreon|Leafeon|Glaceon|Sylveon|Absol)",
    "Naruto": "(?i)(Naruto|Sasuke\s?Uchiha|Sakura\s?Haruno|Kakashi\s?Hatake|Itachi\s?Uchiha|Hinata\s?Hyuga|Jiraiya|Orochimaru|Sharingan|Konoha|Akatsuki)",
    "OnePunchMan": "(?i)(One\s?Punch\s?Man|Saitama|Genos|Fubuki|Hero\s?Association|Monster\s?Association)",
    "YuGiOh": "(?i)(Yu[\s-]?Gi[\s-]?Oh|Yugi\s?Muto|Seto\s?Kaiba|Joey\s?Wheeler|Maximillion\s?Pegasus|Duel\s?Monsters|Shadow\s?Games)"
}

# Creating dummy variables
for dummy, pattern in patterns.items():
    result = result.withColumn(dummy, (col("text").rlike(pattern)).cast("int"))

# Show the resulting DataFrame
result.show()

23/11/19 02:27:27 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 12:>                                                         (0 + 1) / 1]

+---------+-----------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+--------------------+--------------------+--------+-------+------+-----------+------+
|subreddit|           author|   author_flair_text|        created_utc|                text|controversiality|score| parent_id|stickied|  link_id|     id|created_date|created_hour|created_week|created_month|created_year|             cleaned|body_wordCount|contain_pokemon|               token|          clean_text|OnePiece|Pokemon|Naruto|OnePunchMan|YuGiOh|
+---------+-----------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+--------------------+--------------------+--------+-------+------+-----------+------+
|    anime|    DonaldJenkins|                null|2021-11-14 04:39:47|  i sent it to ya ;)|               0|    1|t1_hk0whi9|   false|t3_ov07rq|hkjr7uj|  2021-11-14|           4|           1|           11|        2021|    i sent it to ya |             6|          false|[{token, 0, 0, i,...|[{document, 0, 3,...|       0|      0|     0|          0|     0|
|    anime|         DonMo999|:MAL:https://myan...|2021-11-14 04:40:25|displate has some...|               0|    1| t3_qtgc12|   false|t3_qtgc12|hkjralc|  2021-11-14|           4|           1|           11|        2021|displate has some...|            16|          false|[{token, 0, 7, di...|[{document, 0, 74...|       0|      0|     0|          0|     0|
|    anime|   OrangeBanana38|:AMQ::STAR::AL:ht...|2021-11-14 04:41:01|that sounds like ...|               0|    3|t1_hkjq6wn|   false|t3_qryjfm|hkjrd4w|  2021-11-14|           4|           1|           11|        2021|that sounds like ...|             6|          false|[{token, 0, 3, th...|[{document, 0, 28...|       0|      0|     0|          0|     0|
|    anime|         ClBanjai|                null|2021-11-14 04:41:03|what kind of ques...|               0|    1| t3_qth8ql|   false|t3_qth8ql|hkjrdae|  2021-11-14|           4|           1|           11|        2021|what kind of ques...|            10|          false|[{token, 0, 3, wh...|[{document, 0, 25...|       0|      0|     0|          0|     0|
|    anime|      helsaabiart|                null|2021-11-14 04:42:02|today on shokugek...|               0|    4| t3_qt8p0u|   false|t3_qt8p0u|hkjrhg6|  2021-11-14|           4|           1|           11|        2021|today on shokugek...|            28|          false|[{token, 0, 4, to...|[{document, 0, 11...|       0|      1|     0|          0|     0|
|    anime|           Lezoux|:MAL:https://myan...|2021-11-14 04:42:08|   it's easy enough.|               0|    3|t1_hkjrd4w|   false|t3_qryjfm|hkjrhv3|  2021-11-14|           4|           1|           11|        2021|     its easy enough|             3|          false|[{token, 0, 1, it...|[{document, 0, 10...|       0|      0|     0|          0|     0|
|    anime|    AutoModerator|                null|2021-11-14 04:42:39|hello! if you eve...|               0|    1| t3_qti7iu|   false|t3_qti7iu|hkjrk63|  2021-11-14|           4|           1|           11|        2021|hello if you ever...|           160|          false|[{token, 0, 4, he...|[{document, 0, 11...|       0|      0|     0|          0|     0|
|    anime|    AutoModerator|                null|2021-11-14 04:42:39|hi xxcile, it see...|               0|    1| t3_qti7iu|   false|t3_qti7iu|hkjrk6q|  2021-11-14|           4|           1|           11|        2021|hi xxcile it seem...|            82|          false|[{token, 0, 1, hi...|[{document, 0, 37...|       0|      0|     0|          0|     0|
|    anime|         Terra246|                null|2021-11-14 04:42:47|i did see amagi b...|               0|    4|t1_hkjm3z4|   false|t3_qtgzu3|hkjrkr1|  2021-11-14|           4|           1|           11|        2021|i did see amagi b...|             9|          false|[{token, 0, 0, i,...|[{document, 0, 36...|       0|      0|     0|          0|     0|
|    anime|ZaphodBeebblebrox|:S3::AL:https://a...|2021-11-14 04:43:39|which is your mad...|               0|    3|t1_hkjpckv|   false|t3_qryjfm|hkjrogp|  2021-11-14|           4|           1|           11|        2021|which is your madoka|             4|          false|[{token, 0, 4, wh...|[{document, 0, 5,...|       0|      0|     0|          0|     0|
|    anime|     GreekFire242|                null|2021-11-14 04:43:47|        demon slayer|               0|    2|t1_hkjpe43|   false|t3_qtgcp8|hkjrp3s|  2021-11-14|           4|           1|           11|        2021|        demon slayer|             2|          false|[{token, 0, 4, de...|[{document, 0, 11...|       0|      0|     0|          0|     0|
|    anime|         Terra246|                null|2021-11-14 04:43:52|i mean, it is one...|               0|    2|t1_hkjpa0q|   false|t3_qtgzu3|hkjrpgf|  2021-11-14|           4|           1|           11|        2021|i mean it is one ...|             8|          false|[{token, 0, 0, i,...|[{document, 0, 14...|       0|      0|     0|          0|     0|
|    anime|     MakotoPrince|                null|2021-11-14 04:44:32|yet another good ...|               0|    3| t3_qtg0z3|   false|t3_qtg0z3|hkjrsa2|  2021-11-14|           4|           1|           11|        2021|yet another good ...|            60|          false|[{token, 0, 2, ye...|[{document, 0, 18...|       0|      0|     0|          0|     0|
|    anime|   Gryse_Blacolar|                null|2021-11-14 04:44:41|that's basically ...|               0|    2| t3_qsz91x|   false|t3_qsz91x|hkjrsvy|  2021-11-14|           4|           1|           11|        2021|thats basically s...|            15|          false|[{token, 0, 3, th...|[{document, 0, 55...|       0|      0|     0|          0|     0|
|    anime|     Junnielocked|                null|2021-11-14 04:44:47|looked up the ani...|               0|    2| t3_qt7yff|   false|t3_qt7yff|hkjrtas|  2021-11-14|           4|           1|           11|        2021|looked up the ani...|            29|          false|[{token, 0, 5, lo...|[{document, 0, 93...|       0|      0|     0|          0|     0|
|    anime|        kubabubba|                null|2021-11-14 04:45:01|      how about now?|               0|   24|t1_hkihkib|   false|t3_qt7yff|hkjrual|  2021-11-14|           4|           1|           11|        2021|       how about now|             3|          false|[{token, 0, 2, ho...|[{document, 0, -1...|       0|      0|     0|          0|     0|
|    anime|    alotmorealots|                null|2021-11-14 04:45:08|your post could d...|               0|    1| t3_qtgpcl|   false|t3_qtgpcl|hkjrutb|  2021-11-14|           4|           1|           11|        2021|your post could d...|             9|          false|[{token, 0, 3, yo...|[{document, 0, 20...|       0|      0|     0|          0|     0|
|    anime|        heimdal77|                null|2021-11-14 04:45:17|depends is it lik...|               0|    2|t1_hkj9jju|   false|t3_qtfmin|hkjrvfv|  2021-11-14|           4|           1|           11|        2021|depends is it lik...|             5|          false|[{token, 0, 6, de...|[{document, 0, 18...|       0|      0|     0|          0|     0|
|    anime|    jackofslayers|                null|2021-11-14 04:45:35|i have my own sus...|               0|    7|t1_hkjcx61|   false|t3_qt5igg|hkjrwsz|  2021-11-14|           4|           1|           11|        2021|i have my own sus...|            13|          false|[{token, 0, 0, i,...|[{document, 0, 29...|       0|      0|     0|          0|     0|
|    anime| SarcasmUndefined|                null|2021-11-14 04:45:59|looking submissiv...|               0|    4|t1_hkhunza|   false|t3_qt3ovl|hkjrykr|  2021-11-14|           4|           1|           11|        2021|looking submissiv...|             4|          false|[{token, 0, 6, lo...|[{document, 0, 24...|       0|      0|     0|          0|     0|
+---------+-----------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+--------------------+--------------------+--------+-------+------+-----------+------+
only showing top 20 rows

# Show results
result.select(
    "subreddit",
    "created_utc",
    "text",
    "controversiality",
    "score",
    "created_date",
    "created_hour",
    "created_week",
    "created_month",
    "created_year",
    "cleaned",
    "body_wordCount",
    "OnePiece",
    "Pokemon",
    "Naruto",
    "OnePunchMan",
    "YuGiOh"
).show()

+---------+-------------------+--------------------+----------------+-----+------------+------------+------------+-------------+------------+--------------------+--------------+--------+-------+------+-----------+------+
|subreddit|        created_utc|                text|controversiality|score|created_date|created_hour|created_week|created_month|created_year|             cleaned|body_wordCount|OnePiece|Pokemon|Naruto|OnePunchMan|YuGiOh|
+---------+-------------------+--------------------+----------------+-----+------------+------------+------------+-------------+------------+--------------------+--------------+--------+-------+------+-----------+------+
|    anime|2021-11-14 04:39:47|  i sent it to ya ;)|               0|    1|  2021-11-14|           4|           1|           11|        2021|    i sent it to ya |             6|       0|      0|     0|          0|     0|
|    anime|2021-11-14 04:40:25|displate has some...|               0|    1|  2021-11-14|           4|           1|           11|        2021|displate has some...|            16|       0|      0|     0|          0|     0|
|    anime|2021-11-14 04:41:01|that sounds like ...|               0|    3|  2021-11-14|           4|           1|           11|        2021|that sounds like ...|             6|       0|      0|     0|          0|     0|
|    anime|2021-11-14 04:41:03|what kind of ques...|               0|    1|  2021-11-14|           4|           1|           11|        2021|what kind of ques...|            10|       0|      0|     0|          0|     0|
|    anime|2021-11-14 04:42:02|today on shokugek...|               0|    4|  2021-11-14|           4|           1|           11|        2021|today on shokugek...|            28|       0|      1|     0|          0|     0|
|    anime|2021-11-14 04:42:08|   it's easy enough.|               0|    3|  2021-11-14|           4|           1|           11|        2021|     its easy enough|             3|       0|      0|     0|          0|     0|
|    anime|2021-11-14 04:42:39|hello! if you eve...|               0|    1|  2021-11-14|           4|           1|           11|        2021|hello if you ever...|           160|       0|      0|     0|          0|     0|
|    anime|2021-11-14 04:42:39|hi xxcile, it see...|               0|    1|  2021-11-14|           4|           1|           11|        2021|hi xxcile it seem...|            82|       0|      0|     0|          0|     0|
|    anime|2021-11-14 04:42:47|i did see amagi b...|               0|    4|  2021-11-14|           4|           1|           11|        2021|i did see amagi b...|             9|       0|      0|     0|          0|     0|
|    anime|2021-11-14 04:43:39|which is your mad...|               0|    3|  2021-11-14|           4|           1|           11|        2021|which is your madoka|             4|       0|      0|     0|          0|     0|
|    anime|2021-11-14 04:43:47|        demon slayer|               0|    2|  2021-11-14|           4|           1|           11|        2021|        demon slayer|             2|       0|      0|     0|          0|     0|
|    anime|2021-11-14 04:43:52|i mean, it is one...|               0|    2|  2021-11-14|           4|           1|           11|        2021|i mean it is one ...|             8|       0|      0|     0|          0|     0|
|    anime|2021-11-14 04:44:32|yet another good ...|               0|    3|  2021-11-14|           4|           1|           11|        2021|yet another good ...|            60|       0|      0|     0|          0|     0|
|    anime|2021-11-14 04:44:41|that's basically ...|               0|    2|  2021-11-14|           4|           1|           11|        2021|thats basically s...|            15|       0|      0|     0|          0|     0|
|    anime|2021-11-14 04:44:47|looked up the ani...|               0|    2|  2021-11-14|           4|           1|           11|        2021|looked up the ani...|            29|       0|      0|     0|          0|     0|
|    anime|2021-11-14 04:45:01|      how about now?|               0|   24|  2021-11-14|           4|           1|           11|        2021|       how about now|             3|       0|      0|     0|          0|     0|
|    anime|2021-11-14 04:45:08|your post could d...|               0|    1|  2021-11-14|           4|           1|           11|        2021|your post could d...|             9|       0|      0|     0|          0|     0|
|    anime|2021-11-14 04:45:17|depends is it lik...|               0|    2|  2021-11-14|           4|           1|           11|        2021|depends is it lik...|             5|       0|      0|     0|          0|     0|
|    anime|2021-11-14 04:45:35|i have my own sus...|               0|    7|  2021-11-14|           4|           1|           11|        2021|i have my own sus...|            13|       0|      0|     0|          0|     0|
|    anime|2021-11-14 04:45:59|looking submissiv...|               0|    4|  2021-11-14|           4|           1|           11|        2021|looking submissiv...|             4|       0|      0|     0|          0|     0|
+---------+-------------------+--------------------+----------------+-----+------------+------------+------------+-------------+------------+--------------------+--------------+--------+-------+------+-----------+------+
only showing top 20 rows

from pyspark.sql.functions import sum as _sum

# Calculate counts for each dummy variable
dummy_counts = result.agg(
    _sum("OnePiece").alias("OnePiece_count"),
    _sum("Pokemon").alias("Pokemon_count"),
    _sum("Naruto").alias("Naruto_count"),
    _sum("OnePunchMan").alias("OnePunchMan_count"),
    _sum("YuGiOh").alias("YuGiOh_count"),
)

# Show results
dummy_counts.show()

[Stage 14:=====================================================>  (23 + 1) / 24]

+--------------+-------------+------------+-----------------+------------+
|OnePiece_count|Pokemon_count|Naruto_count|OnePunchMan_count|YuGiOh_count|
+--------------+-------------+------------+-----------------+------------+
|         49421|       105686|       50357|            15037|        5193|
+--------------+-------------+------------+-----------------+------------+

Filter the data - using dummy variables

from pyspark.sql.functions import col

# Filter the dataset
filtered_result = result.filter(
    (col("OnePiece") == 1) |
    (col("Pokemon") == 1) |
    (col("Naruto") == 1) |
    (col("OnePunchMan") == 1) |
    (col("YuGiOh") == 1)
)

# Show the results
filtered_result.select(
    "subreddit",
    "created_utc",
    "text",
    "controversiality",
    "score",
    "created_date",
    "created_hour",
    "created_week",
    "created_month",
    "created_year",
    "cleaned",
    "body_wordCount",
    "OnePiece",
    "Pokemon",
    "Naruto",
    "OnePunchMan",
    "YuGiOh"
).show()

[Stage 17:>                                                         (0 + 1) / 1]

+---------+-------------------+--------------------+----------------+-----+------------+------------+------------+-------------+------------+--------------------+--------------+--------+-------+------+-----------+------+
|subreddit|        created_utc|                text|controversiality|score|created_date|created_hour|created_week|created_month|created_year|             cleaned|body_wordCount|OnePiece|Pokemon|Naruto|OnePunchMan|YuGiOh|
+---------+-------------------+--------------------+----------------+-----+------------+------------+------------+-------------+------------+--------------------+--------------+--------+-------+------+-----------+------+
|    anime|2021-11-14 04:42:02|today on shokugek...|               0|    4|  2021-11-14|           4|           1|           11|        2021|today on shokugek...|            28|       0|      1|     0|          0|     0|
|    anime|2021-12-19 05:41:05|kubo does emotion...|               0|   16|  2021-12-19|           5|           1|           12|        2021|kubo does emotion...|            50|       1|      0|     0|          0|     0|
|    anime|2021-12-19 05:45:14|oh sweet, thanks!...|               0|    2|  2021-12-19|           5|           1|           12|        2021|oh sweet thanks t...|            17|       0|      1|     0|          0|     0|
|    anime|2021-12-19 05:48:51|furuhashi being b...|               0|   77|  2021-12-19|           5|           1|           12|        2021|furuhashi being b...|            43|       0|      1|     0|          0|     0|
|    anime|2021-12-19 05:49:59|it would’ve been ...|               0|    3|  2021-12-19|           5|           1|           12|        2021|it wouldve been m...|            80|       0|      0|     1|          0|     0|
|    anime|2021-12-19 05:50:42|main story: \n\n*...|               1|   -2|  2021-12-19|           5|           1|           12|        2021|main story fatest...|           202|       0|      1|     0|          0|     0|
|    anime|2021-12-19 05:50:54|the sakuga looks ...|               0|   30|  2021-12-19|           5|           1|           12|        2021|the sakuga looks ...|            36|       0|      1|     0|          0|     0|
|    anime|2021-12-19 05:55:17|first i ever watc...|               0|    2|  2021-12-19|           5|           1|           12|        2021|first i ever watc...|            42|       0|      0|     1|          0|     1|
|    anime|2021-10-29 02:08:20|have i talked abo...|               0|    3|  2021-10-29|           2|           6|           10|        2021|have i talked abo...|            16|       0|      1|     0|          0|     0|
|    anime|2021-10-29 02:12:53|you don't need to...|               0|    1|  2021-10-29|           2|           6|           10|        2021|you dont need to ...|            65|       0|      1|     0|          0|     0|
|    anime|2021-10-29 02:16:27|lmao i wasn’t exp...|               0|    1|  2021-10-29|           2|           6|           10|        2021|lmao i wasnt expe...|            79|       0|      1|     0|          0|     0|
|    anime|2021-10-29 02:16:49|i like watching s...|               0|    3|  2021-10-29|           2|           6|           10|        2021|i like watching s...|           221|       0|      1|     0|          0|     0|
|    anime|2021-10-29 02:22:13|this. i absolutel...|               0|    6|  2021-10-29|           2|           6|           10|        2021|this i absolutely...|            29|       0|      1|     0|          0|     0|
|    anime|2021-10-29 02:24:00|i did the trainin...|               0|    3|  2021-10-29|           2|           6|           10|        2021|i did the trainin...|           328|       0|      1|     0|          0|     0|
|    anime|2021-10-03 22:41:12|holy shit that wa...|               0|    3|  2021-10-03|          22|           1|           10|        2021|holy shit that wa...|           133|       1|      0|     0|          0|     0|
|    anime|2021-10-03 22:41:19|**general comment...|               0|    5|  2021-10-03|          22|           1|           10|        2021|general commentar...|           938|       0|      0|     1|          0|     0|
|    anime|2021-10-03 22:43:31|welp, seems like ...|               0|    1|  2021-10-03|          22|           1|           10|        2021|welp seems like w...|            78|       0|      1|     0|          0|     0|
|    anime|2021-10-03 22:49:09|yeah the pacing i...|               0|    1|  2021-10-03|          22|           1|           10|        2021|yeah the pacing i...|            52|       1|      0|     0|          0|     0|
|    anime|2021-10-03 22:52:06|&gt; out of time ...|               0|    4|  2021-10-03|          22|           1|           10|        2021|gt out of time ag...|           309|       0|      0|     1|          0|     0|
|    anime|2021-10-02 08:46:47|i don't rewatch a...|               0|    1|  2021-10-02|           8|           7|           10|        2021|i dont rewatch an...|            18|       1|      0|     0|          0|     0|
+---------+-------------------+--------------------+----------------+-----+------------+------------+------------+-------------+------------+--------------------+--------------+--------+-------+------+-----------+------+
only showing top 20 rows

Save the filtered data

output = "project/nlp/filtered"
my_bucket = 'sagemaker-us-east-1-216384626106'
s3_path = f"s3a://{my_bucket}/{output}"

print(f"writing cleaned comments to {s3_path}")
filtered_result.write.parquet(s3_path, mode="overwrite")

writing cleaned comments to s3a://sagemaker-us-east-1-216384626106/project/nlp/filtered

Build a SparkNLP Pipeline to construct positive/negative sentiment for filtered data

from sparknlp.base import DocumentAssembler
from sparknlp.annotator import UniversalSentenceEncoder, SentimentDLModel
from pyspark.ml import Pipeline

# Document Assembling
documentAssembler = DocumentAssembler()\
    .setInputCol("cleaned")\
    .setOutputCol("document")
    
# Embedding with Universal Sentence Encoder
use = UniversalSentenceEncoder.pretrained(name="tfhub_use", lang="en")\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

# Sentiment Analysis (using a pre-trained model)
sentimentdl = SentimentDLModel.pretrained(name=MODEL_NAME, lang="en")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("sentiment")

# Building the Pipeline
nlpPipeline = Pipeline(
    stages = [
        documentAssembler,
        use,
        sentimentdl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[ | ]tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
Download done! Loading the resource.
[ / ]

2023-11-19 02:36:03.668395: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

[ \ ]

2023-11-19 02:36:08.088743: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 60236800 exceeds 10% of free system memory.
2023-11-19 02:36:08.137386: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 60236800 exceeds 10% of free system memory.
2023-11-19 02:36:08.186279: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 60236800 exceeds 10% of free system memory.
2023-11-19 02:36:08.232046: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 60236800 exceeds 10% of free system memory.
2023-11-19 02:36:08.280446: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 60236800 exceeds 10% of free system memory.

[OK!]
sentimentdl_use_twitter download started this may take some time.
Approximate size to download 11.4 MB
[ | ]sentimentdl_use_twitter download started this may take some time.
Approximate size to download 11.4 MB
Download done! Loading the resource.
[OK!]

# Fit and transform the data using the pipeline
nlp_model = nlpPipeline.fit(filtered_result)
processed_data = nlp_model.transform(filtered_result)

# Display results
processed_data.select(
    "cleaned", 
    "OnePiece",
    "Pokemon",
    "Naruto",
    "OnePunchMan",
    "YuGiOh",
    "sentiment.result"
).show()

[Stage 22:>                                                         (0 + 1) / 1]

+--------------------+--------+-------+------+-----------+------+----------+
|             cleaned|OnePiece|Pokemon|Naruto|OnePunchMan|YuGiOh|    result|
+--------------------+--------+-------+------+-----------+------+----------+
|today on shokugek...|       0|      1|     0|          0|     0|[positive]|
|kubo does emotion...|       1|      0|     0|          0|     0|[positive]|
|oh sweet thanks t...|       0|      1|     0|          0|     0|[positive]|
|furuhashi being b...|       0|      1|     0|          0|     0|[negative]|
|it wouldve been m...|       0|      0|     1|          0|     0|[negative]|
|main story fatest...|       0|      1|     0|          0|     0|[positive]|
|the sakuga looks ...|       0|      1|     0|          0|     0|[positive]|
|first i ever watc...|       0|      0|     1|          0|     1|[positive]|
|have i talked abo...|       0|      1|     0|          0|     0|[positive]|
|you dont need to ...|       0|      1|     0|          0|     0|[positive]|
|lmao i wasnt expe...|       0|      1|     0|          0|     0|[negative]|
|i like watching s...|       0|      1|     0|          0|     0|[positive]|
|this i absolutely...|       0|      1|     0|          0|     0|[negative]|
|i did the trainin...|       0|      1|     0|          0|     0|[negative]|
|holy shit that wa...|       1|      0|     0|          0|     0|[positive]|
|general commentar...|       0|      0|     1|          0|     0|[negative]|
|welp seems like w...|       0|      1|     0|          0|     0|[negative]|
|yeah the pacing i...|       1|      0|     0|          0|     0|[positive]|
|gt out of time ag...|       0|      0|     1|          0|     0|[negative]|
|i dont rewatch an...|       1|      0|     0|          0|     0|[positive]|
+--------------------+--------+-------+------+-----------+------+----------+
only showing top 20 rows

from pyspark.sql.functions import col

# Extracting sentiment value from the result array
processed_result = processed_data.withColumn(
    "sentiment", 
    col("sentiment.result").getItem(0)
)

processed_result.select(
    "subreddit",
    "created_utc",
    "controversiality",
    "score",
    "created_date",
    "created_hour",
    "created_week",
    "created_month",
    "created_year",
    "cleaned",
    "body_wordCount",
    "OnePiece",
    "Pokemon",
    "Naruto",
    "OnePunchMan",
    "YuGiOh",
    "sentiment"
).show()

+---------+-------------------+----------------+-----+------------+------------+------------+-------------+------------+--------------------+--------------+--------+-------+------+-----------+------+---------+
|subreddit|        created_utc|controversiality|score|created_date|created_hour|created_week|created_month|created_year|             cleaned|body_wordCount|OnePiece|Pokemon|Naruto|OnePunchMan|YuGiOh|sentiment|
+---------+-------------------+----------------+-----+------------+------------+------------+-------------+------------+--------------------+--------------+--------+-------+------+-----------+------+---------+
|    anime|2021-11-14 04:42:02|               0|    4|  2021-11-14|           4|           1|           11|        2021|today on shokugek...|            28|       0|      1|     0|          0|     0| positive|
|    anime|2021-12-19 05:41:05|               0|   16|  2021-12-19|           5|           1|           12|        2021|kubo does emotion...|            50|       1|      0|     0|          0|     0| positive|
|    anime|2021-12-19 05:45:14|               0|    2|  2021-12-19|           5|           1|           12|        2021|oh sweet thanks t...|            17|       0|      1|     0|          0|     0| positive|
|    anime|2021-12-19 05:48:51|               0|   77|  2021-12-19|           5|           1|           12|        2021|furuhashi being b...|            43|       0|      1|     0|          0|     0| negative|
|    anime|2021-12-19 05:49:59|               0|    3|  2021-12-19|           5|           1|           12|        2021|it wouldve been m...|            80|       0|      0|     1|          0|     0| negative|
|    anime|2021-12-19 05:50:42|               1|   -2|  2021-12-19|           5|           1|           12|        2021|main story fatest...|           202|       0|      1|     0|          0|     0| positive|
|    anime|2021-12-19 05:50:54|               0|   30|  2021-12-19|           5|           1|           12|        2021|the sakuga looks ...|            36|       0|      1|     0|          0|     0| positive|
|    anime|2021-12-19 05:55:17|               0|    2|  2021-12-19|           5|           1|           12|        2021|first i ever watc...|            42|       0|      0|     1|          0|     1| positive|
|    anime|2021-10-29 02:08:20|               0|    3|  2021-10-29|           2|           6|           10|        2021|have i talked abo...|            16|       0|      1|     0|          0|     0| positive|
|    anime|2021-10-29 02:12:53|               0|    1|  2021-10-29|           2|           6|           10|        2021|you dont need to ...|            65|       0|      1|     0|          0|     0| positive|
|    anime|2021-10-29 02:16:27|               0|    1|  2021-10-29|           2|           6|           10|        2021|lmao i wasnt expe...|            79|       0|      1|     0|          0|     0| negative|
|    anime|2021-10-29 02:16:49|               0|    3|  2021-10-29|           2|           6|           10|        2021|i like watching s...|           221|       0|      1|     0|          0|     0| positive|
|    anime|2021-10-29 02:22:13|               0|    6|  2021-10-29|           2|           6|           10|        2021|this i absolutely...|            29|       0|      1|     0|          0|     0| negative|
|    anime|2021-10-29 02:24:00|               0|    3|  2021-10-29|           2|           6|           10|        2021|i did the trainin...|           328|       0|      1|     0|          0|     0| negative|
|    anime|2021-10-03 22:41:12|               0|    3|  2021-10-03|          22|           1|           10|        2021|holy shit that wa...|           133|       1|      0|     0|          0|     0| positive|
|    anime|2021-10-03 22:41:19|               0|    5|  2021-10-03|          22|           1|           10|        2021|general commentar...|           938|       0|      0|     1|          0|     0| negative|
|    anime|2021-10-03 22:43:31|               0|    1|  2021-10-03|          22|           1|           10|        2021|welp seems like w...|            78|       0|      1|     0|          0|     0| negative|
|    anime|2021-10-03 22:49:09|               0|    1|  2021-10-03|          22|           1|           10|        2021|yeah the pacing i...|            52|       1|      0|     0|          0|     0| positive|
|    anime|2021-10-03 22:52:06|               0|    4|  2021-10-03|          22|           1|           10|        2021|gt out of time ag...|           309|       0|      0|     1|          0|     0| negative|
|    anime|2021-10-02 08:46:47|               0|    1|  2021-10-02|           8|           7|           10|        2021|i dont rewatch an...|            18|       1|      0|     0|          0|     0| positive|
+---------+-------------------+----------------+-----+------------+------------+------------+-------------+------------+--------------------+--------------+--------+-------+------+-----------+------+---------+
only showing top 20 rows

Save the processed data

output = "project/nlp/processed"
my_bucket = 'sagemaker-us-east-1-216384626106'
s3_path = f"s3a://{my_bucket}/{output}"

print(f"writing cleaned comments to {s3_path}")
processed_result.write.parquet(s3_path, mode="overwrite")

writing cleaned comments to s3a://sagemaker-us-east-1-216384626106/project/nlp/processed

import sagemaker

# session = sagemaker.Session()
# bucket = session.default_bucket()
my_bucket = 'sagemaker-us-east-1-216384626106'
nlp_bucket_path = f"s3a://{my_bucket}/project/nlp/processed"

print(f"reading submissions from {nlp_bucket_path}")
processed_result = spark.read.parquet(nlp_bucket_path, header=True)
print(f"shape of the sub dataframe is {processed_result.count():,}x{len(processed_result.columns)}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
reading submissions from s3a://sagemaker-us-east-1-216384626106/project/nlp/processed

23/11/21 04:02:46 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
[Stage 1:====================================================>     (9 + 1) / 10]

shape of the sub dataframe is 210,695x29

Aggregate the data by animes dummy variables and sentiments

from pyspark.sql.functions import count

# Grouping by the anime indicators and sentiment, then counting the number of texts
df_grouped = processed_result.groupBy(
    "OnePiece", 
    "Pokemon", 
    "Naruto", 
    "OnePunchMan", 
    "YuGiOh", 
    "sentiment"
).agg(count("*").alias("regex_text_count"))

# Showing the grouped DataFrame
df_grouped.show()

[Stage 29:=====================================================>  (23 + 1) / 24]

+--------+-------+------+-----------+------+---------+----------------+
|OnePiece|Pokemon|Naruto|OnePunchMan|YuGiOh|sentiment|regex_text_count|
+--------+-------+------+-----------+------+---------+----------------+
|       1|      1|     0|          0|     1| positive|              43|
|       0|      0|     0|          0|     1| negative|            1212|
|       0|      1|     0|          0|     1| negative|             244|
|       1|      0|     0|          0|     1| negative|              27|
|       1|      0|     0|          0|     1| positive|              53|
|       0|      0|     1|          1|     0| positive|             564|
|       1|      1|     1|          0|     1| positive|              65|
|       0|      0|     0|          1|     0|  neutral|             543|
|       1|      0|     0|          0|     0| negative|           10537|
|       0|      1|     0|          0|     1| positive|             530|
|       0|      1|     1|          0|     1| negative|              36|
|       0|      0|     1|          0|     1| positive|             108|
|       1|      0|     1|          0|     0| positive|            5233|
|       1|      1|     1|          1|     0| positive|              34|
|       0|      0|     1|          1|     0| negative|              92|
|       1|      0|     1|          1|     0| positive|             250|
|       1|      0|     0|          1|     0| negative|              52|
|       0|      0|     1|          0|     0| positive|           26756|
|       1|      1|     1|          1|     0| negative|               4|
|       1|      0|     1|          0|     1| positive|              55|
+--------+-------+------+-----------+------+---------+----------------+
only showing top 20 rows

from pyspark.sql.functions import col, sum as _sum

# Step 1: Count sentiments per category
sentiment_counts = processed_result.groupBy("OnePiece", "Pokemon", "Naruto", "OnePunchMan", "YuGiOh", "sentiment").agg(count("*").alias("sentiment_count"))

# Step 2: Calculate total counts per category
category_totals = sentiment_counts.groupBy("OnePiece", "Pokemon", "Naruto", "OnePunchMan", "YuGiOh").agg(_sum("sentiment_count").alias("category_total_count"))

# Step 3: Join to calculate percentages
result_with_percentage = sentiment_counts.join(category_totals, ["OnePiece", "Pokemon", "Naruto", "OnePunchMan", "YuGiOh"])

# Adding percentage column
result_with_percentage = result_with_percentage.withColumn("percentage", col("sentiment_count") / col("category_total_count"))

# Show result
result_with_percentage.select("OnePiece", "Pokemon", "Naruto", "OnePunchMan", "YuGiOh", "sentiment", "sentiment_count", "percentage").show()

[Stage 26:=====================================================>  (23 + 1) / 24]

+--------+-------+------+-----------+------+---------+---------------+-------------------+
|OnePiece|Pokemon|Naruto|OnePunchMan|YuGiOh|sentiment|sentiment_count|         percentage|
+--------+-------+------+-----------+------+---------+---------------+-------------------+
|       0|      1|     0|          0|     1|  neutral|             84| 0.0979020979020979|
|       0|      1|     0|          0|     1| positive|            530| 0.6177156177156177|
|       0|      1|     0|          0|     1| negative|            244|0.28438228438228436|
|       0|      1|     1|          0|     0| negative|            501| 0.3150943396226415|
|       0|      1|     1|          0|     0|  neutral|            139|0.08742138364779874|
|       0|      1|     1|          0|     0| positive|            950| 0.5974842767295597|
|       0|      1|     1|          1|     1| negative|              1|               0.25|
|       0|      1|     1|          1|     1| positive|              3|               0.75|
|       1|      0|     1|          1|     0|  neutral|             15|0.05102040816326531|
|       1|      0|     1|          1|     0| negative|             29|0.09863945578231292|
|       1|      0|     1|          1|     0| positive|            250| 0.8503401360544217|
|       1|      1|     0|          1|     1| positive|              1|                1.0|
|       1|      1|     1|          0|     0|  neutral|             36|0.09651474530831099|
|       1|      1|     1|          0|     0| negative|             90|0.24128686327077747|
|       1|      1|     1|          0|     0| positive|            247| 0.6621983914209115|
|       1|      0|     0|          1|     0|  neutral|             15|0.03614457831325301|
|       1|      0|     0|          1|     0| positive|            348| 0.8385542168674699|
|       1|      0|     0|          1|     0| negative|             52|0.12530120481927712|
|       0|      1|     0|          1|     1| positive|              3|                0.6|
|       0|      1|     0|          1|     1| negative|              2|                0.4|
+--------+-------+------+-----------+------+---------+---------------+-------------------+
only showing top 20 rows

Save the grouped data to CSV

# Define the path where you want to save the CSV
output_path = "grouped_category_sentiment.csv"

# Save the DataFrame as a CSV
df_grouped.write.csv(output_path, header=True, mode="overwrite")

# Define the path where you want to save the CSV
output_path = "result_with_percentage_category_sentiment.csv"

# Save the DataFrame as a CSV
result_with_percentage.write.csv(output_path, header=True, mode="overwrite")

# !pip install plotly

Requirement already satisfied: plotly in /opt/conda/lib/python3.10/site-packages (5.9.0)
Requirement already satisfied: tenacity>=6.2.0 in /opt/conda/lib/python3.10/site-packages (from plotly) (8.0.1)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip

import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook_connected"

from pyspark.sql import functions as F

# Group by time interval and sentiment, and count occurrences for each anime series
time_agg = processed_result.groupBy("created_year", "created_month", "sentiment").agg(
    F.sum("OnePiece").alias("OnePiece_Count"),
    F.sum("Pokemon").alias("Pokemon_Count"),
    F.sum("Naruto").alias("Naruto_Count"),
    F.sum("OnePunchMan").alias("OnePunchMan_Count"),
    F.sum("YuGiOh").alias("YuGiOh_Count")
)

# Convert to Pandas DataFrame for plotting
pandas_time_agg = time_agg.toPandas()

pandas_time_agg.to_csv('pandas_time_agg.csv', index=False)

Plot: Time Series for Sentiment Analysis Across Different Animes

import plotly.graph_objects as go
import pandas as pd

def create_time_series_scatter_connected(df, anime_column, title):
    fig = go.Figure()

    # Define custom colors for sentiments
    colors = {'positive': '#42a63c', 'neutral': '#42a1b9', 'negative': '#d13a47'}

    for sentiment in df['sentiment'].unique():
        # Filter data for each sentiment and sort by date
        sentiment_df = df[df['sentiment'] == sentiment].copy()
        sentiment_df['date'] = pd.to_datetime(sentiment_df['created_year'].astype(str) + '-' + sentiment_df['created_month'].astype(str))
        sentiment_df.sort_values(by='date', inplace=True)

        fig.add_trace(go.Scatter(
            x=sentiment_df['date'],
            y=sentiment_df[anime_column],
            mode='lines+markers',
            name=sentiment,
            line=dict(color=colors[sentiment]),
            marker=dict(color=colors[sentiment])
        ))

    fig.update_layout(
        title=title,
        xaxis_title='Date',
        yaxis_title='Count',
        xaxis=dict(
            rangeselector=dict(
                buttons=list([
                    dict(count=1, label='1m', step='month', stepmode='backward'),
                    dict(count=6, label='6m', step='month', stepmode='backward'),
                    dict(step='all')
                ])
            ),
            type='date'
        )
    )
    return fig

# Creating connected scatter plots for each anime series
fig_onepiece_connected = create_time_series_scatter_connected(pandas_time_agg, 'OnePiece_Count', 'One Piece Sentiment Over Time')
fig_pokemon_connected = create_time_series_scatter_connected(pandas_time_agg, 'Pokemon_Count', 'Pokemon Sentiment Over Time')
fig_onepunchman_connected = create_time_series_scatter_connected(pandas_time_agg, 'OnePunchMan_Count', 'One Punch Man Sentiment Over Time')
fig_naruto_connected = create_time_series_scatter_connected(pandas_time_agg, 'Naruto_Count', 'Naruto Sentiment Over Time')
fig_yugioh_connected = create_time_series_scatter_connected(pandas_time_agg, 'YuGiOh_Count', 'YuGiOh Sentiment Over Time')

# Display the connected scatter plots
fig_onepiece_connected.show()
fig_pokemon_connected.show()
fig_onepunchman_connected.show()
fig_naruto_connected.show()
fig_yugioh_connected.show()

import pandas as pd

# Load the data
file_path = '../../data/csv/grouped_category_sentiment.csv'
data = pd.read_csv(file_path)

data

	OnePiece	Pokemon	Naruto	OnePunchMan	YuGiOh	sentiment	regex_text_count
0	1	1	0	0	1	positive	43
1	0	0	0	0	1	negative	1212
2	0	1	0	0	1	negative	244
3	1	0	0	0	1	negative	27
4	1	0	0	0	1	positive	53
...	...	...	...	...	...	...	...
76	0	1	1	1	1	negative	1
77	1	0	0	1	1	negative	2
78	0	1	0	1	1	positive	3
79	1	0	1	0	1	neutral	5
80	0	0	1	1	1	negative	1

81 rows × 7 columns

Aggregate the data and generate table with percentage

# Aggregate the regex_text_count for each sentiment for each anime series
aggregated_data = data.groupby('sentiment').agg({
    'OnePiece': lambda x: data.loc[x.index, 'regex_text_count'][x == 1].sum(),
    'Pokemon': lambda x: data.loc[x.index, 'regex_text_count'][x == 1].sum(),
    'Naruto': lambda x: data.loc[x.index, 'regex_text_count'][x == 1].sum(),
    'OnePunchMan': lambda x: data.loc[x.index, 'regex_text_count'][x == 1].sum(),
    'YuGiOh': lambda x: data.loc[x.index, 'regex_text_count'][x == 1].sum()
}).reset_index()

# Calculate the percentage for each sentiment in each anime series
for anime in ['OnePiece', 'Pokemon', 'Naruto', 'OnePunchMan', 'YuGiOh']:
    total_count = aggregated_data[anime].sum()
    aggregated_data[anime + '_Perc (%)'] = (aggregated_data[anime] / total_count) * 100

aggregated_data

	sentiment	OnePiece	Pokemon	Naruto	OnePunchMan	YuGiOh	OnePiece_Perc (%)	Pokemon_Perc (%)	Naruto_Perc (%)	OnePunchMan_Perc (%)	YuGiOh_Perc (%)
0	negative	12461	39728	12535	2726	1616	25.213978	37.590599	24.892269	18.128616	31.118814
1	neutral	3367	7497	3373	652	396	6.812893	7.093655	6.698175	4.335971	7.625650
2	positive	33593	58461	34449	11659	3181	67.973129	55.315747	68.409556	77.535413	61.255536

Plot: Grouped Bar Chart of Sentiment Percentages Across Anime Series

import plotly.graph_objects as go

# Define custom colors for each anime series
custom_colors = ['#42a63c', '#42a1b9', '#d13a47', '#f7c200', '#967bb6']

# Creating Grouped Bar Chart using percentages
fig_grouped_bar_perc = go.Figure()

# Adding a bar for each anime series using percentages with custom colors
for idx, anime in enumerate(['OnePiece_Perc (%)', 'Pokemon_Perc (%)', 'Naruto_Perc (%)', 'OnePunchMan_Perc (%)', 'YuGiOh_Perc (%)']):
    fig_grouped_bar_perc.add_trace(go.Bar(
        x=aggregated_data['sentiment'],
        y=aggregated_data[anime],
        name=anime.split('_')[0],  # Remove '_perc' from the name for clarity
        marker_color=custom_colors[idx]  # Apply custom color
    ))

# Updating layout for Grouped Bar Chart with Percentages
fig_grouped_bar_perc.update_layout(
    title="Grouped Bar Chart of Sentiment Percentages Across Anime Series",
    xaxis_title="Sentiment",
    yaxis_title="Percentage",
    barmode='group'
)

# Show the Grouped Bar Chart with Percentages
fig_grouped_bar_perc.show()

Plot: Grouped Bar Chart of Sentiment Counts Across Anime Series

# Creating Grouped Bar Chart
fig_grouped_bar = go.Figure()

# Adding a bar for each anime series
for idx, anime in enumerate(['OnePiece', 'Pokemon', 'Naruto', 'OnePunchMan', 'YuGiOh']):
    fig_grouped_bar.add_trace(go.Bar(
        x=aggregated_data['sentiment'],
        y=aggregated_data[anime],
        name=anime,
        marker_color=custom_colors[idx]
    ))

# Updating layout for Grouped Bar Chart
fig_grouped_bar.update_layout(
    title="Grouped Bar Chart of Sentiment Counts Across Anime Series",
    xaxis_title="Sentiment",
    yaxis_title="Regex Text Count",
    barmode='group'
)

# Show the Grouped Bar Chart
fig_grouped_bar.show()

Plot: Bubble Plot of Sentiment Percentages Across All Anime Series

import plotly.graph_objects as go

# Assuming aggregated_data is your DataFrame with sentiment percentages
fig_bubble_all_color = go.Figure()

# Custom colors for each anime series
bubble_colors = ['#42a63c', '#42a1b9', '#d13a47', '#f7c200', '#967bb6']

# Iterating over each sentiment to add bubbles for each anime series with custom colors
for sentiment in aggregated_data['sentiment'].unique():
    sentiment_data = aggregated_data[aggregated_data['sentiment'] == sentiment]

    for idx, anime in enumerate(['OnePiece_Perc (%)', 'Pokemon_Perc (%)', 'Naruto_Perc (%)', 'OnePunchMan_Perc (%)', 'YuGiOh_Perc (%)']):
        fig_bubble_all_color.add_trace(go.Scatter(
            x=[sentiment],
            y=[sentiment_data[anime].values[0]],
            mode='markers',
            marker=dict(
                size=sentiment_data[anime].values[0],
                sizemode='diameter',
                sizeref=100 / 100**2,  # Adjust the size reference as needed
                sizemin=4,
                color=bubble_colors[idx]  # Apply custom color
            ),
            name=anime.split('_')[0]  # Removing '_perc' for clarity in the legend
        ))

# Update layout
fig_bubble_all_color.update_layout(
    title='Bubble Plot of Sentiment Percentages Across All Anime Series',
    xaxis_title="Sentiment",
    yaxis_title="Percentage",
    xaxis={'type': 'category'}  # Setting x-axis as category for sentiment labels
)

# Display the bubble plot
fig_bubble_all_color.show()

Plot: Line Plot of Sentiment Percentages Across Anime Series

fig_line_perc = go.Figure()

custom_colors = ['#42a63c', '#42a1b9', '#d13a47', '#f7c200', '#967bb6']

# Adding a line for each anime series
for idx, anime in enumerate(['OnePiece_Perc (%)', 'Pokemon_Perc (%)', 'Naruto_Perc (%)', 'OnePunchMan_Perc (%)', 'YuGiOh_Perc (%)']):
    fig_line_perc.add_trace(go.Scatter(
        x=aggregated_data['sentiment'],
        y=aggregated_data[anime],
        mode='lines+markers',
        name=anime.split('_')[0],  # Removing '_perc' for clarity in the legend
        marker_color=custom_colors[idx]
    ))

# Update the layout
fig_line_perc.update_layout(
    title="Line Plot of Sentiment Percentages Across Anime Series",
    xaxis_title="Sentiment",
    yaxis_title="Percentage"
)

# Display the plot
fig_line_perc.show()

Plot: Area Chart of Sentiment Percentages Across Anime Series

fig_area_perc = go.Figure()

# Adding an area trace for each anime series using percentages
for idx, anime in enumerate(['OnePiece_Perc (%)', 'Pokemon_Perc (%)', 'Naruto_Perc (%)', 'OnePunchMan_Perc (%)', 'YuGiOh_Perc (%)']):
    fig_area_perc.add_trace(go.Scatter(
        x=aggregated_data['sentiment'],
        y=aggregated_data[anime],
        mode='lines',
        fill='tozeroy',
        name=anime.split('_')[0],  # Removing '_perc' for clarity in the legend
        marker_color=custom_colors[idx]
    ))

# Update the layout
fig_area_perc.update_layout(
    title="Area Chart of Sentiment Percentages Across Anime Series",
    xaxis_title="Sentiment",
    yaxis_title="Percentage"
)

# Display the area plot
fig_area_perc.show()

Plot: Radar Chart of Sentiment Percentages Across Anime Series

fig_radar_perc = go.Figure()

# Iterating through each anime series to add to the radar chart
for idx, anime in enumerate(['OnePiece_Perc (%)', 'Pokemon_Perc (%)', 'Naruto_Perc (%)', 'OnePunchMan_Perc (%)', 'YuGiOh_Perc (%)']):
    fig_radar_perc.add_trace(go.Scatterpolar(
        r=aggregated_data[anime],
        theta=aggregated_data['sentiment'],
        fill='toself',
        name=anime.split('_')[0],  # Removing '_perc' for clarity in the legend
        marker_color=custom_colors[idx]
    ))

# Update the layout for the radar chart
fig_radar_perc.update_layout(
    title='Radar Chart of Sentiment Percentages Across Anime Series',
    polar=dict(
        radialaxis=dict(
            visible=True,
            range=[0, 100]  # Since it's percentage, the range is from 0 to 100
        )
    ),
    showlegend=True
)

fig_radar_perc.show()

Plot: Sentiment Distribution Across Anime Series

import plotly.express as px
import pandas as pd

# Define custom colors for sentiments
color_map = {'positive': '#42a63c', 'neutral': '#42a1b9', 'negative': '#d13a47'}

# Reshaping the DataFrame for Sunburst Chart
sunburst_data = pd.melt(aggregated_data, id_vars=['sentiment'], 
                        value_vars=['OnePiece', 'Pokemon', 'Naruto', 
                                    'OnePunchMan', 'YuGiOh'],
                        var_name='Anime', value_name='Count')

# Creating Sunburst Chart with custom colors
fig_sunburst = px.sunburst(
    sunburst_data, 
    path=['Anime', 'sentiment'], 
    values='Count',
    color='sentiment', 
    color_discrete_map=color_map,
    title='Sentiment Distribution Across Anime Series'
)

fig_sunburst.show()

import plotly.express as px
import pandas as pd

# Define custom colors for anime series
anime_color_map = {
    'OnePiece': '#42a63c',
    'Pokemon': '#42a1b9',
    'Naruto': '#d13a47',
    'OnePunchMan': '#f7c200',
    'YuGiOh': '#967bb6' 
}

# Creating Sunburst Chart with custom colors for the inner ring (anime series)
fig_sunburst = px.sunburst(
    sunburst_data, 
    path=['Anime', 'sentiment'], 
    values='Count',
    color='Anime',  # Change this to 'Anime' to color based on anime series
    color_discrete_map=anime_color_map,
    title='Sentiment Distribution Across Anime Series'
)

fig_sunburst.show()