Code: NLP-Topic 6

Set up

# Setup - Run only once per Kernel App
%conda install openjdk -y

# install PySpark
%pip install pyspark==3.4.0

# install spark-nlp
%pip install spark-nlp==5.1.3
%pip install sparknlp
# install plotly
%pip install plotly

# install yfinance for external data
%pip install yfinance

# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

Collecting package metadata (current_repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 23.3.1
  latest version: 23.10.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.10.0



## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - openjdk


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2023.08.22 |       h06a4308_0         123 KB
    certifi-2023.7.22          |  py310h06a4308_0         153 KB
    openjdk-11.0.13            |       h87a67e3_0       341.0 MB
    ------------------------------------------------------------
                                           Total:       341.3 MB

The following NEW packages will be INSTALLED:

  openjdk            pkgs/main/linux-64::openjdk-11.0.13-h87a67e3_0 

The following packages will be UPDATED:

  ca-certificates    conda-forge::ca-certificates-2023.7.2~ --> pkgs/main::ca-certificates-2023.08.22-h06a4308_0 

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi            conda-forge/noarch::certifi-2023.7.22~ --> pkgs/main/linux-64::certifi-2023.7.22-py310h06a4308_0 



Downloading and Extracting Packages
certifi-2023.7.22    | 153 KB    |                                       |   0% 
openjdk-11.0.13      | 341.0 MB  |                                       |   0% 

ca-certificates-2023 | 123 KB    |                                       |   0% 

certifi-2023.7.22    | 153 KB    | ##################################### | 100% 
openjdk-11.0.13      | 341.0 MB  | 4                                     |   1% 
openjdk-11.0.13      | 341.0 MB  | ##1                                   |   6% 
openjdk-11.0.13      | 341.0 MB  | ###3                                  |   9% 
openjdk-11.0.13      | 341.0 MB  | #####1                                |  14% 
openjdk-11.0.13      | 341.0 MB  | #######                               |  19% 
openjdk-11.0.13      | 341.0 MB  | ########9                             |  24% 
openjdk-11.0.13      | 341.0 MB  | ##########6                           |  29% 
openjdk-11.0.13      | 341.0 MB  | ############5                         |  34% 
openjdk-11.0.13      | 341.0 MB  | ##############2                       |  38% 
openjdk-11.0.13      | 341.0 MB  | ###############8                      |  43% 
openjdk-11.0.13      | 341.0 MB  | #################6                    |  48% 
openjdk-11.0.13      | 341.0 MB  | ###################4                  |  52% 
openjdk-11.0.13      | 341.0 MB  | #####################2                |  57% 
openjdk-11.0.13      | 341.0 MB  | #######################               |  62% 
openjdk-11.0.13      | 341.0 MB  | ########################9             |  68% 
openjdk-11.0.13      | 341.0 MB  | ##########################8           |  73% 
openjdk-11.0.13      | 341.0 MB  | ############################6         |  77% 
openjdk-11.0.13      | 341.0 MB  | ##############################4       |  82% 
openjdk-11.0.13      | 341.0 MB  | ################################3     |  87% 
openjdk-11.0.13      | 341.0 MB  | ##################################1   |  92% 
openjdk-11.0.13      | 341.0 MB  | ###################################9  |  97% 
                                                                                
                                                                                

                                                                                
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Note: you may need to restart the kernel to use updated packages.
Collecting pyspark==3.4.0
  Using cached pyspark-3.4.0-py2.py3-none-any.whl
Collecting py4j==0.10.9.7 (from pyspark==3.4.0)
  Using cached py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.7 pyspark-3.4.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Collecting spark-nlp==5.1.3
  Obtaining dependency information for spark-nlp==5.1.3 from https://files.pythonhosted.org/packages/cd/7d/bc0eca4c9ec4c9c1d9b28c42c2f07942af70980a7d912d0aceebf8db32dd/spark_nlp-5.1.3-py2.py3-none-any.whl.metadata
  Using cached spark_nlp-5.1.3-py2.py3-none-any.whl.metadata (53 kB)
Using cached spark_nlp-5.1.3-py2.py3-none-any.whl (537 kB)
Installing collected packages: spark-nlp
Successfully installed spark-nlp-5.1.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Collecting sparknlp
  Using cached sparknlp-1.0.0-py3-none-any.whl (1.4 kB)
Requirement already satisfied: spark-nlp in /opt/conda/lib/python3.10/site-packages (from sparknlp) (5.1.3)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from sparknlp) (1.26.0)
Installing collected packages: sparknlp
Successfully installed sparknlp-1.0.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: plotly in /opt/conda/lib/python3.10/site-packages (5.9.0)
Requirement already satisfied: tenacity>=6.2.0 in /opt/conda/lib/python3.10/site-packages (from plotly) (8.0.1)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Collecting yfinance
  Obtaining dependency information for yfinance from https://files.pythonhosted.org/packages/1c/19/bf19123baf16a55fd38cbb100b5a49380b9b6db7279987034689d11254c7/yfinance-0.2.32-py2.py3-none-any.whl.metadata
  Using cached yfinance-0.2.32-py2.py3-none-any.whl.metadata (11 kB)
Requirement already satisfied: pandas>=1.3.0 in /opt/conda/lib/python3.10/site-packages (from yfinance) (1.4.4)
Requirement already satisfied: numpy>=1.16.5 in /opt/conda/lib/python3.10/site-packages (from yfinance) (1.26.0)
Requirement already satisfied: requests>=2.31 in /opt/conda/lib/python3.10/site-packages (from yfinance) (2.31.0)
Collecting multitasking>=0.0.7 (from yfinance)
  Using cached multitasking-0.0.11-py3-none-any.whl (8.5 kB)
Requirement already satisfied: lxml>=4.9.1 in /opt/conda/lib/python3.10/site-packages (from yfinance) (4.9.3)
Requirement already satisfied: appdirs>=1.4.4 in /opt/conda/lib/python3.10/site-packages (from yfinance) (1.4.4)
Collecting pytz>=2022.5 (from yfinance)
  Obtaining dependency information for pytz>=2022.5 from https://files.pythonhosted.org/packages/32/4d/aaf7eff5deb402fd9a24a1449a8119f00d74ae9c2efa79f8ef9994261fc2/pytz-2023.3.post1-py2.py3-none-any.whl.metadata
  Using cached pytz-2023.3.post1-py2.py3-none-any.whl.metadata (22 kB)
Collecting frozendict>=2.3.4 (from yfinance)
  Using cached frozendict-2.3.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (115 kB)
Collecting peewee>=3.16.2 (from yfinance)
  Using cached peewee-3.17.0-cp310-cp310-linux_x86_64.whl
Requirement already satisfied: beautifulsoup4>=4.11.1 in /opt/conda/lib/python3.10/site-packages (from yfinance) (4.11.1)
Collecting html5lib>=1.1 (from yfinance)
  Using cached html5lib-1.1-py2.py3-none-any.whl (112 kB)
Requirement already satisfied: soupsieve>1.2 in /opt/conda/lib/python3.10/site-packages (from beautifulsoup4>=4.11.1->yfinance) (2.3.1)
Requirement already satisfied: six>=1.9 in /opt/conda/lib/python3.10/site-packages (from html5lib>=1.1->yfinance) (1.16.0)
Requirement already satisfied: webencodings in /opt/conda/lib/python3.10/site-packages (from html5lib>=1.1->yfinance) (0.5.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /opt/conda/lib/python3.10/site-packages (from pandas>=1.3.0->yfinance) (2.8.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests>=2.31->yfinance) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests>=2.31->yfinance) (3.3)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests>=2.31->yfinance) (2.0.6)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests>=2.31->yfinance) (2023.7.22)
Using cached yfinance-0.2.32-py2.py3-none-any.whl (68 kB)
Using cached pytz-2023.3.post1-py2.py3-none-any.whl (502 kB)
Installing collected packages: pytz, peewee, multitasking, html5lib, frozendict, yfinance
  Attempting uninstall: pytz
    Found existing installation: pytz 2022.1
    Uninstalling pytz-2022.1:
      Successfully uninstalled pytz-2022.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spyder 5.3.3 requires pyqt5<5.16, which is not installed.
spyder 5.3.3 requires pyqtwebengine<5.16, which is not installed.
jupyterlab-server 2.10.3 requires jupyter-server~=1.4, but you have jupyter-server 2.7.3 which is incompatible.
sagemaker-datawrangler 0.4.3 requires sagemaker-data-insights==0.4.0, but you have sagemaker-data-insights 0.3.3 which is incompatible.
spyder 5.3.3 requires ipython<8.0.0,>=7.31.1, but you have ipython 8.16.1 which is incompatible.
spyder 5.3.3 requires pylint<3.0,>=2.5.0, but you have pylint 3.0.1 which is incompatible.
Successfully installed frozendict-2.3.8 html5lib-1.1 multitasking-0.0.11 peewee-3.17.0 pytz-2023.3.post1 yfinance-0.2.32
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.

!wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.1.3.jar

--2023-11-20 01:09:26--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.1.3.jar
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.165.152, 52.217.134.0, 52.216.209.32, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.165.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 708534094 (676M) [application/java-archive]
Saving to: ‘spark-nlp-assembly-5.1.3.jar’

spark-nlp-assembly- 100%[===================>] 675.71M  94.4MB/s    in 7.1s    

2023-11-20 01:09:38 (95.3 MB/s) - ‘spark-nlp-assembly-5.1.3.jar’ saved [708534094/708534094]

## Import packages
import json
import sparknlp
import numpy as np
import pandas as pd
from sparknlp.base import *
from pyspark.ml import Pipeline
from sparknlp.annotator import *
import pyspark.sql.functions as F
from pyspark.sql.functions import mean, stddev, max, min, count, percentile_approx, year, month, dayofmonth, ceil, col, dayofweek, hour, explode, date_format, lower, size, split, regexp_replace, isnan, when
from pyspark.sql import SparkSession
from sparknlp.pretrained import PretrainedPipeline
import plotly.graph_objects as go
import plotly.subplots as sp
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from pyspark.sql import SparkSession
from py4j.java_gateway import java_import

spark = SparkSession.builder \
    .appName("Spark NLP")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3,org.apache.hadoop:hadoop-aws:3.2.2")\
    .config(
        "fs.s3a.aws.credentials.provider",
        "com.amazonaws.auth.ContainerCredentialsProvider",
    )\
    .getOrCreate()

print(f"Spark version: {spark.version}")
print(f"sparknlp version: {sparknlp.version()}")

Warning: Ignoring non-Spark config property: fs.s3a.aws.credentials.provider

:: loading settings :: url = jar:file:/opt/conda/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml

Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-4e368465-39ec-476f-a5a5-a031f862515f;1.0
    confs: [default]
    found com.johnsnowlabs.nlp#spark-nlp_2.12;5.1.3 in central
    found com.typesafe#config;1.4.2 in central
    found org.rocksdb#rocksdbjni;6.29.5 in central
    found com.amazonaws#aws-java-sdk-bundle;1.11.828 in central
    found com.github.universal-automata#liblevenshtein;3.0.0 in central
    found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
    found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
    found com.google.code.gson#gson;2.3 in central
    found it.unimi.dsi#fastutil;7.0.12 in central
    found org.projectlombok#lombok;1.16.8 in central
    found com.google.cloud#google-cloud-storage;2.20.1 in central
    found com.google.guava#guava;31.1-jre in central
    found com.google.guava#failureaccess;1.0.1 in central
    found com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava in central
    found com.google.errorprone#error_prone_annotations;2.18.0 in central
    found com.google.j2objc#j2objc-annotations;1.3 in central
    found com.google.http-client#google-http-client;1.43.0 in central
    found io.opencensus#opencensus-contrib-http-util;0.31.1 in central
    found com.google.http-client#google-http-client-jackson2;1.43.0 in central
    found com.google.http-client#google-http-client-gson;1.43.0 in central
    found com.google.api-client#google-api-client;2.2.0 in central
    found commons-codec#commons-codec;1.15 in central
    found com.google.oauth-client#google-oauth-client;1.34.1 in central
    found com.google.http-client#google-http-client-apache-v2;1.43.0 in central
    found com.google.apis#google-api-services-storage;v1-rev20220705-2.0.0 in central
    found com.google.code.gson#gson;2.10.1 in central
    found com.google.cloud#google-cloud-core;2.12.0 in central
    found io.grpc#grpc-context;1.53.0 in central
    found com.google.auto.value#auto-value-annotations;1.10.1 in central
    found com.google.auto.value#auto-value;1.10.1 in central
    found javax.annotation#javax.annotation-api;1.3.2 in central
    found commons-logging#commons-logging;1.2 in central
    found com.google.cloud#google-cloud-core-http;2.12.0 in central
    found com.google.http-client#google-http-client-appengine;1.43.0 in central
    found com.google.api#gax-httpjson;0.108.2 in central
    found com.google.cloud#google-cloud-core-grpc;2.12.0 in central
    found io.grpc#grpc-alts;1.53.0 in central
    found io.grpc#grpc-grpclb;1.53.0 in central
    found org.conscrypt#conscrypt-openjdk-uber;2.5.2 in central
    found io.grpc#grpc-auth;1.53.0 in central
    found io.grpc#grpc-protobuf;1.53.0 in central
    found io.grpc#grpc-protobuf-lite;1.53.0 in central
    found io.grpc#grpc-core;1.53.0 in central
    found com.google.api#gax;2.23.2 in central
    found com.google.api#gax-grpc;2.23.2 in central
    found com.google.auth#google-auth-library-credentials;1.16.0 in central
    found com.google.auth#google-auth-library-oauth2-http;1.16.0 in central
    found com.google.api#api-common;2.6.2 in central
    found io.opencensus#opencensus-api;0.31.1 in central
    found com.google.api.grpc#proto-google-iam-v1;1.9.2 in central
    found com.google.protobuf#protobuf-java;3.21.12 in central
    found com.google.protobuf#protobuf-java-util;3.21.12 in central
    found com.google.api.grpc#proto-google-common-protos;2.14.2 in central
    found org.threeten#threetenbp;1.6.5 in central
    found com.google.api.grpc#proto-google-cloud-storage-v2;2.20.1-alpha in central
    found com.google.api.grpc#grpc-google-cloud-storage-v2;2.20.1-alpha in central
    found com.google.api.grpc#gapic-google-cloud-storage-v2;2.20.1-alpha in central
    found com.fasterxml.jackson.core#jackson-core;2.14.2 in central
    found com.google.code.findbugs#jsr305;3.0.2 in central
    found io.grpc#grpc-api;1.53.0 in central
    found io.grpc#grpc-stub;1.53.0 in central
    found org.checkerframework#checker-qual;3.31.0 in central
    found io.perfmark#perfmark-api;0.26.0 in central
    found com.google.android#annotations;4.1.1.4 in central
    found org.codehaus.mojo#animal-sniffer-annotations;1.22 in central
    found io.opencensus#opencensus-proto;0.2.0 in central
    found io.grpc#grpc-services;1.53.0 in central
    found com.google.re2j#re2j;1.6 in central
    found io.grpc#grpc-netty-shaded;1.53.0 in central
    found io.grpc#grpc-googleapis;1.53.0 in central
    found io.grpc#grpc-xds;1.53.0 in central
    found com.navigamez#greex;1.0 in central
    found dk.brics.automaton#automaton;1.11-8 in central
    found com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.4.4 in central
    found com.microsoft.onnxruntime#onnxruntime;1.15.0 in central
    found org.apache.hadoop#hadoop-aws;3.2.2 in central
:: resolution report :: resolve 3784ms :: artifacts dl 441ms
    :: modules in use:
    com.amazonaws#aws-java-sdk-bundle;1.11.828 from central in [default]
    com.fasterxml.jackson.core#jackson-core;2.14.2 from central in [default]
    com.github.universal-automata#liblevenshtein;3.0.0 from central in [default]
    com.google.android#annotations;4.1.1.4 from central in [default]
    com.google.api#api-common;2.6.2 from central in [default]
    com.google.api#gax;2.23.2 from central in [default]
    com.google.api#gax-grpc;2.23.2 from central in [default]
    com.google.api#gax-httpjson;0.108.2 from central in [default]
    com.google.api-client#google-api-client;2.2.0 from central in [default]
    com.google.api.grpc#gapic-google-cloud-storage-v2;2.20.1-alpha from central in [default]
    com.google.api.grpc#grpc-google-cloud-storage-v2;2.20.1-alpha from central in [default]
    com.google.api.grpc#proto-google-cloud-storage-v2;2.20.1-alpha from central in [default]
    com.google.api.grpc#proto-google-common-protos;2.14.2 from central in [default]
    com.google.api.grpc#proto-google-iam-v1;1.9.2 from central in [default]
    com.google.apis#google-api-services-storage;v1-rev20220705-2.0.0 from central in [default]
    com.google.auth#google-auth-library-credentials;1.16.0 from central in [default]
    com.google.auth#google-auth-library-oauth2-http;1.16.0 from central in [default]
    com.google.auto.value#auto-value;1.10.1 from central in [default]
    com.google.auto.value#auto-value-annotations;1.10.1 from central in [default]
    com.google.cloud#google-cloud-core;2.12.0 from central in [default]
    com.google.cloud#google-cloud-core-grpc;2.12.0 from central in [default]
    com.google.cloud#google-cloud-core-http;2.12.0 from central in [default]
    com.google.cloud#google-cloud-storage;2.20.1 from central in [default]
    com.google.code.findbugs#jsr305;3.0.2 from central in [default]
    com.google.code.gson#gson;2.10.1 from central in [default]
    com.google.errorprone#error_prone_annotations;2.18.0 from central in [default]
    com.google.guava#failureaccess;1.0.1 from central in [default]
    com.google.guava#guava;31.1-jre from central in [default]
    com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava from central in [default]
    com.google.http-client#google-http-client;1.43.0 from central in [default]
    com.google.http-client#google-http-client-apache-v2;1.43.0 from central in [default]
    com.google.http-client#google-http-client-appengine;1.43.0 from central in [default]
    com.google.http-client#google-http-client-gson;1.43.0 from central in [default]
    com.google.http-client#google-http-client-jackson2;1.43.0 from central in [default]
    com.google.j2objc#j2objc-annotations;1.3 from central in [default]
    com.google.oauth-client#google-oauth-client;1.34.1 from central in [default]
    com.google.protobuf#protobuf-java;3.21.12 from central in [default]
    com.google.protobuf#protobuf-java-util;3.21.12 from central in [default]
    com.google.re2j#re2j;1.6 from central in [default]
    com.johnsnowlabs.nlp#spark-nlp_2.12;5.1.3 from central in [default]
    com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.4.4 from central in [default]
    com.microsoft.onnxruntime#onnxruntime;1.15.0 from central in [default]
    com.navigamez#greex;1.0 from central in [default]
    com.typesafe#config;1.4.2 from central in [default]
    commons-codec#commons-codec;1.15 from central in [default]
    commons-logging#commons-logging;1.2 from central in [default]
    dk.brics.automaton#automaton;1.11-8 from central in [default]
    io.grpc#grpc-alts;1.53.0 from central in [default]
    io.grpc#grpc-api;1.53.0 from central in [default]
    io.grpc#grpc-auth;1.53.0 from central in [default]
    io.grpc#grpc-context;1.53.0 from central in [default]
    io.grpc#grpc-core;1.53.0 from central in [default]
    io.grpc#grpc-googleapis;1.53.0 from central in [default]
    io.grpc#grpc-grpclb;1.53.0 from central in [default]
    io.grpc#grpc-netty-shaded;1.53.0 from central in [default]
    io.grpc#grpc-protobuf;1.53.0 from central in [default]
    io.grpc#grpc-protobuf-lite;1.53.0 from central in [default]
    io.grpc#grpc-services;1.53.0 from central in [default]
    io.grpc#grpc-stub;1.53.0 from central in [default]
    io.grpc#grpc-xds;1.53.0 from central in [default]
    io.opencensus#opencensus-api;0.31.1 from central in [default]
    io.opencensus#opencensus-contrib-http-util;0.31.1 from central in [default]
    io.opencensus#opencensus-proto;0.2.0 from central in [default]
    io.perfmark#perfmark-api;0.26.0 from central in [default]
    it.unimi.dsi#fastutil;7.0.12 from central in [default]
    javax.annotation#javax.annotation-api;1.3.2 from central in [default]
    org.apache.hadoop#hadoop-aws;3.2.2 from central in [default]
    org.checkerframework#checker-qual;3.31.0 from central in [default]
    org.codehaus.mojo#animal-sniffer-annotations;1.22 from central in [default]
    org.conscrypt#conscrypt-openjdk-uber;2.5.2 from central in [default]
    org.projectlombok#lombok;1.16.8 from central in [default]
    org.rocksdb#rocksdbjni;6.29.5 from central in [default]
    org.threeten#threetenbp;1.6.5 from central in [default]
    :: evicted modules:
    com.google.protobuf#protobuf-java-util;3.0.0-beta-3 by [com.google.protobuf#protobuf-java-util;3.21.12] in [default]
    com.google.protobuf#protobuf-java;3.0.0-beta-3 by [com.google.protobuf#protobuf-java;3.21.12] in [default]
    com.google.code.gson#gson;2.3 by [com.google.code.gson#gson;2.10.1] in [default]
    com.amazonaws#aws-java-sdk-bundle;1.11.563 by [com.amazonaws#aws-java-sdk-bundle;1.11.828] in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   77  |   0   |   0   |   4   ||   73  |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-4e368465-39ec-476f-a5a5-a031f862515f
    confs: [default]
    0 artifacts copied, 73 already retrieved (0kB/142ms)
23/11/20 01:09:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Spark version: 3.4.0
sparknlp version: 5.1.3

Data preparation

## Read cleaned data from parquet

### Anime subreddits
import sagemaker
# session = sagemaker.Session()
# bucket = session.default_bucket()
bucket = 'sagemaker-us-east-1-216384626106'

sub_bucket_path = f"s3a://{bucket}/project/cleaned/submissions"
com_bucket_path = f"s3a://{bucket}/project/cleaned/comments"

print(f"reading submissions from {sub_bucket_path}")
sub = spark.read.parquet(sub_bucket_path, header=True)
print(f"shape of the sub dataframe is {sub.count():,}x{len(sub.columns)}")

print(f"reading comments from {com_bucket_path}")
com = spark.read.parquet(com_bucket_path, header=True)
print(f"shape of the com dataframe is {com.count():,}x{len(com.columns)}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
reading submissions from s3a://sagemaker-us-east-1-216384626106/project/cleaned/submissions

23/11/20 01:10:00 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties

shape of the sub dataframe is 384,218x7
reading comments from s3a://sagemaker-us-east-1-216384626106/project/cleaned/comments

[Stage 5:=======================================================> (32 + 1) / 33]

shape of the com dataframe is 21,131,502x8

sub.groupBy('subreddit').count().show()

[Stage 8:===============================================>           (4 + 1) / 5]

+--------------------+------+
|           subreddit| count|
+--------------------+------+
|NeonGenesisEvange...|  1101|
|         Kaguya_sama|  3339|
|             pokemon| 80192|
|   StardustCrusaders| 18419|
|              yugioh| 21596|
|     ShokugekiNoSoma|   616|
|            OnePiece|141708|
|          TokyoGhoul|  2838|
|       attackontitan| 13346|
|    OneTruthPrevails|  3468|
|      swordartonline|  4977|
|              Gundam|  9416|
|                 dbz|  9904|
|         OnePunchMan| 14847|
|          KillLaKill|   832|
|             digimon| 11363|
|      DetectiveConan|   972|
|              Naruto| 41188|
|    DemonSlayerAnime|  4096|
+--------------------+------+

sub.printSchema()

root
 |-- subreddit: string (nullable = true)
 |-- author: string (nullable = true)
 |-- title: string (nullable = true)
 |-- selftext: string (nullable = true)
 |-- score: long (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- num_comments: long (nullable = true)

sub.show(5)

+---------+-----------------+--------------------+--------------------+-----+-------------------+------------+
|subreddit|           author|               title|            selftext|score|        created_utc|num_comments|
+---------+-----------------+--------------------+--------------------+-----+-------------------+------------+
|  pokemon|       Lssjgaming|Worried about the...|In the presents t...|   39|2021-02-26 15:37:20|          53|
|  pokemon|      ZombieTorch|Brilliant Diamond...|[https://kapwi.ng...|   10|2021-02-26 15:37:39|           1|
|  pokemon|    Ndwith-urlife|After watching th...|[removed]\n\n[Vie...|    1|2021-02-26 15:37:42|           0|
|  pokemon|           dpol27|Is “Legends” the ...|I get the vibe fr...| 1267|2021-02-26 15:38:03|         280|
|  pokemon|Unlikelyusername3|Did the remake st...|This felt like a ...|   13|2021-02-26 15:38:19|          19|
+---------+-----------------+--------------------+--------------------+-----+-------------------+------------+
only showing top 5 rows

com.groupBy('subreddit').count().show()

[Stage 12:====================================================>   (31 + 2) / 33]

+--------------------+-------+
|           subreddit|  count|
+--------------------+-------+
|NeonGenesisEvange...|  29216|
|         Kaguya_sama| 219819|
|             pokemon|5386818|
|   StardustCrusaders| 707070|
|              yugioh|1149720|
|     ShokugekiNoSoma|  17983|
|            OnePiece|6726891|
|          TokyoGhoul|  77437|
|       attackontitan| 602335|
|    OneTruthPrevails|  64187|
|      swordartonline| 202304|
|              Gundam| 808568|
|                 dbz| 504733|
|         OnePunchMan|1719628|
|          KillLaKill|  78023|
|             digimon| 509032|
|      DetectiveConan|  10769|
|              Naruto|2006663|
|    DemonSlayerAnime| 310306|
+--------------------+-------+

com.printSchema()

root
 |-- subreddit: string (nullable = true)
 |-- author: string (nullable = true)
 |-- body: string (nullable = true)
 |-- score: long (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- link_id: string (nullable = true)
 |-- id: string (nullable = true)
 |-- created_utc: timestamp (nullable = true)

com.show(5)

+-------------+-----------------+--------------------+-----+----------+---------+-------+-------------------+
|    subreddit|           author|                body|score| parent_id|  link_id|     id|        created_utc|
+-------------+-----------------+--------------------+-----+----------+---------+-------+-------------------+
|  OnePunchMan|Creed_is_the_best|      Hmm...tiddies.|    6| t3_kr3c9v|t3_kr3c9v|gi7ri80|2021-01-05 19:57:11|
|attackontitan|        Lesbi4nna|                  mE|    2| t3_kqsgua|t3_kqsgua|gi7ri8r|2021-01-05 19:57:11|
|     OnePiece|         elite710|                 Lol|    3|t1_gi72t9c|t3_kr1jkq|gi7ries|2021-01-05 19:57:13|
|      digimon|       alsaucerer|This game was alr...|    3|t1_gi7r7dr|t3_kr3d1z|gi7rjbe|2021-01-05 19:57:21|
|     OnePiece|         Yackberg|First episode was...|    1| t3_kqznde|t3_kqznde|gi7rjop|2021-01-05 19:57:24|
+-------------+-----------------+--------------------+-----+----------+---------+-------+-------------------+
only showing top 5 rows

Data cleaning

sub_cleaned = (
    sub
    .withColumn("created_date", date_format("created_utc", "yyyy-MM-dd")) # create date column
    .withColumn("created_hour", hour("created_utc")) # create hour column
    .withColumn("created_week", dayofweek("created_utc")) # create day of the week column
    .withColumn("created_month", month("created_utc")) # create month of the year column
    .withColumn("created_year", year("created_utc")) # create the year column
    .withColumn("title", lower(col('title'))) # text cleaning: lowercase
    .withColumn("selftext", lower(col('selftext'))) # text cleaning: lowercase
    .withColumn("cleaned_title", regexp_replace(col('title'), r'[^a-zA-Z0-9\s]', '')) # text cleaning: only contain words or number
    .withColumn("cleaned_title", regexp_replace(col('cleaned_title'), r'\s+', ' ')) # text cleaning: remove extra space in text
    .withColumn('title_wordCount', size(split(col('cleaned_title'), ' '))) # word count
    .withColumn("cleaned_selftext", regexp_replace(col('selftext'), r'[^a-zA-Z0-9\s]', '')) # text cleaning: only contain words or number
    .withColumn("cleaned_selftext", regexp_replace(col('cleaned_selftext'), r'\s+', ' ')) # text cleaning: remove extra space in text
    .withColumn('selftext_wordCount', size(split(col('cleaned_selftext'), ' '))) # word count
    .withColumn('contain_pokemon', col("cleaned_title").rlike("""(?i)pokemon|(?i)pokémon""")) # create dummy variable column on title
)

sub_cleaned.show(5)

+---------+-----------------+--------------------+--------------------+-----+-------------------+------------+------------+------------+------------+-------------+------------+--------------------+---------------+--------------------+------------------+---------------+
|subreddit|           author|               title|            selftext|score|        created_utc|num_comments|created_date|created_hour|created_week|created_month|created_year|       cleaned_title|title_wordCount|    cleaned_selftext|selftext_wordCount|contain_pokemon|
+---------+-----------------+--------------------+--------------------+-----+-------------------+------------+------------+------------+------------+-------------+------------+--------------------+---------------+--------------------+------------------+---------------+
|  pokemon|       Lssjgaming|worried about the...|in the presents t...|   39|2021-02-26 15:37:20|          53|  2021-02-26|          15|           6|            2|        2021|worried about the...|              8|in the presents t...|                98|          false|
|  pokemon|      ZombieTorch|brilliant diamond...|[https://kapwi.ng...|   10|2021-02-26 15:37:39|           1|  2021-02-26|          15|           6|            2|        2021|brilliant diamond...|             12|httpskapwingcjxbo...|                 1|          false|
|  pokemon|    Ndwith-urlife|after watching th...|[removed]\n\n[vie...|    1|2021-02-26 15:37:42|           0|  2021-02-26|          15|           6|            2|        2021|after watching th...|             11|removed view poll...|                 3|          false|
|  pokemon|           dpol27|is “legends” the ...|i get the vibe fr...| 1267|2021-02-26 15:38:03|         280|  2021-02-26|          15|           6|            2|        2021|is legends the st...|             11|i get the vibe fr...|               102|          false|
|  pokemon|Unlikelyusername3|did the remake st...|this felt like a ...|   13|2021-02-26 15:38:19|          19|  2021-02-26|          15|           6|            2|        2021|did the remake st...|              9|this felt like a ...|               170|          false|
+---------+-----------------+--------------------+--------------------+-----+-------------------+------------+------------+------------+------------+-------------+------------+--------------------+---------------+--------------------+------------------+---------------+
only showing top 5 rows

com_cleaned = (
    com
    .withColumn("created_date", date_format("created_utc", "yyyy-MM-dd")) # create date column
    .withColumn("created_hour", hour("created_utc")) # create hour column
    .withColumn("created_week", dayofweek("created_utc")) # create day of the week column
    .withColumn("created_month", month("created_utc")) # create month of the year column
    .withColumn("created_year", year("created_utc")) # create the year column
    .withColumn("body", lower(col('body'))) # text cleaning: lowercase
    .withColumn("cleaned", regexp_replace(col('body'), r'[^a-zA-Z0-9\s]', '')) # text cleaning: only contain words or number
    .withColumn("cleaned", regexp_replace(col('cleaned'), r'\s+', ' ')) # text cleaning: remove extra space in text
    .withColumn('body_wordCount', size(split(col('cleaned'), ' '))) # word count
    .withColumn('contain_pokemon', col("body").rlike("""(?i)pokemon|(?i)pokémon""")) # create dummy variable column
)

com_cleaned.show(5)

+-------------+-----------------+--------------------+-----+----------+---------+-------+-------------------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
|    subreddit|           author|                body|score| parent_id|  link_id|     id|        created_utc|created_date|created_hour|created_week|created_month|created_year|             cleaned|body_wordCount|contain_pokemon|
+-------------+-----------------+--------------------+-----+----------+---------+-------+-------------------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
|  OnePunchMan|Creed_is_the_best|      hmm...tiddies.|    6| t3_kr3c9v|t3_kr3c9v|gi7ri80|2021-01-05 19:57:11|  2021-01-05|          19|           3|            1|        2021|          hmmtiddies|             1|          false|
|attackontitan|        Lesbi4nna|                  me|    2| t3_kqsgua|t3_kqsgua|gi7ri8r|2021-01-05 19:57:11|  2021-01-05|          19|           3|            1|        2021|                  me|             1|          false|
|     OnePiece|         elite710|                 lol|    3|t1_gi72t9c|t3_kr1jkq|gi7ries|2021-01-05 19:57:13|  2021-01-05|          19|           3|            1|        2021|                 lol|             1|          false|
|      digimon|       alsaucerer|this game was alr...|    3|t1_gi7r7dr|t3_kr3d1z|gi7rjbe|2021-01-05 19:57:21|  2021-01-05|          19|           3|            1|        2021|this game was alr...|            16|          false|
|     OnePiece|         Yackberg|first episode was...|    1| t3_kqznde|t3_kqznde|gi7rjop|2021-01-05 19:57:24|  2021-01-05|          19|           3|            1|        2021|first episode was...|            54|          false|
+-------------+-----------------+--------------------+-----+----------+---------+-------+-------------------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
only showing top 5 rows

Text Cleaning Pipline

Build a SparkNLP Pipeline

# Step 1: Transforms raw texts to `document` annotation
# documentAssembler = DocumentAssembler()\
#     .setInputCol("text")\
#     .setOutputCol("document")\
#     .setCleanupMode("shrink") # shrink: removes new lines and tabs, plus merging multiple spaces and blank lines to a single space.

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")


# step 2: Removes all dirty characters from text following a regex pattern and transforms

cleanUpPatterns = ["[^a-zA-Z\s]+"] # ["[^\w\d\s]"] : remove punctuations (keep alphanumeric chars)

# emoji_pat = '[\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2600-\u26FF\u2700-\u27BF]'
# clean_pat = '[^a-zA-Z\s]+'
# cleanUpPatterns = [r"({})|({})".format(emoji_pat, clean_pat)]

documentNormalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalizedDocument") \
    .setAction("clean") \
    .setPatterns(cleanUpPatterns) \
    .setReplacement(" ") \
    .setPolicy("pretty_all") \
    .setLowercase(True)

# step 3: Identifies tokens with tokenization open standards
tokenizer = Tokenizer() \
    .setInputCols(["normalizedDocument"]) \
    .setOutputCol("token")\
    .setSplitChars(['-']) \
    .setContextChars(['?', '!']) \

# # step *: 
# spellChecker = ContextSpellCheckerApproach() \
#     .setInputCols("token") \
#     .setOutputCol("corrected") \
#     .setWordMaxDistance(3) \
#     .setBatchSize(24) \
#     .setEpochs(8) \
#     .setLanguageModelClasses(1650)  # dependant on vocabulary size

# step 4: Find lemmas out of words with the objective of returning a base dictionary word
lemmatizer = LemmatizerModel.pretrained() \
    .setInputCols(["token"]) \
    .setOutputCol("lemma") \

stemmer = Stemmer() \
    .setInputCols(["lemma"]) \
    .setOutputCol("stem")

# step 5: Drops all the stop words from the input sequences
stopwords_cleaner = StopWordsCleaner()\
    .setInputCols("stem")\
    .setOutputCol("cleanTokens")\
    .setCaseSensitive(False)\

# step 6: Reconstructs a DOCUMENT type annotation from tokens
tokenassembler = TokenAssembler()\
    .setInputCols(["document", "cleanTokens"]) \
    .setOutputCol("clean_text")


nlpPipeline = Pipeline(
    stages=[
        documentAssembler,
        documentNormalizer,
        tokenizer,
        lemmatizer,
        stemmer,
        stopwords_cleaner,
        tokenassembler
     ])

lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[ | ]lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
Download done! Loading the resource.
[ / ]

[OK!]

com = com_cleaned
sub = sub_cleaned

# renamed the column that need text cleaning to `text` to match the nlpPipline
body_com = com.withColumnRenamed('body','text')
title_sub = sub.withColumnRenamed('title','text')
selftext_sub = sub.withColumnRenamed('selftext','text')

# fit the dataframe to process the text cleaning
# body_com, title_sub, selftext_sub
pipelineModel = nlpPipeline.fit(body_com)
body_cleaned = pipelineModel.transform(body_com)
body_cleaned = body_cleaned.drop("document","normalizedDocument","lemma","stem","cleanTokens")


pipelineModel = nlpPipeline.fit(title_sub)
title_cleaned = pipelineModel.transform(title_sub)
title_cleaned = title_cleaned.drop("document","normalizedDocument","lemma","stem","cleanTokens")


pipelineModel = nlpPipeline.fit(selftext_sub)
selftext_cleaned = pipelineModel.transform(selftext_sub)
selftext_cleaned = selftext_cleaned.drop("document","normalizedDocument","lemma","stem","cleanTokens")

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.util.SizeEstimator$ (file:/opt/conda/lib/python3.10/site-packages/pyspark/jars/spark-core_2.12-3.4.0.jar) to field java.util.regex.Pattern.pattern
WARNING: Please consider reporting this to the maintainers of org.apache.spark.util.SizeEstimator$
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

body_cleaned.printSchema()

root
 |-- subreddit: string (nullable = true)
 |-- author: string (nullable = true)
 |-- text: string (nullable = true)
 |-- score: long (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- link_id: string (nullable = true)
 |-- id: string (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- created_date: string (nullable = true)
 |-- created_hour: integer (nullable = true)
 |-- created_week: integer (nullable = true)
 |-- created_month: integer (nullable = true)
 |-- created_year: integer (nullable = true)
 |-- cleaned: string (nullable = true)
 |-- body_wordCount: integer (nullable = false)
 |-- contain_pokemon: boolean (nullable = true)
 |-- token: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- clean_text: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)

body_cleaned.select(
    "subreddit",
    "created_utc",
    "text",
    "score",
    "id",
    "created_date",
    "created_hour",
    "created_week",
    "created_month",
    "created_year",
    "cleaned",
    "body_wordCount").show()

+-----------------+-------------------+--------------------+-----+-------+------------+------------+------------+-------------+------------+--------------------+--------------+
|        subreddit|        created_utc|                text|score|     id|created_date|created_hour|created_week|created_month|created_year|             cleaned|body_wordCount|
+-----------------+-------------------+--------------------+-----+-------+------------+------------+------------+-------------+------------+--------------------+--------------+
|      OnePunchMan|2021-01-05 19:57:11|      hmm...tiddies.|    6|gi7ri80|  2021-01-05|          19|           3|            1|        2021|          hmmtiddies|             1|
|    attackontitan|2021-01-05 19:57:11|                  me|    2|gi7ri8r|  2021-01-05|          19|           3|            1|        2021|                  me|             1|
|         OnePiece|2021-01-05 19:57:13|                 lol|    3|gi7ries|  2021-01-05|          19|           3|            1|        2021|                 lol|             1|
|          digimon|2021-01-05 19:57:21|this game was alr...|    3|gi7rjbe|  2021-01-05|          19|           3|            1|        2021|this game was alr...|            16|
|         OnePiece|2021-01-05 19:57:24|first episode was...|    1|gi7rjop|  2021-01-05|          19|           3|            1|        2021|first episode was...|            54|
|           Naruto|2021-01-05 19:57:25|well obviously, i...|    2|gi7rjss|  2021-01-05|          19|           3|            1|        2021|well obviously if...|            10|
|         OnePiece|2021-01-05 19:57:34|i started one pie...|    2|gi7rkpv|  2021-01-05|          19|           3|            1|        2021|i started one pie...|            52|
|           Naruto|2021-01-05 19:57:37|so many uchiha re...|    2|gi7rl23|  2021-01-05|          19|           3|            1|        2021|so many uchiha re...|            36|
|      OnePunchMan|2021-01-05 19:57:39|  what is hurricane?|    3|gi7rlcw|  2021-01-05|          19|           3|            1|        2021|   what is hurricane|             3|
|           Naruto|2021-01-05 19:57:41|really? that’d be...|   11|gi7rll9|  2021-01-05|          19|           3|            1|        2021|really thatd be c...|            31|
|          pokemon|2021-01-05 19:57:42|absolutely love t...|    2|gi7rllh|  2021-01-05|          19|           3|            1|        2021|absolutely love t...|             3|
|          pokemon|2021-01-05 19:57:42|and so does kukui...|   34|gi7rlo9|  2021-01-05|          19|           3|            1|        2021|and so does kukui...|            14|
|          digimon|2021-01-05 19:57:44| that's just a dude.|    6|gi7rluv|  2021-01-05|          19|           3|            1|        2021|   thats just a dude|             4|
|         OnePiece|2021-01-05 19:57:55|just like zoro, s...|    5|gi7rn3y|  2021-01-05|          19|           3|            1|        2021|just like zoro sh...|             8|
|         OnePiece|2021-01-05 19:57:56|i don’t think it ...|    3|gi7rn7r|  2021-01-05|          19|           3|            1|        2021|i dont think it m...|            20|
|         OnePiece|2021-01-05 19:57:57|you make some ver...|    4|gi7rnbp|  2021-01-05|          19|           3|            1|        2021|you make some ver...|           120|
|         OnePiece|2021-01-05 19:58:08|hi oases-dragon, ...|    0|gi7roke|  2021-01-05|          19|           3|            1|        2021|hi oasesdragon yo...|           166|
|          pokemon|2021-01-05 19:58:11|no cap on that dr...|   -1|gi7rovi|  2021-01-05|          19|           3|            1|        2021|no cap on that dr...|             5|
|StardustCrusaders|2021-01-05 19:58:15|sheer heart attac...|    1|gi7rpd1|  2021-01-05|          19|           3|            1|        2021|sheer heart attac...|            12|
|         OnePiece|2021-01-05 19:58:17|hi timmyanz, your...|    1|gi7rpmw|  2021-01-05|          19|           3|            1|        2021|hi timmyanz your ...|           130|
+-----------------+-------------------+--------------------+-----+-------+------------+------------+------------+-------------+------------+--------------------+--------------+
only showing top 20 rows

Filter the comments data where subreddit is “pokemon”

from pyspark.sql.functions import col

# Filter the DataFrame for rows where subreddit is "pokemon"
pokemon_data = body_cleaned.filter(col("subreddit") == "pokemon")

# Show the filtered data
pokemon_data.select(
    "subreddit",
    "created_utc",
    "text",
    "score",
    "id",
    "created_date",
    "created_hour",
    "created_week",
    "created_month",
    "created_year",
    "cleaned",
    "body_wordCount").show()

+---------+-------------------+--------------------+-----+-------+------------+------------+------------+-------------+------------+--------------------+--------------+
|subreddit|        created_utc|                text|score|     id|created_date|created_hour|created_week|created_month|created_year|             cleaned|body_wordCount|
+---------+-------------------+--------------------+-----+-------+------------+------------+------------+-------------+------------+--------------------+--------------+
|  pokemon|2021-01-05 19:57:42|absolutely love t...|    2|gi7rllh|  2021-01-05|          19|           3|            1|        2021|absolutely love t...|             3|
|  pokemon|2021-01-05 19:57:42|and so does kukui...|   34|gi7rlo9|  2021-01-05|          19|           3|            1|        2021|and so does kukui...|            14|
|  pokemon|2021-01-05 19:58:11|no cap on that dr...|   -1|gi7rovi|  2021-01-05|          19|           3|            1|        2021|no cap on that dr...|             5|
|  pokemon|2021-01-05 19:58:26|to judge the card...|    1|gi7rqkk|  2021-01-05|          19|           3|            1|        2021|to judge the card...|             9|
|  pokemon|2021-01-05 19:58:31|gods this games p...|    2|gi7rr3n|  2021-01-05|          19|           3|            1|        2021|gods this games p...|            26|
|  pokemon|2021-01-05 19:58:32|why they made gen...|   25|gi7rraa|  2021-01-05|          19|           3|            1|        2021|why they made gen...|            45|
|  pokemon|2021-01-05 19:58:40|minimalistic and ...|    1|gi7rs5i|  2021-01-05|          19|           3|            1|        2021|minimalistic and ...|            19|
|  pokemon|2021-01-05 19:58:51|nah blues'theme i...|    4|gi7rt91|  2021-01-05|          19|           3|            1|        2021|nah bluestheme in...|            16|
|  pokemon|2021-01-05 19:58:56|also do you have ...|    2|gi7rtvn|  2021-01-05|          19|           3|            1|        2021|also do you have ...|            14|
|  pokemon|2021-01-05 19:58:59|and we have a win...|    2|gi7ru52|  2021-01-05|          19|           3|            1|        2021|and we have a win...|            19|
|  pokemon|2021-01-05 19:59:04|that's awesome. i...|    5|gi7rupb|  2021-01-05|          19|           3|            1|        2021|thats awesome its...|            12|
|  pokemon|2021-01-05 19:59:04|even without the ...|   42|gi7ruqg|  2021-01-05|          19|           3|            1|        2021|even without the ...|            30|
|  pokemon|2021-01-05 19:59:32|one down, 150 to go!|    3|gi7rxs0|  2021-01-05|          19|           3|            1|        2021|  one down 150 to go|             5|
|  pokemon|2021-01-05 19:59:35|oh yeah, i forgot...|    2|gi7ry40|  2021-01-05|          19|           3|            1|        2021|oh yeah i forgot ...|            12|
|  pokemon|2021-01-05 19:59:40|gen 3 and 4 are a...|    2|gi7rymu|  2021-01-05|          19|           3|            1|        2021|gen 3 and 4 are a...|            37|
|  pokemon|2021-01-05 19:59:48|i get poison/drag...|    1|gi7rzjr|  2021-01-05|          19|           3|            1|        2021|i get poisondrago...|             4|
|  pokemon|2021-01-05 20:00:20|the special stat ...|   12|gi7s38p|  2021-01-05|          20|           3|            1|        2021|the special stat ...|            48|
|  pokemon|2021-01-05 20:00:32|awwww it is a tot...|    2|gi7s4jm|  2021-01-05|          20|           3|            1|        2021|awwww it is a tot...|             7|
|  pokemon|2021-01-05 20:00:33|all i see is regi...|    2|gi7s4om|  2021-01-05|          20|           3|            1|        2021|all i see is regi...|             5|
|  pokemon|2021-01-05 20:00:42|          very cool!|    1|gi7s5ri|  2021-01-05|          20|           3|            1|        2021|           very cool|             2|
+---------+-------------------+--------------------+-----+-------+------------+------------+------------+-------------+------------+--------------------+--------------+
only showing top 20 rows

pokemon_data.count()

Build a SparkNLP Pipeline to construct positive/negative sentiment for comments data

from sparknlp.base import DocumentAssembler
from sparknlp.annotator import UniversalSentenceEncoder, SentimentDLModel
from pyspark.ml import Pipeline

MODEL_NAME='sentimentdl_use_twitter'

# Document Assembling
documentAssembler = DocumentAssembler()\
    .setInputCol("cleaned")\
    .setOutputCol("document")
    
# Embedding with Universal Sentence Encoder
use = UniversalSentenceEncoder.pretrained(name="tfhub_use", lang="en")\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

# Sentiment Analysis (using a pre-trained model)
sentimentdl = SentimentDLModel.pretrained(name=MODEL_NAME, lang="en")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("sentiment")

# Building the Pipeline
nlpPipeline = Pipeline(
    stages = [
        documentAssembler,
        use,
        sentimentdl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[ | ]tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
Download done! Loading the resource.
[ / ]

2023-11-20 01:11:56.192789: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

[OK!]
sentimentdl_use_twitter download started this may take some time.
Approximate size to download 11.4 MB
[ | ]sentimentdl_use_twitter download started this may take some time.
Approximate size to download 11.4 MB
Download done! Loading the resource.
[OK!]

# Fit and transform the data using the pipeline
nlp_model = nlpPipeline.fit(pokemon_data)
processed_pokemon = nlp_model.transform(pokemon_data)

# Display results
processed_pokemon.select(
    "created_utc",
    "score",
    "created_date",
    "created_hour",
    "created_week",
    "created_month",
    "created_year",
    "cleaned",
    "body_wordCount",
    "sentiment.result"
).show()

23/11/20 01:12:13 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 29:>                                                         (0 + 1) / 1]

+-------------------+-----+------------+------------+------------+-------------+------------+--------------------+--------------+----------+
|        created_utc|score|created_date|created_hour|created_week|created_month|created_year|             cleaned|body_wordCount|    result|
+-------------------+-----+------------+------------+------------+-------------+------------+--------------------+--------------+----------+
|2021-01-05 19:57:42|    2|  2021-01-05|          19|           3|            1|        2021|absolutely love t...|             3|[positive]|
|2021-01-05 19:57:42|   34|  2021-01-05|          19|           3|            1|        2021|and so does kukui...|            14|[positive]|
|2021-01-05 19:58:11|   -1|  2021-01-05|          19|           3|            1|        2021|no cap on that dr...|             5|[negative]|
|2021-01-05 19:58:26|    1|  2021-01-05|          19|           3|            1|        2021|to judge the card...|             9|[positive]|
|2021-01-05 19:58:31|    2|  2021-01-05|          19|           3|            1|        2021|gods this games p...|            26|[negative]|
|2021-01-05 19:58:32|   25|  2021-01-05|          19|           3|            1|        2021|why they made gen...|            45|[negative]|
|2021-01-05 19:58:40|    1|  2021-01-05|          19|           3|            1|        2021|minimalistic and ...|            19|[positive]|
|2021-01-05 19:58:51|    4|  2021-01-05|          19|           3|            1|        2021|nah bluestheme in...|            16|[negative]|
|2021-01-05 19:58:56|    2|  2021-01-05|          19|           3|            1|        2021|also do you have ...|            14| [neutral]|
|2021-01-05 19:58:59|    2|  2021-01-05|          19|           3|            1|        2021|and we have a win...|            19|[positive]|
|2021-01-05 19:59:04|    5|  2021-01-05|          19|           3|            1|        2021|thats awesome its...|            12|[positive]|
|2021-01-05 19:59:04|   42|  2021-01-05|          19|           3|            1|        2021|even without the ...|            30|[negative]|
|2021-01-05 19:59:32|    3|  2021-01-05|          19|           3|            1|        2021|  one down 150 to go|             5|[positive]|
|2021-01-05 19:59:35|    2|  2021-01-05|          19|           3|            1|        2021|oh yeah i forgot ...|            12|[negative]|
|2021-01-05 19:59:40|    2|  2021-01-05|          19|           3|            1|        2021|gen 3 and 4 are a...|            37|[negative]|
|2021-01-05 19:59:48|    1|  2021-01-05|          19|           3|            1|        2021|i get poisondrago...|             4|[positive]|
|2021-01-05 20:00:20|   12|  2021-01-05|          20|           3|            1|        2021|the special stat ...|            48|[negative]|
|2021-01-05 20:00:32|    2|  2021-01-05|          20|           3|            1|        2021|awwww it is a tot...|             7|[positive]|
|2021-01-05 20:00:33|    2|  2021-01-05|          20|           3|            1|        2021|all i see is regi...|             5|[negative]|
|2021-01-05 20:00:42|    1|  2021-01-05|          20|           3|            1|        2021|           very cool|             2|[positive]|
+-------------------+-----+------------+------------+------------+-------------+------------+--------------------+--------------+----------+
only showing top 20 rows

# Extracting sentiment value from the result array
processed_pokemon_result = processed_pokemon.withColumn(
    "sentiment", 
    col("sentiment.result").getItem(0)
)

processed_pokemon_result.select(
    "created_utc",
    "score",
    "created_date",
    "created_hour",
    "created_week",
    "created_month",
    "created_year",
    "cleaned",
    "body_wordCount",
    "sentiment"
).show()

[Stage 30:>                                                         (0 + 1) / 1]

+-------------------+-----+------------+------------+------------+-------------+------------+--------------------+--------------+---------+
|        created_utc|score|created_date|created_hour|created_week|created_month|created_year|             cleaned|body_wordCount|sentiment|
+-------------------+-----+------------+------------+------------+-------------+------------+--------------------+--------------+---------+
|2021-01-05 19:57:42|    2|  2021-01-05|          19|           3|            1|        2021|absolutely love t...|             3| positive|
|2021-01-05 19:57:42|   34|  2021-01-05|          19|           3|            1|        2021|and so does kukui...|            14| positive|
|2021-01-05 19:58:11|   -1|  2021-01-05|          19|           3|            1|        2021|no cap on that dr...|             5| negative|
|2021-01-05 19:58:26|    1|  2021-01-05|          19|           3|            1|        2021|to judge the card...|             9| positive|
|2021-01-05 19:58:31|    2|  2021-01-05|          19|           3|            1|        2021|gods this games p...|            26| negative|
|2021-01-05 19:58:32|   25|  2021-01-05|          19|           3|            1|        2021|why they made gen...|            45| negative|
|2021-01-05 19:58:40|    1|  2021-01-05|          19|           3|            1|        2021|minimalistic and ...|            19| positive|
|2021-01-05 19:58:51|    4|  2021-01-05|          19|           3|            1|        2021|nah bluestheme in...|            16| negative|
|2021-01-05 19:58:56|    2|  2021-01-05|          19|           3|            1|        2021|also do you have ...|            14|  neutral|
|2021-01-05 19:58:59|    2|  2021-01-05|          19|           3|            1|        2021|and we have a win...|            19| positive|
|2021-01-05 19:59:04|    5|  2021-01-05|          19|           3|            1|        2021|thats awesome its...|            12| positive|
|2021-01-05 19:59:04|   42|  2021-01-05|          19|           3|            1|        2021|even without the ...|            30| negative|
|2021-01-05 19:59:32|    3|  2021-01-05|          19|           3|            1|        2021|  one down 150 to go|             5| positive|
|2021-01-05 19:59:35|    2|  2021-01-05|          19|           3|            1|        2021|oh yeah i forgot ...|            12| negative|
|2021-01-05 19:59:40|    2|  2021-01-05|          19|           3|            1|        2021|gen 3 and 4 are a...|            37| negative|
|2021-01-05 19:59:48|    1|  2021-01-05|          19|           3|            1|        2021|i get poisondrago...|             4| positive|
|2021-01-05 20:00:20|   12|  2021-01-05|          20|           3|            1|        2021|the special stat ...|            48| negative|
|2021-01-05 20:00:32|    2|  2021-01-05|          20|           3|            1|        2021|awwww it is a tot...|             7| positive|
|2021-01-05 20:00:33|    2|  2021-01-05|          20|           3|            1|        2021|all i see is regi...|             5| negative|
|2021-01-05 20:00:42|    1|  2021-01-05|          20|           3|            1|        2021|           very cool|             2| positive|
+-------------------+-----+------------+------------+------------+-------------+------------+--------------------+--------------+---------+
only showing top 20 rows

title_cleaned.select(
    "subreddit",
    "score",
    "created_utc",
    "text",
    "selftext",
    "num_comments",
    "created_date",
    "created_hour",
    "created_week",
    "created_month",
    "created_year",
    "cleaned_title",
    "title_wordCount",
    "cleaned_selftext",
    "selftext_wordCount").show(5)

+---------+-----+-------------------+--------------------+--------------------+------------+------------+------------+------------+-------------+------------+--------------------+---------------+--------------------+------------------+
|subreddit|score|        created_utc|                text|            selftext|num_comments|created_date|created_hour|created_week|created_month|created_year|       cleaned_title|title_wordCount|    cleaned_selftext|selftext_wordCount|
+---------+-----+-------------------+--------------------+--------------------+------------+------------+------------+------------+-------------+------------+--------------------+---------------+--------------------+------------------+
|  pokemon|   39|2021-02-26 15:37:20|worried about the...|in the presents t...|          53|  2021-02-26|          15|           6|            2|        2021|worried about the...|              8|in the presents t...|                98|
|  pokemon|   10|2021-02-26 15:37:39|brilliant diamond...|[https://kapwi.ng...|           1|  2021-02-26|          15|           6|            2|        2021|brilliant diamond...|             12|httpskapwingcjxbo...|                 1|
|  pokemon|    1|2021-02-26 15:37:42|after watching th...|[removed]\n\n[vie...|           0|  2021-02-26|          15|           6|            2|        2021|after watching th...|             11|removed view poll...|                 3|
|  pokemon| 1267|2021-02-26 15:38:03|is “legends” the ...|i get the vibe fr...|         280|  2021-02-26|          15|           6|            2|        2021|is legends the st...|             11|i get the vibe fr...|               102|
|  pokemon|   13|2021-02-26 15:38:19|did the remake st...|this felt like a ...|          19|  2021-02-26|          15|           6|            2|        2021|did the remake st...|              9|this felt like a ...|               170|
+---------+-----+-------------------+--------------------+--------------------+------------+------------+------------+------------+-------------+------------+--------------------+---------------+--------------------+------------------+
only showing top 5 rows

Concat the title with selftext for submissions data

from pyspark.sql.functions import concat_ws

# Combine 'cleaned_title' and 'cleaned_selftext' into a new column 'cleaned_all'
title_cleaned = title_cleaned.withColumn("cleaned_all", concat_ws(" ", title_cleaned.cleaned_title, title_cleaned.cleaned_selftext))

# Show the first 5 rows of the updated DataFrame
title_cleaned.select("subreddit",
    "score",
    "created_utc",
    "text",
    "selftext",
    "num_comments",
    "created_date",
    "created_hour",
    "created_week",
    "created_month",
    "created_year",
    "cleaned_all",
    "title_wordCount",
    "selftext_wordCount").show(5)

+---------+-----+-------------------+--------------------+--------------------+------------+------------+------------+------------+-------------+------------+--------------------+---------------+------------------+
|subreddit|score|        created_utc|                text|            selftext|num_comments|created_date|created_hour|created_week|created_month|created_year|         cleaned_all|title_wordCount|selftext_wordCount|
+---------+-----+-------------------+--------------------+--------------------+------------+------------+------------+------------+-------------+------------+--------------------+---------------+------------------+
|  pokemon|   39|2021-02-26 15:37:20|worried about the...|in the presents t...|          53|  2021-02-26|          15|           6|            2|        2021|worried about the...|              8|                98|
|  pokemon|   10|2021-02-26 15:37:39|brilliant diamond...|[https://kapwi.ng...|           1|  2021-02-26|          15|           6|            2|        2021|brilliant diamond...|             12|                 1|
|  pokemon|    1|2021-02-26 15:37:42|after watching th...|[removed]\n\n[vie...|           0|  2021-02-26|          15|           6|            2|        2021|after watching th...|             11|                 3|
|  pokemon| 1267|2021-02-26 15:38:03|is “legends” the ...|i get the vibe fr...|         280|  2021-02-26|          15|           6|            2|        2021|is legends the st...|             11|               102|
|  pokemon|   13|2021-02-26 15:38:19|did the remake st...|this felt like a ...|          19|  2021-02-26|          15|           6|            2|        2021|did the remake st...|              9|               170|
+---------+-----+-------------------+--------------------+--------------------+------------+------------+------------+------------+-------------+------------+--------------------+---------------+------------------+
only showing top 5 rows

Filter the submissions data where subreddit is “pokemon”

from pyspark.sql.functions import col

# Filter the DataFrame for rows where subreddit is "pokemon"
pokemon_sub = title_cleaned.filter(col("subreddit") == "pokemon")

# Show the filtered data
pokemon_sub.select(
    "score",
    "created_utc",
    "text",
    "selftext",
    "num_comments",
    "created_date",
    "created_hour",
    "created_week",
    "created_month",
    "created_year",
    "cleaned_all",
    "title_wordCount",
    "selftext_wordCount").show(5)

+-----+-------------------+--------------------+--------------------+------------+------------+------------+------------+-------------+------------+--------------------+---------------+------------------+
|score|        created_utc|                text|            selftext|num_comments|created_date|created_hour|created_week|created_month|created_year|         cleaned_all|title_wordCount|selftext_wordCount|
+-----+-------------------+--------------------+--------------------+------------+------------+------------+------------+-------------+------------+--------------------+---------------+------------------+
|   39|2021-02-26 15:37:20|worried about the...|in the presents t...|          53|  2021-02-26|          15|           6|            2|        2021|worried about the...|              8|                98|
|   10|2021-02-26 15:37:39|brilliant diamond...|[https://kapwi.ng...|           1|  2021-02-26|          15|           6|            2|        2021|brilliant diamond...|             12|                 1|
|    1|2021-02-26 15:37:42|after watching th...|[removed]\n\n[vie...|           0|  2021-02-26|          15|           6|            2|        2021|after watching th...|             11|                 3|
| 1267|2021-02-26 15:38:03|is “legends” the ...|i get the vibe fr...|         280|  2021-02-26|          15|           6|            2|        2021|is legends the st...|             11|               102|
|   13|2021-02-26 15:38:19|did the remake st...|this felt like a ...|          19|  2021-02-26|          15|           6|            2|        2021|did the remake st...|              9|               170|
+-----+-------------------+--------------------+--------------------+------------+------------+------------+------------+-------------+------------+--------------------+---------------+------------------+
only showing top 5 rows

pokemon_sub.count()

Build a SparkNLP Pipeline to construct positive/negative sentiment submissions data

from sparknlp.base import DocumentAssembler
from sparknlp.annotator import UniversalSentenceEncoder, SentimentDLModel
from pyspark.ml import Pipeline

MODEL_NAME='sentimentdl_use_twitter'

# Document Assembling
documentAssembler = DocumentAssembler()\
    .setInputCol("cleaned_all")\
    .setOutputCol("document")
    
# Embedding with Universal Sentence Encoder
use = UniversalSentenceEncoder.pretrained(name="tfhub_use", lang="en")\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

# Sentiment Analysis (using a pre-trained model)
sentimentdl = SentimentDLModel.pretrained(name=MODEL_NAME, lang="en")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("sentiment")

# Building the Pipeline
nlpPipeline = Pipeline(
    stages = [
        documentAssembler,
        use,
        sentimentdl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
sentimentdl_use_twitter download started this may take some time.
Approximate size to download 11.4 MB
[OK!]

# Fit and transform the data using the pipeline
nlp_model = nlpPipeline.fit(pokemon_sub)
processed_pokemon_sub = nlp_model.transform(pokemon_sub)

# Display results
processed_pokemon_sub.select(
    "score",
    "created_utc",
    "text",
    "selftext",
    "num_comments",
    "created_date",
    "created_hour",
    "created_week",
    "created_month",
    "created_year",
    "cleaned_all",
    "title_wordCount",
    "selftext_wordCount",
    "sentiment.result"
).show()

+-----+-------------------+--------------------+--------------------+------------+------------+------------+------------+-------------+------------+--------------------+---------------+------------------+----------+
|score|        created_utc|                text|            selftext|num_comments|created_date|created_hour|created_week|created_month|created_year|         cleaned_all|title_wordCount|selftext_wordCount|    result|
+-----+-------------------+--------------------+--------------------+------------+------------+------------+------------+-------------+------------+--------------------+---------------+------------------+----------+
|   39|2021-02-26 15:37:20|worried about the...|in the presents t...|          53|  2021-02-26|          15|           6|            2|        2021|worried about the...|              8|                98|[negative]|
|   10|2021-02-26 15:37:39|brilliant diamond...|[https://kapwi.ng...|           1|  2021-02-26|          15|           6|            2|        2021|brilliant diamond...|             12|                 1|[positive]|
|    1|2021-02-26 15:37:42|after watching th...|[removed]\n\n[vie...|           0|  2021-02-26|          15|           6|            2|        2021|after watching th...|             11|                 3|[positive]|
| 1267|2021-02-26 15:38:03|is “legends” the ...|i get the vibe fr...|         280|  2021-02-26|          15|           6|            2|        2021|is legends the st...|             11|               102|[positive]|
|   13|2021-02-26 15:38:19|did the remake st...|this felt like a ...|          19|  2021-02-26|          15|           6|            2|        2021|did the remake st...|              9|               170|[negative]|
|   20|2021-02-26 15:38:29|reactions to the ...|personally the dp...|          12|  2021-02-26|          15|           6|            2|        2021|reactions to the ...|              4|                92|[negative]|
|  544|2021-02-26 15:38:32|congratulations g...|as a sinnoh die h...|          83|  2021-02-26|          15|           6|            2|        2021|congratulations g...|             12|                52|[positive]|
|    0|2021-02-26 15:38:35|guys you are allo...|to all the people...|          17|  2021-02-26|          15|           6|            2|        2021|guys you are allo...|             11|               304|[positive]|
|    1|2021-02-26 15:38:56|yo we got dp rema...|[removed]\n\n[vie...|           0|  2021-02-26|          15|           6|            2|        2021|yo we got dp rema...|             17|                 3|[negative]|
|   60|2021-02-26 15:39:03|openworld pokemon...|i said multiple t...|           5|  2021-02-26|          15|           6|            2|        2021|openworld pokemon...|             10|                73|[negative]|
| 2099|2021-02-26 15:39:05|the most shocking...|https://i.imgur.c...|         206|  2021-02-26|          15|           6|            2|        2021|the most shocking...|             19|                65|[negative]|
| 1246|2021-02-26 15:39:18|forget the remake...|as i watched the ...|         241|  2021-02-26|          15|           6|            2|        2021|forget the remake...|             10|               107|[negative]|
|    6|2021-02-26 15:39:46|gen 4 remakes and...|1) pokemon snap-2...|           2|  2021-02-26|          15|           6|            2|        2021|gen 4 remakes and...|              9|                62|[positive]|
|   11|2021-02-26 15:39:46|the sinnoh remake...|i don’t know what...|          22|  2021-02-26|          15|           6|            2|        2021|the sinnoh remake...|             10|               219|[negative]|
|   22|2021-02-26 15:39:56|do you think they...|hi there everyone...|          14|  2021-02-26|          15|           6|            2|        2021|do you think they...|             14|                81|[positive]|
|    7|2021-02-26 15:40:02|an interesting th...|now i am happy ab...|           7|  2021-02-26|          15|           6|            2|        2021|an interesting th...|              7|               121|[negative]|
|    1|2021-02-26 15:40:25|    remake's graphic|[removed]\n\n[vie...|           0|  2021-02-26|          15|           6|            2|        2021|remakes graphic r...|              2|                 3|[negative]|
|    2|2021-02-26 15:40:44|something i want ...|please don’t knoc...|           4|  2021-02-26|          15|           6|            2|        2021|something i want ...|              7|                66|[negative]|
|   29|2021-02-26 15:40:48|ancient/sinnoh va...|first off let me ...|          20|  2021-02-26|          15|           6|            2|        2021|ancientsinnoh var...|              5|               109|[positive]|
|   22|2021-02-26 15:41:02|they were never g...|let’s be real non...|          14|  2021-02-26|          15|           6|            2|        2021|they were never g...|              7|               185|[negative]|
+-----+-------------------+--------------------+--------------------+------------+------------+------------+------------+-------------+------------+--------------------+---------------+------------------+----------+
only showing top 20 rows

# Extracting sentiment value from the result array
processed_pokemon_result_sub = processed_pokemon_sub.withColumn(
    "sentiment", 
    col("sentiment.result").getItem(0)
)

processed_pokemon_result_sub.select(
    "score",
    "created_utc",
    "text",
    "selftext",
    "num_comments",
    "created_date",
    "created_hour",
    "created_week",
    "created_month",
    "created_year",
    "cleaned_all",
    "title_wordCount",
    "selftext_wordCount",
    "sentiment"
).show()

+-----+-------------------+--------------------+--------------------+------------+------------+------------+------------+-------------+------------+--------------------+---------------+------------------+---------+
|score|        created_utc|                text|            selftext|num_comments|created_date|created_hour|created_week|created_month|created_year|         cleaned_all|title_wordCount|selftext_wordCount|sentiment|
+-----+-------------------+--------------------+--------------------+------------+------------+------------+------------+-------------+------------+--------------------+---------------+------------------+---------+
|   39|2021-02-26 15:37:20|worried about the...|in the presents t...|          53|  2021-02-26|          15|           6|            2|        2021|worried about the...|              8|                98| negative|
|   10|2021-02-26 15:37:39|brilliant diamond...|[https://kapwi.ng...|           1|  2021-02-26|          15|           6|            2|        2021|brilliant diamond...|             12|                 1| positive|
|    1|2021-02-26 15:37:42|after watching th...|[removed]\n\n[vie...|           0|  2021-02-26|          15|           6|            2|        2021|after watching th...|             11|                 3| positive|
| 1267|2021-02-26 15:38:03|is “legends” the ...|i get the vibe fr...|         280|  2021-02-26|          15|           6|            2|        2021|is legends the st...|             11|               102| positive|
|   13|2021-02-26 15:38:19|did the remake st...|this felt like a ...|          19|  2021-02-26|          15|           6|            2|        2021|did the remake st...|              9|               170| negative|
|   20|2021-02-26 15:38:29|reactions to the ...|personally the dp...|          12|  2021-02-26|          15|           6|            2|        2021|reactions to the ...|              4|                92| negative|
|  544|2021-02-26 15:38:32|congratulations g...|as a sinnoh die h...|          83|  2021-02-26|          15|           6|            2|        2021|congratulations g...|             12|                52| positive|
|    0|2021-02-26 15:38:35|guys you are allo...|to all the people...|          17|  2021-02-26|          15|           6|            2|        2021|guys you are allo...|             11|               304| positive|
|    1|2021-02-26 15:38:56|yo we got dp rema...|[removed]\n\n[vie...|           0|  2021-02-26|          15|           6|            2|        2021|yo we got dp rema...|             17|                 3| negative|
|   60|2021-02-26 15:39:03|openworld pokemon...|i said multiple t...|           5|  2021-02-26|          15|           6|            2|        2021|openworld pokemon...|             10|                73| negative|
| 2099|2021-02-26 15:39:05|the most shocking...|https://i.imgur.c...|         206|  2021-02-26|          15|           6|            2|        2021|the most shocking...|             19|                65| negative|
| 1246|2021-02-26 15:39:18|forget the remake...|as i watched the ...|         241|  2021-02-26|          15|           6|            2|        2021|forget the remake...|             10|               107| negative|
|    6|2021-02-26 15:39:46|gen 4 remakes and...|1) pokemon snap-2...|           2|  2021-02-26|          15|           6|            2|        2021|gen 4 remakes and...|              9|                62| positive|
|   11|2021-02-26 15:39:46|the sinnoh remake...|i don’t know what...|          22|  2021-02-26|          15|           6|            2|        2021|the sinnoh remake...|             10|               219| negative|
|   22|2021-02-26 15:39:56|do you think they...|hi there everyone...|          14|  2021-02-26|          15|           6|            2|        2021|do you think they...|             14|                81| positive|
|    7|2021-02-26 15:40:02|an interesting th...|now i am happy ab...|           7|  2021-02-26|          15|           6|            2|        2021|an interesting th...|              7|               121| negative|
|    1|2021-02-26 15:40:25|    remake's graphic|[removed]\n\n[vie...|           0|  2021-02-26|          15|           6|            2|        2021|remakes graphic r...|              2|                 3| negative|
|    2|2021-02-26 15:40:44|something i want ...|please don’t knoc...|           4|  2021-02-26|          15|           6|            2|        2021|something i want ...|              7|                66| negative|
|   29|2021-02-26 15:40:48|ancient/sinnoh va...|first off let me ...|          20|  2021-02-26|          15|           6|            2|        2021|ancientsinnoh var...|              5|               109| positive|
|   22|2021-02-26 15:41:02|they were never g...|let’s be real non...|          14|  2021-02-26|          15|           6|            2|        2021|they were never g...|              7|               185| negative|
+-----+-------------------+--------------------+--------------------+------------+------------+------------+------------+-------------+------------+--------------------+---------------+------------------+---------+
only showing top 20 rows

Aggregate the results data by created date and sentiment level

Convert the dataframes to Pandas

from pyspark.sql.functions import col, count

# Group by date and sentiment, then count occurrences
pokemon_agg = processed_pokemon_result.groupBy("created_date", "sentiment").agg(count("*").alias("count"))

# Convert to Pandas DataFrame for plotting
pandas_pokemon_agg = pokemon_agg.toPandas()

[Stage 81:====================>                                   (12 + 4) / 33]

from pyspark.sql.functions import col, count

# Group by date and sentiment, then count occurrences
pokemon_agg_sub = processed_pokemon_result_sub.groupBy("created_date", "sentiment").agg(count("*").alias("count"))

# Convert to Pandas DataFrame for plotting
pandas_pokemon_agg_sub = pokemon_agg_sub.toPandas()

import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook_connected"

Plot: Subreddit “pokemon” Comments+Submissions Sentiment Over Time

import plotly.graph_objects as go
import pandas as pd

# Convert 'created_date' to datetime and sort by this column for comments
pandas_pokemon_agg['created_date'] = pd.to_datetime(pandas_pokemon_agg['created_date'])
pandas_pokemon_agg.sort_values(by='created_date', inplace=True)

# Convert 'created_date' to datetime and sort by this column for submissions
pandas_pokemon_agg_sub['created_date'] = pd.to_datetime(pandas_pokemon_agg_sub['created_date'])
pandas_pokemon_agg_sub.sort_values(by='created_date', inplace=True)

def create_pokemon_time_series_scatter(df, title):
    fig = go.Figure()

    # Define custom colors for sentiments
    sentiment_order = ['positive', 'neutral', 'negative']  # Define the order of sentiments
    colors = {'positive': '#42a63c', 'neutral': '#42a1b9', 'negative': '#d13a47'}

    # Add traces in the order of positive, neutral, and negative
    for sentiment in sentiment_order:
        sentiment_df = df[df['sentiment'] == sentiment]
        
        fig.add_trace(go.Scatter(
            x=sentiment_df['created_date'],
            y=sentiment_df['count'],
            mode='lines+markers',
            name=sentiment,
            line=dict(color=colors[sentiment]),
            marker=dict(color=colors[sentiment], size=4)  # Adjusting marker size to 4 for smaller points
        ))

    fig.update_layout(
        title=title,
        xaxis_title='Date',
        yaxis_title='Count',
        xaxis=dict(
            rangeselector=dict(
                buttons=list([
                    dict(count=1, label='1m', step='month', stepmode='backward'),
                    dict(count=6, label='6m', step='month', stepmode='backward'),
                    dict(step='all')
                ])
            ),
            type='date'
        ),
        legend=dict(traceorder='normal')  # Ensure the legend follows the trace order
    )
    return fig

# # Creating a connected scatter plot for the Pokemon subreddit
# fig_pokemon_subreddit = create_pokemon_time_series_scatter(pandas_pokemon_agg, 'Pokemon Comments Subreddit Sentiment Over Time')

# # Display the plot
# fig_pokemon_subreddit.show()

# # Creating a connected scatter plot for the Pokemon subreddit
# fig_pokemon_subreddit_sub = create_pokemon_time_series_scatter(pandas_pokemon_agg_sub, 'Pokemon Submissions Subreddit Sentiment Over Time')

# # Display the plot
# fig_pokemon_subreddit_sub.show()

# Creating a connected scatter plot for the Pokemon subreddit - comments
fig_pokemon_subreddit = create_pokemon_time_series_scatter(pandas_pokemon_agg, 'Pokemon Comments Subreddit Sentiment Over Time')
# Creating a connected scatter plot for the Pokemon subreddit - submissions
fig_pokemon_subreddit_sub = create_pokemon_time_series_scatter(pandas_pokemon_agg_sub, 'Pokemon Submissions Subreddit Sentiment Over Time')

# Display the plot
fig_pokemon_subreddit.show()
fig_pokemon_subreddit_sub.show()

Table: Sentiment by Week of Day

from pyspark.sql.functions import date_format

# Group by day of the week and sentiment, then count occurrences
sentiment_by_weekofday_table = processed_pokemon_result_sub.groupBy('created_week', 'sentiment').agg(
    count('sentiment').alias('Count')
).orderBy('created_week')

# Show the table
sentiment_by_weekofday_table.show()

[Stage 48:==============================================>           (4 + 1) / 5]

+------------+---------+-----+
|created_week|sentiment|Count|
+------------+---------+-----+
|           1| negative| 6553|
|           1|  neutral|  783|
|           1| positive| 6262|
|           2|  neutral|  662|
|           2| positive| 6177|
|           2| negative| 5657|
|           3| positive| 4884|
|           3|  neutral|  589|
|           3| negative| 4708|
|           4|  neutral|  585|
|           4| positive| 4962|
|           4| negative| 4754|
|           5| positive| 4808|
|           5| negative| 4671|
|           5|  neutral|  532|
|           6|  neutral|  650|
|           6| negative| 5319|
|           6| positive| 5226|
|           7|  neutral|  641|
|           7| positive| 5826|
+------------+---------+-----+
only showing top 20 rows

Table: Sentiment by Hour

from pyspark.sql.functions import col, count

# Group by hour and sentiment, then count occurrences
sentiment_by_hour_table = processed_pokemon_result_sub.groupBy('created_hour', 'sentiment').agg(
    count('sentiment').alias('Count')
).orderBy('created_hour')

# Show the table
sentiment_by_hour_table.show()

[Stage 51:==============================================>           (4 + 1) / 5]

+------------+---------+-----+
|created_hour|sentiment|Count|
+------------+---------+-----+
|           0|  neutral|  209|
|           0| positive| 1748|
|           0| negative| 1931|
|           1| negative| 1707|
|           1|  neutral|  203|
|           1| positive| 1738|
|           2|  neutral|  190|
|           2| positive| 1717|
|           2| negative| 1717|
|           3| positive| 1548|
|           3|  neutral|  174|
|           3| negative| 1668|
|           4|  neutral|  193|
|           4| negative| 1498|
|           4| positive| 1516|
|           5| positive| 1287|
|           5| negative| 1311|
|           5|  neutral|  163|
|           6|  neutral|  130|
|           6| negative| 1129|
+------------+---------+-----+
only showing top 20 rows

Table: Monthly Sentiment Summary

from pyspark.sql.functions import count

monthly_sentiment_summary_table = processed_pokemon_result_sub.groupBy(
    "created_month",
    "created_year",
    "sentiment"
).agg(
    count("sentiment").alias("Count")
).orderBy("created_year", "created_month")

monthly_sentiment_summary_table.show()

[Stage 54:==============================================>           (4 + 1) / 5]

+-------------+------------+---------+-----+
|created_month|created_year|sentiment|Count|
+-------------+------------+---------+-----+
|            1|        2021| positive| 1305|
|            1|        2021| negative| 1176|
|            1|        2021|  neutral|  157|
|            2|        2021| positive| 1837|
|            2|        2021| negative| 2007|
|            2|        2021|  neutral|  251|
|            3|        2021| negative| 1733|
|            3|        2021| positive| 1752|
|            3|        2021|  neutral|  240|
|            4|        2021| negative| 1081|
|            4|        2021| positive| 1167|
|            4|        2021|  neutral|  137|
|            5|        2021|  neutral|  149|
|            5|        2021| positive| 1184|
|            5|        2021| negative| 1215|
|            6|        2021| positive| 1049|
|            6|        2021| negative|  989|
|            6|        2021|  neutral|  136|
|            7|        2021|  neutral|  129|
|            7|        2021| positive| 1298|
+-------------+------------+---------+-----+
only showing top 20 rows

Convert the Table to Pandas and Save to CSV

import plotly.express as px

# Convert to Pandas DataFram
pandas_sentiment_by_day_of_week = sentiment_by_weekofday_table.toPandas()
pandas_sentiment_by_hour = sentiment_by_hour_table.toPandas()
pandas_monthly_sentiment_summary = monthly_sentiment_summary_table.toPandas()

# Save the Pandas DataFrames to CSV files in local environment
pandas_sentiment_by_day_of_week.to_csv('sentiment_by_day_of_week.csv', index=False)
pandas_sentiment_by_hour.to_csv('sentiment_by_hour.csv', index=False)
pandas_monthly_sentiment_summary.to_csv('monthly_sentiment_summary.csv', index=False)

import pandas as pd

# Load the data
day_file_path = '../../data/csv/pokemon_sentiment_by_day_of_week.csv'
hour_file_path = '../../data/csv/pokemon_sentiment_by_hour.csv'
pandas_sentiment_by_day_of_week = pd.read_csv(day_file_path)
pandas_sentiment_by_hour = pd.read_csv(hour_file_path)

Plot: Sentiment Trend by Day of the Week

# Define custom colors for sentiments
colors = {'positive': '#42a63c', 'neutral': '#42a1b9', 'negative': '#d13a47'}

# Mapping the days of the week to their names
day_of_week_mapping = {
    1: 'Monday', 
    2: 'Tuesday', 
    3: 'Wednesday', 
    4: 'Thursday', 
    5: 'Friday', 
    6: 'Saturday', 
    7: 'Sunday'
}
pandas_sentiment_by_day_of_week['Day_Name'] = pandas_sentiment_by_day_of_week['created_week'].map(day_of_week_mapping)

fig_line = px.line(
    pandas_sentiment_by_day_of_week, 
    x='Day_Name', 
    y='Count', 
    color='sentiment',
    color_discrete_map=colors,
    labels={'Day_Name': 'Day of the Week', 'Count': 'Total Count', 'sentiment': 'Sentiment'},
    title='Sentiment Trend by Day of the Week'
)
fig_line.show()

Plot: Sentiment Distribution by Day of the Week

import plotly.express as px
import pandas as pd

# Convert sentiment to a categorical type with specific order
pandas_sentiment_by_day_of_week['sentiment'] = pd.Categorical(
    pandas_sentiment_by_day_of_week['sentiment'],
    categories=['neutral', 'negative', 'positive'],  # Order of categories
    ordered=True
)

# Sort the DataFrame by 'Day_Name' and then by 'sentiment'
pandas_sentiment_by_day_of_week.sort_values(by=['Day_Name', 'sentiment'], inplace=True)

# Create the area plot with custom colors
fig_area = px.area(
    pandas_sentiment_by_day_of_week, 
    x='Day_Name', 
    y='Count', 
    color='sentiment',
    color_discrete_map=colors,
    labels={'Day_Name': 'Day of the Week', 'Count': 'Total Count', 'sentiment': 'Sentiment'},
    title='Sentiment Distribution by Day of the Week'
)

# Show the plot
fig_area.show()

import plotly.express as px

# Create the bar plot with custom colors and day names
fig_day_of_week = px.bar(
    pandas_sentiment_by_day_of_week, 
    x='Day_Name', 
    y='Count', 
    color='sentiment',
    color_discrete_map=colors,
    labels={'Day_Name': 'Day of the Week', 'Count': 'Total Count', 'sentiment': 'Sentiment'},
    title='Sentiment by Day of the Week',
    category_orders={"Day_Name": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]}  # Ensuring the days are in the correct order
)

# Show the plot
fig_day_of_week.show()

Plot: Sentiment Trend by Hour of the Day

fig_line_hour = px.line(
    pandas_sentiment_by_hour, 
    x='created_hour', 
    y='Count', 
    color='sentiment',
    color_discrete_map=colors,
    labels={'created_hour': 'Hour of the Day', 'Count': 'Total Count', 'sentiment': 'Sentiment'},
    title='Sentiment Trend by Hour of the Day'
)
fig_line_hour.show()

Plot: Sentiment by Hour of the Day

# Convert sentiment to a categorical type with specific order
pandas_sentiment_by_hour['sentiment'] = pd.Categorical(
    pandas_sentiment_by_hour['sentiment'],
    categories=['neutral', 'negative', 'positive'],  # Order of categories
    ordered=True
)

# Sort the DataFrame by 'Day_Name' and then by 'sentiment'
pandas_sentiment_by_hour.sort_values(by=['sentiment'], inplace=True)

colors = {'positive': '#42a63c', 'neutral': '#42a1b9', 'negative': '#d13a47'}

fig_area_hour = px.area(
    pandas_sentiment_by_hour, 
    x='created_hour', 
    y='Count', 
    color='sentiment',
    color_discrete_map=colors,
    labels={'created_hour': 'Hour of the Day', 'Count': 'Total Count', 'sentiment': 'Sentiment'},
    title='Sentiment Distribution by Hour of the Day'
)
fig_area_hour.show()

fig_hour = px.bar(
    pandas_sentiment_by_hour, 
    x='created_hour', 
    y='Count', 
    color='sentiment',
    color_discrete_map=colors,
    labels={'created_hour': 'Hour of the Day', 'Count': 'Total Count', 'sentiment': 'Sentiment'},
    title='Sentiment by Hour of the Day'
)
fig_hour.show()

Plot: Monthly Sentiment Summary

import plotly.express as px
import pandas as pd

# Combine year and month for plotting
pandas_monthly_sentiment_summary['Year-Month'] = pandas_monthly_sentiment_summary['created_year'].astype(str) + '-' + pandas_monthly_sentiment_summary['created_month'].astype(str)

# Define custom colors for sentiments
colors = {'positive': '#42a63c', 'neutral': '#42a1b9', 'negative': '#d13a47'}

# Create the line plot with custom colors
fig_monthly = px.line(
    pandas_monthly_sentiment_summary, 
    x='Year-Month', 
    y='Count', 
    color='sentiment',
    color_discrete_map=colors,
    labels={'Year-Month': 'Month of the Year', 'Count': 'Total Count'}, 
    title='Monthly Sentiment Summary'
)

# Show the plot
fig_monthly.show()

output = "project/nlp/processed/pokemon/comments"
my_bucket = 'sagemaker-us-east-1-216384626106'
s3_path = f"s3a://{my_bucket}/{output}"

print(f"writing cleaned comments to {s3_path}")
processed_pokemon_result.write.parquet(s3_path, mode="overwrite")

output = "project/nlp/processed/pokemon/submissions"
my_bucket = 'sagemaker-us-east-1-216384626106'
s3_path = f"s3a://{my_bucket}/{output}"

print(f"writing cleaned submissions to {s3_path}")
processed_pokemon_result_sub.write.parquet(s3_path, mode="overwrite")

writing cleaned submissions to s3a://sagemaker-us-east-1-216384626106/project/nlp/processed/pokemon/submissions

Exception in thread "serve-DataFrame" java.net.SocketTimeoutException: Accept timed out
    at java.base/java.net.PlainSocketImpl.socketAccept(Native Method)
    at java.base/java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:458)
    at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:565)
    at java.base/java.net.ServerSocket.accept(ServerSocket.java:533)
    at org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:65)