Code: NLP-EDA&Pipline

# Setup - Run only once per Kernel App
%conda install openjdk -y

# install PySpark
%pip install pyspark==3.4.0

# install spark-nlp
%pip install spark-nlp==5.1.3
%pip install sparknlp

# install plotly
%pip install plotly

# install yfinance for external data
%pip install yfinance

# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")
Collecting package metadata (current_repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 23.3.1
  latest version: 23.10.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.10.0



## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - openjdk


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2023.08.22 |       h06a4308_0         123 KB
    certifi-2023.11.17         |  py310h06a4308_0         158 KB
    openjdk-11.0.13            |       h87a67e3_0       341.0 MB
    ------------------------------------------------------------
                                           Total:       341.3 MB

The following NEW packages will be INSTALLED:

  openjdk            pkgs/main/linux-64::openjdk-11.0.13-h87a67e3_0 

The following packages will be UPDATED:

  ca-certificates    conda-forge::ca-certificates-2023.7.2~ --> pkgs/main::ca-certificates-2023.08.22-h06a4308_0 
  certifi            conda-forge/noarch::certifi-2023.7.22~ --> pkgs/main/linux-64::certifi-2023.11.17-py310h06a4308_0 



Downloading and Extracting Packages
certifi-2023.11.17   | 158 KB    |                                       |   0% 
openjdk-11.0.13      | 341.0 MB  |                                       |   0% 

ca-certificates-2023 | 123 KB    |                                       |   0% 

certifi-2023.11.17   | 158 KB    | ##################################### | 100% 
openjdk-11.0.13      | 341.0 MB  | 3                                     |   1% 
openjdk-11.0.13      | 341.0 MB  | ##                                    |   5% 
openjdk-11.0.13      | 341.0 MB  | ###8                                  |  10% 
openjdk-11.0.13      | 341.0 MB  | #####5                                |  15% 
openjdk-11.0.13      | 341.0 MB  | #######1                              |  19% 
openjdk-11.0.13      | 341.0 MB  | ########9                             |  24% 
openjdk-11.0.13      | 341.0 MB  | ##########5                           |  28% 
openjdk-11.0.13      | 341.0 MB  | ############2                         |  33% 
openjdk-11.0.13      | 341.0 MB  | #############8                        |  37% 
openjdk-11.0.13      | 341.0 MB  | ###############6                      |  42% 
openjdk-11.0.13      | 341.0 MB  | #################4                    |  47% 
openjdk-11.0.13      | 341.0 MB  | ###################1                  |  52% 
openjdk-11.0.13      | 341.0 MB  | ####################9                 |  57% 
openjdk-11.0.13      | 341.0 MB  | ######################7               |  61% 
openjdk-11.0.13      | 341.0 MB  | ########################5             |  66% 
openjdk-11.0.13      | 341.0 MB  | ##########################3           |  71% 
openjdk-11.0.13      | 341.0 MB  | ############################          |  76% 
openjdk-11.0.13      | 341.0 MB  | #############################9        |  81% 
openjdk-11.0.13      | 341.0 MB  | ###############################7      |  86% 
openjdk-11.0.13      | 341.0 MB  | #################################5    |  91% 
openjdk-11.0.13      | 341.0 MB  | ###################################2  |  95% 
                                                                                
                                                                                

                                                                                
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Note: you may need to restart the kernel to use updated packages.
Collecting pyspark==3.4.0
  Using cached pyspark-3.4.0-py2.py3-none-any.whl
Collecting py4j==0.10.9.7 (from pyspark==3.4.0)
  Using cached py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.7 pyspark-3.4.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Collecting spark-nlp==5.1.3
  Obtaining dependency information for spark-nlp==5.1.3 from https://files.pythonhosted.org/packages/cd/7d/bc0eca4c9ec4c9c1d9b28c42c2f07942af70980a7d912d0aceebf8db32dd/spark_nlp-5.1.3-py2.py3-none-any.whl.metadata
  Using cached spark_nlp-5.1.3-py2.py3-none-any.whl.metadata (53 kB)
Using cached spark_nlp-5.1.3-py2.py3-none-any.whl (537 kB)
Installing collected packages: spark-nlp
Successfully installed spark-nlp-5.1.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Collecting sparknlp
  Using cached sparknlp-1.0.0-py3-none-any.whl (1.4 kB)
Requirement already satisfied: spark-nlp in /opt/conda/lib/python3.10/site-packages (from sparknlp) (5.1.3)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from sparknlp) (1.26.0)
Installing collected packages: sparknlp
Successfully installed sparknlp-1.0.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: plotly in /opt/conda/lib/python3.10/site-packages (5.9.0)
Requirement already satisfied: tenacity>=6.2.0 in /opt/conda/lib/python3.10/site-packages (from plotly) (8.0.1)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Collecting yfinance
  Obtaining dependency information for yfinance from https://files.pythonhosted.org/packages/1c/19/bf19123baf16a55fd38cbb100b5a49380b9b6db7279987034689d11254c7/yfinance-0.2.32-py2.py3-none-any.whl.metadata
  Using cached yfinance-0.2.32-py2.py3-none-any.whl.metadata (11 kB)
Requirement already satisfied: pandas>=1.3.0 in /opt/conda/lib/python3.10/site-packages (from yfinance) (1.4.4)
Requirement already satisfied: numpy>=1.16.5 in /opt/conda/lib/python3.10/site-packages (from yfinance) (1.26.0)
Requirement already satisfied: requests>=2.31 in /opt/conda/lib/python3.10/site-packages (from yfinance) (2.31.0)
Collecting multitasking>=0.0.7 (from yfinance)
  Using cached multitasking-0.0.11-py3-none-any.whl (8.5 kB)
Requirement already satisfied: lxml>=4.9.1 in /opt/conda/lib/python3.10/site-packages (from yfinance) (4.9.3)
Requirement already satisfied: appdirs>=1.4.4 in /opt/conda/lib/python3.10/site-packages (from yfinance) (1.4.4)
Collecting pytz>=2022.5 (from yfinance)
  Obtaining dependency information for pytz>=2022.5 from https://files.pythonhosted.org/packages/32/4d/aaf7eff5deb402fd9a24a1449a8119f00d74ae9c2efa79f8ef9994261fc2/pytz-2023.3.post1-py2.py3-none-any.whl.metadata
  Using cached pytz-2023.3.post1-py2.py3-none-any.whl.metadata (22 kB)
Collecting frozendict>=2.3.4 (from yfinance)
  Obtaining dependency information for frozendict>=2.3.4 from https://files.pythonhosted.org/packages/bf/e8/6eb098234b607ed660501a951b4b9190bf7bceff10a66cda828f32ad6e1a/frozendict-2.3.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading frozendict-2.3.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Collecting peewee>=3.16.2 (from yfinance)
  Using cached peewee-3.17.0-cp310-cp310-linux_x86_64.whl
Requirement already satisfied: beautifulsoup4>=4.11.1 in /opt/conda/lib/python3.10/site-packages (from yfinance) (4.11.1)
Collecting html5lib>=1.1 (from yfinance)
  Using cached html5lib-1.1-py2.py3-none-any.whl (112 kB)
Requirement already satisfied: soupsieve>1.2 in /opt/conda/lib/python3.10/site-packages (from beautifulsoup4>=4.11.1->yfinance) (2.3.1)
Requirement already satisfied: six>=1.9 in /opt/conda/lib/python3.10/site-packages (from html5lib>=1.1->yfinance) (1.16.0)
Requirement already satisfied: webencodings in /opt/conda/lib/python3.10/site-packages (from html5lib>=1.1->yfinance) (0.5.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /opt/conda/lib/python3.10/site-packages (from pandas>=1.3.0->yfinance) (2.8.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests>=2.31->yfinance) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests>=2.31->yfinance) (3.3)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests>=2.31->yfinance) (2.0.6)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests>=2.31->yfinance) (2023.11.17)
Using cached yfinance-0.2.32-py2.py3-none-any.whl (68 kB)
Downloading frozendict-2.3.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (115 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.9/115.9 kB 1.4 MB/s eta 0:00:00ta 0:00:01
Using cached pytz-2023.3.post1-py2.py3-none-any.whl (502 kB)
Installing collected packages: pytz, peewee, multitasking, html5lib, frozendict, yfinance
  Attempting uninstall: pytz
    Found existing installation: pytz 2022.1
    Uninstalling pytz-2022.1:
      Successfully uninstalled pytz-2022.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spyder 5.3.3 requires pyqt5<5.16, which is not installed.
spyder 5.3.3 requires pyqtwebengine<5.16, which is not installed.
jupyterlab-server 2.10.3 requires jupyter-server~=1.4, but you have jupyter-server 2.7.3 which is incompatible.
sagemaker-datawrangler 0.4.3 requires sagemaker-data-insights==0.4.0, but you have sagemaker-data-insights 0.3.3 which is incompatible.
spyder 5.3.3 requires ipython<8.0.0,>=7.31.1, but you have ipython 8.16.1 which is incompatible.
spyder 5.3.3 requires pylint<3.0,>=2.5.0, but you have pylint 3.0.1 which is incompatible.
Successfully installed frozendict-2.3.10 html5lib-1.1 multitasking-0.0.11 peewee-3.17.0 pytz-2023.3.post1 yfinance-0.2.32
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
!wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.1.3.jar
--2023-11-29 23:38:06--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.1.3.jar
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.200.72, 52.217.137.208, 52.216.106.246, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.200.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 708534094 (676M) [application/java-archive]
Saving to: ‘spark-nlp-assembly-5.1.3.jar’

spark-nlp-assembly- 100%[===================>] 675.71M  26.6MB/s    in 26s     

2023-11-29 23:38:35 (26.4 MB/s) - ‘spark-nlp-assembly-5.1.3.jar’ saved [708534094/708534094]
## Import packages
import json
import sparknlp
import numpy as np
import pandas as pd
from sparknlp.base import *
from pyspark.ml import Pipeline
from sparknlp.annotator import *
import pyspark.sql.functions as F
from pyspark.sql.functions import mean, stddev, max, min, count, percentile_approx, year, month, dayofmonth, ceil, col, dayofweek, hour, explode, date_format, lower, size, split, regexp_replace, isnan, when
from pyspark.sql import SparkSession
from sparknlp.pretrained import PretrainedPipeline
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.subplots as sp
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook_connected"
from pyspark.sql import SparkSession
from py4j.java_gateway import java_import

spark = SparkSession.builder \
    .appName("Spark NLP")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3,org.apache.hadoop:hadoop-aws:3.2.2")\
    .config(
        "fs.s3a.aws.credentials.provider",
        "com.amazonaws.auth.ContainerCredentialsProvider",
    )\
    .getOrCreate()

print(f"Spark version: {spark.version}")
print(f"sparknlp version: {sparknlp.version()}")
Warning: Ignoring non-Spark config property: fs.s3a.aws.credentials.provider
:: loading settings :: url = jar:file:/opt/conda/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-f3336de6-f72a-4c19-9174-5b0599cd1773;1.0
    confs: [default]
    found com.johnsnowlabs.nlp#spark-nlp_2.12;5.1.3 in central
    found com.typesafe#config;1.4.2 in central
    found org.rocksdb#rocksdbjni;6.29.5 in central
    found com.amazonaws#aws-java-sdk-bundle;1.11.828 in central
    found com.github.universal-automata#liblevenshtein;3.0.0 in central
    found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
    found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
    found com.google.code.gson#gson;2.3 in central
    found it.unimi.dsi#fastutil;7.0.12 in central
    found org.projectlombok#lombok;1.16.8 in central
    found com.google.cloud#google-cloud-storage;2.20.1 in central
    found com.google.guava#guava;31.1-jre in central
    found com.google.guava#failureaccess;1.0.1 in central
    found com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava in central
    found com.google.errorprone#error_prone_annotations;2.18.0 in central
    found com.google.j2objc#j2objc-annotations;1.3 in central
    found com.google.http-client#google-http-client;1.43.0 in central
    found io.opencensus#opencensus-contrib-http-util;0.31.1 in central
    found com.google.http-client#google-http-client-jackson2;1.43.0 in central
    found com.google.http-client#google-http-client-gson;1.43.0 in central
    found com.google.api-client#google-api-client;2.2.0 in central
    found commons-codec#commons-codec;1.15 in central
    found com.google.oauth-client#google-oauth-client;1.34.1 in central
    found com.google.http-client#google-http-client-apache-v2;1.43.0 in central
    found com.google.apis#google-api-services-storage;v1-rev20220705-2.0.0 in central
    found com.google.code.gson#gson;2.10.1 in central
    found com.google.cloud#google-cloud-core;2.12.0 in central
    found io.grpc#grpc-context;1.53.0 in central
    found com.google.auto.value#auto-value-annotations;1.10.1 in central
    found com.google.auto.value#auto-value;1.10.1 in central
    found javax.annotation#javax.annotation-api;1.3.2 in central
    found commons-logging#commons-logging;1.2 in central
    found com.google.cloud#google-cloud-core-http;2.12.0 in central
    found com.google.http-client#google-http-client-appengine;1.43.0 in central
    found com.google.api#gax-httpjson;0.108.2 in central
    found com.google.cloud#google-cloud-core-grpc;2.12.0 in central
    found io.grpc#grpc-alts;1.53.0 in central
    found io.grpc#grpc-grpclb;1.53.0 in central
    found org.conscrypt#conscrypt-openjdk-uber;2.5.2 in central
    found io.grpc#grpc-auth;1.53.0 in central
    found io.grpc#grpc-protobuf;1.53.0 in central
    found io.grpc#grpc-protobuf-lite;1.53.0 in central
    found io.grpc#grpc-core;1.53.0 in central
    found com.google.api#gax;2.23.2 in central
    found com.google.api#gax-grpc;2.23.2 in central
    found com.google.auth#google-auth-library-credentials;1.16.0 in central
    found com.google.auth#google-auth-library-oauth2-http;1.16.0 in central
    found com.google.api#api-common;2.6.2 in central
    found io.opencensus#opencensus-api;0.31.1 in central
    found com.google.api.grpc#proto-google-iam-v1;1.9.2 in central
    found com.google.protobuf#protobuf-java;3.21.12 in central
    found com.google.protobuf#protobuf-java-util;3.21.12 in central
    found com.google.api.grpc#proto-google-common-protos;2.14.2 in central
    found org.threeten#threetenbp;1.6.5 in central
    found com.google.api.grpc#proto-google-cloud-storage-v2;2.20.1-alpha in central
    found com.google.api.grpc#grpc-google-cloud-storage-v2;2.20.1-alpha in central
    found com.google.api.grpc#gapic-google-cloud-storage-v2;2.20.1-alpha in central
    found com.fasterxml.jackson.core#jackson-core;2.14.2 in central
    found com.google.code.findbugs#jsr305;3.0.2 in central
    found io.grpc#grpc-api;1.53.0 in central
    found io.grpc#grpc-stub;1.53.0 in central
    found org.checkerframework#checker-qual;3.31.0 in central
    found io.perfmark#perfmark-api;0.26.0 in central
    found com.google.android#annotations;4.1.1.4 in central
    found org.codehaus.mojo#animal-sniffer-annotations;1.22 in central
    found io.opencensus#opencensus-proto;0.2.0 in central
    found io.grpc#grpc-services;1.53.0 in central
    found com.google.re2j#re2j;1.6 in central
    found io.grpc#grpc-netty-shaded;1.53.0 in central
    found io.grpc#grpc-googleapis;1.53.0 in central
    found io.grpc#grpc-xds;1.53.0 in central
    found com.navigamez#greex;1.0 in central
    found dk.brics.automaton#automaton;1.11-8 in central
    found com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.4.4 in central
    found com.microsoft.onnxruntime#onnxruntime;1.15.0 in central
    found org.apache.hadoop#hadoop-aws;3.2.2 in central
:: resolution report :: resolve 5861ms :: artifacts dl 1134ms
    :: modules in use:
    com.amazonaws#aws-java-sdk-bundle;1.11.828 from central in [default]
    com.fasterxml.jackson.core#jackson-core;2.14.2 from central in [default]
    com.github.universal-automata#liblevenshtein;3.0.0 from central in [default]
    com.google.android#annotations;4.1.1.4 from central in [default]
    com.google.api#api-common;2.6.2 from central in [default]
    com.google.api#gax;2.23.2 from central in [default]
    com.google.api#gax-grpc;2.23.2 from central in [default]
    com.google.api#gax-httpjson;0.108.2 from central in [default]
    com.google.api-client#google-api-client;2.2.0 from central in [default]
    com.google.api.grpc#gapic-google-cloud-storage-v2;2.20.1-alpha from central in [default]
    com.google.api.grpc#grpc-google-cloud-storage-v2;2.20.1-alpha from central in [default]
    com.google.api.grpc#proto-google-cloud-storage-v2;2.20.1-alpha from central in [default]
    com.google.api.grpc#proto-google-common-protos;2.14.2 from central in [default]
    com.google.api.grpc#proto-google-iam-v1;1.9.2 from central in [default]
    com.google.apis#google-api-services-storage;v1-rev20220705-2.0.0 from central in [default]
    com.google.auth#google-auth-library-credentials;1.16.0 from central in [default]
    com.google.auth#google-auth-library-oauth2-http;1.16.0 from central in [default]
    com.google.auto.value#auto-value;1.10.1 from central in [default]
    com.google.auto.value#auto-value-annotations;1.10.1 from central in [default]
    com.google.cloud#google-cloud-core;2.12.0 from central in [default]
    com.google.cloud#google-cloud-core-grpc;2.12.0 from central in [default]
    com.google.cloud#google-cloud-core-http;2.12.0 from central in [default]
    com.google.cloud#google-cloud-storage;2.20.1 from central in [default]
    com.google.code.findbugs#jsr305;3.0.2 from central in [default]
    com.google.code.gson#gson;2.10.1 from central in [default]
    com.google.errorprone#error_prone_annotations;2.18.0 from central in [default]
    com.google.guava#failureaccess;1.0.1 from central in [default]
    com.google.guava#guava;31.1-jre from central in [default]
    com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava from central in [default]
    com.google.http-client#google-http-client;1.43.0 from central in [default]
    com.google.http-client#google-http-client-apache-v2;1.43.0 from central in [default]
    com.google.http-client#google-http-client-appengine;1.43.0 from central in [default]
    com.google.http-client#google-http-client-gson;1.43.0 from central in [default]
    com.google.http-client#google-http-client-jackson2;1.43.0 from central in [default]
    com.google.j2objc#j2objc-annotations;1.3 from central in [default]
    com.google.oauth-client#google-oauth-client;1.34.1 from central in [default]
    com.google.protobuf#protobuf-java;3.21.12 from central in [default]
    com.google.protobuf#protobuf-java-util;3.21.12 from central in [default]
    com.google.re2j#re2j;1.6 from central in [default]
    com.johnsnowlabs.nlp#spark-nlp_2.12;5.1.3 from central in [default]
    com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.4.4 from central in [default]
    com.microsoft.onnxruntime#onnxruntime;1.15.0 from central in [default]
    com.navigamez#greex;1.0 from central in [default]
    com.typesafe#config;1.4.2 from central in [default]
    commons-codec#commons-codec;1.15 from central in [default]
    commons-logging#commons-logging;1.2 from central in [default]
    dk.brics.automaton#automaton;1.11-8 from central in [default]
    io.grpc#grpc-alts;1.53.0 from central in [default]
    io.grpc#grpc-api;1.53.0 from central in [default]
    io.grpc#grpc-auth;1.53.0 from central in [default]
    io.grpc#grpc-context;1.53.0 from central in [default]
    io.grpc#grpc-core;1.53.0 from central in [default]
    io.grpc#grpc-googleapis;1.53.0 from central in [default]
    io.grpc#grpc-grpclb;1.53.0 from central in [default]
    io.grpc#grpc-netty-shaded;1.53.0 from central in [default]
    io.grpc#grpc-protobuf;1.53.0 from central in [default]
    io.grpc#grpc-protobuf-lite;1.53.0 from central in [default]
    io.grpc#grpc-services;1.53.0 from central in [default]
    io.grpc#grpc-stub;1.53.0 from central in [default]
    io.grpc#grpc-xds;1.53.0 from central in [default]
    io.opencensus#opencensus-api;0.31.1 from central in [default]
    io.opencensus#opencensus-contrib-http-util;0.31.1 from central in [default]
    io.opencensus#opencensus-proto;0.2.0 from central in [default]
    io.perfmark#perfmark-api;0.26.0 from central in [default]
    it.unimi.dsi#fastutil;7.0.12 from central in [default]
    javax.annotation#javax.annotation-api;1.3.2 from central in [default]
    org.apache.hadoop#hadoop-aws;3.2.2 from central in [default]
    org.checkerframework#checker-qual;3.31.0 from central in [default]
    org.codehaus.mojo#animal-sniffer-annotations;1.22 from central in [default]
    org.conscrypt#conscrypt-openjdk-uber;2.5.2 from central in [default]
    org.projectlombok#lombok;1.16.8 from central in [default]
    org.rocksdb#rocksdbjni;6.29.5 from central in [default]
    org.threeten#threetenbp;1.6.5 from central in [default]
    :: evicted modules:
    com.google.protobuf#protobuf-java-util;3.0.0-beta-3 by [com.google.protobuf#protobuf-java-util;3.21.12] in [default]
    com.google.protobuf#protobuf-java;3.0.0-beta-3 by [com.google.protobuf#protobuf-java;3.21.12] in [default]
    com.google.code.gson#gson;2.3 by [com.google.code.gson#gson;2.10.1] in [default]
    com.amazonaws#aws-java-sdk-bundle;1.11.563 by [com.amazonaws#aws-java-sdk-bundle;1.11.828] in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   77  |   0   |   0   |   4   ||   73  |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-f3336de6-f72a-4c19-9174-5b0599cd1773
    confs: [default]
    0 artifacts copied, 73 already retrieved (0kB/348ms)
23/11/29 23:38:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark version: 3.4.0
sparknlp version: 5.1.3

External data

# Importing the yfinance package
import datetime
import yfinance as yf
 
# Set the start and end date
start_date = '2021-01-01'
end_date = '2023-04-1'
 
# Add multiple space separated tickers here
ticker = 'NTDOY TYO SONY'

# Get the data
stocks = yf.download(ticker, start_date, end_date)['Adj Close']
stocks["Date"] = stocks.index
stocks.reset_index(drop=True, inplace=True)
# Print the last 5 rows
print(stocks.tail())

# Export data to a CSV file
# stocks.to_csv("../../data/csv/stocks.csv")
[*********************100%%**********************]  3 of 3 completed
     NTDOY       SONY        TYO       Date
560   9.65  86.639999  11.965896 2023-03-27
561   9.61  85.820000  11.985530 2023-03-28
562   9.77  87.870003  12.044427 2023-03-29
563   9.63  89.309998  12.054242 2023-03-30
564   9.69  90.650002  11.818654 2023-03-31
# Read the saved data
daily_sub = pd.read_csv("../../data/csv/sub_daily.csv")
daily_com = pd.read_csv("../../data/csv/com_daily.csv")
stocks = pd.read_csv("../../data/csv/stocks.csv")
# merge multiple dataset together based on date
merged_df = pd.concat([daily_sub.set_index("created_date"), daily_com.set_index("created_date"), stocks.set_index("Date")], axis=1)
merged_df["Date"] = merged_df.index
merged_df.reset_index(drop=True, inplace=True)
#convert date column to datetime and subtract one week
merged_df['week'] = pd.to_datetime(merged_df['Date']) - pd.to_timedelta(7, unit='d')
# merged_df[["Date","total_submissions", "avg_num_comments", "total_comments","NTDOY", "SONY", "TYO"]].tail(10)


# Group by week and calculate sum and mean
agg_columns = {
    'total_submissions': 'sum',
    'total_comments': 'sum',
    'NTDOY': 'mean',
    'SONY': 'mean',
    'TYO': 'mean',
}
weekly_df= merged_df.groupby([pd.Grouper(key='week', freq='W')]).agg(agg_columns).reset_index()

weekly_df.tail(10)
week total_submissions total_comments NTDOY SONY TYO
108 2023-01-22 707 48227.0 10.734 89.502000 11.827641
109 2023-01-29 640 37987.0 10.772 91.054001 11.796592
110 2023-02-05 634 52538.0 10.146 90.430002 12.403984
111 2023-02-12 626 50620.0 10.036 87.999998 12.794034
112 2023-02-19 598 51840.0 9.840 82.777500 13.246667
113 2023-02-26 524 47569.0 9.394 83.927998 13.407248
114 2023-03-05 248 39966.0 9.402 86.634000 13.191847
115 2023-03-12 249 39501.0 9.486 85.529999 12.037221
116 2023-03-19 267 44612.0 9.570 88.206000 11.796666
117 2023-03-26 159 24835.0 9.670 88.058000 11.973750
# Create subplots with a 2x1 grid
fig = make_subplots(rows=2, cols=1, specs=[[{"secondary_y": True}],[{"secondary_y": True}]])


# Add traces for the first subplot (Submissions and Stock Prices)
fig.add_trace(go.Scatter(x=weekly_df['week'], y=weekly_df['total_submissions'], marker_color='#d13a47', opacity=.65, name="Submissions"), row=1, col=1, secondary_y=False)
fig.add_trace(go.Scatter(x=weekly_df['week'], y=weekly_df['NTDOY'], marker_color='#f7c200', opacity=.65, name="NTDOY"), row=1, col=1, secondary_y=True)
fig.add_trace(go.Scatter(x=weekly_df['week'], y=weekly_df['TYO'], marker_color='#42a63c', opacity=.65, name="TYO"), row=1, col=1, secondary_y=True)

# Add traces for the second subplot (Comments and Stock Prices)
fig.add_trace(go.Scatter(x=weekly_df['week'], y=weekly_df['total_comments'], marker_color='#42a1b9', opacity=.65, name="Comments"), row=2, col=1, secondary_y=False)
fig.add_trace(go.Scatter(x=weekly_df['week'], y=weekly_df['NTDOY'], marker_color='#f7c200', opacity=.65, name="NTDOY", showlegend=False), row=2, col=1, secondary_y=True)
fig.add_trace(go.Scatter(x=weekly_df['week'], y=weekly_df['TYO'], marker_color='#42a63c', opacity=.65, name="TYO", showlegend=False), row=2, col=1, secondary_y=True)


# Update the y-axis labels
fig.update_yaxes(title_text="Submissions", secondary_y=False, row=1, col=1)
# fig.update_yaxes(title_text="Mean Stock Prices", secondary_y=True, row=1, col=1, matches='y')
fig.update_yaxes(title_text="Comments", secondary_y=False, row=2, col=1)
fig.update_yaxes(title_text="Mean Stock Prices", secondary_y=True, row=2, col=1)

# Update the layout for the whole figure
fig.update_layout(
    title='The number of comments, submissions, and stock prices for each week',
    # xaxis={'title': 'Date by week'},
    paper_bgcolor='#FFFFFF', 
    plot_bgcolor='rgba(0,0,0,0)',
)

# Show the figure
fig.show()

Data preparation for NLP pipline

## Read cleaned data from parquet

### Anime subreddits
import sagemaker
# session = sagemaker.Session()
# bucket = session.default_bucket()
bucket = 'sagemaker-us-east-1-315969085594'

sub_bucket_path = f"s3a://{bucket}/project/cleaned/sub"
com_bucket_path = f"s3a://{bucket}/project/cleaned/com"

print(f"reading submissions from {sub_bucket_path}")
sub = spark.read.parquet(sub_bucket_path, header=True)
print(f"shape of the sub dataframe is {sub.count():,}x{len(sub.columns)}")

print(f"reading comments from {com_bucket_path}")
com = spark.read.parquet(com_bucket_path, header=True)
print(f"shape of the com dataframe is {com.count():,}x{len(com.columns)}")
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
reading submissions from s3a://sagemaker-us-east-1-315969085594/project/cleaned/sub
23/11/29 23:39:01 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                
shape of the sub dataframe is 110,247x22
reading comments from s3a://sagemaker-us-east-1-315969085594/project/cleaned/com
[Stage 5:======================================================>  (23 + 1) / 24]
shape of the com dataframe is 6,879,119x19
                                                                                
sub.groupBy('subreddit').count().show()
[Stage 8:============================================>              (3 + 1) / 4]
+---------+------+
|subreddit| count|
+---------+------+
|    anime|110247|
+---------+------+
                                                                                
sub.printSchema()
root
 |-- subreddit: string (nullable = true)
 |-- author: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- title: string (nullable = true)
 |-- selftext: string (nullable = true)
 |-- num_comments: long (nullable = true)
 |-- num_crossposts: long (nullable = true)
 |-- over_18: boolean (nullable = true)
 |-- score: long (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- id: string (nullable = true)
 |-- created_date: string (nullable = true)
 |-- created_hour: integer (nullable = true)
 |-- created_week: integer (nullable = true)
 |-- created_month: integer (nullable = true)
 |-- created_year: integer (nullable = true)
 |-- cleaned_title: string (nullable = true)
 |-- title_wordCount: integer (nullable = true)
 |-- cleaned_selftext: string (nullable = true)
 |-- selftext_wordCount: integer (nullable = true)
 |-- contain_pokemon: boolean (nullable = true)
sub.show(3)
[Stage 11:>                                                         (0 + 1) / 1]
+---------+--------------------+-----------------+-------------------+--------------------+--------------------+------------+--------------+-------+-----+--------+------+------------+------------+------------+-------------+------------+--------------------+---------------+--------------------+------------------+---------------+
|subreddit|              author|author_flair_text|        created_utc|               title|            selftext|num_comments|num_crossposts|over_18|score|stickied|    id|created_date|created_hour|created_week|created_month|created_year|       cleaned_title|title_wordCount|    cleaned_selftext|selftext_wordCount|contain_pokemon|
+---------+--------------------+-----------------+-------------------+--------------------+--------------------+------------+--------------+-------+-----+--------+------+------------+------------+------------+-------------+------------+--------------------+---------------+--------------------+------------------+---------------+
|    anime|PsychologicalGift299|             null|2021-04-19 20:42:46|anime movies for ...|so as my fellow o...|          12|             0|  false|    0|   false|mua1uo|  2021-04-19|          20|           2|            4|        2021|anime movies for 420|              4|so as my fellow o...|                64|          false|
|    anime|        Tuttles4ever|             null|2021-04-19 20:48:42|i need a very spe...|are there any ani...|           7|             0|  false|    0|   false|mua6g3|  2021-04-19|          20|           2|            4|        2021|i need a very spe...|             15|are there any ani...|                42|          false|
|    anime|          nemifloras|             null|2021-04-19 20:52:42|any atmospheric a...|i finished reassi...|           9|             0|  false|    0|   false|mua9iu|  2021-04-19|          20|           2|            4|        2021|any atmospheric a...|             10|i finished reassi...|                18|          false|
+---------+--------------------+-----------------+-------------------+--------------------+--------------------+------------+--------------+-------+-----+--------+------+------------+------------+------------+-------------+------------+--------------------+---------------+--------------------+------------------+---------------+
only showing top 3 rows
                                                                                
com.groupBy('subreddit').count().show()
[Stage 12:=====================================================>  (23 + 1) / 24]
+---------+-------+
|subreddit|  count|
+---------+-------+
|    anime|6879119|
+---------+-------+
                                                                                
com.printSchema()
root
 |-- subreddit: string (nullable = true)
 |-- author: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- body: string (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- score: long (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- link_id: string (nullable = true)
 |-- id: string (nullable = true)
 |-- created_date: string (nullable = true)
 |-- created_hour: integer (nullable = true)
 |-- created_week: integer (nullable = true)
 |-- created_month: integer (nullable = true)
 |-- created_year: integer (nullable = true)
 |-- cleaned: string (nullable = true)
 |-- body_wordCount: integer (nullable = true)
 |-- contain_pokemon: boolean (nullable = true)
com.show(3)
[Stage 15:>                                                         (0 + 1) / 1]
+---------+--------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
|subreddit|        author|   author_flair_text|        created_utc|                body|controversiality|score| parent_id|stickied|  link_id|     id|created_date|created_hour|created_week|created_month|created_year|             cleaned|body_wordCount|contain_pokemon|
+---------+--------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
|    anime| DonaldJenkins|                null|2021-11-14 04:39:47|  i sent it to ya ;)|               0|    1|t1_hk0whi9|   false|t3_ov07rq|hkjr7uj|  2021-11-14|           4|           1|           11|        2021|    i sent it to ya |             6|          false|
|    anime|      DonMo999|:MAL:https://myan...|2021-11-14 04:40:25|displate has some...|               0|    1| t3_qtgc12|   false|t3_qtgc12|hkjralc|  2021-11-14|           4|           1|           11|        2021|displate has some...|            16|          false|
|    anime|OrangeBanana38|:AMQ::STAR::AL:ht...|2021-11-14 04:41:01|that sounds like ...|               0|    3|t1_hkjq6wn|   false|t3_qryjfm|hkjrd4w|  2021-11-14|           4|           1|           11|        2021|that sounds like ...|             6|          false|
+---------+--------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
only showing top 3 rows
                                                                                

Text Cleaning Pipline

Build a SparkNLP Pipeline

# Step 1: Transforms raw texts to `document` annotation
# documentAssembler = DocumentAssembler()\
#     .setInputCol("text")\
#     .setOutputCol("document")\
#     .setCleanupMode("shrink") # shrink: removes new lines and tabs, plus merging multiple spaces and blank lines to a single space.

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")


# step 2: Removes all dirty characters from text following a regex pattern and transforms

cleanUpPatterns = ["[^a-zA-Z\s]+"] # ["[^\w\d\s]"] : remove punctuations (keep alphanumeric chars)

# emoji_pat = '[\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2600-\u26FF\u2700-\u27BF]'
# clean_pat = '[^a-zA-Z\s]+'
# cleanUpPatterns = [r"({})|({})".format(emoji_pat, clean_pat)]

documentNormalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalizedDocument") \
    .setAction("clean") \
    .setPatterns(cleanUpPatterns) \
    .setReplacement(" ") \
    .setPolicy("pretty_all") \
    .setLowercase(True)

# step 3: Identifies tokens with tokenization open standards
tokenizer = Tokenizer() \
    .setInputCols(["normalizedDocument"]) \
    .setOutputCol("token")\
    .setSplitChars(['-']) \
    .setContextChars(['?', '!']) \

# # step *: 
# spellChecker = ContextSpellCheckerApproach() \
#     .setInputCols("token") \
#     .setOutputCol("corrected") \
#     .setWordMaxDistance(3) \
#     .setBatchSize(24) \
#     .setEpochs(8) \
#     .setLanguageModelClasses(1650)  # dependant on vocabulary size

# step 4: Find lemmas out of words with the objective of returning a base dictionary word
lemmatizer = LemmatizerModel.pretrained() \
    .setInputCols(["token"]) \
    .setOutputCol("lemma") \

stemmer = Stemmer() \
    .setInputCols(["lemma"]) \
    .setOutputCol("stem")

# step 5: Drops all the stop words from the input sequences
stopwords_cleaner = StopWordsCleaner()\
    .setInputCols("stem")\
    .setOutputCol("cleanTokens")\
    .setCaseSensitive(False)\

# step 6: Reconstructs a DOCUMENT type annotation from tokens
tokenassembler = TokenAssembler()\
    .setInputCols(["document", "cleanTokens"]) \
    .setOutputCol("clean_text")


nlpPipeline = Pipeline(
    stages=[
        documentAssembler,
        documentNormalizer,
        tokenizer,
        lemmatizer,
        stemmer,
        stopwords_cleaner,
        tokenassembler
     ])
lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[ / ]lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
Download done! Loading the resource.
[ — ]
                                                                                
[OK!]
# renamed the column that need text cleaning to `text` to match the nlpPipline
body_com = com.withColumnRenamed('body','text')
title_sub = sub.withColumnRenamed('title','text')
selftext_sub = sub.withColumnRenamed('selftext','text')

# fit the dataframe to process the text cleaning
# body_com, title_sub, selftext_sub
pipelineModel = nlpPipeline.fit(body_com)
body_cleaned = pipelineModel.transform(body_com)
body_cleaned = body_cleaned.drop("document","normalizedDocument","lemma","stem","cleanTokens")


pipelineModel = nlpPipeline.fit(title_sub)
title_cleaned = pipelineModel.transform(title_sub)
title_cleaned = title_cleaned.drop("document","normalizedDocument","lemma","stem","cleanTokens")


pipelineModel = nlpPipeline.fit(selftext_sub)
selftext_cleaned = pipelineModel.transform(selftext_sub)
selftext_cleaned = selftext_cleaned.drop("document","normalizedDocument","lemma","stem","cleanTokens")
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.util.SizeEstimator$ (file:/opt/conda/lib/python3.10/site-packages/pyspark/jars/spark-core_2.12-3.4.0.jar) to field java.util.regex.Pattern.pattern
WARNING: Please consider reporting this to the maintainers of org.apache.spark.util.SizeEstimator$
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
body_cleaned.printSchema()
root
 |-- subreddit: string (nullable = true)
 |-- author: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- text: string (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- score: long (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- link_id: string (nullable = true)
 |-- id: string (nullable = true)
 |-- created_date: string (nullable = true)
 |-- created_hour: integer (nullable = true)
 |-- created_week: integer (nullable = true)
 |-- created_month: integer (nullable = true)
 |-- created_year: integer (nullable = true)
 |-- cleaned: string (nullable = true)
 |-- body_wordCount: integer (nullable = true)
 |-- contain_pokemon: boolean (nullable = true)
 |-- token: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- clean_text: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
body_cleaned.select("token").show()
[Stage 10:>                                                         (0 + 1) / 1]
+--------------------+
|               token|
+--------------------+
|[{token, 0, 0, i,...|
|[{token, 0, 7, di...|
|[{token, 0, 3, th...|
|[{token, 0, 3, wh...|
|[{token, 0, 4, to...|
|[{token, 0, 1, it...|
|[{token, 0, 4, he...|
|[{token, 0, 1, hi...|
|[{token, 0, 0, i,...|
|[{token, 0, 4, wh...|
|[{token, 0, 4, de...|
|[{token, 0, 0, i,...|
|[{token, 0, 2, ye...|
|[{token, 0, 3, th...|
|[{token, 0, 5, lo...|
|[{token, 0, 2, ho...|
|[{token, 0, 3, yo...|
|[{token, 0, 6, de...|
|[{token, 0, 0, i,...|
|[{token, 0, 6, lo...|
+--------------------+
only showing top 20 rows
                                                                                

Basic text checks

What are the most common words overall or over time? What is the distribution of text lengths? What are important words according to TF-IDF?

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
True

Text Lengths Distribution

# sub title_wordCount
sub_title_length = sub \
    .select("title_wordCount") \
    .withColumn("text_length",F.when(sub.title_wordCount<=5,"<=05") \
    .when(sub.title_wordCount.between(6,10),"<=10") \
    .when(sub.title_wordCount.between(11,15),"<=15") \
    .when(sub.title_wordCount.between(16,20),"<=20") \
    .when(sub.title_wordCount.between(21,25),"<=25") \
    .when(sub.title_wordCount.between(26,30),"<=30") \
    .when(sub.title_wordCount.between(31,40),"<=40") \
    .when(sub.title_wordCount.between(41,50),"<=50") \
    .otherwise(">50"))

length_title = sub_title_length.groupBy("text_length").count().sort(F.asc("text_length")).toPandas()
length_title
                                                                                
text_length count
0 <=05 28704
1 <=10 52709
2 <=15 18678
3 <=20 6262
4 <=25 2171
5 <=30 838
6 <=40 596
7 <=50 183
8 >50 106
plt.rcParams['figure.dpi'] = 360
plt.figure(figsize=(12, 8))
plt.bar(length_title['text_length'], length_title['count'], color='#f7c200')  # Updated color
plt.xlabel('Length')
plt.ylabel('Counts')
plt.title("Submissions Title Text Length Distribution")
# plt.gca().invert_yaxis()  # To display the highest count at the top
plt.savefig(f"../../website-source/images/anime_title_length.png")
plt.savefig(f"../../img/anime_title_length.png")
plt.show()

# sub selftext_wordCount
sub_selftext_length = sub \
    .select("selftext_wordCount") \
    .withColumn("text_length",F.when(sub.selftext_wordCount<=10,"<=10") \
    .when(sub.selftext_wordCount.between(11,20),"<=20") \
    .when(sub.selftext_wordCount.between(21,30),"<=30") \
    .when(sub.selftext_wordCount.between(31,40),"<=40") \
    .when(sub.selftext_wordCount.between(41,50),"<=50") \
    .when(sub.selftext_wordCount.between(51,60),"<=60") \
    .otherwise(">60"))

length_selftext = sub_selftext_length.groupBy("text_length").count().sort(F.asc("text_length")).toPandas()
length_selftext
                                                                                
text_length count
0 <=10 7524
1 <=20 11460
2 <=30 12833
3 <=40 11887
4 <=50 9311
5 <=60 7986
6 >60 49246
plt.rcParams['figure.dpi'] = 360
plt.figure(figsize=(12, 8))
plt.bar(length_selftext['text_length'], length_selftext['count'], color='#d13a47')  # Updated color
plt.xlabel('Length')
plt.ylabel('Counts')
plt.title("Submissions Selftext Length Distribution")
# plt.gca().invert_yaxis()  # To display the highest count at the top
plt.savefig(f"../../website-source/images/anime_selftext_length.png")
plt.savefig(f"../../img/anime_selftext_length.png")
plt.show()

# com body_wordCount
com_length = com \
    .select("body_wordCount") \
    .withColumn("text_length",F.when(com.body_wordCount<=10,"<=010") \
    .when(com.body_wordCount.between(11,30),"<=030") \
    .when(com.body_wordCount.between(31,50),"<=050") \
    .when(com.body_wordCount.between(51,70),"<=070") \
    .when(com.body_wordCount.between(71,100),"<=100") \
    .when(com.body_wordCount.between(101,140),"<=140") \
    .when(com.body_wordCount.between(141,200),"<=200") \
    .otherwise(">200"))

length_body = com_length.groupBy("text_length").count().sort(F.asc("text_length")).toPandas()
length_body
                                                                                
text_length count
0 <=010 2384465
1 <=030 2282496
2 <=050 881447
3 <=070 414939
4 <=100 365355
5 <=140 232981
6 <=200 182853
7 >200 134583
plt.rcParams['figure.dpi'] = 360
plt.figure(figsize=(12, 8))
plt.bar(length_body['text_length'], length_body['count'], color='#cccccc')  # Updated color
plt.xlabel('Length')
plt.ylabel('Counts')
plt.title("Comments Body Text Length Distribution")
# plt.gca().invert_yaxis()  # To display the highest count at the top
plt.savefig(f"../../website-source/images/anime_body_length.png")
plt.savefig(f"../../img/anime_body_length.png")
plt.show()

Top common words & important words via TF-IDF

def top_n_words(df, n=20):
    # Explode the tokens column to have one row per token
    exploded_df = df \
        .select("token") \
        .withColumn("word", explode("token")) \
        .groupBy("word.result") \
        .count() \
        .orderBy("count", ascending=False).limit(n)

    return top_words

# Get the top 20 words for each cleaned dataframe
top_words_body = top_n_words(body_cleaned)
[Stage 12:>                                                        (0 + 2) / 24]
stop_words = set(stopwords.words('english'))
stop_words = stop_words.union({'', 'im', 'dont', 'also'})

def top_n_words(data, column, n, stop_words):
    
    word_counts = (
        data
        .withColumn("word", F.explode(F.split(F.col(column), "\s+")))
        .withColumn("word", F.regexp_replace("word", "[^\w]", ""))
        .withColumn("word", F.regexp_replace("word", "\d", ""))
        .groupBy("word")
        .count()
        .filter(F.col("word") != "")
        .sort("count", ascending=False)
    )
    top_n_words = word_counts.filter(~word_counts["word"].isin(stop_words)).limit(n).select("word", "count").toPandas()

    plt.rcParams['figure.dpi'] = 360
    plt.figure(figsize=(12, 8))
    plt.barh(top_n_words['word'], top_n_words['count'], color='#42a1b9')  # Updated color
    plt.xlabel('Word Counts')
    plt.ylabel('Word')
    plt.title(f'Top {n} Words Counts')
    plt.gca().invert_yaxis()  # To display the highest count at the top
    plt.savefig(f"../../website-source/images/anime_{column}_top20_words.png")
    plt.show()
def TFIDF(data,column,n):
    # calculate the term frequency of each comment
    tf = data.select("token","clean_text") \
        .withColumn("doc_id", F.monotonically_increasing_id()) \
        .withColumn('token', F.explode(F.col('token'))) \
        .groupBy('doc_id','token.result') \
        .agg(F.count('clean_text').alias("TF"))

    # calculate the ducument frequency
    df = tf.groupby('result') \
           .agg(F.count('doc_id').alias('DF'))
    # df = tf.groupBy('result').agg(F.countDistinct('doc_id').alias('DF'))
    N = tf.select('doc_id').distinct().count()
    # calculate the tf-idf
    tfidf = tf.join(df, 'result').select(tf.result, tf.TF, df.DF) \
           .withColumn('TFIDF',(tf.TF * F.log(N / df.DF))) \
           .groupby('result') \
           .agg(F.max("TFIDF").alias("TFIDF")) \
           .sort('TFIDF', ascending=False)
    
    top_n_tfidf = tfidf.select('result','TFIDF').limit(n).toPandas()
    
    plt.rcParams['figure.dpi'] = 360
    plt.figure(figsize=(12, 8))
    plt.barh(top_n_tfidf['result'], top_n_tfidf['TFIDF'], color='#42a63c')  # Updated color
    plt.xlabel('Word Counts')
    plt.ylabel('Word')
    plt.title(f'Top {n} Words via TFIDF')
    plt.gca().invert_yaxis()  # To display the highest count at the top
    plt.savefig(f"../../website-source/images/anime_{column}_top20_words_tfidf.png")
    plt.show()
# submission title
title_top = top_n_words(sub,"title", 20, stop_words)
title_top
                                                                                

<Figure size 2304x1728 with 0 Axes>

The most common words for submission titles are highlt related to recommendations. Such as words: like, good, rewatch, recommendatrions, best.

It is not surprising that anime, episode, watch and season are in the top 20 common words. It is highly related to the anime itself.

title_tfidf = TFIDF(title_cleaned, "title", 20)
title_tfidf
                                                                                

<Figure size 2304x1728 with 0 Axes>

The common words generated via TFIDF are highly related to japanese words and culture. Follwoing is part of the most common words: - Dango: a japanese cusine. - Muri (Japanese): “muri” (無理) means “impossible” or “unreasonable. - Toaru (Japanese): Toaru Majutsu no Indekkusu (A Certain Magical Index), a popular light novel series. - Kougeki (Japanese):”Kougeki” (攻撃) means “attack”. - Zutto (Japanese): “zutto” (ずっと) means “always” or “forever”.

# submission selftext
selftext_top = top_n_words(sub,"selftext", 20, stop_words)
selftext_top
                                                                                

<Figure size 2304x1728 with 0 Axes>

The most common words from submission selftext are very similar to the submission titles. It also highly related to recommandations with additioanl words like ‘please’.

selftext_tfidf = TFIDF(selftext_cleaned, "selftext", 20)
selftext_tfidf
                                                                                

<Figure size 2304x1728 with 0 Axes>

It also contains many Japanese words: - Kita (Japanese): “kita” (来た) means “came” or “arrived”. - Ryo (Japanese): - “ryo” (良) as a given name. - “ryō” (両) was a gold currency unit. - Gagumber (Japanese): a character in “Gagumber the Gale” (疾風のガガンバー) - Saori (Japanese): “Saori” as a Japanese given name.

What’s more it has some adjective words like: irony, situational.

# comment body
body_top = top_n_words(com,"body", 20, stop_words)
body_top
                                                                                

<Figure size 2304x1728 with 0 Axes>

While in the comment body text are more general, only those words that are hihgly related to anime: ‘anime’, ‘watch’, ‘episode’. They are similar to the jsubmission title.

body_tfidf = TFIDF(body_cleaned, "body", 20)
body_tfidf
                                                                                

<Figure size 2304x1728 with 0 Axes>