Code: NLP-EDA&Pipline

# Setup - Run only once per Kernel App
%conda install openjdk -y

# install PySpark
%pip install pyspark==3.4.0

# install spark-nlp
%pip install spark-nlp==5.1.3
%pip install sparknlp

# install plotly
%pip install plotly

# install yfinance for external data
%pip install yfinance

# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

Collecting package metadata (current_repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 23.3.1
  latest version: 23.10.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.10.0



## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - openjdk


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2023.08.22 |       h06a4308_0         123 KB
    certifi-2023.11.17         |  py310h06a4308_0         158 KB
    openjdk-11.0.13            |       h87a67e3_0       341.0 MB
    ------------------------------------------------------------
                                           Total:       341.3 MB

The following NEW packages will be INSTALLED:

  openjdk            pkgs/main/linux-64::openjdk-11.0.13-h87a67e3_0 

The following packages will be UPDATED:

  ca-certificates    conda-forge::ca-certificates-2023.7.2~ --> pkgs/main::ca-certificates-2023.08.22-h06a4308_0 
  certifi            conda-forge/noarch::certifi-2023.7.22~ --> pkgs/main/linux-64::certifi-2023.11.17-py310h06a4308_0 



Downloading and Extracting Packages
certifi-2023.11.17   | 158 KB    |                                       |   0% 
openjdk-11.0.13      | 341.0 MB  |                                       |   0% 

ca-certificates-2023 | 123 KB    |                                       |   0% 

certifi-2023.11.17   | 158 KB    | ##################################### | 100% 
openjdk-11.0.13      | 341.0 MB  | 3                                     |   1% 
openjdk-11.0.13      | 341.0 MB  | ##                                    |   5% 
openjdk-11.0.13      | 341.0 MB  | ###8                                  |  10% 
openjdk-11.0.13      | 341.0 MB  | #####5                                |  15% 
openjdk-11.0.13      | 341.0 MB  | #######1                              |  19% 
openjdk-11.0.13      | 341.0 MB  | ########9                             |  24% 
openjdk-11.0.13      | 341.0 MB  | ##########5                           |  28% 
openjdk-11.0.13      | 341.0 MB  | ############2                         |  33% 
openjdk-11.0.13      | 341.0 MB  | #############8                        |  37% 
openjdk-11.0.13      | 341.0 MB  | ###############6                      |  42% 
openjdk-11.0.13      | 341.0 MB  | #################4                    |  47% 
openjdk-11.0.13      | 341.0 MB  | ###################1                  |  52% 
openjdk-11.0.13      | 341.0 MB  | ####################9                 |  57% 
openjdk-11.0.13      | 341.0 MB  | ######################7               |  61% 
openjdk-11.0.13      | 341.0 MB  | ########################5             |  66% 
openjdk-11.0.13      | 341.0 MB  | ##########################3           |  71% 
openjdk-11.0.13      | 341.0 MB  | ############################          |  76% 
openjdk-11.0.13      | 341.0 MB  | #############################9        |  81% 
openjdk-11.0.13      | 341.0 MB  | ###############################7      |  86% 
openjdk-11.0.13      | 341.0 MB  | #################################5    |  91% 
openjdk-11.0.13      | 341.0 MB  | ###################################2  |  95% 
                                                                                
                                                                                

                                                                                
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Note: you may need to restart the kernel to use updated packages.
Collecting pyspark==3.4.0
  Using cached pyspark-3.4.0-py2.py3-none-any.whl
Collecting py4j==0.10.9.7 (from pyspark==3.4.0)
  Using cached py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.7 pyspark-3.4.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Collecting spark-nlp==5.1.3
  Obtaining dependency information for spark-nlp==5.1.3 from https://files.pythonhosted.org/packages/cd/7d/bc0eca4c9ec4c9c1d9b28c42c2f07942af70980a7d912d0aceebf8db32dd/spark_nlp-5.1.3-py2.py3-none-any.whl.metadata
  Using cached spark_nlp-5.1.3-py2.py3-none-any.whl.metadata (53 kB)
Using cached spark_nlp-5.1.3-py2.py3-none-any.whl (537 kB)
Installing collected packages: spark-nlp
Successfully installed spark-nlp-5.1.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Collecting sparknlp
  Using cached sparknlp-1.0.0-py3-none-any.whl (1.4 kB)
Requirement already satisfied: spark-nlp in /opt/conda/lib/python3.10/site-packages (from sparknlp) (5.1.3)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from sparknlp) (1.26.0)
Installing collected packages: sparknlp
Successfully installed sparknlp-1.0.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: plotly in /opt/conda/lib/python3.10/site-packages (5.9.0)
Requirement already satisfied: tenacity>=6.2.0 in /opt/conda/lib/python3.10/site-packages (from plotly) (8.0.1)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Collecting yfinance
  Obtaining dependency information for yfinance from https://files.pythonhosted.org/packages/1c/19/bf19123baf16a55fd38cbb100b5a49380b9b6db7279987034689d11254c7/yfinance-0.2.32-py2.py3-none-any.whl.metadata
  Using cached yfinance-0.2.32-py2.py3-none-any.whl.metadata (11 kB)
Requirement already satisfied: pandas>=1.3.0 in /opt/conda/lib/python3.10/site-packages (from yfinance) (1.4.4)
Requirement already satisfied: numpy>=1.16.5 in /opt/conda/lib/python3.10/site-packages (from yfinance) (1.26.0)
Requirement already satisfied: requests>=2.31 in /opt/conda/lib/python3.10/site-packages (from yfinance) (2.31.0)
Collecting multitasking>=0.0.7 (from yfinance)
  Using cached multitasking-0.0.11-py3-none-any.whl (8.5 kB)
Requirement already satisfied: lxml>=4.9.1 in /opt/conda/lib/python3.10/site-packages (from yfinance) (4.9.3)
Requirement already satisfied: appdirs>=1.4.4 in /opt/conda/lib/python3.10/site-packages (from yfinance) (1.4.4)
Collecting pytz>=2022.5 (from yfinance)
  Obtaining dependency information for pytz>=2022.5 from https://files.pythonhosted.org/packages/32/4d/aaf7eff5deb402fd9a24a1449a8119f00d74ae9c2efa79f8ef9994261fc2/pytz-2023.3.post1-py2.py3-none-any.whl.metadata
  Using cached pytz-2023.3.post1-py2.py3-none-any.whl.metadata (22 kB)
Collecting frozendict>=2.3.4 (from yfinance)
  Obtaining dependency information for frozendict>=2.3.4 from https://files.pythonhosted.org/packages/bf/e8/6eb098234b607ed660501a951b4b9190bf7bceff10a66cda828f32ad6e1a/frozendict-2.3.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading frozendict-2.3.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Collecting peewee>=3.16.2 (from yfinance)
  Using cached peewee-3.17.0-cp310-cp310-linux_x86_64.whl
Requirement already satisfied: beautifulsoup4>=4.11.1 in /opt/conda/lib/python3.10/site-packages (from yfinance) (4.11.1)
Collecting html5lib>=1.1 (from yfinance)
  Using cached html5lib-1.1-py2.py3-none-any.whl (112 kB)
Requirement already satisfied: soupsieve>1.2 in /opt/conda/lib/python3.10/site-packages (from beautifulsoup4>=4.11.1->yfinance) (2.3.1)
Requirement already satisfied: six>=1.9 in /opt/conda/lib/python3.10/site-packages (from html5lib>=1.1->yfinance) (1.16.0)
Requirement already satisfied: webencodings in /opt/conda/lib/python3.10/site-packages (from html5lib>=1.1->yfinance) (0.5.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /opt/conda/lib/python3.10/site-packages (from pandas>=1.3.0->yfinance) (2.8.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests>=2.31->yfinance) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests>=2.31->yfinance) (3.3)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests>=2.31->yfinance) (2.0.6)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests>=2.31->yfinance) (2023.11.17)
Using cached yfinance-0.2.32-py2.py3-none-any.whl (68 kB)
Downloading frozendict-2.3.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (115 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.9/115.9 kB 1.4 MB/s eta 0:00:00ta 0:00:01
Using cached pytz-2023.3.post1-py2.py3-none-any.whl (502 kB)
Installing collected packages: pytz, peewee, multitasking, html5lib, frozendict, yfinance
  Attempting uninstall: pytz
    Found existing installation: pytz 2022.1
    Uninstalling pytz-2022.1:
      Successfully uninstalled pytz-2022.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spyder 5.3.3 requires pyqt5<5.16, which is not installed.
spyder 5.3.3 requires pyqtwebengine<5.16, which is not installed.
jupyterlab-server 2.10.3 requires jupyter-server~=1.4, but you have jupyter-server 2.7.3 which is incompatible.
sagemaker-datawrangler 0.4.3 requires sagemaker-data-insights==0.4.0, but you have sagemaker-data-insights 0.3.3 which is incompatible.
spyder 5.3.3 requires ipython<8.0.0,>=7.31.1, but you have ipython 8.16.1 which is incompatible.
spyder 5.3.3 requires pylint<3.0,>=2.5.0, but you have pylint 3.0.1 which is incompatible.
Successfully installed frozendict-2.3.10 html5lib-1.1 multitasking-0.0.11 peewee-3.17.0 pytz-2023.3.post1 yfinance-0.2.32
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.

!wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.1.3.jar

--2023-11-29 23:38:06--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.1.3.jar
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.200.72, 52.217.137.208, 52.216.106.246, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.200.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 708534094 (676M) [application/java-archive]
Saving to: ‘spark-nlp-assembly-5.1.3.jar’

spark-nlp-assembly- 100%[===================>] 675.71M  26.6MB/s    in 26s     

2023-11-29 23:38:35 (26.4 MB/s) - ‘spark-nlp-assembly-5.1.3.jar’ saved [708534094/708534094]

## Import packages
import json
import sparknlp
import numpy as np
import pandas as pd
from sparknlp.base import *
from pyspark.ml import Pipeline
from sparknlp.annotator import *
import pyspark.sql.functions as F
from pyspark.sql.functions import mean, stddev, max, min, count, percentile_approx, year, month, dayofmonth, ceil, col, dayofweek, hour, explode, date_format, lower, size, split, regexp_replace, isnan, when
from pyspark.sql import SparkSession
from sparknlp.pretrained import PretrainedPipeline
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.subplots as sp
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook_connected"

from pyspark.sql import SparkSession
from py4j.java_gateway import java_import

spark = SparkSession.builder \
    .appName("Spark NLP")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3,org.apache.hadoop:hadoop-aws:3.2.2")\
    .config(
        "fs.s3a.aws.credentials.provider",
        "com.amazonaws.auth.ContainerCredentialsProvider",
    )\
    .getOrCreate()

print(f"Spark version: {spark.version}")
print(f"sparknlp version: {sparknlp.version()}")

Warning: Ignoring non-Spark config property: fs.s3a.aws.credentials.provider

:: loading settings :: url = jar:file:/opt/conda/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml

Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-f3336de6-f72a-4c19-9174-5b0599cd1773;1.0
    confs: [default]
    found com.johnsnowlabs.nlp#spark-nlp_2.12;5.1.3 in central
    found com.typesafe#config;1.4.2 in central
    found org.rocksdb#rocksdbjni;6.29.5 in central
    found com.amazonaws#aws-java-sdk-bundle;1.11.828 in central
    found com.github.universal-automata#liblevenshtein;3.0.0 in central
    found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
    found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
    found com.google.code.gson#gson;2.3 in central
    found it.unimi.dsi#fastutil;7.0.12 in central
    found org.projectlombok#lombok;1.16.8 in central
    found com.google.cloud#google-cloud-storage;2.20.1 in central
    found com.google.guava#guava;31.1-jre in central
    found com.google.guava#failureaccess;1.0.1 in central
    found com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava in central
    found com.google.errorprone#error_prone_annotations;2.18.0 in central
    found com.google.j2objc#j2objc-annotations;1.3 in central
    found com.google.http-client#google-http-client;1.43.0 in central
    found io.opencensus#opencensus-contrib-http-util;0.31.1 in central
    found com.google.http-client#google-http-client-jackson2;1.43.0 in central
    found com.google.http-client#google-http-client-gson;1.43.0 in central
    found com.google.api-client#google-api-client;2.2.0 in central
    found commons-codec#commons-codec;1.15 in central
    found com.google.oauth-client#google-oauth-client;1.34.1 in central
    found com.google.http-client#google-http-client-apache-v2;1.43.0 in central
    found com.google.apis#google-api-services-storage;v1-rev20220705-2.0.0 in central
    found com.google.code.gson#gson;2.10.1 in central
    found com.google.cloud#google-cloud-core;2.12.0 in central
    found io.grpc#grpc-context;1.53.0 in central
    found com.google.auto.value#auto-value-annotations;1.10.1 in central
    found com.google.auto.value#auto-value;1.10.1 in central
    found javax.annotation#javax.annotation-api;1.3.2 in central
    found commons-logging#commons-logging;1.2 in central
    found com.google.cloud#google-cloud-core-http;2.12.0 in central
    found com.google.http-client#google-http-client-appengine;1.43.0 in central
    found com.google.api#gax-httpjson;0.108.2 in central
    found com.google.cloud#google-cloud-core-grpc;2.12.0 in central
    found io.grpc#grpc-alts;1.53.0 in central
    found io.grpc#grpc-grpclb;1.53.0 in central
    found org.conscrypt#conscrypt-openjdk-uber;2.5.2 in central
    found io.grpc#grpc-auth;1.53.0 in central
    found io.grpc#grpc-protobuf;1.53.0 in central
    found io.grpc#grpc-protobuf-lite;1.53.0 in central
    found io.grpc#grpc-core;1.53.0 in central
    found com.google.api#gax;2.23.2 in central
    found com.google.api#gax-grpc;2.23.2 in central
    found com.google.auth#google-auth-library-credentials;1.16.0 in central
    found com.google.auth#google-auth-library-oauth2-http;1.16.0 in central
    found com.google.api#api-common;2.6.2 in central
    found io.opencensus#opencensus-api;0.31.1 in central
    found com.google.api.grpc#proto-google-iam-v1;1.9.2 in central
    found com.google.protobuf#protobuf-java;3.21.12 in central
    found com.google.protobuf#protobuf-java-util;3.21.12 in central
    found com.google.api.grpc#proto-google-common-protos;2.14.2 in central
    found org.threeten#threetenbp;1.6.5 in central
    found com.google.api.grpc#proto-google-cloud-storage-v2;2.20.1-alpha in central
    found com.google.api.grpc#grpc-google-cloud-storage-v2;2.20.1-alpha in central
    found com.google.api.grpc#gapic-google-cloud-storage-v2;2.20.1-alpha in central
    found com.fasterxml.jackson.core#jackson-core;2.14.2 in central
    found com.google.code.findbugs#jsr305;3.0.2 in central
    found io.grpc#grpc-api;1.53.0 in central
    found io.grpc#grpc-stub;1.53.0 in central
    found org.checkerframework#checker-qual;3.31.0 in central
    found io.perfmark#perfmark-api;0.26.0 in central
    found com.google.android#annotations;4.1.1.4 in central
    found org.codehaus.mojo#animal-sniffer-annotations;1.22 in central
    found io.opencensus#opencensus-proto;0.2.0 in central
    found io.grpc#grpc-services;1.53.0 in central
    found com.google.re2j#re2j;1.6 in central
    found io.grpc#grpc-netty-shaded;1.53.0 in central
    found io.grpc#grpc-googleapis;1.53.0 in central
    found io.grpc#grpc-xds;1.53.0 in central
    found com.navigamez#greex;1.0 in central
    found dk.brics.automaton#automaton;1.11-8 in central
    found com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.4.4 in central
    found com.microsoft.onnxruntime#onnxruntime;1.15.0 in central
    found org.apache.hadoop#hadoop-aws;3.2.2 in central
:: resolution report :: resolve 5861ms :: artifacts dl 1134ms
    :: modules in use:
    com.amazonaws#aws-java-sdk-bundle;1.11.828 from central in [default]
    com.fasterxml.jackson.core#jackson-core;2.14.2 from central in [default]
    com.github.universal-automata#liblevenshtein;3.0.0 from central in [default]
    com.google.android#annotations;4.1.1.4 from central in [default]
    com.google.api#api-common;2.6.2 from central in [default]
    com.google.api#gax;2.23.2 from central in [default]
    com.google.api#gax-grpc;2.23.2 from central in [default]
    com.google.api#gax-httpjson;0.108.2 from central in [default]
    com.google.api-client#google-api-client;2.2.0 from central in [default]
    com.google.api.grpc#gapic-google-cloud-storage-v2;2.20.1-alpha from central in [default]
    com.google.api.grpc#grpc-google-cloud-storage-v2;2.20.1-alpha from central in [default]
    com.google.api.grpc#proto-google-cloud-storage-v2;2.20.1-alpha from central in [default]
    com.google.api.grpc#proto-google-common-protos;2.14.2 from central in [default]
    com.google.api.grpc#proto-google-iam-v1;1.9.2 from central in [default]
    com.google.apis#google-api-services-storage;v1-rev20220705-2.0.0 from central in [default]
    com.google.auth#google-auth-library-credentials;1.16.0 from central in [default]
    com.google.auth#google-auth-library-oauth2-http;1.16.0 from central in [default]
    com.google.auto.value#auto-value;1.10.1 from central in [default]
    com.google.auto.value#auto-value-annotations;1.10.1 from central in [default]
    com.google.cloud#google-cloud-core;2.12.0 from central in [default]
    com.google.cloud#google-cloud-core-grpc;2.12.0 from central in [default]
    com.google.cloud#google-cloud-core-http;2.12.0 from central in [default]
    com.google.cloud#google-cloud-storage;2.20.1 from central in [default]
    com.google.code.findbugs#jsr305;3.0.2 from central in [default]
    com.google.code.gson#gson;2.10.1 from central in [default]
    com.google.errorprone#error_prone_annotations;2.18.0 from central in [default]
    com.google.guava#failureaccess;1.0.1 from central in [default]
    com.google.guava#guava;31.1-jre from central in [default]
    com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava from central in [default]
    com.google.http-client#google-http-client;1.43.0 from central in [default]
    com.google.http-client#google-http-client-apache-v2;1.43.0 from central in [default]
    com.google.http-client#google-http-client-appengine;1.43.0 from central in [default]
    com.google.http-client#google-http-client-gson;1.43.0 from central in [default]
    com.google.http-client#google-http-client-jackson2;1.43.0 from central in [default]
    com.google.j2objc#j2objc-annotations;1.3 from central in [default]
    com.google.oauth-client#google-oauth-client;1.34.1 from central in [default]
    com.google.protobuf#protobuf-java;3.21.12 from central in [default]
    com.google.protobuf#protobuf-java-util;3.21.12 from central in [default]
    com.google.re2j#re2j;1.6 from central in [default]
    com.johnsnowlabs.nlp#spark-nlp_2.12;5.1.3 from central in [default]
    com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.4.4 from central in [default]
    com.microsoft.onnxruntime#onnxruntime;1.15.0 from central in [default]
    com.navigamez#greex;1.0 from central in [default]
    com.typesafe#config;1.4.2 from central in [default]
    commons-codec#commons-codec;1.15 from central in [default]
    commons-logging#commons-logging;1.2 from central in [default]
    dk.brics.automaton#automaton;1.11-8 from central in [default]
    io.grpc#grpc-alts;1.53.0 from central in [default]
    io.grpc#grpc-api;1.53.0 from central in [default]
    io.grpc#grpc-auth;1.53.0 from central in [default]
    io.grpc#grpc-context;1.53.0 from central in [default]
    io.grpc#grpc-core;1.53.0 from central in [default]
    io.grpc#grpc-googleapis;1.53.0 from central in [default]
    io.grpc#grpc-grpclb;1.53.0 from central in [default]
    io.grpc#grpc-netty-shaded;1.53.0 from central in [default]
    io.grpc#grpc-protobuf;1.53.0 from central in [default]
    io.grpc#grpc-protobuf-lite;1.53.0 from central in [default]
    io.grpc#grpc-services;1.53.0 from central in [default]
    io.grpc#grpc-stub;1.53.0 from central in [default]
    io.grpc#grpc-xds;1.53.0 from central in [default]
    io.opencensus#opencensus-api;0.31.1 from central in [default]
    io.opencensus#opencensus-contrib-http-util;0.31.1 from central in [default]
    io.opencensus#opencensus-proto;0.2.0 from central in [default]
    io.perfmark#perfmark-api;0.26.0 from central in [default]
    it.unimi.dsi#fastutil;7.0.12 from central in [default]
    javax.annotation#javax.annotation-api;1.3.2 from central in [default]
    org.apache.hadoop#hadoop-aws;3.2.2 from central in [default]
    org.checkerframework#checker-qual;3.31.0 from central in [default]
    org.codehaus.mojo#animal-sniffer-annotations;1.22 from central in [default]
    org.conscrypt#conscrypt-openjdk-uber;2.5.2 from central in [default]
    org.projectlombok#lombok;1.16.8 from central in [default]
    org.rocksdb#rocksdbjni;6.29.5 from central in [default]
    org.threeten#threetenbp;1.6.5 from central in [default]
    :: evicted modules:
    com.google.protobuf#protobuf-java-util;3.0.0-beta-3 by [com.google.protobuf#protobuf-java-util;3.21.12] in [default]
    com.google.protobuf#protobuf-java;3.0.0-beta-3 by [com.google.protobuf#protobuf-java;3.21.12] in [default]
    com.google.code.gson#gson;2.3 by [com.google.code.gson#gson;2.10.1] in [default]
    com.amazonaws#aws-java-sdk-bundle;1.11.563 by [com.amazonaws#aws-java-sdk-bundle;1.11.828] in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   77  |   0   |   0   |   4   ||   73  |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-f3336de6-f72a-4c19-9174-5b0599cd1773
    confs: [default]
    0 artifacts copied, 73 already retrieved (0kB/348ms)
23/11/29 23:38:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Spark version: 3.4.0
sparknlp version: 5.1.3

External data

# Importing the yfinance package
import datetime
import yfinance as yf
 
# Set the start and end date
start_date = '2021-01-01'
end_date = '2023-04-1'
 
# Add multiple space separated tickers here
ticker = 'NTDOY TYO SONY'

# Get the data
stocks = yf.download(ticker, start_date, end_date)['Adj Close']
stocks["Date"] = stocks.index
stocks.reset_index(drop=True, inplace=True)
# Print the last 5 rows
print(stocks.tail())

# Export data to a CSV file
# stocks.to_csv("../../data/csv/stocks.csv")

[*********************100%%**********************]  3 of 3 completed
     NTDOY       SONY        TYO       Date
560   9.65  86.639999  11.965896 2023-03-27
561   9.61  85.820000  11.985530 2023-03-28
562   9.77  87.870003  12.044427 2023-03-29
563   9.63  89.309998  12.054242 2023-03-30
564   9.69  90.650002  11.818654 2023-03-31

# Read the saved data
daily_sub = pd.read_csv("../../data/csv/sub_daily.csv")
daily_com = pd.read_csv("../../data/csv/com_daily.csv")
stocks = pd.read_csv("../../data/csv/stocks.csv")

# merge multiple dataset together based on date
merged_df = pd.concat([daily_sub.set_index("created_date"), daily_com.set_index("created_date"), stocks.set_index("Date")], axis=1)
merged_df["Date"] = merged_df.index
merged_df.reset_index(drop=True, inplace=True)
#convert date column to datetime and subtract one week
merged_df['week'] = pd.to_datetime(merged_df['Date']) - pd.to_timedelta(7, unit='d')
# merged_df[["Date","total_submissions", "avg_num_comments", "total_comments","NTDOY", "SONY", "TYO"]].tail(10)


# Group by week and calculate sum and mean
agg_columns = {
    'total_submissions': 'sum',
    'total_comments': 'sum',
    'NTDOY': 'mean',
    'SONY': 'mean',
    'TYO': 'mean',
}
weekly_df= merged_df.groupby([pd.Grouper(key='week', freq='W')]).agg(agg_columns).reset_index()

weekly_df.tail(10)

	week	total_submissions	total_comments	NTDOY	SONY	TYO
108	2023-01-22	707	48227.0	10.734	89.502000	11.827641
109	2023-01-29	640	37987.0	10.772	91.054001	11.796592
110	2023-02-05	634	52538.0	10.146	90.430002	12.403984
111	2023-02-12	626	50620.0	10.036	87.999998	12.794034
112	2023-02-19	598	51840.0	9.840	82.777500	13.246667
113	2023-02-26	524	47569.0	9.394	83.927998	13.407248
114	2023-03-05	248	39966.0	9.402	86.634000	13.191847
115	2023-03-12	249	39501.0	9.486	85.529999	12.037221
116	2023-03-19	267	44612.0	9.570	88.206000	11.796666
117	2023-03-26	159	24835.0	9.670	88.058000	11.973750

# Create subplots with a 2x1 grid
fig = make_subplots(rows=2, cols=1, specs=[[{"secondary_y": True}],[{"secondary_y": True}]])


# Add traces for the first subplot (Submissions and Stock Prices)
fig.add_trace(go.Scatter(x=weekly_df['week'], y=weekly_df['total_submissions'], marker_color='#d13a47', opacity=.65, name="Submissions"), row=1, col=1, secondary_y=False)
fig.add_trace(go.Scatter(x=weekly_df['week'], y=weekly_df['NTDOY'], marker_color='#f7c200', opacity=.65, name="NTDOY"), row=1, col=1, secondary_y=True)
fig.add_trace(go.Scatter(x=weekly_df['week'], y=weekly_df['TYO'], marker_color='#42a63c', opacity=.65, name="TYO"), row=1, col=1, secondary_y=True)

# Add traces for the second subplot (Comments and Stock Prices)
fig.add_trace(go.Scatter(x=weekly_df['week'], y=weekly_df['total_comments'], marker_color='#42a1b9', opacity=.65, name="Comments"), row=2, col=1, secondary_y=False)
fig.add_trace(go.Scatter(x=weekly_df['week'], y=weekly_df['NTDOY'], marker_color='#f7c200', opacity=.65, name="NTDOY", showlegend=False), row=2, col=1, secondary_y=True)
fig.add_trace(go.Scatter(x=weekly_df['week'], y=weekly_df['TYO'], marker_color='#42a63c', opacity=.65, name="TYO", showlegend=False), row=2, col=1, secondary_y=True)


# Update the y-axis labels
fig.update_yaxes(title_text="Submissions", secondary_y=False, row=1, col=1)
# fig.update_yaxes(title_text="Mean Stock Prices", secondary_y=True, row=1, col=1, matches='y')
fig.update_yaxes(title_text="Comments", secondary_y=False, row=2, col=1)
fig.update_yaxes(title_text="Mean Stock Prices", secondary_y=True, row=2, col=1)

# Update the layout for the whole figure
fig.update_layout(
    title='The number of comments, submissions, and stock prices for each week',
    # xaxis={'title': 'Date by week'},
    paper_bgcolor='#FFFFFF', 
    plot_bgcolor='rgba(0,0,0,0)',
)

# Show the figure
fig.show()

Data preparation for NLP pipline

## Read cleaned data from parquet

### Anime subreddits
import sagemaker
# session = sagemaker.Session()
# bucket = session.default_bucket()
bucket = 'sagemaker-us-east-1-315969085594'

sub_bucket_path = f"s3a://{bucket}/project/cleaned/sub"
com_bucket_path = f"s3a://{bucket}/project/cleaned/com"

print(f"reading submissions from {sub_bucket_path}")
sub = spark.read.parquet(sub_bucket_path, header=True)
print(f"shape of the sub dataframe is {sub.count():,}x{len(sub.columns)}")

print(f"reading comments from {com_bucket_path}")
com = spark.read.parquet(com_bucket_path, header=True)
print(f"shape of the com dataframe is {com.count():,}x{len(com.columns)}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
reading submissions from s3a://sagemaker-us-east-1-315969085594/project/cleaned/sub

23/11/29 23:39:01 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties

shape of the sub dataframe is 110,247x22
reading comments from s3a://sagemaker-us-east-1-315969085594/project/cleaned/com

[Stage 5:======================================================>  (23 + 1) / 24]

shape of the com dataframe is 6,879,119x19

sub.groupBy('subreddit').count().show()

[Stage 8:============================================>              (3 + 1) / 4]

+---------+------+
|subreddit| count|
+---------+------+
|    anime|110247|
+---------+------+

sub.printSchema()

root
 |-- subreddit: string (nullable = true)
 |-- author: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- title: string (nullable = true)
 |-- selftext: string (nullable = true)
 |-- num_comments: long (nullable = true)
 |-- num_crossposts: long (nullable = true)
 |-- over_18: boolean (nullable = true)
 |-- score: long (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- id: string (nullable = true)
 |-- created_date: string (nullable = true)
 |-- created_hour: integer (nullable = true)
 |-- created_week: integer (nullable = true)
 |-- created_month: integer (nullable = true)
 |-- created_year: integer (nullable = true)
 |-- cleaned_title: string (nullable = true)
 |-- title_wordCount: integer (nullable = true)
 |-- cleaned_selftext: string (nullable = true)
 |-- selftext_wordCount: integer (nullable = true)
 |-- contain_pokemon: boolean (nullable = true)

sub.show(3)

[Stage 11:>                                                         (0 + 1) / 1]

+---------+--------------------+-----------------+-------------------+--------------------+--------------------+------------+--------------+-------+-----+--------+------+------------+------------+------------+-------------+------------+--------------------+---------------+--------------------+------------------+---------------+
|subreddit|              author|author_flair_text|        created_utc|               title|            selftext|num_comments|num_crossposts|over_18|score|stickied|    id|created_date|created_hour|created_week|created_month|created_year|       cleaned_title|title_wordCount|    cleaned_selftext|selftext_wordCount|contain_pokemon|
+---------+--------------------+-----------------+-------------------+--------------------+--------------------+------------+--------------+-------+-----+--------+------+------------+------------+------------+-------------+------------+--------------------+---------------+--------------------+------------------+---------------+
|    anime|PsychologicalGift299|             null|2021-04-19 20:42:46|anime movies for ...|so as my fellow o...|          12|             0|  false|    0|   false|mua1uo|  2021-04-19|          20|           2|            4|        2021|anime movies for 420|              4|so as my fellow o...|                64|          false|
|    anime|        Tuttles4ever|             null|2021-04-19 20:48:42|i need a very spe...|are there any ani...|           7|             0|  false|    0|   false|mua6g3|  2021-04-19|          20|           2|            4|        2021|i need a very spe...|             15|are there any ani...|                42|          false|
|    anime|          nemifloras|             null|2021-04-19 20:52:42|any atmospheric a...|i finished reassi...|           9|             0|  false|    0|   false|mua9iu|  2021-04-19|          20|           2|            4|        2021|any atmospheric a...|             10|i finished reassi...|                18|          false|
+---------+--------------------+-----------------+-------------------+--------------------+--------------------+------------+--------------+-------+-----+--------+------+------------+------------+------------+-------------+------------+--------------------+---------------+--------------------+------------------+---------------+
only showing top 3 rows

com.groupBy('subreddit').count().show()

[Stage 12:=====================================================>  (23 + 1) / 24]

+---------+-------+
|subreddit|  count|
+---------+-------+
|    anime|6879119|
+---------+-------+

com.printSchema()

root
 |-- subreddit: string (nullable = true)
 |-- author: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- body: string (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- score: long (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- link_id: string (nullable = true)
 |-- id: string (nullable = true)
 |-- created_date: string (nullable = true)
 |-- created_hour: integer (nullable = true)
 |-- created_week: integer (nullable = true)
 |-- created_month: integer (nullable = true)
 |-- created_year: integer (nullable = true)
 |-- cleaned: string (nullable = true)
 |-- body_wordCount: integer (nullable = true)
 |-- contain_pokemon: boolean (nullable = true)

com.show(3)

[Stage 15:>                                                         (0 + 1) / 1]

+---------+--------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
|subreddit|        author|   author_flair_text|        created_utc|                body|controversiality|score| parent_id|stickied|  link_id|     id|created_date|created_hour|created_week|created_month|created_year|             cleaned|body_wordCount|contain_pokemon|
+---------+--------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
|    anime| DonaldJenkins|                null|2021-11-14 04:39:47|  i sent it to ya ;)|               0|    1|t1_hk0whi9|   false|t3_ov07rq|hkjr7uj|  2021-11-14|           4|           1|           11|        2021|    i sent it to ya |             6|          false|
|    anime|      DonMo999|:MAL:https://myan...|2021-11-14 04:40:25|displate has some...|               0|    1| t3_qtgc12|   false|t3_qtgc12|hkjralc|  2021-11-14|           4|           1|           11|        2021|displate has some...|            16|          false|
|    anime|OrangeBanana38|:AMQ::STAR::AL:ht...|2021-11-14 04:41:01|that sounds like ...|               0|    3|t1_hkjq6wn|   false|t3_qryjfm|hkjrd4w|  2021-11-14|           4|           1|           11|        2021|that sounds like ...|             6|          false|
+---------+--------------+--------------------+-------------------+--------------------+----------------+-----+----------+--------+---------+-------+------------+------------+------------+-------------+------------+--------------------+--------------+---------------+
only showing top 3 rows

Text Cleaning Pipline

Build a SparkNLP Pipeline

# Step 1: Transforms raw texts to `document` annotation
# documentAssembler = DocumentAssembler()\
#     .setInputCol("text")\
#     .setOutputCol("document")\
#     .setCleanupMode("shrink") # shrink: removes new lines and tabs, plus merging multiple spaces and blank lines to a single space.

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")


# step 2: Removes all dirty characters from text following a regex pattern and transforms

cleanUpPatterns = ["[^a-zA-Z\s]+"] # ["[^\w\d\s]"] : remove punctuations (keep alphanumeric chars)

# emoji_pat = '[\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2600-\u26FF\u2700-\u27BF]'
# clean_pat = '[^a-zA-Z\s]+'
# cleanUpPatterns = [r"({})|({})".format(emoji_pat, clean_pat)]

documentNormalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalizedDocument") \
    .setAction("clean") \
    .setPatterns(cleanUpPatterns) \
    .setReplacement(" ") \
    .setPolicy("pretty_all") \
    .setLowercase(True)

# step 3: Identifies tokens with tokenization open standards
tokenizer = Tokenizer() \
    .setInputCols(["normalizedDocument"]) \
    .setOutputCol("token")\
    .setSplitChars(['-']) \
    .setContextChars(['?', '!']) \

# # step *: 
# spellChecker = ContextSpellCheckerApproach() \
#     .setInputCols("token") \
#     .setOutputCol("corrected") \
#     .setWordMaxDistance(3) \
#     .setBatchSize(24) \
#     .setEpochs(8) \
#     .setLanguageModelClasses(1650)  # dependant on vocabulary size

# step 4: Find lemmas out of words with the objective of returning a base dictionary word
lemmatizer = LemmatizerModel.pretrained() \
    .setInputCols(["token"]) \
    .setOutputCol("lemma") \

stemmer = Stemmer() \
    .setInputCols(["lemma"]) \
    .setOutputCol("stem")

# step 5: Drops all the stop words from the input sequences
stopwords_cleaner = StopWordsCleaner()\
    .setInputCols("stem")\
    .setOutputCol("cleanTokens")\
    .setCaseSensitive(False)\

# step 6: Reconstructs a DOCUMENT type annotation from tokens
tokenassembler = TokenAssembler()\
    .setInputCols(["document", "cleanTokens"]) \
    .setOutputCol("clean_text")


nlpPipeline = Pipeline(
    stages=[
        documentAssembler,
        documentNormalizer,
        tokenizer,
        lemmatizer,
        stemmer,
        stopwords_cleaner,
        tokenassembler
     ])

lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[ / ]lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
Download done! Loading the resource.
[ — ]

[OK!]

# renamed the column that need text cleaning to `text` to match the nlpPipline
body_com = com.withColumnRenamed('body','text')
title_sub = sub.withColumnRenamed('title','text')
selftext_sub = sub.withColumnRenamed('selftext','text')

# fit the dataframe to process the text cleaning
# body_com, title_sub, selftext_sub
pipelineModel = nlpPipeline.fit(body_com)
body_cleaned = pipelineModel.transform(body_com)
body_cleaned = body_cleaned.drop("document","normalizedDocument","lemma","stem","cleanTokens")


pipelineModel = nlpPipeline.fit(title_sub)
title_cleaned = pipelineModel.transform(title_sub)
title_cleaned = title_cleaned.drop("document","normalizedDocument","lemma","stem","cleanTokens")


pipelineModel = nlpPipeline.fit(selftext_sub)
selftext_cleaned = pipelineModel.transform(selftext_sub)
selftext_cleaned = selftext_cleaned.drop("document","normalizedDocument","lemma","stem","cleanTokens")

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.util.SizeEstimator$ (file:/opt/conda/lib/python3.10/site-packages/pyspark/jars/spark-core_2.12-3.4.0.jar) to field java.util.regex.Pattern.pattern
WARNING: Please consider reporting this to the maintainers of org.apache.spark.util.SizeEstimator$
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

body_cleaned.printSchema()

root
 |-- subreddit: string (nullable = true)
 |-- author: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- text: string (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- score: long (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- link_id: string (nullable = true)
 |-- id: string (nullable = true)
 |-- created_date: string (nullable = true)
 |-- created_hour: integer (nullable = true)
 |-- created_week: integer (nullable = true)
 |-- created_month: integer (nullable = true)
 |-- created_year: integer (nullable = true)
 |-- cleaned: string (nullable = true)
 |-- body_wordCount: integer (nullable = true)
 |-- contain_pokemon: boolean (nullable = true)
 |-- token: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- clean_text: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)

body_cleaned.select("token").show()

[Stage 10:>                                                         (0 + 1) / 1]

+--------------------+
|               token|
+--------------------+
|[{token, 0, 0, i,...|
|[{token, 0, 7, di...|
|[{token, 0, 3, th...|
|[{token, 0, 3, wh...|
|[{token, 0, 4, to...|
|[{token, 0, 1, it...|
|[{token, 0, 4, he...|
|[{token, 0, 1, hi...|
|[{token, 0, 0, i,...|
|[{token, 0, 4, wh...|
|[{token, 0, 4, de...|
|[{token, 0, 0, i,...|
|[{token, 0, 2, ye...|
|[{token, 0, 3, th...|
|[{token, 0, 5, lo...|
|[{token, 0, 2, ho...|
|[{token, 0, 3, yo...|
|[{token, 0, 6, de...|
|[{token, 0, 0, i,...|
|[{token, 0, 6, lo...|
+--------------------+
only showing top 20 rows

Basic text checks

What are the most common words overall or over time? What is the distribution of text lengths? What are important words according to TF-IDF?

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

True

Text Lengths Distribution

# sub title_wordCount
sub_title_length = sub \
    .select("title_wordCount") \
    .withColumn("text_length",F.when(sub.title_wordCount<=5,"<=05") \
    .when(sub.title_wordCount.between(6,10),"<=10") \
    .when(sub.title_wordCount.between(11,15),"<=15") \
    .when(sub.title_wordCount.between(16,20),"<=20") \
    .when(sub.title_wordCount.between(21,25),"<=25") \
    .when(sub.title_wordCount.between(26,30),"<=30") \
    .when(sub.title_wordCount.between(31,40),"<=40") \
    .when(sub.title_wordCount.between(41,50),"<=50") \
    .otherwise(">50"))

length_title = sub_title_length.groupBy("text_length").count().sort(F.asc("text_length")).toPandas()
length_title

	text_length	count
0	<=05	28704
1	<=10	52709
2	<=15	18678
3	<=20	6262
4	<=25	2171
5	<=30	838
6	<=40	596
7	<=50	183
8	>50	106

plt.rcParams['figure.dpi'] = 360
plt.figure(figsize=(12, 8))
plt.bar(length_title['text_length'], length_title['count'], color='#f7c200')  # Updated color
plt.xlabel('Length')
plt.ylabel('Counts')
plt.title("Submissions Title Text Length Distribution")
# plt.gca().invert_yaxis()  # To display the highest count at the top
plt.savefig(f"../../website-source/images/anime_title_length.png")
plt.savefig(f"../../img/anime_title_length.png")
plt.show()

# sub selftext_wordCount
sub_selftext_length = sub \
    .select("selftext_wordCount") \
    .withColumn("text_length",F.when(sub.selftext_wordCount<=10,"<=10") \
    .when(sub.selftext_wordCount.between(11,20),"<=20") \
    .when(sub.selftext_wordCount.between(21,30),"<=30") \
    .when(sub.selftext_wordCount.between(31,40),"<=40") \
    .when(sub.selftext_wordCount.between(41,50),"<=50") \
    .when(sub.selftext_wordCount.between(51,60),"<=60") \
    .otherwise(">60"))

length_selftext = sub_selftext_length.groupBy("text_length").count().sort(F.asc("text_length")).toPandas()
length_selftext

	text_length	count
0	<=10	7524
1	<=20	11460
2	<=30	12833
3	<=40	11887
4	<=50	9311
5	<=60	7986
6	>60	49246

plt.rcParams['figure.dpi'] = 360
plt.figure(figsize=(12, 8))
plt.bar(length_selftext['text_length'], length_selftext['count'], color='#d13a47')  # Updated color
plt.xlabel('Length')
plt.ylabel('Counts')
plt.title("Submissions Selftext Length Distribution")
# plt.gca().invert_yaxis()  # To display the highest count at the top
plt.savefig(f"../../website-source/images/anime_selftext_length.png")
plt.savefig(f"../../img/anime_selftext_length.png")
plt.show()

# com body_wordCount
com_length = com \
    .select("body_wordCount") \
    .withColumn("text_length",F.when(com.body_wordCount<=10,"<=010") \
    .when(com.body_wordCount.between(11,30),"<=030") \
    .when(com.body_wordCount.between(31,50),"<=050") \
    .when(com.body_wordCount.between(51,70),"<=070") \
    .when(com.body_wordCount.between(71,100),"<=100") \
    .when(com.body_wordCount.between(101,140),"<=140") \
    .when(com.body_wordCount.between(141,200),"<=200") \
    .otherwise(">200"))

length_body = com_length.groupBy("text_length").count().sort(F.asc("text_length")).toPandas()
length_body

	text_length	count
0	<=010	2384465
1	<=030	2282496
2	<=050	881447
3	<=070	414939
4	<=100	365355
5	<=140	232981
6	<=200	182853
7	>200	134583

plt.rcParams['figure.dpi'] = 360
plt.figure(figsize=(12, 8))
plt.bar(length_body['text_length'], length_body['count'], color='#cccccc')  # Updated color
plt.xlabel('Length')
plt.ylabel('Counts')
plt.title("Comments Body Text Length Distribution")
# plt.gca().invert_yaxis()  # To display the highest count at the top
plt.savefig(f"../../website-source/images/anime_body_length.png")
plt.savefig(f"../../img/anime_body_length.png")
plt.show()

Top common words & important words via TF-IDF

def top_n_words(df, n=20):
    # Explode the tokens column to have one row per token
    exploded_df = df \
        .select("token") \
        .withColumn("word", explode("token")) \
        .groupBy("word.result") \
        .count() \
        .orderBy("count", ascending=False).limit(n)

    return top_words

# Get the top 20 words for each cleaned dataframe
top_words_body = top_n_words(body_cleaned)

[Stage 12:>                                                        (0 + 2) / 24]

stop_words = set(stopwords.words('english'))
stop_words = stop_words.union({'', 'im', 'dont', 'also'})

def top_n_words(data, column, n, stop_words):
    
    word_counts = (
        data
        .withColumn("word", F.explode(F.split(F.col(column), "\s+")))
        .withColumn("word", F.regexp_replace("word", "[^\w]", ""))
        .withColumn("word", F.regexp_replace("word", "\d", ""))
        .groupBy("word")
        .count()
        .filter(F.col("word") != "")
        .sort("count", ascending=False)
    )
    top_n_words = word_counts.filter(~word_counts["word"].isin(stop_words)).limit(n).select("word", "count").toPandas()

    plt.rcParams['figure.dpi'] = 360
    plt.figure(figsize=(12, 8))
    plt.barh(top_n_words['word'], top_n_words['count'], color='#42a1b9')  # Updated color
    plt.xlabel('Word Counts')
    plt.ylabel('Word')
    plt.title(f'Top {n} Words Counts')
    plt.gca().invert_yaxis()  # To display the highest count at the top
    plt.savefig(f"../../website-source/images/anime_{column}_top20_words.png")
    plt.show()

def TFIDF(data,column,n):
    # calculate the term frequency of each comment
    tf = data.select("token","clean_text") \
        .withColumn("doc_id", F.monotonically_increasing_id()) \
        .withColumn('token', F.explode(F.col('token'))) \
        .groupBy('doc_id','token.result') \
        .agg(F.count('clean_text').alias("TF"))

    # calculate the ducument frequency
    df = tf.groupby('result') \
           .agg(F.count('doc_id').alias('DF'))
    # df = tf.groupBy('result').agg(F.countDistinct('doc_id').alias('DF'))
    N = tf.select('doc_id').distinct().count()
    # calculate the tf-idf
    tfidf = tf.join(df, 'result').select(tf.result, tf.TF, df.DF) \
           .withColumn('TFIDF',(tf.TF * F.log(N / df.DF))) \
           .groupby('result') \
           .agg(F.max("TFIDF").alias("TFIDF")) \
           .sort('TFIDF', ascending=False)
    
    top_n_tfidf = tfidf.select('result','TFIDF').limit(n).toPandas()
    
    plt.rcParams['figure.dpi'] = 360
    plt.figure(figsize=(12, 8))
    plt.barh(top_n_tfidf['result'], top_n_tfidf['TFIDF'], color='#42a63c')  # Updated color
    plt.xlabel('Word Counts')
    plt.ylabel('Word')
    plt.title(f'Top {n} Words via TFIDF')
    plt.gca().invert_yaxis()  # To display the highest count at the top
    plt.savefig(f"../../website-source/images/anime_{column}_top20_words_tfidf.png")
    plt.show()

# submission title
title_top = top_n_words(sub,"title", 20, stop_words)
title_top

<Figure size 2304x1728 with 0 Axes>

The most common words for submission titles are highlt related to recommendations. Such as words: like, good, rewatch, recommendatrions, best.

It is not surprising that anime, episode, watch and season are in the top 20 common words. It is highly related to the anime itself.

title_tfidf = TFIDF(title_cleaned, "title", 20)
title_tfidf

<Figure size 2304x1728 with 0 Axes>

The common words generated via TFIDF are highly related to japanese words and culture. Follwoing is part of the most common words: - Dango: a japanese cusine. - Muri (Japanese): “muri” (無理) means “impossible” or “unreasonable. - Toaru (Japanese): Toaru Majutsu no Indekkusu (A Certain Magical Index), a popular light novel series. - Kougeki (Japanese):”Kougeki” (攻撃) means “attack”. - Zutto (Japanese): “zutto” (ずっと) means “always” or “forever”.

# submission selftext
selftext_top = top_n_words(sub,"selftext", 20, stop_words)
selftext_top

<Figure size 2304x1728 with 0 Axes>

The most common words from submission selftext are very similar to the submission titles. It also highly related to recommandations with additioanl words like ‘please’.

selftext_tfidf = TFIDF(selftext_cleaned, "selftext", 20)
selftext_tfidf

<Figure size 2304x1728 with 0 Axes>

It also contains many Japanese words: - Kita (Japanese): “kita” (来た) means “came” or “arrived”. - Ryo (Japanese): - “ryo” (良) as a given name. - “ryō” (両) was a gold currency unit. - Gagumber (Japanese): a character in “Gagumber the Gale” (疾風のガガンバー) - Saori (Japanese): “Saori” as a Japanese given name.

What’s more it has some adjective words like: irony, situational.

# comment body
body_top = top_n_words(com,"body", 20, stop_words)
body_top

<Figure size 2304x1728 with 0 Axes>

While in the comment body text are more general, only those words that are hihgly related to anime: ‘anime’, ‘watch’, ‘episode’. They are similar to the jsubmission title.

body_tfidf = TFIDF(body_cleaned, "body", 20)
body_tfidf

<Figure size 2304x1728 with 0 Axes>