# Setup - Run only once per Kernel App
%conda install openjdk -y

# install PySpark
%pip install pyspark==3.3.0

# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

Collecting package metadata (current_repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 23.3.1
  latest version: 23.9.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.9.0


## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - openjdk


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2023.08.22 |       h06a4308_0         123 KB
    certifi-2023.7.22          |  py310h06a4308_0         153 KB
    openjdk-11.0.13            |       h87a67e3_0       341.0 MB
    ------------------------------------------------------------
                                           Total:       341.3 MB

The following NEW packages will be INSTALLED:

  openjdk            pkgs/main/linux-64::openjdk-11.0.13-h87a67e3_0 

The following packages will be UPDATED:

  ca-certificates    conda-forge::ca-certificates-2023.7.2~ --> pkgs/main::ca-certificates-2023.08.22-h06a4308_0 

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi            conda-forge/noarch::certifi-2023.7.22~ --> pkgs/main/linux-64::certifi-2023.7.22-py310h06a4308_0 


Downloading and Extracting Packages
ca-certificates-2023 | 123 KB    |                                       |   0% 
openjdk-11.0.13      | 341.0 MB  |                                       |   0% 

certifi-2023.7.22    | 153 KB    |                                       |   0% 
ca-certificates-2023 | 123 KB    | ##################################### | 100% 

certifi-2023.7.22    | 153 KB    | ##################################### | 100% 

certifi-2023.7.22    | 153 KB    | ##################################### | 100% 
openjdk-11.0.13      | 341.0 MB  | 8                                     |   2% 
openjdk-11.0.13      | 341.0 MB  | ##4                                   |   7% 
openjdk-11.0.13      | 341.0 MB  | ####                                  |  11% 
openjdk-11.0.13      | 341.0 MB  | #####2                                |  14% 
openjdk-11.0.13      | 341.0 MB  | ######7                               |  18% 
openjdk-11.0.13      | 341.0 MB  | ########                              |  22% 
openjdk-11.0.13      | 341.0 MB  | #########3                            |  25% 
openjdk-11.0.13      | 341.0 MB  | ##########6                           |  29% 
openjdk-11.0.13      | 341.0 MB  | ###########8                          |  32% 
openjdk-11.0.13      | 341.0 MB  | #############3                        |  36% 
openjdk-11.0.13      | 341.0 MB  | ##############6                       |  40% 
openjdk-11.0.13      | 341.0 MB  | ###############8                      |  43% 
openjdk-11.0.13      | 341.0 MB  | #################                     |  46% 
openjdk-11.0.13      | 341.0 MB  | ##################3                   |  50% 
openjdk-11.0.13      | 341.0 MB  | ###################5                  |  53% 
openjdk-11.0.13      | 341.0 MB  | ####################9                 |  57% 
openjdk-11.0.13      | 341.0 MB  | ######################4               |  61% 
openjdk-11.0.13      | 341.0 MB  | #######################9              |  65% 
openjdk-11.0.13      | 341.0 MB  | #########################3            |  68% 
openjdk-11.0.13      | 341.0 MB  | ##########################7           |  72% 
openjdk-11.0.13      | 341.0 MB  | ############################2         |  76% 
openjdk-11.0.13      | 341.0 MB  | #############################8        |  81% 
openjdk-11.0.13      | 341.0 MB  | ###############################4      |  85% 
openjdk-11.0.13      | 341.0 MB  | #################################     |  89% 
openjdk-11.0.13      | 341.0 MB  | ##################################5   |  93% 
openjdk-11.0.13      | 341.0 MB  | ####################################  |  98% 
                                                                                
                                                                                
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Note: you may need to restart the kernel to use updated packages.
Collecting pyspark==3.3.0
  Using cached pyspark-3.3.0-py2.py3-none-any.whl
Collecting py4j==0.10.9.5 (from pyspark==3.3.0)
  Using cached py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.

# Import pyspark and build Spark session
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("PySparkApp")
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.2")
    .config(
        "fs.s3a.aws.credentials.provider",
        "com.amazonaws.auth.ContainerCredentialsProvider",
    )
    .getOrCreate()
)

print(spark.version)

Warning: Ignoring non-Spark config property: fs.s3a.aws.credentials.provider

:: loading settings :: url = jar:file:/opt/conda/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml

Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-ab9af583-0d7a-4fbf-8918-31b37b436133;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.2.2 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.563 in central
:: resolution report :: resolve 511ms :: artifacts dl 70ms
	:: modules in use:
	com.amazonaws#aws-java-sdk-bundle;1.11.563 from central in [default]
	org.apache.hadoop#hadoop-aws;3.2.2 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-ab9af583-0d7a-4fbf-8918-31b37b436133
	confs: [default]
	0 artifacts copied, 2 already retrieved (0kB/36ms)

23/10/30 20:48:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

3.3.0

#!mkdir -p ./code

!pwd

/root/project/fall-2023-reddit-project-team-01/code/preprocessing

%%writefile ./process_conservative_finance.py

import os
import logging
import argparse

# Import pyspark and build Spark session
from pyspark.sql.functions import *
from pyspark.sql.types import (
    DoubleType,
    IntegerType,
    StringType,
    StructField,
    StructType,
)
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

logging.basicConfig(format='%(asctime)s,%(levelname)s,%(module)s,%(filename)s,%(lineno)d,%(message)s', level=logging.DEBUG)
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))

def main():
    parser = argparse.ArgumentParser(description="app inputs and outputs")
    parser.add_argument("--s3_dataset_path", type=str, help="Path of dataset in S3")    
    parser.add_argument("--s3_output_bucket", type=str, help="s3 output bucket")
    parser.add_argument("--s3_output_prefix", type=str, help="s3 output prefix")
    parser.add_argument("--col_name_for_filtering", type=str, help="Name of the column to filter")
    parser.add_argument("--values_to_keep", type=str, help="comma separated list of values to keep in the filtered set")
    args = parser.parse_args()

    spark = SparkSession.builder.appName("PySparkApp").getOrCreate()
    logger.info(f"spark version = {spark.version}")
    
    # This is needed to save RDDs which is the only way to write nested Dataframes into CSV format
    sc = spark.sparkContext
    sc._jsc.hadoopConfiguration().set(
        "mapred.output.committer.class", "org.apache.hadoop.mapred.FileOutputCommitter"
    )

   
    # Downloading the data from S3 into a Dataframe
    logger.info(f"going to read {args.s3_dataset_path}")
    df = spark.read.parquet(args.s3_dataset_path, header=True)
    logger.info(f"finished reading files...")
    

    
    # filter the dataframe to only keep the values of interest
    vals = [s.strip() for s in args.values_to_keep.split(",")]
    df_filtered = df.where(col(args.col_name_for_filtering).isin(vals))
    
    # save the filtered dataframes so that these files can now be used for future analysis
    s3_path = f"s3://{args.s3_output_bucket}/{args.s3_output_prefix}"
    logger.info(f"going to write data for {vals} in {s3_path}")
    logger.info(f"shape of the df_filtered dataframe is {df_filtered.count():,}x{len(df_filtered.columns)}")
    df_filtered.write.mode("overwrite").parquet(s3_path)
    
    logger.info(f"all done...")
    
if __name__ == "__main__":
    main()

Overwriting ./process_conservative_finance.py

import sagemaker
from sagemaker.spark.processing import PySparkProcessor

# Setup the PySpark processor to run the job. Note the instance type and instance count parameters. SageMaker will create these many instances of this type for the spark job.
role = sagemaker.get_execution_role()
spark_processor = PySparkProcessor(
    base_job_name="sm-spark-project",
    framework_version="3.3",
    role=role,
    instance_count=8,
    instance_type="ml.m5.xlarge",
    max_runtime_in_seconds=7200,
)

# s3 paths
session = sagemaker.Session()
bucket = session.default_bucket()
output_prefix_logs = f"spark_logs"
col_name_for_filtering = "subreddit"

# modify this comma separated list to choose the subreddits of interest
subreddits = "Conservative, finance" # "Conservative, Libertarian, centrist, changemyview, Ask_Politics, finance"
configuration = [
    {
        "Classification": "spark-defaults",
        "Properties": {"spark.executor.memory": "12g", "spark.executor.cores": "4"},
    }
]

years = [2021, 2022, 2023]

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml

import time

for year in years:    
    # comments
    print(f"Working on Comments for year {year}")
    
    s3_dataset_path_commments = f"s3://bigdatateaching/reddit-parquet/comments/year={year}/month=*/*.parquet" 
    output_prefix_data_comments = f"project/comments/year={year}"

    spark_processor.run(
        submit_app="./process_conservative_finance.py",
        arguments=[
            "--s3_dataset_path",
            s3_dataset_path_commments,
            "--s3_output_bucket",
            bucket,
            "--s3_output_prefix",
            output_prefix_data_comments,
            "--col_name_for_filtering",
            col_name_for_filtering,
            "--values_to_keep",
            subreddits,
        ],
        spark_event_logs_s3_uri="s3://{}/{}/spark_event_logs".format(bucket, output_prefix_logs),
        logs=False,
        configuration=configuration
    )

    time.sleep(60)
    
    # submissions
    print(f"Working on Submissions for year {year}")
    
    s3_dataset_path_submissions = f"s3://bigdatateaching/reddit-parquet/submissions/year={year}/month=*/*.parquet"
    output_prefix_data_submissions = f"project/submissions/year={year}"

    spark_processor.run(
        submit_app="./process_conservative_finance.py",
        arguments=[
             "--s3_dataset_path",
            s3_dataset_path_submissions,
            "--s3_output_bucket",
            bucket,
            "--s3_output_prefix",
            output_prefix_data_submissions,
            "--col_name_for_filtering",
            col_name_for_filtering,
            "--values_to_keep",
            subreddits,
        ],
        spark_event_logs_s3_uri="s3://{}/{}/spark_event_logs".format(bucket, output_prefix_logs),
        logs=False,
        configuration=configuration
    )
    
    time.sleep(60)

Working on Comments for year 2021
............................................................................................................................................................................................................!Working on Submissions for year 2021
..........................................................................................................................................................................................................................!Working on Comments for year 2022
...................................................................................................................................................................................................................!Working on Submissions for year 2022
.................................................................................................................................................................................................................................................................!Working on Comments for year 2023
..................................................................................................................!Working on Submissions for year 2023
..................................................................................................................................!

%%time
s3_path = f"s3a://{bucket}/project/comments"
print(f"reading comments from {s3_path}")
comments = spark.read.parquet(s3_path, header=True)
print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")

reading comments from s3a://sagemaker-us-east-1-224518912016/project/comments
23/10/30 23:18:46 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties

[Stage 1:======================================================>(244 + 2) / 246]

shape of the comments dataframe is 7,522,539x22
CPU times: user 470 ms, sys: 138 ms, total: 608 ms
Wall time: 18min 26s

# check counts (ensuring all needed subreddits exist)
comments.groupBy('subreddit').count().show()

[Stage 17:===================================================>     (9 + 1) / 10]

+------------+------+
|   subreddit| count|
+------------+------+
|changemyview|158144|
|     finance|  8476|
|   socialism| 17527|
| Libertarian|186529|
|Ask_Politics|  8282|
|    centrist| 50853|
|Conservative|570355|
|   Economics| 50770|
+------------+------+

comments.printSchema()

root
 |-- author: string (nullable = true)
 |-- author_cakeday: boolean (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- body: string (nullable = true)
 |-- can_gild: boolean (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- distinguished: string (nullable = true)
 |-- edited: string (nullable = true)
 |-- gilded: long (nullable = true)
 |-- id: string (nullable = true)
 |-- is_submitter: boolean (nullable = true)
 |-- link_id: string (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- permalink: string (nullable = true)
 |-- retrieved_on: timestamp (nullable = true)
 |-- score: long (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- subreddit: string (nullable = true)
 |-- subreddit_id: string (nullable = true)

# display a subset of columns
comments.select("subreddit", "author", "body", "parent_id", "link_id", "id", "created_utc").show()

+------------+-----------------+--------------------+----------+---------+-------+-------------------+
|   subreddit|           author|                body| parent_id|  link_id|     id|        created_utc|
+------------+-----------------+--------------------+----------+---------+-------+-------------------+
|Conservative|   Thrownaway1211|         2nd dumbest|t1_gjyohnx|t3_l19aok|gjzhisd|2021-01-20 20:44:29|
|Conservative|        [deleted]|           [deleted]|t1_gjzfpyh|t3_l1hhgw|gjzhite|2021-01-20 20:44:30|
|Conservative|        [deleted]|           [removed]|t1_gjzdkd4|t3_l1dlf1|gjzhiuy|2021-01-20 20:44:30|
|Conservative|        premer777|God helps those w...|t1_gjzd3i6|t3_l19aok|gjzhivc|2021-01-20 20:44:30|
|Conservative|    Barnyard_Rich|&gt; This country...|t1_gjzax9z|t3_l1g3b9|gjzhiwu|2021-01-20 20:44:31|
|Conservative|     sailor-jackn|We’re not just ge...|t1_gjzb8mv|t3_l1dlf1|gjzhixn|2021-01-20 20:44:31|
|Conservative|        [deleted]|You just might be...|t1_gjzgw3m|t3_l1fxyh|gjzhiyc|2021-01-20 20:44:31|
|Conservative|     lulskadoodle|These are all goo...| t3_l19aok|t3_l19aok|gjzhj0m|2021-01-20 20:44:32|
| Libertarian| No_Consequences_|This is not what ...|t1_gjxrxoy|t3_l0zgze|gjzhj1n|2021-01-20 20:44:32|
|Conservative|        [deleted]|           [deleted]|t1_gjzh1a0|t3_l1h8v7|gjzhj1u|2021-01-20 20:44:32|
| Libertarian|        [deleted]|           [deleted]|t1_gjzhcc4|t3_l0oyxu|gjzhj5a|2021-01-20 20:44:34|
|Conservative|    AutoModerator|Looking for debat...| t3_l1hv88|t3_l1hv88|gjzhj5u|2021-01-20 20:44:34|
|Conservative|        [deleted]|Please stop sayin...|t1_gjyhdqz|t3_l199d1|gjzhj7r|2021-01-20 20:44:34|
|Conservative|        [deleted]|           [removed]| t3_l1d0r7|t3_l1d0r7|gjzhj7w|2021-01-20 20:44:34|
|Conservative|    DanPlaysMusic|Why do you suppor...|t1_gjzel12|t3_l1dlf1|gjzhj9s|2021-01-20 20:44:35|
| Libertarian|    iushciuweiush|Ignorance is blis...|t1_gjz5cl0|t3_l1efor|gjzhjan|2021-01-20 20:44:35|
|Conservative|          mk21dvr|You forgot the "/s".|t1_gjzejva|t3_l1eoiy|gjzhjb5|2021-01-20 20:44:36|
|Conservative|        [deleted]|           [removed]|t1_gjz9dbo|t3_l1e03j|gjzhjeq|2021-01-20 20:44:37|
|Conservative|    CastleBravo45|There are problem...|t1_gjzh695|t3_l1d0r7|gjzhjgz|2021-01-20 20:44:38|
|Conservative|KilgoreTroutsAnus|The point of Trum...|t1_gjzb2la|t3_l1ftsv|gjzhjh1|2021-01-20 20:44:38|
+------------+-----------------+--------------------+----------+---------+-------+-------------------+
only showing top 20 rows

%%time
s3_path = f"s3a://{bucket}/{output_prefix_data}/submissions"
print(f"reading submissions from {s3_path}")
submissions = spark.read.parquet(s3_path, header=True)
print(f"shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}")

reading submissions from s3a://sagemaker-us-east-1-433974840707/project/submissions
23/10/21 21:57:46 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.

[Stage 22:===========================================>              (3 + 1) / 4]

shape of the submissions dataframe is 36,353x68
CPU times: user 17 ms, sys: 1.33 ms, total: 18.3 ms
Wall time: 16.5 s

# check counts (ensuring all needed subreddits exist)
submissions.groupBy('subreddit').count().show()

[Stage 25:===========================================>              (3 + 1) / 4]

+------------+-----+
|   subreddit|count|
+------------+-----+
|changemyview| 3507|
|     finance|  950|
|   socialism| 2349|
| Libertarian| 4448|
|Ask_Politics| 1027|
|    centrist| 1194|
|Conservative|21520|
|   Economics| 1358|
+------------+-----+

submissions.printSchema()

root
 |-- adserver_click_url: string (nullable = true)
 |-- adserver_imp_pixel: string (nullable = true)
 |-- archived: boolean (nullable = true)
 |-- author: string (nullable = true)
 |-- author_cakeday: boolean (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- author_id: string (nullable = true)
 |-- brand_safe: boolean (nullable = true)
 |-- contest_mode: boolean (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- crosspost_parent: string (nullable = true)
 |-- crosspost_parent_list: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- approved_at_utc: string (nullable = true)
 |    |    |-- approved_by: string (nullable = true)
 |    |    |-- archived: boolean (nullable = true)
 |    |    |-- author: string (nullable = true)
 |    |    |-- author_flair_css_class: string (nullable = true)
 |    |    |-- author_flair_text: string (nullable = true)
 |    |    |-- banned_at_utc: string (nullable = true)
 |    |    |-- banned_by: string (nullable = true)
 |    |    |-- brand_safe: boolean (nullable = true)
 |    |    |-- can_gild: boolean (nullable = true)
 |    |    |-- can_mod_post: boolean (nullable = true)
 |    |    |-- clicked: boolean (nullable = true)
 |    |    |-- contest_mode: boolean (nullable = true)
 |    |    |-- created: double (nullable = true)
 |    |    |-- created_utc: double (nullable = true)
 |    |    |-- distinguished: string (nullable = true)
 |    |    |-- domain: string (nullable = true)
 |    |    |-- downs: long (nullable = true)
 |    |    |-- edited: boolean (nullable = true)
 |    |    |-- gilded: long (nullable = true)
 |    |    |-- hidden: boolean (nullable = true)
 |    |    |-- hide_score: boolean (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- is_crosspostable: boolean (nullable = true)
 |    |    |-- is_reddit_media_domain: boolean (nullable = true)
 |    |    |-- is_self: boolean (nullable = true)
 |    |    |-- is_video: boolean (nullable = true)
 |    |    |-- likes: string (nullable = true)
 |    |    |-- link_flair_css_class: string (nullable = true)
 |    |    |-- link_flair_text: string (nullable = true)
 |    |    |-- locked: boolean (nullable = true)
 |    |    |-- media: string (nullable = true)
 |    |    |-- mod_reports: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- num_comments: long (nullable = true)
 |    |    |-- num_crossposts: long (nullable = true)
 |    |    |-- num_reports: string (nullable = true)
 |    |    |-- over_18: boolean (nullable = true)
 |    |    |-- parent_whitelist_status: string (nullable = true)
 |    |    |-- permalink: string (nullable = true)
 |    |    |-- pinned: boolean (nullable = true)
 |    |    |-- quarantine: boolean (nullable = true)
 |    |    |-- removal_reason: string (nullable = true)
 |    |    |-- report_reasons: string (nullable = true)
 |    |    |-- saved: boolean (nullable = true)
 |    |    |-- score: long (nullable = true)
 |    |    |-- secure_media: string (nullable = true)
 |    |    |-- selftext: string (nullable = true)
 |    |    |-- selftext_html: string (nullable = true)
 |    |    |-- spoiler: boolean (nullable = true)
 |    |    |-- stickied: boolean (nullable = true)
 |    |    |-- subreddit: string (nullable = true)
 |    |    |-- subreddit_id: string (nullable = true)
 |    |    |-- subreddit_name_prefixed: string (nullable = true)
 |    |    |-- subreddit_type: string (nullable = true)
 |    |    |-- suggested_sort: string (nullable = true)
 |    |    |-- thumbnail: string (nullable = true)
 |    |    |-- thumbnail_height: string (nullable = true)
 |    |    |-- thumbnail_width: string (nullable = true)
 |    |    |-- title: string (nullable = true)
 |    |    |-- ups: long (nullable = true)
 |    |    |-- url: string (nullable = true)
 |    |    |-- user_reports: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- view_count: string (nullable = true)
 |    |    |-- visited: boolean (nullable = true)
 |    |    |-- whitelist_status: string (nullable = true)
 |-- disable_comments: boolean (nullable = true)
 |-- distinguished: string (nullable = true)
 |-- domain: string (nullable = true)
 |-- domain_override: string (nullable = true)
 |-- edited: string (nullable = true)
 |-- embed_type: string (nullable = true)
 |-- embed_url: string (nullable = true)
 |-- gilded: long (nullable = true)
 |-- hidden: boolean (nullable = true)
 |-- hide_score: boolean (nullable = true)
 |-- href_url: string (nullable = true)
 |-- id: string (nullable = true)
 |-- imp_pixel: string (nullable = true)
 |-- is_crosspostable: boolean (nullable = true)
 |-- is_reddit_media_domain: boolean (nullable = true)
 |-- is_self: boolean (nullable = true)
 |-- is_video: boolean (nullable = true)
 |-- link_flair_css_class: string (nullable = true)
 |-- link_flair_text: string (nullable = true)
 |-- locked: boolean (nullable = true)
 |-- media: struct (nullable = true)
 |    |-- event_id: string (nullable = true)
 |    |-- oembed: struct (nullable = true)
 |    |    |-- author_name: string (nullable = true)
 |    |    |-- author_url: string (nullable = true)
 |    |    |-- cache_age: long (nullable = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- height: long (nullable = true)
 |    |    |-- html: string (nullable = true)
 |    |    |-- provider_name: string (nullable = true)
 |    |    |-- provider_url: string (nullable = true)
 |    |    |-- thumbnail_height: long (nullable = true)
 |    |    |-- thumbnail_url: string (nullable = true)
 |    |    |-- thumbnail_width: long (nullable = true)
 |    |    |-- title: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |    |    |-- version: string (nullable = true)
 |    |    |-- width: long (nullable = true)
 |    |-- reddit_video: struct (nullable = true)
 |    |    |-- dash_url: string (nullable = true)
 |    |    |-- duration: long (nullable = true)
 |    |    |-- fallback_url: string (nullable = true)
 |    |    |-- height: long (nullable = true)
 |    |    |-- hls_url: string (nullable = true)
 |    |    |-- is_gif: boolean (nullable = true)
 |    |    |-- scrubber_media_url: string (nullable = true)
 |    |    |-- transcoding_status: string (nullable = true)
 |    |    |-- width: long (nullable = true)
 |    |-- type: string (nullable = true)
 |-- media_embed: struct (nullable = true)
 |    |-- content: string (nullable = true)
 |    |-- height: long (nullable = true)
 |    |-- scrolling: boolean (nullable = true)
 |    |-- width: long (nullable = true)
 |-- mobile_ad_url: string (nullable = true)
 |-- num_comments: long (nullable = true)
 |-- num_crossposts: long (nullable = true)
 |-- original_link: string (nullable = true)
 |-- over_18: boolean (nullable = true)
 |-- parent_whitelist_status: string (nullable = true)
 |-- permalink: string (nullable = true)
 |-- pinned: boolean (nullable = true)
 |-- post_hint: string (nullable = true)
 |-- preview: struct (nullable = true)
 |    |-- enabled: boolean (nullable = true)
 |    |-- images: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- resolutions: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |-- source: struct (nullable = true)
 |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |-- variants: struct (nullable = true)
 |    |    |    |    |-- gif: struct (nullable = true)
 |    |    |    |    |    |-- resolutions: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |    |    |-- source: struct (nullable = true)
 |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |    |-- mp4: struct (nullable = true)
 |    |    |    |    |    |-- resolutions: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |    |    |-- source: struct (nullable = true)
 |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |    |-- nsfw: struct (nullable = true)
 |    |    |    |    |    |-- resolutions: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |    |    |-- source: struct (nullable = true)
 |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |    |-- obfuscated: struct (nullable = true)
 |    |    |    |    |    |-- resolutions: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |    |-- width: long (nullable = true)
 |    |    |    |    |    |-- source: struct (nullable = true)
 |    |    |    |    |    |    |-- height: long (nullable = true)
 |    |    |    |    |    |    |-- url: string (nullable = true)
 |    |    |    |    |    |    |-- width: long (nullable = true)
 |-- promoted: boolean (nullable = true)
 |-- promoted_by: string (nullable = true)
 |-- promoted_display_name: string (nullable = true)
 |-- promoted_url: string (nullable = true)
 |-- retrieved_on: timestamp (nullable = true)
 |-- score: long (nullable = true)
 |-- secure_media: struct (nullable = true)
 |    |-- event_id: string (nullable = true)
 |    |-- oembed: struct (nullable = true)
 |    |    |-- author_name: string (nullable = true)
 |    |    |-- author_url: string (nullable = true)
 |    |    |-- cache_age: long (nullable = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- height: long (nullable = true)
 |    |    |-- html: string (nullable = true)
 |    |    |-- provider_name: string (nullable = true)
 |    |    |-- provider_url: string (nullable = true)
 |    |    |-- thumbnail_height: long (nullable = true)
 |    |    |-- thumbnail_url: string (nullable = true)
 |    |    |-- thumbnail_width: long (nullable = true)
 |    |    |-- title: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |    |    |-- version: string (nullable = true)
 |    |    |-- width: long (nullable = true)
 |    |-- type: string (nullable = true)
 |-- secure_media_embed: struct (nullable = true)
 |    |-- content: string (nullable = true)
 |    |-- height: long (nullable = true)
 |    |-- media_domain_url: string (nullable = true)
 |    |-- scrolling: boolean (nullable = true)
 |    |-- width: long (nullable = true)
 |-- selftext: string (nullable = true)
 |-- spoiler: boolean (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- subreddit: string (nullable = true)
 |-- subreddit_id: string (nullable = true)
 |-- suggested_sort: string (nullable = true)
 |-- third_party_trackers: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- third_party_tracking: string (nullable = true)
 |-- third_party_tracking_2: string (nullable = true)
 |-- thumbnail: string (nullable = true)
 |-- thumbnail_height: long (nullable = true)
 |-- thumbnail_width: long (nullable = true)
 |-- title: string (nullable = true)
 |-- url: string (nullable = true)
 |-- whitelist_status: string (nullable = true)

# display a subset of columns
submissions.select("subreddit", "author", "title", "selftext", "created_utc", "num_comments").show()

+------------+-------------------+--------------------+--------------------+-------------------+------------+
|   subreddit|             author|               title|            selftext|        created_utc|num_comments|
+------------+-------------------+--------------------+--------------------+-------------------+------------+
|Conservative|       Foubar_ghost|Liberal lawyer De...|                    |2021-01-06 01:19:57|          44|
|changemyview|          [deleted]|CMV: CallMeCarson...|           [removed]|2021-01-06 01:20:24|          31|
|Conservative|             f1sh98|Hong Kong Police ...|                    |2021-01-06 01:21:27|          13|
|Conservative|          BluePath2|Georgia run off t...|                    |2021-01-06 01:24:40|           0|
| Libertarian|          [deleted]|Trump supporters ...|           [deleted]|2021-01-06 01:25:25|         288|
|Conservative|          [deleted]|I dont even need ...|           [deleted]|2021-01-06 01:25:31|           0|
| Libertarian|    GruntNumber9902|Learn from histor...|Libertarian: an a...|2021-01-06 01:31:23|           0|
|Conservative|      ChunkyArsenio|UK: Chief medical...|                    |2021-01-06 01:32:35|           6|
|Conservative|      Lionhearted09|Live Updates in G...|                    |2021-01-06 01:33:05|         847|
|Conservative|           1221Wood|Just a reminder t...|                    |2021-01-06 01:33:30|           0|
|Conservative|     3dprinteddildo|Most Georgia runo...|                    |2021-01-06 01:33:51|         263|
|Conservative|          weethomas|How much do you w...|           [removed]|2021-01-06 01:34:41|           0|
|Ask_Politics|            nicebol|What does your da...|           [removed]|2021-01-06 01:38:05|           1|
|Conservative|  joystickfantastic|Pence told Trump ...|                    |2021-01-06 01:39:35|           0|
| Libertarian|            rgshrey|Blatant plug for ...|                    |2021-01-06 01:40:13|           0|
|Conservative|          [deleted]|Liberal Law Profe...|           [deleted]|2021-01-06 01:40:42|           0|
| Libertarian|anonymous_man842740|Just today I real...|Just today I real...|2021-01-06 01:40:45|          68|
|Conservative|          Vimes3000|Where in Reddit i...|           [removed]|2021-01-06 01:40:52|           0|
|Conservative|             nimobo|CNN Kicks Off 202...|                    |2021-01-06 01:41:36|           3|
|Conservative|          [deleted]|Goodbye /r/Conser...|           [removed]|2021-01-06 01:42:19|           0|
+------------+-------------------+--------------------+--------------------+-------------------+------------+
only showing top 20 rows

Processing large datasets with Apache Spark and Amazon SageMaker¶

Our workflow for processing large amounts of data with SageMaker¶

In this notebook...¶

Setup¶

Utilize S3 Data within local PySpark¶

Process S3 data with SageMaker Processing Job `PySparkProcessor`¶

Read the filtered data¶

Processing large datasets with Apache Spark and Amazon SageMaker¶

Our workflow for processing large amounts of data with SageMaker¶

In this notebook...¶

Setup¶

Utilize S3 Data within local PySpark¶

Process S3 data with SageMaker Processing Job PySparkProcessor¶

Read the filtered data¶

Process S3 data with SageMaker Processing Job `PySparkProcessor`¶