Processing large datasets with Apache Spark and Amazon SageMaker¶
This notebook run on Data Science 3.0 - Python 3
kernel on a ml.t3.large
instance.
Amazon SageMaker Processing Jobs are used to analyze data and evaluate machine learning models on Amazon SageMaker. With Processing, you can use a simplified, managed experience on SageMaker to run your data processing workloads, such as feature engineering, data validation, model evaluation, and model interpretation. You can also use the Amazon SageMaker Processing APIs during the experimentation phase and after the code is deployed in production to evaluate performance.
The preceding diagram shows how Amazon SageMaker spins up a Processing job. Amazon SageMaker takes your script, copies your data from Amazon Simple Storage Service (Amazon S3), and then pulls a processing container. The processing container image can either be an Amazon SageMaker built-in image or a custom image that you provide. The underlying infrastructure for a Processing job is fully managed by Amazon SageMaker. Cluster resources are provisioned for the duration of your job, and cleaned up when a job completes. The output of the Processing job is stored in the Amazon S3 bucket you specified.
Our workflow for processing large amounts of data with SageMaker¶
We can divide our workflow into two steps:
Work with a small subset of the data with Spark running in local model in a SageMaker Studio Notebook.
Once we are able to work with the small subset of data we can provide the same code (as a Python script rather than a series of interactive steps) to SageMaker Processing which launched a Spark cluster, runs out code and terminates the cluster.
In this notebook...¶
We will analyze the Pushshift Reddit dataset to be used for the project and then we will run a SageMaker Processing Job to filter out the comments and submissions from subreddits of interest. The filtered data will be stored in your account's s3 bucket and it is this filtered data that you will be using for your project.
Setup¶
We need an available Java installation to run pyspark. The easiest way to do this is to install JDK and set the proper paths using conda
# Setup - Run only once per Kernel App
%conda install openjdk -y
# install PySpark
%pip install pyspark==3.3.0
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")
Collecting package metadata (current_repodata.json): done Solving environment: done ==> WARNING: A newer version of conda exists. <== current version: 23.3.1 latest version: 23.9.0 Please update conda by running $ conda update -n base -c defaults conda Or to minimize the number of packages updated during conda update use conda install conda=23.9.0 # All requested packages already installed. Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: pyspark==3.3.0 in /opt/conda/lib/python3.10/site-packages (3.3.0) Requirement already satisfied: py4j==0.10.9.5 in /opt/conda/lib/python3.10/site-packages (from pyspark==3.3.0) (0.10.9.5) WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv [notice] A new release of pip is available: 23.2.1 -> 23.3.1 [notice] To update, run: pip install --upgrade pip Note: you may need to restart the kernel to use updated packages.
# Import pyspark and build Spark session
from pyspark.sql import SparkSession
import time
spark = (
SparkSession.builder.appName("PySparkApp")
.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.2")
.config(
"fs.s3a.aws.credentials.provider",
"com.amazonaws.auth.ContainerCredentialsProvider",
)
.getOrCreate()
)
print(spark.version)
Warning: Ignoring non-Spark config property: fs.s3a.aws.credentials.provider
:: loading settings :: url = jar:file:/opt/conda/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2/cache The jars for the packages stored in: /root/.ivy2/jars org.apache.hadoop#hadoop-aws added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-c0c02831-9312-41d7-a2f6-a926d4daf29a;1.0 confs: [default] found org.apache.hadoop#hadoop-aws;3.2.2 in central found com.amazonaws#aws-java-sdk-bundle;1.11.563 in central :: resolution report :: resolve 472ms :: artifacts dl 23ms :: modules in use: com.amazonaws#aws-java-sdk-bundle;1.11.563 from central in [default] org.apache.hadoop#hadoop-aws;3.2.2 from central in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 2 | 0 | 0 | 0 || 2 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent-c0c02831-9312-41d7-a2f6-a926d4daf29a confs: [default] 0 artifacts copied, 2 already retrieved (0kB/24ms)
23/10/30 16:06:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
3.3.0
Process S3 data with SageMaker Processing Job PySparkProcessor
¶
We are going to move the above processing code in a Python file and then submit that file to SageMaker Processing Job's PySparkProcessor
.
#!mkdir -p ./code
!pwd
/root/fall-2023-reddit-project-team-01/code/preprocessing
%%writefile ./process_askpolitics_changemyview.py
import os
import time
import logging
import argparse
# Import pyspark and build Spark session
from pyspark.sql.functions import *
from pyspark.sql.types import (
DoubleType,
IntegerType,
StringType,
StructField,
StructType,
)
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
logging.basicConfig(format='%(asctime)s,%(levelname)s,%(module)s,%(filename)s,%(lineno)d,%(message)s', level=logging.DEBUG)
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))
def main():
parser = argparse.ArgumentParser(description="app inputs and outputs")
parser.add_argument("--s3_dataset_path", type=str, help="Path of dataset in S3")
parser.add_argument("--s3_output_bucket", type=str, help="s3 output bucket")
parser.add_argument("--s3_output_prefix", type=str, help="s3 output prefix")
parser.add_argument("--col_name_for_filtering", type=str, help="Name of the column to filter")
parser.add_argument("--values_to_keep", type=str, help="comma separated list of values to keep in the filtered set")
args = parser.parse_args()
spark = SparkSession.builder.appName("PySparkApp").getOrCreate()
logger.info(f"spark version = {spark.version}")
# This is needed to save RDDs which is the only way to write nested Dataframes into CSV format
sc = spark.sparkContext
sc._jsc.hadoopConfiguration().set(
"mapred.output.committer.class", "org.apache.hadoop.mapred.FileOutputCommitter"
)
# Downloading the data from S3 into a Dataframe
logger.info(f"going to read {args.s3_dataset_path}")
df = spark.read.parquet(args.s3_dataset_path, header=True)
logger.info(f"finished reading files...")
# filter the dataframe to only keep the values of interest
vals = [s.strip() for s in args.values_to_keep.split(",")]
df_filtered = df.where(col(args.col_name_for_filtering).isin(vals))
# save the filtered dataframes so that these files can now be used for future analysis
s3_path = f"s3://{args.s3_output_bucket}/{args.s3_output_prefix}"
logger.info(f"going to write data for {vals} in {s3_path}")
logger.info(f"shape of the df_filtered dataframe is {df_filtered.count():,}x{len(df_filtered.columns)}")
df_filtered.write.mode("overwrite").parquet(s3_path)
logger.info(f"all done...")
if __name__ == "__main__":
main()
Overwriting ./process_centrist_libertarian.py
Now submit this code to SageMaker Processing Job.
import sagemaker
from sagemaker.spark.processing import PySparkProcessor
# Setup the PySpark processor to run the job. Note the instance type and instance count parameters. SageMaker will create these many instances of this type for the spark job.
role = sagemaker.get_execution_role()
spark_processor = PySparkProcessor(
base_job_name="sm-spark-project",
framework_version="3.3",
role=role,
instance_count=8,
instance_type="ml.m5.xlarge",
max_runtime_in_seconds=7200,
)
# s3 paths
session = sagemaker.Session()
bucket = session.default_bucket()
output_prefix_logs = f"spark_logs"
col_name_for_filtering = "subreddit"
# modify this comma separated list to choose the subreddits of interest
subreddits = "changemyview, Ask_Politics"
configuration = [
{
"Classification": "spark-defaults",
"Properties": {"spark.executor.memory": "12g", "spark.executor.cores": "4"},
}
]
years = [2021, 2022, 2023]
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
for year in years:
# comments
print(f"Working on Comments for year {year}")
s3_dataset_path_commments = f"s3://bigdatateaching/reddit-parquet/comments/year={year}/month=*/*.parquet"
output_prefix_data_comments = f"anthony_project/comments/year={year}"
spark_processor.run(
submit_app="./process_askpolitics_changemyview.py",
arguments=[
"--s3_dataset_path",
s3_dataset_path_commments,
"--s3_output_bucket",
bucket,
"--s3_output_prefix",
output_prefix_data_comments,
"--col_name_for_filtering",
col_name_for_filtering,
"--values_to_keep",
subreddits,
],
spark_event_logs_s3_uri="s3://{}/{}/spark_event_logs".format(bucket, output_prefix_logs),
logs=False,
configuration=configuration
)
time.sleep(60)
# submissions
print(f"Working on Submissions for year {year}")
s3_dataset_path_submissions = f"s3://bigdatateaching/reddit-parquet/submissions/year={year}/month=*/*.parquet"
output_prefix_data_submissions = f"anthony_project/submissions/year={year}"
spark_processor.run(
submit_app="./process_askpolitics_changemyview.py",
arguments=[
"--s3_dataset_path",
s3_dataset_path_submissions,
"--s3_output_bucket",
bucket,
"--s3_output_prefix",
output_prefix_data_submissions,
"--col_name_for_filtering",
col_name_for_filtering,
"--values_to_keep",
subreddits,
],
spark_event_logs_s3_uri="s3://{}/{}/spark_event_logs".format(bucket, output_prefix_logs),
logs=False,
configuration=configuration
)
time.sleep(60)
Working on Comments for year 2021 ...............................................................................................................................................................................................!Working on Submissions for year 2021 .....................................................................................................................................................................................................................................!Working on Comments for year 2022 ...................................................................................................................................................................................................................!Working on Submissions for year 2022 ................................................................................................................................................................................................................................!Working on Comments for year 2023 ................................................................................................................!Working on Submissions for year 2023 ....................................................................................................................!
Read the filtered data¶
Now that we have filtered the data to only keep submissions and comments from subreddits of interest. Let us read data from the s3 path where we saved the filtered data.
%%time
s3_path = f"s3a://{bucket}/project/comments/year=*/"
print(f"reading comments from {s3_path}")
comments = spark.read.parquet(s3_path, header=True)
comments.persist()
#print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")
reading comments from s3a://sagemaker-us-east-1-433974840707/project/comments/year=*/ 23/10/30 17:46:50 WARN CacheManager: Asked to cache already cached data. CPU times: user 5.63 ms, sys: 656 µs, total: 6.29 ms Wall time: 1.69 s
DataFrame[author: string, author_cakeday: boolean, author_flair_css_class: string, author_flair_text: string, body: string, can_gild: boolean, controversiality: bigint, created_utc: timestamp, distinguished: string, edited: string, gilded: bigint, id: string, is_submitter: boolean, link_id: string, parent_id: string, permalink: string, retrieved_on: timestamp, score: bigint, stickied: boolean, subreddit: string, subreddit_id: string]
# check counts (ensuring all needed subreddits exist)
comments.groupBy('subreddit').count().show()
23/10/30 17:46:51 WARN MemoryStore: Not enough space to cache rdd_61_1 in memory! (computed 3.4 MiB so far) 23/10/30 17:46:52 WARN MemoryStore: Not enough space to cache rdd_61_0 in memory! (computed 13.9 MiB so far) 23/10/30 17:46:52 WARN MemoryStore: Not enough space to cache rdd_61_2 in memory! (computed 12.5 MiB so far) 23/10/30 17:46:52 WARN MemoryStore: Not enough space to cache rdd_61_4 in memory! (computed 12.1 MiB so far)
[Stage 23:=> (8 + 2) / 244]
23/10/30 17:46:52 WARN MemoryStore: Not enough space to cache rdd_61_7 in memory! (computed 3.4 MiB so far) 23/10/30 17:46:52 WARN MemoryStore: Not enough space to cache rdd_61_8 in memory! (computed 3.4 MiB so far) 23/10/30 17:46:52 WARN MemoryStore: Not enough space to cache rdd_61_9 in memory! (computed 3.4 MiB so far) 23/10/30 17:46:52 WARN MemoryStore: Not enough space to cache rdd_61_11 in memory! (computed 3.4 MiB so far)
[Stage 23:===> (14 + 2) / 244]
23/10/30 17:46:52 WARN MemoryStore: Not enough space to cache rdd_61_13 in memory! (computed 10.9 MiB so far) 23/10/30 17:46:52 WARN MemoryStore: Not enough space to cache rdd_61_15 in memory! (computed 10.8 MiB so far) 23/10/30 17:46:53 WARN MemoryStore: Not enough space to cache rdd_61_17 in memory! (computed 10.7 MiB so far) 23/10/30 17:46:53 WARN MemoryStore: Not enough space to cache rdd_61_19 in memory! (computed 10.5 MiB so far) 23/10/30 17:46:53 WARN MemoryStore: Not enough space to cache rdd_61_21 in memory! (computed 3.4 MiB so far)
[Stage 23:====> (21 + 2) / 244]
23/10/30 17:46:53 WARN MemoryStore: Not enough space to cache rdd_61_23 in memory! (computed 3.4 MiB so far) 23/10/30 17:46:53 WARN MemoryStore: Not enough space to cache rdd_61_26 in memory! (computed 3.4 MiB so far) 23/10/30 17:46:53 WARN MemoryStore: Not enough space to cache rdd_61_25 in memory! (computed 10.2 MiB so far)
[Stage 23:======> (27 + 2) / 244]
23/10/30 17:46:53 WARN MemoryStore: Not enough space to cache rdd_61_29 in memory! (computed 10.0 MiB so far) 23/10/30 17:46:54 WARN MemoryStore: Not enough space to cache rdd_61_32 in memory! (computed 9.8 MiB so far)
[Stage 23:=======> (33 + 2) / 244]
23/10/30 17:46:54 WARN MemoryStore: Not enough space to cache rdd_61_36 in memory! (computed 3.4 MiB so far) 23/10/30 17:46:54 WARN MemoryStore: Not enough space to cache rdd_61_37 in memory! (computed 9.6 MiB so far) 23/10/30 17:46:54 WARN MemoryStore: Not enough space to cache rdd_61_38 in memory! (computed 9.6 MiB so far)
[Stage 23:========> (37 + 2) / 244]
23/10/30 17:46:55 WARN MemoryStore: Not enough space to cache rdd_61_41 in memory! (computed 9.5 MiB so far) 23/10/30 17:46:55 WARN MemoryStore: Not enough space to cache rdd_61_43 in memory! (computed 9.4 MiB so far)
[Stage 23:=========> (42 + 2) / 244]
23/10/30 17:46:55 WARN MemoryStore: Not enough space to cache rdd_61_46 in memory! (computed 3.4 MiB so far) 23/10/30 17:46:55 WARN MemoryStore: Not enough space to cache rdd_61_47 in memory! (computed 9.2 MiB so far) 23/10/30 17:46:55 WARN MemoryStore: Not enough space to cache rdd_61_49 in memory! (computed 9.1 MiB so far)
[Stage 23:===========> (52 + 2) / 244]
23/10/30 17:46:55 WARN MemoryStore: Not enough space to cache rdd_61_51 in memory! (computed 9.1 MiB so far) 23/10/30 17:46:55 WARN MemoryStore: Not enough space to cache rdd_61_53 in memory! (computed 9.0 MiB so far) 23/10/30 17:46:55 WARN MemoryStore: Not enough space to cache rdd_61_54 in memory! (computed 3.4 MiB so far)
[Stage 23:=============> (58 + 2) / 244]
23/10/30 17:46:56 WARN MemoryStore: Not enough space to cache rdd_61_57 in memory! (computed 3.4 MiB so far) 23/10/30 17:46:56 WARN MemoryStore: Not enough space to cache rdd_61_61 in memory! (computed 8.7 MiB so far)
[Stage 23:===============> (67 + 2) / 244]
23/10/30 17:46:56 WARN MemoryStore: Not enough space to cache rdd_61_65 in memory! (computed 8.6 MiB so far) 23/10/30 17:46:56 WARN MemoryStore: Not enough space to cache rdd_61_68 in memory! (computed 8.4 MiB so far) 23/10/30 17:46:57 WARN MemoryStore: Not enough space to cache rdd_61_69 in memory! (computed 8.4 MiB so far)
[Stage 23:================> (72 + 2) / 244]
23/10/30 17:46:57 WARN MemoryStore: Not enough space to cache rdd_61_73 in memory! (computed 3.4 MiB so far) 23/10/30 17:46:57 WARN MemoryStore: Not enough space to cache rdd_61_75 in memory! (computed 3.4 MiB so far)
[Stage 23:=================> (76 + 2) / 244]
23/10/30 17:46:57 WARN MemoryStore: Not enough space to cache rdd_61_77 in memory! (computed 8.1 MiB so far)
[Stage 23:==================> (84 + 2) / 244]
23/10/30 17:46:58 WARN MemoryStore: Not enough space to cache rdd_61_80 in memory! (computed 8.0 MiB so far)
[Stage 23:======================> (101 + 2) / 244]
23/10/30 17:46:58 WARN MemoryStore: Not enough space to cache rdd_61_96 in memory! (computed 3.4 MiB so far) 23/10/30 17:46:59 WARN MemoryStore: Not enough space to cache rdd_61_103 in memory! (computed 3.4 MiB so far) 23/10/30 17:46:59 WARN MemoryStore: Not enough space to cache rdd_61_106 in memory! (computed 3.4 MiB so far)
[Stage 23:===========================> (125 + 2) / 244]
23/10/30 17:46:59 WARN MemoryStore: Not enough space to cache rdd_61_121 in memory! (computed 3.4 MiB so far) 23/10/30 17:46:59 WARN MemoryStore: Not enough space to cache rdd_61_129 in memory! (computed 3.4 MiB so far) 23/10/30 17:47:00 WARN MemoryStore: Not enough space to cache rdd_61_134 in memory! (computed 3.4 MiB so far)
[Stage 23:==============================> (139 + 2) / 244]
23/10/30 17:47:00 WARN MemoryStore: Not enough space to cache rdd_61_144 in memory! (computed 3.4 MiB so far)
[Stage 23:====================================> (167 + 2) / 244]
23/10/30 17:47:01 WARN MemoryStore: Not enough space to cache rdd_61_171 in memory! (computed 3.4 MiB so far)
[Stage 23:========================================> (183 + 2) / 244]
23/10/30 17:47:01 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_188 in memory. 23/10/30 17:47:01 WARN MemoryStore: Not enough space to cache rdd_61_188 in memory! (computed 384.0 B so far) 23/10/30 17:47:01 WARN MemoryStore: Not enough space to cache rdd_61_191 in memory! (computed 3.3 MiB so far)
[Stage 23:===================================================> (234 + 2) / 244]
23/10/30 17:47:02 WARN MemoryStore: Not enough space to cache rdd_61_240 in memory! (computed 1694.5 KiB so far) +-----------+-------+ | subreddit| count| +-----------+-------+ |Libertarian|2706909| | centrist| 921871| +-----------+-------+
from pyspark.sql.functions import min, max
comments.select(min('created_utc').alias('min_created_utc'),
max('created_utc').alias('max_created_utc')).first()
23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_1 in memory! (computed 12.9 MiB so far) 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_0 in memory! (computed 13.9 MiB so far) 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_2 in memory! (computed 12.5 MiB so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_4 in memory. 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_5 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_4 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_5 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_6 in memory. 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_7 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_6 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_7 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_8 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_8 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_9 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_9 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_10 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_10 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_11 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_11 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_12 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_12 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_13 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_13 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_14 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_14 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_15 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_15 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_16 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_16 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_17 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_17 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_18 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_18 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_19 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_19 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_20 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_20 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_21 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_21 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_22 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_22 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_23 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_23 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_24 in memory. 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_25 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_24 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_25 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_26 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_26 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_27 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_27 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_28 in memory. 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_29 in memory.
[Stage 26:=======> (34 + 2) / 244]
23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_28 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_29 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_30 in memory. 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_31 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_30 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_31 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_32 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_32 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_33 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_33 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_34 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_34 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_35 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_35 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_36 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_36 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_37 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_37 in memory! (computed 384.0 B so far) 23/10/30 17:47:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_38 in memory. 23/10/30 17:47:03 WARN MemoryStore: Not enough space to cache rdd_61_38 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_39 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_39 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_40 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_40 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_41 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_41 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_42 in memory.
[Stage 26:=========> (44 + 2) / 244]
23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_42 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_43 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_43 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_44 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_44 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_45 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_45 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_46 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_46 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_47 in memory. 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_48 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_47 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_48 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_49 in memory. 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_50 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_49 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_50 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_51 in memory. 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_52 in memory.
[Stage 26:===========> (53 + 2) / 244]
23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_51 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_52 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_53 in memory. 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_54 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_53 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_54 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_55 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_55 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_56 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_56 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_57 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_57 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_58 in memory. 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_59 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_58 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_59 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_60 in memory. 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_61 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_60 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_61 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_62 in memory. 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_63 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_62 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_63 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_64 in memory. 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_65 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_64 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_65 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_66 in memory. 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_67 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_66 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_67 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_68 in memory. 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_69 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_68 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_69 in memory! (computed 384.0 B so far)
[Stage 26:=============> (62 + 2) / 244]
23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_70 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_70 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_71 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_71 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_72 in memory. 23/10/30 17:47:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_73 in memory. 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_72 in memory! (computed 384.0 B so far) 23/10/30 17:47:04 WARN MemoryStore: Not enough space to cache rdd_61_73 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_74 in memory. 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_75 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_74 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_75 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_76 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_76 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_77 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_77 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_78 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_78 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_79 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_79 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_80 in memory.
[Stage 26:==================> (83 + 2) / 244]
23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_80 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_81 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_81 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_82 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_82 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_83 in memory. 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_84 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_83 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_84 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_85 in memory. 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_86 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_85 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_86 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_87 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_87 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_88 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_88 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_89 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_89 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_90 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_90 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_91 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_91 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_92 in memory. 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_93 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_92 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_93 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_94 in memory. 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_95 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_94 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_95 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_96 in memory. 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_97 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_96 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_97 in memory! (computed 384.0 B so far)
[Stage 26:======================> (101 + 2) / 244]
23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_98 in memory. 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_99 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_98 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_99 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_100 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_100 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_101 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_101 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_102 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_102 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_103 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_103 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_104 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_104 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_105 in memory. 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_106 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_105 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_106 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_107 in memory. 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_108 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_107 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_108 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_110 in memory. 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_109 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_110 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_109 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_111 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_111 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_112 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_112 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_113 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_113 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_114 in memory. 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_115 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_115 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_114 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_121 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_121 in memory! (computed 384.0 B so far) 23/10/30 17:47:05 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_129 in memory. 23/10/30 17:47:05 WARN MemoryStore: Not enough space to cache rdd_61_129 in memory! (computed 384.0 B so far) 23/10/30 17:47:06 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_134 in memory.
[Stage 26:==============================> (136 + 2) / 244]
23/10/30 17:47:06 WARN MemoryStore: Not enough space to cache rdd_61_134 in memory! (computed 384.0 B so far) 23/10/30 17:47:06 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_144 in memory. 23/10/30 17:47:06 WARN MemoryStore: Not enough space to cache rdd_61_144 in memory! (computed 384.0 B so far)
[Stage 26:=========================================> (189 + 2) / 244]
23/10/30 17:47:06 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_171 in memory. 23/10/30 17:47:06 WARN MemoryStore: Not enough space to cache rdd_61_171 in memory! (computed 384.0 B so far) 23/10/30 17:47:06 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_188 in memory. 23/10/30 17:47:06 WARN MemoryStore: Not enough space to cache rdd_61_188 in memory! (computed 384.0 B so far) 23/10/30 17:47:06 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_191 in memory. 23/10/30 17:47:06 WARN MemoryStore: Not enough space to cache rdd_61_191 in memory! (computed 384.0 B so far)
[Stage 26:==============================================> (210 + 2) / 244]
23/10/30 17:47:06 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_61_240 in memory. 23/10/30 17:47:06 WARN MemoryStore: Not enough space to cache rdd_61_240 in memory! (computed 384.0 B so far)
Row(min_created_utc=datetime.datetime(2021, 1, 1, 0, 0, 8), max_created_utc=datetime.datetime(2023, 3, 31, 23, 59, 12))
%%time
s3_path = f"s3a://{bucket}/project/submissions/year=*/" # Use * as a wildcard to match all subdirectories
print(f"reading submissions from {s3_path}")
submissions = spark.read.parquet(s3_path, header=True)
submissions.persist()
#print(f"shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}")
reading submissions from s3a://sagemaker-us-east-1-433974840707/project/submissions/year=*/ 23/10/30 17:47:08 WARN CacheManager: Asked to cache already cached data. CPU times: user 4.77 ms, sys: 242 µs, total: 5.01 ms Wall time: 1.13 s
DataFrame[adserver_click_url: string, adserver_imp_pixel: string, archived: boolean, author: string, author_cakeday: boolean, author_flair_css_class: string, author_flair_text: string, author_id: string, brand_safe: boolean, contest_mode: boolean, created_utc: timestamp, crosspost_parent: string, crosspost_parent_list: array<struct<approved_at_utc:string,approved_by:string,archived:boolean,author:string,author_flair_css_class:string,author_flair_text:string,banned_at_utc:string,banned_by:string,brand_safe:boolean,can_gild:boolean,can_mod_post:boolean,clicked:boolean,contest_mode:boolean,created:double,created_utc:double,distinguished:string,domain:string,downs:bigint,edited:boolean,gilded:bigint,hidden:boolean,hide_score:boolean,id:string,is_crosspostable:boolean,is_reddit_media_domain:boolean,is_self:boolean,is_video:boolean,likes:string,link_flair_css_class:string,link_flair_text:string,locked:boolean,media:string,mod_reports:array<string>,name:string,num_comments:bigint,num_crossposts:bigint,num_reports:string,over_18:boolean,parent_whitelist_status:string,permalink:string,pinned:boolean,quarantine:boolean,removal_reason:string,report_reasons:string,saved:boolean,score:bigint,secure_media:string,selftext:string,selftext_html:string,spoiler:boolean,stickied:boolean,subreddit:string,subreddit_id:string,subreddit_name_prefixed:string,subreddit_type:string,suggested_sort:string,thumbnail:string,thumbnail_height:string,thumbnail_width:string,title:string,ups:bigint,url:string,user_reports:array<string>,view_count:string,visited:boolean,whitelist_status:string>>, disable_comments: boolean, distinguished: string, domain: string, domain_override: string, edited: string, embed_type: string, embed_url: string, gilded: bigint, hidden: boolean, hide_score: boolean, href_url: string, id: string, imp_pixel: string, is_crosspostable: boolean, is_reddit_media_domain: boolean, is_self: boolean, is_video: boolean, link_flair_css_class: string, link_flair_text: string, locked: boolean, media: struct<event_id:string,oembed:struct<author_name:string,author_url:string,cache_age:bigint,description:string,height:bigint,html:string,provider_name:string,provider_url:string,thumbnail_height:bigint,thumbnail_url:string,thumbnail_width:bigint,title:string,type:string,url:string,version:string,width:bigint>,reddit_video:struct<dash_url:string,duration:bigint,fallback_url:string,height:bigint,hls_url:string,is_gif:boolean,scrubber_media_url:string,transcoding_status:string,width:bigint>,type:string>, media_embed: struct<content:string,height:bigint,scrolling:boolean,width:bigint>, mobile_ad_url: string, num_comments: bigint, num_crossposts: bigint, original_link: string, over_18: boolean, parent_whitelist_status: string, permalink: string, pinned: boolean, post_hint: string, preview: struct<enabled:boolean,images:array<struct<id:string,resolutions:array<struct<height:bigint,url:string,width:bigint>>,source:struct<height:bigint,url:string,width:bigint>,variants:struct<gif:struct<resolutions:array<struct<height:bigint,url:string,width:bigint>>,source:struct<height:bigint,url:string,width:bigint>>,mp4:struct<resolutions:array<struct<height:bigint,url:string,width:bigint>>,source:struct<height:bigint,url:string,width:bigint>>,nsfw:struct<resolutions:array<struct<height:bigint,url:string,width:bigint>>,source:struct<height:bigint,url:string,width:bigint>>,obfuscated:struct<resolutions:array<struct<height:bigint,url:string,width:bigint>>,source:struct<height:bigint,url:string,width:bigint>>>>>>, promoted: boolean, promoted_by: string, promoted_display_name: string, promoted_url: string, retrieved_on: timestamp, score: bigint, secure_media: struct<event_id:string,oembed:struct<author_name:string,author_url:string,cache_age:bigint,description:string,height:bigint,html:string,provider_name:string,provider_url:string,thumbnail_height:bigint,thumbnail_url:string,thumbnail_width:bigint,title:string,type:string,url:string,version:string,width:bigint>,type:string>, secure_media_embed: struct<content:string,height:bigint,media_domain_url:string,scrolling:boolean,width:bigint>, selftext: string, spoiler: boolean, stickied: boolean, subreddit: string, subreddit_id: string, suggested_sort: string, third_party_trackers: array<string>, third_party_tracking: string, third_party_tracking_2: string, thumbnail: string, thumbnail_height: bigint, thumbnail_width: bigint, title: string, url: string, whitelist_status: string]
# check counts (ensuring all needed subreddits exist)
submissions.groupBy('subreddit').count().show()
[Stage 30:==================================================> (87 + 2) / 97]
+-----------+-----+ | subreddit|count| +-----------+-----+ |Libertarian|51153| | centrist|13594| +-----------+-----+
from pyspark.sql.functions import min, max
submissions.select(min('created_utc').alias('min_created_utc'),
max('created_utc').alias('max_created_utc')).first()
Row(min_created_utc=datetime.datetime(2021, 1, 1, 0, 0, 58), max_created_utc=datetime.datetime(2023, 3, 31, 23, 39, 22))