Code and Data

Code

All the code used in this project is available on Github. For a description of the code folders, click here.

Data

This project uses the [1] dataset. The features available are listed below. While not all formal definitions are made publicly available by the dataset authors, those shown below were sourced from [1] and [2].

submissions Data Card

Field Type Description
adserver_click_url
adserver_imp_pixel
archived
author String The account name of the poster, e.g., “example username”
author_cakeday
author_flair_css_class String The CSS class of the author’s flair. This field is specific to subreddit
author_flair_text String The text of the author’s flair. This field is specific to subreddit
author_id
brand_safe
contest_mode
created_utc Integer UNIX timestamp referring to the time of the submission’s creation, e.g., 1483228803
crosspost_parent
crosspost_parent_list
disable_comments
distinguished String Flag to determine whether the submission is distinguished2 by moderators. “null” means
not distinguished
domain String The domain of the submission, e.g., self.AskReddit
domain_override
edited Long Indicates whether the submission has been edited. Either a number indicating the UNIX
timestamp that the submission was edited at, “false” otherwise.
embed_type
embed_url
gilded Integer The number of times this submission received Reddit gold, e.g., 0
hidden Boolean true if the post is hidden by the logged in user. false if not logged in or not hidden.
hide_score Boolean Flag indicating if the submission’s score is hidden, e.g., false
href_url
id String The submission’s identifier, e.g., “5lcgjh”
imp_pixel
is_crosspostable
is_reddit_media_domain
is_self Boolean Flag that indicates whether the submission is a self post, e.g., true
is_video
link_flair_css_class String the CSS class of the link’s flair.
link_flair_text String the text of the link’s flair.
locked Boolean Flag indicating whether the submission is currently closed to new comments, e.g., false
media Object Used for streaming video. Detailed information about the video and it’s origins are placed here
media_embed Object Used for streaming video. Technical embed specific information is found here.
mobile_ad_url
num_comments Integer The number of comments associated with this submission, e.g., 7
num_crossposts
original_link
over_18 Boolean Flag that indicates whether the submission is Not-Safe-For-Work, e.g., false
parent_whitelist_status
permalink String Relative URL of the permanent link that points to this specific submission,
e.g., “/r/AskReddit/comments/5lcgj9/what did you think of the ending of rogue one/”
pinned
post_hint
preview
promoted
promoted_by
promoted_display_name
promoted_url
retrieved_on Integer UNIX timestamp referring to the time we crawled the submission, e.g., 1483228803
score Integer The score that the submission has accumulated. The score is the number of upvotes
minus the number of downvotes. E.g., 5 . NB: Reddit fuzzes the real score to prevent spam bots.
secure_media
secure_media_embed
selftext String The text that is associated with the submission
spoiler
stickied Boolean Flag indicating whether the submission is set as sticky in the subreddit, e.g., false
subreddit String Name of the subreddit that the submission is posted. Note that it excludes the prefix
/r/. E.g., ’AskReddit’
subreddit_id String The identifier of the subreddit, e.g., “t5 2qh1i”
suggested_sort
third_party_trackers
third_party_tracking
third_party_tracking_2
thumbnail String full URL to the thumbnail for this link; “self” if this is a self post; “image” if this is a
link to an image but has no thumbnail; “default” if a thumbnail is not available
thumbnail_height
thumbnail_width
title String The title that is associated with the submission, e.g., “What did you think of the
ending of Rogue One?”
url String The URL that the submission is posting. This is the same with the permalink in
cases where the submission is a self post. E.g., “https://www.reddit.com/r/AskReddit/
whitelist_status

comments Data Card

Field Type Description
author String The account name of the poster, e.g., “example username”
author_cakeday
author_flair_css_class String The CSS class of the author’s flair. This field is specific to subreddit
author_flair_text String The text of the author’s flair. This field is specific to subreddit
body String The comment’s text, e.g., “This is an example comment”
can_gild
controversiality Integer Number that indicates whether the comment is controversial, e.g., 0
created_utc Integer UNIX timestamp referring to the time of the submission’s creation, e.g., 1483228803
distinguished String Flag to determine whether the comment is distinguished by the moderators. “null” means
not distinguished
edited Long Flag indicating if the comment has been edited. Either the UNIX timestamp that the comment
was edited at, or “false”.
gilded Integer The number of times this comment received Reddit gold, e.g., 0
id String The comment’s identifier, e.g., “dbumnq8”
is_submitter
link_id String Identifier of the submission that this comment is in, e.g., “t3 5l954r”
parent_id String Identifier of the parent of this comment, might be the identifier of the submission if it is top-level
comment or the identifier of another comment, e.g., “t1 dbu5bpp”
permalink String Relative URL of the permanent link that points to this specific submission,
e.g., “/r/AskReddit/comments/5lcgj9/what did you think of the ending of rogue one/”
retrieved_on Integer UNIX timestamp that refers to the time that we crawled the comment, e.g., 1483228803
score Integer The score of the comment. The score is the number of upvotes minus the number of
downvotes. Note that Reddit fuzzes the real score to prevent spam bots. E.g., 5
stickied Boolean Flag indicating whether the submission is set as sticky in the subreddit, e.g., false
subreddit String Name of the subreddit that the comment is posted. Note that it excludes the prefix /r/. E.g., ’AskReddit’
subreddit_id String The identifier of the subreddit where the comment is posted, e.g., “t5 2qh1i”

References

[1]
Baumgartner J, Zannettou S, Keegan B, Squire M, Blackburn J. The Pushshift Reddit Dataset. 2020 [accessed 2023 Nov 1]. http://arxiv.org/abs/2001.08435. doi:10.48550/arXiv.2001.08435
[2]
The Reddit Archives. JSON. GitHub. 2016 [accessed 2023 Nov 2]. https://github.com/reddit-archive/reddit/wiki/JSON