Code and Data
Code
All the code used in this project is available on Github. For a description of the code folders, click here.
Data
This project uses the [1] dataset. The features available are listed below. While not all formal definitions are made publicly available by the dataset authors, those shown below were sourced from [1] and [2].
submissions Data Card
| Field | Type | Description |
|---|---|---|
| adserver_click_url | ||
| adserver_imp_pixel | ||
| archived | ||
| author | String | The account name of the poster, e.g., “example username” |
| author_cakeday | ||
| author_flair_css_class | String | The CSS class of the author’s flair. This field is specific to subreddit |
| author_flair_text | String | The text of the author’s flair. This field is specific to subreddit |
| author_id | ||
| brand_safe | ||
| contest_mode | ||
| created_utc | Integer | UNIX timestamp referring to the time of the submission’s creation, e.g., 1483228803 |
| crosspost_parent | ||
| crosspost_parent_list | ||
| disable_comments | ||
| distinguished | String | Flag to determine whether the submission is distinguished2 by moderators. “null” means not distinguished |
| domain | String | The domain of the submission, e.g., self.AskReddit |
| domain_override | ||
| edited | Long | Indicates whether the submission has been edited. Either a number indicating the UNIX timestamp that the submission was edited at, “false” otherwise. |
| embed_type | ||
| embed_url | ||
| gilded | Integer | The number of times this submission received Reddit gold, e.g., 0 |
| hidden | Boolean | true if the post is hidden by the logged in user. false if not logged in or not hidden. |
| hide_score | Boolean | Flag indicating if the submission’s score is hidden, e.g., false |
| href_url | ||
| id | String | The submission’s identifier, e.g., “5lcgjh” |
| imp_pixel | ||
| is_crosspostable | ||
| is_reddit_media_domain | ||
| is_self | Boolean | Flag that indicates whether the submission is a self post, e.g., true |
| is_video | ||
| link_flair_css_class | String | the CSS class of the link’s flair. |
| link_flair_text | String | the text of the link’s flair. |
| locked | Boolean | Flag indicating whether the submission is currently closed to new comments, e.g., false |
| media | Object | Used for streaming video. Detailed information about the video and it’s origins are placed here |
| media_embed | Object | Used for streaming video. Technical embed specific information is found here. |
| mobile_ad_url | ||
| num_comments | Integer | The number of comments associated with this submission, e.g., 7 |
| num_crossposts | ||
| original_link | ||
| over_18 | Boolean | Flag that indicates whether the submission is Not-Safe-For-Work, e.g., false |
| parent_whitelist_status | ||
| permalink | String | Relative URL of the permanent link that points to this specific submission, e.g., “/r/AskReddit/comments/5lcgj9/what did you think of the ending of rogue one/” |
| pinned | ||
| post_hint | ||
| preview | ||
| promoted | ||
| promoted_by | ||
| promoted_display_name | ||
| promoted_url | ||
| retrieved_on | Integer | UNIX timestamp referring to the time we crawled the submission, e.g., 1483228803 |
| score | Integer | The score that the submission has accumulated. The score is the number of upvotes minus the number of downvotes. E.g., 5 . NB: Reddit fuzzes the real score to prevent spam bots. |
| secure_media | ||
| secure_media_embed | ||
| selftext | String | The text that is associated with the submission |
| spoiler | ||
| stickied | Boolean | Flag indicating whether the submission is set as sticky in the subreddit, e.g., false |
| subreddit | String | Name of the subreddit that the submission is posted. Note that it excludes the prefix /r/. E.g., ’AskReddit’ |
| subreddit_id | String | The identifier of the subreddit, e.g., “t5 2qh1i” |
| suggested_sort | ||
| third_party_trackers | ||
| third_party_tracking | ||
| third_party_tracking_2 | ||
| thumbnail | String | full URL to the thumbnail for this link; “self” if this is a self post; “image” if this is a link to an image but has no thumbnail; “default” if a thumbnail is not available |
| thumbnail_height | ||
| thumbnail_width | ||
| title | String | The title that is associated with the submission, e.g., “What did you think of the ending of Rogue One?” |
| url | String | The URL that the submission is posting. This is the same with the permalink in cases where the submission is a self post. E.g., “https://www.reddit.com/r/AskReddit/ |
| whitelist_status |
References
[1]
Baumgartner J, Zannettou S, Keegan B, Squire M, Blackburn J. The Pushshift Reddit Dataset. 2020 [accessed 2023 Nov 1]. http://arxiv.org/abs/2001.08435. doi:10.48550/arXiv.2001.08435
[2]
The Reddit Archives. JSON. GitHub. 2016 [accessed 2023 Nov 2]. https://github.com/reddit-archive/reddit/wiki/JSON
commentsData Cardnot distinguished
was edited at, or “false”.
comment or the identifier of another comment, e.g., “t1 dbu5bpp”
e.g., “/r/AskReddit/comments/5lcgj9/what did you think of the ending of rogue one/”
downvotes. Note that Reddit fuzzes the real score to prevent spam bots. E.g., 5