Code and Data
Code
All the code used in this project is available on Github.
Data
- Reddit Archive data: June 2023 to July 2024 submission and comments data for subreddits
r/tifu
andr/confession
. View the data structure here Github - EmoLex(Mohammad and Turney 2013): NRC Word-Emotion Association Lexicon. A list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). The annotations were manually done by crowdsourcing. View the data txt here Github
- Debagreement(Pougué-Biyong et al. 2021): Developed in partnership with Oxford University, this dataset contains comment-reply interactions across five subreddits: Democrats, Republicans, Black Lives Matter, Brexit, and Climate. Each comment-reply interaction is annotated with “agree,” “disagree,” “neutral,” or “unsure” labels by at least three raters. Read more about the dataset here openReview
- 42894 rows of data for the purpose of training our agreement label table.
References
Mohammad, Saif M., and Peter D. Turney. 2013. “Crowdsourcing a Word–Emotion Association Lexicon.” Computational Intelligence 29 (3): 436–65. https://doi.org/10.1111/j.1467-8640.2012.00460.x.
Pougué-Biyong, John, Valentina Semenova, Alexandre Matton, Rachel Han, Aerin Kim, Renaud Lambiotte, and Doyne Farmer. 2021. “DEBAGREEMENT: A Comment-Reply Dataset for (Dis)agreement Detection in Online Debates.” In Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=udVUN__gFO.