Using NLP to Analyze Social Media at Scale

Introduction:

Oxfam is a global organization committed to combatting inequality, poverty, and injustice. As a campaigning organization, Oxfam is deeply interested in gauging the prevailing public sentiment. Historically, Oxfam relied on polling and media representations to gauge public opinion, but these sources have faced recent challenges. Simultaneously, the rise of online discussions has created a new wellspring of information for understanding public sentiment. Analyzing and extracting insights from this online discourse, however, presents unique challenges. I was honored to receive the Siegel PiTech Impact Fellowship, enabling me to contribute my expertise to Oxfam in addressing these challenges.

Discovery and Exploration Process:

During this summer, I collaborated with Oxfam as a Siegel Family Endowment PiTech PhD Impact Fellow to explore whether recent advancements in Natural Language Processing (NLP) and machine learning could enable the analysis of online discussions at scale. My work centered around conducting an experimental analysis using these methodologies, focusing on the Amazon/Walmart warehouse campaign aimed at improving the treatment of their workers.

The initial challenge was sourcing data. I gathered Reddit content related to Amazon and Walmart, with a specific focus on their warehouse employees. I meticulously preprocessed this data and transformed it into a structured database. I also streamlined this process and developed a pipeline that could be adapted to analyze discussions on any topic of interest to Oxfam.

The second challenge revolved around effectively interpreting this dataset. Recognizing that Oxfam would benefit most from a statistical approach, I employed topic modeling—a technique utilizing unsupervised machine learning to identify clusters or groups of similarity within Reddit submissions and comments. To achieve this, I utilized BERTopic, implementing a pipeline that incorporated transformer-based embeddings, dimensionality reduction (UMAP), and HDBSCAN clustering. Furthermore, I employed KeyBERT to generate custom labels for the identified topics. This comprehensive analysis aimed to characterize online discussions and assess the extent to which they expressed discontent with the treatment of workers by these corporations.

Additionally, I employed a range of data analysis tools and visualizations to further scrutinize and interpret the results.

Analysis:

This experiment successfully demonstrated the utility of Large Language Models (LLMs) and topic modeling in comprehending social media conversations at scale. Oxfam now possesses the tools and pipeline necessary to replicate this analysis on topics relevant to their campaigns.

Impact and Future Directions:

Subsequent steps will involve in-depth data analysis and statistical exploration to provide Oxfam with actionable insights aligned with their goals. For instance, these findings can be formalized to assert that a significant portion of social media expresses criticism of these corporations. This insight can be leveraged as an advocacy tool to call for meaningful reforms. Furthermore, the establishment of a versatile pipeline will empower Oxfam to apply these techniques to any topic of interest, providing them with the means to continually assess public sentiment and tailor their advocacy efforts accordingly.