1. Introduction
This piece inaugurates the Natural language processing (NLP) for Social Good column in this journal. The column aims to provide a regular forum for discussing how NLP technologies can, and sometimes fail to, contribute to positive social outcomes. My goal in this first instalment is to offer a landscape overview: where are we, what is missing, and where should we go?
The idea that NLP should serve the public good is not new. Hovy and Spruit (Hovy and Spruit Reference Hovy and Spruit2016) raised the question of NLP’s social impact a decade ago. Since then, the field has seen the establishment of recurring workshop series such as NLP for Positive Impact (NLP4PI), which has run five editions between 2021 and 2026, and domain-specific venues including ClimateNLP (since 2024), BioNLP (since 2002), and ClinicalNLP (now in its seventh edition). Shared tasks have also dominated across social good applications, from explainable detection of online sexism (SemEval 2023) (Kirk et al. Reference Kirk, Yin, Vidgen and Röttger2023) to multilingual persuasion technique detection in memes (SemEval 2024) Dimitrov et al. (Reference Dimitrov, Alam, Hasanain, Hasnat, Silvestri, Nakov and Da San Martino2024) and clinical text generation (BioNLP 2024) Xu et al. (Reference Xu, Chen, Johnston, Blankemeier, Varma, Hom, Collins, Modi, Lloyd, Hopkins, Langlotz and Delbrouck2024). Furthermore, in 2025, EMNLP included “NLP for Social Good” explicitly in its call for papers.
Several recent surveys have attempted to chart this research area systematically. Fortuna et al. (Reference Fortuna, Pérez-Mayos, AbuRa’ed, Soler-Company and Wanner2021) proposed an early map of NLP4SG, analysing around 50,000 ACL Anthology papers using keyword matching and finding that explicit social good papers accounted for under 10% of publications, with healthcare dominating and areas such as environmental sustainability and language disorders largely neglected. Jin et al. (Reference Jin, Chauhan, Tse, Sachan and Mihalcea2021) took a complementary approach, evaluating NLP tasks through the lens of social impact and finding significant gaps between what the community works on and what would have the greatest positive effect. Adauto et al. (Reference Adauto, Jin, Schölkopf, Hope, Sachan and Mihalcea2023) scaled this analysis to over 76,000 papers using LLM-based annotation. Most recently, Karamolegkou et al. (Reference Karamolegkou, Borah, Cho, Choudhury, Galletti, Gupta, Ignat, Kargupta, Kotonya, Lamba, Lee, Mangla, Mondal, Moudakir, Nazar, Nemkova, Pisarevskaya, Rizwan, Sabri, Samway, Stammbach, Schulten, Tomás, Wilson, Yi, Zhu, Zubiaga, Søgaard, Fraser, Jin, Mihalcea, Tetreault and Dementieva2026) conducted the most comprehensive survey to date, analysing 47,000 ACL Anthology papers across nine domains aligned with UN Sustainable Development Goals and the World Economic Forum’s Global Risks framework.
I draw on these surveys, supplemented by additional research, to construct the overview that follows. The remainder of this column is organised as follows. Section 2 maps the current research landscape and identifies the domains that receive the most and least attention. Section 3 examines how large language models have reshaped the possibilities and perils of NLP4SG. Section 4 addresses the persistent challenge of linguistic diversity. Section 5 discusses evaluation. Section 6 includes five directions I believe will define the field’s next chapter.
2. The research landscape in NLP for social good
Karamolegkou et al. (Reference Karamolegkou, Borah, Cho, Choudhury, Galletti, Gupta, Ignat, Kargupta, Kotonya, Lamba, Lee, Mangla, Mondal, Moudakir, Nazar, Nemkova, Pisarevskaya, Rizwan, Sabri, Samway, Stammbach, Schulten, Tomás, Wilson, Yi, Zhu, Zubiaga, Søgaard, Fraser, Jin, Mihalcea, Tetreault and Dementieva2026) annotated 47,000 ACL Anthology papers (2019–2025) across nine NLP4SG domains using GPT-4.1 mini with zero-shot classification. The resulting map reveals that approximately 76.5% of papers received at least one social good domain label, but the distribution across domains is heavily skewed.
AI harms (bias, fairness, toxicity, interpretability) and inclusion/inequalities together account for the largest share of NLP4SG research, with over 15,000 and 19,000 papers, respectively. These are followed by education (around 13,700 papers), healthcare (around 5,400), digital violence (around 4,800), and misinformation (around 4,200). At the other end of the spectrum, peacebuilding (around 1,100 papers), poverty (around 1,000), and environmental protection (around 380) remain what the survey’s authors call areas that are “only starting to gain traction in response to real-world crises.”
Furthermore, the co-occurrence heatmap in Karamolegkou et al. (Reference Karamolegkou, Borah, Cho, Choudhury, Galletti, Gupta, Ignat, Kargupta, Kotonya, Lamba, Lee, Mangla, Mondal, Moudakir, Nazar, Nemkova, Pisarevskaya, Rizwan, Sabri, Samway, Stammbach, Schulten, Tomás, Wilson, Yi, Zhu, Zubiaga, Søgaard, Fraser, Jin, Mihalcea, Tetreault and Dementieva2026) shows that AI harms and inclusion research frequently overlap with multiple other domains, making them central hubs in the NLP4SG network, while poverty and peacebuilding appear relatively isolated. Research on environmental protection, which might seem naturally connected to sustainability discourse more broadly, has the weakest inter-domain connections of all.
Between 2019 and 2025, NLP4SG publications have grown across nearly all domains, with particularly rapid growth in AI harms, inclusion, and education. The methodological landscape has also evolved significantly: dataset creation, model analysis, and interpretability have become the dominant tasks, while transfer learning, prompting, and in-context learning have risen to prominence as methods, gradually displacing older paradigms based on supervised learning and classical machine learning.
3. LLMs in NLP for social good
The rise of large language models since 2022 has fundamentally reshaped the NLP4SG landscape. On the opportunity side, zero-shot and few-shot transfer enables NLP applications without the large labelled datasets that were previously required, which was precisely the bottleneck that has historically limited NLP deployment in low-resource social good settings. Open-source models have made sophisticated NLP accessible to organisations without massive compute budgets.
Concrete LLM-powered social good applications have emerged across domains. In healthcare, LLMs now support clinical note summarisation, patient education, and diagnostic decision-making, with domain-specific models like MedPALM and MEDITRON showing strong performance. In mental health, human-AI collaboration has been shown to enable more empathic conversations in peer support settings (Sharma et al. Reference Sharma, Lin, Miner, Atkins and Althoff2023). In crisis response, lightweight LLM frameworks using parameter-efficient fine-tuning maintain over 99% of full fine-tuning performance at half the memory cost for disaster tweet classification. In education, LLM-powered tutors are being adapted for specialised populations, including learners with hearing impairments.
However, the risks are substantial and domain-specific. Hallucination in clinical settings is the most acute concern: a 2025 study in Communications Medicine found that six LLMs tested with physician-validated clinical vignettes repeated or elaborated on planted errors in up to 83% of cases. Even with explicit mitigation prompts, the error rate was only halved. A 2026 benchmark across 37 models found general hallucination rates of 15–52%, reaching 64% in healthcare contexts without safeguards. For NLP4SG applications in high-stakes domains such as healthcare, legal aid, and crisis response, these error rates are not merely inconvenient, as they can cause direct harm to vulnerable populations.
The environmental cost of LLMs presents a further tension. Training GPT-3 consumed approximately 1,287,000 kWh and produced around 552 tonnes of CO2 equivalent. Inference costs are also increasingly dominant. For NLP4SG, this argument has particular force: many communities served by social good applications, such as those in crisis zones, rural healthcare settings, or developing economies, have poor or no internet connectivity, making cloud-dependent systems impractical regardless of their technical capabilities.
4. The language gap in NLP4SG
The intersection of language resources and social good creates a great inequality: communities most in need of NLP4SG applications, such as those facing poverty, health crises, and conflict, disproportionately speak languages with the fewest NLP resources.
Several grassroots communities have emerged to address this gap, and their models deserve attention. Masakhane (“We build together” in isiZulu) has grown to over 2,000 researchers from more than 30 African countries, producing datasets for named entity recognition, sentiment analysis, and LLM evaluation across dozens of African languages. AI4Bharat, led from IIT Madras, has created the world’s largest Indic parallel corpus (BPCC, with 230 million sentence pairs across 22 Indian languages) alongside neural translation systems and speech datasets covering 13 languages. IndoNLP addresses Indonesia’s 700+ languages through resources like NusaX and IndoBERT. The LoResLM workshop series supports indigenous building language models for low-resource languages (Hettiarachchi et al. Reference Hettiarachchi, Ranasinghe, Plum, Rayson, Mitkov, Gaber, Premasiri, Tan and Uyangodage2026).
Multilingual models have improved, but they still remain insufficient. Aya from Cohere for AI represents perhaps the most inclusive effort to date: Aya 101 was followed by Aya Expanse and notably Tiny Aya (February 2026), which is a 3.35-billion-parameter model designed to run locally on consumer devices, covering 70+ languages with regional specialisation.
Funding for low-resource NLP4SG also remains modest. The Lacuna Fund, the world’s first collaborative fund for creating ML datasets in low- and middle-income contexts which allocated approximately $1 million for NLP datasets in Africa and Latin America in 2024, a fraction of the billions invested in English-centric AI. Community events like the Deep Learning Indaba (held in Dakar in 2024 with over 600 attendees) provide vital infrastructure, but the structural funding asymmetry remains.
5. Evaluation in NLP4SG
The inadequacy of standard NLP metrics for evaluating NLP4SG systems is one of the field’s most persistent problems. NLP benchmarks lack clear alignment with user needs and encourage what the authors call “pointless SOTA-chasing.” For NLP4SG, this means evaluation must demonstrate connection to actual social outcomes, not merely performance on proxy tasks.
Several promising evaluation approaches are emerging. Disaggregated evaluation, as proposed by Pfohl et al. (Reference Pfohl, Cole-Lewis, Sayres, Neal, Asiedu, Dieng, Tomasev, Rashid, Azizi, Rostamzadeh, McCoy, Celi, Liu, Schaekermann, Walton, Parrish, Nagpal, Singh, Dewitt, Mansfield, Prakash, Heller, Karthikesalingam, Semturs, Barral, Corrado, Matias, Smith-Loud, Horn and Singhal2024), surfaces health equity harms through stratified assessments across patient subpopulations. They argued that equal performance across subgroups is an unreliable measure of fairness when data reflects real-world disparities. Participatory evaluation is also gaining traction: Wilson, Atabey, and Revans (Reference Wilson, Atabey and Revans2025) reported on participatory design engagements at the NLP4PI workshop, arguing that what constitutes a “good” NLP output depends entirely on the context of use, which is something standard metrics cannot capture.
On the institutional side, the ACL community has strengthened its ethical infrastructure. The Responsible NLP Research Checklist is mandatory for all ACL Rolling Review submissions and was updated in October 2024. Since December 2024, inappropriately completed checklists can result in desk rejection. The ACL Publication Ethics Policy requires disclosure of all generative AI use. The EU AI Act, the world’s first comprehensive AI law, classifies AI in healthcare and education as “high-risk,” requiring conformity assessments and human oversight—directly affecting many NLP4SG applications.
6. Future directions
Based on the analysis above, I identify five directions that I believe will define NLP4SG’s next chapter.
-
• Small language models on edge devices This may be the most transformative direction. Many communities served by NLP4SG, such as crisis response teams, rural healthcare workers, and refugee settings, have poor or no internet connectivity. On-device processing keeps sensitive data local, eliminates prohibitive cloud costs, and enables offline operation. Small language models consuming 60–70% less energy than their larger counterparts align the social good mission with environmental responsibility. Tiny Aya’s 3.35-billion-parameter model running locally in 70+ languages signals this future.
-
• Addressing the neglected domains Poverty, peacebuilding, and environmental protection lack the community infrastructure that healthcare NLP has built over two decades. The ClimateNLP workshop series demonstrates how quickly a dedicated venue can catalyse research. Poverty and peacebuilding need equivalent institutional support.
-
• Multilingual and multicultural grounding NLP4SG systems must work in the languages spoken by the communities they aim to serve. This requires not only multilingual model development but also culturally grounded evaluation. The participatory models pioneered by Masakhane and AI4Bharat, in which communities are partners in research, offer templates for the broader NLP4SG community to adopt.
-
• Human-centred evaluation frameworks We need evaluation that connects benchmark performance to real-world outcomes. This means field trials, participatory evaluation design, disaggregated fairness assessments, and explicit measurement of deployment impact.
-
• Cross-disciplinary partnerships Successful NLP4SG collaborations such as CLEAR Global’s humanitarian translation, the DEEP platform for crisis analysis, GhanaNLP’s Khaya translation app share a common feature: sustained engagement between NLP researchers and domain practitioners. However, NLP4SG papers tend to appear in lower-impact venues, suggesting a prestige penalty that discourages such work. The NLP community must actively counteract this through dedicated tracks, awards, and career incentives that value deployment and impact alongside technical novelty.
7. Concluding remarks
NLP for Social Good stands at a turning point. The field has achieved institutional recognition, built community infrastructure, and demonstrated real-world deployments. But three structural tensions will determine whether this momentum translates to genuine social impact. First, the domain imbalance: as long as poverty and peacebuilding lack the community infrastructure that AI harms and hate speech detection enjoy, NLP4SG will underserve the communities facing the greatest need. Second, the geographic and linguistic asymmetry: NLP4SG research remains overwhelmingly concentrated in institutions with the least proximity to the problems being addressed. Third, the evaluation gap: without frameworks that connect performance to outcomes, the field risks producing sophisticated tools that fail on deployment.
I hope this column will serve as a regular venue for advancing these conversations. I particularly invite contributions that address the underexplored domains identified here, that report on field deployments rather than benchmark improvements, and that centre the perspectives of communities that NLP4SG aims to serve. The technical capabilities of modern NLP are remarkable. The question is whether we can direct them where they are most needed.