Wétin dey with these comments? Modeling Sociolinguistic Factors Affecting Code-switching Behavior in Nigerian Online Discussions Innocent Ndubuisi-Obi, Sayan Ghosh, David Jurgens

Introduction

Multilingual individuals code switch between languages as a part of a complex communication process. However, most computational studies have examined only one or a handful of contextual factors predictive of switching. Here, we examine Naijá-English code switching in a rich contextual environment to understand the social and topical factors eliciting a switch. We introduce a new corpus of 330K articles and accompanying 389K comments labeled for code switching behavior. In modeling whether a comment will switch, we show that topic-driven variation, tribal affiliation, emotional valence, and audience design all play complementary roles in behavior.


As a part of the paper, we are releasing our annotated dataset of articles from four Nigerian newspapers, all code and materials used to construct the classifiers for English-Naijá classification, etc.


Getting started (Code and Models)

  • All the code and resources for the language classifier are available on GitHub.

Data for download

1.   Articles and comments
 ·  Articles and comments


This data contains the articles and comments from four Nigerian newspapers (The Guardian, The Nation, The Punch, Vanguard).


2.   Language classifier training data
 ·  Language classifier training data


This data contains the training data for the English-Naijá classifier.


Citing the paper, data, or classifier

It's going to be okay: Measuring Access to Support in Online Communities. Zijian Wang and David Jurgens. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 2018.

@inproceedings{obighosh2019naija, title={Wétin dey with these comments? Modeling Sociolinguistic Factors Affecting Code-switching Behavior, in Nigerian Online Discussions}, author={Ndubuisi-Obi, Innocent and Ghosh, Sayan and Jurgens, David}, booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)}, year={2019} }

Highlights

1.   Topic matters

The topic of the article affected code-switching behavior in the comment sections. In particular, articles related to social tensions on the political, regional, and/or tribal level, as well as other social issues were more likely to have comment sections displaying code-switching behavior. On the other hand, articles related to more general politics as well as specific economic isssues were not likely to have comment sections displaying code-switching behavior.


2.   Commenters adjust language for different target audiences

The use of Naijá was influenced by the intended audience a commenter wished to reach. Comments deeper in a reply thread, usually as part of a conversation, were more likely to be in Naijá compared to higher level comments, which were more general commentary on the article and targeted towards a wider audience.


3.   Sentiment affects code-switching behavior

Our work provides more evidence for the fact that sentiment affects whether an individual code-switches. When expressing any kind of sentiment (positive or negative), commenters were more likely to code-switch into Naijá. However, our work also shows that a reaction to emotional language does not necessarily illicit a code-switch.


Bugs/Issues/Discussion

GitHub: The languagee classifier code is available on GitHub. For bug reports and patches on the code or for any issues you might run into with the data, please file a GitHub issues. We also welcome any pull requests for new features or to make the pipeline work with other kinds of data.

David Jurgens |

Site design courtesy of Will Hamilton via Jason Chuang via Jeffrey Pennington