Multilingual individuals code switch between languages as a part of a complex communication process. However, most computational studies have examined only one or a handful of contextual factors predictive of switching. Here, we examine Naijá-English code switching in a rich contextual environment to understand the social and topical factors eliciting a switch. We introduce a new corpus of 330K articles and accompanying 389K comments labeled for code switching behavior. In modeling whether a comment will switch, we show that topic-driven variation, tribal affiliation, emotional valence, and audience design all play complementary roles in behavior.
As a part of the paper, we are releasing our annotated dataset of articles from four Nigerian newspapers, all code and materials used to construct the classifiers for English-Naijá classification, etc.
Getting started (Code and Models)
-
All the code and resources for the language classifier are available on GitHub.
1. Articles and comments
· Articles and comments
This data contains the articles and comments from four Nigerian newspapers (The Guardian, The Nation, The Punch, Vanguard).
2. Language classifier training data
· Language classifier training data
This data contains the training data for the English-Naijá classifier.
Citing the paper, data, or classifier
It's going to be okay: Measuring Access to Support in Online Communities.
Zijian Wang and David Jurgens. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 2018.
@inproceedings{obighosh2019naija,
title={Wétin dey with these comments? Modeling Sociolinguistic Factors Affecting Code-switching Behavior,
in Nigerian Online Discussions},
author={Ndubuisi-Obi, Innocent and Ghosh, Sayan and Jurgens, David},
booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
(Volume 2: Short Papers)},
year={2019}
}
1. Topic matters
The topic of the article affected code-switching behavior in the comment sections. In particular, articles related to social tensions on the political, regional, and/or tribal level, as well as other social issues were more likely to have comment sections displaying code-switching behavior. On the other hand, articles related to more general politics as well as specific economic isssues were not likely to have comment sections displaying code-switching behavior.
2. Commenters adjust language for different target audiences
The use of Naijá was influenced by the intended audience a commenter wished to reach. Comments deeper in a reply thread, usually as part of a conversation, were more likely to be in Naijá compared to higher level comments, which were more general commentary on the article and targeted towards a wider audience.
3. Sentiment affects code-switching behavior
Our work provides more evidence for the fact that sentiment affects whether an individual code-switches. When expressing any kind of sentiment (positive or negative), commenters were more likely to code-switch into Naijá. However, our work also shows that a reaction to emotional language does not necessarily illicit a code-switch.
GitHub: The languagee classifier code is available on
GitHub. For bug reports and patches on the code or for any issues you
might run into with the data, please file a GitHub issues. We also
welcome any pull requests for new features or to make the pipeline work
with other kinds of data.