Wétin dey with these comments? Modeling Sociolinguistic Factors Affecting Code-switching Behavior in Nigerian Online Discussions

Wétin dey with these comments? Modeling Sociolinguistic Factors Affecting Code-switching Behavior in Nigerian Online Discussions Innocent Ndubuisi-Obi, Sayan Ghosh, David Jurgens

Introduction

Multilingual individuals code switch between languages as a part of a complex communication process. However, most computational studies have examined only one or a handful of contextual factors predictive of switching. Here, we examine Naijá-English code switching in a rich contextual environment to understand the social and topical factors eliciting a switch. We introduce a new corpus of 330K articles and accompanying 389K comments labeled for code switching behavior. In modeling whether a comment will switch, we show that topic-driven variation, tribal affiliation, emotional valence, and audience design all play complementary roles in behavior.

As a part of the paper, we are releasing our annotated dataset of articles from four Nigerian newspapers, all code and materials used to construct the classifiers for English-Naijá classification, etc.

Getting started (Code and Models)

All the code and resources for the language classifier are available on GitHub.

Data for download

1. Articles and comments
· Articles and comments

This data contains the articles and comments from four Nigerian newspapers (The Guardian, The Nation, The Punch, Vanguard).

2. Language classifier training data
· Language classifier training data

This data contains the training data for the English-Naijá classifier.

Citing the paper, data, or classifier

It's going to be okay: Measuring Access to Support in Online Communities. Zijian Wang and David Jurgens. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 2018.

  @inproceedings{obighosh2019naija,
           title={Wétin dey with these comments? Modeling Sociolinguistic Factors Affecting Code-switching Behavior,
            in Nigerian Online Discussions},
           author={Ndubuisi-Obi, Innocent and Ghosh, Sayan and Jurgens, David},
            booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
           (Volume 2: Short Papers)},
           year={2019}
  }
  

Highlights

1. Topic matters

The topic of the article affected code-switching behavior in the comment sections. In particular, articles related to social tensions on the political, regional, and/or tribal level, as well as other social issues were more likely to have comment sections displaying code-switching behavior. On the other hand, articles related to more general politics as well as specific economic isssues were not likely to have comment sections displaying code-switching behavior.

2. Commenters adjust language for different target audiences

The use of Naijá was influenced by the intended audience a commenter wished to reach. Comments deeper in a reply thread, usually as part of a conversation, were more likely to be in Naijá compared to higher level comments, which were more general commentary on the article and targeted towards a wider audience.

3. Sentiment affects code-switching behavior

Our work provides more evidence for the fact that sentiment affects whether an individual code-switches. When expressing any kind of sentiment (positive or negative), commenters were more likely to code-switch into Naijá. However, our work also shows that a reaction to emotional language does not necessarily illicit a code-switch.

Bugs/Issues/Discussion

GitHub: The languagee classifier code is available on GitHub. For bug reports and patches on the code or for any issues you might run into with the data, please file a GitHub issues. We also welcome any pull requests for new features or to make the pipeline work with other kinds of data.

David Jurgens |

Site design courtesy of Will Hamilton via Jason Chuang via Jeffrey Pennington