Questions are ubiquitous in our daily communications. Different questions hold various levels of intimacy and are used in different situations. Can we quantify intimacy in questions? What social norms regulate people's question intimacy in interpersonal communications?

In this study, we create (i) a new dataset and method for quantifying intimacy in language and (ii) an accurate NLP model to predict intimacy in language. We apply this model over 80.5M questions across both real (Reddit and Twitter) and imagined conversations (Books and Films) to study the pragmatic choices and social norms in interpersonal communications. Our analysis shows that 1) gender norms of masculinity persist across real/imagined conversations and are held by both male and female authors; 2) on Twitter, most intimate interactions happen between close friends and total strangers, mirroring a similar effect of "strangers on a train." 3) using an anonymous account (e.g., throwaway123) is an effective strategy of audience design that allows people to ask more intimate questions without the constraints of all the social norms.

As a part of the paper, we are releasing our annotated dataset for intimacy in language, all code and materials used to construct the regressor for intimacy, a pre-trained version of the intimacy prediction model used in the paper, and, upon request, a massive new dataset of 80.5M questions across Reddit, Twitter, books, and films.

Getting started (Code and Models)

  • All the code and resources for the intimacy regressor are available on GitHub.
  • Our RoBERTa-based intimacy estimator is available via simple pip and its code is at on GitHub.
      pip3 install question-intimacy 
  • Model also available on Hugging Face Transformers .
      from transformers import AutoTokenizer, AutoModelForSequenceClassification
      tokenizer = AutoTokenizer.from_pretrained("pedropei/question-intimacy")
      model = AutoModelForSequenceClassification.from_pretrained("pedropei/question-intimacy") 
  • Data for download

    1.   Annotated question intimacy data (Link)

    This data contains 2397 questions from Reddit, Twitter, Books and Movies with annotated intimacy score. The train/test/dev split used in our paper is also provided here.

    2.   80.5M question data

    This data contains 80.5M questions from Reddit, Twitter, Books and Movies. The data is in preparation. Please contact if interested.


    1.   Most intimate questions are asked between close friends or total strangers

    The literature of social psychology suggests that social distance regulates the norms of communications between people: intimate interactions are reserved for close relationships. However, the most intimate self-disclosure can also happen between total strangers because they probably won't see each other again and are therefore temporarily relieved from the social norms. This is known as the Stranger-on-the-train phenomenon. Our study over 1M questions on the Twitter network suggests that both the norms of friends and strangers persist on the Internet: most intimate questions are asked between close friends or total strangers.

    2.   Gender norms of masculinity persist across online and imagined conversations

    The hegemonic norms of masculinity suggest that males are supposed to be strong, rational, and inexpressive of personal emotions. Does such a norm persist in online and imagined conversations? Our answer is yes. Based on 64M questions across Reddit, Twitter, Movie and Books, we found:

  • In both online (Reddit and Twitter) and imagined (books and movies) conversations, male-to-male questions hold the least intimacy compared with dyads involving females.
  • Both female author and male authors perpetulate such gender norms in their books.

  • 3.   Wanna get relieved from social norms? Use an anonymous account!

    Given the strong norms of gender and social distance, is there a way to get relieved from them? It is indeed hard in real life; however, in online communities, you could create a completely anonymous identity that removes the constraints of social norms. Our study over 12M questions on Reddit suggests that anonymous accounts (e.g., throwaway123) are asking much more intimate questions compared with other types of accounts, which we consider as a special way of audience design: instead of changing the language, you could change your identity.

    4.   How to (linguistically) write more intimate questions?

    In daily communications, linguistic devices are actively employed to deliver social information like politeness, which is also true for intimacy! In our study over 20M questions across Reddit, Twitter, Movie, and Books, we found that hedging and swearing are associated with a higher level of intimacy but with different purposes: (i) swearing expresses the speaker’s perceived solidarity with the audience; (2) hedging reduces the potential risk of losing face with decreased certainty and allows people to ask more intimate questions.

    Citing the paper, data, or model

    @inproceedings{pei2020quantifying, title={Quantifying Intimacy in Language}, author={Pei, Jiaxin and Jurgens, David}, booktitle={Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year={2020} }

    Jiaxin Pei & David Jurgens |

    Site design courtesy of Will Hamilton via Jason Chuang via Jeffrey Pennington