Paper Summary: Whose Opinions Do Language Models Reflect?

Tamanna Hossain-Kay / 2023-08-05

Paper Link: https://arxiv.org/pdf/2303.17548.pdf
Authors: Shibani Santurkar \(^1\), Esin Durmus\(^1\), Faisal Ladhak\(^2\), Cinoo Lee\(^1\), Percy Liang\(^1\), Tatsunori Hashimoto\(^1\) (\(^1\) Stanford, \(^2\) Columbia University)

Language models, or LMs, can give “opinionated” answers to open-ended, subjective questions. But whose opinions are these? This is important to understand as LMs become more integrated into open-ended applications.

Recent studies have shown that LMs can exhibit specific political stances and even mirror beliefs of certain demographics. To investigate this further, the authors used a framework built on public opinion surveys. This framework, utilizing the OpinionQA dataset formed from Pew Research’s American Trends Panels (ATP), offers insights via expertly curated topics, clear wording, and standardized multiple-choice responses.

From the paper’s analysis of nine LMs from OpenAI and AI21 Labs:

There’s a clear disparity between LMs’ “opinions” and the broader US public opinion.
When models are fine-tuned using human feedback, this gap widens, leaning towards the views of the liberal, affluent, and well-educated.
Some groups, notably those aged 65+ and the Mormon community, are consistently underrepresented.
LMs can, to a limited degree, mimic the opinion distribution of specific groups when specifically prompted, but improvements are slight.
The alignment between LMs and human groups varies across different subjects.

OpinionsQA Dataset
Measuring Human-LM Alignment
Results: Whose views do current LMs express?

OpinionsQA Dataset

When trying to curate a dataset to discern the viewpoints of Language Models (LMs), researchers face several challenges. These include selecting relevant topics, creating effective questions to extract the LM’s views, and establishing a benchmark of human opinions for comparison. A promising solution is to utilize public opinion surveys, a proven tool for capturing human sentiments.

OpinionQA is based on the Pew American Trends Panel (ATP) survey:

An annual dataset formed from the responses of approximately 10,000 participants recruited over multiple years.
To reduce the participation load, only a subset of panelists are invited each year.
These participants are chosen from a random address-based sample of households, with efforts taken to ensure the sample’s representativeness.
The survey includes households without internet access, reaching out to them via phone or tablets. The questionnaire design undergoes rigorous processes: questions are piloted through focus groups, pre-interviews, and cognitive tests, ensuring they are clear, unbiased, and easy to comprehend.
Valid answers are derived from open-ended surveys.
To maintain the integrity of the data, quality checks are performed, and sample weights are applied to the ATP data to counter sampling bias and non-response.
Additionally, the ordering of ordinal option choices is kept consistent.

OpinionsQA uses 15 such ATP surveys on diverse topics, from politics to health, and gathers responses from thousands in the US. The acquired data, including individual answers, demographics, and participant weights, assists in sketching the human opinion landscape. These questions are then organized into broad and detailed topic classes. It’s crucial to note that the OpinionQA dataset primarily focuses on English and the US demographic.

Measuring Human-LM Alignment

To facilitate the comparison between humans and LMs, LMs are queried using conventional question answering (QA) techniques, transforming each question into a specific format as demonstrated in Figure 1. The sequence in which options are presented follows the original design from the surveys, acknowledging the ordinal nature of the options.

Figure 1: Evaluating the opinions reflected by language models using the OpinionQA dataset. The pipeline is as follows: an LM (here, text-davinci-003) is prompted with a multiple-choice survey question from our dataset, preceded by an optional context (QA/BIO/PORTRAY) to steer it towards a persona (here, Democrats). Th next-token log probabilities from the LM are then obtained for each of the answer choices (excluding refusal) and normalized to obtain the model’s opinion distribution. Finally, this quantity is compared to reference human opinion distributions— obtained by aggregating human responses to the same survey question at a population level and by demographic. Model and human refusal rates are compared separately.

The evaluation of the LMs is bifurcated into representativeness (where no context is given) and steerability (where contextual cues guide the LM to mimic a certain demographic). LMs are “steered” towards mimicing a specific demographic using 3 methods,

QA: Group information is provided as an answer to a preceding multiple-choice survey question.
Bio: Provides demographic information through a free-text answer to a preceding biographic question
Portray: LMs are directly prompted to pretend to be part of a specific group

Models

The study evaluated models from both OpenAI and AI21 labs (see Table 5 for complete list). When models are posed with a question, they evaluate the likelihood of each potential answer, which is then transformed to discern the model’s opinion distribution. Due to API restrictions, OpenAI returns a maximum of 100 log probabilities, while AI21 provides up to 10. If an answer isn’t included within these returned probabilities, its likelihood is capped by using the minimum of the remaining probability mass, or the smallest returned token probability.

Comparing human and LM opinion distributions

A metric that can account for the ordinal nature of the survey answers needs to be used, so KL-divergence isn’t used. 1-Wasserstein distance (WD) is used instead, which is defined as the minimum cost for transforming distribution \(D_1\) to distribution \(D2\). To convert ordinal answers to a appropriate space for WD, they are mapped to corresponding positive integers.

Results: Whose views do current LMs express?

Representiveness

Overall Representativeness: most models have comparable opinion alignment to the alignment between agnostic and orthodox people on abortion or Democrats and Republicans on climate change (Figure 2).

Figure 2: Overall representativenessROm of LMs: A higher score (lighter) indicates that, on average across the dataset, the LM’s opinion distribution is more similar to that of the total population of survey respondents (Section 4.1). For context, we show the representativeness measures for: (i) demographic groups that are randomly chosen (‘avg’) and least representative of the overall US population (‘worst’), and (ii) pairs of demographic groups on topics of interest.

Group representativeness: (some in Figure 3):

Base LMs are most aligned with lower income, moderate, and Protestant or Roman Catholic groups
OpenAI’s instruct series models align more with liberal, high income, well-educated, and non-religious groups.These groups line up with the demographics of the crowdworkers reported in OpenAI’s InstructGPT paper.
Several groups have low representativeness scores for all LMs, such as those aged 65+, widowed, and with high religious attendance.

Figure 3: Group representativeness scoresRGm of LMs as a function of political ideology and income: A higher score (lighter) indicates that, on average across dataset questions, the LMs opinion distribution is more similar to that of survey respondents from the specified group (i.e.,RGm(Q) is larger). The coloring is normalized by column to highlight the groups a given model (column) is most/least aligned to. We find that the demographic groups with the highest representativeness shift from base LM (moderate to conservative with low income) to the RLHF trained ones (liberal and high income). Other demographic categories are shown in Appendix 8.

Modal representativeness: text-davinci-003 has a sharp and low entropy opinion distribution, converging to the modal views of liberals and moderates.

Figure 9: A comparison of the entropy of LM response distributions: text-davinci-003 tends to assign most of it’s probability mass to a single option. This is in contrast to human opinions which tend to have a fair amount of variability.

Steerability

Most LMs (except ada) become more representative of subgroups post-steering (Fig 4b)
Most cases improvements are by a constant factor, so different subgroups are still aligned to different degrees post-steering
There is some varition in adaptability of models, eg. j1-grande-v2-beta is more adaptive for Southerners and text-davinci-002 for liberals (Fig 11)
text-davinci-002 has the smallest alignment gap across group after steering

Figure 11: A break down of the post-steering representativeness scores of different LMs by the subgroup they are steered to.

Consistency

A Consistency score (Cm) is defined, which is the fraction of topics where an LM’s most aligned group matches its most aligned group on the given topic. The scores ranges from 0 to 1, higher score means model agrees with same subgroups across all topics.

Overall consistency scores of current LMs are fairly low, indicating expression of patchwork of disparate opinions (Fig 3)
Base models from both providers and the RLHF-trained text-davinci-003 from OpenAI seem to be the most consistent, but towards different sets of groups (see Fig 5)
None of are perfectly consistent, even text-davinci-00{2,3} aligns with conservatives on religion (Fig 5)

Figure 5: Consistency of different LMs (columns) across topics (rows) on different demographic attributes (panels). Each dot indicates an LM-topic pair, with the color indicating the group to which the model is best aligned, and the size of the dot indicates the strength of this alignment (computed as the ratio of the best and worst subgroup representativeness for that topic, see Appendix B.3 for details). We find significant topic-level inconsistencies, especially for base LMs, and strong educational attainment consistency for RLHF trained LMs.