Paper Summary: Whose Opinions Do Language Models Reflect?
Tamanna Hossain-Kay / 2023-08-05
Paper Link: https://arxiv.org/pdf/2303.17548.pdf
Authors: Shibani Santurkar \(^1\), Esin Durmus\(^1\), Faisal Ladhak\(^2\), Cinoo Lee\(^1\), Percy Liang\(^1\), Tatsunori Hashimoto\(^1\) (\(^1\) Stanford, \(^2\) Columbia University)
Language models, or LMs, can give “opinionated” answers to open-ended, subjective questions. But whose opinions are these? This is important to understand as LMs become more integrated into open-ended applications.
Recent studies have shown that LMs can exhibit specific political stances and even mirror beliefs of certain demographics. To investigate this further, the authors used a framework built on public opinion surveys. This framework, utilizing the OpinionQA dataset formed from Pew Research’s American Trends Panels (ATP), offers insights via expertly curated topics, clear wording, and standardized multiple-choice responses.
From the paper’s analysis of nine LMs from OpenAI and AI21 Labs:
- There’s a clear disparity between LMs’ “opinions” and the broader US public opinion.
- When models are fine-tuned using human feedback, this gap widens, leaning towards the views of the liberal, affluent, and well-educated.
- Some groups, notably those aged 65+ and the Mormon community, are consistently underrepresented.
- LMs can, to a limited degree, mimic the opinion distribution of specific groups when specifically prompted, but improvements are slight.
- The alignment between LMs and human groups varies across different subjects.
OpinionsQA Dataset
Measuring Human-LM Alignment
Results: Whose views do current LMs express?
OpinionsQA Dataset
When trying to curate a dataset to discern the viewpoints of Language Models (LMs), researchers face several challenges. These include selecting relevant topics, creating effective questions to extract the LM’s views, and establishing a benchmark of human opinions for comparison. A promising solution is to utilize public opinion surveys, a proven tool for capturing human sentiments.
OpinionQA is based on the Pew American Trends Panel (ATP) survey:
- An annual dataset formed from the responses of approximately 10,000 participants recruited over multiple years.
- To reduce the participation load, only a subset of panelists are invited each year.
- These participants are chosen from a random address-based sample of households, with efforts taken to ensure the sample’s representativeness.
- The survey includes households without internet access, reaching out to them via phone or tablets. The questionnaire design undergoes rigorous processes: questions are piloted through focus groups, pre-interviews, and cognitive tests, ensuring they are clear, unbiased, and easy to comprehend.
- Valid answers are derived from open-ended surveys.
- To maintain the integrity of the data, quality checks are performed, and sample weights are applied to the ATP data to counter sampling bias and non-response.
- Additionally, the ordering of ordinal option choices is kept consistent.
OpinionsQA uses 15 such ATP surveys on diverse topics, from politics to health, and gathers responses from thousands in the US. The acquired data, including individual answers, demographics, and participant weights, assists in sketching the human opinion landscape. These questions are then organized into broad and detailed topic classes. It’s crucial to note that the OpinionQA dataset primarily focuses on English and the US demographic.
Measuring Human-LM Alignment
To facilitate the comparison between humans and LMs, LMs are queried using conventional question answering (QA) techniques, transforming each question into a specific format as demonstrated in Figure 1. The sequence in which options are presented follows the original design from the surveys, acknowledging the ordinal nature of the options.
The evaluation of the LMs is bifurcated into representativeness (where no context is given) and steerability (where contextual cues guide the LM to mimic a certain demographic). LMs are “steered” towards mimicing a specific demographic using 3 methods,
- QA: Group information is provided as an answer to a preceding multiple-choice survey question.
- Bio: Provides demographic information through a free-text answer to a preceding biographic question
- Portray: LMs are directly prompted to pretend to be part of a specific group
Models
The study evaluated models from both OpenAI and AI21 labs (see Table 5 for complete list). When models are posed with a question, they evaluate the likelihood of each potential answer, which is then transformed to discern the model’s opinion distribution. Due to API restrictions, OpenAI returns a maximum of 100 log probabilities, while AI21 provides up to 10. If an answer isn’t included within these returned probabilities, its likelihood is capped by using the minimum of the remaining probability mass, or the smallest returned token probability.
Comparing human and LM opinion distributions
A metric that can account for the ordinal nature of the survey answers needs to be used, so KL-divergence isn’t used. 1-Wasserstein distance (WD) is used instead, which is defined as the minimum cost for transforming distribution \(D_1\) to distribution \(D2\). To convert ordinal answers to a appropriate space for WD, they are mapped to corresponding positive integers.
Results: Whose views do current LMs express?
Representiveness
Overall Representativeness: most models have comparable opinion alignment to the alignment between agnostic and orthodox people on abortion or Democrats and Republicans on climate change (Figure 2).
Group representativeness: (some in Figure 3):
- Base LMs are most aligned with lower income, moderate, and Protestant or Roman Catholic groups
- OpenAI’s instruct series models align more with liberal, high income, well-educated, and non-religious groups.These groups line up with the demographics of the crowdworkers reported in OpenAI’s InstructGPT paper.
- Several groups have low representativeness scores for all LMs, such as those aged 65+, widowed, and with high religious attendance.
Modal representativeness: text-davinci-003 has a sharp and low entropy opinion distribution, converging to the modal views of liberals and moderates.
Steerability
- Most LMs (except ada) become more representative of subgroups post-steering (Fig 4b)
- Most cases improvements are by a constant factor, so different subgroups are still aligned to different degrees post-steering
- There is some varition in adaptability of models, eg. j1-grande-v2-beta is more adaptive for Southerners and text-davinci-002 for liberals (Fig 11)
- text-davinci-002 has the smallest alignment gap across group after steering
Consistency
A Consistency score (Cm) is defined, which is the fraction of topics where an LM’s most aligned group matches its most aligned group on the given topic. The scores ranges from 0 to 1, higher score means model agrees with same subgroups across all topics.
- Overall consistency scores of current LMs are fairly low, indicating expression of patchwork of disparate opinions (Fig 3)
- Base models from both providers and the RLHF-trained text-davinci-003 from OpenAI seem to be the most consistent, but towards different sets of groups (see Fig 5)
- None of are perfectly consistent, even text-davinci-00{2,3} aligns with conservatives on religion (Fig 5)