Is the Hawkometer a machine learning model?

Yes, as part of an editorial process rather than a fully automated one. As of June 2026, our research team reviews each speech and passes the transcript to Claude (a large language model by Anthropic), which returns a hawkishness score, a written rationale, and the key phrases that drove the score. The team checks the result before it is published. The original keyword library is retained as a reference baseline for transparency and to flag when the LLM and keyword approaches diverge significantly.

How are the phrase weights chosen?

Weights are calibrated based on observed market reaction to comparable language in past speeches. Phrases that historically coincided with material moves in front-end rates received higher weights. The full table is published on this page.

Can the Hawkometer be gamed?

In principle yes — an official could deliberately use phrases known to score hawkish or dovish. In practice central bankers value being understood unambiguously, so the same phrases that move our index also move markets. We monitor for any drift in language patterns and update the phrase library quarterly.

Hawkometer

Hawkometer Methodology — How We Score Central Bank Speeches

The scoring engine, the phrase library, and the limitations finance journalists need to know about

May 7, 2026 · Central Bank Watch Research · 7 min read

What the Hawkometer measures

The Hawkometer answers one question: based on what a senior central bank official has said in their public appearances, where do they sit on the hawkish-dovish spectrum, and is that position drifting?

It is not a measure of:

What rate they will vote for at the next meeting (that is what our probability and Taylor Rule tools are for).
Their long-run reaction function.
Their personal beliefs separate from the committee’s published guidance.

It is a measure of public communication tone, normalised across speakers and aggregated up to the committee level.

The scoring engine — editorial review assisted by Claude

As of June 2026, each speech is scored through an editorial process assisted by Claude AI, a large language model developed by Anthropic. Our research team passes the full text of each speech, interview, testimony or press conference to the model with a structured prompt that asks it to:

Score the text on a -10 (most dovish) to +10 (most hawkish) scale.
Write a short rationale explaining the score in plain English.
Extract the key hawkish and dovish phrases that most influenced the score.

Because the model reads and reasons about the complete text rather than counting keyword matches, it handles linguistic context that a pure phrase matcher cannot. Negation is processed correctly — “the case for additional tightening has weakened materially” now scores dovish, as intended, rather than triggering on “additional tightening” as a hawkish phrase. Conditional statements (“if inflation were to re-accelerate…”) are treated as hypotheticals rather than forward guidance. Hedging language (“somewhat,” “gradually,” “patient”) is weighed in proportion to its role in the surrounding argument.

The LLM approach also handles novel phrasing. When an official coins a new turn of phrase that markets immediately read as a signal, the model can pick it up without requiring a phrase library update. Equally, the model understands that “patience” in a Bank of Japan context carries different weight than the same word in a Federal Reserve speech — institutional context matters, and the model has been trained on enough central bank communication to reflect those differences.

Every score is accompanied by a written rationale and a list of the phrases that drove the result. Readers can therefore see not just that a speech scored +3.5, but why — preserving the auditability that a black-box model would sacrifice.

Phrase library — keyword baseline reference

The original v1 keyword library is retained alongside the LLM scorer as a transparency reference. The LLM may identify the same phrases the library contains, different phrases, or combinations and shadings that the library does not cover. Where the two approaches diverge significantly on the same speech, we treat that as a signal that the phrase library needs updating — either because new language has emerged or because a library weight was mis-calibrated.

Hawkish phrases (selection)

Phrase	Weight
inflation persistent	+2.5
inflation remains elevated	+2.2
further tightening	+2.4
additional tightening	+2.4
more work to do	+2.0
premature to declare victory	+2.0
not the time to cut	+1.9
upside risks to inflation	+1.8
inflation expectations rising	+1.8
policy must remain restrictive	+1.7
sticky inflation	+1.6
overheating	+1.6
restrictive stance is appropriate	+1.5
vigilance / vigilant	+1.4
hawkish hold	+1.4

Dovish phrases (selection)

Phrase	Weight
appropriate to begin easing	-2.4
scope to ease	-2.2
disinflation is well advanced	-2.0
conditions for easing	-2.0
rate cuts on the table	-2.0
dovish pivot	-2.0
approaching neutral	-1.8
close to neutral	-1.6
policy is sufficiently restrictive	-1.6
disinflation continues	-1.5
labour market cooling	-1.4
recession risk	-1.4
downside risks	-1.4
financial conditions have tightened	-1.2
growth is slowing	-1.2

Some phrases are deliberately ambiguous — “data-dependent” is technically a process statement, but in current usage it skews mildly dovish, so it sits at -0.7. We document each judgement call rather than hide it.

From speeches to committee scores

Once individual speeches are scored, we aggregate in three layers:

1. Per-speaker rolling 90-day average

For each official we compute the simple average of their sentiment scores over the last 90 days. We also compute a 30-day average and a prior-60-day average so a shift indicator can be reported (last 30 days vs. the 60 days before that).

2. Per-bank, voter-weighted committee score

The committee score is the weighted average of each speaker’s 90-day score:

Status	Weight
Voter at the next meeting	1.0
Non-voter (e.g. non-voting Fed regional president, observer)	0.55

A non-voter is still part of the committee’s intellectual centre of gravity, so we don’t drop them entirely — but a voter’s voice counts roughly twice as much in the index, which matches how markets price these officials.

3. Cross-bank lean

Per-bank scores are translated into a verbal lean for human readers:

Score range	Lean
≥ +2.5	hawkish
+1.0 to +2.5	leaning hawkish
-1.0 to +1.0	neutral
-2.5 to -1.0	leaning dovish
≤ -2.5	dovish

These are descriptive labels, not predictions about the next decision.

Limitations — read this before citing the index

The Hawkometer is a useful first cut, but it has real limitations. Anyone citing it in research or reporting should understand them.

LLM consistency. The model may score a borderline speech slightly differently across runs depending on sampling randomness. We mitigate this by using a fixed prompt template and caching results — once a speech is scored, the cached result is used for all subsequent builds rather than re-scoring.

Hallucination risk. Like any large language model, Claude could misread an unusual speech structure or assign weight to a phrase in a way that a careful human reader would not. The published rationale paragraph is there specifically so readers can spot-check the model’s reasoning and flag cases where it has gone wrong.

Hypotheticals and conditionals. Even with LLM scoring, the model may not always correctly weight a heavily conditional statement (“if inflation were to re-accelerate, additional tightening would be appropriate”). The rationale will usually flag this, but readers should not over-interpret a single high reading from a single appearance.

Translation effects. ECB speakers, BoJ officials and SNB Board members frequently speak in languages other than English. Our scorer currently runs only on the English version of those remarks. When the official version is non-English, scoring is delayed until a translated transcript is published; for press conferences we use the simultaneous interpretation transcript. This introduces a small lag and a small translation bias.

No causal claim. The index does not claim that a hawkish reading causes anything. It is an organised summary of what officials have said. Use it alongside market-implied probabilities, the Taylor Rule analysis and the rate path tools.

Sample data. While the production scrapers ramp up, some entries on the site may be representative samples rather than direct transcripts. Sample-derived appearances should be treated as representative examples rather than direct transcripts.

Update cadence

New speeches are scraped automatically as part of the daily data pipeline, but LLM scoring itself is an editorial step, not an automated backend job: roughly every two weeks, our research team reviews newly scraped transcripts from the previous 14 days and scores them via Claude, one speech at a time. Until a speech has been through that review, its automatically-computed keyword-baseline score is shown and flagged as such. The phrase library is reviewed at the end of each quarter and any time a major central bank materially changes its communication style.

Reproducibility

Every score is accompanied by a written rationale and the key phrases identified by the model. A public export is not currently provided; use the on-page tables and methodology notes for now. If you find a scoring decision you disagree with, the rationale gives you a starting point to understand the model’s reasoning and tell us where you think it went wrong.

For methodology questions or to suggest a phrase to add or re-weight, see the about page.