About and Methodology

Why this project

"Economists, like any other discipline, do not produce ideas in a vacuum. Their contributions build upon and influence one another, forming a complex and evolving web of intellectual history."

The goal of Econograph is to uncover and visualize those interconnections: academic lineage, intellectual influence, and institutional affiliations, using structured data scraped from Wikipedia. Inspired by similar projects in the history of philosophy, this effort brings the tools of data scraping, network analysis, and machine learning to the history of economic thought. While some comparable projects exist, they tend toward either manual curation or simple lists. Here, the aim is to automate data collection at scale and create a reusable foundation for historical, institutional, or theoretical investigations into the economics profession.

One of the most meaningful courses in my training as an economist was the history of economic thought. It is the kind of course that forces you to step back and ask a harder set of questions: how did we come to think the way we think? What problems were economists actually trying to solve when they built the models we now take for granted? What did they assume away, and why? The history of the discipline is, in this sense, a history of the questions economists decided were worth asking. Econograph is an attempt to make that history navigable.

The graph you see is the result of a five-stage automated pipeline. None of the school assignments, summaries, or network scores were entered by hand. The source code lives at github.com/mmarteaga/econograph.

The data pipeline

Six stages turn raw Wikidata and Wikipedia records into the interactive graph.

1 · Wikidata scraping

All economists with a Wikipedia page, their photos, birth years, doctoral advisor and student relationships, and influence links.

2 · Wikipedia text extraction

The introductory section of each economist's Wikipedia article is fetched via the Wikipedia Action API, providing a plain-text summary of their work.

3 · LLM school classification

Claude Haiku reads each Wikipedia intro and assigns the economist to one of 20 schools of thought, using 156 authoritative seed economists as anchors.

4 · Network analysis

A graph is built from the 1,800 connection edges. PageRank scores each economist by their centrality in the intellectual network.

5 · AI summaries and keywords

Claude Haiku writes a contribution summary and generates 8 identifying keywords per economist, stored in the graph for offline access.

6 · Search indexing

MiniSearch (BM25) indexes names, schools, bios, summaries, and keywords. At search time, Wikipedia optionally enriches queries with broader conceptual context.

1 Data collection: Wikidata

The dataset starts with Wikidata, Wikipedia's structured knowledge base. Every economist with a Wikipedia article has a Wikidata entry recording structured facts: birth and death dates, a photo, doctoral advisor relationships, doctoral student relationships, and intellectual influences.

The scraper queries Wikidata's SPARQL endpoint and the MediaWiki API to collect this data. The result is 1,637 economists and 1,800 direct connections.

Why Wikidata instead of scraping Wikipedia infoboxes? Wikidata represents structured facts with explicit property types. P184 is doctoral advisor, P185 is doctoral student, P737 is influenced by. This distinction matters: knowing that Paul Samuelson was Robert Solow's doctoral advisor is richer information than knowing they are "connected" in some undifferentiated sense.

2 Wikipedia text extraction

For each economist with a Wikipedia page, the introductory section (the paragraphs before any section headers) is fetched using the Wikipedia Action API. This gives a plain-text summary of who each economist is and what they worked on.

This text serves two purposes. First, it is the raw material for LLM school classification in Stage 3. Second, it is embedded directly in the graph file so the Research Assistant can perform full-text search without additional API calls at browse time.

A note on batching: The Wikipedia Action API silently caps prop=extracts responses at 20 articles per request, regardless of how many titles are submitted. Batches are set to 20 accordingly. This cap is not prominently documented; it was discovered empirically during development.

3 School classification: LLM approach

Assigning an economist to a school of thought is genuinely hard. Two problems make traditional keyword matching unreliable.

The first is polysemy. The word "development" means something entirely different in "development economics" (growth in low-income countries) versus "financial development" (depth of capital markets). A keyword approach cannot resolve this.

The second is network contamination. Community detection (Louvain algorithm) groups economists by who they cite, but intellectual network proximity does not equal school membership. Raj Chetty co-publishes with behavioral economists but is primarily a labor economist. Naive community detection dragged the entire Harvard labor economics cluster (Autor, Katz, Diamond) into the wrong school because of their network adjacency to behavioral researchers.

The solution is to read what each economist's Wikipedia page actually says and classify from that text directly. Each Wikipedia intro is sent to Claude Haiku with a structured prompt listing all 20 valid schools and including explicit disambiguation rules.

Seed economists

156 economists are classified by hand as authoritative seeds and are never sent to the LLM. They serve as anchors: Keynes is Keynesian, Hayek is Austrian School, Samuelson is Classical/Neoclassical. The LLM classifies everyone else.

Disambiguation rules in the prompt

The system prompt gives the LLM explicit guidance on edge cases. Some examples:

"Development" refers to developing-country economics, not "development of capital markets."
An economist who publishes game theory papers may still belong to "Chicago School" if their primary identity is Chicago-trained price theory.
If Wikipedia explicitly calls someone an "institutional economist," that designation takes priority over the topics they happen to write about.

Coverage: 1,388 of 1,637 economists had Wikipedia intro text available. All 1,388 were classified. 863 received a different school assignment than the prior keyword-based approach. The remaining 249 economists either had no Wikipedia URL or insufficient text; they retain their seed assignment or prior classification.

Result quality

The LLM approach substantially outperforms keyword matching on edge cases. Tobias Adrian (an expert on financial stability and capital market risk, wrongly labeled "Development" by keyword matching) is correctly classified as "Finance." The Harvard labor economics cluster is no longer contaminated by its Louvain community membership.

4 Network analysis: PageRank

The 1,637 economists and 1,800 connections form an undirected graph. NetworkX computes PageRank for each node, the same iterative algorithm Google introduced to rank web pages.

PageRank works recursively: a node receives a higher score when it is connected to many others, and when those others are themselves highly connected. The formal update equation is:

PR(u) = (1 - d) / N + d * SUM{ PR(v) / L(v) for v in B(u) } PR(u) = PageRank score of node u (an economist) d = damping factor, set to 0.85 (probability of following a link rather than jumping randomly) N = total number of nodes in the graph (1,637) B(u) = set of nodes with an edge pointing to u (i.e., economists who reference u) L(v) = number of outbound edges from node v

Scores converge through repeated iteration until the change between rounds falls below a tolerance threshold (10^-6). The result is a continuous score for each economist reflecting both breadth of influence and the prestige of their peers.

PageRank scores are used throughout the interface. They rank search results, determine which connections appear first in the detail panel, size nodes in the mini network diagram, and weight the "Surprise me" selection toward historically significant figures.

Typed connections (3,365 relationships)

Beyond the raw graph, Wikidata records the type of each relationship. These are resolved into three labeled categories in the detail panel: doctoral advisors, doctoral students, and intellectual influences. A fourth category ("Also Connected") captures all remaining edges that carry no specific relationship type in the data, typically colleagues or frequent co-authors.

5 AI summaries and keywords

For each economist with a Wikipedia article, Claude Haiku generates two things. The first is a one-paragraph contribution summary (three to five sentences) describing their most important ideas, theorems, and intellectual legacy in specific terms. The second is a set of eight identifying keywords drawn from the theories they created, the institutions they shaped, or the results they are best known for.

The summaries and keywords are generated offline as part of the build pipeline and stored directly in the graph data file. No API calls are made at browse time. This makes the site fully static and compatible with GitHub Pages hosting.

Generation approach

Wikipedia text (up to 3,000 characters of the article's introductory section) is fetched via the Wikipedia Action API in batches of 20. Each batch is then processed concurrently by Claude Haiku with up to 8 simultaneous API calls. A checkpoint file saves progress after each batch, making the process crash-safe and resumable.

The model is instructed to respond with valid JSON only: a "summary" field containing the paragraph and a "keywords" array of exactly eight strings. Responses are parsed strictly; any output that does not parse as JSON is retried up to three times before being skipped.

Coverage

1,636 of 1,637 economists received summaries and keywords. The single economist without coverage has no Wikipedia URL in the Wikidata record and could not be fetched. A URL-encoding issue discovered during the run (Wikipedia URLs containing percent-encoded characters such as G%C3%A9rard_Debreu were not being decoded before lookup) caused 97 economists to be missed on the first pass; this was corrected with urllib.parse.unquote() and a resume run covered the remaining cases.

Why not use Wikipedia bios for the same purpose? Wikipedia intros vary enormously in length, quality, and focus. The AI summaries are consistently structured and use active voice ("Arrow proved...," "Friedman argued..."), making them more useful for quick orientation. The Wikipedia bio text is preserved separately in the data for full-text search.

6 Search: MiniSearch and Wikipedia enrichment

The Research Assistant uses MiniSearch, a lightweight BM25 full-text search library that runs entirely in the browser. On page load, it builds an in-memory index over every economist's name, school, Wikipedia bio, AI-generated summary, and keywords. No server is required.

The BM25 relevance score for a query q against a document d is:

score(q, d) = SUM{ IDF(t) * f(t,d) * (k1 + 1) / [f(t,d) + k1 * (1 - b + b * |d| / avgdl)] for each term t in q } IDF(t) = inverse document frequency of term t (rare terms score higher) f(t, d) = frequency of term t in document d |d| = length of document d in tokens avgdl = average document length across the corpus k1 = term saturation parameter (1.2 by default; controls how much repeated mentions add) b = length normalization parameter (0.75 by default; penalizes very long documents)

Field weights are applied on top of BM25: name matches receive a boost of 4, keywords 3, summary 2, school 2, and bio text 1. This ensures that an economist whose name matches the query outranks one who merely mentions the query term in passing.

The final score for each result is a weighted combination of MiniSearch relevance and the economist's PageRank score, so prominent economists surface above obscure figures with the same keyword match.

Wikipedia query enrichment

For broader conceptual queries, Econograph fetches the Wikipedia article for the search query itself, extracts the first 600 characters of its intro, and runs a second MiniSearch pass using those expanded terms. A query for "history of corporate governance" pulls Wikipedia's intro for that topic, which mentions "agency theory," "hostile takeover," and "institutional investors." Those terms appear in relevant economists' bios and substantially improve recall. When this enrichment fires, a small badge appears on the results.

Intellectual Debates

The debates view surfaces six recurring intellectual tensions that have driven the development of economics as a discipline. Rather than treating the history of economic thought as a sequence of settled questions, this view frames it as a set of ongoing arguments, each of which has shifted in character but never been fully resolved.

How the six debates were chosen

The selection criteria were that a debate had to meet three conditions. It had to be genuinely unresolved, meaning that thoughtful economists today still disagree about it rather than having reached a stable consensus. It had to be traceable through at least 150 years of the discipline's history, with identifiable figures on recognizable sides. It also had to connect to something visible in contemporary policy or research, so that a reader could see why it still matters.

Applying those criteria produces six debates:

Rules vs. Discretion: whether policy should follow pre-committed rules or adapt to circumstances. This runs from the classical gold standard through Friedman's monetarism and the Lucas critique to modern inflation targeting.
Markets vs. State: whether markets self-correct efficiently or require systematic intervention. The central tension between Smith and Keynes, now framed around externalities, information failures, and regulatory design.
Rational vs. Behavioral Agents: whether economic agents form expectations efficiently or exhibit systematic cognitive biases. The conflict between rational expectations (Lucas, Sargent) and behavioral economics (Kahneman, Thaler) reshaped both macroeconomics and policy design.
Growth vs. Inequality: whether aggregate growth and distributional outcomes can be treated separately. Kuznets proposed they could; Piketty's r>g thesis argues they cannot.
Exogenous vs. Endogenous Money: whether the central bank controls the money supply or whether credit creation is driven by private banking demand. Friedman's monetarism sits on one side; Minsky's financial instability hypothesis sits on the other.
Free Trade vs. Industrial Policy: whether comparative advantage should determine specialization or whether strategic intervention can create new comparative advantages. Ricardo versus List, and today, Bhagwati versus Rodrik.

How economists are assigned to sides

The assignment process works in two layers. The first layer is a hand-curated set of named economists for each side of each debate. These are the figures most closely associated with each position in the historical literature. The second layer applies school-of-thought matching to extend coverage: economists in the Keynesian or Post-Keynesian schools are assigned to the "Discretion" side of the rules debate, for example, while Chicago School economists are assigned to the "Rules" side. Force-listed names always take precedence over school-based assignment. Economists whose work spans multiple debates appear in each one independently.

A note on simplification: Reducing an intellectual tradition to "two sides" necessarily loses nuance. Many economists occupy complex positions that shift across different policy questions. The debates view is designed to orient readers to the major fault lines, not to serve as a complete intellectual taxonomy.

Limitations and caveats

English Wikipedia bias. The dataset is sourced entirely from English Wikipedia and Wikidata. Economists whose work is primarily documented in other languages are underrepresented or absent, particularly thinkers from Latin America, East Asia, and continental Europe outside major research universities.

Coverage is not exhaustive. 1,637 economists is a large sample but not a census. Notable figures may be missing if they lack a structured Wikidata entry or English Wikipedia article.
School assignments are single-label. Many economists work across multiple traditions. Gregory Mankiw writes New Keynesian DSGE models and also Chicago-adjacent policy work; he receives one label. The classification reflects the dominant characterization in Wikipedia, not a multidimensional profile.
Wikidata relationship data is incomplete. Many advisor, student, and influence relationships exist in the real world but are not recorded in Wikidata. The typed connections section reflects what was formally recorded, not the full scope of actual intellectual relationships.
LLM classification can make errors. Roughly 18 economists received non-canonical labels ("Health Economics," "Agricultural Economics") that were outside the valid school list and were mapped to "Other." A small number of plausible-sounding but incorrect classifications likely remain undetected.
Historical figures have sparse data. Economists born before roughly 1850 often have minimal Wikipedia coverage, which limits both bio quality and LLM classification confidence.
Debates assignments reflect one reading of the literature. Placing an economist on a "side" of an intellectual debate is inherently interpretive. The assignments reflect the dominant characterizations in secondary literature, not a claim that each figure held a simple or static position.

Technology

The full pipeline is open source at github.com/mmarteaga/econograph.

Python 3 Scrapy NetworkX Wikidata API Wikipedia REST API Claude Haiku (Anthropic) D3.js v4 MiniSearch (BM25) GitHub Pages

How Econograph works

Why this project

The data pipeline

1 Data collection: Wikidata

2 Wikipedia text extraction

3 School classification: LLM approach

Seed economists

Disambiguation rules in the prompt

Result quality

4 Network analysis: PageRank

Typed connections (3,365 relationships)

5 AI summaries and keywords

Generation approach

Coverage

6 Search: MiniSearch and Wikipedia enrichment

Wikipedia query enrichment

Intellectual Debates

How the six debates were chosen

How economists are assigned to sides

Limitations and caveats

Technology