Data and methods

Purpose. Consolidated walkthrough of how data was handled at each step of the China-HQ green FDI readiness diagnostic. Specifies the inputs, the filters and aggregations applied at each stage, and the reasoning behind each load-bearing choice. Companion notes treat individual decisions in more depth: capability measurement [1], HQ classification rule [2], supply-chain position taxonomy [3].

Overview

The diagnostic combines two lenses. Lens 1 asks where China-HQ FDI landed relative to a country's pre-FDI capability in the technology, yielding a five-category quadrant assignment. Lens 2 asks what kind of capability appears in the post-FDI trade panel, yielding a five-family supply-chain position. Both lenses are required: the same cell can read as confidently incumbent on Lens 1 and as midstream-feedstock on Lens 2, and that disagreement is the finding. The pipeline below covers the ingest, filtering, windowing, capability measurement, FDI signal classification, quadrant assignment, trade-outcome measurement, and family assignment that produce the scored 466-cell panel (Solar, Battery, and EV) and the fifteen-cell narrative batch.

Data sources

Four primary inputs.

Source	Local path	Used for
China-HQ green FDI panel	`data/raw/fdi_*.csv`	Firm-level outward FDI announcements 2020-2024, filtered on the source-panel HQ field
Predicted Competitiveness (PC) scores	`../../../goodenough/data/pc_scores.csv`	Country-technology capability level, 10 green technologies, 2003-2024
Broad-category RCA matrix	`../../../goodenough/data/rca_all.csv`	Country times six broad green-category baskets times technology times year
EV capability model outputs	`data/raw/ev/` (frozen copies)	EV predicted competitiveness (158 economies, 2018-2024) and EV broad-category RCA (2017-2024), from the dedicated EV model
BACI HS17 trade	`data/raw/baci/BACI_HS17_V202601.zip`	Bilateral merchandise trade flows by HS6, 2017 and 2022-2024

Two derived green-product references are also used. The green dictionary at ../../../goodenough/data/green_dictionary.csv defines HS6 baskets per technology and per supply-chain stage. The cleaned per-cut basket file at data/processed/hs6_baskets.json (produced by scripts/build_trade_outcome_window.py) splits the dictionary into all-stage, manufacturing-only, factory-stage, and upstream-only cuts.

The full data inventory, including every produced intermediate table, lives at DATA_INVENTORY.md in the station root.

Windows and lags

Four windows define the diagnostic. The capability window is 2015-2019 (EV: 2018-2019, because the dedicated EV model's series starts in 2018; the deviation and its consequences are documented in the EV capability measurement note). The China-HQ FDI window is 2020-2024. The trade pre-window is 2017-2019. The trade post-window is 2022-2024.

The capability window does not overlap the FDI window: this is by construction, to avoid circularity. Same-period PC would feed back from FDI-driven output increases and contaminate the input. The trade pre-window deliberately tail-overlaps the capability window because it measures the export base immediately before the FDI window opens, while the capability window measures the deeper industrial substrate from which that base was generated. The trade post-window is centered on 2023, two to three years into the FDI window, which captures early output from FDI-window plants that ramped quickly but undershoots the largest 2023-2024 announcements (a known caveat below).

Pipeline at a glance

#	Script	Purpose
1	`ingest_china_fdi.py`	Filter the FDI panel on exact `Company Country (HQ) == China`, normalize project records, dedupe
1b	`build_ev_basket.py`	Validate the curated EV stage table against the BACI HS17 product list and write the EV HS6 basket cuts
2	`aggregate_fdi.py`	Aggregate to country-technology cells for the 2020-2024 window (Solar + Battery + EV scope)
3	`build_capability_window.py`	Compute pre-FDI capability vector from PC and broad RCA (2015-2019 lag; EV 2018-2019 from the dedicated model)
4	`build_trade_outcome_window.py`	Filter BACI to the Solar + Battery + EV HS6 baskets; compute pre/post export/RCA outcomes under three basket cuts
5	`join_worktable.py`	Join capability, FDI, and trade-outcome cells; write publish-gate QA
6	`build_quadrants.py`	Apply absolute-floor classification + composite ranking indices
7	`quadrants_pca_robustness.py`	Validate 50/25/25 composite weighting against PCA-derived weights
8	`build_case_card_evidence.py`	Compute HS6 decomposition + timing evidence for the narrative batch
9	`build_mex_destination_evidence.py`	Compute destination decomposition for Mexico/Battery
10	`audit_fdi_hq_classification.py`	Apply curated firm-level HQ classification to test panel sensitivity
11	`make_quadrant_chart.py`	Render the Lens-1 readiness quadrant chart
12	`make_supply_chain_spectrum.py`	Render the Lens-2 supply-chain spectrum
13	`render_method_appendices.py`	Render the linked Markdown method notes as HTML appendices

Step 1: Ingest the FDI panel

The source FDI panel ships as data/raw/fdi_*.csv. The ingest script applies an exact string filter against the source-panel Company Country (HQ) field, keeping only rows where the value equals "China". No control-based, beneficial-owner, or analyst-classification logic is layered on top at this stage. The rule is chosen for tractability and reproducibility; a control-based rule would require firm-level beneficial-ownership research beyond the scope of v1.

The audit at step 10 documents the cases where the source-panel HQ field is more permissive in practice than corporate HQ of record (Maxeon, AESC, Geely-Volvo, Geely-LEVC, Geely-Polestar, Geely-Proton, Canadian Solar, Xinyi Glass). The panel-as-ingested remains the canonical reproducible data layer; a documented-exclusions screen is reported alongside for narrative claims [2].

The script normalizes project records (sector, destination ISO3, year announced, capex, JV flag, status, source URLs), deduplicates on (project_id, sector), and writes the typed panel at data/processed/china_outward_fdi.json plus an ingest QA report at data/processed/_ingest_report.md. The output covers 427 China-HQ projects across Solar (167), EV (111), Battery (105), and Wind (44), with 50 destination countries and roughly $213B in disclosed capex at a project-level disclosure rate of 84%.

Step 2: Scope to Solar, Battery, and EV

Quadrant scoring covers Solar, Battery, and EV. The v1 release scored only Solar and Battery, because the shared upstream PC model trains on 10 green technologies and does not include EV. EV joined the scored panel in July 2026 when a dedicated EV predicted-competitiveness model became available; the EV capability measurement note in this rail documents the model, its shortened 2018-2019 pre-window, the 101-code basket and its stage-type curation, and the EV-specific HQ-audit sensitivities.

Wind remains excluded because the 2020-2024 China-HQ FDI window holds only 15 wind projects with under $0.5B in disclosed capex, much thinner than the scored sectors; the wind cell appears in the narrative as prose ("China-HQ wind firms have not globalized like solar and battery firms"), not as a quadrant cell. Wind records remain in the underlying panel for ad-hoc analysis and do not influence any cell-level claim. The fifteen-cell narrative batch covers Solar, Battery, and EV; there is no Wind narrative cell.

Step 3: Aggregate FDI to country-technology cells

For each (destination ISO3, sector) cell in the 2020-2024 FDI window, the aggregator computes project count and disclosed capex sum, plus a status breakdown (completed, commenced, announced) and the top projects by capex. The output is data/processed/fdi_by_country_sector.csv. The aggregation treats project count and disclosed capex as independent inputs: project count is always known (every record counts), but capex is sometimes blank (capex disclosure is partial). Both are carried forward into the quadrant scoring so the chart can encode count as bubble area and capex as the y-axis position.

Step 4: Classify the FDI signal

A three-category FDI signal sits between the raw aggregates and the quadrant logic:

real_disclosed          = disclosed capex >= $100m
active_below_or_unknown = at least one project and disclosed capex < $100m (or fully undisclosed)
absent                  = no projects

The $100m absolute floor is intentionally strict. Smaller projects are real but, in this dataset, often correspond to lab-scale, demonstration-scale, or unverified records that should not pin a cell to the "FDI-First Experiment" diagnostic. The active_below_or_unknown category lets the chart preserve cells with announcement signal but insufficient capex disclosure, instead of collapsing them into either Bypassed (absent) or FDI-First (above the floor).

Step 5: Measure capability (Lens 1)

Two capability measures are computed from the 2015-2019 window. The primary direct measure is pc_level_pre, the median PC score over 2015-2019 from the upstream goodenough model. The secondary measure is broad_rca_footprint_pre, the median RCA across the six broad green-category baskets per country-technology cell.

The decision to use median rather than mean across categories is load-bearing. Mean-across-categories is contaminated by undiversified-economy bias: a country with one extreme high-RCA category and five very low ones reads as having industrial breadth when it has the opposite. Median is robust to this bias and is what gets carried into the composite ranking index.

The broad RCA measure is not a textbook Hidalgo-style phi-weighted product-space relatedness density. Building that properly requires running the full HS6 lift on BACI export data and is a v2 task. The pre-aggregated six-category cut captures the same intuition (industrial breadth into related categories) at far lower computational cost and is defensible as a v1 placeholder. The capability measurement note [1] documents this choice in full.

PC momentum is also computed (the year-on-year change in PC over the pre-window) but is used only as a composite-index input for within-quadrant ranking, not for quadrant membership.

Step 6: Assign quadrants

Quadrants are assigned from direct inputs, not composite scores:

Prepared Magnet                 pc_level_pre >= 0.10 and fdi_signal == real_disclosed
Untapped Candidate              pc_level_pre >= 0.10 and fdi_signal == absent
FDI-First Experiment            pc_level_pre <  0.10 and fdi_signal == real_disclosed
Bypassed / Lagging              pc_level_pre <  0.10 and fdi_signal == absent
Active, magnitude below/unknown fdi_signal == active_below_or_unknown

The PC = 0.10 floor is absolute, not a percentile. It corresponds to roughly the top decile of incumbent capability in each scored technology's universe (11 of 155 solar cells, 11 of 155 battery cells, and 16 of 158 EV cells sit at or above it). The floor was chosen intentionally strict so the label "Prepared Magnet" connotes confident incumbency, not "above median." A 30th-percentile threshold would have inflated the Prepared Magnet count and weakened the diagnostic distinction.

The trade-off is documented: cells with substantial adjacent industrial bases but weak direct technology exports fall below the floor by design. Mexico/Battery (PC 0.014) is the load-bearing example. The station handles this by treating supply-chain adjacency as a Lens-2 finding on the trade outcome side rather than a quadrant input.

The chart applies a square-root display transform on the x-axis so the dense low-PC region does not collapse into the left edge; the underlying PC values and the PC = 0.10 floor are unchanged.

Step 7: Composite indices for ranking, not classification

Two composite indices are computed for within-quadrant ordering and case selection:

capability_momentum_index = 0.50 * PC level + 0.25 * PC momentum + 0.25 * broad RCA footprint
fdi_intensity_index       = combines project count and asinh(capex)

These indices never feed back into quadrant membership; that depends only on PC and the FDI signal.

The 50/25/25 weighting is pre-registered. PCA on the standardized capability matrix yields data-derived weights of 38/30/33, and the pooled Spearman rho between the two weightings is 0.998. The ranking is therefore not sensitive to the pre-registered weights, which is why the simpler transparent weighting is used in public surfaces. The robustness check is reported at data/processed/pca_robustness_report.md.

Step 8: Measure trade outcomes (Lens 2)

The trade-outcome layer scans BACI HS17 V202601 across 2017-2019 and 2022-2024, filtering to Solar + Battery HS6 baskets from the green dictionary. The output data/processed/trade_outcome_by_country_sector.csv carries 452 country-sector cells with three basket cuts:

All-stage: the full Solar/Battery HS6 basket (raw materials through finished products).
Manufacturing-only: all-stage minus upstream raw materials. This is the canonical chart and worktable outcome because the story is about FDI-linked industrial capability, not raw-material exposure.
Factory-stage: keeps only Final Product, Product Component, and Process Equipment; excludes Processed Material. This is the stricter diagnostic for "did the country become a cell or component or equipment exporter?"

The default for the canonical export and RCA columns is the manufacturing-only cut. The factory-stage columns are kept alongside as a validation layer because midstream processed-material upgrading is itself a real industrial outcome that the factory-stage cut would discard. Indonesia/Battery is the load-bearing example: manufacturing-only RCA crosses 1.0 (0.64 to 1.58), but factory-stage RCA remains far below 1.0 (0.21 to 0.26). Public copy describes Indonesia/Battery as a midstream battery-materials platform, not a cell exporter.

466 of the 678 trade-outcome cells join cleanly to the capability-covered quadrant table; the remaining 212 are tiny territories outside the capability panels (Aruba, Anguilla, Andorra, Bahamas, and similar) and do not enter any narrative cell.

Step 9: Assign supply-chain position families

For the fifteen-cell narrative batch, each cell is assigned to one of five supply-chain families based on the HS6 decomposition of its post-window export delta:

Finished-product export platform: finished-product share at least 50% of the manufacturing-basket delta, and post-window factory-stage RCA at least 1.0.
Equipment / process-tools platform: dominant HS type in the delta is Process Equipment, and finished-product share is below 30%.
Midstream materials / feedstock platform: dominant HS type in the delta is Processed Material, and finished-product share is below 15%.
Hybrid / relative-share case: no single HS type accounts for more than 40% of the basket delta, and factory-stage RCA declines over the window. Hybrid is a fallback family; cells are assigned Finished, Equipment, or Midstream first by dominant HS type and finished-product share.
Vehicle and component export platform: dominant HS type in the delta is Product Component, and finished-product share is below 50%. This is the EV family: recipients enter the window with existing auto-industry bases, so growth is auto parts rather than newly created factory capacity, and the diagnostic is finished-vehicle entry (HS 870380 battery-electric vehicles) rather than factory-base creation. Two variants: established vehicle exporters with factory-stage RCA above 1.0 (Hungary, Mexico, Thailand) and the home-market producer whose output is absorbed domestically (Brazil). Indonesia's EV cell shares the home-market FDI story but assigns to the midstream family on its dominant HS type.

A negative RCA trajectory alone does not move a cell into Hybrid. USA/Solar and Saudi/Solar both have declining factory-stage RCA but their growth is dominated by Process Equipment and Processed Material respectively, so they remain in those families. The taxonomy note [3] documents the rationale and the full fifteen-cell assignment.

Step 10: HQ audit and the documented-exclusions screen

A curated firm-level audit applies an HQ classification table to all 427 records. It assigns 81.5% of disclosed capex ($172.7B of $211.8B) to a specific HQ status and leaves an unverified residual of $39.2B across 98 records and 77 firms. The classified set includes seven control-linked non-China-HQ firms (Maxeon, AESC, Geely-Volvo, Geely-LEVC, Geely-Polestar, Geely-Proton, Canadian Solar), one Hong Kong-HQ firm (Xinyi Glass), and one joint-venture case with a material non-China-HQ partner (SAIC-GM-Wuling). Unlike the pre-refresh tail of small firms, the current residual is material in aggregate: the 2026-06-28 panel refresh added large mainland-Chinese firms (the largest is TBEA at $11.5B) that read as China-HQ by name but are not yet individually entered in the curated table, so they default to unverified. This is a curation-completeness gap rather than foreign-HQ capital, and it leaves the documented-exclusions screen used for narrative claims unaffected.

A documented-exclusions screen removes the control-linked non-China-HQ firms and the Hong Kong case from the panel for narrative claims. Four of the fifteen narrative cells change disclosed-capex magnitude under the screen but none cross the $100m FDI-First threshold (IDN/Solar drops from $17.6B to $6.07B with Xinyi excluded; USA/Battery drops from $4.38B to $3.66B with Canadian Solar excluded; USA/Solar drops from $2.16B to $1.16B with Maxeon excluded; MEX/EV drops from $3.75B to $2.75B with the Geely-owned Volvo record excluded). In the current panel the screen removes no cell from the Solar + Battery FDI-First universe: every FDI-First cell keeps at least $100m of disclosed capex from non-excluded firms, so the universe is 24 cells under both the panel rule and the audited-HQ screen. The HQ rule note [2] documents the rule, the audit, and the cell-level sensitivity.

The panel-as-ingested remains the canonical reproducible data layer. Documented-exclusions is the preferred public-facing screen for narrative claims.

Aggregate finding

Across the twenty-four FDI-First Experiment cells in the broader scored Solar + Battery panel, one crossed factory-stage RCA = 1 from below during the 2022-2024 trade window: KHM/Solar (0.04 to 2.83). That crossing is US-tariff transshipment, mainland Chinese cell and module makers relocating final assembly to Cambodia to route around US antidumping and countervailing duties, rather than indigenous factory-stage capability. Of the other twenty-three, two ended the window above 1.0 (DEU/Battery 2.03 to 1.76; USA/Solar 1.27 to 1.07) but started above and declined; three more (ARE/Solar 1.41 to 0.97; USA/Battery 1.19 to 0.94; GBR/Battery 0.99 to 0.98) started near or above and dropped below; the remaining eighteen started and stayed below 1.0. The finding holds under the documented-exclusions screen, where the FDI-First universe is unchanged at twenty-four cells and the sole crossing is still the Cambodia transshipment case. This is the load-bearing aggregate finding for the package: in this trade window, and setting aside one tariff-driven transshipment crossing, China-HQ FDI in FDI-First cells was not followed by factory-stage export specialization gains.

Caveats

Announcement-to-output lag. Greenfield green-tech plants typically need two to four years from announcement to commercial output. The 2022-2024 trade post-window therefore measures a mix of output from plants operating before the FDI window opened and early output from FDI-window plants that ramped quickly. Most FDI-First cells fall in the former case; the 2022-2024 trade numbers do not yet include meaningful output from the 2020-2024 announcements. A 2027-onward trade-window refresh is pre-registered as the natural follow-up test.

HS17 Solar finished-product proxy. The green dictionary uses HS22 codes (854141 / 854142 / 854143 / 854149) for finished PV cells and modules. BACI HS17 buckets these under the broader HS17 code 854140, which also contains LEDs. For Vietnam/Solar, LED contamination is small relative to known module output and the bucket is a defensible module proxy. For Malaysia/Solar, the LED component is material relative to PV but the equipment-platform reading is robust because the finished-product share (1.2%) is already an upper bound well below the family threshold.

Trade-outcome RCA uses the full-basket denominator without the median-across-categories correction applied on the capability side. Bias-correction symmetry is a v2 cleanup direction.

Documented v1 coverage exclusion. Serbia/Solar carries $300M in China-HQ FDI but has no goodenough capability coverage, so it is excluded from the scored quadrant table and reported in the publish-gate QA without blocking publication. The pipeline gate join_worktable.py exits non-zero if a new non-documented unmatched FDI cell above $50M appears.

Supply-chain adjacency is invisible to direct PC. Mexico/Battery shows +76% manufacturing export growth with PC 0.014 (below the 0.10 floor) and no China-HQ FDI in the panel. The HS6 decomposition shows the increase is dominated by process equipment, components, testing equipment, separators, and electrolyte, not finished cells. The destination-decomposition note for Mexico/Battery [4] confirms USA + Canada absorb 88.4% of the basket delta. The PC = 0.10 floor catches Mexico/Battery as Bypassed by design; a v2 supply-chain-adjacency flag from process-equipment, component, electronics, and auto HS6 RCA is the natural next addition, surfaced as a flag rather than as a quadrant input.

Outcome glyphs on the Lens-1 chart are manufacturing-rule glyphs. Indonesia/Battery receives an up-up glyph because manufacturing-only RCA crosses 1.0, but the stricter factory-stage RCA stays below 1.0. Mexico/Battery receives a star glyph for growth without China-HQ FDI but does not enter RCA specialization. A v1.1 glyph refinement would split midstream specialization from factory-stage export entry, and growth-only stars from growth-plus-RCA-entry stars.

References

[1] sources/capability_measurement.md. Seven decisions on the Lens-1 capability axis: PC as primary measure, broad RCA footprint using median-across-categories, PC = 0.10 absolute floor, composites for ranking only, Solar + Battery scope, non-overlapping time windows.

[2] sources/hq_classification_rule.md. Exact-string HQ rule, seven control-linked non-China-HQ firms, Hong Kong case, joint-venture cases, audit-status counts, cell-level sensitivity under the documented-exclusions screen.

[3] drafts/supply_chain_spectrum_v1.md. Five-family supply-chain position taxonomy with primary criteria, fifteen-cell assignment, aggregate finding, caveats.

[4] data/processed/mex_battery_destination_evidence.md. Destination decomposition supporting the Mexico/Battery card claim that the basket delta is absorbed primarily by USA + Canada.

[5] DATA_INVENTORY.md. Catalog of every input, intermediate, and produced output with source paths, use, and status.

[6] scripts/README.md. Pipeline-runner reference with classification rule and refresh instructions.

[7] BACI HS17 V202601. CEPII bilateral trade panel. Used for the trade-outcome layer.

[8] PC scores: upstream goodenough predicted-competitiveness model. Country-technology-year, 10 green technologies, 2003-2024.

Markdown source on GitHub