Metrics Debugger — PM Skills

Overview

A metric just moved unexpectedly. Your job is to figure out why — fast. Is it a real product problem, a data pipeline issue, an external factor, or a normal fluctuation? Panic is expensive; methodical debugging is cheap.

First Response

When a metric drops, resist the urge to hypothesize immediately. Follow this sequence:

Step 1: Verify the Data

Before investigating the cause, confirm the drop is real:

[ ] Is the data fresh? Check pipeline latency. Is today's data fully loaded?
[ ] Is the definition consistent? Did someone change how the metric is calculated?
[ ] Is the source stable? Any data pipeline failures, schema changes, or ETL errors?
[ ] Is sampling consistent? If you use sampling, did the sample change?
[ ] Check multiple sources. Does the drop appear in both your analytics tool AND the database? If only one shows it, it's likely a data issue.

If the data is wrong, stop. Fix the data. Don't debug a phantom problem.

Step 2: Scope the Impact

## Impact Assessment

**Metric:** [name]
**Change:** [X → Y] ([Z% change])
**When:** [started date/time]
**Duration:** [how long]

**Affected segments:**
- By platform: [iOS / Android / Web — which shows the drop?]
- By geography: [specific regions?]
- By user type: [new vs. existing, free vs. paid?]
- By channel: [organic vs. paid, specific referrer?]

**Related metrics:**
- [Upstream metric]: [changed / stable]
- [Downstream metric]: [changed / stable]
- [Correlated metric]: [changed / stable]

Step 3: Systematic Root Cause Analysis

Work through these categories in order:

1. Internal product changes:

Any deployments in the timeframe? (Check deploy logs)
Feature flag changes?
A/B test started or ended?
Infrastructure changes (CDN, load balancer, database migration)?

2. External factors:

Seasonality (holiday, weekend, end of month)?
Competitor action (launch, outage, price change)?
News event affecting user behavior?
App store changes (algorithm, featuring, ranking)?
Platform changes (iOS update, Chrome update, API deprecation)?

3. User behavior shift:

Traffic mix change (different channels sending different users)?
User segment shift (enterprise vs. SMB ratio)?
Bot traffic change?

4. Upstream dependency:

Third-party API outage?
Payment provider issue?
Email deliverability change?
Ad platform targeting change?

Step 4: Quantify and Attribute

Once you've identified the likely cause:

## Root Cause

**Primary cause:** [description]
**Confidence:** High / Medium / Low
**Evidence:** [what confirms this]

**Impact breakdown:**
- [X%] of the drop is explained by [cause 1]
- [Y%] of the drop is explained by [cause 2]
- [Z%] is unexplained / noise

**Is this a one-time event or ongoing?** [assessment]

Step 5: Recommend Response

| Urgency | Criteria | Action | |---------|---------|--------| | Critical | Revenue-impacting, user-facing, growing | Fix immediately, war room | | High | Significant but contained | Fix this sprint, monitor hourly | | Medium | Moderate impact, stable | Investigate and fix within 1-2 weeks | | Low | Minor, self-correcting | Monitor, no action needed |

Output

# Metrics Debug — [Metric Name] [Date]

## Summary
[2-3 sentences: what happened, why, what to do]

## Investigation Log
[Step-by-step what you checked and found]

## Root Cause
[Identified cause with evidence and confidence]

## Recommended Action
[What to do + urgency level]

## Monitoring Plan
[What to watch to confirm the fix or detect recurrence]

Save as METRICS-DEBUG-[metric]-[date].md.