Claude Code for Data Science: EDA, Analysis, and Visualization Workflows

Q: Does Claude Code work with scikit-learn pipelines and custom transformers?

Yes. Claude Code writes sklearn Pipeline and ColumnTransformer objects correctly, including custom BaseEstimator / TransformerMixin subclasses. Provide the column names and transformation intent in your prompt and it generates a complete, fit-able pipeline. It also handles edge cases like handle_unknown="use_encoded_value" for ordinal encoders encountering unseen categories at inference time.

Q: How do I avoid Claude Code generating code that runs on wrong column names?

Add a CLAUDE.md to your project with exact column names, dtypes, and sample values. Also include a note like "Always call df.columns.tolist() at the top of every generated script and assert expected columns are present." This makes generated code self-validating and eliminates the most common runtime failure mode. See How to Use Claude for Data Analysis for more prompt patterns.

Yes — Claude Code is a strong fit for data science work. Drop a CSV path into your terminal, describe your goal in plain English, and Claude Code writes the pandas EDA, feature engineering, sklearn pipeline, and matplotlib visualization code for you — immediately runnable in your existing environment. It doesn't replace your Python skills; it removes the friction of boilerplate so you spend time on interpretation, not syntax lookup. For teams running Jupyter notebooks, it integrates via the %claude magic or by writing and running .py scripts in the same project directory. This guide covers every major workflow with production-ready code examples.

EDA Workflow: Describe a CSV, Claude Generates the Analysis

The fastest way to start is to give Claude Code a description of your dataset and ask for a full EDA script.

Prompt to Claude Code:

I have a CSV at data/sales_2025.csv with columns:
date (YYYY-MM-DD), region, product_category, revenue, units_sold, discount_pct, customer_id

Write a complete EDA script that:
1. Loads the data and prints shape, dtypes, null counts
2. Plots revenue distribution and a monthly trend line
3. Shows correlation between discount_pct and revenue
4. Prints top 5 regions by total revenue

Generated output:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# --- Load & inspect ---
df = pd.read_csv("data/sales_2025.csv", parse_dates=["date"])
print("Shape:", df.shape)
print("\nDtypes:\n", df.dtypes)
print("\nNull counts:\n", df.isnull().sum())
print("\nSample:\n", df.head(3))

# --- Revenue distribution ---
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].hist(df["revenue"], bins=50, edgecolor="white", color="#4A90D9")
axes[0].set_title("Revenue Distribution")
axes[0].set_xlabel("Revenue ($)")
axes[0].set_ylabel("Count")

# --- Monthly trend ---
monthly = df.resample("ME", on="date")["revenue"].sum().reset_index()
axes[1].plot(monthly["date"], monthly["revenue"], marker="o", linewidth=2, color="#E74C3C")
axes[1].set_title("Monthly Revenue Trend")
axes[1].set_xlabel("Month")
axes[1].set_ylabel("Total Revenue ($)")
axes[1].tick_params(axis="x", rotation=45)
plt.tight_layout()
plt.savefig("outputs/eda_revenue.png", dpi=150)
plt.show()

# --- Discount vs Revenue correlation ---
print("\nCorrelation (discount_pct vs revenue):",
      df["discount_pct"].corr(df["revenue"]).round(3))

fig2, ax = plt.subplots(figsize=(7, 4))
ax.scatter(df["discount_pct"], df["revenue"], alpha=0.3, s=15, color="#2ECC71")
ax.set_xlabel("Discount %")
ax.set_ylabel("Revenue ($)")
ax.set_title("Discount vs Revenue")
plt.tight_layout()
plt.savefig("outputs/eda_discount_corr.png", dpi=150)
plt.show()

# --- Top regions ---
top_regions = (
    df.groupby("region")["revenue"]
    .sum()
    .sort_values(ascending=False)
    .head(5)
    .reset_index()
)
top_regions.columns = ["Region", "Total Revenue"]
print("\nTop 5 Regions:\n", top_regions.to_string(index=False))

Claude Code generates this in seconds and saves it to a file you can run immediately. If anything fails (missing column, wrong dtype), paste the error back and it fixes it.

Feature Engineering Assistance

Feature engineering is where many data science projects stall — you know what you want conceptually but translating it to pandas is tedious. Claude Code handles this well.

Prompt:

From my sales DataFrame, engineer these features:
- days_since_last_purchase (per customer_id, sorted by date)
- rolling_7d_revenue (per customer, 7-day rolling sum)
- is_high_value (1 if revenue > 75th percentile, else 0)
- discount_revenue_interaction (discount_pct * revenue)

Generated code:

import pandas as pd
import numpy as np

df = pd.read_csv("data/sales_2025.csv", parse_dates=["date"])
df = df.sort_values(["customer_id", "date"])

# --- days_since_last_purchase ---
df["prev_purchase_date"] = df.groupby("customer_id")["date"].shift(1)
df["days_since_last_purchase"] = (
    df["date"] - df["prev_purchase_date"]
).dt.days.fillna(0).astype(int)

# --- rolling_7d_revenue (per customer) ---
df["rolling_7d_revenue"] = (
    df.groupby("customer_id")["revenue"]
    .transform(lambda x: x.rolling(window=7, min_periods=1).sum())
)

# --- is_high_value ---
p75 = df["revenue"].quantile(0.75)
df["is_high_value"] = (df["revenue"] > p75).astype(int)

# --- interaction feature ---
df["discount_revenue_interaction"] = df["discount_pct"] * df["revenue"]

print("New features added:", ["days_since_last_purchase", "rolling_7d_revenue",
                               "is_high_value", "discount_revenue_interaction"])
print(df[["customer_id", "date", "revenue",
          "days_since_last_purchase", "rolling_7d_revenue",
          "is_high_value", "discount_revenue_interaction"]].head(8))

The key advantage: Claude Code understands business intent ("7-day rolling per customer") and writes the correct grouped transform — a common source of bugs when written manually.

Model Selection Advice

Describe your prediction task and Claude Code recommends a model, explains the reasoning, and writes a baseline pipeline.

Prompt:

I want to predict is_high_value (binary target).
Features: days_since_last_purchase, rolling_7d_revenue, discount_pct,
          region (categorical, 8 values), product_category (6 values).
~50,000 rows. Which model should I start with?

Claude Code's response + code:

For a binary classification task with mixed numeric/categorical features and 50k rows, start with Random Forest as a baseline — it handles mixed types, is robust to outliers, and gives feature importances out of the box. Graduate to XGBoost or LightGBM for production if you need higher accuracy or faster inference.

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, roc_auc_score
import numpy as np

df = pd.read_csv("data/sales_2025_features.csv")

FEATURES = [
    "days_since_last_purchase", "rolling_7d_revenue",
    "discount_pct", "region", "product_category"
]
TARGET = "is_high_value"

X = df[FEATURES]
y = df[TARGET]

numeric_features = ["days_since_last_purchase", "rolling_7d_revenue", "discount_pct"]
categorical_features = ["region", "product_category"]

preprocessor = ColumnTransformer(transformers=[
    ("num", "passthrough", numeric_features),
    ("cat", OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
     categorical_features),
])

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", RandomForestClassifier(
        n_estimators=200,
        max_depth=12,
        class_weight="balanced",  # handles imbalanced target
        random_state=42,
        n_jobs=-1,
    )),
])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)[:, 1]

print("Classification Report:\n", classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_prob).round(4))

# --- Feature importances ---
feature_names = numeric_features + categorical_features
importances = pipeline.named_steps["model"].feature_importances_
importance_df = pd.DataFrame({
    "feature": feature_names,
    "importance": importances
}).sort_values("importance", ascending=False)
print("\nFeature Importances:\n", importance_df.to_string(index=False))

Visualization: Describe What You Want, Get matplotlib/seaborn Code

Claude Code excels at translating visualization intent into clean chart code — especially for the kinds of multi-panel, annotated charts that take 30+ minutes to write from scratch.

Prompt:

Create a 2x2 dashboard figure showing:
- top-left: bar chart of revenue by region, sorted descending
- top-right: line chart of monthly units_sold, with a 3-month moving average overlay
- bottom-left: heatmap of avg revenue by region x product_category
- bottom-right: box plot of discount_pct by product_category
Use a clean, minimal style. Save as outputs/sales_dashboard.png at 200 dpi.

Generated code:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("data/sales_2025.csv", parse_dates=["date"])
sns.set_theme(style="whitegrid", palette="muted", font_scale=0.9)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle("Sales Dashboard 2025", fontsize=14, fontweight="bold", y=1.01)

# --- Top-left: Revenue by region ---
region_rev = (
    df.groupby("region")["revenue"].sum()
    .sort_values(ascending=False)
    .reset_index()
)
axes[0, 0].barh(region_rev["region"], region_rev["revenue"], color=sns.color_palette("muted")[0])
axes[0, 0].invert_yaxis()
axes[0, 0].set_title("Revenue by Region")
axes[0, 0].set_xlabel("Total Revenue ($)")

# --- Top-right: Monthly units + 3-month MA ---
monthly = df.resample("ME", on="date")["units_sold"].sum().reset_index()
monthly["ma3"] = monthly["units_sold"].rolling(3).mean()
axes[0, 1].plot(monthly["date"], monthly["units_sold"],
                marker="o", linewidth=1.5, label="Monthly", alpha=0.7)
axes[0, 1].plot(monthly["date"], monthly["ma3"],
                linewidth=2.5, linestyle="--", label="3-Month MA", color="tomato")
axes[0, 1].set_title("Monthly Units Sold")
axes[0, 1].set_xlabel("")
axes[0, 1].tick_params(axis="x", rotation=30)
axes[0, 1].legend()

# --- Bottom-left: Heatmap region x category ---
pivot = df.pivot_table(
    values="revenue", index="region",
    columns="product_category", aggfunc="mean"
)
sns.heatmap(pivot, ax=axes[1, 0], cmap="YlOrRd", fmt=".0f",
            annot=True, linewidths=0.5, cbar_kws={"shrink": 0.8})
axes[1, 0].set_title("Avg Revenue: Region × Category")
axes[1, 0].set_xlabel("")

# --- Bottom-right: Box plot discount by category ---
df.boxplot(column="discount_pct", by="product_category", ax=axes[1, 1],
           notch=False, vert=True, patch_artist=True)
axes[1, 1].set_title("Discount % by Category")
axes[1, 1].set_xlabel("Product Category")
axes[1, 1].set_ylabel("Discount %")
plt.sca(axes[1, 1])
plt.xticks(rotation=30)

plt.tight_layout()
plt.savefig("outputs/sales_dashboard.png", dpi=200, bbox_inches="tight")
plt.show()
print("Saved: outputs/sales_dashboard.png")

CLAUDE.md Setup for Data Science Projects

Adding a CLAUDE.md file to your data science project directory makes Claude Code context-aware — it knows your dataset names, column conventions, and environment constraints without you having to repeat them.

Example .claude/CLAUDE.md for a data science project:

# Project: Customer Revenue Analysis

## Environment
- Python 3.11, pandas 2.2, scikit-learn 1.5, matplotlib 3.9, seaborn 0.13
- Run scripts with: `python src/analysis.py`
- Data directory: `data/` (never modify raw files — write to `data/processed/`)
- Output charts: `outputs/` (always save at 200 dpi, PNG)

## Dataset: sales_2025.csv
Columns: date (YYYY-MM-DD), region (str, 8 unique), product_category (str, 6 unique),
         revenue (float, USD), units_sold (int), discount_pct (float 0-1),
         customer_id (str, UUID format)

## Coding conventions
- Use snake_case for all variable names
- Always use `parse_dates=["date"]` when loading this CSV
- Target variable for ML: `is_high_value` (derived, see features.py)
- Preferred model: RandomForest for baseline, LightGBM for production

## What NOT to do
- Don't hardcode file paths — use pathlib.Path(__file__).parent / "data"
- Don't use .iterrows() — vectorize with pandas operations
- Don't drop null rows without logging how many were removed

With this in place, every Claude Code session in this directory starts with full project context. You stop re-explaining column names and environment details.

Jupyter Notebook Integration

Claude Code works alongside Jupyter in two main patterns.

Pattern 1: Write scripts, import results into notebooks

Claude Code writes .py scripts in your src/ directory. In your notebook:

# notebook cell
import subprocess
result = subprocess.run(["python", "src/features.py"], capture_output=True, text=True)
print(result.stdout)

Or simply run the generated script in your terminal and load the output artifacts:

import pandas as pd
df_features = pd.read_csv("data/processed/features.csv")
df_features.head()

Pattern 2: Ask Claude Code to generate complete notebook cells

Prompt: Write the code for a Jupyter cell that loads data/sales_2025.csv, shows a df.info() summary, and plots a revenue histogram — ready to paste into a .ipynb cell.

Claude Code returns a clean code block you paste directly. For iterative EDA, this is faster than tab-completing pandas method names.

Pattern 3: nbconvert workflow

Claude Code can write a full .ipynb-compatible Python script using # %% cell markers (compatible with VS Code Jupyter, Jupytext):

# %% [markdown]
# ## Sales EDA — Auto-generated by Claude Code

# %%
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("data/sales_2025.csv", parse_dates=["date"])
print(f"Rows: {len(df):,} | Columns: {df.shape[1]}")
df.head()

# %%
# Monthly revenue trend
monthly = df.resample("ME", on="date")["revenue"].sum()
monthly.plot(title="Monthly Revenue", figsize=(10, 4))
plt.tight_layout()
plt.show()

Real Example: Analyzing a Sales Dataset End-to-End

Here is a complete end-to-end session showing how a data scientist would use Claude Code on a real project. The goal: identify which customer segments are driving high-value purchases and build a simple classifier.

Step 1 — EDA prompt:

Load data/sales_2025.csv. Show me:
1. Value counts for region and product_category
2. Revenue percentiles (25th, 50th, 75th, 90th, 99th)
3. % of rows where discount_pct > 0.20
4. Monthly revenue trend as a quick chart

Step 2 — Feature engineering prompt (after reviewing EDA output):

Create is_high_value = 1 where revenue > 75th percentile.
Add: days_since_last_purchase, rolling_7d_revenue per customer.
Save the feature DataFrame to data/processed/features.csv.

Step 3 — Model training prompt:

Using data/processed/features.csv, train a RandomForestClassifier to predict is_high_value.
- 80/20 train/test split, stratified
- Print ROC-AUC and classification report
- Save the trained model to models/rf_high_value.pkl

Step 4 — Results interpretation prompt:

The model got ROC-AUC = 0.87. Feature importances show rolling_7d_revenue 
is the top feature at 42%, days_since_last_purchase at 28%.
What does this mean for the business, and what experiments should we run?

Claude Code answers: "The 7-day rolling revenue dominance suggests recency and purchase momentum matter more than historical frequency for high-value classification. Suggested experiments: (1) test a personalized discount intervention on customers with high rolling_7d_revenue but no recent high-value purchase, (2) A/B test urgency messaging for customers with days_since_last_purchase between 14-30 days — the model assigns them moderate high-value probability."

This kind of closed-loop — code generation, execution, interpretation, business recommendation — is where Claude Code creates the most leverage in a data science workflow.

Gumroad CTA: Prompts That Drive Better Data Science Results

The examples above follow structured prompting patterns that consistently get better code from Claude Code. The P1 Power Prompts 300 collection includes 40+ data science–specific prompts covering EDA, feature engineering, model evaluation, and visualization — tested against standard sklearn datasets.

Get P1 Power Prompts 300 on Gumroad →

FAQ

Can Claude Code read my actual CSV files directly?

Yes. If your CSV is in the project directory, Claude Code can read it using its built-in file tools. It will inspect the file, identify columns and dtypes, and generate analysis code that references the real column names. For very large files (>100MB), it reads a sample to understand the schema and writes pandas code you run in your own environment — which is actually the safer pattern for production data.

How does Claude Code compare to GitHub Copilot for data science?

The key difference is context scope. Copilot autocompletes line-by-line in your editor. Claude Code understands multi-file project context, can read your CLAUDE.md, knows your dataset schema, and generates entire analysis scripts with explanations. For data science specifically — where you often need to explain "why this transformation" or "what does this result mean" — Claude Code's conversational reasoning is more useful than autocomplete. See Claude Code Complete Guide for a deeper comparison.

Which Claude model should I use for data science tasks?

For most data science workflows — writing EDA scripts, generating visualization code, and feature engineering — Claude Sonnet is the right default. It handles code generation and multi-step reasoning well without the latency or cost of Opus. Use Haiku for simple, repetitive tasks like batch-generating boilerplate column-rename operations. Reserve Opus for complex analytical decisions like architecture design for a production ML pipeline. See Haiku vs Sonnet vs Opus: Which Model for a full decision framework.

Does Claude Code work with scikit-learn pipelines and custom transformers?

Yes. Claude Code writes sklearn Pipeline and ColumnTransformer objects correctly, including custom BaseEstimator / TransformerMixin subclasses. Provide the column names and transformation intent in your prompt and it generates a complete, fit-able pipeline. It also handles edge cases like handle_unknown="use_encoded_value" for ordinal encoders encountering unseen categories at inference time.

How do I avoid Claude Code generating code that runs on wrong column names?

Add a CLAUDE.md to your project with exact column names, dtypes, and sample values. Also include a note like "Always call df.columns.tolist() at the top of every generated script and assert expected columns are present." This makes generated code self-validating and eliminates the most common runtime failure mode. See How to Use Claude for Data Analysis for more prompt patterns.

Summary

Claude Code accelerates data science workflows at every stage: EDA scripts from a CSV description, feature engineering from business intent, model selection with baseline code, multi-panel visualizations from a plain-English layout request, and closed-loop interpretation of results. The highest-leverage setup is a well-written CLAUDE.md that gives Claude Code your dataset schema and project conventions — once that is in place, you stop repeating context and start getting production-quality code on the first prompt.

For the prompting patterns that get the best code out of Claude Code across all 300 use cases — including 40+ data science–specific templates — the P1 Power Prompts 300 collection is the fastest way to get consistent, high-quality outputs.

Get P1 Power Prompts 300 →

Claude Code for Data Science: EDA, Analysis, and Visualization Workflows

EDA Workflow: Describe a CSV, Claude Generates the Analysis

Feature Engineering Assistance

Model Selection Advice

Visualization: Describe What You Want, Get matplotlib/seaborn Code

CLAUDE.md Setup for Data Science Projects

Jupyter Notebook Integration

Real Example: Analyzing a Sales Dataset End-to-End

Gumroad CTA: Prompts That Drive Better Data Science Results

FAQ

Can Claude Code read my actual CSV files directly?

How does Claude Code compare to GitHub Copilot for data science?

Which Claude model should I use for data science tasks?

Does Claude Code work with scikit-learn pipelines and custom transformers?

How do I avoid Claude Code generating code that runs on wrong column names?

Summary

Related guides

Claude Code for Data Science: Jupyter and Notebook Workflows

How to Use Claude for Data Analysis: Practical Guide

Claude Code Test Generation: Auto-Write Unit & Integration Tests

Automate Web Scraping with Claude Code (2026)

Tools and references