About Project

Olist — Market Segmentation & Repurchase Prediction

SQL architecture to consolidate facts/dims → segmentation with EFA + clustering → calibrated repurchase models to drive targeted campaigns by probability × ticket.

1) The Challenge

Multi-product marketplace with heterogeneous customers. We need to identify segments and predict repurchase to focus campaigns and offers, increasing frequency and ticket while reducing churn.

2) Approach

  1. SQL views: vw_customer_features (R/F/Monetary, logistics, reviews, categories).
  2. EFA: KMO/Bartlett → latent drivers; noise reduction.
  3. Clustering: K-Means / PAM / CLARA / hierarchical; silhouette & stability.
  4. Supervised: pruned tree, RF, GB, MLP; calibration (Brier) and ROC/AP.
  5. Business: Lift & Gain by deciles; per-segment thresholds; export to CRM.
ROC AUC
0.89
+0.03 vs baseline
Average Precision
0.96
Top deciles
Brier (↓ better)
0.11
Calibrated
Segments
3
Actionable

3) Findings

Lift & Gain

Top 20% captures ~65% of repurchases

Targeting the top two deciles by predicted probability concentrates the majority of expected revenue, reducing contact cost by up to 4× compared to untargeted campaigns.

Calibration

Probabilities you can trust

Calibration curve near the diagonal (Brier = 0.11) means a predicted 30% probability translates to a real 30% repurchase rate — enabling reliable per-segment budget allocation.

Drivers

Recency & logistics drive repurchase

Recency and delivery experience are the strongest predictors. Improving logistics SLA by 1 point increases repurchase probability by ~8%. Ticket is modeled separately via OLS.

Segments

3 actionable customer profiles

High-Value Loyalists → loyalty & cross-sell. At-Risk Occasionals → win-back campaigns. Low-Engagement Churners → aggressive incentives or deprioritize.

Model Scorecard

ROC AUC, Avg Precision, F1 & Brier across classifiers

Explore supervised models

4) Next Step

  1. Activation: export deciles (prob × ticket) to CRM; choose channel per segment.
  2. Experimentation: A/B with segment-level holdouts; measure uplift, CAC, and ROI.
  3. Orchestration: weekly re-training and scoring job; monitor drift.
  4. Governance: features catalog, model versioning, and experiment logbook.

Analytical Modules

Exploratory

Unsupervised — EFA + Clustering

We discover segments and their drivers, and define actions per profile.

  • EFA (KMO/Bartlett) → latent factors
  • K-Means / PAM / CLARA / hierarchical
  • Silhouette, stability, and naming
Modeling

Supervised — Repurchase & Ticket

Calibrated repurchase probability and expected ticket to prioritize campaigns.

  • Pruned tree, RF, GB, MLP
  • Calibration (Brier) & ROC/AP
  • OLS for avg_ticket
BI

Executive Story — Tableau

Executive sequence that integrates dashboards and key findings.

  • Guided navigation
  • KPIs and evidence
  • Shareable with stakeholders

Tech Stack

SQL (Snowflake)

Snowflake schema (facts/dims), analytic views, and warehouse orchestration.

scikit-learn

Clustering & supervised models, calibration, validation, and evaluation.

Tableau

Executive story, dashboards, KPI tracking, and sharing.

pandas

Feature engineering, time windows, joins, and I/O.

NumPy

Vectorized math and numerical helpers.

Scope & Limitations

Scope
  • End-to-end pipeline data → segments → scoring → activation.
  • Calibrated probabilities ready for per-segment thresholds.
  • Deciles for Lift/Gain and campaign focus.
  • Expected ticket (OLS) to optimize discounts and cross-sell.
Limitations
  • Sensitivity to time windows and mix of verticals/channels.
  • Risk of data drift and changes in pricing/promotions/logistics.
  • Requires integration with CRM and controlled measurement (A/B, holdout).
  • Needs re-training and seasonality monitoring.