About Project

Olist — Market Segmentation & Repurchase Prediction

SQL architecture to consolidate facts/dims → segmentation with EFA + clustering → calibrated repurchase models to drive targeted campaigns by probability × ticket.

1) The Challenge

Multi-product marketplace with heterogeneous customers. We need to identify segments and predict repurchase to focus campaigns and offers, increasing frequency and ticket while reducing churn.

2) Approach

  1. SQL views: vw_customer_features (R/F/Monetary, logistics, reviews, categories).
  2. EFA: KMO/Bartlett → latent drivers; noise reduction.
  3. Clustering: K-Means / PAM / CLARA / hierarchical; silhouette & stability.
  4. Supervised: pruned tree, RF, GB, MLP; calibration (Brier) and ROC/AP.
  5. Business: Lift & Gain by deciles; per-segment thresholds; export to CRM.
ROC AUC
0.89
+0.03 vs baseline
Average Precision
0.96
Top deciles
Brier (↓ better)
0.11
Calibrated
Segments
4
Actionable

3) Findings

Lift & Gain

Concentration in top deciles

The top 20% capture most expected repurchases → prioritize contacts.

Calibration

Probabilities you can trust

Curve close to the perfect diagonal; per-segment thresholds maximize ROI.

Drivers

Recency & logistics stand out

Recency, logistics rating, and post-sale experience stand out; ticket is modeled separately.

Segments

Clear actions per segment

Messaging, discounts, and cross-sell vary by profile and probability × ticket.

4) Next Step

  1. Activation: export deciles (prob × ticket) to CRM; choose channel per segment.
  2. Experimentation: A/B with segment-level holdouts; measure uplift, CAC, and ROI.
  3. Orchestration: weekly re-training and scoring job; monitor drift.
  4. Governance: features catalog, model versioning, and experiment logbook.

Analytical Modules

Exploratory

Unsupervised — EFA + Clustering

We discover segments and their drivers, and define actions per profile.

  • EFA (KMO/Bartlett) → latent factors
  • K-Means / PAM / CLARA / hierarchical
  • Silhouette, stability, and naming
Modeling

Supervised — Repurchase & Ticket

Calibrated repurchase probability and expected ticket to prioritize campaigns.

  • Pruned tree, RF, GB, MLP
  • Calibration (Brier) & ROC/AP
  • OLS for avg_ticket
BI

Executive Story — Tableau

Executive sequence that integrates dashboards and key findings.

  • Guided navigation
  • KPIs and evidence
  • Shareable with stakeholders

Tech Stack

SQL (Snowflake)

Snowflake schema (facts/dims), analytic views, and warehouse orchestration.

scikit-learn

Clustering & supervised models, calibration, validation, and evaluation.

Tableau

Executive story, dashboards, KPI tracking, and sharing.

pandas

Feature engineering, time windows, joins, and I/O.

NumPy

Vectorized math and numerical helpers.

Scope & Limitations

Scope
  • End-to-end pipeline data → segments → scoring → activation.
  • Calibrated probabilities ready for per-segment thresholds.
  • Deciles for Lift/Gain and campaign focus.
  • Expected ticket (OLS) to optimize discounts and cross-sell.
Limitations
  • Sensitivity to time windows and mix of verticals/channels.
  • Risk of data drift and changes in pricing/promotions/logistics.
  • Requires integration with CRM and controlled measurement (A/B, holdout).
  • Needs re-training and seasonality monitoring.