About Project

Olist — Market Segmentation & Repurchase Prediction

SQL architecture to consolidate facts/dims → segmentation with EFA + clustering → calibrated repurchase models to drive targeted campaigns by probability × ticket.

High-Value Customer Strategy Download PDF

1) The Challenge

Multi-product marketplace with heterogeneous customers. We need to identify segments and predict repurchase to focus campaigns and offers, increasing frequency and ticket while reducing churn.

Data: unify events in a snowflake schema and expose analytic views.
Segmentation: understand drivers (recency, logistics, reviews, ticket).
Activation: calibrated probabilities & thresholds by segment/decile.

2) Approach

SQL views: vw_customer_features (R/F/Monetary, logistics, reviews, categories).
EFA: KMO/Bartlett → latent drivers; noise reduction.
Clustering: K-Means / PAM / CLARA / hierarchical; silhouette & stability.
Supervised: pruned tree, RF, GB, MLP; calibration (Brier) and ROC/AP.
Business: Lift & Gain by deciles; per-segment thresholds; export to CRM.

ROC AUC

0.89

+0.03 vs baseline

Average Precision

0.96

Top deciles

Brier (↓ better)

0.11

Calibrated

Segments

Actionable

3) Findings

Lift & Gain

Concentration in top deciles

The top 20% capture most expected repurchases → prioritize contacts.

Calibration

Probabilities you can trust

Curve close to the perfect diagonal; per-segment thresholds maximize ROI.

Drivers

Recency & logistics stand out

Recency, logistics rating, and post-sale experience stand out; ticket is modeled separately.

Segments

Clear actions per segment

Messaging, discounts, and cross-sell vary by profile and probability × ticket.

4) Next Step

Activation: export deciles (prob × ticket) to CRM; choose channel per segment.
Experimentation: A/B with segment-level holdouts; measure uplift, CAC, and ROI.
Orchestration: weekly re-training and scoring job; monitor drift.
Governance: features catalog, model versioning, and experiment logbook.

Analytical Modules

Exploratory

Unsupervised — EFA + Clustering

We discover segments and their drivers, and define actions per profile.

EFA (KMO/Bartlett) → latent factors
K-Means / PAM / CLARA / hierarchical
Silhouette, stability, and naming

Open

Modeling

Supervised — Repurchase & Ticket

Calibrated repurchase probability and expected ticket to prioritize campaigns.

Pruned tree, RF, GB, MLP
Calibration (Brier) & ROC/AP
OLS for avg_ticket

Open

Executive Story — Tableau

Executive sequence that integrates dashboards and key findings.

Guided navigation
KPIs and evidence
Shareable with stakeholders

Open

Tech Stack

SQL (Snowflake)

Snowflake schema (facts/dims), analytic views, and warehouse orchestration.

scikit-learn

Clustering & supervised models, calibration, validation, and evaluation.

Tableau

Executive story, dashboards, KPI tracking, and sharing.

pandas

Feature engineering, time windows, joins, and I/O.

NumPy

Vectorized math and numerical helpers.

Scope & Limitations

Scope

End-to-end pipeline data → segments → scoring → activation.
Calibrated probabilities ready for per-segment thresholds.
Deciles for Lift/Gain and campaign focus.
Expected ticket (OLS) to optimize discounts and cross-sell.

Limitations

Sensitivity to time windows and mix of verticals/channels.
Risk of data drift and changes in pricing/promotions/logistics.
Requires integration with CRM and controlled measurement (A/B, holdout).
Needs re-training and seasonality monitoring.