top of page

Product Summary

A 50,000-row synthetic telecom churn dataset built for data scientists, instructors, and learners.

 

Every feature mirrors real-world telecom patterns—contracts, tenure, billing, and service combinations—without exposing any personal data.
Comes with a ready scikit-learn logistic regression model, pre-encoded and tested for 78.5% accuracy.


Ideal for teaching, benchmarking, and feature-engineering practice, all under a permissive MIT license.

 

Who It’s For

  • Data scientists testing churn prediction workflows.

  • Analysts needing structured telecom-like data for visualization.

  • Instructors teaching classification without privacy concerns.

  • Students practicing end-to-end ML pipelines.

  • Founders/PMs validating churn-related product ideas.

 

Problems We Solve

  • No real churn dataset access due to privacy.

  • Need for reproducible ML baseline for churn.

  • Lack of clear, labeled schema for teaching.

  • Cold-start research and proof-of-concept experiments.

 

What You Get

  • customer_churn_data.csv — 50,000 synthetic rows, telecom-style churn data.

  • customer_churn_model.pkl — trained logistic regression baseline.

  • README.md — schema, generation, pipeline, metrics, and usage instructions.

  • LICENSE — MIT License (open, permissive).

  • customer_churn_product.zip — full product bundle.
    (File sizes: Not specified in the files.)

 

Data Schema

Row count: 50,000
Churn label balance: No ≈ 78%, Yes ≈ 22% (per model support counts)

Columns → Type → Meaning:

  • customerID → string → unique identifier

  • gender → category → Male/Female

  • SeniorCitizen → integer → 1 if senior, 0 otherwise

  • Partner, Dependents → category → household info

  • tenure → integer → months with company

  • PhoneService, MultipleLines → category → phone plan details

  • InternetService → category → DSL/Fiber/None

  • OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies → category → add-on services

  • Contract → category → month-to-month/1-yr/2-yr

  • PaperlessBilling → category → Yes/No

  • PaymentMethod → category → payment type

  • MonthlyCharges, TotalCharges → numeric → billing amounts

  • Churn → target → Yes/No
     

How the Data Was Generated

The dataset was synthetically created using programmatic logic:

  • Feature distributions matched telecom industry patterns (e.g., 15% seniors, more short-term contracts).

  • Churn was simulated using a logistic function combining contract type, tenure, senior status, payment method, and monthly charges.

  • Random noise was added for realism.

 

ML Baseline Model

  • Algorithm: Logistic Regression

  • Preprocessing: One-hot encoding of categorical features; numeric passthrough.

  • Split: 80% training / 20% testing

  • Accuracy: 78.51%

  • Precision/Recall/F1:

    • No → 0.79 / 0.99 / 0.88

    • Yes → 0.62 / 0.06 / 0.11

    • Weighted F1 → 0.71

 

Quickstart:

  • pip install pandas scikit-learn joblib

  • import joblib; model = joblib.load('customer_churn_model.pkl')

  • Prepare input as DataFrame matching column names

  • prediction = model.predict(df)

  • Evaluate or extend for feature engineering

 

Perfect For

  • Machine learning courses and demos

  • Feature engineering practice

  • Model benchmarking across algorithms

  • Building ML pipelines from clean data

  • Designing churn dashboards or reports

 

Not For

  • Real customer data or production inference

  • Any PII-based or personalized targeting

  • Representing a specific telecom provider

 

Licensing & Usage Rights

License: MIT License
You may use, modify, distribute, and include this dataset and model in commercial or educational work with attribution.
 

Pricing & What’s Included

Price: US$49.99

Includes:

  • Full 50k-row CSV dataset

  • Trained logistic regression model

  • README (schema + usage)

  • MIT License

  • Product ZIP bundle

 

FAQs

Is PII included?
No, it’s fully synthetic.

 

How many rows?
50,000.

 

How is churn defined?
A “Yes” means the customer left; “No” means they stayed.

 

Which features are categorical vs numeric?
All except tenure, MonthlyCharges, and TotalCharges are categorical.

 

How do I load the model?
Use joblib: model = joblib.load('customer_churn_model.pkl').

 

What accuracy does the model achieve?
78.51% on the test set.

 

Can I retrain or modify it?
Yes, permitted by MIT License.

 

Is there version tracking?
Not specified in the files.

 

Does the model generalize?
Not specified in the files

 

Any missing data handling steps?
Not specified in the files

 

Ethics & Disclaimers

All data are synthetic — no real individuals, records, or private information.
Intended solely for learning, demonstration, and non-sensitive analytics

Customer Churn Dataset with ML Baseline (Synthetic Telecom Data)

$49.99 Regular Price
$0.00Sale Price
    No Reviews YetShare your thoughts. Be the first to leave a review.
    bottom of page