Product Summary
A 50,000-row synthetic telecom churn dataset built for data scientists, instructors, and learners.
Every feature mirrors real-world telecom patterns—contracts, tenure, billing, and service combinations—without exposing any personal data.
Comes with a ready scikit-learn logistic regression model, pre-encoded and tested for 78.5% accuracy.
Ideal for teaching, benchmarking, and feature-engineering practice, all under a permissive MIT license.
Who It’s For
Data scientists testing churn prediction workflows.
Analysts needing structured telecom-like data for visualization.
Instructors teaching classification without privacy concerns.
Students practicing end-to-end ML pipelines.
Founders/PMs validating churn-related product ideas.
Problems We Solve
No real churn dataset access due to privacy.
Need for reproducible ML baseline for churn.
Lack of clear, labeled schema for teaching.
Cold-start research and proof-of-concept experiments.
What You Get
customer_churn_data.csv — 50,000 synthetic rows, telecom-style churn data.
customer_churn_model.pkl — trained logistic regression baseline.
README.md — schema, generation, pipeline, metrics, and usage instructions.
LICENSE — MIT License (open, permissive).
customer_churn_product.zip — full product bundle.
(File sizes: Not specified in the files.)
Data Schema
Row count: 50,000
Churn label balance: No ≈ 78%, Yes ≈ 22% (per model support counts)
Columns → Type → Meaning:
customerID → string → unique identifier
gender → category → Male/Female
SeniorCitizen → integer → 1 if senior, 0 otherwise
Partner, Dependents → category → household info
tenure → integer → months with company
PhoneService, MultipleLines → category → phone plan details
InternetService → category → DSL/Fiber/None
OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies → category → add-on services
Contract → category → month-to-month/1-yr/2-yr
PaperlessBilling → category → Yes/No
PaymentMethod → category → payment type
MonthlyCharges, TotalCharges → numeric → billing amounts
Churn → target → Yes/No
How the Data Was Generated
The dataset was synthetically created using programmatic logic:
Feature distributions matched telecom industry patterns (e.g., 15% seniors, more short-term contracts).
Churn was simulated using a logistic function combining contract type, tenure, senior status, payment method, and monthly charges.
Random noise was added for realism.
ML Baseline Model
Algorithm: Logistic Regression
Preprocessing: One-hot encoding of categorical features; numeric passthrough.
Split: 80% training / 20% testing
Accuracy: 78.51%
Precision/Recall/F1:
No → 0.79 / 0.99 / 0.88
Yes → 0.62 / 0.06 / 0.11
Weighted F1 → 0.71
Quickstart:
pip install pandas scikit-learn joblib
import joblib; model = joblib.load('customer_churn_model.pkl')
Prepare input as DataFrame matching column names
prediction = model.predict(df)
Evaluate or extend for feature engineering
Perfect For
Machine learning courses and demos
Feature engineering practice
Model benchmarking across algorithms
Building ML pipelines from clean data
Designing churn dashboards or reports
Not For
Real customer data or production inference
Any PII-based or personalized targeting
Representing a specific telecom provider
Licensing & Usage Rights
License: MIT License
You may use, modify, distribute, and include this dataset and model in commercial or educational work with attribution.
Pricing & What’s Included
Price: US$49.99
Includes:
Full 50k-row CSV dataset
Trained logistic regression model
README (schema + usage)
MIT License
Product ZIP bundle
FAQs
Is PII included?
No, it’s fully synthetic.
How many rows?
50,000.
How is churn defined?
A “Yes” means the customer left; “No” means they stayed.
Which features are categorical vs numeric?
All except tenure, MonthlyCharges, and TotalCharges are categorical.
How do I load the model?
Use joblib: model = joblib.load('customer_churn_model.pkl').
What accuracy does the model achieve?
78.51% on the test set.
Can I retrain or modify it?
Yes, permitted by MIT License.
Is there version tracking?
Not specified in the files.
Does the model generalize?
Not specified in the files
Any missing data handling steps?
Not specified in the files
Ethics & Disclaimers
All data are synthetic — no real individuals, records, or private information.
Intended solely for learning, demonstration, and non-sensitive analytics



