Twitter Classification & Diamond Price Prediction
This project contains two distinct machine learning tasks: (1) Twitter text classification using Random Forest on preprocessed Twitter data, achieving ~87% accuracy with 3 classes, and (2) Diamond price prediction using regression models on a dataset of 53,940 diamonds. The Twitter classification builds upon previous preprocessing work, while the diamond prediction demonstrates regression techniques on structured data with mixed feature types (numeric and ordinal). The project showcases feature importance analysis, proper handling of mixed data types, and comprehensive data exploration.
Overview
This project contains two distinct machine learning tasks: (1) Twitter text classification using Random Forest on preprocessed Twitter data, achieving ~87% accuracy with 3 classes, and (2) Diamond price prediction using regression models on a dataset of 53,940 diamonds. The Twitter classification builds upon previous preprocessing work, while the diamond prediction demonstrates regression techniques on structured data with mixed feature types (numeric and ordinal). The project showcases feature importance analysis, proper handling of mixed data types, and comprehensive data exploration.
Key Features
Random Forest classification for Twitter data (~87% accuracy)
Multi-class classification with 3 classes
Feature importance analysis (top 20 features)
Diamond price prediction regression
Mixed data type handling (numeric and ordinal)
Data exploration and visualization
Structured preprocessing pipelines
Train/test splitting with optional validation
Comprehensive evaluation metrics
Model interpretability insights
pages.portfolio.projects.twitter_classification_diamond_prediction.features.10
Technical Highlights
Implemented Random Forest for Twitter classification with ~87% accuracy
Built regression pipeline for diamond price prediction on 53,940 samples
Analyzed feature importance for model interpretability
Handled mixed data types with separate preprocessing pipelines
Performed comprehensive data exploration and visualization
Demonstrated dual-task ML workflow (classification and regression)
Challenges and Solutions
Class Imbalance
Random Forest handled uneven class distribution across 3 classes effectively
Mixed Data Types
Created separate preprocessing pipelines for numeric and ordinal features using ColumnTransformer
Feature Importance Compatibility
Documented and handled scikit-learn version compatibility for feature importance attributes
Price Distribution
Used data visualization and appropriate regression techniques for skewed price distribution
Ordinal Feature Encoding
Applied OrdinalEncoder to preserve ordinal relationships in cut, color, and clarity features
Dimension Analysis
Explored relationships between dimensions (x, y, z) and price using scatter plots and histograms
Technologies
ML Models
Preprocessing
Pipeline
Analysis
Data
Environment
Project Information
- Status
- Completed
- Year
- 2024
- Architecture
- Dual-Task ML Workflow with Classification and Regression Pipelines
- Category
- Data Science