💎

Twitter Classification & Diamond Price Prediction

Completed 2024 Dual-Task ML Workflow with Classification and Regression Pipelines

This project contains two distinct machine learning tasks: (1) Twitter text classification using Random Forest on preprocessed Twitter data, achieving ~87% accuracy with 3 classes, and (2) Diamond price prediction using regression models on a dataset of 53,940 diamonds. The Twitter classification builds upon previous preprocessing work, while the diamond prediction demonstrates regression techniques on structured data with mixed feature types (numeric and ordinal). The project showcases feature importance analysis, proper handling of mixed data types, and comprehensive data exploration.

Data Science Machine Learning Python Development Text Classification Regression Feature Engineering Model Evaluation

Overview

This project contains two distinct machine learning tasks: (1) Twitter text classification using Random Forest on preprocessed Twitter data, achieving ~87% accuracy with 3 classes, and (2) Diamond price prediction using regression models on a dataset of 53,940 diamonds. The Twitter classification builds upon previous preprocessing work, while the diamond prediction demonstrates regression techniques on structured data with mixed feature types (numeric and ordinal). The project showcases feature importance analysis, proper handling of mixed data types, and comprehensive data exploration.

Key Features

Random Forest classification for Twitter data (~87% accuracy)

Multi-class classification with 3 classes

Feature importance analysis (top 20 features)

Diamond price prediction regression

Mixed data type handling (numeric and ordinal)

Data exploration and visualization

Structured preprocessing pipelines

Train/test splitting with optional validation

Comprehensive evaluation metrics

Model interpretability insights

pages.portfolio.projects.twitter_classification_diamond_prediction.features.10

Technical Highlights

Implemented Random Forest for Twitter classification with ~87% accuracy

Built regression pipeline for diamond price prediction on 53,940 samples

Analyzed feature importance for model interpretability

Handled mixed data types with separate preprocessing pipelines

Performed comprehensive data exploration and visualization

Demonstrated dual-task ML workflow (classification and regression)

Challenges and Solutions

Class Imbalance

Random Forest handled uneven class distribution across 3 classes effectively

Mixed Data Types

Created separate preprocessing pipelines for numeric and ordinal features using ColumnTransformer

Feature Importance Compatibility

Documented and handled scikit-learn version compatibility for feature importance attributes

Price Distribution

Used data visualization and appropriate regression techniques for skewed price distribution

Ordinal Feature Encoding

Applied OrdinalEncoder to preserve ordinal relationships in cut, color, and clarity features

Dimension Analysis

Explored relationships between dimensions (x, y, z) and price using scatter plots and histograms

Technologies

ML Models

RandomForestClassifier Regression Models

Preprocessing

MinMaxScaler OrdinalEncoder Custom Transformers

Pipeline

Pipeline ColumnTransformer

Analysis

Feature Importance Data Visualization

Data

Pandas NumPy Matplotlib

Environment

Python Jupyter Notebook Joblib

Project Information

Status
Completed
Year
2024
Architecture
Dual-Task ML Workflow with Classification and Regression Pipelines
Category
Data Science