Spooky-Author-Identification

👻 Spooky Author Identification

Who wrote this sentence? Classifying authors from a single line of text

“Let the words tell you who wrote them.”

📎 Full Analysis:
👉 View Jupyter Notebook on GitHub

📎 Kaggle Competition:
👉 Spooky Author Identification (2017 Halloween)

Spooky Author Identification

📌 One-Line Summary

Given a sentence, predict which of Edgar Allan Poe (EAP), H. P. Lovecraft (HPL), or Mary W. Shelley (MWS) wrote it.
With light preprocessing and BoW/TF‑IDF features + Naive Bayes, the model achieves strong accuracy and stable generalization.

1️⃣ Problem & Data

Task: Multi-class author classification (EAP/HPL/MWS) for a single sentence.
Metric: Multi-class Logarithmic Loss (submit probabilities for each author).
Files
- train.csv: id, text, author
- test.csv: id, text
- sample_submission.csv: example submission format
Source: Public-domain fiction, sentence-split with CoreNLP’s MaxEnt tokenizer.

2️⃣ How It Was Built

Preprocessing
- Lowercasing, light cleanup of punctuation/numbers, minimal stopword handling.
Feature Extraction
- Bag-of-Words / TF‑IDF vectors (suited for short sentences).
Modeling
- Tried Random Forest, AdaBoost, SVM, Naive Bayes.
- Evaluated with 10‑fold Cross‑Validation for robustness.
Final Choice
- Naive Bayes selected for best balance of accuracy and stability.
Submission
- Output calibrated probabilities for each class and submit to Kaggle.

Architecture

1	[text] → [cleaning] → [BoW / TF‑IDF] → [Classifier] → P(EAP), P(HPL), P(MWS)

3️⃣ Results

Model	Training Score	10‑Fold CV (avg/total F1)	Notes
Random Forest	0.4311	0.31	Struggles with sparse text features
AdaBoost	0.6293	0.65	Better than RF; still volatile
SVM	0.4035	0.23	Overfits a single class in this setup
Naive Bayes (Final)	0.8329	0.90	Fast, simple, strong on sparse TF‑IDF ✅

Kaggle Leaderboard

LogLoss: 0.48767
Rank: 793 / 1244 (63.7%)

4️⃣ Why Naive Bayes?

Excels with sparse word distributions and short texts.
Few hyperparameters → stable generalization and fast iteration.
Simple pipeline makes experimentation (n‑grams, char‑grams) easy.

5️⃣ Real‑World Use

Stylometry (author identification), plagiarism detection, style recommendation.
Brand/author tone-of-voice classification and content routing.

🛠 Technologies Used

Step	Technology
Data / EDA	Python, Pandas, NumPy
NLP Features	scikit-learn (Count / TF‑IDF)
Models	scikit-learn (NB, SVM, RF, AdaBoost)
Environment	Jupyter Notebook

💡 Key Learnings

TF‑IDF + Naive Bayes remains a strong baseline for short-text classification.
Improving leaderboard LogLoss benefits from probability calibration and n‑gram / char‑gram features.
For imbalanced, sparse text, simpler models can beat complex ensembles.

🔗 GitHub Repository

📂 View Project on GitHub