Spooky-Author-Identification
👻 Spooky Author Identification
Who wrote this sentence? Classifying authors from a single line of text
“Let the words tell you who wrote them.”
📎 Full Analysis:
👉 View Jupyter Notebook on GitHub
📎 Kaggle Competition:
👉 Spooky Author Identification (2017 Halloween)
📌 One-Line Summary
Given a sentence, predict which of Edgar Allan Poe (EAP), H. P. Lovecraft (HPL), or Mary W. Shelley (MWS) wrote it.
With light preprocessing and BoW/TF‑IDF features + Naive Bayes, the model achieves strong accuracy and stable generalization.
1️⃣ Problem & Data
- Task: Multi-class author classification (EAP/HPL/MWS) for a single sentence.
- Metric: Multi-class Logarithmic Loss (submit probabilities for each author).
- Files
train.csv
: id, text, authortest.csv
: id, textsample_submission.csv
: example submission format
- Source: Public-domain fiction, sentence-split with CoreNLP’s MaxEnt tokenizer.
2️⃣ How It Was Built
- Preprocessing
- Lowercasing, light cleanup of punctuation/numbers, minimal stopword handling.
- Feature Extraction
- Bag-of-Words / TF‑IDF vectors (suited for short sentences).
- Modeling
- Tried Random Forest, AdaBoost, SVM, Naive Bayes.
- Evaluated with 10‑fold Cross‑Validation for robustness.
- Final Choice
- Naive Bayes selected for best balance of accuracy and stability.
- Submission
- Output calibrated probabilities for each class and submit to Kaggle.
Architecture
1 | [text] → [cleaning] → [BoW / TF‑IDF] → [Classifier] → P(EAP), P(HPL), P(MWS) |
3️⃣ Results
Model | Training Score | 10‑Fold CV (avg/total F1) | Notes |
---|---|---|---|
Random Forest | 0.4311 | 0.31 | Struggles with sparse text features |
AdaBoost | 0.6293 | 0.65 | Better than RF; still volatile |
SVM | 0.4035 | 0.23 | Overfits a single class in this setup |
Naive Bayes (Final) | 0.8329 | 0.90 | Fast, simple, strong on sparse TF‑IDF ✅ |
Kaggle Leaderboard
- LogLoss: 0.48767
- Rank: 793 / 1244 (63.7%)
4️⃣ Why Naive Bayes?
- Excels with sparse word distributions and short texts.
- Few hyperparameters → stable generalization and fast iteration.
- Simple pipeline makes experimentation (n‑grams, char‑grams) easy.
5️⃣ Real‑World Use
- Stylometry (author identification), plagiarism detection, style recommendation.
- Brand/author tone-of-voice classification and content routing.
🛠 Technologies Used
Step | Technology |
---|---|
Data / EDA | Python, Pandas, NumPy |
NLP Features | scikit-learn (Count / TF‑IDF) |
Models | scikit-learn (NB, SVM, RF, AdaBoost) |
Environment | Jupyter Notebook |
💡 Key Learnings
- TF‑IDF + Naive Bayes remains a strong baseline for short-text classification.
- Improving leaderboard LogLoss benefits from probability calibration and n‑gram / char‑gram features.
- For imbalanced, sparse text, simpler models can beat complex ensembles.
🔗 GitHub Repository
All articles on this blog are licensed under CC BY-NC-SA 4.0 unless otherwise stated.