👻 Spooky Author Identification

Who wrote this sentence? Classifying authors from a single line of text

“Let the words tell you who wrote them.”

📎 Full Analysis:
👉 View Jupyter Notebook on GitHub

📎 Kaggle Competition:
👉 Spooky Author Identification (2017 Halloween)

Spooky Author Identification


📌 One-Line Summary

Given a sentence, predict which of Edgar Allan Poe (EAP), H. P. Lovecraft (HPL), or Mary W. Shelley (MWS) wrote it.
With light preprocessing and BoW/TF‑IDF features + Naive Bayes, the model achieves strong accuracy and stable generalization.


1️⃣ Problem & Data

  • Task: Multi-class author classification (EAP/HPL/MWS) for a single sentence.
  • Metric: Multi-class Logarithmic Loss (submit probabilities for each author).
  • Files
    • train.csv: id, text, author
    • test.csv: id, text
    • sample_submission.csv: example submission format
  • Source: Public-domain fiction, sentence-split with CoreNLP’s MaxEnt tokenizer.

2️⃣ How It Was Built

  1. Preprocessing
    • Lowercasing, light cleanup of punctuation/numbers, minimal stopword handling.
  2. Feature Extraction
    • Bag-of-Words / TF‑IDF vectors (suited for short sentences).
  3. Modeling
    • Tried Random Forest, AdaBoost, SVM, Naive Bayes.
    • Evaluated with 10‑fold Cross‑Validation for robustness.
  4. Final Choice
    • Naive Bayes selected for best balance of accuracy and stability.
  5. Submission
    • Output calibrated probabilities for each class and submit to Kaggle.

Architecture

1
[text] → [cleaning] → [BoW / TF‑IDF] → [Classifier] → P(EAP), P(HPL), P(MWS)

3️⃣ Results

Model Training Score 10‑Fold CV (avg/total F1) Notes
Random Forest 0.4311 0.31 Struggles with sparse text features
AdaBoost 0.6293 0.65 Better than RF; still volatile
SVM 0.4035 0.23 Overfits a single class in this setup
Naive Bayes (Final) 0.8329 0.90 Fast, simple, strong on sparse TF‑IDF ✅

Kaggle Leaderboard

  • LogLoss: 0.48767
  • Rank: 793 / 1244 (63.7%)

4️⃣ Why Naive Bayes?

  • Excels with sparse word distributions and short texts.
  • Few hyperparameters → stable generalization and fast iteration.
  • Simple pipeline makes experimentation (n‑grams, char‑grams) easy.

5️⃣ Real‑World Use

  • Stylometry (author identification), plagiarism detection, style recommendation.
  • Brand/author tone-of-voice classification and content routing.

🛠 Technologies Used

Step Technology
Data / EDA Python, Pandas, NumPy
NLP Features scikit-learn (Count / TF‑IDF)
Models scikit-learn (NB, SVM, RF, AdaBoost)
Environment Jupyter Notebook

💡 Key Learnings

  • TF‑IDF + Naive Bayes remains a strong baseline for short-text classification.
  • Improving leaderboard LogLoss benefits from probability calibration and n‑gram / char‑gram features.
  • For imbalanced, sparse text, simpler models can beat complex ensembles.

🔗 GitHub Repository

📂 View Project on GitHub