๐Ÿ“˜ Training Data & Feature Engineering

Why Training Data Matters

  • To build a reliable ML model, you need good quality data.\
  • Principle: Garbage In โ†’ Garbage Out. If your input data is messy
    or incorrect, your model will produce poor predictions.\
  • Data preparation is the most critical stage of ML.\
  • The way you model your data (e.g., labeled/unlabeled,
    structured/unstructured) directly impacts which algorithms you can
    use.

๐Ÿ‘‰ Exam Tip: Expect questions about labeled vs.ย unlabeled and
structured vs.ย unstructured data.


Labeled vs.ย Unlabeled Data

๐Ÿ”น Labeled Data

  • Contains both input features and output labels.\
  • Example: Animal images labeled as โ€œcatโ€ or โ€œdog.โ€\
  • Used in Supervised Learning โ†’ the model learns to map inputs to
    outputs.\
  • Strong but expensive โ†’ requires manual labeling.

๐Ÿ”น Unlabeled Data

  • Contains only input features, with no labels.\
  • Example: A folder of animal pictures with no tags.\
  • Used in Unsupervised Learning โ†’ the model finds hidden patterns
    or clusters.\
  • Cheaper and more abundant, but harder to interpret.


Structured vs.ย Unstructured Data

๐Ÿ”น Structured Data

Organized into rows/columns (like Excel or databases).\

  • Tabular Data: Customer DB (Name, Age, Purchase Amount).\

  • Time Series Data: Stock prices collected daily.

๐Ÿ”น Unstructured Data

Doesnโ€™t follow a set format, often text-heavy or media-rich.\

  • Text Data: Articles, social posts, product reviews.\
  • Image Data: Photos, medical scans, etc.

๐Ÿ‘‰ Exam Tip: AWS might test you on which algorithm handles
structured (tabular, time-series) vs.ย unstructured (text, image)
data.


Supervised Learning

  • Learns a mapping function: predicts output for unseen inputs.\
  • Requires labeled data.\
  • Types: Regression (continuous values) and Classification
    (categories).

๐Ÿ”น Regression

  • Predicts numeric values.\
  • Examples:
    • House prices (based on size, location).\
    • Stock price forecasting.\
    • Weather prediction (temperature).\
  • Output = continuous (any real value).

๐Ÿ”น Classification

  • Predicts discrete categories.\
  • Examples:
    • Binary: Spam vs.ย Not Spam.\
    • Multi-class: Mammal, Bird, Reptile.\
    • Multi-label: A movie labeled as Action + Comedy.\
  • Common Algorithm: k-NN (k-Nearest Neighbors).

------------------------------------------------------------------------

Splitting the Dataset

  • Training Set: 60โ€“80% (used to train).\
  • Validation Set: 10โ€“20% (used to tune hyperparameters).\
  • Test Set: 10โ€“20% (used to evaluate final performance).

๐Ÿ‘‰ Exam Tip:\

  • Training = learning.\
  • Validation = tuning.\
  • Test = evaluation.


Feature Engineering

Transforming raw data into useful features โ†’ improves model accuracy.

Techniques

  • Feature Extraction: Convert raw values into meaningful ones.
    • Example: From birth date โ†’ calculate age.\
  • Feature Selection: Keep only the most relevant features.
    • Example: House price prediction โ†’ keep location & size, drop
      irrelevant columns.\
  • Feature Transformation: Normalize or scale data to improve
    convergence.


Feature Engineering Examples

๐Ÿ”น On Structured Data

  • Predicting house prices:
    • Create price per square foot.\
    • Normalize features like size and income.

๐Ÿ”น On Unstructured Data

  • Text: Convert reviews into numbers using TF-IDF or word
    embeddings.\
  • Images: Use CNNs to extract edges, shapes, or textures.

๐Ÿ‘‰ Exam Tip: Know that feature engineering = boosting model
performance by transforming data.


โœ… Quick Recap for Exam

  1. Good data is critical โ€“ Garbage In, Garbage Out.\
  2. Labeled โ†’ Supervised | Unlabeled โ†’ Unsupervised.\
  3. Structured vs.ย Unstructured: Tables vs.ย Text/Images.\
  4. Regression = numeric predictions, Classification =
    categories.\
  5. Data split: Train (learn), Validate (tune), Test (evaluate).\
  6. Feature Engineering improves accuracy through extraction,
    selection, transformation.

๐Ÿ‘‰ One-Liner Exam Tip:
Most AWS exam questions on ML basics test whether you can correctly
match the data type with the ML method (e.g., time-series โ†’
supervised regression
, unlabeled images โ†’ unsupervised clustering).

(Additional) ๐Ÿ“Œ What is TF-IDF?

TF-IDF is a statistical method used in Natural Language Processing (NLP) to evaluate how important a word is within a document relative to a collection of documents (called a corpus).
It is widely used in search engines, information retrieval, and text mining.


(Additional) โšก How It Works

1. Term Frequency (TF)

  • Measures how often a word appears in a document.
  • Formula:
    ( TF(t, d) = \frac{\text{Number of times term t appears in document d}}{\text{Total terms in document d}} )

๐Ÿ‘‰ Example: If the word โ€œAIโ€ appears 5 times in a 100-word document,
( TF(AI) = 5 / 100 = 0.05 ).


2. Inverse Document Frequency (IDF)

  • Measures how important a word is across all documents in the corpus.
  • Common words (like โ€œtheโ€, โ€œisโ€, โ€œandโ€) get lower scores, while rare words get higher scores.
  • Formula:
    ( IDF(t) = \log\frac{N}{1 + df(t)} )
    where:
    • (N) = total number of documents
    • (df(t)) = number of documents containing the term t

๐Ÿ‘‰ Example: If the word โ€œAIโ€ appears in 2 out of 10 documents,
( IDF(AI) = \log(10 / (1+2)) โ‰ˆ 1.20 ).


3. TF-IDF Score

  • Combines TF and IDF to measure the importance of a term in a document relative to the whole corpus.
  • Formula:
    ( TF\text{-}IDF(t, d) = TF(t, d) \times IDF(t) )

๐Ÿ‘‰ Example:
Using the previous numbers, ( TF(AI) = 0.05 ) and ( IDF(AI) = 1.20 ).
So, ( TF\text{-}IDF(AI) = 0.05 \times 1.20 = 0.06 ).


๐ŸŽฏ Why Is TF-IDF Useful?

  • Search Engines: Helps rank documents by relevance to a query.
  • Text Mining: Identifies key terms in large text datasets.
  • Spam Filtering: Detects suspicious terms often used in spam messages.
  • Recommendation Systems: Finds similarities between documents or user profiles.

๐Ÿ“ Summary

  • TF โ†’ Frequency of a word in a document.
  • IDF โ†’ Importance of a word across all documents.
  • TF-IDF โ†’ Highlights words that are frequent in one document but rare across the corpus.

๐Ÿ‘‰ In AWS or AI-related exams, TF-IDF often comes up as a classic feature extraction technique for text data before applying ML algorithms.