AWS Certified AI Practitioner(23) - Training Data & Feature Engineering
๐ Training Data & Feature Engineering
Why Training Data Matters
- To build a reliable ML model, you need good quality data.\
- Principle: Garbage In โ Garbage Out. If your input data is messy
or incorrect, your model will produce poor predictions.\ - Data preparation is the most critical stage of ML.\
- The way you model your data (e.g., labeled/unlabeled,
structured/unstructured) directly impacts which algorithms you can
use.
๐ Exam Tip: Expect questions about labeled vs.ย unlabeled and
structured vs.ย unstructured data.
Labeled vs.ย Unlabeled Data
๐น Labeled Data
- Contains both input features and output labels.\
- Example: Animal images labeled as โcatโ or โdog.โ\
- Used in Supervised Learning โ the model learns to map inputs to
outputs.\ - Strong but expensive โ requires manual labeling.
๐น Unlabeled Data
- Contains only input features, with no labels.\
- Example: A folder of animal pictures with no tags.\
- Used in Unsupervised Learning โ the model finds hidden patterns
or clusters.\ - Cheaper and more abundant, but harder to interpret.
Structured vs.ย Unstructured Data
๐น Structured Data
Organized into rows/columns (like Excel or databases).\
- Tabular Data: Customer DB (Name, Age, Purchase Amount).\
- Time Series Data: Stock prices collected daily.
๐น Unstructured Data
Doesnโt follow a set format, often text-heavy or media-rich.\
- Text Data: Articles, social posts, product reviews.\
- Image Data: Photos, medical scans, etc.
๐ Exam Tip: AWS might test you on which algorithm handles
structured (tabular, time-series) vs.ย unstructured (text, image)
data.
Supervised Learning
- Learns a mapping function: predicts output for unseen inputs.\
- Requires labeled data.\
- Types: Regression (continuous values) and Classification
(categories).
๐น Regression
- Predicts numeric values.\
- Examples:
- House prices (based on size, location).\
- Stock price forecasting.\
- Weather prediction (temperature).\
- Output = continuous (any real value).
๐น Classification
- Predicts discrete categories.\
- Examples:
- Binary: Spam vs.ย Not Spam.\
- Multi-class: Mammal, Bird, Reptile.\
- Multi-label: A movie labeled as Action + Comedy.\
- Common Algorithm: k-NN (k-Nearest Neighbors).
Splitting the Dataset
- Training Set: 60โ80% (used to train).\
- Validation Set: 10โ20% (used to tune hyperparameters).\
- Test Set: 10โ20% (used to evaluate final performance).
๐ Exam Tip:\
- Training = learning.\
- Validation = tuning.\
- Test = evaluation.
Feature Engineering
Transforming raw data into useful features โ improves model accuracy.
Techniques
- Feature Extraction: Convert raw values into meaningful ones.
- Example: From birth date โ calculate age.\
- Feature Selection: Keep only the most relevant features.
- Example: House price prediction โ keep location & size, drop
irrelevant columns.\
- Example: House price prediction โ keep location & size, drop
- Feature Transformation: Normalize or scale data to improve
convergence.
Feature Engineering Examples
๐น On Structured Data
- Predicting house prices:
- Create price per square foot.\
- Normalize features like size and income.
๐น On Unstructured Data
- Text: Convert reviews into numbers using TF-IDF or word
embeddings.\ - Images: Use CNNs to extract edges, shapes, or textures.
๐ Exam Tip: Know that feature engineering = boosting model
performance by transforming data.
โ Quick Recap for Exam
- Good data is critical โ Garbage In, Garbage Out.\
- Labeled โ Supervised | Unlabeled โ Unsupervised.\
- Structured vs.ย Unstructured: Tables vs.ย Text/Images.\
- Regression = numeric predictions, Classification =
categories.\ - Data split: Train (learn), Validate (tune), Test (evaluate).\
- Feature Engineering improves accuracy through extraction,
selection, transformation.
๐ One-Liner Exam Tip:
Most AWS exam questions on ML basics test whether you can correctly
match the data type with the ML method (e.g., time-series โ
supervised regression, unlabeled images โ unsupervised clustering).
(Additional) ๐ What is TF-IDF?
TF-IDF is a statistical method used in Natural Language Processing (NLP) to evaluate how important a word is within a document relative to a collection of documents (called a corpus).
It is widely used in search engines, information retrieval, and text mining.
(Additional) โก How It Works
1. Term Frequency (TF)
- Measures how often a word appears in a document.
- Formula:
( TF(t, d) = \frac{\text{Number of times term t appears in document d}}{\text{Total terms in document d}} )
๐ Example: If the word โAIโ appears 5 times in a 100-word document,
( TF(AI) = 5 / 100 = 0.05 ).
2. Inverse Document Frequency (IDF)
- Measures how important a word is across all documents in the corpus.
- Common words (like โtheโ, โisโ, โandโ) get lower scores, while rare words get higher scores.
- Formula:
( IDF(t) = \log\frac{N}{1 + df(t)} )
where:- (N) = total number of documents
- (df(t)) = number of documents containing the term t
๐ Example: If the word โAIโ appears in 2 out of 10 documents,
( IDF(AI) = \log(10 / (1+2)) โ 1.20 ).
3. TF-IDF Score
- Combines TF and IDF to measure the importance of a term in a document relative to the whole corpus.
- Formula:
( TF\text{-}IDF(t, d) = TF(t, d) \times IDF(t) )
๐ Example:
Using the previous numbers, ( TF(AI) = 0.05 ) and ( IDF(AI) = 1.20 ).
So, ( TF\text{-}IDF(AI) = 0.05 \times 1.20 = 0.06 ).
๐ฏ Why Is TF-IDF Useful?
- Search Engines: Helps rank documents by relevance to a query.
- Text Mining: Identifies key terms in large text datasets.
- Spam Filtering: Detects suspicious terms often used in spam messages.
- Recommendation Systems: Finds similarities between documents or user profiles.
๐ Summary
- TF โ Frequency of a word in a document.
- IDF โ Importance of a word across all documents.
- TF-IDF โ Highlights words that are frequent in one document but rare across the corpus.
๐ In AWS or AI-related exams, TF-IDF often comes up as a classic feature extraction technique for text data before applying ML algorithms.