AWS Certified AI Practitioner(32) - Amazon Polly & Rekognition
Amazon Polly
What It Does
- Amazon Polly is a Text-to-Speech (TTS) service that converts written text into lifelike speech using deep learning.
- With Polly, you can create applications that actually speak to users—for example, an audiobook generator, a voice-enabled chatbot, or accessibility tools for visually impaired users.
Key Features (Exam Focus)
- Lexicons (Custom Pronunciation Dictionary)
- You can define how certain words should be pronounced.
- Example:
AWS
→ Amazon Web Services,W3C
→ World Wide Web Consortium. - Exam Tip: If a question asks how to control the way Polly pronounces abbreviations → the answer is Lexicons.
- SSML (Speech Synthesis Markup Language)
Markup language to fine-tune speech output: pauses, emphasis, pitch, volume, rate, etc.
Example:
1
<speak>Hello, <break time="1s"/> how are you?</speak>
→ Polly will say “Hello,” pause for 1 second, then continue with “how are you?”
- Voice Engines
- Standard: Older, robotic-sounding voices.
- Neural: More human-like and natural.
- Long-form: Designed for extended audio like podcasts or audiobooks.
- Generative: Latest engine using GenAI, capable of expressive, adaptive voices.
- Exam Tip: Know the difference between Standard vs Neural voices.
- Speech Marks
- Metadata showing where words and sentences start/end in the audio stream.
- Useful for lip-syncing or highlighting words in real-time transcripts.
Important Comparison
- Amazon Polly = Text → Speech
- Amazon Transcribe = Speech → Text
Amazon Rekognition
What It Does
- Amazon Rekognition analyzes images and videos with machine learning.
- It can identify objects, text, people, and activities, and it supports facial recognition and verification.
Core Use Cases (High Exam Relevance)
- Labeling – Automatically detect and categorize objects and scenes (e.g., “car,” “dog,” “mountain”).
- Text Detection – Extract text from images (e.g., license plates, signs).
- Face Detection & Analysis – Determine gender, age range, and emotions (e.g., smiling, eyes open).
- Face Search & Verification – Match against a database of known faces (e.g., for access control).
- Celebrity Recognition – Identify famous people.
- Pathing / Tracking – Track movement (e.g., following a ball in a sports game).
- PPE Detection – Detect personal protective equipment like helmets, gloves, and masks.
Advanced Features
Custom Labels
- Train Rekognition to detect your own objects or logos.
- Example: The NFL uses Rekognition to automatically find its logo in social media photos.
- Only a few hundred training images are needed.
- Images are stored in Amazon S3, then Rekognition trains a custom model.
Exam Tip:
If you see “identify your company logo in images” → answer is Rekognition Custom Labels.Content Moderation
- Automatically detect inappropriate or unsafe content (e.g., for social media platforms, ad campaigns, broadcasting).
- Reduces human review workload to about 1–5%.
- Integrated with Amazon Augmented AI (A2I) so humans can review edge cases.
- Supports Custom Moderation Adapters → you can supply your own labeled datasets to improve accuracy.
Exam Tip:
If a question asks about automatically filtering harmful content while still allowing human review when needed → the answer involves Rekognition Content Moderation + A2I.
Extra Details That Might Show Up on Exams
- Face Liveness Detection: Ensures the detected face is real (not a photo or video spoof).
- Image Properties: Extract dominant colors, foreground/background quality.
- Integration with Other AWS Services:
- Works well with Amazon S3 (for image storage).
- Results can be sent to Amazon SNS/SQS for event handling.
- Human-in-the-loop moderation integrates with Amazon A2I.
Quick Exam Summary
- Polly vs Transcribe → Polly = TTS, Transcribe = STT.
- Polly Key Features → Lexicons, SSML, Neural/Generative Voices, Speech Marks.
- Rekognition Key Features → Labeling, Text Detection, Face Analysis, Celebrity Recognition, PPE Detection.
- Rekognition Advanced → Custom Labels, Content Moderation (+ A2I integration).
- Remember: Rekognition = image/video analysis, Polly = text-to-speech.
All articles on this blog are licensed under CC BY-NC-SA 4.0 unless otherwise stated.