1. Apache Spark๋ž€?

Apache Spark๋Š”
๐Ÿ‘‰ ๋ถ„์‚ฐ ํด๋Ÿฌ์Šคํ„ฐ ํ™˜๊ฒฝ์—์„œ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋ฅผ ๋น ๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์—”์ง„(Engine) ์ž…๋‹ˆ๋‹ค.

Spark๋Š” ๋‹ค์Œ ์ž‘์—…์„ ํ•˜๋‚˜์˜ ํ†ตํ•ฉ๋œ ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ๋ฐฐ์น˜ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ (Batch Processing)
  • ์ŠคํŠธ๋ฆฌ๋ฐ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ (Stream Processing)
  • ๋จธ์‹ ๋Ÿฌ๋‹ (Machine Learning)
  • ๊ทธ๋ž˜ํ”„ ์ฒ˜๋ฆฌ (Graph Processing)
  • SQL ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ๋ถ„์„

๐Ÿ“Œ ์‹œํ—˜ ํฌ์ธํŠธ

  • Spark๋Š” ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๊ฐ€ ์•„๋‹ˆ๋‹ค
  • Spark๋Š” ์Šคํ† ๋ฆฌ์ง€ ์‹œ์Šคํ…œ์ด ์•„๋‹ˆ๋‹ค
  • Spark๋Š” ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์—”์ง„(Processing Engine) ์ด๋‹ค

2. Spark๊ฐ€ ์ œ๊ณตํ•˜๋Š” ์ฃผ์š” API (Unified Framework)

Spark๋Š” ํ•˜๋‚˜์˜ ์—”์ง„ ์œ„์—์„œ ์—ฌ๋Ÿฌ API๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

(1) Spark SQL & DataFrame API

  • SQL ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
  • ANSI SQL ํ˜ธํ™˜
  • ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋จ โญโญโญ

(2) Structured Streaming

  • ์ŠคํŠธ๋ฆฌ๋ฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐฐ์น˜์ฒ˜๋Ÿผ ์ฒ˜๋ฆฌ
  • Kafka, Kinesis ๋“ฑ๊ณผ ์—ฐ๋™

(3) MLlib

  • ๋จธ์‹ ๋Ÿฌ๋‹ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • ๋ถ„๋ฅ˜, ํšŒ๊ท€, ํด๋Ÿฌ์Šคํ„ฐ๋ง ๋“ฑ ์ œ๊ณต

(4) GraphX

  • ๊ทธ๋ž˜ํ”„ ๊ธฐ๋ฐ˜ ์—ฐ์‚ฐ (๋…ธ๋“œ, ์—ฃ์ง€)

๐Ÿ“Œ ์‹œํ—˜ ํฌ์ธํŠธ

  • RDD API๋Š” ์กด์žฌํ•˜์ง€๋งŒ ๊ถŒ์žฅ๋˜์ง€ ์•Š์Œ
  • ์‹œํ—˜์—์„œ๋Š” DataFrame / Spark SQL ์ค‘์‹ฌ

3. Spark ์•„ํ‚คํ…์ฒ˜ (Spark Stack)

Spark ๋™์ž‘ ๊ตฌ์กฐ (์•„๋ž˜ โ†’ ์œ„)

  1. Distributed Storage

    • HDFS
    • Amazon S3
    • Azure Data Lake Storage (ADLS)
    • Google Cloud Storage (GCS)
  2. Compute Cluster

    • ์—ฌ๋Ÿฌ ๋Œ€์˜ ์„œ๋ฒ„๋กœ ๊ตฌ์„ฑ๋œ ํด๋Ÿฌ์Šคํ„ฐ
  3. Resource Manager (Cluster Manager)

    • YARN
    • Kubernetes
    • Standalone
    • Mesos (๊ณผ๊ฑฐ)
  4. Spark Framework

    • Spark Core
    • Spark SQL
    • Streaming
    • MLlib
    • GraphX
  5. Programming API / DSL

    • Scala
    • Java
    • Python (PySpark)
    • R

๐Ÿ“Œ ์‹œํ—˜ ํฌ์ธํŠธ

  • Spark๋Š” ๋ฐ˜๋“œ์‹œ Cluster Manager ์œ„์—์„œ ์‹คํ–‰
  • Spark๋Š” ์Šคํ† ๋ฆฌ์ง€์™€ ๋ถ„๋ฆฌ๋œ ๊ตฌ์กฐ

4. Spark Core & ์–ธ์–ด ์ง€์›

Spark Core

  • Spark์˜ ํ•ต์‹ฌ ์‹คํ–‰ ์—”์ง„
  • RDD ๊ธฐ๋ฐ˜ API ํฌํ•จ

์ง€์› ์–ธ์–ด

  • Scala (Spark์˜ ์›๋ž˜ ์–ธ์–ด)
  • Java
  • Python (PySpark) โญ
  • R

๐Ÿ“Œ ์‹œํ—˜ ํฌ์ธํŠธ

  • Spark Core = RDD API
  • ์‹ค๋ฌด & ์‹œํ—˜์—์„œ๋Š” DataFrame API ์‚ฌ์šฉ

5. Spark๊ฐ€ ์ธ๊ธฐ ์žˆ๋Š” ์ด์œ 

(1) ๋†’์€ ์ถ”์ƒํ™” (High Abstraction)

  • ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ ๋ณต์žก์„ฑ ์ˆจ๊น€
  • ๊ฐœ๋ฐœ์ž๋Š” SQL ๋˜๋Š” DataFrame๋งŒ ์ž‘์„ฑ

(2) ์‚ฌ์šฉํ•˜๊ธฐ ์‰ฌ์›€

  • SQL ๊ธฐ๋ฐ˜ ์ ‘๊ทผ ๊ฐ€๋Šฅ
  • ๋‹ค์–‘ํ•œ ์–ธ์–ด ์ง€์›

(3) Unified Platform

  • SQL + Batch + Streaming + ML + Graph
  • ํ•˜๋‚˜์˜ ์—”์ง„์—์„œ ๋ชจ๋‘ ์ฒ˜๋ฆฌ

(4) Open Source & ํ’๋ถ€ํ•œ ์ƒํƒœ๊ณ„

  • ์ˆ˜๋งŽ์€ ๊ธฐ์—… ์‚ฌ์šฉ
  • Fortune 500์˜ ์•ฝ 80% ์‚ฌ์šฉ

๐Ÿ“Œ ์‹œํ—˜ ํฌ์ธํŠธ

  • Spark์˜ ํ•ต์‹ฌ ์žฅ์  = Unified + Abstraction

6. Apache Spark๊ฐ€ โ€œ์•„๋‹Œ ๊ฒƒโ€ (์ค‘์š” โญโญโญ)

Spark๋Š” ๊ฐ•๋ ฅํ•˜์ง€๋งŒ ์—”ํ„ฐํ”„๋ผ์ด์ฆˆ ์†”๋ฃจ์…˜์œผ๋กœ๋Š” ๋ถ€์กฑํ•œ ๋ถ€๋ถ„์ด ์žˆ์Œ

(1) ์ž์ฒด ์Šคํ† ๋ฆฌ์ง€ โŒ

  • Spark๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜์ง€ ์•Š์Œ
  • ํ•ญ์ƒ ์™ธ๋ถ€ ์Šคํ† ๋ฆฌ์ง€ ํ•„์š” (S3, HDFS ๋“ฑ)

(2) ACID ํŠธ๋žœ์žญ์…˜ โŒ

  • Spark ์ž์ฒด๋Š” ACID ๋ณด์žฅ ์•ˆ ํ•จ
  • Atomicity, Consistency, Isolation, Durability ๋ฏธ์ง€์›

(3) ์ค‘์•™ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์นดํƒˆ๋กœ๊ทธ โŒ

  • ๋‹จ์ˆœํ•œ ๋‚ด๋ถ€ ์นดํƒˆ๋กœ๊ทธ๋งŒ ์กด์žฌ
  • Enterprise-grade Catalog ์—†์Œ

(4) ํด๋Ÿฌ์Šคํ„ฐ ๊ด€๋ฆฌ โŒ

  • Spark๋กœ ํด๋Ÿฌ์Šคํ„ฐ ์ƒ์„ฑ/์‚ญ์ œ ๋ถˆ๊ฐ€
  • Cluster Manager์˜ ์—ญํ• 

(5) ์ž๋™ํ™” ๋„๊ตฌ ๋ถ€์กฑ โŒ

  • ๋ฐฐํฌ, ๋ชจ๋‹ˆํ„ฐ๋ง, ์šด์˜ ์ž๋™ํ™” ๊ธฐ๋Šฅ ๋ฏธํก

๐Ÿ“Œ ์‹œํ—˜ ํฌ์ธํŠธ

  • Spark๋Š” ์—”์ง„์ด์ง€ ํ”Œ๋žซํผ์ด ์•„๋‹ˆ๋‹ค

7. ์™œ Spark โ€œํ”Œ๋žซํผโ€์ด ํ•„์š”ํ•œ๊ฐ€?

์—”ํ„ฐํ”„๋ผ์ด์ฆˆ ํ™˜๊ฒฝ์—์„œ๋Š” ๋‹ค์Œ์ด ํ•„์š”ํ•จ:

  • ACID ๋ณด์žฅ
  • ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ
  • ๋ณด์•ˆ
  • ์ž๋™ํ™”
  • ์šด์˜ ํŽธ์˜์„ฑ

๐Ÿ‘‰ ๊ทธ๋ž˜์„œ Spark + Platform ์กฐํ•ฉ์ด ํ•„์š”


8. ๋Œ€ํ‘œ์ ์ธ Spark ํ”Œ๋žซํผ๋“ค (์‹œํ—˜ ๋‹จ๊ณจ)

(1) Cloudera Hadoop

  • ์˜จํ”„๋ ˆ๋ฏธ์Šค Hadoop ํ”Œ๋žซํผ
  • YARN ๊ธฐ๋ฐ˜
  • Spark ์‹คํ–‰ ๊ฐ€๋Šฅ

(2) Amazon EMR

  • AWS ๊ด€๋ฆฌํ˜• Hadoop/Spark
  • ๋‚ด๋ถ€์ ์œผ๋กœ Hadoop + YARN ์‚ฌ์šฉ

(3) Azure HDInsight

  • Azure ๊ธฐ๋ฐ˜ Hadoop/Spark ์„œ๋น„์Šค

(4) Google Dataproc

  • GCP ๊ธฐ๋ฐ˜ Hadoop/Spark ์„œ๋น„์Šค

๐Ÿ“Œ ๊ณตํ†ต์ 

  • ๋ชจ๋‘ Hadoop ๊ธฐ๋ฐ˜
  • YARN ์‚ฌ์šฉ

9. Databricks์˜ ์ฐจ๋ณ„์  โญโญโญ

Databricks ํŠน์ง•

  • Hadoop ๊ธฐ๋ฐ˜ ์•„๋‹˜
  • YARN ์‚ฌ์šฉ ์•ˆ ํ•จ
  • Spark ์ „์šฉ Cloud Native ํ”Œ๋žซํผ
  • ํด๋ผ์šฐ๋“œ ์ตœ์ ํ™”

Databricks๋Š”?

  • Spark + ACID + Metadata + Automation ์ œ๊ณต
  • Delta Lake ๊ธฐ๋ฐ˜
  • Medallion Architecture ๊ตฌํ˜„ ๊ฐ€๋Šฅ

๐Ÿ“Œ ์‹œํ—˜ ํฌ์ธํŠธ

  • Databricks = Pure Spark Platform
  • On-Premise โŒ, Cloud Only โญ•

10. ์‹œํ—˜์— ์ž์ฃผ ๋‚˜์˜ค๋Š” ํ•œ ์ค„ ์š”์•ฝ

  • Spark๋Š” ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์—”์ง„
  • Spark๋Š” ์Šคํ† ๋ฆฌ์ง€๋ฅผ ํฌํ•จํ•˜์ง€ ์•Š๋Š”๋‹ค
  • Spark๋Š” ACID๋ฅผ ๊ธฐ๋ณธ ์ œ๊ณตํ•˜์ง€ ์•Š๋Š”๋‹ค
  • Spark๋Š” YARN / Kubernetes ์œ„์—์„œ ์‹คํ–‰
  • Databricks๋Š” Spark ๊ธฐ๋ฐ˜ ์—”ํ„ฐํ”„๋ผ์ด์ฆˆ ํ”Œ๋žซํผ
  • Hadoop ๊ธฐ๋ฐ˜ ํ”Œ๋žซํผ โ‰  Databricks

11. Medallion Architecture์™€ Spark (๋ณด๋„ˆ์Šค)

  • Spark ๋‹จ๋… โŒ
  • Spark + Delta Lake โญ•
  • Databricks์—์„œ Medallion Architecture ๊ตฌํ˜„ ๊ฐ€๋Šฅ
    • Bronze
    • Silver
    • Gold

๐Ÿ“Œ ์‹œํ—˜ ํฌ์ธํŠธ

  • Medallion Architecture = Databricks + Delta Lake

โœ… ๋งˆ๋ฌด๋ฆฌ ํ•œ ๋ฌธ์žฅ (์‹œํ—˜์šฉ)

Apache Spark๋Š” ๋ถ„์‚ฐ ํ™˜๊ฒฝ์—์„œ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ Unified Data Processing Engine์ด๋ฉฐ,
์—”ํ„ฐํ”„๋ผ์ด์ฆˆ ํ™˜๊ฒฝ์—์„œ๋Š” Databricks ๊ฐ™์€ ํ”Œ๋žซํผ๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉ๋œ๋‹ค.