내 소개 및 전반적인 질문

나의 소개 왜 이직하니? 보잉이란 그리고 지원동기? 최근 프로젝트 이슈 - 데이터 볼륨증가 3~4TB 이슈 - 스키마 변경 데이터 웨어하우스 설계 데이터 웨어하우스 지연속도 문제 및 해결 데이터 품질관리 - 여러포인트 검증 & good data ETL (오케스트레이션) 설계 - CA7와 Glue 이용

데이터 품질과 보안

데이터 품질관리 - 유닛테스트 보안과 규정 (e.g. AWS, Azure)

기술 내용

Alteryx 사용기간 및 경험 테라데이터 사용기간 및 경험 테라데이터 정의 및 사용사례 테라데이터가 높은 성능을 발휘할수 있는 이유 테라데이터의 Primary Index (PI) 역할 테라데이터 데이터 불균형을 어떻게 해결? 테라데이터의 Secondary Index 이란 Teradata 파티션 테라데이터 문제해결 Alteryx와 Teradata 사용 Neo4J 관련해서

행동 규정

매니저가 부재시 결정해야할 경우 팀원 의견 다름 - SLA 20분 지연 - 품질해결 상사 의견 다름 - refresh only 변경된 파티션만 다른 성향의 사람과 협동 팀동료 성공시키기 급한일과 중요한일의 우선순위 - 급한것 먼저 기술 문제 도전 - 1.5TB 처리 개선 사례 - 큰 데이터 처리 (위 동일) 생산 문제 - 커넥트 Stop 자발적 프로세스 개선 - 커넥트 Stop Kafka 실시간 데이터에서 고려할 부분 - 커넥트 상태확인 품질 문제 - 통화단위 에러 실패,실수 - 컬럼검증X, 통화단위 에러 프로젝트 지연 - 스키마 변경 팀 리드 & 솔선 - 소스 입력 안됨 리더싶 사례 - data type mismatch 지시받지않은 일 - add load_date / 상사 의견다름 동일 타이트한 스케줄 & 압박 - 일 나누고 대화, 트랙 여러 민족 같이 근무 - sync 미팅, 미팅요약 다른 팀 협업 - 용어 통일 고객이 마지막에 변경요청시 - 스키마변경 고객이 자주 변경사항을 요청할때 - 요청을 그룹핑함 비개발자에게 기술적인 내용을 쉽게 설명할 수 있나요? 이슈 - Spark memory 문제

ETL / 오케스트레이션

Data Orchestration - (CA7, Glue, Airflow) ETL pipeline 최적화 - SLA 6시 AWS Glue 사용경험 - ETL services 현재 glue를 사용하지는 않는다

데이터 웨어하우스

Redshift 란? Redshift Columnar Storage Snowflake 장점 (zero-copy, time travel) Databricks 사용경험 - Anomaly detection 파티션 전략 데이터 모델링 & Architecture - 오라클 range partitioning< 정규화 vs 비정규화 Star & Snowflake 스키마

파이썬/스파크/하둡

Python, SQL, Spark, PySpark 경험 - 스파크/하둡 ingestion 경험 - AWS 많이 사용했니? 장점과 단점 스트레스를 어떻게 풉니까? 삶의 모토는? 파워텍에서 머신러닝 모델 사용 마지막으로 하고 싶은 말 파이썬 타입비교

나의 소개

Thank you for having me for an interview and my name is Sunghwan ki but you can go by Danny I work as Data Engineer with 6 years experience in building ETL process, especially in the financial industry. Currently I lead the projects that use the Kafka, Oracle, and Spark where I focus on near real-time data processing and optimization. I primarily use Python to build data pipelines, and recently, I completed on a project where I built a data warehouse using AWS Glue and Redshift. Before joining PNC, I spent roughly seven years working in data analytics, where I primarily used Tableau and MySql to analyze the data

To better performance, I completed the Master’s degree in Data Science last year and also I hold the AWS certifications and continue to pursue additional cloud-related credentials to further strengthen my expertise

맨 위로

왜 이직하니?

I’ve truly enjoyed my time at PNC and I’ve spent over six years working on meaningful projects and improved my technical skills. Now I feel I ready for a new challenge that allows me to expand further. Technology is evolving faster than ever, and I want to keep learning and developing new skills. for me It’s not about leaving something behind — it’s about taking the next step toward work I’m truly passionate about.

맨 위로

보잉이란 그리고 지원동기

Boeing is one of the world’s largest aerospace and defense companies. It designs and builds commercial airplanes like the 737 and 777. Also, Boeing’s work connects people, supports global transportation, and contributes to national security and space exploration. That's why I applied to this company to builds products with real-world impact.

맨 위로

Recent project (최근 프로젝트)

Currently, I am working on building a near real time pipeline that ingests kafka topic data into Oracle Exadata and then into Hadoop platform. In the past, stakeholders had to rely on the previous day’s data to make decisions. But now with this new pipeline, data from Kafka is ingested into Hadoop in every 10 minutes and then visualized through Tableau dashboards. This project significantly reduced data latency and helped business team to make faster decision.

맨 위로

이슈 - 데이터 볼륨증가 3~4TB

One of the biggest challenges I faced recently was with a Kafka-to-Hadoop data pipeline, where Oracle Exadata was used as a staging area.

Initially, the volume of data coming from Kafka was about 1 TB per day, but it suddenly increased to 3 or 4 TB per day. Even though the data was automatically deleted after being loaded into Hadoop, new data was coming in faster than it could be deleted, so Exadata started running out of space. To handle this, I increased the number of Spark jobs to speed up data movement into Hadoop. But this caused to slow down the Exadata and it created a bottleneck issue. Then I suddenly thought about compressing the data at exadata base, and luckly, I discovered that EXADATA has a built-in compression feature — and the best part is that the data doesn’t need to be decompressed when it’s moved to Hadoop. Using this compression method, I was able to reduce the data size by almost 70% in Exadata. After that, I reduced the number of Spark jobs, which helped Exadata run better and stabilized the pipeline.

맨 위로

이슈 - 스키마 변경

I remember there was a project that we were integrating data from multiple sources into a central data warehouse.

The challenge was that one of the upstream systems frequently changed its schema without notice. And it caused our ETL jobs to fail and delayed reporting for business users. My responsibility was to make the pipeline more resilient so that these schema changes would not break the entire data flow. I implemented a schema validation and auto-adjustment process. I updated the code to compare the incoming data schemas against our expected schema. If a non-critical column changed such as a new column being added, the pipeline could adapt automatically without failing. But For critical mismatches, the system flagged the issue, generated incident, and provided fallback logic to continue processing the data. This reduced ETL job failures by more than 90% and ensured that the business team continued to receive the data even when upstream systems changed its schema unexpectedly.

Valid records are loaded into the main HDFS path, and Invalid records are redirected to a separate reject HDFS path.

  • When a string value like “ABC” appears in a numeric column ,
  • When a NULL value is provided for a NOT NULL column ,
  • When a date column receives a value in a completely different or invalid format ,
  • When the data type is correct but the length exceeds the limit (e.g., a 50-character string for a VARCHAR2(10) column)

오라클에 데이터가 저장될때 : A CLOB (Character Large Object) is a data type used to store very large text data.

맨 위로

데이터 웨어하우스 설계 (e.g. Amazon Redshift, Snowflake 사용경험)

I have experience using Redshift to build the cloud data warehousing. In one of my project, I built an analytics pipeline to process and analyze mobile user login data. To achieve this, I set up a pipeline where the log data was first stored in Amazon S3. From there, AWS Glue processed and loaded data into Redshift. Once the data was in Redshift, I used Amazon QuickSight to build interactive dashboards. And it visualized key user activity such as session duration, clickstream patterns, and device usage. This solution provided business stakeholders with actionable insights.

맨 위로

데이터 웨어하우스 지연속도 문제 및 해결

One of the challenges I ran into was that loading JSON files from S3 into Redshift was much slower than I expected.
Because the data was in JSON, Redshift had to parse every row, and the file sizes were all different.
This caused performance issues and even led to uneven data distribution across Redshift nodes.

To fix this, I redesigned the ingestion process in AWS Glue.
I converted the JSON data into Parquet and saved the files in S3 with same sizes—around 128 MB.
Since in Parquet the data is already fully parsed, Redshift didn’t have to do extra parsing during the load, which significantly sped up the loading process.
I also updated the DISTKEY and SORTKEY based on how the data was being queried. And it helped prevent data skew and allowed Redshift to process data more evenly across all nodes.

  • 왜 128MB 사이즈로 데이터를 만드는지?
    I stored the Parquet files in 128 MB chunks because Redshift performs best when it reads multiple files of similar size in parallel. Consistent file sizes help avoid data skew, reduce S3 overhead, and allow Redshift to distribute the workload evenly across all nodes, which results in much faster COPY performance.

맨 위로

데이터 정규화와 비정규화의 차이점

Normalized data is typically used in OLTP systems. It separates data into multiple related tables to reduce redundancy and maintain data integrity. This helps ensure consistency during insert, update, and delete operations, but often requires multiple joins to retrieve data. Denormalized data is more common in OLAP systems. It intentionally duplicates data by combining related fields into fewer tables, which improves read performance and speeds up complex analytical queries.

맨 위로

Star & Snowflake 스키마 (데이터 웨어하우스에서 스타 스키마 사용)

In most data warehouses, the Star Schema is used because it provides high query performance, especially for analytical workloads, and has a simple structure consisting of a central fact table connected to denormalized dimension tables. This simplicity also makes it well suited for BI tools like Tableau or Power BI. But the Snowflake Schema is also used—especially when storage efficiency or data normalization is a higher priority. It tends to introduce more joins, which can affect query performance. Therefore, Star Schema is generally preferred in data warehouse environments.

맨 위로

데이터 품질관리 - 여러포인트 검증 & good data

When it comes to data quality, I apply validation at multiple points of the pipeline. During ingestion, I perform schema validation and basic checks such as null values, data types, and duplicates. As the data moves through transformations, I apply additional business-rule validations to ensure the results make sense before loading them into data warehouse. In addition, I worked closely with business team to define what “good data” means for their use cases. And I ensured that the dashboards in Tableau reflected the reliable information for decision-making.

In one project, I worked with the fraud prevention team, where my role was to deliver data they could fully trust. For them, “good data” meant accurate, up-to-date, and reliable information without duplication or errors. Because the quality of data directly impacted their fraud detection models, I focused not only on data delivery but also on maintaining high quality through the validation and monitoring.

맨 위로

ETL (오케스트레이션) 설계 - CA7와 Glue 이용

I’ve designed and implemented both batch and near real-time ETL pipelines. For near real-time workloads, I built pipelines that ingest Kafka streaming data into Oracle Exadata and Hadoop every ten minutes. I used PySpark for transformations and I used the CA7, a mainframe-based scheduler, to orchestrate the dependencies across these jobs. CA7 ensured that each PySpark workflow ran in the correct sequence and at the right time and it was critical for the batch operations. I also have experience building cloud-native ETL solutions. In one project, I used AWS Glue studio to design the ETL workflows. Glue’s built-in transformations, and job orchestration features made it easier to manage the logic.

맨 위로

기술 내용

Alteryx 사용기간 및 경험

I used Alteryx for about a year in the past, mainly for data preparation and automation tasks such as joining datasets, performing aggregations, and creating analytical outputs. However, I didn’t use it as the primary tool in any large-scale enterprise projects. These days, I mainly work with AWS Glue Studio. When dealing with larger datasets, I noticed that Alteryx tends to slow down since it’s not optimized for big-data workloads. But AWS Glue Studio runs on Apache Spark, which provides much better performance and scalability for heavy ETL processing.

맨 위로

테라데이터 사용기간 및 경험

I have around two years of experience with Teradata, mainly using it with Hadoop system. In our setup, Hadoop stored the customer's account data, and Teradata accessed that data through QueryGrid, which allowed us to easily combine and query Hadoop datasets. We also connected Tableau to Teradata and set up hourly refreshes so the dashboards always reflected the latest customer's account data insights.
  • 사용할때 문제 : Network latency 이슈
    Network latency was the biggest issue we faced. Because Teradata had to retrieve large volumes of detailed data from Hadoop over QueryGrid, there were many cases where the query didn’t return on time or failed altogether. To address this, we changed the approach so that heavy processing happened in Hadoop first. We aggregated and filtered the data using Spark, and then used QueryGrid only to bring back a much smaller dataset. This significantly reduced the amount of data being transferred, which helped avoid latency issues and made the overall query performance much more stable.

맨 위로

테라데이터 문제해결

One of the most common issues I faced with Teradata was slow query performance when working with large tables, especially when the table wasn’t partitioned. In those cases, Teradata had to scan the entire table, which made daily jobs take much longer than expected. To fix this, I added date-based partitions so Teradata only scanned the specific partition needed for each query. This small change made a big difference. the queries became much faster and more stable. It also helped reduce the load on the system and improved overall performance.

맨 위로

테라데이터가 높은 성능을 발휘할수 있는 이유

Teradata leverages an MPP architecture where data is distributed across multiple AMPs (Access Module Processors). Each AMP works independently to store and process its portion of data and it enables parallel execution. Because of this distribution mechanism, Teradata can handle large volumes of data with high performance.

맨 위로

테라데이터의 Primary Index (PI) 역할

A Primary Index determines how data is distributed across AMPs. Choosing the right PI is crucial because it ensures even data distribution. A well-chosen PI improves join performance and overall query efficiency. (Teradata 내부에서는 데이터를 저장할 때 Primary Index 컬럼에 해시 함수를 적용해서 Hash Value를 만들고, 그 값을 기반으로 어떤 AMP에 저장할지를 결정합니다.)
1
2
3
4
5
CREATE TABLE customer (
customer_id INTEGER,
name VARCHAR(100)
)
PRIMARY INDEX (customer_id);

맨 위로

테라데이터 데이터 불균형을 어떻게 해결?

Data skew occurs when data is unevenly distributed across AMPs, causing some AMPs to process significantly more data than others. This leads to slower query performance. To handle this Data skew , I typically review PI selection and check for unique columns. Sometimes, creating a multicolumn PI can help balance the distribution.
1
2
3
4
5
6
7
8
9
10
11
CREATE TABLE customer_new
PRIMARY INDEX (new_column)
AS customer
WITH NO DATA;

INSERT INTO customer_new
SELECT * FROM customer;

DROP TABLE customer;

RENAME TABLE customer_new TO customer;
1
2
3
4
5
6
7
# orders_stage : 스테이징 테이블
# orders_nopi : NoPI 테이블 (Primary Index 없음)

INSERT INTO orders_nopi
SELECT *
FROM orders_stage
HASH BY customer_id;

맨 위로

테라데이터의 Secondary Index 이란

A Secondary Index is useful when frequently queried columns are not part of the Primary Index. It accelerates data access without re-distributing data. However, because Secondary Indexes require additional maintenance, I usually add them only when a business-critical query pattern consistently needs optimization.
1
2
3
# SI 추가하기
ALTER TABLE your_table_name
ADD INDEX (column_name);

맨 위로

Teradata 파티션

Partitioning allows tables to be divided into manageable segments, usually based on date. This improves query performance because Teradata only scans relevant partitions instead of the whole table. I commonly used date-based partitioning.
1
2
3
4
5
6
7
8
9
10
11
CREATE TABLE sales_daily (
order_id INTEGER,
order_date DATE
)
PARTITION BY RANGE_N(order_date >= DATE '2023-01-01'
EACH INTERVAL '1' DAY);

SELECT *
FROM sales
WHERE order_date >= DATE '2023-05-01'
AND order_date < DATE '2023-06-01';

맨 위로

Alteryx와 Teradata 사용

I think ETL pipelines between Alteryx and Teradata are built using Alteryx’s In-DB tools. Alteryx generates SQL and pushes all heavy transformations to the Teradata MPP engine, which handles large-scale joins and aggregations efficiently. Alteryx simply orchestrates the workflow, while Teradata performs the actual processing. This approach combines the ease of use of Alteryx with the scalability of Teradata.

Neo4J 관련해서

Although I haven’t used Neo4j in production, But I’m interested in graph databases. I would like to have the opportunity to learn and apply Neo4j in future projects.

맨 위로

여러 민족 같이 근무 - sync 미팅, 미팅요약

Currently, I work on a data integration project with team members from the U.S., India, and Europe. At first, coordination was difficult because of time zone differences and different communication styles. To improve collaboration, I organized short daily sync meetings that overlapped our working hours and encouraged open discussions so everyone could share progress or blockers. I also started sending clear written summaries after each meeting so teammates in different time zones could stay updated. As a result, we reduced misunderstandings and improved task handoffs between regions.

맨 위로

다른 팀 협업 - 용어 통일

In one of my projects, I worked closely with software engineers and business analysts to improve how we tracked and analyzed user behavior. The engineers were responsible for sending user activity data into our database, and my role was to clean and transform that data so it could be used for reporting and analysis. I noticed that each team had slightly different definitions for key metrics, like “active users” or “sessions,” which caused confusion in reports. So, I organized a short meeting to align on clear definitions and updated our data dictionary to make sure everyone used the same terms. After that, the reports became much more consistent, and the business team was able to make decisions faster and with more confidence. It was a great experience showing how clear communication and teamwork can really improve data quality and trust.

맨 위로

타이트한 스케줄 & 압박 - 일 나누고 대화, 트랙

When I face tight deadlines or high-pressure situations, I stay calm and break the work into smaller parts. For example, in one project, our team had to build a new ETL workflow in less than two weeks because of a last-minute client request. Instead of stressing out, I focused on what was most important, assigned tasks clearly, and set up short daily check-ins to track progress. I also kept open communication with both the team and stakeholders, making sure everyone understood what we could realistically deliver. By staying organized and working together, we completed the project on time with great results. This experience taught me that under pressure, clear priorities, steady communication, and teamwork are the keys to success.

맨 위로

품질과 보안 내용

데이터 품질관리 - 유닛테스트

I usually use Pytest for unit testing in Python. It’s simpler and more readable than the built-in unittest module, and it allows to write tests quickly without creating test classes. In Pytest, test functions simply start with test_, and I use the assert statement to verify the results.
1
2
3
4
5
6
import pytest
from calculator import add, divide # calculator.py

def test_add():
assert add(2, 3) == 5
assert add(-1, 1) == 0

맨 위로

품질 문제 - 통화단위 에러

During a new ETL release, one of our reports was showing incorrect revenue numbers. After investigating, I found that the issue came from an incorrect currency conversion in the transformation logic. I quickly fixed the script, reprocessed the data, and added the automated checks to compare daily results with historical trends. After that, the data became much more accurate, and the same issue never happened again. This experience reminded me how important it is to validate data thoroughly before going to production.

맨 위로

보안과 규정 (AWS, Azure)

In my current role, we have a dedicated security and compliance team that handles overall data governance. So if I need access to certain sensitive databases or tables, I first have to get approval from that team. This helps only the right people can have an access to the data. On the data engineering side, I am responsible for protecting sensitive data during our ETL processes. That means identifying PII data and masking them, so even if someone sees this data, it’s not readable. Actually, We strictly follow the principle of least privilege. We assign only the minimum required permissions.

맨 위로

행동 면접 질문

매니저가 부재시 결정해야할 경우

A few months ago, there is a configuration issue during a production release. And it caused to delay the data. At that time, my manager was not available, but the fix required his approval. I quickly analyzed the impact, documented the issue and solution, and escalated it to the department lead for temporary approval. I clearly communicated the risk and rollback plan, implemented the fix, and monitored results until the system was stable. When my manager returned, I shared a full summary and documentation for review. This taught me how to act responsibly under pressure. And I also learned to make quick but careful decisions and keep communication clear with others.

맨 위로

팀원 의견 다름 - SLA 20분 지연 - 품질해결

Yes, I’ve experienced disagreements within the team. One example was during a near real-time data pipeline project. We were loading Kafka data into Hadoop, and the pipeline often missed our 10-minute SLA — sometimes it took over 20 minutes. some of the team members wanted to focus on improving Spark performance, while others, including me, thought the main issue was data quality because of inconsistent records and schema mismatches. To find the real cause, I checked the logs and monitoring reports and found that about 70% of the delays were due to data validation errors, not Spark processing speed. Based on that insight, I proposed a short proof-of-concept to implement stronger schema validation and fallback rules in QA environment and it worked. After implementing it in production, the number of failures dropped significantly and we regained our SLA. Once the team saw the data and results, everyone agreed to proceed with quality improvements first, then revisit performance tuning. This experience taught me that using measurable data, clear communication, and structured tests is much more effective than letting opinions dominate technical decisions.

맨 위로

상사 의견 다름 - refresh only 변경된 파티션만

If I don’t agree with my manager’s opinion, I first make sure I fully understand their reasoning and goals. Then I share my perspective, supported by data or examples, rather than emotion. For instance, in one project, my manager suggested running a full data reload every day to ensure data completeness. I understood his concern but also knew it would be inefficient, since most of the customer master data rarely changed. So I analyzed the update patterns and designed a process to refresh only the partitions containing changed customer records, instead of reloading the entire dataset. After testing it together in staging, we confirmed that this approach maintained accuracy while reducing runtime from over 3 hours to about 40 minutes. That experience taught me that presenting clear evidence and focusing on the common goal — not on who’s right or wrong. It helps turn disagreements into productive discussions.

맨 위로

다른 성향의 사람과 협동

In one of my previous projects, I worked closely with a senior engineer whose personality was quite different from mine. I’m generally organized and prefer to plan tasks carefully before execution, but he preferred to take a more spontaneous (스팬테니어스), “just try it and fix later” approach. At first, this difference caused some tension because I wanted to review the design and test cases before deployment, while he wanted to move fast to meet deadlines. Instead of arguing, I suggested we combine both styles — I would document the structure and validation rules, and he could focus on rapid prototyping. By dividing responsibilities that way, we were able to deliver the pipeline faster and maintain quality. Over time, I also learned to be more flexible and open to trying quick experiments. And he started to appreciate the value of planning and testing as well. That experience taught me that personality differences can actually strengthen a team when you focus on complementary strengths rather than conflicts.

맨 위로

팀동료 성공시키기

In one project, I worked with a junior data engineer who was new to our Spark-based ETL environment. She was struggling to understand how our partitioning and scheduling logic worked, and her job often failed in production. Instead of just fixing it for her, I scheduled a short session to walk her through how the Spark job read data from Oracle, wrote it into Hadoop platform. I also helped her debug one of her failing jobs step by step and showed her how to check logs and handle schema mismatches. Within a few weeks, she became confident enough to manage her own pipelines and even automated some validation scripts. Seeing her grow and succeed made me realize that helping others not only strengthens the team but also improves overall project quality.

맨 위로

급한일과 중요한일의 우선순위 - 급한것 먼저

I usually start by understanding the impact of each task. If something is urgent and affects business operations or other teams, I handle it first. But I also make sure not to ignore important long-term work. For example, once our production ETL job failed right before a reporting deadline. I paused my ongoing optimization task, fixed the ETL issue immediately, and restored the pipeline so the business could get their reports on time. After that, I resumed my optimization work. I believe good prioritization means balancing immediate needs with long-term improvements.

맨 위로

기술 문제 도전 - 1.5TB 처리

One of the most challenging technical problems I faced was optimizing a large Spark ETL job that processed about 1.5 TB everyday. The job was taking more than 5 hours to complete, which caused delays in our downstream dashboards and reports. I started by analyzing the Spark UI and noticed heavy shuffling and many small output files. To fix it, I adjusted the partition strategy, used broadcast joins for smaller tables, and combined small files before writing to Hadoop. I also added data filtering early in the pipeline to reduce unnecessary computation. After these changes, the runtime dropped from 5 hours to under 3 hours, and the cluster cost was reduced by almost 30%. That experience taught me how small technical optimizations can have a big business impact when working with large-scale data.

맨 위로

개선 사례 - 큰 데이터 처리 (위 동일)

I remember that one of our daily ETL jobs was taking more than 5 hours to finish. It processed a large amount of log data from multiple sources, and sometimes it even failed because of memory issues. I reviewed the Spark job and found that it was using too many small files and unnecessary joins. I optimized the job by adjusting the partition size, adding proper filters early in the transformation, and combining small files before loading. After the changes, the job ran in less than 3 hours and became much more stable. This improvement not only saved computing costs but also made our data available earlier for reporting every morning.

맨 위로

생산 문제 - 커넥트 Stop

When a production issue happens, I stay calm and focus on finding the root cause quickly. For example, one night our Kafka-to-Hadoop pipeline failed, and the business dashboards in Tableau were missing data the next morning. I immediately checked the Kafka Connect logs and found that the sink connector had stopped due to a network issue. I manually restarted the connector and confirmed that the data started flowing again. Afterward, I created a monitoring script using curl commands that checks the connector status every 10 minutes. If it fails, the script automatically creates an incident and sends an alert to our team. This experience taught me the importance of not only fixing issues quickly but also building automation to prevent the same problem from happening again.

맨 위로

자발적 프로세스 개선 - 커넥트 Stop

In one project, I noticed that one of our data pipelines sometimes failed overnight, but the team would only find out the next morning. This caused delays in daily reports and frustration for analysts waiting for updated data. Even though it wasn’t part of my assigned tasks, I decided to create an automatic alert system. I built a small Python script that checked job completion logs and create INC if a failure occurred. After testing it for a week, I presented it to the team, and we integrated it into our pipeline. Since then, we’ve been able to respond to failures immediately, reducing downtime and improving data reliability. That experience taught me the value of being proactive — small improvements can make a big difference to the whole team.

맨 위로

Kafka 실시간 데이터에서 고려할 부분 - 커넥트 상태확인

When designing Kafka pipelines, I focus on a few key areas to ensure performance and reliability. First, I choose the right topic partitioning strategy based on data size. And then I make sure that Kafka connectors are properly configured with retry mechanisms in case of failures. For monitoring, I built a script that uses a curl command to check the status of the all Kafka sink connectors every 10 minutes. If the one of the connectors is down or there’s an issue with the Kafka broker, the script automatically generates an incident, triggering an alert to my team. This setup helped us catch issues and significantly reduced downtime.

맨 위로

실패,실수 - 컬럼검증X, 통화단위 에러

Yes, I made a mistake during a data validation process. I was in charge of checking the output of a new ETL job before it went live. I verified the total record count but forgot to double-check the column-level transformations. After deployment, we found that one column had an incorrect currency conversion rate, and it caused wrong numbers to show up in a few business reports. As soon as I realized the issue, I corrected the transformation logic, reprocessed the data, and updated the reports. After that, I added column-level validation rules and a simple Python script that automatically compares key fields between the source and target tables before deployment. That experience taught me how even a small mistake can affect business reports, so now I always check both the data structure and actual values carefully before sign-off.

맨 위로

프로젝트 지연 - 스키마 변경

Yes, I’ve experienced that before. In one project, our data ingestion pipeline was delayed because the upstream system changed its schema without notice. This caused our ETL jobs to fail and delayed daily reports for the client. As soon as the client raised concerns, I explained the issue clearly, shared the revised delivery plan, and sent daily updates so they could see our progress. Meanwhile, I worked with my team to add automatic schema validation and fallback logic in the pipeline, so future schema changes wouldn’t break the process again. After we implemented the fix, the pipeline became more stable, and the client appreciated our quick communication and the long-term solution we put in place.

맨 위로

팀 리드 & 솔선 - 소스 입력 안됨

In one project, we had a very tight deadline to deliver a new ETL workflow for daily reporting. However, a few tasks were delayed because some external data sources were not delivered on schedule, and that caused downstream jobs in Spark to fail during testing. To get things back on track, I took the initiative to organize short daily stand-up meetings and created a shared progress tracker in Confluence so everyone — including the data and QA teams — could see real-time task status. This helped us identify blockers early, communicate clearly, and reassign tasks based on team availability. Within a week, we recovered the lost time and successfully completed the workflow before the deadline. The reporting system went live as planned, and we avoided last-minute production issues. That experience taught me that strong coordination and clear communication are just as important as technical skills when leading a project under tight timelines.

맨 위로

리더싶 사례 - data type mismatch

During a data migration project, we faced a serious issue when one of the ETL jobs started failing right before a major release. Everyone was under pressure, and the team was unsure how to proceed. Even though I wasn’t the official lead, I took the initiative to organize an emergency meeting with the data, QA, and infrastructure teams. I divided the investigation into parts — one team checked data source changes, another team looked at schema issues, and I focused on debugging the Spark job logic. After identifying that a data type mismatch in one column was causing the failure, we quickly fixed it and ran validation tests together. The release went smoothly, and my manager later recognized my leadership for coordinating the teams under tight deadlines. That experience taught me that real leadership often means stepping up and guiding the team toward a solution — even without having a formal title.

맨 위로

지시받지않은 일 - add load_date / 상사와 의견이 맞지 안음사례와 동일

A few months ago, I noticed that one of our nightly ETL jobs in production was running slower and occasionally failing, even though it wasn’t part of the pipelines I was directly responsible for. Instead of ignoring it, I decided to investigate on my own time because it was delaying downstream reports for the business team. After checking the logs, I found that the job was performing a full table scan on a very large dataset every night. To fix the issue, I first added a new LOAD_DATE column to the target table to track daily data loads. Then, I rewrote the logic to process only new and updated records based on this column and created partitions on LOAD_DATE to improve query performance and data management efficiency. After validating the logic, I worked with the scheduler team to test and deploy the fix safely. The result was dramatic — runtime dropped from over 3 hour to under forty minutes, and the business team could access their dashboards much earlier every morning. Even though it wasn’t my assigned task, I took ownership because I knew the issue was affecting overall business operations. That experience taught me that going above and beyond means proactively solving problems that impact the team — not just completing my own tickets.

맨 위로

고객이 마지막에 변경요청시 - 스키마변경

When a customer requests changes right before the final release, I believe it’s important to balance flexibility with stability. First, I listen carefully to understand why the change is needed — whether it’s a business-critical fix or just a nice-to-have improvement. Then I assess the impact on scope, timeline, and quality. If the change is minor and doesn’t risk the release, I coordinate quickly with the team to implement and test it. However, if the change is major or could affect stability, I clearly communicate the risks and propose alternatives — such as including it in the next patch or minor release. The most important thing is to be honest about the situation, and focus on finding solutions. This way, the customer knows you’re listening, and the project stays on track. For example, in one project, a client requested a schema change right before deployment. I analyzed the dependency, explained that it would delay the release by two days, and suggested deploying the current version first and adding the change in the next release. The client agreed, and we delivered on time without compromising quality.

맨 위로

고객이 자주 변경사항을 요청할때 - 요청을 그룹핑함

When customers often ask for changes, I try to handle it in a clear way. First, I listen carefully to understand why they want the change — maybe their business needs have changed or something wasn’t clear before. Then I explain what the change means for the project — like how it might affect the schedule or workload — so they can decide what’s most important. If there are many small requests, I suggest grouping them together or saving them for the next update. In one project, the customer kept asking for new data checks. I made a simple list to track all the requests and talked with them once a week to decide which ones to do first. That way, they felt listened to, and our team could work in an organized way without confusion.

맨 위로

비개발자에게 기술적인 내용을 쉽게 설명할 수 있나요?

Yes, I always try to explain technical topics in a simple and clear way, especially when talking to non-technical people. I focus on using everyday language instead of technical terms, and I give real examples that relate to their work or daily life. For example, when explaining data pipelines, I might say it’s like a factory line — data comes in as raw material, goes through cleaning and transformation, and comes out as a finished product ready for analysis. I believe being able to translate complex ideas into simple concepts is an important skill for teamwork and communication.

맨 위로

이슈 - Spark memory 문제

One of the most common issues I encounter is out-of-memory (OOM) errors. To address this, First, I review the PySpark code to identify any operational command like collect() or toPandas() that might be pulling too much data into the driver. If I find them, I either remove or replace them. I also use broadcast joins when dealing with small tables to minimize shuffle operations, it can reduce memory usage. Another important step is avoiding Python UDFs if it is possible to use native Spark SQL functions. Additionally, when I need to reuse the intermediate results, I use caching it and also I use MEMORY_AND_DISK storage option to avoid overwhelming the memory. Finally, I adjust partition sizes using coalesce() or repartition() to optimize resource usage during shuffle operations. By applying these techniques, I’ve been able to effectively prevent and troubleshoot memory-related issues in Spark jobs.

맨 위로

ETL pipeline 최적화 - SLA 6시로 맞추기

I had a situation where our SLA required the data to be fully available by 6 AM, But one day, the amount of source data suddenly increased — almost three times more than usual. Because of that, our Spark job didn’t finish until 8 AM. So I increased the number of partitions to allow more parallel processing. I also checked our resource settings and made sure the job had enough memory and CPU by adjusting the scheduler pool and YARN resorce manager. After these changes, the job completed before 6 AM the next day, and we were able to meet the SLA again. This experience helped me understand how important it is to tune the Spark jobs and monitor them carefully, especially when data volume suddenly increase.

맨 위로

AWS Glue 사용경험 - ETL services

I have experience building ETL workflows using AWS Glue. In one of my project, I built an analytics pipeline to process and analyze mobile user log data. I’ve used Glue to extract data from S3 and then transform and load into Redshift every 10 minutes Also, I’ve utilized Glue Crawlers to automatically detect schema changes and keep the data catalog updated for querying in Athena as well.

맨 위로

현재 glue를 사용하지는 않는다.

We are currently using CA7 mainframe along with PySpark scripts for our ETL processes mainly. CA7 is a mainframe-based job scheduling and workflow automation tool. It's used to manage and schedule batch jobs, ensuring the tasks run in the right order and at the right time. We have not changed this Orchestration tool Because Our data workflows have been integrated into a mainframe-based CA7 scheduling system for a long time and switching would introduce additional operational costs. Lastly Our team continues to manage and monitor all ETL workflows within the CA7 environment.

맨 위로

Redshift 란?

Redshift is designed for high-speed querying using massively parallel processing (MPP). This makes it great for analyzing large datasets quickly. we can start small and scale up by increasing the node size or number of nodes as the data grows. Data is stored in columnar format, which speeds up analytical queries and reduces I/O.

맨 위로

Redshift Columnar Storage

When we execute queries like SUM(), AVG(), or filtering on specific columns, the database only needs to read the relevant columns, not entire rows. This speeds up reading performance in data warehousing and analytics. Since each column typically contains similar types of data, it compresses more efficiently than row-based data. Also, it only reads the selected columns, the amount of data scanned is reduced.

맨 위로

스노우플레이크 장점 (zero-copy cloning and time travel)

I can instantly clone entire databases or tables without duplicating data and it saves cost and time. Also, there is a Time Travel function and it lets us query or restore data from a previous point in time. It is useful for recovering the data.

맨 위로

Databricks 사용경험 - Anomaly detection

On the Databricks side, I primarily work with the Azure-hosted version of Databricks. Recently, I developed an end-to-end scalable pipeline for computer vision anomaly detection. As you can see my portfolio website. You can see its notebook and model. I use the PyTorch and Hugging Face to train and build the model.

맨 위로

파티션 전략 (Spark, Redshift, Snowflake)

Partitioning strategy depends on query patterns and data volume. Regarding the Oracle Exadata, I Used range partitioning by date column to support daily ingestion and quickly delete old data by simply dropping partition. In Spark, I used dynamic partition overwrite with partitionBy("date") when writing Parquet files, and adjusted the number of partitions with coalesce or repartition commands to avoid creating too many small files. In Redshift, I defined DISTKEY and SORTKEY based on the columns that were most frequently used in joins and filters, which helped improve query performance and reduce data movement across nodes. In Snowflake, I rely on its automatic micro-partitioning feature, which breaks data into 16MB blocks and optimizes storage and query performance without any manual intervention. However, for very large tables where queries frequently filter on specific columns—such as date, I define a cluster key to further improve performance and these approach improved query speed as well.

맨 위로

데이터 모델링 & Architecture - 오라클 range partitioning

I have strong experience in data modeling and architecture, especially in designing data pipelines at PNC. In Oracle Exadata, I designed the tables using range partitioning by day, so that Kafka data was automatically separated into daily partitions. This made it much easier to manage large volumes of data, speed up queries, and improve overall performance. For example, instead of using a traditional delete command, we could simply drop an entire partition when the data was no longer needed. And This is not only optimized storage space but also kept query performance fast and efficient, since queries only scanned the relevant partitions rather than the entire table.

맨 위로

Python, SQL, Spark, and PySpark

I am working with Python, SQL, Spark, and PySpark throughout my career as a data engineer. Python has been my primary programming language for building ETL pipelines. I've used it in both production and QA environments, including developing data ingestion frameworks. SQL is a core part of my daily workflow. I’ve written complex analytical queries and optimized SQL for performance on databases like Oracle Exadata. With Spark, I’ve built scalable data processing pipelines for both batch and near real-time use cases. I’ve used Spark in distributed environments, primarily through PySpark, to perform transformations, aggregations, and joins on large datasets.

맨 위로

경험 - 스파크/하둡 데이터 ingestion

On the Hadoop and Spark side, I designed frameworks to handle large-scale data ingestion and transformation. For example, data coming from Oracle first needed to be cleaned before it could be used for reporting. I built PySpark jobs that automatically parse the data and removed duplicate records, handled missing values, and converted the data into optimized formats like Parquet and stored it at hadoop platform. At the same time, I added metadata and validation rules so that we could easily track the data and confirm its accuracy.

맨 위로

경험 - AWS 많이 사용했니?

If you take a look at my portfolio website, you’ll see that most of my projects are built using AWS. I actively use AWS to quickly build and experiment with different data architectures. Since data tools are evolving so fast, I use EMR service to easily install and try out big data tools like Spark, Hadoop, and Kafka. For storing data, I normally use RDS or S3. Overall, AWS has been a great platform for me to learn, experiment, and build end-to-end data pipelines.

맨 위로

장점과 단점

My biggest strengths are my flexibility and adaptability. Wherever I work, work environments change daily and throughout the day. And there are certain projects that require individual attention and others that involve a teamwork approach. My flexibility and adaptability have allowed me to meet the expectations and even go beyond them. Also, I get along with people around me. This kind of personality makes the work environment more comfortable and easier As far as my weaknesses, I sometimes put in too much time on what I like to do. With my mentor’s help, I started using a daily checklist to plan and prioritize my work. Now I make sure I pace myself better and focus on finishing the most important tasks first. It’s helped me become more balanced and efficient.

맨 위로

스트레스를 어떻게 풉니까?

When I feel stressed, I try to handle it in a healthy and productive way. First, I take a short break to clear my mind — even a short walk or a few minutes of quiet time helps me refocus. I also like to organize my tasks and set priorities. Once I have a clear plan, the stress usually goes down because I can see what needs to be done first. Outside of work, I relieve stress by exercising and spending time with my family or friends. These activities help me recharge and come back to work with more energy and focus.

맨 위로

삶의 모토는?

My life motto is “Stay curious, stay humble, and keep growing.” I believe learning never stops, no matter how much experience you have. Staying curious helps me discover new ideas, staying humble keeps me open to feedback, and continuous growth gives me purpose in both my career and personal life.

맨 위로

파워텍에서 머신러닝 모델 사용

When I worked at Hyundai Powertech, we produced car transmissions. Each transmission needed a small gasket to fill the gap between parts, but the gap size was different for each transmission — sometimes 1 mm, 1.5 mm, or 2 mm. Because of this, the company had to keep all gasket sizes in stock, which wasted storage space and money. I collected the production log from the machines on the floor and analyze them and trained machine learning models to predict which gasket size would be needed for each transmission. Among several models, XGBoost performed the best. By using this prediction model, we reduced inventory levels and saved costs by ordering only the needed gasket sizes. - XGBoost is an efficient and high-performance boosting algorithm that combines many small decision trees to make strong and accurate predictions.

맨 위로

마지막으로 하고 싶은 말

May I ask which technologies your team works with most often, and what types of projects are currently the main focus? May I ask what qualities you think are most important to succeed in this position? May I ask which projects are currently the highest priority?

맨 위로

파이썬 타입비교

Strings are immutable but maintain order and allow duplicate characters. Lists are mutable, ordered, and allow duplicates. Tuples are similar to lists but immutable. Dictionaries are mutable and ordered (as of Python 3.7+), but their keys must be unique. Sets are mutable but unordered and do not allow duplicate elements.