Thank you for having me for an interview and my name is Sunghwan ki but you can go by Danny
I work as Data Engineer with 6 years experience in building ETL process, especially in the financial industry.
Currently I lead the projects that use the Kafka, Oracle, and Spark where I focus on near real-time data processing and optimization.
I primarily use Python to build data pipelines, and recently, I completed on a project where I built a data warehouse using AWS Glue and Redshift.
Before joining PNC, I spent roughly seven years working in data analytics, where I primarily used Tableau and MySql to analyze the data
To better performance, I completed the Master’s degree in Data Science last year and also I hold the AWS certifications and continue to pursue additional cloud-related credentials to further strengthen my expertise
I’ve truly enjoyed my time at PNC and I’ve spent over six years working on meaningful projects and improved my technical skills. Now I feel I ready for a new challenge that allows me to expand further.
Technology is evolving faster than ever, and I want to keep learning and developing new skills.
for me It’s not about leaving something behind — it’s about taking the next step toward work I’m truly passionate about.
Boeing is one of the world’s largest aerospace and defense companies. It designs and builds commercial airplanes like the 737 and 777.
Also, Boeing’s work connects people, supports global transportation, and contributes to national security and space exploration.
That's why I applied to this company to builds products with real-world impact.
Currently, I am working on building a near real time pipeline that ingests kafka topic data into Oracle Exadata and then into Hadoop platform.
In the past, stakeholders had to rely on the previous day’s data to make decisions. But now with this new pipeline, data from Kafka is ingested into Hadoop in every 10 minutes and then visualized through Tableau dashboards.
This project significantly reduced data latency and helped business team to make faster decision.
One of the biggest challenges I faced recently was with a Kafka-to-Hadoop data pipeline, where Oracle Exadata was used as a staging area.
Initially, the volume of data coming from Kafka was about 1 TB per day, but it suddenly increased to 3 or 4 TB per day. Even though the data was automatically deleted after being loaded into Hadoop, new data was coming in faster than it could be deleted, so Exadata started running out of space. To handle this, I increased the number of Spark jobs to speed up data movement into Hadoop. But this caused to slow down the Exadata and it created a bottleneck issue. Then I suddenly thought about compressing the data at exadata base, and luckly, I discovered that EXADATA has a built-in compression feature — and the best part is that the data doesn’t need to be decompressed when it’s moved to Hadoop. Using this compression method, I was able to reduce the data size by almost 70% in Exadata. After that, I reduced the number of Spark jobs, which helped Exadata run better and stabilized the pipeline.
I remember there was a project that we were integrating data from multiple sources into a central data warehouse.
The challenge was that one of the upstream systems frequently changed its schema without notice. And it caused our ETL jobs to fail and delayed reporting for business users. My responsibility was to make the pipeline more resilient so that these schema changes would not break the entire data flow. I implemented a schema validation and auto-adjustment process. I updated the code to compare the incoming data schemas against our expected schema. If a non-critical column changed such as a new column being added, the pipeline could adapt automatically without failing. But For critical mismatches, the system flagged the issue, generated incident, and provided fallback logic to continue processing the data. This reduced ETL job failures by more than 90% and ensured that the business team continued to receive the data even when upstream systems changed its schema unexpectedly.
Valid records are loaded into the main HDFS path, and Invalid records are redirected to a separate reject HDFS path.
When a string value like “ABC” appears in a numeric column ,
When a NULL value is provided for a NOT NULL column ,
When a date column receives a value in a completely different or invalid format ,
When the data type is correct but the length exceeds the limit (e.g., a 50-character string for a VARCHAR2(10) column)
오라클에 데이터가 저장될때 : A CLOB (Character Large Object) is a data type used to store very large text data.
데이터 웨어하우스 설계 (e.g. Amazon Redshift, Snowflake 사용경험)
I have experience using Redshift to build the cloud data warehousing. In one of my project, I built an analytics pipeline to process and analyze mobile user login data.
To achieve this, I set up a pipeline where the log data was first stored in Amazon S3. From there, AWS Glue processed and loaded data into Redshift. Once the data was in Redshift, I used Amazon QuickSight to build interactive dashboards. And it visualized key user activity such as session duration, clickstream patterns, and device usage. This solution provided business stakeholders with actionable insights.
One of the challenges I ran into was that loading JSON files from S3 into Redshift was much slower than I expected. Because the data was in JSON, Redshift had to parse every row, and the file sizes were all different. This caused performance issues and even led to uneven data distribution across Redshift nodes.
To fix this, I redesigned the ingestion process in AWS Glue. I converted the JSON data into Parquet and saved the files in S3 with same sizes—around 128 MB. Since in Parquet the data is already fully parsed, Redshift didn’t have to do extra parsing during the load, which significantly sped up the loading process. I also updated the DISTKEY and SORTKEY based on how the data was being queried. And it helped prevent data skew and allowed Redshift to process data more evenly across all nodes.
왜 128MB 사이즈로 데이터를 만드는지? I stored the Parquet files in 128 MB chunks because Redshift performs best when it reads multiple files of similar size in parallel. Consistent file sizes help avoid data skew, reduce S3 overhead, and allow Redshift to distribute the workload evenly across all nodes, which results in much faster COPY performance.
Normalized data is typically used in OLTP systems. It separates data into multiple related tables to reduce redundancy and maintain data integrity. This helps ensure consistency during insert, update, and delete operations, but often requires multiple joins to retrieve data.
Denormalized data is more common in OLAP systems. It intentionally duplicates data by combining related fields into fewer tables, which improves read performance and speeds up complex analytical queries.
In most data warehouses, the Star Schema is used because it provides high query performance, especially for analytical workloads, and has a simple structure consisting of a central fact table connected to denormalized dimension tables. This simplicity also makes it well suited for BI tools like Tableau or Power BI.
But the Snowflake Schema is also used—especially when storage efficiency or data normalization is a higher priority. It tends to introduce more joins, which can affect query performance. Therefore, Star Schema is generally preferred in data warehouse environments.
When it comes to data quality, I apply validation at multiple points of the pipeline.
During ingestion, I perform schema validation and basic checks such as null values, data types, and duplicates. As the data moves through transformations, I apply additional business-rule validations to ensure the results make sense before loading them into data warehouse.
In addition, I worked closely with business team to define what “good data” means for their use cases. And I ensured that the dashboards in Tableau reflected the reliable information for decision-making.
In one project, I worked with the fraud prevention team, where my role was to deliver data they could fully trust. For them, “good data” meant accurate, up-to-date, and reliable information without duplication or errors. Because the quality of data directly impacted their fraud detection models, I focused not only on data delivery but also on maintaining high quality through the validation and monitoring.
I’ve designed and implemented both batch and near real-time ETL pipelines. For near real-time workloads, I built pipelines that ingest Kafka streaming data into Oracle Exadata and Hadoop every ten minutes. I used PySpark for transformations and I used the CA7, a mainframe-based scheduler, to orchestrate the dependencies across these jobs. CA7 ensured that each PySpark workflow ran in the correct sequence and at the right time and it was critical for the batch operations.
I also have experience building cloud-native ETL solutions. In one project, I used AWS Glue studio to design the ETL workflows. Glue’s built-in transformations, and job orchestration features made it easier to manage the logic.
I used Alteryx for about a year in the past, mainly for data preparation and automation tasks such as joining datasets, performing aggregations, and creating analytical outputs. However, I didn’t use it as the primary tool in any large-scale enterprise projects. These days, I mainly work with AWS Glue Studio. When dealing with larger datasets, I noticed that Alteryx tends to slow down since it’s not optimized for big-data workloads. But AWS Glue Studio runs on Apache Spark, which provides much better performance and scalability for heavy ETL processing.
I have around two years of experience with Teradata, mainly using it with Hadoop system. In our setup, Hadoop stored the customer's account data, and Teradata accessed that data through QueryGrid, which allowed us to easily combine and query Hadoop datasets. We also connected Tableau to Teradata and set up hourly refreshes so the dashboards always reflected the latest customer's account data insights.
사용할때 문제 : Network latency 이슈 Network latency was the biggest issue we faced. Because Teradata had to retrieve large volumes of detailed data from Hadoop over QueryGrid, there were many cases where the query didn’t return on time or failed altogether. To address this, we changed the approach so that heavy processing happened in Hadoop first. We aggregated and filtered the data using Spark, and then used QueryGrid only to bring back a much smaller dataset. This significantly reduced the amount of data being transferred, which helped avoid latency issues and made the overall query performance much more stable.
One of the most common issues I faced with Teradata was slow query performance when working with large tables, especially when the table wasn’t partitioned. In those cases, Teradata had to scan the entire table, which made daily jobs take much longer than expected.
To fix this, I added date-based partitions so Teradata only scanned the specific partition needed for each query. This small change made a big difference. the queries became much faster and more stable. It also helped reduce the load on the system and improved overall performance.
Teradata leverages an MPP architecture where data is distributed across multiple AMPs (Access Module Processors). Each AMP works independently to store and process its portion of data and it enables parallel execution. Because of this distribution mechanism, Teradata can handle large volumes of data with high performance.
A Primary Index determines how data is distributed across AMPs. Choosing the right PI is crucial because it ensures even data distribution. A well-chosen PI improves join performance and overall query efficiency. (Teradata 내부에서는 데이터를 저장할 때 Primary Index 컬럼에 해시 함수를 적용해서 Hash Value를 만들고, 그 값을 기반으로 어떤 AMP에 저장할지를 결정합니다.)
1 2 3 4 5
CREATE TABLE customer ( customer_id INTEGER, name VARCHAR(100) ) PRIMARY INDEX (customer_id);
Data skew occurs when data is unevenly distributed across AMPs, causing some AMPs to process significantly more data than others. This leads to slower query performance. To handle this Data skew , I typically review PI selection and check for unique columns. Sometimes, creating a multicolumn PI can help balance the distribution.
1 2 3 4 5 6 7 8 9 10 11
CREATE TABLE customer_new PRIMARY INDEX (new_column) AS customer WITHNO DATA;
A Secondary Index is useful when frequently queried columns are not part of the Primary Index. It accelerates data access without re-distributing data. However, because Secondary Indexes require additional maintenance, I usually add them only when a business-critical query pattern consistently needs optimization.
1 2 3
# SI 추가하기 ALTER TABLE your_table_name ADD INDEX (column_name);
Partitioning allows tables to be divided into manageable segments, usually based on date. This improves query performance because Teradata only scans relevant partitions instead of the whole table. I commonly used date-based partitioning.
I think ETL pipelines between Alteryx and Teradata are built using Alteryx’s In-DB tools. Alteryx generates SQL and pushes all heavy transformations to the Teradata MPP engine, which handles large-scale joins and aggregations efficiently. Alteryx simply orchestrates the workflow, while Teradata performs the actual processing. This approach combines the ease of use of Alteryx with the scalability of Teradata.
Neo4J 관련해서
Although I haven’t used Neo4j in production, But I’m interested in graph databases. I would like to have the opportunity to learn and apply Neo4j in future projects.
Currently, I work on a data integration project with team members from the U.S., India, and Europe. At first, coordination was difficult because of time zone differences and different communication styles. To improve collaboration, I organized short daily sync meetings that overlapped our working hours and encouraged open discussions so everyone could share progress or blockers. I also started sending clear written summaries after each meeting so teammates in different time zones could stay updated. As a result, we reduced misunderstandings and improved task handoffs between regions.
In one of my projects, I worked closely with software engineers and business analysts to improve how we tracked and analyzed user behavior. The engineers were responsible for sending user activity data into our database, and my role was to clean and transform that data so it could be used for reporting and analysis. I noticed that each team had slightly different definitions for key metrics, like “active users” or “sessions,” which caused confusion in reports. So, I organized a short meeting to align on clear definitions and updated our data dictionary to make sure everyone used the same terms. After that, the reports became much more consistent, and the business team was able to make decisions faster and with more confidence. It was a great experience showing how clear communication and teamwork can really improve data quality and trust.
When I face tight deadlines or high-pressure situations, I stay calm and break the work into smaller parts. For example, in one project, our team had to build a new ETL workflow in less than two weeks because of a last-minute client request. Instead of stressing out, I focused on what was most important, assigned tasks clearly, and set up short daily check-ins to track progress. I also kept open communication with both the team and stakeholders, making sure everyone understood what we could realistically deliver. By staying organized and working together, we completed the project on time with great results. This experience taught me that under pressure, clear priorities, steady communication, and teamwork are the keys to success.
I usually use Pytest for unit testing in Python. It’s simpler and more readable than the built-in unittest module, and it allows to write tests quickly without creating test classes. In Pytest, test functions simply start with test_, and I use the assert statement to verify the results.
1 2 3 4 5 6
import pytest from calculator import add, divide # calculator.py
During a new ETL release, one of our reports was showing incorrect revenue numbers. After investigating, I found that the issue came from an incorrect currency conversion in the transformation logic. I quickly fixed the script, reprocessed the data, and added the automated checks to compare daily results with historical trends. After that, the data became much more accurate, and the same issue never happened again.
This experience reminded me how important it is to validate data thoroughly before going to production.
In my current role, we have a dedicated security and compliance team that handles overall data governance. So if I need access to certain sensitive databases or tables, I first have to get approval from that team. This helps only the right people can have an access to the data. On the data engineering side, I am responsible for protecting sensitive data during our ETL processes. That means identifying PII data and masking them, so even if someone sees this data, it’s not readable. Actually, We strictly follow the principle of least privilege. We assign only the minimum required permissions.
A few months ago, there is a configuration issue during a production release. And it caused to delay the data. At that time, my manager was not available, but the fix required his approval. I quickly analyzed the impact, documented the issue and solution, and escalated it to the department lead for temporary approval. I clearly communicated the risk and rollback plan, implemented the fix, and monitored results until the system was stable. When my manager returned, I shared a full summary and documentation for review. This taught me how to act responsibly under pressure. And I also learned to make quick but careful decisions and keep communication clear with others.
Yes, I’ve experienced disagreements within the team. One example was during a near real-time data pipeline project. We were loading Kafka data into Hadoop, and the pipeline often missed our 10-minute SLA — sometimes it took over 20 minutes. some of the team members wanted to focus on improving Spark performance, while others, including me, thought the main issue was data quality because of inconsistent records and schema mismatches. To find the real cause, I checked the logs and monitoring reports and found that about 70% of the delays were due to data validation errors, not Spark processing speed. Based on that insight, I proposed a short proof-of-concept to implement stronger schema validation and fallback rules in QA environment and it worked. After implementing it in production, the number of failures dropped significantly and we regained our SLA. Once the team saw the data and results, everyone agreed to proceed with quality improvements first, then revisit performance tuning. This experience taught me that using measurable data, clear communication, and structured tests is much more effective than letting opinions dominate technical decisions.
If I don’t agree with my manager’s opinion, I first make sure I fully understand their reasoning and goals. Then I share my perspective, supported by data or examples, rather than emotion. For instance, in one project, my manager suggested running a full data reload every day to ensure data completeness. I understood his concern but also knew it would be inefficient, since most of the customer master data rarely changed. So I analyzed the update patterns and designed a process to refresh only the partitions containing changed customer records, instead of reloading the entire dataset. After testing it together in staging, we confirmed that this approach maintained accuracy while reducing runtime from over 3 hours to about 40 minutes. That experience taught me that presenting clear evidence and focusing on the common goal — not on who’s right or wrong. It helps turn disagreements into productive discussions.
In one of my previous projects, I worked closely with a senior engineer whose personality was quite different from mine. I’m generally organized and prefer to plan tasks carefully before execution, but he preferred to take a more spontaneous (스팬테니어스), “just try it and fix later” approach. At first, this difference caused some tension because I wanted to review the design and test cases before deployment, while he wanted to move fast to meet deadlines. Instead of arguing, I suggested we combine both styles — I would document the structure and validation rules, and he could focus on rapid prototyping. By dividing responsibilities that way, we were able to deliver the pipeline faster and maintain quality. Over time, I also learned to be more flexible and open to trying quick experiments. And he started to appreciate the value of planning and testing as well. That experience taught me that personality differences can actually strengthen a team when you focus on complementary strengths rather than conflicts.
In one project, I worked with a junior data engineer who was new to our Spark-based ETL environment. She was struggling to understand how our partitioning and scheduling logic worked, and her job often failed in production. Instead of just fixing it for her, I scheduled a short session to walk her through how the Spark job read data from Oracle, wrote it into Hadoop platform. I also helped her debug one of her failing jobs step by step and showed her how to check logs and handle schema mismatches. Within a few weeks, she became confident enough to manage her own pipelines and even automated some validation scripts. Seeing her grow and succeed made me realize that helping others not only strengthens the team but also improves overall project quality.
I usually start by understanding the impact of each task. If something is urgent and affects business operations or other teams, I handle it first. But I also make sure not to ignore important long-term work. For example, once our production ETL job failed right before a reporting deadline. I paused my ongoing optimization task, fixed the ETL issue immediately, and restored the pipeline so the business could get their reports on time. After that, I resumed my optimization work. I believe good prioritization means balancing immediate needs with long-term improvements.
One of the most challenging technical problems I faced was optimizing a large Spark ETL job that processed about 1.5 TB everyday. The job was taking more than 5 hours to complete, which caused delays in our downstream dashboards and reports. I started by analyzing the Spark UI and noticed heavy shuffling and many small output files. To fix it, I adjusted the partition strategy, used broadcast joins for smaller tables, and combined small files before writing to Hadoop. I also added data filtering early in the pipeline to reduce unnecessary computation. After these changes, the runtime dropped from 5 hours to under 3 hours, and the cluster cost was reduced by almost 30%. That experience taught me how small technical optimizations can have a big business impact when working with large-scale data.
I remember that one of our daily ETL jobs was taking more than 5 hours to finish. It processed a large amount of log data from multiple sources, and sometimes it even failed because of memory issues. I reviewed the Spark job and found that it was using too many small files and unnecessary joins. I optimized the job by adjusting the partition size, adding proper filters early in the transformation, and combining small files before loading. After the changes, the job ran in less than 3 hours and became much more stable. This improvement not only saved computing costs but also made our data available earlier for reporting every morning.
When a production issue happens, I stay calm and focus on finding the root cause quickly. For example, one night our Kafka-to-Hadoop pipeline failed, and the business dashboards in Tableau were missing data the next morning. I immediately checked the Kafka Connect logs and found that the sink connector had stopped due to a network issue. I manually restarted the connector and confirmed that the data started flowing again. Afterward, I created a monitoring script using curl commands that checks the connector status every 10 minutes. If it fails, the script automatically creates an incident and sends an alert to our team. This experience taught me the importance of not only fixing issues quickly but also building automation to prevent the same problem from happening again.
In one project, I noticed that one of our data pipelines sometimes failed overnight, but the team would only find out the next morning. This caused delays in daily reports and frustration for analysts waiting for updated data. Even though it wasn’t part of my assigned tasks, I decided to create an automatic alert system. I built a small Python script that checked job completion logs and create INC if a failure occurred. After testing it for a week, I presented it to the team, and we integrated it into our pipeline. Since then, we’ve been able to respond to failures immediately, reducing downtime and improving data reliability. That experience taught me the value of being proactive — small improvements can make a big difference to the whole team.
When designing Kafka pipelines, I focus on a few key areas to ensure performance and reliability. First, I choose the right topic partitioning strategy based on data size. And then I make sure that Kafka connectors are properly configured with retry mechanisms in case of failures. For monitoring, I built a script that uses a curl command to check the status of the all Kafka sink connectors every 10 minutes. If the one of the connectors is down or there’s an issue with the Kafka broker, the script automatically generates an incident, triggering an alert to my team. This setup helped us catch issues and significantly reduced downtime.
Yes, I made a mistake during a data validation process. I was in charge of checking the output of a new ETL job before it went live. I verified the total record count but forgot to double-check the column-level transformations. After deployment, we found that one column had an incorrect currency conversion rate, and it caused wrong numbers to show up in a few business reports. As soon as I realized the issue, I corrected the transformation logic, reprocessed the data, and updated the reports. After that, I added column-level validation rules and a simple Python script that automatically compares key fields between the source and target tables before deployment. That experience taught me how even a small mistake can affect business reports, so now I always check both the data structure and actual values carefully before sign-off.
Yes, I’ve experienced that before.
In one project, our data ingestion pipeline was delayed because the upstream system changed its schema without notice. This caused our ETL jobs to fail and delayed daily reports for the client. As soon as the client raised concerns, I explained the issue clearly, shared the revised delivery plan, and sent daily updates so they could see our progress. Meanwhile, I worked with my team to add automatic schema validation and fallback logic in the pipeline, so future schema changes wouldn’t break the process again. After we implemented the fix, the pipeline became more stable, and the client appreciated our quick communication and the long-term solution we put in place.
In one project, we had a very tight deadline to deliver a new ETL workflow for daily reporting. However, a few tasks were delayed because some external data sources were not delivered on schedule, and that caused downstream jobs in Spark to fail during testing. To get things back on track, I took the initiative to organize short daily stand-up meetings and created a shared progress tracker in Confluence so everyone — including the data and QA teams — could see real-time task status. This helped us identify blockers early, communicate clearly, and reassign tasks based on team availability. Within a week, we recovered the lost time and successfully completed the workflow before the deadline. The reporting system went live as planned, and we avoided last-minute production issues. That experience taught me that strong coordination and clear communication are just as important as technical skills when leading a project under tight timelines.
During a data migration project, we faced a serious issue when one of the ETL jobs started failing right before a major release. Everyone was under pressure, and the team was unsure how to proceed. Even though I wasn’t the official lead, I took the initiative to organize an emergency meeting with the data, QA, and infrastructure teams. I divided the investigation into parts — one team checked data source changes, another team looked at schema issues, and I focused on debugging the Spark job logic. After identifying that a data type mismatch in one column was causing the failure, we quickly fixed it and ran validation tests together. The release went smoothly, and my manager later recognized my leadership for coordinating the teams under tight deadlines. That experience taught me that real leadership often means stepping up and guiding the team toward a solution — even without having a formal title.
A few months ago, I noticed that one of our nightly ETL jobs in production was running slower and occasionally failing, even though it wasn’t part of the pipelines I was directly responsible for. Instead of ignoring it, I decided to investigate on my own time because it was delaying downstream reports for the business team. After checking the logs, I found that the job was performing a full table scan on a very large dataset every night. To fix the issue, I first added a new LOAD_DATE column to the target table to track daily data loads. Then, I rewrote the logic to process only new and updated records based on this column and created partitions on LOAD_DATE to improve query performance and data management efficiency. After validating the logic, I worked with the scheduler team to test and deploy the fix safely. The result was dramatic — runtime dropped from over 3 hour to under forty minutes, and the business team could access their dashboards much earlier every morning. Even though it wasn’t my assigned task, I took ownership because I knew the issue was affecting overall business operations. That experience taught me that going above and beyond means proactively solving problems that impact the team — not just completing my own tickets.
When a customer requests changes right before the final release, I believe it’s important to balance flexibility with stability. First, I listen carefully to understand why the change is needed — whether it’s a business-critical fix or just a nice-to-have improvement. Then I assess the impact on scope, timeline, and quality. If the change is minor and doesn’t risk the release, I coordinate quickly with the team to implement and test it. However, if the change is major or could affect stability, I clearly communicate the risks and propose alternatives — such as including it in the next patch or minor release. The most important thing is to be honest about the situation, and focus on finding solutions. This way, the customer knows you’re listening, and the project stays on track.
For example, in one project, a client requested a schema change right before deployment. I analyzed the dependency, explained that it would delay the release by two days, and suggested deploying the current version first and adding the change in the next release. The client agreed, and we delivered on time without compromising quality.
When customers often ask for changes, I try to handle it in a clear way. First, I listen carefully to understand why they want the change — maybe their business needs have changed or something wasn’t clear before. Then I explain what the change means for the project — like how it might affect the schedule or workload — so they can decide what’s most important. If there are many small requests, I suggest grouping them together or saving them for the next update.
In one project, the customer kept asking for new data checks. I made a simple list to track all the requests and talked with them once a week to decide which ones to do first. That way, they felt listened to, and our team could work in an organized way without confusion.
Yes, I always try to explain technical topics in a simple and clear way, especially when talking to non-technical people. I focus on using everyday language instead of technical terms, and I give real examples that relate to their work or daily life. For example, when explaining data pipelines, I might say it’s like a factory line — data comes in as raw material, goes through cleaning and transformation, and comes out as a finished product ready for analysis. I believe being able to translate complex ideas into simple concepts is an important skill for teamwork and communication.
One of the most common issues I encounter is out-of-memory (OOM) errors. To address this, First, I review the PySpark code to identify any operational command like collect() or toPandas() that might be pulling too much data into the driver. If I find them, I either remove or replace them. I also use broadcast joins when dealing with small tables to minimize shuffle operations, it can reduce memory usage. Another important step is avoiding Python UDFs if it is possible to use native Spark SQL functions. Additionally, when I need to reuse the intermediate results, I use caching it and also I use MEMORY_AND_DISK storage option to avoid overwhelming the memory. Finally, I adjust partition sizes using coalesce() or repartition() to optimize resource usage during shuffle operations. By applying these techniques, I’ve been able to effectively prevent and troubleshoot memory-related issues in Spark jobs.
I had a situation where our SLA required the data to be fully available by 6 AM, But one day, the amount of source data suddenly increased — almost three times more than usual. Because of that, our Spark job didn’t finish until 8 AM. So I increased the number of partitions to allow more parallel processing. I also checked our resource settings and made sure the job had enough memory and CPU by adjusting the scheduler pool and YARN resorce manager.
After these changes, the job completed before 6 AM the next day, and we were able to meet the SLA again. This experience helped me understand how important it is to tune the Spark jobs and monitor them carefully, especially when data volume suddenly increase.
I have experience building ETL workflows using AWS Glue. In one of my project, I built an analytics pipeline to process and analyze mobile user log data. I’ve used Glue to extract data from S3 and then transform and load into Redshift every 10 minutes Also, I’ve utilized Glue Crawlers to automatically detect schema changes and keep the data catalog updated for querying in Athena as well.
We are currently using CA7 mainframe along with PySpark scripts for our ETL processes mainly. CA7 is a mainframe-based job scheduling and workflow automation tool. It's used to manage and schedule batch jobs, ensuring the tasks run in the right order and at the right time. We have not changed this Orchestration tool Because Our data workflows have been integrated into a mainframe-based CA7 scheduling system for a long time and switching would introduce additional operational costs. Lastly Our team continues to manage and monitor all ETL workflows within the CA7 environment.
Redshift is designed for high-speed querying using massively parallel processing (MPP). This makes it great for analyzing large datasets quickly. we can start small and scale up by increasing the node size or number of nodes as the data grows. Data is stored in columnar format, which speeds up analytical queries and reduces I/O.
When we execute queries like SUM(), AVG(), or filtering on specific columns, the database only needs to read the relevant columns, not entire rows. This speeds up reading performance in data warehousing and analytics. Since each column typically contains similar types of data, it compresses more efficiently than row-based data. Also, it only reads the selected columns, the amount of data scanned is reduced.
I can instantly clone entire databases or tables without duplicating data and it saves cost and time. Also, there is a Time Travel function and it lets us query or restore data from a previous point in time. It is useful for recovering the data.
On the Databricks side, I primarily work with the Azure-hosted version of Databricks. Recently, I developed an end-to-end scalable pipeline for computer vision anomaly detection. As you can see my portfolio website. You can see its notebook and model. I use the PyTorch and Hugging Face to train and build the model.
Partitioning strategy depends on query patterns and data volume. Regarding the Oracle Exadata, I Used range partitioning by date column to support daily ingestion and quickly delete old data by simply dropping partition.
In Spark, I used dynamic partition overwrite with partitionBy("date") when writing Parquet files, and adjusted the number of partitions with coalesce or repartition commands to avoid creating too many small files.
In Redshift, I defined DISTKEY and SORTKEY based on the columns that were most frequently used in joins and filters, which helped improve query performance and reduce data movement across nodes.
In Snowflake, I rely on its automatic micro-partitioning feature, which breaks data into 16MB blocks and optimizes storage and query performance without any manual intervention. However, for very large tables where queries frequently filter on specific columns—such as date, I define a cluster key to further improve performance and these approach improved query speed as well.
I have strong experience in data modeling and architecture, especially in designing data pipelines at PNC. In Oracle Exadata, I designed the tables using range partitioning by day, so that Kafka data was automatically separated into daily partitions. This made it much easier to manage large volumes of data, speed up queries, and improve overall performance. For example, instead of using a traditional delete command, we could simply drop an entire partition when the data was no longer needed. And This is not only optimized storage space but also kept query performance fast and efficient, since queries only scanned the relevant partitions rather than the entire table.
I am working with Python, SQL, Spark, and PySpark throughout my career as a data engineer. Python has been my primary programming language for building ETL pipelines. I've used it in both production and QA environments, including developing data ingestion frameworks.
SQL is a core part of my daily workflow. I’ve written complex analytical queries and optimized SQL for performance on databases like Oracle Exadata.
With Spark, I’ve built scalable data processing pipelines for both batch and near real-time use cases. I’ve used Spark in distributed environments, primarily through PySpark, to perform transformations, aggregations, and joins on large datasets.
On the Hadoop and Spark side, I designed frameworks to handle large-scale data ingestion and transformation. For example, data coming from Oracle first needed to be cleaned before it could be used for reporting. I built PySpark jobs that automatically parse the data and removed duplicate records, handled missing values, and converted the data into optimized formats like Parquet and stored it at hadoop platform. At the same time, I added metadata and validation rules so that we could easily track the data and confirm its accuracy.
If you take a look at my portfolio website, you’ll see that most of my projects are built using AWS. I actively use AWS to quickly build and experiment with different data architectures. Since data tools are evolving so fast, I use EMR service to easily install and try out big data tools like Spark, Hadoop, and Kafka. For storing data, I normally use RDS or S3. Overall, AWS has been a great platform for me to learn, experiment, and build end-to-end data pipelines.
My biggest strengths are my flexibility and adaptability. Wherever I work, work environments change daily and throughout the day. And there are certain projects that require individual attention and others that involve a teamwork approach. My flexibility and adaptability have allowed me to meet the expectations and even go beyond them. Also, I get along with people around me. This kind of personality makes the work environment more comfortable and easier
As far as my weaknesses, I sometimes put in too much time on what I like to do. With my mentor’s help, I started using a daily checklist to plan and prioritize my work. Now I make sure I pace myself better and focus on finishing the most important tasks first. It’s helped me become more balanced and efficient.
When I feel stressed, I try to handle it in a healthy and productive way. First, I take a short break to clear my mind — even a short walk or a few minutes of quiet time helps me refocus. I also like to organize my tasks and set priorities. Once I have a clear plan, the stress usually goes down because I can see what needs to be done first. Outside of work, I relieve stress by exercising and spending time with my family or friends. These activities help me recharge and come back to work with more energy and focus.
My life motto is “Stay curious, stay humble, and keep growing.” I believe learning never stops, no matter how much experience you have. Staying curious helps me discover new ideas, staying humble keeps me open to feedback, and continuous growth gives me purpose in both my career and personal life.
When I worked at Hyundai Powertech, we produced car transmissions. Each transmission needed a small gasket to fill the gap between parts, but the gap size was different for each transmission — sometimes 1 mm, 1.5 mm, or 2 mm. Because of this, the company had to keep all gasket sizes in stock, which wasted storage space and money. I collected the production log from the machines on the floor and analyze them and trained machine learning models to predict which gasket size would be needed for each transmission. Among several models, XGBoost performed the best. By using this prediction model, we reduced inventory levels and saved costs by ordering only the needed gasket sizes.
- XGBoost is an efficient and high-performance boosting algorithm that combines many small decision trees to make strong and accurate predictions.
May I ask which technologies your team works with most often, and what types of projects are currently the main focus?
May I ask what qualities you think are most important to succeed in this position?
May I ask which projects are currently the highest priority?
Strings are immutable but maintain order and allow duplicate characters.
Lists are mutable, ordered, and allow duplicates.
Tuples are similar to lists but immutable.
Dictionaries are mutable and ordered (as of Python 3.7+), but their keys must be unique.
Sets are mutable but unordered and do not allow duplicate elements.