Description:
• Support the development and maintenance of data pipelines using Databricks, Spark, and similar technologies.
• Write and optimize SQL and Python scripts for data transformation, integration, and automation tasks.
• Develop automation scripts that populate metadata and comments across Databricks tables using structured definitions such as CSV files.
• Assist in building a proof-of-concept for an automated data dictionary maintained with existing Databricks metadata.
• Contribute to prototyping an AI-powered knowledge agent that uses internal data and documentation to answer common questions.
• Collaborate with team members to improve data quality, cataloging, and metadata management across the ecosystem.
• Participate in code reviews, design discussions, and sprint ceremonies to learn engineering best practices.
• Document findings, workflows, and automation processes for future reuse.
• Perform other duties as assigned.
Requirements:
• Actively pursuing a Bachelor’s or Master’s degree in Computer Science, Software Engineering, Information Systems, or a related technical field.
• Foundational knowledge of Python and SQL for data manipulation and analysis.
• Familiarity with ETL concepts and structured data formats such as CSV, JSON, and Parquet.
• Interest in cloud-based data platforms, with Azure preferred.
• Strong analytical and problem-solving skills with an eagerness to learn.
• Effective communication and teamwork skills.
• Exposure to Databricks, Apache Spark, or other distributed data frameworks is preferred.
• Familiarity with Git or version control practices is preferred.
• Interest in AI/LLM-based automation, data documentation, or metadata management is preferred.
• Prior project or internship experience in data engineering or cloud technologies is preferred.