Description:
• Lead and grow a multidisciplinary research team focused on LLM scaling, efficiency, and systems performance.
• Define and execute the scaling research roadmap in alignment with Databricks’ strategic objectives.
• Drive algorithmic innovations for large-scale training and inference, including optimizers, low-precision techniques, and model adaptation methods.
• Oversee the design and execution of large-scale experiments and benchmark results against state-of-the-art methods.
• Optimize distributed training, parallelism, memory management, and hardware utilization in collaboration with systems and infrastructure teams.
• Translate research breakthroughs into customer-facing capabilities in the Databricks AI platform.
• Establish metrics, evaluation protocols, and best practices for scaling-focused research and drive adoption across the organization.
• Champion responsible deployment by ensuring model behavior, reliability, and safety remain first-class considerations.
• Work hands-on with the team to develop high-quality Python and PyTorch code for research, prototyping, and production integration.
• Mentor and develop research scientists and engineers through technical guidance and career support.
Requirements:
• Proven ability to lead a research team developing novel techniques for foundation model efficiency or related topics.
• Strong track record of industry impact.
• Deep expertise in at least one of: generative AI, LLMs, distributed ML systems, model optimization, or responsible AI.
• Strong emphasis on scaling and efficiency for large-scale neural networks.
• Strong programming skills and demonstrated ability to write high-quality, efficient code in Python and PyTorch.
• Demonstrated ability to translate research innovation into scalable product capabilities with product and engineering teams.
• Excellent communication, leadership, and stakeholder management skills.
• Experience influencing cross-functional roadmaps and aligning research with business impact.
• Prior work at the intersection of systems and ML, such as distributed training frameworks, compiler and kernel optimization, or memory-/compute-efficient model design (preferred).
• Strong industry and academic network in large-scale ML, with ongoing collaborations or conference service such as PC or area chair roles (preferred).
• First-author publications at top ML/systems conferences such as ICLR, ICML, NeurIPS, or MLSys, or influential open-source contributions / widely used deployed systems, especially in optimization or efficiency (preferred).
Benefits:
• Competitive base salary range of $280,000 to $350,000 USD.
• Eligibility for an annual performance bonus.
• Eligibility for equity as part of the total compensation package.
• Comprehensive benefits and perks offered regionally.
• Compensation may be adjusted based on skills, experience, certifications, training, and work location.
Apply tot his job
Apply To this Job