Hamza Harkous | Senior Staff Research Scientist

Summary

cat ./summary.md

A 'zero-to-scale' Senior Staff Research Scientist working at the intersection of synthetic data, environments' simulation, and agents. I own the full project lifecycle, with high agency across the full stack.

Latest News

tail -f /var/log/news.log

[2025-05-01] Published workshop paper on Simula at ICLR.

Experience

history | grep "work"

Senior Staff Research Scientist, Google

Nov 2025 – Present

Staff Research Scientist, Google

Nov 2023 – Oct 2025

Founder & Lead, Simula Framework: Co-founded and grew Google's internal leading synthetic data framework (with total of 2 tech FTEs) to serve 450 monthly active Googlers (>1k unique users in 2025). Enabled the generation of >1 billion data items while achieving the highest satisfaction score (93%) across all Google data tools.
Creator, Simula Agent: Built the most widely used internal data-generation agent (since 2024). Currently serving >200 monthly active users, the agent auto-generates hundreds of end-to-end Colabs per month, specialized for Googlers' data generation needs.
Gemma Ecosystem: Simula served as a primary data engine for the Gemma family, including the upcoming Gemma models, ShieldGemma, FunctionGemma, and MedGemma.
Gemini Ecosystem: The framework contributed to frontier model development and safety:
- Safety & Security: Powering all safety classifiers (server-side & on-device); contributed to Cybersecurity, Red-Teaming, and Prompt Injection defenses.
- Distillation & Features: Supporting Gemini Flash Lite distillation, Nano (e.g., i18n post-training), and Gemini app features.
- Evaluations: Enabled evaluations in specialized Gemini verticals.
Wider Impact: Instrumental in public launches, such as Android Call Scam and Messages Spam detection. The framework is also integrated into Vertex AI's GenAI Evaluation Service.

Nov 2021 – Oct 2023

Senior Research Scientist, Google

Founder, Internal Data-Curation Platform: Bootstrapped an internal web platform combining diversified retrieval, active learning, and LLM assistance. I grew the contributing team to 13 engineers, evolving the project into a foundational data engine for numerous modeling initiatives.
ML Lead & Architect, Google Checks: Continued to lead the ML strategy for Google's AI-powered privacy compliance platform, scaling the models to secure thousands of mobile apps.

Feb 2020 – Oct 2021

Research Scientist, Google

ML Architect, Google Checks: Architected the initial ML models for Google's privacy compliance platform. I designed the entire ML pipeline, including data labeling, model pre-training, and distillation.
Lead Researcher & Developer, Hark: Built the core ML models and infrastructure for a large-scale privacy-feedback analytics system used daily by 300+ triagers and processing tens of millions of reviews.

Jul 2019 – Jan 2020

Applied Scientist, Amazon Alexa

Developed DATATUNER, a neural data-to-text generation system with state-of-the-art semantic fidelity.

Nov 2018 – May 2019

Machine-Learning & Privacy Consultant, Privately SA

Shipped on-device classifiers for hate-speech, toxicity, and emotion detection; technology launched in a BBC-branded mobile keyboard.

Jul 2017 – Sep 2018

Post-doctoral Researcher, EPFL (LSIR-Lab)

Lead author and developer of Polisis, an AI tool that analyzed privacy policies for >45,000 users. The project was featured in major publications like Wired and WSJ.

Professional Highlights

git praise --all

Transformative Impact: Received Google's highest "Transformative" performance rating (top 4%) for the cross-organizational impact of the Simula project (2024).
3-Time Google Core Tech Impact Award Winner: Received three separate awards for a primary/lead role in three independent, high-impact projects: Checks, Simula, and a new data-curation platform (awarded to the top 5% of projects).
Top Code Contributor: #1 Top code contributor at Privacy, Safety, & Security Research in Google (2023-2025).

Education

less /etc/education_history

Ph.D. Computer, Communication & Information Sciences
EPFL

Thesis: “Data-Driven, Personalized, Usable Privacy”
M.Sc. Communication Systems
EPFL
B.E. Computer & Communications Engineering
American University of Beirut (Minor: Mathematics)

Awards & Recognition

cat /var/log/awards.log

Caspar Bowden Award, PETS (2017)
ISSS Excellence Award for Best Ph.D. Thesis, Switzerland (2017)
Outstanding Paper Award, ACM CODASPY (2017)
Best Dataset Award, ACM IMC (2015)
4-Year Merit Scholarship, AUB (2006-2010)
Dean’s Award for Creative Achievement, AUB (2010)

Selected Publications

ls -la /var/log/publications.log

Orchestrating Synthetic Data with Reasoning
SynthData at ICLR 2025.
ShieldGemma: Generative AI Content Moderation Based on Gemma
2024.
Automated Cookie Notice Analysis and Enforcement
USENIX Security 2023.
Hark: A Deep Learning System for Navigating Privacy Feedback at Scale
USENIX Security 2022.
PriSEC: A Privacy Settings Enforcement Controller
USENIX Security 2021.
Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity
COLING 2020.
The Privacy Policy Landscape After the GDPR
PoPETS 2019
Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning
USENIX Security 2018.
If You Can’t Beat Them, Join Them: A Usability Approach to Interdependent Privacy in Cloud Apps
CODASPY 2017.
The Curious Case of the PDF Converter that Likes Mozart
PETS 2016.

ls -a ./publications

harkous@home:~$ Hamza Harkous