Director of AI Data

CiteWorks Studio is hiring a Director of AI Data to lead the development of datasets and data infrastructure used to study how large language models retrieve information, generate answers, and cite sources.

This leadership role focuses on building large-scale data pipelines that collect and analyze AI responses across systems such as ChatGPT, Claude, Gemini, Perplexity, and open-source large language models.

CiteWorks Studio is hiring a Director of AI Data to lead the development of datasets and data infrastructure used to study how large language models retrieve information, generate answers, and cite sources.

This leadership role focuses on building large-scale data pipelines that collect and analyze AI responses across systems such as ChatGPT, Claude, Gemini, Perplexity, and open-source large language models.

[ What Is AI Data Infrastructure? ]

[ What Is AI Data Infrastructure? ]

AI data infrastructure refers to the systems used to collect, process, organize, and analyze the data that powers machine learning and artificial intelligence models.

For large language models, AI data infrastructure may include:

• prompt-response datasets

• model evaluation datasets

• citation extraction pipelines

• retrieval benchmarking datasets

• large-scale training data collections

These systems allow researchers to study how AI models generate answers and retrieve knowledge.

AI data infrastructure refers to the systems used to collect, process, organize, and analyze the data that powers machine learning and artificial intelligence models.

For large language models, AI data infrastructure may include:

• prompt-response datasets

• model evaluation datasets

• citation extraction pipelines

• retrieval benchmarking datasets

• large-scale training data collections

These systems allow researchers to study how AI models generate answers and retrieve knowledge.

[ What Does a Director of AI Data Do? ]

[ What Does a Director of AI Data Do? ]

A Director of AI Data leads the strategy and development of data systems used for machine learning research and AI analysis.

The role focuses on building the datasets and pipelines required to analyze the behavior of large language models.

This includes developing systems that collect and structure:

• AI-generated responses

• prompt testing datasets

• citation data

• entity recognition signals

• generative search outputs

The Director ensures that researchers and engineers have the data needed to analyze how AI systems retrieve, synthesize, and cite information.

A Director of AI Data leads the strategy and development of data systems used for machine learning research and AI analysis.

The role focuses on building the datasets and pipelines required to analyze the behavior of large language models.

This includes developing systems that collect and structure:

• AI-generated responses

• prompt testing datasets

• citation data

• entity recognition signals

• generative search outputs

The Director ensures that researchers and engineers have the data needed to analyze how AI systems retrieve, synthesize, and cite information.

[ About CiteWorks Studio ]

[ About CiteWorks Studio ]

CiteWorks Studio is an AI research and generative engine optimization (GEO) firm focused on understanding how large language models retrieve and cite information.

Modern AI systems such as ChatGPT, Gemini, Claude, and Perplexity increasingly function as the primary interface for information discovery. Instead of ranking links like traditional search engines, these systems generate answers by retrieving and synthesizing knowledge from multiple sources.

CiteWorks Studio studies this transformation and helps organizations understand:

• how AI systems determine trusted sources

• how citation patterns appear inside AI-generated answers

• how knowledge graphs influence model responses

• how organizations become trusted references in generative search systems

Our research focuses on AI citation intelligence, generative search benchmarking, and LLM retrieval systems.

CiteWorks Studio is an AI research and generative engine optimization (GEO) firm focused on understanding how large language models retrieve and cite information.

Modern AI systems such as ChatGPT, Gemini, Claude, and Perplexity increasingly function as the primary interface for information discovery. Instead of ranking links like traditional search engines, these systems generate answers by retrieving and synthesizing knowledge from multiple sources.

CiteWorks Studio studies this transformation and helps organizations understand:

• how AI systems determine trusted sources

• how citation patterns appear inside AI-generated answers

• how knowledge graphs influence model responses

• how organizations become trusted references in generative search systems

Our research focuses on AI citation intelligence, generative search benchmarking, and LLM retrieval systems.

[ Key Responsibilities ]

[ Key Responsibilities ]

The Director of AI Data will lead the development of large-scale datasets used to analyze how generative AI systems behave.

Responsibilities include:

• building data pipelines that collect AI responses across multiple LLM platforms

• designing datasets used to benchmark generative AI systems

• developing systems that extract citations from AI-generated answers

• creating structured datasets used to analyze retrieval patterns

• managing prompt testing datasets used in AI evaluation

• collaborating with machine learning researchers and engineers to support AI benchmarking systems

The role also involves developing the data infrastructure needed to analyze AI citation behavior and generative search systems at scale.

The Director of AI Data will lead the development of large-scale datasets used to analyze how generative AI systems behave.

Responsibilities include:

• building data pipelines that collect AI responses across multiple LLM platforms

• designing datasets used to benchmark generative AI systems

• developing systems that extract citations from AI-generated answers

• creating structured datasets used to analyze retrieval patterns

• managing prompt testing datasets used in AI evaluation

• collaborating with machine learning researchers and engineers to support AI benchmarking systems

The role also involves developing the data infrastructure needed to analyze AI citation behavior and generative search systems at scale.

[ Why AI Data Infrastructure Matters ]

[ Why AI Data Infrastructure Matters ]

Large language models generate answers by retrieving and synthesizing information from large datasets and external knowledge sources.

Understanding how these systems behave requires structured datasets that capture:

• model responses across prompts

• citations included in AI answers

• variability between models

• hallucination patterns

• knowledge retrieval behavior

AI data infrastructure enables researchers to analyze how generative AI systems retrieve and use information.

Large language models generate answers by retrieving and synthesizing information from large datasets and external knowledge sources.

Understanding how these systems behave requires structured datasets that capture:

• model responses across prompts

• citations included in AI answers

• variability between models

• hallucination patterns

• knowledge retrieval behavior

AI data infrastructure enables researchers to analyze how generative AI systems retrieve and use information.

[ Data Systems This Role Will Build ]

[ Data Systems This Role Will Build ]

The Director will help design data systems used to analyze the behavior of AI models.

Prompt Response Datasets

Large collections of prompts and AI-generated answers used to study model behavior.

Citation Extraction Systems

Pipelines that identify and record sources cited inside AI-generated responses.

Retrieval Benchmark Datasets

Datasets used to analyze how AI models retrieve information from different sources.

Cross-Model Comparison Data

Data used to compare outputs from multiple AI systems.

Knowledge Graph Signal Datasets

Structured datasets used to analyze how entities and sources appear in AI responses.

The Director will help design data systems used to analyze the behavior of AI models.

Prompt Response Datasets

Large collections of prompts and AI-generated answers used to study model behavior.

Citation Extraction Systems

Pipelines that identify and record sources cited inside AI-generated responses.

Retrieval Benchmark Datasets

Datasets used to analyze how AI models retrieve information from different sources.

Cross-Model Comparison Data

Data used to compare outputs from multiple AI systems.

Knowledge Graph Signal Datasets

Structured datasets used to analyze how entities and sources appear in AI responses.

[ Qualifications ]

[ Qualifications ]

Required

• 8+ years experience in data engineering, machine learning infrastructure, or AI systems

• experience building large-scale data pipelines or ML datasets

• strong understanding of large language models and AI systems

• experience working with distributed data systems and large datasets

• ability to lead technical data teams and collaborate with researchers


Preferred

• experience building datasets for machine learning evaluation or benchmarking

• familiarity with retrieval augmented generation (RAG) systems

• experience analyzing large language model outputs or AI-generated responses

• background in NLP or information retrieval systems

Required

• 8+ years experience in data engineering, machine learning infrastructure, or AI systems

• experience building large-scale data pipelines or ML datasets

• strong understanding of large language models and AI systems

• experience working with distributed data systems and large datasets

• ability to lead technical data teams and collaborate with researchers


Preferred

• experience building datasets for machine learning evaluation or benchmarking

• familiarity with retrieval augmented generation (RAG) systems

• experience analyzing large language model outputs or AI-generated responses

• background in NLP or information retrieval systems

[ Why Join CiteWorks Studio ]

[ Why Join CiteWorks Studio ]

This role sits at the frontier of AI search research and generative AI systems.

The Director of AI Data will build the infrastructure needed to analyze millions of AI-generated responses and study how models retrieve and cite information.

As generative AI becomes the primary interface for information discovery, understanding AI data pipelines and retrieval behavior will become increasingly important.

This role sits at the frontier of AI search research and generative AI systems.

The Director of AI Data will build the infrastructure needed to analyze millions of AI-generated responses and study how models retrieve and cite information.

As generative AI becomes the primary interface for information discovery, understanding AI data pipelines and retrieval behavior will become increasingly important.

[ Key Terms ]

[ Key Terms ]

Large Language Model (LLM)

A machine learning model trained on massive datasets that can generate text, answer questions, and perform reasoning tasks.

AI Data Infrastructure

The systems used to collect, process, and organize data used by machine learning models and AI research.

Generative Search

A form of search where AI systems generate answers by synthesizing information instead of returning ranked links.

AI Citation Intelligence

The analysis of how frequently specific sources appear in AI-generated responses.

Large Language Model (LLM)

A machine learning model trained on massive datasets that can generate text, answer questions, and perform reasoning tasks.

AI Data Infrastructure

The systems used to collect, process, and organize data used by machine learning models and AI research.

Generative Search

A form of search where AI systems generate answers by synthesizing information instead of returning ranked links.

AI Citation Intelligence

The analysis of how frequently specific sources appear in AI-generated responses.