These are good practices to keep in mind when setting up GenAI solutions, but I'm not convinced that this part of the job will allow "data scientist" as a profession to thrive. Here's my pessimistic take.
Data scientists were appreciated largely because of their ability to create models that unlock business value. Model creation was a dark magic that you needed strong mathematical skills to perform - or at least that's the image, even if in reality you just slap XGBoost on a problem and call it a day. Data scientists were enablers and value creators.
With GenAI, value creation is apparently done by the LLM provider and whoever in your company calls the API, which could really be any engineering team. Coaxing the right behavior out of the LLM is a bit of black magic in itself, but it's not something that requires deep mathematical knowledge. Knowing how gradients are calculated in a decoder-only transformer doesn't really help you make the LLM follow instructions. In fact, all your business stakeholders are constantly prompting chatbots themselves, so even if you provide some expertise here they will just see you as someone doing the same thing they do when they summarize an email.
So that leaves the part the OP discusses: evaluation and monitoring. These are not sexy tasks and from the point of view of business stakeholders they are not the primary value add. In fact, they are barriers that get in the way of taking the POC someone slapped together in Copilot (it works!) and putting that solution in production. It's not even strictly necessary if you just want to move fast and break things. Appreciation for this kind of work is most present in large risk-averse companies, but even there it can be tricky to convince management that this is a job that needs to be done by a highly paid statistician with a graduate degree.
What's the way forward? Convince management that people with the job title "data scientist" should be allowed to gatekeep building LLM solutions? Maybe I'm overestimating how good the average AI-aware software engineer is at this stuff, but I don't see the professional moat.
I don't really see why evals are assumed to be exclusively in the domain of data scientists. In my experience SWEs-turned-AI Engineers are much better suited to building agents. Some struggle more than others, but "evals as automated tests" is, imo, so obvious a mental model, and can be so well adapted to by good SWEs, that data scientists have no real role on many "agent" projects.
I'm not saying this is good or bad, just that it's what I'm observing in practice.
For context, I'm a SWE-turned-AI Engineer, so I may be biased :)
I think there's a lot of methodological expertise that goes into collecting good eval data. For example, in many cases you need human labelers with the right expertise, well designed tasks, well defined constructs, and you need to hit interrater agreement targets and troubleshoot when you don't. Good label data is a prerequisite to the stuff that can probably be automated by the AI agent (improving the system to optimize a metric measured against ground truth labels). Data scientists and research scientists are more likely to have this skillset. And it takes time to pick up and learn the nuances.
I agree with you take the there isn’t a lot of specialist work for data scientists to do with using off-the-shelf LLMs that can’t be done by an engineer. As an AI-aware software engineer myself… this stuff wasn’t that hard to pick up. Even a lot of the work on the Evals side (creating an LLM judge etc.) isn’t that hard and doesn’t require serious ML or stats.
But aren’t there still plenty of opportunities for building ML models beyond LLMs, albeit a bit less sexy now? It’s not like you can run a business process like (say) AirBnB’s search rankings or Uber’s driver marching algorithms on an LLM; you need to build a custom model for that. Or am I missing something here? Or is that point that those opportunities are still there, but the pond has shrunk because so much new work is now LLM-related? I buy that.
> I agree with you take the there isn’t a lot of specialist work for data scientists to do with using off-the-shelf LLMs that can’t be done by an engineer.
Conversely, data scientists are doing software engineering, including webdev. It’s an interesting time. I think it’s less about the job title demarcation now, and more about output.
I think most use-cases will still use simpler models like XGBoost etc. rather than LLM's. Customer segmentation is a really common use-case with no need for an LLM. Same for revenue/LTV forecasting.
Perhaps they can use the LLM to write and deploy these models without needing a Data Scientist but that seems risky to say the least.
In my company, the most Data Scientist-adjacent people are the Data Analysts but they tend not to have programming experience beyond SQL and basic Python and they aren't used to using the terminal etc.
Do those use cases need LLMs? Probably not. but if good results can be had with a day of prompting (in addition to the stuff mentioned in the article, which you have to do anyway) and a smaller model like Haiku gives good results why would you build a classifer before you have literally millions of customers?
The LLM solution will be much more flexible because prompts can change more easily than training data and input tokens are cheap.
I don't disagree that very numerical tasks like revenue forecasting are not a good fit for LLMs. But neither did a lot of data scientist concerns themselves with such things (compared to business analysts and the like). Software to achieve this has been commoditized.
I agree. It is difficult to convince leadership to do this work at all ("it works on my example, ship it"), and in my experience most DS don't even want to do it.
One of the key value is that it forces some thinking about what is the task you want to solve in the first place. In many cases, it is difficult if not impossible to do it, which implies the underlying product should not be built at all. But nobody wants to hear that.
Doing eval only makes sense if making the product better impacts something the business cares about, which is very difficult to do in practice.
I don’t actually even know what people are hinting at when they say that LLMs replace the need for building custom models. Regression models? People are using LLMs instead of say building a Bayesian hierarchical model? That’s not possible. Time series modeling using an LLM? Also ridiculous. Recommender systems? Ok maybe, still utterly ridiculous and abysmally slow.
For anything NLP sure, it definitely wins. However, I’ve just recently used some big fancy OpenAI model to actually just label thousands of text data for me, just so I could build a classifier with CatBoost. Guess what, inference speed is at a guaranteed sub 100ms and it costs $0 in tokens. The”AI Engineer” solution here would be just run every classification request through an LLM.
AI Engineering is going to have the same problem we had when Data Science as a term arrived and you had every Statistician saying they’re just re-inventing everything that exists in statistics, poorly.
You're right. For years the real impediment to "AI" products at many companies was the sheer crappiness of ML frameworks which were built by and for grad students, not professional engineers.
When LLMs appeared it was just so much easier to use then as an uber model and leave behind the training and inference infrastructure (if you can even call it that).
Now that LLMs can code I expect we'll be coding up custom model pipelines more and more... but only when we stop subsidizing LLMs.
As a AI-aware software engineer currently creating systems that integrate with LLM provider APIs for my company- who also has no idea what an eval is or how a data scientist thinks about RAG. I honestly don't see what value a data scientist would bring to the table for my team. Maybe someone would care to enlighten me?
One thing data scientists brought to the table was statistical rigor in the models, but that seems to have left the building at this point with LLM-based solutions.
You recognize that you haven't really needed strong mathematical (or coding) skills to create models for some time. Data Scientists add value by knowing how to translate business speak into XGBoost type model and interesting XGBoost model results into business speak. And, frankly, often by being some of the smartest people in the room. The math is occasionally helpful for speaking the language of the XGBoost model. And picking only people who are decent at math (and coding) helps ensure the smart factor. How much of that will really change with AI? I've also seen Business stakeholders try to use the chatbot to bypass the Data Scientist. Typically it's not long before there is a design decision or an interesting result the Business stakeholders don't understand. That's why I think there will be demand for Data Scientists. Not exactly evaluation and monitoring. And definitely not gatekeeping building of LLM solutions. Often the opposite, called in to explain and debug the Business stakeholders' slop.
> You recognize that you haven't really needed strong mathematical (or coding) skills to create models for some time.
And then there goes something like this [1], where researchers failed to control for p-value: "In this particular setting, emergent abilities claims are possibly infected by a failure to control for multiple comparisons. In BIG-Bench alone, there are ≥220 tasks, ∼40 metrics per task, ∼10 model families, for a total of ∼10^6 task-metric-model family triplets, meaning probability that no task-metric-model family triplet exhibits an emergent ability by random chance might be small."
I say this quite a lot to data scientists who are now building agents:
1. think of the context data as training data for your requests (the LLM performs in-context learning based on your provided context data)
2. think of evals as test data to evaluate the performance of your agents. Collect them from agent traces and label them manually. If you want to "train" a LLM to act as a judge to label traces, then again, you will need lots of good quality examples (training data) as the LLM-as-a-Judge does in-context learning as well.
I have a data science/engineering background. From my perspective, using AI is like mining the solution space for optimality. The solution space is the combinatorics of the billions of parameters and their cardinalities. You try to narrow down the search space with your prompt and hopefully guide your mining with more semantic-based heuristics towards your optimal solution.
You might hit a local maxima or go down a blind path. I tend to completely start my code base from scratch every week. I would make things more generic, remove unnecessary complexity, or add new features. And hope that can move me past the local maxima.
I'm a data-scientist now, and a fan of claude code for implementing things. But I have to say, I'm constantly surprised by how "dumb" chatgpt is as a math research partner. I will ask it a math question I'm thinking about, get a confident answer back, only to realize hours to days later that it was 180 degrees backwards. I'm so frustrated right now with this that I'm almost ready to stop asking it such questions at all. I'm aware this seems to contrast strongly with other math-people's enthusiasm e.g., Terrance Tao. Unclear why my mileage varies.
Much of my work takes the form above -- in other words figuring out what to do. once i've decided, it can of course spit out the boilerplate code much faster than I could, and I appreciate that. But for the moment I think I still have some job security thanks to the first issue.
I can see cases like the recently mentioned pg_textsearch (https://news.ycombinator.com/item?id=47589856) being perfect cases for this kind of development style succeeding - where you have the clear test cases, benchmarks, etc you can meet.
Though for greenfield development, writing the test cases (like the spec) is equally as hard, if not harder than writing the code.
I also observe that LLMs tend to find themselves trapped in local minima. Once the codebase architecture has been solidified, very rarely will it consider larger refactors. In some ways - very similar to overfitting in ML
This matches what I've seen working with automated systems. The watching part is
genuinely underrated. Evals give you a score. Watching gives you intuition about
failure modes you didn't know to test for.
Sitting with a running system teaches you things you would never think to measure.
> The bulk of the work is setting up experiments to test how well the AI generalizes to unseen data, debugging stochastic systems, and designing good metrics.
In my experience, this is missing a big part of the work: confirming what the data actually is, sometimes despite what people think it is.
The monitoring and evaluation piece is underrated. In my experience the hardest part isn't building the initial LLM pipeline, it's knowing when the thing quietly broke. Domain expertise matters a lot there because you need to design evals that actually catch the failure modes that matter for your specific data distribution.
Was the data scientist role only about building NLP models?
Are the LLms gonna build Churn prediction models?
Tell the PM why stopping the A/B test halfway through is a bad idea?
Push back on loony ideas of applying ML to predicting sales from user horoscopes?
Maybe the role is a bit tinier in scope than 10 years ago, but I see it as a good thing. If you looked at DS positions on job search sites the role descriptions would be all over the place, maybe now at least we'll see it consolidate.
Exactly - in my company we had some NLP models in Customer Service (bag-of-words for classifying tickets) but everywhere else it was just classification or regression problems.
So yeah, the bag-of-words model got replaced with a chatbot several years ago (when chatbots were all the rage back in like 2017) and will probably get replaced again with an LLM-enhanced chatbot soon. But the meat and potatoes are those classification and regression models and they aren't going anywhere.
I just spent yesterday applying Kaparthy's autoresearch on an ML problem.
I teach ML for a living and was amazed with what the tokens gave back to me after many rounds of experiments. If Kaggle was still a thing, AI would generally beat it.
The challenge I've seen is that most data science/ml modeling work is quite weak. Folks don't even know the basic tools well. Not sure if giving AI to them will really open up many doors to them.
As always experts love minions of juniors doing their deeds. Non-experts get to wade through slop.
I agree AI could probably do a decent job on Kaggle problems. Of course, almost no DS job is building models with well-defined objectives and perfect data. The DS and MLE folks I work with mostly spend their time reframing ill-posed product requests into ML systems that can be maintained and improved with feedback loops.
A _huge_ part of a DS is saying "No" to bad ideas posed by non-experts. The issue with LLMs is all they ever say is "Yes" and "Wow, that's such a great idea!"
I mean it is a similar loop. Define what good looks like, measure how far off you are, iterate. I would say though that the people who've been doing that for years just have a head start that prompt engineers don't.
These are good practices to keep in mind when setting up GenAI solutions, but I'm not convinced that this part of the job will allow "data scientist" as a profession to thrive. Here's my pessimistic take.
Data scientists were appreciated largely because of their ability to create models that unlock business value. Model creation was a dark magic that you needed strong mathematical skills to perform - or at least that's the image, even if in reality you just slap XGBoost on a problem and call it a day. Data scientists were enablers and value creators.
With GenAI, value creation is apparently done by the LLM provider and whoever in your company calls the API, which could really be any engineering team. Coaxing the right behavior out of the LLM is a bit of black magic in itself, but it's not something that requires deep mathematical knowledge. Knowing how gradients are calculated in a decoder-only transformer doesn't really help you make the LLM follow instructions. In fact, all your business stakeholders are constantly prompting chatbots themselves, so even if you provide some expertise here they will just see you as someone doing the same thing they do when they summarize an email.
So that leaves the part the OP discusses: evaluation and monitoring. These are not sexy tasks and from the point of view of business stakeholders they are not the primary value add. In fact, they are barriers that get in the way of taking the POC someone slapped together in Copilot (it works!) and putting that solution in production. It's not even strictly necessary if you just want to move fast and break things. Appreciation for this kind of work is most present in large risk-averse companies, but even there it can be tricky to convince management that this is a job that needs to be done by a highly paid statistician with a graduate degree.
What's the way forward? Convince management that people with the job title "data scientist" should be allowed to gatekeep building LLM solutions? Maybe I'm overestimating how good the average AI-aware software engineer is at this stuff, but I don't see the professional moat.
I agree with your take.
I don't really see why evals are assumed to be exclusively in the domain of data scientists. In my experience SWEs-turned-AI Engineers are much better suited to building agents. Some struggle more than others, but "evals as automated tests" is, imo, so obvious a mental model, and can be so well adapted to by good SWEs, that data scientists have no real role on many "agent" projects.
I'm not saying this is good or bad, just that it's what I'm observing in practice.
For context, I'm a SWE-turned-AI Engineer, so I may be biased :)
I think there's a lot of methodological expertise that goes into collecting good eval data. For example, in many cases you need human labelers with the right expertise, well designed tasks, well defined constructs, and you need to hit interrater agreement targets and troubleshoot when you don't. Good label data is a prerequisite to the stuff that can probably be automated by the AI agent (improving the system to optimize a metric measured against ground truth labels). Data scientists and research scientists are more likely to have this skillset. And it takes time to pick up and learn the nuances.
[dead]
I agree with you take the there isn’t a lot of specialist work for data scientists to do with using off-the-shelf LLMs that can’t be done by an engineer. As an AI-aware software engineer myself… this stuff wasn’t that hard to pick up. Even a lot of the work on the Evals side (creating an LLM judge etc.) isn’t that hard and doesn’t require serious ML or stats.
But aren’t there still plenty of opportunities for building ML models beyond LLMs, albeit a bit less sexy now? It’s not like you can run a business process like (say) AirBnB’s search rankings or Uber’s driver marching algorithms on an LLM; you need to build a custom model for that. Or am I missing something here? Or is that point that those opportunities are still there, but the pond has shrunk because so much new work is now LLM-related? I buy that.
> I agree with you take the there isn’t a lot of specialist work for data scientists to do with using off-the-shelf LLMs that can’t be done by an engineer.
Conversely, data scientists are doing software engineering, including webdev. It’s an interesting time. I think it’s less about the job title demarcation now, and more about output.
I think most use-cases will still use simpler models like XGBoost etc. rather than LLM's. Customer segmentation is a really common use-case with no need for an LLM. Same for revenue/LTV forecasting.
Perhaps they can use the LLM to write and deploy these models without needing a Data Scientist but that seems risky to say the least.
In my company, the most Data Scientist-adjacent people are the Data Analysts but they tend not to have programming experience beyond SQL and basic Python and they aren't used to using the terminal etc.
Do those use cases need LLMs? Probably not. but if good results can be had with a day of prompting (in addition to the stuff mentioned in the article, which you have to do anyway) and a smaller model like Haiku gives good results why would you build a classifer before you have literally millions of customers?
The LLM solution will be much more flexible because prompts can change more easily than training data and input tokens are cheap.
> Do those use cases need LLMs? Probably not.
One of the points of the article is the importance of gathering data to support your conclusions.
> prompts can change more easily than training data
Training data is real, and prompts are not. I don’t think this is an apples to apples comparison.
I don't disagree that very numerical tasks like revenue forecasting are not a good fit for LLMs. But neither did a lot of data scientist concerns themselves with such things (compared to business analysts and the like). Software to achieve this has been commoditized.
I agree. It is difficult to convince leadership to do this work at all ("it works on my example, ship it"), and in my experience most DS don't even want to do it.
One of the key value is that it forces some thinking about what is the task you want to solve in the first place. In many cases, it is difficult if not impossible to do it, which implies the underlying product should not be built at all. But nobody wants to hear that.
Doing eval only makes sense if making the product better impacts something the business cares about, which is very difficult to do in practice.
I don’t actually even know what people are hinting at when they say that LLMs replace the need for building custom models. Regression models? People are using LLMs instead of say building a Bayesian hierarchical model? That’s not possible. Time series modeling using an LLM? Also ridiculous. Recommender systems? Ok maybe, still utterly ridiculous and abysmally slow.
For anything NLP sure, it definitely wins. However, I’ve just recently used some big fancy OpenAI model to actually just label thousands of text data for me, just so I could build a classifier with CatBoost. Guess what, inference speed is at a guaranteed sub 100ms and it costs $0 in tokens. The”AI Engineer” solution here would be just run every classification request through an LLM.
AI Engineering is going to have the same problem we had when Data Science as a term arrived and you had every Statistician saying they’re just re-inventing everything that exists in statistics, poorly.
You're right. For years the real impediment to "AI" products at many companies was the sheer crappiness of ML frameworks which were built by and for grad students, not professional engineers.
When LLMs appeared it was just so much easier to use then as an uber model and leave behind the training and inference infrastructure (if you can even call it that).
Now that LLMs can code I expect we'll be coding up custom model pipelines more and more... but only when we stop subsidizing LLMs.
[dead]
As a AI-aware software engineer currently creating systems that integrate with LLM provider APIs for my company- who also has no idea what an eval is or how a data scientist thinks about RAG. I honestly don't see what value a data scientist would bring to the table for my team. Maybe someone would care to enlighten me?
One thing data scientists brought to the table was statistical rigor in the models, but that seems to have left the building at this point with LLM-based solutions.
I believe data scientists and ML engineers should not be conflated.
I believe that the massive splitting of data roles over the past decade is both a product of ZIRP and premature optimisation.
So you consider data scientist akin to QA? They are just validating LLM based solutions?
You recognize that you haven't really needed strong mathematical (or coding) skills to create models for some time. Data Scientists add value by knowing how to translate business speak into XGBoost type model and interesting XGBoost model results into business speak. And, frankly, often by being some of the smartest people in the room. The math is occasionally helpful for speaking the language of the XGBoost model. And picking only people who are decent at math (and coding) helps ensure the smart factor. How much of that will really change with AI? I've also seen Business stakeholders try to use the chatbot to bypass the Data Scientist. Typically it's not long before there is a design decision or an interesting result the Business stakeholders don't understand. That's why I think there will be demand for Data Scientists. Not exactly evaluation and monitoring. And definitely not gatekeeping building of LLM solutions. Often the opposite, called in to explain and debug the Business stakeholders' slop.
[1] https://arxiv.org/abs/2304.15004
I say this quite a lot to data scientists who are now building agents:
1. think of the context data as training data for your requests (the LLM performs in-context learning based on your provided context data)
2. think of evals as test data to evaluate the performance of your agents. Collect them from agent traces and label them manually. If you want to "train" a LLM to act as a judge to label traces, then again, you will need lots of good quality examples (training data) as the LLM-as-a-Judge does in-context learning as well.
From my book - https://www.amazon.com/Building-Machine-Learning-Systems-Fea...
Yup, agree. “Evaluations” = Tests
Gets pretty meta when you’re evaluating a model which needs to evaluate the output of another agent… gotta pin things down to ground truth somewhere.
I have a data science/engineering background. From my perspective, using AI is like mining the solution space for optimality. The solution space is the combinatorics of the billions of parameters and their cardinalities. You try to narrow down the search space with your prompt and hopefully guide your mining with more semantic-based heuristics towards your optimal solution.
You might hit a local maxima or go down a blind path. I tend to completely start my code base from scratch every week. I would make things more generic, remove unnecessary complexity, or add new features. And hope that can move me past the local maxima.
I'm a data-scientist now, and a fan of claude code for implementing things. But I have to say, I'm constantly surprised by how "dumb" chatgpt is as a math research partner. I will ask it a math question I'm thinking about, get a confident answer back, only to realize hours to days later that it was 180 degrees backwards. I'm so frustrated right now with this that I'm almost ready to stop asking it such questions at all. I'm aware this seems to contrast strongly with other math-people's enthusiasm e.g., Terrance Tao. Unclear why my mileage varies.
Much of my work takes the form above -- in other words figuring out what to do. once i've decided, it can of course spit out the boilerplate code much faster than I could, and I appreciate that. But for the moment I think I still have some job security thanks to the first issue.
I can see cases like the recently mentioned pg_textsearch (https://news.ycombinator.com/item?id=47589856) being perfect cases for this kind of development style succeeding - where you have the clear test cases, benchmarks, etc you can meet.
Though for greenfield development, writing the test cases (like the spec) is equally as hard, if not harder than writing the code.
I also observe that LLMs tend to find themselves trapped in local minima. Once the codebase architecture has been solidified, very rarely will it consider larger refactors. In some ways - very similar to overfitting in ML
This matches what I've seen working with automated systems. The watching part is genuinely underrated. Evals give you a score. Watching gives you intuition about failure modes you didn't know to test for.
Sitting with a running system teaches you things you would never think to measure.
> The bulk of the work is setting up experiments to test how well the AI generalizes to unseen data, debugging stochastic systems, and designing good metrics.
In my experience, this is missing a big part of the work: confirming what the data actually is, sometimes despite what people think it is.
The monitoring and evaluation piece is underrated. In my experience the hardest part isn't building the initial LLM pipeline, it's knowing when the thing quietly broke. Domain expertise matters a lot there because you need to design evals that actually catch the failure modes that matter for your specific data distribution.
I don't understand the framing of the assumption.
Was the data scientist role only about building NLP models? Are the LLms gonna build Churn prediction models? Tell the PM why stopping the A/B test halfway through is a bad idea? Push back on loony ideas of applying ML to predicting sales from user horoscopes?
Maybe the role is a bit tinier in scope than 10 years ago, but I see it as a good thing. If you looked at DS positions on job search sites the role descriptions would be all over the place, maybe now at least we'll see it consolidate.
Exactly - in my company we had some NLP models in Customer Service (bag-of-words for classifying tickets) but everywhere else it was just classification or regression problems.
So yeah, the bag-of-words model got replaced with a chatbot several years ago (when chatbots were all the rage back in like 2017) and will probably get replaced again with an LLM-enhanced chatbot soon. But the meat and potatoes are those classification and regression models and they aren't going anywhere.
So true... I get more mileage from just watching an agent work than building sophisticated LLM-as-judge workflows
I just spent yesterday applying Kaparthy's autoresearch on an ML problem.
I teach ML for a living and was amazed with what the tokens gave back to me after many rounds of experiments. If Kaggle was still a thing, AI would generally beat it.
The challenge I've seen is that most data science/ml modeling work is quite weak. Folks don't even know the basic tools well. Not sure if giving AI to them will really open up many doors to them.
As always experts love minions of juniors doing their deeds. Non-experts get to wade through slop.
I agree AI could probably do a decent job on Kaggle problems. Of course, almost no DS job is building models with well-defined objectives and perfect data. The DS and MLE folks I work with mostly spend their time reframing ill-posed product requests into ML systems that can be maintained and improved with feedback loops.
A _huge_ part of a DS is saying "No" to bad ideas posed by non-experts. The issue with LLMs is all they ever say is "Yes" and "Wow, that's such a great idea!"
The data scientist is like in house lawyers in that respect.
Is Kaggle no longer a thing?
I mean it is a similar loop. Define what good looks like, measure how far off you are, iterate. I would say though that the people who've been doing that for years just have a head start that prompt engineers don't.
[dead]
[dead]