ChatGPT Health fails to recognise medical emergencies – study

(theguardian.com)

185 points | by simonebrunozzi 7 hours ago ago

138 comments

unstyledcontent 7 hours ago
I have had some incredible medical advice from ChatGPT. It has saved me from small mystery issues, like a rash on my face. Small enough issues that I probably wouldn't have bothered to go into a doctor. BUT it also failed to diagnose me with a medical issue that ended up with a trip to the ER and emergency surgery.
A few weeks before the ER, I was having stomach pain. I went to the doctor with theories from ChatGPT in hand, they checked me for those things and then didn't check me for what ended up being a pretty obvious issue. What's interesting is that I mentioned to the doctor that I used ChatGPT and that the doctor even seemed to value that opinion and did not consider other options (and what it ultimately ended up being was rare but really obvious in retrospect, I think most doctors would have checked for it). I do feel I actually biased the first doctors opinion with my "research."
[-]
- hwillis 6 hours ago
  > I do feel I actually biased the first doctors opinion with my "research."
  It may feel easy to say doctors should just consider all the options. But telling them an option is worse than just biasing their thinking; they are going to interpret that as information about your symptoms.
  If you feel pain in your abdomen but are only talking about your appendix, they are rightfully going to think the pain is in the region of your appendix. They are not going to treat you like you have kidney pain. How could they? If they have to treat all of your descriptions as all the things that you could be relating them to, then that information is practically useless.
  [-]
  - ljm 3 hours ago
    It sounds strange to me that you would use GPT to start consulting to your doc, as if you suddenly know better than them. You don't want to be doing their job for them.
    If I used GPT for my medical issue last year and everybody took my word for it, I would be dead.
    [-]
    - QuantumGood 2 hours ago
      Neither "the worst case would be" nor "everything is a sliding scale" are good single hueristics. There are rarely There are rarely good single hueristics, but implying them tends to color discussions strongly.
    - kevin_thibedeau 3 hours ago
      I've related self-dianoses of minor issues to a doctor, immediately followed up with a proviso that I don't put a lot of credence into non-professional opinions. The doctor was supportive that patient directed investigations had value. There is a threshold where an informed patient can be useful for treatment.
      [-]
      - Groxx 15 minutes ago
        Yeah, I personally know a couple people where self-research found the correct diagnosis, and I am one of them. We had a fantastic primary, who worked with us quite closely and did a lot of research after we found some new information from him.
        Doctors don't know everything and don't have access to everything, they are just quite a lot better than the alternatives in the vast majority of cases, so your default odds are much better following their recommendation than anything else. Training is worth a lot, and everyone also knows it's not perfect, and that's entirely fine.
  - thfuran 3 hours ago
    Any competent doctor is aware that patients are likely to misdescribe things. If you walk in and say your appendix hurts, they absolutely should try to clarify that rather than just assuming you have appendicitis.
- Aurornis 6 hours ago
  > I do feel I actually biased the first doctors opinion with my "research."
  This has been a big problem in medicine since the early days of WebMD: Each appointment has a limited time due to the limited supply of doctors and high demand for appointments.
  When someone arrives with their own research, the doctor has to make a choice: Do they work with what the patient brought and try to confirm or rule it out, or do they try to walk back their research and start from the beginning?
  When doctors appear to disregard the research patients arrive with many patients get very angry. It leads to negative reviews or even formal complaints being filed (usually from encouragement from some Facebook group or TikTok community they were in). There might even be bigger problems if the patient turns out to be correct and the doctor did not embrace the research, which can prompt lawsuits.
  So many doctors will err on the side of focusing on patient-provided theories first. Given the finite time available to see each patient (with waiting lists already extending months out in some places) this can crowd out time for getting a big picture discussion through the doctor's own diagnostic process.
  When I visit a doctor I try to ground myself to starting with symptoms first and try to avoid biasing toward my thoughts about what it might be. Only if the conversation is going nowhere do I bring out my research, and then only as questions rather than suggestions. This seems to be more helpful than what I did when I was younger, which is research everything for hours and then show up with an idea that I wanted them to confirm or disprove.
  [-]
  - bryanlarsen 5 hours ago
    > Each appointment has a limited time
    A doctor is typically scheduled at 6 patients/hour. In that time they also have to chart, walk between rooms, make up time for the other patients that inevitably went over time, et cetera. The doctor you're seeing probably has a goal of only talking to you for 3 minutes.
    [-]
    - Aurornis 3 hours ago
      > A doctor is typically scheduled at 6 patients/hour.
      This is untrue. General practice physicians are usually at 3 patients per hour. Some specialists can get in the range or 5 or more per hour if assistants handle most of the prep and work.
      The average across all specialties is around 3, though.
      > In that time they also have to chart, walk between rooms, make up time for the other patients that inevitably went over time, et cetera. The doctor you're seeing probably has a goal of only talking to you for 3 minutes.
      I've been through two different medical systems due to job changes/moving. Both of them gave me the option of a 20 minute or 40 minute appointment slot, with the latter requiring some pre-screening to be approved by the staff. I got the time every time I went.
      If your doctor is only giving you 3 minutes you need to find a new one.
      [-]
      - Calavar 3 hours ago
        I know you qualified your assertion of three patients an hour with general practice, but there are plenty of specialty practices where six patients an hour is common. Dermatology and ophthalmology clinics often run at that pace (at least in the US). Some surgical clinics can run at that pace for follow up visits (not for initial visits)
        [-]
        Aurornis 2 hours ago
        That's exactly what I said in my 3rd sentence.
  - bandrami 4 hours ago
    I'm annoyed enough by coworkers asking "is the server down?" that I try not to do the equivalent to other people at their jobs, particularly doctors.
  - tokai 5 hours ago
    My aunt died from this (my opinion). She spend two years confusion her diagnosis and treatment, and borderline harassing her doctors, by thinking her own research was on point and interpreting all her symptoms through that lens. In the end it wasn't borrelia, parasites, 5G, or any of the other fancies, but just lung cancer that was only diagnosed when it was very well developed.
    [-]
    - walletdrainer 4 hours ago
      There’s a difference between mental illness and active participation.
      People not suffering from mental illness will typically not blame 5G for their health concerns.
      [-]
      - ifyoubuildit 3 hours ago
        You're a lay person. You know there is a thing out there called 'foo'.
        You've read things that compellingly claim that foo causes xyz symptoms. You also know that some people that have obviously palpable disdain for you claim that foo could never cause these symptoms.
        You have xyz symptoms. Are you mentally ill if you think that foo could be the cause?
        [-]
        thfuran 2 hours ago
        Are the compelling claims from experts in foo or xyz? Is the disdain?
        [-]
        ifyoubuildit 2 hours ago
        Both present themselves to you as experts.
- SoftTalker 6 hours ago
  > what it ultimately ended up being was rare but really obvious in retrospect, I think most doctors would have checked for it
  I'm not so sure. Doctors are trained to check for the most common things that explain the symptoms. "When you hear hoofbeats, think horses not zebras" is a saying that is often heard in medicine.
  ChatGPT was trained on the same medical textbooks and research papers that doctors are.
  [-]
  - giraffe_lady 6 hours ago
    > ChatGPT was trained on the same medical textbooks and research papers that doctors are.
    Yeah hm I wonder what the difference could possibly be.
- boondongle 6 hours ago
  This is ultimately the same difference between a search engine and a professional. 10 years before this, Googling the symptoms was a thing.
  I have a family member who had a "rare but obvious" one but it took 5 doctors to get to the diagnosis. What we really need to see are attempts to blind studies and real statistical rigor. It's funny to paint a tunnel on a canvas and get a Tesla to drive into it, but there's a reason studies (and the more blind the better) are the standard.
- BloondAndDoom 5 hours ago
  The real story hear your doctor actually listened to you. I appreciate what a lot doctors do, but majority of them fucking irritating and don’t even listen your issues, I’m glad we have AI and less reliant on them.
  [-]
  - PearlRiver 5 hours ago
    It is not a doctor job to listen, smile or be nice. Their job is to fix you.
    [-]
    - boondongle 5 hours ago
      I mean - obviously if they're not listening their chance of the latter is pretty low.
      Doctors hate to hear this, but if you're so poor in communication and social skills that the patient can't/won't follow you any care you've given, your value is lost.
- bluSCALE4 4 hours ago
  Personally, I think the value in ChatGPT in health is not that it's right or wrong but that it encourages you to take an active role in your health and more importantly to try things. I've gone through similar issues with ChatGPT where it's convinced me that if A is true, therefore so must B though that may not be the case.
  In the future, I think I'll likely review things with ChatGPT and have an opinion and treat the doctor like a ChatGPT session as well--this is opposed to leading the doctor to what I believe I should be doing. I was dismissive about the doctor's advice because it seemed so obvious but more and more, I feel that most of our issues are caused by habitual, daily mistakes--little things that take hold seasonally or over periods of stress that appear like chronic health issues. At least for me.
- luke5441 4 hours ago
  We have the same kind of issue as software engineers. Users come to use with solutions to their problems and want us to implement the solution. At that point the lazy path would be to just do that. If you have bad management, software engineers might even be punished for questioning the customers.
  What you want instead is that the users just describe their problem, as unbiased as possible and with enough detail and then let the expert come up with an appropriate solution that solves the problem.
  I try to do that as well when going to the doctor.
- cmsp12 5 hours ago
  You should've let the doctor do its job. if he reached a different conclusion then you can tell him what you researched. and he will make a decision having already done his own research without biasing him
- soco 6 hours ago
  Which is exactly why the AI, at least the ones of today, should never be used beyond the level of (trusted or not) advisor. Yet not only many CxOs and boards, but even certain governments which shall not be named, are stubbornly trying, for cost or whatever other reasons, to throw entire populations (employees or nations) under the AI bus. And I sincerely don't believe anything short of an uprising will be able to stop them. Change my mind.
  [-]
  - qalmakka 6 hours ago
    I agree. AI right now is at a level of "knowledgeable friend", not of "professional with years of real world experience". You'd listen to what your friend has to say, but taking pills after one of their suggestions? Dumb idea. It's great to brainstorm things, but just like your knowledgeable friend that likes reading Wikipedia pages a bit too much you need to really check it's not reaching to conclusions too quickly
  - asdff 3 hours ago
    The sad truth is that it is because while we all appreciate hard work and a good job, that isn't what is needed to move forward in the world of business. Creaky leaky products held together under the hood by scotch tape and string are fine. You don't make more money having a better product. A more performant tool. Better benchmarks. End users, aside for writing tools for other engineers, don't care. They really don't. Word 95 probably opens faster than word today.
    Management has realized this. Hey I can outsource to bangalore/hyderabad/east europe/ai, get something that barely works, and just market the crap out of it. Look at the sort of companies, products, and services that dominate markets today. These aren't leaders in quality or engineering. They are leaders in marketing. Marketing is what sells. Marketing can sell billions of steaming turds. Nike shoes are pieces of shit but it's marketing that makes the brand and provides all value in the stock. The world doesn't value quality. It values noise and pretty feathers.
  - simonebrunozzi 6 hours ago
    > but even certain governments which shall not be named
    Why can't you name them, and give us some context? Is this based on public info, or not?
    [-]
    - _dwt 5 hours ago
      Not the original commenter, but you may have noticed a wee kerfluffle between a large nation-state's "Secretary of War" and a frontier model provider over whether the model's licensing would permit autonomous lethal weapon systems operated by said - and I cannot emphasize the middle word enough - large _language_ model.
WarmWash 7 hours ago
I'd greatly prefer a blind study comparing doctors to AI, rather than a study of doctors feeding AI scenarios and seeing if it matches their predetermined outcome.
Edit: People seem confused here. The study was feeding the AI structured clinical scenarios and seeing it's results. The study was not a live analyses of AI being used in the field to treat patients.
[-]
- riskassessment 6 hours ago
  I don't understand this reasoning. Randomizing people to AI vs standard of care is expensive and risky. Checking whether the AI can pass hypothetical scenarios seems like a perfectly reasonable approach to researching the safety of these models before running a clinical trial.
  [-]
  - selridge 4 hours ago
    The issue is that those hypothetical scenarios do not have to look like how patients actually interact with the tool.
    Real life use is full of ill posed questions open ended statements inaccurate assessment of symptoms, and conclusory remarks sprinkled in between. Real use of chat bots for Health by non-clinicians looks very different than scenario based evaluation.
  - WarmWash 6 hours ago
    You would pass those hypothetical scenarios to doctors too, and then the analyses of results would be done by doctors who don't know if it's an AI or doctor result.
    [-]
    - riskassessment 6 hours ago
      From the paper
      > Three physicians independently assigned gold-standard triage levels based on cited clinical guidelines and clinical expertise, with high inter-rater agreement
  - nick49488171 6 hours ago
    You can start by comparing "doctor" care vs "doctor who also uses AI" care
- GorbachevyChase an hour ago
  The number of people who die each year just in the United States for causes attributable to medical errors is believed to be in the hundreds of thousands. A doctor’s opinion is not the golden yardstick.
  It may be interesting to study if there is some kind of signal in general health outcomes in the US since the popularization of ChatGPT for this purpose. It may be a while before we have enough data to know. I could see it going either way.
- hwillis 6 hours ago
  We have standards of care for a reason. They are the most basic requirements of testing. Ignoring them is not just being a bad doctor, its unethical treatment. Its the absolute bare minimum of a medical system.
- dekoidal 5 hours ago
  You're joking right? This is the 'testing on mice' phase and it failed and your idea is to start dosing humans just to see what happens.
  [-]
  - selridge 4 hours ago
    Human use is already widespread. You might as well complain in 2015 about the use of Wikipedia among emergency room doctors. That ship has sailed.
- RandomLensman 5 hours ago
  Feeding scenarios is not without challenges as some things, for example, smell, would be "pre-processed" by humans before fed into the AI, I think.
- lmkg 6 hours ago
  That type of experimental set-up is forbidden due to ethical concerns. It goes against medical ethics to give patients treatment that you think might be worse.
- nradov 6 hours ago
  I don't understand what you're proposing. How would you design such a study in a way that would pass IRB?
  [-]
  - dec0dedab0de 5 hours ago
    I think the best would be an interface, where the patient isn't told if the doctor on the other end is human or AI. Tell them that they are going to do multiple remote exams with different care providers for the same illness in exchange for free treatment, and payment for the study.
    If you're worried about not catching a legit emergency, as in something that can't wait a day or two for them to complete the different sessions, you could have a doctor monitor the interactions with the ability to raise a flag and step in to send them to the ER.
    [-]
    - nradov 5 hours ago
      I'm pretty sure that wouldn't pass IRB.
  - selridge 4 hours ago
    You could absolutely randomize care between a doctor and an AI under an IRB. I’d be stunned if there aren’t a dozen studies doing something like this already.
    You have to justify it, but most places have sections in the document where you request review to justify it. It’s not any different from giving one patient heart medicine that you think works and another patient a sugar pill.
    [-]
    - nradov 4 hours ago
      Huh? Do you have any actual examples of such studies? I don't think you understand how IRB actually works.
      In actual heart medicine studies the control arm is typically treated with the current standard of care, not a placebo. So it seems pretty clear that you don't have any actual knowledge or experience in this area.
  - SoftTalker 6 hours ago
    Feed it randomly selected case histories? See if it came up with the same diagnosis as the doctors?
    [-]
    - nradov 6 hours ago
      I don't think that would tell us anything useful. The data quality in most patient charts is shockingly bad. I've seen a lot of them while working on clinical systems interoperability. Garbage in / garbage out. When human physicians make a diagnosis they typically rely on a lot of inputs that never appear in the patient chart.
      And in most cases the diagnosis is the easy part. I mean we see occasional horror stories about misdiagnosis but those are rare. The harder and more important part is coming up with an effective treatment plan which the patient will actually follow, and then monitoring progress while making adjustments as needed. So a focus on the diagnosis portion of clinical decision support seems fundamentally misguided.
      [-]
      - qsera 6 hours ago
        > When human physicians make a diagnosis they typically rely on a lot of inputs that never appear in the patient chart.
        Yea, like how rich the patient is or if they are on insurance etc. I wish I was kidding.
        [-]
        PearlRiver 5 hours ago
        This the real reason why some people go to chatGPT instead of a GP. I am glad to live in a country were going to the doctor is free.
  - dyauspitr 6 hours ago
    It’s all case histories and text no real person is affected by this.
- lkey 4 hours ago
  This 'preference' is sociopathic, illegal, and stupid.
- qsera 7 hours ago
  Yea, that is exactly why I don't like this.
  These "experts", they have no problem to tout anecdotes when it serves them..
traceroute66 3 hours ago
> ChatGPT was trained on the same medical textbooks and research papers that doctors are.
There is a reason why the majority of a doctor's 8 years of training is spent doing the rounds as a junior doctor in hospital wards ....
[-]
- tty456 3 hours ago
  Curious, what is learned doing rounds that isn't taught in med school, that ChatGPT could benefit from?
  [-]
  - kruffalon 3 hours ago
    People!
    Interacting with real people, facing a person trying to get help for something that they don't want to experience is vastly different than reading about a symptom or group of symptoms in a book.
    [-]
    - moffkalast 2 hours ago
      That just sounds like instruct tuning on user data with extra steps. They must've collected hundreds of millions of conversation examples of people asking for medical related things by now.
      [-]
      - kruffalon an hour ago
        You're thinking about nursing. This is a different field that I think doctors should study and practice too.
        Doctors, from what I can tell, receive no formal education in nursing.
    - tty456 2 hours ago
      Right. All doctors are 100% happy to help you, listen intently, and are never condescending! /s
      [-]
      - kruffalon an hour ago
        No, in my experience most are actually quite condescending, but I can't imagine they would treat me better if I was the first ever patient they had to deal with.
  - traceroute66 3 hours ago
    > Curious, what is learned doing rounds that isn't taught in med school, that ChatGPT could benefit from?
    Seriously ? ¯\_(ツ)_/¯
    The textbooks are the theory.
    The hospital wards are the practice.
    The hospital wards are what shows you that the human body is complex and many times things don't happen like the textbook says it will.
    And then there's the ICU, pediatric, geriatric and mental health wards where the patient often cannot even describe their symptoms ...
    [-]
    - yunnpp 2 hours ago
      The idea that theory does not match practice is very foreign to a software developer or somebody that works in the information processing field. We are spoiled to have mathematicians do the hard work, and then we just get to botch it all with software that doesn't even work. But people with real jobs doing actual science/engineering know that the practice never matches the theory and that one needs to harden their balls in the thick of battle to come to a true understanding of things. That's why only a software developer, in their full command of hubris and ignorance, would suggest that you can replace a doctor with a computer program that has digested every book in existence and then statistically regurgitates its contents.
  - ruszki 2 hours ago
    Just from coding: Clean Code. Most companies require this on principle. But nobody follows it. And there is a good reason, because if you follow it, your code will be completely unreadable and thus unmaintainable.
  - batshit_beaver 2 hours ago
    In theory, there's no difference between theory and practice.
    In practice...
  - dgxyz 3 hours ago
    Nothing. Need brain, body, hands, eyes.
- kledru 2 hours ago
  well, chatGPT only started its first year and probably has not even done an autopsy
iainctduncan 4 hours ago
I think the worse situation is the bad AI summaries from search on health issues.
We had a potential pet poisoning, so was naturally searching for resources. Google had a summary with a "dose of concern" that was an order of magnitude off. Someone could have read that and thought all was fine and had a dead cat.
(BTW cat is fine, turned out to be a false alarm, but public service announcement: cats are alergic to aspirin and peptobismal has aspirin. don't leave demented plastic chewing cats around those bottles, in case you too have a lovely but demented cat)
[-]
- cloud-oak 3 hours ago
  What's really worrying is seeing medical professionals starting to rely on these tools.
  My wife had a pretty bad cold during pregnancy and our GP proceeded to prescribe her cough syrup with high alcohol content, because that was what ChatGPT told him to prescribe. We only noticed it once she took the first dose and spit it out again...
  [-]
  - conception 2 hours ago
    The amount of alcohol in cough syrup will not affect a pregnancy.
- ep103 4 hours ago
  I have literally never seen a correct google summary. Maybe y'all are searching for different things than i am, but at this point I've started taking the viewpoint that if I don't know why the ai summary is wrong, then i also don't know enough about the topic to trust its summary enough to determine whether the summary is useful.
rendleflag 4 hours ago
There is a concept of “the burden or knowledge”, in that doctors know the worst thing that could happen, so they recommend the most cautious approach. My son had stomach pain one time when he was young. We took him to urgent care because it was a stomach ache. The doctor there said we needed to go to the ER because it could be an appendicitis. So we trucked to the ER. Close to $2000 later he was diagnosed with idiopathic stomach pain and told to wait it out at home.
So when I read “they then compared the platform’s recommendations with the doctors’ assessments” and see a mismatch, I wonder if it’s because human doctors are overly cautious or that the AI was wrong.
But that all pales in what could be the actual issue. I can’t read the original study, but if it use the USA, it’s understandable why people are turning to AI for Health advice. Healthcare is painfully expensive here. Even a simple trip to the ER (e.g. a $2000 stomach ache) is beyond a lot of people’s ability to spend. That’s just a reality.
With that in mind, the real questions “should I do nothing about my symptoms because I can’t afford healthcare or should I at least ask AI knowing it could be wrong”.
nerdjon 7 hours ago
Even though these tools are showing time and time again that they have serious reliability issues, somehow people still think it is a good idea to use them for critical decisions.
Still regularly get wrong information from google’s search AI.
Really starting to wonder if common sense is ever going to come back with new tech, but I fear it is going to require something truly catastrophic to happen.
[-]
- bubblewand 4 hours ago
  I’ve got a popcorn reserve at hand to watch the show when the massive security breaches happen and people start freaking out. And/or a lawsuit gets discovery of a company’s LLM history and it’s every bit as awful for them as we all know it will be and the rest of corporate America pumps the brakes.
  These systems are borderline useless if you don’t give them dangerous levels of access to data and generate tons of juicy chat history with them. What’s coming is very predictable.
- yodsanklai 6 hours ago
  It's a strange paradigm shift, where the tool is right and useful most of than not, but also make expensive mistakes that would have been spotted easily by an expert.
  [-]
  - nxm 5 hours ago
    Human experts make expensive mistakes all the time
- lkbm 6 hours ago
  > Still regularly get wrong information from google’s search AI.
  The fact that the model most hyper-optimized for cheap+fast makes mistakes is not a particular compelling argument.
  [-]
  - mayneack 4 hours ago
    Then Google shouldn't be using something so unreliable for anything important. Arguing that random users should know the difference between cheap and frontier models is also not compelling. It's all the same "AI" to most people.
  - raddan 5 hours ago
    You are mistaken. ChatGPT Health [1] is a model specifically designed for health applications and was co-developed with a benchmark suite, HealthBench [2], for testing against health conditions. This study suggests that the people working on HealthBench have some concerning external validity problems.
    [1] https://openai.com/index/introducing-chatgpt-health/
    [2] https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca65...
    [-]
    - GaryBluto 5 hours ago
      GP was referring to Google's search AI not ChatGPT Health.
- duskdozer 6 hours ago
  It's really the "common sense" i.e. believing things without thinking because they "sound right" or because it's what your parents told you a lot growing up or because you watched an ad saying it a hundred times that's the issue. People don't want "the truth" or uncomfortable realities; they want comfortable, easily digestible bullshit. Smooth talkers filled the role before and LLMs are filling that role now.
spicyusername 7 hours ago
And how often are we reviewing doctors performance?
I suspect many, many doctors also fail to regularly recognize medical emergencies.
[-]
- nradov 6 hours ago
  In the general case it's usually not possible to accurately review an individual physician's performance. The software developers here on HN like to think in simplistic binary terms but in the real world of clinical care there is usually no reliable source of truth to evaluate against. Occasionally we see egregious cases of malpractice or failure to follow established clinical practice guidelines but below that there's a huge gray area.
  If you look at online reviews, doctors are mostly rated based on being "nice" but that has little bearing on patient outcomes.
- emp17344 6 hours ago
  Amazing how you can just deflect any criticism of LLMs here by going “but humans suck too!” And the misanthropic HN userbase eats it up every time.
  We live during the healthiest period in human history due to the fact that doctors are highly reliable and well-trained. You simply would not be able to replace a real doctor with an LLM and get desirable results.
  [-]
  - KronisLV 6 hours ago
    > Amazing how you can just deflect any criticism of LLMs here by going “but humans suck too!” And the misanthropic HN userbase eats it up every time.
    I think it's rather people trying to keep grounded and suggest that it's not just the hallucination machine that's bad, but also that many doctors in real life also suck - in part because of the domain being complex, but also due to a plethora of human reasons, such as not listening to your patients properly or disregarding their experiences and being dismissive (seems to happen to women more for some reason), or sometimes just being overworked.
    > You simply would not be able to replace a real doctor with an LLM and get desirable results.
    I don't think people should be replaced with LLMs, but we should benchmark the relative performance of various approaches:
```
  A) the performance of doctors alone, no LLMs
  B) the performance of LLMs alone, no human in the loop
  C) the performance of doctors, using LLMs
```
    Problem is that historical cases where humans resolved the issue and not the ones where the patient died (or suffered in general as a consequence of the wrong calls being made) would be pre-selecting for the stuff that humans might be good at, and sometimes wouldn't even properly be known due to some of those being straight up malpractice on the behalf of humans, whereas benchmarking just LLMs against stuff like that wouldn't give enough visibility in the failings of humans either.
    Ideally you'd assess the weaknesses and utility of both at a meaningfully large scale, in search of blind spots and systemic issues, the problem being that benchmarking that in a vacuum without involving real cases might prove to be difficult and doing that on real cases would be unethical and a non-starter. And you'd also get issues with finding the truly shitty doctors to include in the sample set, sometimes even ones with good intentions but really overworked (other times because their results would suggest they shouldn't be practicing healthcare), otherwise you're skewing towards only the competent ones which is a misrepresentation of reality.
    Reminds me of an article that got linked on HN a while back: https://restofworld.org/2025/ai-chatbot-china-sick/
    The fact that someone would say stuff like "Doctors are more like machines." implies failure before we even get to basic medical competency. People willingly misdirect themselves and risk getting horrible advice because humans will not give better advice and the sycophantic machine is just nicer.
    [-]
    - slopinthebag 4 hours ago
      > I think it's rather people trying to keep grounded and suggest that it's not just the hallucination machine that's bad, but also that many doctors in real life also suck
      No, you see this line or argumentation on every post critical of LLM's deficiencies. "Humans also produce bad code", "Humans also make mistakes" etc etc.
      [-]
      - KronisLV 3 hours ago
        > No, you see this line or argumentation on every post critical of LLM's deficiencies. "Humans also produce bad code", "Humans also make mistakes" etc etc.
        So your reading of this is that it's a deflection of the shortcomings?
        My reading of it is that both humans and LLMs suck at all sorts of tasks, often in slightly different ways.
        One being bad at something doesn't immediately make the other good if it also sucks - it might, however, suggest that there are issues with the task itself (e.g. in regards to code: no proper tests and harnesses of various scripts that push whoever is writing new code in the direction of being correct and successful).
  - boondongle 6 hours ago
    Even in medicine, often the difference between drug A and drug B is the difference between the two in statistical terms. If drugs were held to the standard "works 100% of the time", no drug would ever be cleared for use. Feelings about AI and this administration are influencing this conversation far too much.
    It's like people want to remove the physician or current care from the discussion. It's weird because care is already too expensive and too error prone for the cost.
- MostlyStable 6 hours ago
  A friend of mine had such a bad experience with _multiple_ American doctors missing a major issue that nearly ended up killing her that she decided that, were she to have kids, she would go back to Russia rather than be pregnant in the American medical system.
  Now, I don't agree that this is a good decision, but the point is, human doctors also often miss major problems.
- SoftTalker 6 hours ago
  Medical errors are one of the leading causes of death. It's a real catch-22. If you're under medical care for something serious, there's a real chance that someone will make a mistake that kills you.
  [-]
  - sarchertech 5 hours ago
    https://www.mcgill.ca/oss/article/critical-thinking-health/m...
    The numbers that you see quoted are almost certainly wildly exaggerated.
- jerlam 6 hours ago
  Isn't this what malpractice is?
  [-]
  - ThrowawayTestr 6 hours ago
    It's only malpractice when there's negligence. If other doctors agree that they could have made the same mistake, it's not malpractice.
    [-]
    - lkbm 6 hours ago
      You also don't sue for malpractice unless something goes catastrophically wrong. I've had doctors make ludicrously bad diagnoses, and while it sucked until I found a competent doctor and got proper treatment, it wasn't something I was going to go to court over.
andersmurphy 4 hours ago
Is this unsurprising? It's a fancy markov chain. It's like using a slot machine to diagnose medical conditions. I guess it's a slot machine with really good marketing.
[-]
- atleastoptimal 4 hours ago
  Hey, 2021 called, they want their knee-jerk criticisms of LLMs back
  [-]
  - andersmurphy 3 hours ago
    Ask your LLM for a random number between 1 and 10. What do you get?
    Let me guess you got 7?
    [-]
    - cloud-oak 2 hours ago
      Picking up your original comment - Markov chains are much better at generating random numbers than either humans or LLMs!
- slopinthebag 4 hours ago
  People are reading way too much into it, talking about "emergence" and anthropomorphizing it to insane degrees.
SoftTalker 7 hours ago
I really only use ChatGPT as a better search engine. But it's often wrong, which has actually ended up costing me money. I don't put a lot of trust in it. Certainly would not try to use it as a doctor.
[-]
- steveBK123 5 hours ago
  I have found the LLMs to be wrong in random insidious ways, so trusting them with anything critical is terrifying.
  Recent (as in last few days/weeks) incidents using different models/tools:
  * Google AI search summary compare product A & B, call out a bunch of differences that are correct.. and then threw in features that didn't exist
  * Work (midsize company with big AI team / homebuilt GPT wrappers) PDF parsing for company headquarters address, it hallucinated an address that didn't exist in the document
  * Work, a team using frontier model from top 2 AI lab was using it to perform DevOps type tasks, requested "Restart XYZ service in DEV environment". It responded "OK, restarting ABC service in PROD environment". It then asked for confirmation AFTER actioning whether they meant XYZ in DEV or ABC in PROD... a little too late.
  They are very difficult tools to use correctly when the results are not automatically verifiable (like code can be with the right tests) and the answer might actually matter.
  [-]
  - rsynnott 4 hours ago
    > Work, a team using frontier model from top 2 AI lab was using it to perform DevOps type tasks, requested "Restart XYZ service in DEV environment". It responded "OK, restarting ABC service in PROD environment". It then asked for confirmation AFTER actioning whether they meant XYZ in DEV or ABC in PROD... a little too late.
    ... Wait, they gave the magic robot _access to modify their production environment_?!
    Bloody hell, there's no helping some people.
    [-]
    - steveBK123 4 hours ago
      Yes, at a fairly large company that should otherwise know better.
      The problem with all these orgs hiring "AI experts" is the adverse selection of finding the people who "know AI" but can't get a job at AI lab, startup, big tech, or literally any other job using AI that is better than "making excel do AI more good".
      It's like Big Data / Cybersecurity / DevOps / Big Agile / Cloud Evangelist / Data Science grifter playbook all over again.
Scoundreller 6 hours ago
Search engines and Dr. Google must be feeling like they’ve missed some major artillery level bullets in this debate.
[-]
- selridge 4 hours ago
  Fuckin WebMD just hunkering down in the corner.
hayleox 6 hours ago
I think there is so much potential for AI in healthcare, but we absolutely HAVE to go through the existing ruleset of conducting years of research and trials and approvals before pushing anything out to patients. Move fast and break things is simply not an option in healthcare.
[-]
- weatherlite 5 hours ago
  It depends; people actually get sicker and even die due to endless backlog and lack of doctors (in most developed countries). It's not as if everyone gets optimal care now. A.I can at least expedite things hopefully.
- selridge 4 hours ago
  Sure it is. How many trials did we have before ER doctors started using Wikipedia?
dipflow 6 hours ago
Adding normal lab results made the suicide crisis banner disappear? That's a weird failure mode. You'd expect unrelated context to be ignored, not to override the risk signal.
ben5 4 hours ago
I know this isn't always the best answer, but if you need real medical advice - see a doctor. Not the internet.
[-]
- AuryGlenz 4 hours ago
  No, see both. LLMs are great for second opinions, as long as you give it the relevant info and don't try to steer it. Even though we all know we're supposed to get second opinions on medical things, we usually don't bother because it's too expensive in both time and money.
  If it could be an emergency, see a doctor.
  [-]
  - SoftTalker an hour ago
    If you need a second opinion, ask another doctor. My question would be, do you tell him "Here's the first doctor's diagnosis, do you agree?" or do you go in cold and see what he comes up with?
- selridge 4 hours ago
  You gonna pay for it?
WalterBright 6 hours ago
Doctors also miss things.
A friend of mine had an accident. He was taken to the emergency room, but the doctors there thought his injuries were minor. My friend insisted that he was bleeding out internally. They finally checked for that, and it turns out he was minutes from dying.
AI wasn't involved in this case, but it's good to have both AI and a trained doctor in the decision loop.
[-]
- sarchertech 5 hours ago
  >AI wasn't involved in this case, but it's good to have both AI and a trained doctor in the decision loop.
  That doesn't necessarily follow from your story. The AI's specificity and sensitivity are important, which is why we need to study this stuff. An AI that produces too many false positives will send doctors off chasing zebras and they'll waste time, which will result in more deaths.
  An AI that produces too many false negatives will make doctors more likely to miss things they otherwise would have checked, which will result in more deaths.
  The other real problem with using AI in a medical setting is that AI is very very good at producing plausible sounding wrong information. Even an expert isn't immune to this. So it's even more important that we study how likely they are to be wrong.
francisofascii 4 hours ago
The reality is entering the healthcare system can result in thousands of dollars in bills. People make risk/cost judgement on going to the hospital or not.
josefritzishere 7 hours ago
It continues to amaze me how recklessly some people cram AI into spaces where it performs poorly and the consequences include death.
[-]
- y-c-o-m-b 4 hours ago
  As a software dev that uses it and observes the many errors it makes on a daily basis, I definitely treat the output with a much greater deal of skepticism than the average person I speak with. If you're used to it providing relatively accurate results based on surface level google-eqsue searches, then it makes sense why you'd place a higher weight on it being an "expert" vs a "tool that needs verification". I understand why people fall into this mindset.
  I used ChatGPT to do a valve adjustment on an engine; a task I've never done before. I didn't just accept the torque values and procedure it told me though, because I know better from my experience with it as a dev. I cross-referenced it all with Youtube videos, forum posts, instruction manuals (where available) to make sure the job was A) doable for a non-mechanic like me and B) done correctly. Thanks to the Youtube video (which I cross-referenced with other sources), I discovered the valve clearance values were slightly off with the ChatGPT recommendation.
  I think the average Joe would assume these values were correct and run with it.
- rectang 7 hours ago
  If the AI gets attached to a health insurer (not the case here as far as I know), I would expect it to make decisions that are aligned with the company’s incentive to weed out unprofitable patients. AI is not a human who takes a Hippocratic oath; it can be more easily manipulated to perform unethical acts.
  [-]
  - stvltvs 7 hours ago
    AI is an overloaded term, so I'm not sure whether insurers are using LLMs or more traditional ML, but they are already using "AI" to deny claims.
    https://www.liveinsurancenews.com/health-insurance-claims-de...
  - PUSH_AX 6 hours ago
    I don't think anyone would use an AI with such a severe conflict of interests, unless this was completely hidden from the user.
    [-]
    - rectang 5 hours ago
      With an integrated insurer/provider, they just have to make primary care scarce so that it takes months to get an appointment, and then offer AI Doctor as an option. Not all patients have to use it for it to be cost effective.
- TZubiri 6 hours ago
  But it doesn't perform poorly actually, it's just that the stakes are very high and it's a highly regulated environment.
  Most physicians I know use ChatGPT. Although of course it's usage guided by an expert, not by the patient, nor fully autonomous.
nilamo 3 hours ago
Amazing that some people thought a pseudorandom number generator would be good at diagnosing health issues it can't even see.
bsoles 4 hours ago
>> "securely" (my emphasis) connect medical records and wellness apps” to generate health advice and responses.
No, no, no, and no. Are we going to never learn. Sharing medical data with AI tools is going to come back and bite you.
system2 2 hours ago
Maybe because human interaction, part of a doctor's training, is not documented as internet blog posts, so ChatGPT didn't learn and failed because of it? LLM is just learning from what's written.
jbverschoor 6 hours ago
Sounds exactly like a GP in the Netherlands
nashashmi 6 hours ago
Has anyone tried to suggest sudoku puzzles? In the middle of a hard game I will submit the screenshot to copilot or Gemini and it hallucinates suggestions on next move.
ml_giant 7 hours ago
I’m not surprised.
dyauspitr 6 hours ago
I feel like these need to be run against case histories from already determined cases, not cases were the doctors set up the scenarios, knowing they’re going to be run against ChatGPT.
TZubiri 6 hours ago
How about we allow ChatGPT to be used alongside human MD diagnosis?
Win win right?
[-]
- nerevarthelame 5 hours ago
  That would need to be tested. If doctors get lazy, complacent, or overworked (!), a "doctor with access to ChatGPT Health" may be functionally equivalent to "just ChatGPT Health" in some cases.
- nradov 6 hours ago
  What do you mean "allow"? From a public policy perspective there's nothing prohibiting that today, as long as the human MD follows the HIPAA privacy rule.
selridge 4 hours ago
I’ve never heard of in my entire life a doctor failing to recognize a medical emergency. /s
One of the things that people need to come to grips with is that like Wikipedia people will use ChatGPT because it is there. And the alternative is to be rich and have a primary care doctor that you can reach out to at a moments notice. Until that is different people will use these web services. It’s the same thing as Wikipedia or WebMD.
[-]
- the_mar 3 hours ago
  it's a very simplistic take. the issue with ChatGPT is that it speaks with authority, vs webMD and such just provide information. to say that how the information is presented is irrelevant to the outcomes is reductionist at best
varispeed 7 hours ago
I find that 5.2 has been completely dumbed down. Feels more like talking to early versions of Gemini when it quickly enters into loop state.