I feel like I see this attitude a lot amongst devs: "If everyone just built it correctly, we wouldn't need these bandaids"
To me, it feels similar to "If everyone just cooperated perfectly and helped each other out, we wouldn't need laws/money/government/religion/etc."
Yes, you're probably right, but no that won't happen the way you want to, because we are part of a complex system, and everyone has their very different incentives.
Semantic web was a standard suggested by Google, but unless every browser got on board to break web pages that didn't conform to that standard, then people aren't going to fully follow it. Instead, browsers (correctly in my view) decided to be as flexible as possible to render pages in a best-effort way, because everyone had a slightly different way to build web pages.
I feel like people get too stuck on the "correct" way to do things, but the reality of computers, as is the reality of everything, is that there are lots of different ways to do things, and we need to have systems that are comfortable with handling that.
Was this written by AI? I find it hard to believe anyone who was interested in Semantic Web would have not known it's origin (or at least that it's origin was not Google).
The concept of a Semantic web was proposed by Tim Berners-Lee (who hopefully everyone recognizes as the father of HTTP, WWW, HTML) in 1999 [0]. Google, to my knowledge, had no direct development or even involvement in the early Semweb standards such as RDF [1] and OWL [2]. I worked with some of the people involved in the latter (not closely though), and at the time Google was still quite small.
That was a human-generated hallucination, my apologies. I always associated semantic web with something Google was pushing to assist with web crawling, and my first exposure to it was during the Web 2.0 era (early 2010s) as HTML5 was seeing adoption, and I always associated it with Google trying to enhance the web as the application platform of the future.
W3C of course deserves credit for their hard work on this standard.
My main point was that regardless of the semantic "standard", nothing prevented us from putting everything in a generic div, so complaining that everyone's just "not on board" isn't a useful lament.
Google acquired Metaweb Technologies in 2010, acquiring Freebase with it. Freebase was a semantic web knowledge base and this became deeply integrated into Google's search technology. They did, in fact, want to push semantic web attributes to make the web more indexable, even though they originated neither the bigger idea nor the original implementation.
(one of my classmates ended up as an engineer at Metaweb, then Google)
"I always associated semantic web with something Google was pushing to assist with web crawling, and my first exposure to it was during the Web 2.0 era (early 2010s) as HTML5 was seeing adoption, and I always associated it with Google trying to enhance the web as the application platform of the future."
This sounds more like "indexing" than "crawling"
The "Sitemaps 0.84" protocol , e.g., sitemap.xml, was another standard that _was_ introduced by Google
Helpful for crawling and other things
(I convert sitemap.xml to rss; I also use it to download multiple pages in a single TCP connection)
Not every site includes a sitemap, some do not even publish a robots.txt
The phrase “if everyone just” is an automatic trigger for me. Everyone is never going to just. A different solution to whatever the problem is will be necessary.
I can't find a copy of the old "reasons your solution to email spam won't work" response checklist, but one of the line items was "fails to account for human nature".
eh I feel this but it's a lot simpler than that. Not "if everyone built everything correctly" but "if everyone's work was even slightly better than complete garbage". I do not see many examples of companies building things that are not complete embarrassing shit. I worked at some companies and the things we built was complete embarrassing shit. The reasons are obvious: nobody cares internally to do it, and nobody externally has any standards, and the money still flows if you do a bad job so why do better?
What happens in practice is that the culture exterminates the drive for improvement: not only are things bad, but people look at you if you're crazy if you think things should be better. Like in 2025 people defend C, people defend Javascript, people write software without types, people write scripts in shell languages; debugging sometimes involves looking at actual bytes with your eyes; UIs are written in non-cross-platform ways; the same stupid software gets written over and over at a million companies, sending a large file to another person is still actually pretty hard; leaving comments on it is functionally impossible ... these are software problems, everything is shit, everything can be improved on, nothing should be hard anymore but everything still is; we are still missing a million abstractions that are necessary to make the world simple. Good lord, yesterday I spent two hours trying to resize a PDF. We truly live in the stone age; the only progress we've made is that there are now ads on every rock.
I really wish it was a a much more ruthlessly competitive landscape. One in which if your software is bad, slow, hard to debug, hard to extend, not open source, not modernized, not built on the right abstractions, hard to migrate on or off of, not receptive to feedback, covered in ads... you'd be brutally murdered by the competition in short order. Not like today where you can just lie on your marketing materials and nobody can really do anything because the competition is just as weak. People would do a much better job if they had to to survive.
> the money still flows if you do a bad job so why do better?
I'll raise. The money flows because you do a bad job. Doing a good job is costly and takes time. The money cannot invest that much time and resources. Investing time and resources builds an ordinary business. The money is in for the casino effect, for the bangs. Pull the arm and see if it sticks. If yes, good. Keep pulling the arm. If not, continue with another slot machine.
I would argue that the money is short termism though. It just assumes short term returns are the correct choice because it lacks the technical understanding of the long term benefits of a good job.
In my experience many acquisitions set the peak of a given software product. The money then makes the argument that its "good enough" and flogs the horse until its dead and a younger more agile team of developers eventually build a new product that makes it irrelevant. The only explanation I have for why so many applications fail to adapt is because of a cultural issue between the software and the money, that always gets resolved by the money winning.
For example I would suggest that the vast majority of desktop apps, especially those made by SMEs, originally in MFC or something fail to make the transition to online services that they need today because of this conflict. The company ends up dying and the money never works out what it did wrong because its much harder to appreciate those long term effects than the short term ones that gave them more money at the start.
Today is easy (finally we have Rust, Zig, Odin, Swift, Go etc that show marked improvements), but the op is correct: A lot of progress was stalled because people defend sub optimal tools.
Every time somebody suggest "but C is wrong" the answer was "You should not do the wrong, be careful every time and things will be fine!"
P.D: Yesterday we have pascal/ADA/effiel/ocalm and others, but the major issues is that C/JS should have be improved to remove the bad things, and add the good things. Is not as if what or why to do it was a mystery, it was just the fear and arrogance against change.
And this cause too much inertia against "reinvent the wheel" and too much hate against anybody that even try
He seems to have mistaken his personal opinions on which languages and language features are good for some objective truth.
Ironically, that’s part of why we can’t have nice things. People who aren’t open to other viewpoints and refuse to compromise when possible impede progress.
Well, sure, age is part of it. I would hope languages coming out 40-50 years after their predecessors (in the case of Rust following C/C++) would have improved upon those predecessors and learned from the ideas of computer science in the intermediate years.
(Coincidentally this is one of my chief complaints about Go: despite being about the same age as Rust, it took the avenue of discarding quite a lot of advancements in programming language theory and ergonomics since C)
Go has a much different set of design goals than Zig, Nim, or especially Rust. Go is really for people who want a few major improvements on C like easier builds, faster builds, higher-level standard string handling, direct support for some form of concurrency, an optional GC which defaults to on, and a few syntax tweaks - things that a modern C might have that were not going to make it into a standards revision. Rust, to support ownership and the borrow checker at compile time, had to build a useful language around that hugely helpful but quite restrictive requirement. They subsequently went different directions than the Go team on a lot of the other language features. Zig, Nim, and D are something in between those extremes in their own ways.
As someone with a background of a lot of time with Perl and the Pascal/Ada family who was rooting for Julia, Go strikes a good balance for me where I would have used Perl, Python, Julia, Ruby, or shell for system tasks. I haven’t done a lot of work in Rust yet, because in the infrastructure space speed to implement is usually more important than performance and Go is a lot faster already than Python or Ruby and because my team as a whole has much more experience with it. I certainly appreciate why the folks at work who do the performance-critical network and video software use mostly Rust to do what they used to do in C, Java, or C++.
we have to accept that the vast majority of people don't think like us. They don't think its complete garbage because they can't look hard enough to appreciate that fidelity.
While it might be better if everyone thought like us and wanted things to be _fundamentally_ good, most people don't, and most people money >> less people money and the difference in scale is vast.
We could try to build a little fief where we get to set the rules but network effects are against us. If anything our best shot is to try to reverse the centralisation of the internet because that's a big cause of enshittification.
The semantic web came out of work on Prolog and formal systems for AI which just didnt work well... LLMs and vector databases give us new tools that are pretty usable.
Neither broke web pages, honestly. XHTML requires a DTD named at the top of the document, and browsers will happily fall back to HTML 3, 4, or 5 as they can if there’s no DTD specified.
My interpretation of "break web pages" was serving XHTML with MIME type application/xhtml+xml, in which case browsers don't render anything when the XHTML isn't well-formed, which is really just a strict / validate-my-syntax mode you can opt into.
Fresh plums right off the tree taste significantly better than the ones you can get in the produce isle, which are in turn better than canned, which are themselves still better than re-hydrated prunes.
In scaling out computation to the masses, we went from locally grown plums that took a lot of work and were only available to a small number of people that had a plum tree or knew someone that had one, to building near magical prune-cornucopia devices that everyone could carry around in their pockets, giving them an effectively unlimited supply of prunes.
LLMs re-hydrate these for us, making them significantly more palatable; if you're used to gnawing dried fruit, they seem amazing.
Perhaps, but we still failed and not at personal computing, nor just semantic web, but computing and programming in general. The failure is between the original intent (computing was originally more or less AI) along with theory and actual result with every software project turning into unsustainable infinite regress. Things likely broke around ALGOL.
Also LLMs are failing too, for different reasons, but IMO unlikely AI in general will— it will correct a 60 years or so failure in industrial computer science.
The most reprehensible knowledge-search-and-communication failure of all.
We gave people monetisation of drek instead. Then asked them to use it for education. Then trained bot systems on it. Then said that even those answers must be made to confirm to transient propagandists.
The computers, yes. The experience of using them, no.
There is a joy to having a tool that lets you do what you never could before that has been buried along the way in layers of petty annoyances and digital micro aggressions. As you say, computers today are better by so many metrics—including, unfortunately, their ability to tracks us, sell us things we neither need nor want, outrage and distract us.
I really don't like this analogy, and I really don't like the premise of this article.
Writing software is only so scalable. It doesn't matter all of the shortcuts we take, like Electron and JavaScript. There are only so many engineers with so much time, and there are abundantly many problems to solve.
A better analogy would be to look at what's happening to AI images and video. Those have 10,000x'd the fundamental cost savings, time savings, and personnel requirements. It's an industrialization moment. As a filmmaker who has made several photons-on-glass films, this is a game changer and lifts the entire creative industry to a level where individuals can outdo Pixar.
That is the lens with which to look at what AI will do to software. We're going from hand-carving stone wheels to the Model T.
This is all just getting started. We've barely used what the models of today offer us.
Totally agree with the core of your position. But the two perspectives are complementary, and perhaps even more deeply linked.
Initial attempts to alleviate any shortage are likely to result in a decrease in quality; initial attempts to improve quality are likely to reduce variability and thus variety. My point (and my reading of the article) is that we're at the stage where we've figured out how to turn out enough Hostess Twinkies that "let them eat cake!" is a viable option, and starvation is being driven back.
This is definite progress, but also highlights previous failures and future work.
This is a massive cope. AI image/video slop is still slop. Yes it's getting better, but it's still better .. slop. Unless radical new breakthroughs are made, the current LLM paradigm will not outdo Pixar or any other apex of human creativity. It'll always be instantly recognizable, as slop.
And if we allow it to take over society, we'll end up with a society that's also slop. Netflixification/marvelization only much much worse..
He didn't say LLMs can outdo Pixar. That's ridiculous and they are nowhere near that level.
He said that LLMs are at the point "where individuals can outdo Pixar." And that's very possible. The output of a talented individual with the assistance of AI is vastly better than the output of AI alone.
This is a very reductionist take that's to be expected from a software engineer but most definitely something that an artistic person would never utter. The creative process doesn't scale in the way that software engineers imagine. Coming up with genuine new ideas or magical moments of "synthesis" doesn't emerge from throwing lots of commodified tools together and calling it a day.
So far we haven't seen a single iota of creative art coming out of LLMs. Zip. Nada. It's all smoke and mirrors in that we get better and better veneers on top of bad copies of actual art that humans have previously created. The veneers are improving but there is no substance underneath. It's still slop. I don't want to live in a society that doesn't care about substance but instead worships the veneer. Yet this is the place that the current LLMs are taking us.
> This is a massive cope. AI image/video slop is still slop.
Slop content is slop content, AI or not. You don't need an AI to make slop. We've always had films like "The Room", it's just that the financial and time constraints put an upper bound on how much slop was created. AI makes creation more accessible. You've got Reddit for image and video now, essentially.
You are biased by media narratives and slop content you're encountering on social media. I work in the industry and professionals are using AI models in ways you aren't even aware of. I guarantee you can't identify all AI content.
> And if we allow it to take over society, we'll end up with a society that's also slop. Netflixification/marvelization only much much worse..
Auteurs and artists aren't going anywhere. These tools enable the 50,000 annual film students to sustainably find autonomy, where previously there wasn't any.
Scaling the production of almost good things by individuals that used to take just as many people and just as big a budget as a really nice major feature film is certainly full of use cases for education, training, portfolios of work, and pitches of the content.
I have often thought about how computers are significantly faster than they were in the early 2000s, but they are significantly harder to use. Using Linux for the first time in college was a revelation, because it gave me the tools to tell the computer "rename all of the files in this directory, keeping only the important parts of the name."
But instead of iterating on better interfaces to effectively utilize the N thousands of operations per second a computer is capable of, the powers that be behind the industry have decided to invest billions of dollars in GPUs to get a program that seems like it understands language, but is incapable of counting the number of B's in "blueberry."
They're not claiming AGI yet, so human intelligence is required to operate an LLM optimally. It's well known that LLMs process tokens rather than characters s, so without space for "reasoning" there's no representation of the letter b in the prompt. Telling it to spell or think about it gives it room to spell it out, and from there it can "see" the letters and it's trivial to count.
So long as I’m teaching the user how to speak to the computer for a specific edge case, which of these burn nearly as much power as your prompt? Maybe we should consider that certain problems are suitable to LLMs and certain ones should be handled differently, even if that means getting the LLM to recognize its own edge cases and run canned routines to produce answers.
Is counting the number of B's vital? Also, I'm pretty sure you can get an LLM to parse text the way you want it, it just doesn't see your text as you do, so that simple operation is not straightforward. Similarly, are you worthless because you seem like you understand language but are incapable of counting the number of octects in "blueberry"?
Let's say I hire a plumber because of his plumbing expertise and he bills me $35 and I pay him with a $50 bill and he gives me back $10 in change. He insists he's right about this.
I am now completely justified in worrying about whether the pipes he just installed were actually made of butter.
As shown by the GPT-5 reaction, a majority of people just have nothing better to ask the models than how many times does the letter "s" appear in "stupid".
I think this is a completely valid thing to do when you have Sam Altman going on the daily shows and describing it as a genius in your pocket and how it's smarter than any human alive. Deflating hype bubbles is an important service.
But the point is, why would you trust it for anything at all, when it can't do an incredibly simple thing reliably at all? (Yes, I understand the tokenizer makes this hard, but still, it's a quick demonstration that it's just bad technology.)
I mean, I think that anyone who understands UTF-8 will know that there are nine octets in blueberry when it is written on a web page. If you wanted to be tricky, you could have thrown a Β in there or something.
> Similarly, are you worthless because you seem like you understand language but are incapable of counting the number of octects in "blueberry"?
Well, I would say that if GP advertised themselves as being able to do so, and confidently gave an incorrect answer, their function as someone who is able to serve their advertised purpose is practically useless.
It is (maybe not directly but very insistently) advertised as taking many jobs soon.
And counting stuff you have in front of yourself is basic skill required everywhere. Counting letters in a word is just a representative task for counting boxes with goods, or money, or kids in a group, or rows on a list on some document, it comes up in all kinds of situations. Of course people insist that AI must do this right. The word bag perhaps can't do it but it can call some better tool, in this case literally one line of python. And that is actually the topic the article touches on.
People always insist that any tool must do things right. They as well insist that people do things right.
Tools are not perfect, people are not perfect.
Thinking that LLMs must do things right, that people find simple, is a common mistake, and it is common because we easily treat the machine as a person, while it only is acting like one.
> Thinking that LLMs must do things right, that people find simple, is a common mistake
Show me any publicly known visible figure that tries to rectify this. Everyone peddles hype, there's no more Babbage as in the "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" anecdote.
People and tools that don't do things right aren't useful. They get replaced. Making do with a shitty tool might make sense economically but not in any other way.
If you follow that reasoning, no person is useful and no tool is useful.
The little box I'm filling now is, compared to a lot of other interfaces, a shitty interface. That doesn't mean it isn't useful. Probably it is getting replaced, only with a slightly better inferface.
The karma system is quite simplistic and far from perfect. I'm sure there are ways to go around it. The moderators make mistakes.
That doesn't mean the karma and moderation are not useful. I hope you get my point but it's fine if we disagree as well.
It is advertised as being able to "analyze data" and "answer complex questions" [0], so I'd hope for them to reliably determine when to use its data-analysis capabilities to answer questions, if nothing else.
IDK, I think there is something adorable about taking a system that over trillions of iterations always performs the same operation with the same result, reliability unmatched in all of the universe…
And making it more of “IDK what it answered the way it did, but it might be right!!”
Yeah dream on. I’m an engineer and know what structured data is. And yet I miserably fail to store my private files in a way that I can find them back without relying on search tools. So how on earth are we ever going to organize all the world’s data and knowledge? Thank god we found this sub-optimal “band aid” called LLMs!
No names is not the biggest problem. You just have to come up with a name. The problem is when things have multiple names, or even worse when people disagree on what names are appropriate for something. The world rarely allows you to neatly categorize large datasets. There are always outliers.
For example, you have a set of balls and you want to sort them by color. Where does orange stop and red begin? What about striped balls or ones with logos printed on them? What if it is a hypercolor ball that changes based on heat? It gets messy very fast.
Not everything has to be named once and put into a hierarchy like a directory tree. Tags work well for data. A system like an LLM that understands synonyms and antonyms should be able to find and even update tags for concepts that don’t have a full set already - as long as there are a few appropriate tags on the concept to start.
In practice if you're making up tags on the fly it's not much better than untagged data. A LLM that can figure out what the tags mean can probably just infer it from the data anyway.
> Remember Semantic Web? The web was supposed to evolve into semantically structured, linked, machine-readable data that would enable amazing opportunities. That never happened.
I think the lesson to be learned is in answering the question "Why didn't the semantic web happen?"
I have literally been doing we development since their was a web, and the companies I developed for are openly hostile to the idea of putting their valuable, or perceived valuable, information online in a format that could be easily scraped. Information doesn't want to be free, it wants to be paid for. Unless the information shared pulls visitors to the site it doesn't need to be public.
> Information doesn't want to be free, it wants to be paid for. Unless the information shared pulls visitors to the site it doesn't need to be public.
That's a cultural and societal problem, not a technology problem. The motivations (profit) are wrong, and don't lead to true innovations, only to financialization.
So long as people need to pay to eat, then information will also want to continue to be paid for, and our motivations will continue to be misaligned with true innovations, especially if said innovations would make life easier but wouldn't result in profit.
I'd argue that resource availability is already high enough to alleviate scarcity for most people, and that most scarcity today is artificially generated, because of profit.
We won't achieve post scarcity, even with widespread automation (if AI ever brings that to fruition), because we haven't yet fixed the benefits that wealth brings, so the motivation to work toward a post-scarcity society just doesn't exist.
I've encountered a similar issue in academia - PI's don't want to make their data available to be scraped (or, at least not easily) because the amount of grant funding is limited, and a rival who has scraped one's data could get the grant money instead by using that scraped data to bolster their application.
To a degree re ads on pages, but why didn't big business end up publishing all of their products in JSON-LD or similar?
A lot did, to get aggregated, but not all.
But also because companies that produce web content wanted it to be seen by humans who would look at ads, not consumed by bots and synthesized with other info into a product owned by some other firm.
And yet today most websites are being scraped by LLM bots which don't look at ads and which synthesize with other info into a product owned by some other firm.
Optimistically, the semantic web is going to happen. Just that instead of the original plan of website owners willingly making data machine-readable, LLMs will be the ones turning non-machine-readable data machine-readable (which can then be processed by user agents), even if the website owner prefers you looked at the ads instead.
The semantic web was theoretically great for data scientists and metadata scrapers, but offered close to zero value for ordinary humans, both on the publishing side an the consumption side. Also, nobody did the hard work of defining all of the categories and protocols in a way that was actually usable.
The whole concept was too high minded and they never got the implementation details down. Even if they did it would have been horrendously complex and close to impossible to manage. Asking every single publisher to neatly categories their data into this necessarily enormous scheme would have resulted in countless errors all over the web that would have seriously undercut the utility of the project anyway. Ultimately the semantic web doesn't scale very well. It failed for the same reason command economies fail: It's too overwhelming for the people in control to manage and drowns in its own bureaucracy.
Semantic web never existed. There was Google and Google had an API to get breadcrumbs to show on search results. And that's what people called "semantic web." A few years later they gave up and made everything look like a breadcrumb anyway. And that sums up the whole semantic web experience.
I find traditional web search and LLM search to be complementary technologies, and this is a good example of that. Both have their uses and if you get the information you need using one or the other, we are all happy.
I think the example query here actually shows a problem with the query languages used in web search rather than an intrinsic inability of web search. It contains what amounts to a natural language subquery starting with "in the same year". In other words, to evaluate this query properly, we need to first evaluate this subquery and then use that information to evaluate the overall query. Google Search and almost all other traditional web search engines use intentionally oversimplified query languages that disallow nested queries let alone subqueries, so this example really is just revealing a problem with the query language rather than a problem with web search overall. With a better query language, we might get better results.
> What animal is featured on a flag of a country where the first small British colony was established in the same year that Sweden's King Gustav IV Adolf declared war on France? ... My point is that if all knowledge were stored in a structured way with rich semantic linking, then very primitive natural language processing algorithms could parse question like the example at the beginning of the article, and could find the answer using orders of magnitude fewer computational resources.
So as well as people writing posts in English, they would need to provide semantic markup for all the information like dates, flags, animals, people, and countries? It's difficult enough to get people to use basic HTML tags and accessible markup properly, so what was the plan for how this would scale, specifically to non-techy people creating content?
This actually happened already and it's part of why llms are so smart, I haven't tested this but I venture a guess that without wikipedia and wikidata and wikipedia clones and stolen articles, LLMs would be quite a lot dumber. You can only get so far with reddit articles and embedded knowledge of basic info on higher order articles.
My guess is when fine tuning and modifying weights, the lowest hanging fruit is to overweigh wikipedia sources and reduce the weight of sources like reddit.
Only a relatively small part of Wikipedia has semantic markup though? Like if the article says "_Bob_ was born in _France_ in 1950" where the underlines are Wikpedia links, you'll get some semantic info from the use of links (Bob is a person, France is a country), but you'd be missing the "born" relationship and "1950" date as these are still only raw text.
Same with the rest of articles with much more complex relationships that would probably be daunting even for experts to markup in an objective and unambiguous way.
I can see how the semantic web might work for products and services like ordering food and booking flights, but not for more complex information like the above, or how semantic markup is going to get added books, research articles, news stories etc. that are always coming out.
> The semantic information is first present not in markup but in natural language.
Accurate natural language processing is a very hard problem though and is best processed by AI/LLMs today, but this goes against what the article was going for when it's saying we shouldn't need AI if the semantic web had been done properly?
Complex NLP is the opposite to what the semantic web was advocating? Imagine asking the computer to buy a certain product and it orders the wrong thing because the natural language parsed was ambiguous.
> Additionally infoboxes also hold relationships, you might find when a person was born in an infobox, or where they studied.
That's not a lot of semantic information compared to the contents of a Wikipedia article that's several pages long. Imagine a version of Wikipedia that only included the infoboxes and links within them.
> If all knowledge were stored in a structured way with rich semantic linking, then very primitive natural language processing algorithms could parse question like the example at the beginning of the article, and could find the answer using orders of magnitude fewer computational resources.
In vertical markets, can LLMs generate a "semantic web of linked data" knowledge graph to be parsed with efficient NLP algorithms?
leveraging LLMs to build the special markup so that it can be applied towards other uses.. some type of semantic web format, like JSON-LD or OWL, or some database that can process SPARQL queries.. Palantir is using ontologies as guardrails to prevent LLM hallucinations
Palantir was 90% about ontology creation back when I first used it in 2009. They knew at that point that without some level of structure that mapped to the specific business problem that the data input into the graph was difficult to contextualize.
The root cause of this is that HTML is not a language for the markup of hypertext.
Because everything has to be copied and/or compressed into a single layer document along with its markup, you almost never see actual markup in the web, just a lot of formatting and layout.
Because you can't have multiple markup of a given source material, adding multiple hierarchies doesn't happen. Any given information structure is necessarily oversimplified and thus only fit for limited use.
It's almost as if someone read Vannevar Bush's description of the memex and decided to actively prevent its reification.
He said the organization of knowledge was the largest challenge facing mankind post war. Clearly he was right, and we've failed miserably.
The thing LLMs provide is an impedance match between the data we do have, and the actual needs of mankind.
I think every field has its own version of this thought, where if we could just manage to categorise and tag things properly we could achieve anything. Our lack of a valid overarching ontology is what is holding us back from greatness.
It might be short lived, who knows, but it's interesting that the recent progress came from capturing/consuming rather than systematically eliminating the nuance in language.
I think I broadly agree with this. I am super frustrated that everythign always wants me to search for things. As an example Finders default search while looking at a folder is the whole machine instead of a filter in the directory you are viewing in seems totally insane to me. It's almost like they don't want me to know where my files are.
I can understand that it's a result, to a degree of cloud services and peoples primary mode swapping to opening and app and opening recents or searching instead of opening a file to open an app but it does mean that you're at the mercy of what I experience as some pretty crap search algorithms that don't seem to want you to find the information you're looking for. I keep encountering searches that rank fuzzy matches over exact matches or aren't stable as you continue to complete the same word and I just don't understand how that's acceptable after being pointed out if search is what I'm supposed to be using.
> It's almost like they don't want me to know where my files are.
I think this might actually be true in some cases.
Especially where companies want your files on their cloud servers. It's better for them if you don't think about what's stored locally or remotely. It's better for them if you don't think at all and just ask them for whatever you want and let them decide what to show you or keep hidden from you. It's easier for them to get metrics on what you're doing if you type it out explicitly in a search box than it is for them to track you as you browse through a hierarchy you designed to get to what you want. You're supposed to feel increasingly helpless and dependent on them.
Uh, the author got so close to make the same realization I had while working on a project [0] for the Wikimedia Foundation: we wouldn't need search engines if we had better tooling to query semantic databases like wikidata.
However, the thing that the author might be missing is that the semantic web exists. [1] The problem is that the tools that we can use to access it are not being developed by Big Tech. Remember Freebase? Remember that Google could have easily kept it around but decided to fold it and shoved it into the structured query results? That's because Google is not interested in "organizing the world's information and make it universally accessible" unless it is done in a way that it can justify itself into being the data broker.
I'm completely out of time or energy for any side project at the moment, but if someone wants to steal my idea: please take a llm model and fine tune so that it can take any question and turn it into a SparQL query for Wikidata. Also, make a web crawler that reads the page and turns into a set of RDF triples or QuickStatements for any new facts that are presented. This would effectively be the "ultimate information organizer" and could potentially replace Wikidata as most people's entry page of the internet.
DBpedia Spotlight and entity-fishing already do something similar to your idea - they extract structured data from text and link to knowledge bases. Combining these with LLM-based query translation to SPARQL could indeed bridge the gap between semantic web's structure and natural language interfaces.
ChatGPT etc does an OK job at SPARQL generation.
Try something like "generate a list of all supermarkets, including websites, country, description" and you get usable queries out.
In a much, much more limited way, this is what I was dabbling with with alltheprices - queries to pull data from wikidata, crawling sites to pull out the schema.org Product and offers, and publish the aggregate.
I would argue that LLM is ultimately making true Semantic Web available, or irrelevant.
It can basically figure out all the semantic knowledge graphs automatically for us, and it's multi-modal to boot. That means it can infer the relationships between any node across text, audio, images, videos, and even across different languages.
The funny thing to me is that now more than ever structured data is important so that AI has a known good data set to train from, and so that search engines have contextual and semantically rich data to search.
AI isn't a solution to this; on the contrary, whatever insufficiency exists in the original data set will only be amplified, as compression artifacts can be amplified in audio.
We also can't trust any data that's created recently, because an LLM can be trained to provide correct-looking structured data that may or may not accurately model semantic structure.
The best source of structured, tagged, contextually-rich data suitable for training or searching in the future will be from people who currently AREN'T using generative AI.
"... if all knowledge were stored in a structured way with rich semantic linking..." this sounds a lot like Google's "Knowledge Graph". https://developers.google.com/knowledge-graph. (Disclosure: I work at Google.)
If you ask an LLM where you can find a structured database of knowledge with structured semantic links, they'll point you to this and other knowledge graphs. TIL about Diffbot!
In my experience, it's a lot more fun to imagine the perfect database like this than it is to work with the actual ones people have built.
My read is that the author is saying it would have been really nice if there had been a really good protocol for storing data in a rich semantically structured way and everyone had been really really good at adhering to that standard.
Ok, but Google's result summary got the answer wrong. So did Gemini, when I tried it (Lion, Sierra Leone). And so did ChatGPT when I tried it (Lion, Sri Lanka).
So... it's impressive, sure, because the author is correct that a search engine can't answer that question in a single query. But it's also wildly wrong.
I also vaguely agree with the author that Google Drive sucks, but I wish they'd mentioned that the solution to their problem - using search! - also fucking sucks in Google Drive.
If you really want to, you know have this super generic indexing thing, why don't you go organize the web with hypercard and semantic web crap and tell us how it worked out for you
One can speak in their native language to a computer now and it mostly understands what is meant and can retrieve information or even throw together a scaffold of a project somewhat reliably.
It's not particularly good at writing software, however. Still feels like a human is needed to make sure it doesn't generate nonsense or apparently pretty insecure code.
So I'm not sure the author got the point across that they wished, but aren't vector databases basically a semantic storage/retrieval technology?
At Engelbart's mother of all demos in 1968, which basically birthed what we call personal computing today, most computer scientists were convinced that AGI was right around the corner and "personal computing" wasn't worth it.
Now, back then AGI wasn't right around, and personal computing was really really necessary, but how did we forget the viewpoint that personal computing was seen as a competing way to use computing power vs AI?
Economic interests, walled gardens, lock-in effects. Providers learned to make it (initially) convenient for us to forget. Of course, once we’re hooked, enshittification ensues: https://news.ycombinator.com/item?id=44837367
Another angle: we're super over provisioned in compute resources because of reasonable limitations in how quickly we can collectively absorb and learn how to use computers. "AI" is simply a new paradigm in our understanding of how to program computers. This isn't a failure, it's just an evolution.
Leading in with what feels like a p-hacked question that almost certainly isn't typical kind of hurts the point. Reminds me of seeing clearly rehearsed demos for things that ignore the reason it worked is because it was rehearsed. A lot.
Am I wrong in that this was a completely organic question?
People have said this since Vannevar Bush theorized the Memex. The unanswered question, the problem unsolved, remains: Who does all the categorization work? How come they're never part of "we," such that the plaint is never "if only we had done the work..."?
Neither chat-gpt nor gemini got the same answer as the article for me
`
What animal is featured on a flag of a country where the first small British colony was established in the same year that Sweden's King Gustav IV Adolf declared war on France?
`
chatgpt: Final answer: Lion
gemini: A lion is featured on the flag of Sri Lanka.
Gemini (2.5 pro) stands firm on its answer even when told that ChatGPT said it is a parrot. It does provide additional reasoning: https://g.co/gemini/share/45232dcc7278
Just complaining that the world is bad is a good way to waste your energy and end up being a cynic.
So why is not all information organized in structured, open formats? Because there's not enough of an incentive to label/structure your documents/data that way. That's if you even want to open your data to the public - paywalls fund business models.
There have been some smaller successes with semantic web, however. While a recipe site might not want to make it easy for everyone to scrape their recipes, people do want Twitter to generate a useful link preview from their sites' metadata. They do that with special tags Twitter recognizes, and other sites can use as well.
The good news is that LLMs can generate structured data from unstructured documents. It's not perfect, but has two advantages: it's cheaper than humans doing it manually, and you don't have to ask the author to do anything. The structuring can happen on the read side, not the write side - that's powerful. This means we could generate large corpuses of open data from previously-inaccessible opaque documents.
This massive conversion of unstructured to structured data has already been happening in private, with efforts like Google's internal Knowledge Graph. That project has probably seen billions in cumultative investment over the years.
What we need is open data orgs like Wikipedia pick up this mantle. They already have Wikidata, whose facts you can query with a graph querying language. The flag example in the article could be decomposed into motifs by an LLM and added to the flag's entry. And then you could use SPARQL to do the structured query. (And that structured query can be generated from LLMs, too!)
Cyc[0] tried for 40 years to make handwritten semantic rules a thing, but still don't have much to show for. Humans just aren't good or fast at writing down the rather fuzzy semantics of the real world into a computer readable format.
With RDF specially, there was also the issue of "WTF is this good for?". Semantic Web sounds lofty in theory, but was there ever even a clear plan on how the UI would look like? How would I explore all that semantic data if it ever came into existence? How would it deal with link rot?
And much like with RSS, I think a big failure of RDF is that it's some weird thing outside the Web, instead of just some addition HTML tags to enrich existing documents. If there is a failure, it's that. Even today, a lot of basic semantic tags are missing from HTML, we finally got <date> in 2011, but we still have nothing for names, cities, units, books[1], movie, gps coordinates and a lot of other stuff we use daily.
Another big failure is that HTML has become a read-only format, the idea that one uses HTML as a source format to publish documents seems to have been completely abandoned. HTML is just a UI language for Web apps nowadays, Markdown, LaTeX or whatever is what one uses to write content.
How do LLMs help Google, an ad company, generate more ad revenue. LLMs drive the traffic away from websites, give you answer directly without needing bother with ads. How does it benefit Google's ad business?
Ad clicks are a trifle compared to the possibility of winning the game of displacing vast amounts of labor and delivering ads and narratives undisclosed through conversational AI output. At that point you've won the economy, period.
> winning the game of displacing vast amounts of labor and delivering ads and narratives undisclosed through conversational AI output. At that point you've won the economy, period.
In a functional democracy, all you should win by doing that is harsh antitrust action.
Alan Kay has also written about his disappointment with what personal computing delivered. I think he'd agree with this.
Most people can't use the power of the computers they have effectively. Maintaining the data in Excel spreadsheets for most people is like manual labour.
is it unreasonable to expect some kind of good enough baseline guided by the prior of these llm to be the standard ? Should google's priors (or their training dataset and recipe) be allowed to guide societal's prior? same problem, different era ?
This is a clickbait article one could've written when Google was new and people were getting used to the idea of "You can just search for things on the internet."
It's just dressed up with mentions of AI for better "engagement."
No AI is impressive because we built a machine that finally understands what we are saying and has a concept of what creativity is. It's rudimentary but it's a milestone for humanity.
It's social media and the endless barrage of generated AI that's creating the illusion that AI isn't impressive. You're inundated everyday with the same crap and is just making it less and less impressive.
This take is very naive and should anticipate the obvious criticism: people did try, very very hard, to create structured information systems. They simply did not work. The sort of systems people tend to build are well-scoped for a particular problem, but fall apart completely when moving out of that domain. It is very easy to imagine an information system that results in correct answering of flag questions. It is a wide open problem to come up with a way of tagging information that works well for flag questions and for questions about memes, or art. It's not like Google didn't try!
I always appreciate our weekly Crankypants Take on LLMs.
> AI is not a triumph of elegant design, but a brute-force workaround
You can read and understand Attention Is All You Need in one hour, and then (after just scaling out by a few billion) a computer talks to you like a human. Pretty elegant, if you ask me.
> The web was supposed to evolve into semantically structured, linked, machine-readable data that would enable amazing opportunities.
I missed that memo. The web was, and forever shall be, a giant, unstructured, beautiful mess. In fact, LLMs show just how hopeless the semantic web approach was. Yes, it's useful to attach metadata to objects, but you will still need massive layering and recursion to derive higher-order, non-trivial information.
This entire article is someone unable to let go of an old idea that Did Not Work.
> AI is not a triumph of elegant design, but a brute-force workaround
I think the author is on to something here, but does not realize it, or applies the think to the wrong problem. The issue isn't the web, search was good enough and it's a pretty big problem to solve. We need to go smaller. Applying AI to customer service, internal processes and bureaucracy, among other things, is an inelegant brute-force approach to not just fixing your shit.
The majority of customer service could be fixed by having better self-service, designing better UIs, writing better manuals, having better monitoring, better processes, better trained staff, not insisting on pumping stock numbers and actually caring about your product. AI will never fix the issues your customers are having, they'll just make customer service cheaper and worse because just like real humans the content and levers they need aren't available.
Same for much of the bureaucracy in companies and governments. Rather than having AI attempt to fill out forms or judge if someone is entitled to a pension, insurance pay out or what have you. Take the time to fix your damn processes and built self-service portals that actually works (some of those may have some sort of AI on the backend for things like scanning documents).
Forget the semantic web thing, it never worked, to much data, and AI generated content certainly isn't making the problem easier. Let's work on that at some other time. Still the author is correct, LLMs are a brute-force workaround, regardless of how elegant the design may be. They are applied to a host of problem that should just be eliminated, not hidden and stupefied by a prediction engine.
> They scan the unstructured web and build ephemeral semantic maps across everything. It's not knowledge in the classic sense.. or perhaps it is exactly what knowledge is?
Personal computing, best now it's flav djour we get to slam on cloud based approaches to everything being what they always were. Wasteful, slow and privacy invasive... But still can't just plainly say cloud has been mostly bad.
I sympathize so much with the failure of personal computing to manifest!
> My point is that if all knowledge were stored in a structured way with rich semantic linking, then very primitive natural language processing algorithms could parse question like the example at the beginning of the article, and could find the answer using orders of magnitude fewer computational resources. And most importantly: the knowledge and the connections would remain accessible and comprehensible, not hidden within impenetrable AI models.
It's a pocket hope of mine that AI brings us back to the Semantic Web, or something very like it. In many ways these embeddings are giant non-human lexicons already. Distilling this information out seems so possible.
Even just making an AI to go markup (or uhh perhaps refine) a page with microdata seems conceptually very doable!
More broadly, looking at personal computing: my personal belief is that failure is hugely because of apps. Instead of a broader personal computing that aggregates, that allows constructivism, that enables personal growth & development (barefoot developers & home-cooked software style), computing has been defined by massificstion, by big tech software. The dominant computing paradigm has been mainframe pattern: we the user have an app that acts as a terminal to some far off cloud-y data center. Whatever agency we get is hewn out for us apriori by product teams, and any behavior not explicitly built in is a "Felony Contempt of Business Model" (an oh so accurate Doctorow-ism)! https://maggieappleton.com/home-cooked-softwarehttps://news.ycombinator.com/item?id=40633029https://pluralistic.net/2022/10/20/benevolent-dictators/#fel...https://news.ycombinator.com/item?id=33279274
It is so so sad to see computing squandered so!
The good news is AI is changing this relationship with software. I'm sure we will have no end of AI models built in to our software, that companies will maintain the strict control (tyrant's grip) over software as long as they can! But for AI to flourish, it's going to need to work across systems. And that means tearing down some of the walls, walls that have forcibly kept computing (ardently) anti-personal.
I can go look at https://github.com/punkpeye/awesome-mcp-servers and take such great hope from it. Hundreds of ways that we have eeked out a way to talk interface with systems that, before, people had no say and no control over.
The last ten years of consumer web tech was dedicated to SEO slop and crypto. Google has destroyed everything else with its endless promotion of hindustantimes .com listicles.
The essential meaning of technological progress is the utter inadequacy of humanity. We build cars that we never figure out how to safely drive, but being to arrogant to admit defeat and reconstruct society around trams like a sensible animal would do, we wait a century for cars that can drive themselves while the bodies pile up on the freeways in the meantime.
Yes it's true, we can't make a UI. Or a personal computer, or anything else that isn't deeply opaque to its intended users. But it's so much worse than that. We humans can't do hardly anything successfully besides build technology and beat each other in an ever expanding array of ridiculous pissing contests we call society.
I feel like I see this attitude a lot amongst devs: "If everyone just built it correctly, we wouldn't need these bandaids"
To me, it feels similar to "If everyone just cooperated perfectly and helped each other out, we wouldn't need laws/money/government/religion/etc."
Yes, you're probably right, but no that won't happen the way you want to, because we are part of a complex system, and everyone has their very different incentives.
Semantic web was a standard suggested by Google, but unless every browser got on board to break web pages that didn't conform to that standard, then people aren't going to fully follow it. Instead, browsers (correctly in my view) decided to be as flexible as possible to render pages in a best-effort way, because everyone had a slightly different way to build web pages.
I feel like people get too stuck on the "correct" way to do things, but the reality of computers, as is the reality of everything, is that there are lots of different ways to do things, and we need to have systems that are comfortable with handling that.
> Semantic web was a standard suggested by Google
Was this written by AI? I find it hard to believe anyone who was interested in Semantic Web would have not known it's origin (or at least that it's origin was not Google).
The concept of a Semantic web was proposed by Tim Berners-Lee (who hopefully everyone recognizes as the father of HTTP, WWW, HTML) in 1999 [0]. Google, to my knowledge, had no direct development or even involvement in the early Semweb standards such as RDF [1] and OWL [2]. I worked with some of the people involved in the latter (not closely though), and at the time Google was still quite small.
0. https://archive.org/details/isbn_9780062515872/mode/2up
1. https://www.w3.org/TR/PR-rdf-syntax/Overview.html
2. https://www.w3.org/TR/owl-ref/
That was a human-generated hallucination, my apologies. I always associated semantic web with something Google was pushing to assist with web crawling, and my first exposure to it was during the Web 2.0 era (early 2010s) as HTML5 was seeing adoption, and I always associated it with Google trying to enhance the web as the application platform of the future.
W3C of course deserves credit for their hard work on this standard.
My main point was that regardless of the semantic "standard", nothing prevented us from putting everything in a generic div, so complaining that everyone's just "not on board" isn't a useful lament.
The association isn't without merit.
Google acquired Metaweb Technologies in 2010, acquiring Freebase with it. Freebase was a semantic web knowledge base and this became deeply integrated into Google's search technology. They did, in fact, want to push semantic web attributes to make the web more indexable, even though they originated neither the bigger idea nor the original implementation.
(one of my classmates ended up as an engineer at Metaweb, then Google)
"That was a human-generated hallucination"
Kudos. I respect this kind of honesty. I wish there was more of it.
"I always associated semantic web with something Google was pushing to assist with web crawling, and my first exposure to it was during the Web 2.0 era (early 2010s) as HTML5 was seeing adoption, and I always associated it with Google trying to enhance the web as the application platform of the future."
This sounds more like "indexing" than "crawling"
The "Sitemaps 0.84" protocol , e.g., sitemap.xml, was another standard that _was_ introduced by Google
Helpful for crawling and other things
(I convert sitemap.xml to rss; I also use it to download multiple pages in a single TCP connection)
Not every site includes a sitemap, some do not even publish a robots.txt
Some ideas going back even further than that, like 1994:
https://philip.greenspun.com/research/shame-and-war-old
>was proposed by Tim Berners-Lee
he's actually still working on it: https://solidproject.org/
"Semantic web was a standard suggested by Google", sorry that's false. They only contributed a bit towards it.
Tim Berners-Lee coined it in 1999 and further expanded on the concept in a 2001 Scientific American article by Berners-Lee, Hendler, and Lassila.
The phrase “if everyone just” is an automatic trigger for me. Everyone is never going to just. A different solution to whatever the problem is will be necessary.
I can't find a copy of the old "reasons your solution to email spam won't work" response checklist, but one of the line items was "fails to account for human nature".
eh I feel this but it's a lot simpler than that. Not "if everyone built everything correctly" but "if everyone's work was even slightly better than complete garbage". I do not see many examples of companies building things that are not complete embarrassing shit. I worked at some companies and the things we built was complete embarrassing shit. The reasons are obvious: nobody cares internally to do it, and nobody externally has any standards, and the money still flows if you do a bad job so why do better?
What happens in practice is that the culture exterminates the drive for improvement: not only are things bad, but people look at you if you're crazy if you think things should be better. Like in 2025 people defend C, people defend Javascript, people write software without types, people write scripts in shell languages; debugging sometimes involves looking at actual bytes with your eyes; UIs are written in non-cross-platform ways; the same stupid software gets written over and over at a million companies, sending a large file to another person is still actually pretty hard; leaving comments on it is functionally impossible ... these are software problems, everything is shit, everything can be improved on, nothing should be hard anymore but everything still is; we are still missing a million abstractions that are necessary to make the world simple. Good lord, yesterday I spent two hours trying to resize a PDF. We truly live in the stone age; the only progress we've made is that there are now ads on every rock.
I really wish it was a a much more ruthlessly competitive landscape. One in which if your software is bad, slow, hard to debug, hard to extend, not open source, not modernized, not built on the right abstractions, hard to migrate on or off of, not receptive to feedback, covered in ads... you'd be brutally murdered by the competition in short order. Not like today where you can just lie on your marketing materials and nobody can really do anything because the competition is just as weak. People would do a much better job if they had to to survive.
> the money still flows if you do a bad job so why do better?
I'll raise. The money flows because you do a bad job. Doing a good job is costly and takes time. The money cannot invest that much time and resources. Investing time and resources builds an ordinary business. The money is in for the casino effect, for the bangs. Pull the arm and see if it sticks. If yes, good. Keep pulling the arm. If not, continue with another slot machine.
I would argue that the money is short termism though. It just assumes short term returns are the correct choice because it lacks the technical understanding of the long term benefits of a good job.
In my experience many acquisitions set the peak of a given software product. The money then makes the argument that its "good enough" and flogs the horse until its dead and a younger more agile team of developers eventually build a new product that makes it irrelevant. The only explanation I have for why so many applications fail to adapt is because of a cultural issue between the software and the money, that always gets resolved by the money winning.
For example I would suggest that the vast majority of desktop apps, especially those made by SMEs, originally in MFC or something fail to make the transition to online services that they need today because of this conflict. The company ends up dying and the money never works out what it did wrong because its much harder to appreciate those long term effects than the short term ones that gave them more money at the start.
Wait, which is the correct programming language to defend? C and Javascript are on pretty opposite sides of most spectra....
Languages which are not hard to use and dangerous. I work in C professionally right now but would much rather work in either Zig or Rust.
Today is easy (finally we have Rust, Zig, Odin, Swift, Go etc that show marked improvements), but the op is correct: A lot of progress was stalled because people defend sub optimal tools.
Every time somebody suggest "but C is wrong" the answer was "You should not do the wrong, be careful every time and things will be fine!"
P.D: Yesterday we have pascal/ADA/effiel/ocalm and others, but the major issues is that C/JS should have be improved to remove the bad things, and add the good things. Is not as if what or why to do it was a mystery, it was just the fear and arrogance against change.
And this cause too much inertia against "reinvent the wheel" and too much hate against anybody that even try
the new ones we should have made by now that fix their problems...
Zig and Rust and TS are a start
They are on the same side of the “number of serious design flaws due to historical incidents” spectrum.
Right, so, to reiterate the question, which language(s) are on the opposite side of that spectrum?
He seems to have mistaken his personal opinions on which languages and language features are good for some objective truth.
Ironically, that’s part of why we can’t have nice things. People who aren’t open to other viewpoints and refuse to compromise when possible impede progress.
well I was being flippant
but
what do we have other than our opinions? I think everything sucks so that's what I said.
Presumably the spectrum that would have Rust and Typescript at the other end.
Is that just age?
Well, sure, age is part of it. I would hope languages coming out 40-50 years after their predecessors (in the case of Rust following C/C++) would have improved upon those predecessors and learned from the ideas of computer science in the intermediate years.
(Coincidentally this is one of my chief complaints about Go: despite being about the same age as Rust, it took the avenue of discarding quite a lot of advancements in programming language theory and ergonomics since C)
Go has a much different set of design goals than Zig, Nim, or especially Rust. Go is really for people who want a few major improvements on C like easier builds, faster builds, higher-level standard string handling, direct support for some form of concurrency, an optional GC which defaults to on, and a few syntax tweaks - things that a modern C might have that were not going to make it into a standards revision. Rust, to support ownership and the borrow checker at compile time, had to build a useful language around that hugely helpful but quite restrictive requirement. They subsequently went different directions than the Go team on a lot of the other language features. Zig, Nim, and D are something in between those extremes in their own ways.
As someone with a background of a lot of time with Perl and the Pascal/Ada family who was rooting for Julia, Go strikes a good balance for me where I would have used Perl, Python, Julia, Ruby, or shell for system tasks. I haven’t done a lot of work in Rust yet, because in the infrastructure space speed to implement is usually more important than performance and Go is a lot faster already than Python or Ruby and because my team as a whole has much more experience with it. I certainly appreciate why the folks at work who do the performance-critical network and video software use mostly Rust to do what they used to do in C, Java, or C++.
Programming languages are not on a spectrum. I guess they dislike the syntax of JS and the low-level no-batteries of C.
Programming languages, like literally everything else, are on many spectra.
Did your comment always say "most spectra"? I swear it was "a spectra" before. Sorry, I must have misread.
It did, but it is an unusual plural to encounter in conversation so I don't blame you for missing it. Kudos for owning up!
we have to accept that the vast majority of people don't think like us. They don't think its complete garbage because they can't look hard enough to appreciate that fidelity.
While it might be better if everyone thought like us and wanted things to be _fundamentally_ good, most people don't, and most people money >> less people money and the difference in scale is vast. We could try to build a little fief where we get to set the rules but network effects are against us. If anything our best shot is to try to reverse the centralisation of the internet because that's a big cause of enshittification.
The semantic web came out of work on Prolog and formal systems for AI which just didnt work well... LLMs and vector databases give us new tools that are pretty usable.
Imagine how easy it would be to build and train an AI if it had semantically tagged input all over the Web.
I also think...
mom: "you need to clean up your room"
kid: "mom, just give up. The room will always be a mess, just use search"
I think you're confusing XHTML and semantic web on the "break web pages" part.
Neither broke web pages, honestly. XHTML requires a DTD named at the top of the document, and browsers will happily fall back to HTML 3, 4, or 5 as they can if there’s no DTD specified.
My interpretation of "break web pages" was serving XHTML with MIME type application/xhtml+xml, in which case browsers don't render anything when the XHTML isn't well-formed, which is really just a strict / validate-my-syntax mode you can opt into.
In either case, I agree!
Fresh plums right off the tree taste significantly better than the ones you can get in the produce isle, which are in turn better than canned, which are themselves still better than re-hydrated prunes.
In scaling out computation to the masses, we went from locally grown plums that took a lot of work and were only available to a small number of people that had a plum tree or knew someone that had one, to building near magical prune-cornucopia devices that everyone could carry around in their pockets, giving them an effectively unlimited supply of prunes.
LLMs re-hydrate these for us, making them significantly more palatable; if you're used to gnawing dried fruit, they seem amazing.
But there's still a lot of work to be done.
Perhaps, but we still failed and not at personal computing, nor just semantic web, but computing and programming in general. The failure is between the original intent (computing was originally more or less AI) along with theory and actual result with every software project turning into unsustainable infinite regress. Things likely broke around ALGOL.
Also LLMs are failing too, for different reasons, but IMO unlikely AI in general will— it will correct a 60 years or so failure in industrial computer science.
> we still failed [at] semantic web
The most reprehensible knowledge-search-and-communication failure of all.
We gave people monetisation of drek instead. Then asked them to use it for education. Then trained bot systems on it. Then said that even those answers must be made to confirm to transient propagandists.
> LLMs re-hydrate these for us, making them significantly more palatable; if you're used to gnawing dried fruit, they seem amazing.
Except sometimes you're expecting a fresh plum, and then you bite into a fig, or an apple, or a banana, or a stick.
The analogy doesn't make any sense, computers today are better by any conceivable metric than computers before.
The computers, yes. The experience of using them, no.
There is a joy to having a tool that lets you do what you never could before that has been buried along the way in layers of petty annoyances and digital micro aggressions. As you say, computers today are better by so many metrics—including, unfortunately, their ability to tracks us, sell us things we neither need nor want, outrage and distract us.
But can it beat William Carlos Williams?
they were delicious
so sweet
and so cold
:)
Or slop with some plum aroma added. Seems like a good analogy.
Extruded synthetic plum substrate
I really don't like this analogy, and I really don't like the premise of this article.
Writing software is only so scalable. It doesn't matter all of the shortcuts we take, like Electron and JavaScript. There are only so many engineers with so much time, and there are abundantly many problems to solve.
A better analogy would be to look at what's happening to AI images and video. Those have 10,000x'd the fundamental cost savings, time savings, and personnel requirements. It's an industrialization moment. As a filmmaker who has made several photons-on-glass films, this is a game changer and lifts the entire creative industry to a level where individuals can outdo Pixar.
That is the lens with which to look at what AI will do to software. We're going from hand-carving stone wheels to the Model T.
This is all just getting started. We've barely used what the models of today offer us.
Totally agree with the core of your position. But the two perspectives are complementary, and perhaps even more deeply linked.
Initial attempts to alleviate any shortage are likely to result in a decrease in quality; initial attempts to improve quality are likely to reduce variability and thus variety. My point (and my reading of the article) is that we're at the stage where we've figured out how to turn out enough Hostess Twinkies that "let them eat cake!" is a viable option, and starvation is being driven back.
This is definite progress, but also highlights previous failures and future work.
This is a massive cope. AI image/video slop is still slop. Yes it's getting better, but it's still better .. slop. Unless radical new breakthroughs are made, the current LLM paradigm will not outdo Pixar or any other apex of human creativity. It'll always be instantly recognizable, as slop.
And if we allow it to take over society, we'll end up with a society that's also slop. Netflixification/marvelization only much much worse..
He didn't say LLMs can outdo Pixar. That's ridiculous and they are nowhere near that level.
He said that LLMs are at the point "where individuals can outdo Pixar." And that's very possible. The output of a talented individual with the assistance of AI is vastly better than the output of AI alone.
This is a very reductionist take that's to be expected from a software engineer but most definitely something that an artistic person would never utter. The creative process doesn't scale in the way that software engineers imagine. Coming up with genuine new ideas or magical moments of "synthesis" doesn't emerge from throwing lots of commodified tools together and calling it a day.
So far we haven't seen a single iota of creative art coming out of LLMs. Zip. Nada. It's all smoke and mirrors in that we get better and better veneers on top of bad copies of actual art that humans have previously created. The veneers are improving but there is no substance underneath. It's still slop. I don't want to live in a society that doesn't care about substance but instead worships the veneer. Yet this is the place that the current LLMs are taking us.
I'm not talking about LLMs. I'm talking about image and video diffusion models.
Editors, VFX artists, and studios big and small are already using the tools.
I'm in this industry. They're widely deployed as we speak.
Improving every year, it approaches the asymptote of being almost any good.
> This is a massive cope. AI image/video slop is still slop.
Slop content is slop content, AI or not. You don't need an AI to make slop. We've always had films like "The Room", it's just that the financial and time constraints put an upper bound on how much slop was created. AI makes creation more accessible. You've got Reddit for image and video now, essentially.
You are biased by media narratives and slop content you're encountering on social media. I work in the industry and professionals are using AI models in ways you aren't even aware of. I guarantee you can't identify all AI content.
> And if we allow it to take over society, we'll end up with a society that's also slop. Netflixification/marvelization only much much worse..
Auteurs and artists aren't going anywhere. These tools enable the 50,000 annual film students to sustainably find autonomy, where previously there wasn't any.
Scaling the production of almost good things by individuals that used to take just as many people and just as big a budget as a really nice major feature film is certainly full of use cases for education, training, portfolios of work, and pitches of the content.
I have often thought about how computers are significantly faster than they were in the early 2000s, but they are significantly harder to use. Using Linux for the first time in college was a revelation, because it gave me the tools to tell the computer "rename all of the files in this directory, keeping only the important parts of the name."
But instead of iterating on better interfaces to effectively utilize the N thousands of operations per second a computer is capable of, the powers that be behind the industry have decided to invest billions of dollars in GPUs to get a program that seems like it understands language, but is incapable of counting the number of B's in "blueberry."
Prompt: "Spell blueberry and count the letter b".
They're not claiming AGI yet, so human intelligence is required to operate an LLM optimally. It's well known that LLMs process tokens rather than characters s, so without space for "reasoning" there's no representation of the letter b in the prompt. Telling it to spell or think about it gives it room to spell it out, and from there it can "see" the letters and it's trivial to count.
if you're going to need to learn how to use a tool, why not learn to use the efficient and precise one?
Because there aren't more efficient and precise tools capable of the same things?
perl -e 'print scalar grep {/b/} split //, "blueberry”'
echo blueberry | grep -o 'b' | wc -l
echo blueberry | perl -ne 'print scalar (() = m/(b)/g)’
echo blueberry | tr -d '\n' | tr b '\n' | wc -l
echo -n blueberry | tr b '\n' | wc -l
So long as I’m teaching the user how to speak to the computer for a specific edge case, which of these burn nearly as much power as your prompt? Maybe we should consider that certain problems are suitable to LLMs and certain ones should be handled differently, even if that means getting the LLM to recognize its own edge cases and run canned routines to produce answers.
Is counting the number of B's vital? Also, I'm pretty sure you can get an LLM to parse text the way you want it, it just doesn't see your text as you do, so that simple operation is not straightforward. Similarly, are you worthless because you seem like you understand language but are incapable of counting the number of octects in "blueberry"?
Let's say I hire a plumber because of his plumbing expertise and he bills me $35 and I pay him with a $50 bill and he gives me back $10 in change. He insists he's right about this.
I am now completely justified in worrying about whether the pipes he just installed were actually made of butter.
Really? Is that easy? Happens quite often to really believe something and be wrong. Maybe you both are right and the $5 bill is on the floor?
As shown by the GPT-5 reaction, a majority of people just have nothing better to ask the models than how many times does the letter "s" appear in "stupid".
I think this is a completely valid thing to do when you have Sam Altman going on the daily shows and describing it as a genius in your pocket and how it's smarter than any human alive. Deflating hype bubbles is an important service.
Yeah: Like with self-driving vehicles, the characteristics of when and how something breaks are important, not just some average error-rate.
If users cannot anticipate what does or doesn't constitute risky usage or potential damages, things go Extra Wrong.
But the point is, why would you trust it for anything at all, when it can't do an incredibly simple thing reliably at all? (Yes, I understand the tokenizer makes this hard, but still, it's a quick demonstration that it's just bad technology.)
2 time(s)
I mean, I think that anyone who understands UTF-8 will know that there are nine octets in blueberry when it is written on a web page. If you wanted to be tricky, you could have thrown a Β in there or something.
> anyone who understands UTF-8
So not too many?
> Similarly, are you worthless because you seem like you understand language but are incapable of counting the number of octects in "blueberry"?
Well, I would say that if GP advertised themselves as being able to do so, and confidently gave an incorrect answer, their function as someone who is able to serve their advertised purpose is practically useless.
So ChatGPT was advertised as a letter counter?
Also, no matter what hype or marketing says: GPT is a statistical word bag with a mostly invisible middleman to give it a bias.
A car is a private transportation vehicle but companies still try to sell it as a lifestyle choice. It's still a car.
It is (maybe not directly but very insistently) advertised as taking many jobs soon.
And counting stuff you have in front of yourself is basic skill required everywhere. Counting letters in a word is just a representative task for counting boxes with goods, or money, or kids in a group, or rows on a list on some document, it comes up in all kinds of situations. Of course people insist that AI must do this right. The word bag perhaps can't do it but it can call some better tool, in this case literally one line of python. And that is actually the topic the article touches on.
People always insist that any tool must do things right. They as well insist that people do things right.
Tools are not perfect, people are not perfect.
Thinking that LLMs must do things right, that people find simple, is a common mistake, and it is common because we easily treat the machine as a person, while it only is acting like one.
> Thinking that LLMs must do things right, that people find simple, is a common mistake
Show me any publicly known visible figure that tries to rectify this. Everyone peddles hype, there's no more Babbage as in the "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" anecdote.
People and tools that don't do things right aren't useful. They get replaced. Making do with a shitty tool might make sense economically but not in any other way.
If you follow that reasoning, no person is useful and no tool is useful.
The little box I'm filling now is, compared to a lot of other interfaces, a shitty interface. That doesn't mean it isn't useful. Probably it is getting replaced, only with a slightly better inferface.
The karma system is quite simplistic and far from perfect. I'm sure there are ways to go around it. The moderators make mistakes.
That doesn't mean the karma and moderation are not useful. I hope you get my point but it's fine if we disagree as well.
It is advertised as being able to "analyze data" and "answer complex questions" [0], so I'd hope for them to reliably determine when to use its data-analysis capabilities to answer questions, if nothing else.
[0] http://web.archive.org/web/20250729225834/https://openai.com...
Here I am, a mind the size of a planet and what am I doing? Parking cars. - Marvin
if i have to talk to it a specific way, why not just use programming. The specific way we talk to computers effectively...
IDK, I think there is something adorable about taking a system that over trillions of iterations always performs the same operation with the same result, reliability unmatched in all of the universe…
And making it more of “IDK what it answered the way it did, but it might be right!!”
Humans like games :)
Yeah dream on. I’m an engineer and know what structured data is. And yet I miserably fail to store my private files in a way that I can find them back without relying on search tools. So how on earth are we ever going to organize all the world’s data and knowledge? Thank god we found this sub-optimal “band aid” called LLMs!
Librarians have succeeded in precisely this for a long time now.
Precisely this. This article might seem reasonable to anybody who has never tried to organize something as simple as a local music collection.
Made me think about John Wilkins' "philosophical language" which I first heard about in Neal Stephenson's book Quicksilver
https://en.wikipedia.org/wiki/An_Essay_Towards_a_Real_Charac...
I'm sure there have been countless similar attempts at categorizing knowledge
one of the more successful ones being the dewey decimal system
I have my doubts about whether the thing the OP alleges we have "failed" at is even possible at all
Well, this runs straight into one of the massive, concrete pillars of computing: naming things.
Because that’s what a lot of this falls into.
Overwhelming amount of stuff with no names. No categories, no nothing.
With extended file attributes we could hang all sorts of meta bits off of arbitrary files. But that’s very fragile.
So we ask the systems to make up names for data based on their content, which turns out to not necessarily work as well as we might like.
No names is not the biggest problem. You just have to come up with a name. The problem is when things have multiple names, or even worse when people disagree on what names are appropriate for something. The world rarely allows you to neatly categorize large datasets. There are always outliers.
For example, you have a set of balls and you want to sort them by color. Where does orange stop and red begin? What about striped balls or ones with logos printed on them? What if it is a hypercolor ball that changes based on heat? It gets messy very fast.
Not everything has to be named once and put into a hierarchy like a directory tree. Tags work well for data. A system like an LLM that understands synonyms and antonyms should be able to find and even update tags for concepts that don’t have a full set already - as long as there are a few appropriate tags on the concept to start.
In practice if you're making up tags on the fly it's not much better than untagged data. A LLM that can figure out what the tags mean can probably just infer it from the data anyway.
I'll go farther and say it's not even possibly. Our brain wants to categorize things to make things simple but unfortunately nothing is simple.
I think of the Whalphin and it took Sea World era to discover. Who would see that coming?
[dead]
Expresses a longing for the semantic web.
> Remember Semantic Web? The web was supposed to evolve into semantically structured, linked, machine-readable data that would enable amazing opportunities. That never happened.
I think the lesson to be learned is in answering the question "Why didn't the semantic web happen?"
"Why didn't the semantic web happen?"
I have literally been doing we development since their was a web, and the companies I developed for are openly hostile to the idea of putting their valuable, or perceived valuable, information online in a format that could be easily scraped. Information doesn't want to be free, it wants to be paid for. Unless the information shared pulls visitors to the site it doesn't need to be public.
> Information doesn't want to be free, it wants to be paid for. Unless the information shared pulls visitors to the site it doesn't need to be public.
That's a cultural and societal problem, not a technology problem. The motivations (profit) are wrong, and don't lead to true innovations, only to financialization.
So long as people need to pay to eat, then information will also want to continue to be paid for, and our motivations will continue to be misaligned with true innovations, especially if said innovations would make life easier but wouldn't result in profit.
You need profit or you need post-scarcity or nothing works at all
I'd argue that resource availability is already high enough to alleviate scarcity for most people, and that most scarcity today is artificially generated, because of profit.
We won't achieve post scarcity, even with widespread automation (if AI ever brings that to fruition), because we haven't yet fixed the benefits that wealth brings, so the motivation to work toward a post-scarcity society just doesn't exist.
Kind of a chicken and egg problem.
I've encountered a similar issue in academia - PI's don't want to make their data available to be scraped (or, at least not easily) because the amount of grant funding is limited, and a rival who has scraped one's data could get the grant money instead by using that scraped data to bolster their application.
I was thinking of that in terms of siloed web sites but your description of walling off information is broader and more appropriate.
> "Why didn't the semantic web happen?"
Advertising.
To a degree re ads on pages, but why didn't big business end up publishing all of their products in JSON-LD or similar? A lot did, to get aggregated, but not all.
>"Why didn't the semantic web happen?"
Because web content is generated by humans, not engineers.
But also because companies that produce web content wanted it to be seen by humans who would look at ads, not consumed by bots and synthesized with other info into a product owned by some other firm.
And yet today most websites are being scraped by LLM bots which don't look at ads and which synthesize with other info into a product owned by some other firm.
Optimistically, the semantic web is going to happen. Just that instead of the original plan of website owners willingly making data machine-readable, LLMs will be the ones turning non-machine-readable data machine-readable (which can then be processed by user agents), even if the website owner prefers you looked at the ads instead.
The semantic web was theoretically great for data scientists and metadata scrapers, but offered close to zero value for ordinary humans, both on the publishing side an the consumption side. Also, nobody did the hard work of defining all of the categories and protocols in a way that was actually usable.
The whole concept was too high minded and they never got the implementation details down. Even if they did it would have been horrendously complex and close to impossible to manage. Asking every single publisher to neatly categories their data into this necessarily enormous scheme would have resulted in countless errors all over the web that would have seriously undercut the utility of the project anyway. Ultimately the semantic web doesn't scale very well. It failed for the same reason command economies fail: It's too overwhelming for the people in control to manage and drowns in its own bureaucracy.
Because you cannot build it with merchants. This is a job for monks.
because semantic web was more limited than language
Semantic web never existed. There was Google and Google had an API to get breadcrumbs to show on search results. And that's what people called "semantic web." A few years later they gave up and made everything look like a breadcrumb anyway. And that sums up the whole semantic web experience.
A pretty reductionist and a poor take.
"Standing on the shoulders of giants, it is clear that the giants failed to reach the heights we have reached."
I find traditional web search and LLM search to be complementary technologies, and this is a good example of that. Both have their uses and if you get the information you need using one or the other, we are all happy.
I think the example query here actually shows a problem with the query languages used in web search rather than an intrinsic inability of web search. It contains what amounts to a natural language subquery starting with "in the same year". In other words, to evaluate this query properly, we need to first evaluate this subquery and then use that information to evaluate the overall query. Google Search and almost all other traditional web search engines use intentionally oversimplified query languages that disallow nested queries let alone subqueries, so this example really is just revealing a problem with the query language rather than a problem with web search overall. With a better query language, we might get better results.
> What animal is featured on a flag of a country where the first small British colony was established in the same year that Sweden's King Gustav IV Adolf declared war on France? ... My point is that if all knowledge were stored in a structured way with rich semantic linking, then very primitive natural language processing algorithms could parse question like the example at the beginning of the article, and could find the answer using orders of magnitude fewer computational resources.
So as well as people writing posts in English, they would need to provide semantic markup for all the information like dates, flags, animals, people, and countries? It's difficult enough to get people to use basic HTML tags and accessible markup properly, so what was the plan for how this would scale, specifically to non-techy people creating content?
So wikipedia and wikidata?
This actually happened already and it's part of why llms are so smart, I haven't tested this but I venture a guess that without wikipedia and wikidata and wikipedia clones and stolen articles, LLMs would be quite a lot dumber. You can only get so far with reddit articles and embedded knowledge of basic info on higher order articles.
My guess is when fine tuning and modifying weights, the lowest hanging fruit is to overweigh wikipedia sources and reduce the weight of sources like reddit.
Only a relatively small part of Wikipedia has semantic markup though? Like if the article says "_Bob_ was born in _France_ in 1950" where the underlines are Wikpedia links, you'll get some semantic info from the use of links (Bob is a person, France is a country), but you'd be missing the "born" relationship and "1950" date as these are still only raw text.
Same with the rest of articles with much more complex relationships that would probably be daunting even for experts to markup in an objective and unambiguous way.
I can see how the semantic web might work for products and services like ordering food and booking flights, but not for more complex information like the above, or how semantic markup is going to get added books, research articles, news stories etc. that are always coming out.
The semantic information is first present not in markup but in natural language.
But it is also present inside the website, there's infoboxes that mark the type of object, place, person, theory.
Additionally infoboxes also hold relationships, you might find when a person was born in an infobox, or where they studied.
> The semantic information is first present not in markup but in natural language.
Accurate natural language processing is a very hard problem though and is best processed by AI/LLMs today, but this goes against what the article was going for when it's saying we shouldn't need AI if the semantic web had been done properly?
For example, https://en.wikipedia.org/wiki/Resource_Description_Framework and https://en.wikipedia.org/wiki/Web_Ontology_Language are some markup approaches related to the semantic web.
Complex NLP is the opposite to what the semantic web was advocating? Imagine asking the computer to buy a certain product and it orders the wrong thing because the natural language parsed was ambiguous.
> Additionally infoboxes also hold relationships, you might find when a person was born in an infobox, or where they studied.
That's not a lot of semantic information compared to the contents of a Wikipedia article that's several pages long. Imagine a version of Wikipedia that only included the infoboxes and links within them.
Yeah. Wikidata
> If all knowledge were stored in a structured way with rich semantic linking, then very primitive natural language processing algorithms could parse question like the example at the beginning of the article, and could find the answer using orders of magnitude fewer computational resources.
In vertical markets, can LLMs generate a "semantic web of linked data" knowledge graph to be parsed with efficient NLP algorithms?
https://news.ycombinator.com/item?id=43914227#43926169
Palantir was 90% about ontology creation back when I first used it in 2009. They knew at that point that without some level of structure that mapped to the specific business problem that the data input into the graph was difficult to contextualize.
The root cause of this is that HTML is not a language for the markup of hypertext.
Because everything has to be copied and/or compressed into a single layer document along with its markup, you almost never see actual markup in the web, just a lot of formatting and layout.
Because you can't have multiple markup of a given source material, adding multiple hierarchies doesn't happen. Any given information structure is necessarily oversimplified and thus only fit for limited use.
It's almost as if someone read Vannevar Bush's description of the memex and decided to actively prevent its reification.
He said the organization of knowledge was the largest challenge facing mankind post war. Clearly he was right, and we've failed miserably.
The thing LLMs provide is an impedance match between the data we do have, and the actual needs of mankind.
I think every field has its own version of this thought, where if we could just manage to categorise and tag things properly we could achieve anything. Our lack of a valid overarching ontology is what is holding us back from greatness.
It might be short lived, who knows, but it's interesting that the recent progress came from capturing/consuming rather than systematically eliminating the nuance in language.
Every field including library science? ;)
I think I broadly agree with this. I am super frustrated that everythign always wants me to search for things. As an example Finders default search while looking at a folder is the whole machine instead of a filter in the directory you are viewing in seems totally insane to me. It's almost like they don't want me to know where my files are.
I can understand that it's a result, to a degree of cloud services and peoples primary mode swapping to opening and app and opening recents or searching instead of opening a file to open an app but it does mean that you're at the mercy of what I experience as some pretty crap search algorithms that don't seem to want you to find the information you're looking for. I keep encountering searches that rank fuzzy matches over exact matches or aren't stable as you continue to complete the same word and I just don't understand how that's acceptable after being pointed out if search is what I'm supposed to be using.
> It's almost like they don't want me to know where my files are.
I think this might actually be true in some cases. Especially where companies want your files on their cloud servers. It's better for them if you don't think about what's stored locally or remotely. It's better for them if you don't think at all and just ask them for whatever you want and let them decide what to show you or keep hidden from you. It's easier for them to get metrics on what you're doing if you type it out explicitly in a search box than it is for them to track you as you browse through a hierarchy you designed to get to what you want. You're supposed to feel increasingly helpless and dependent on them.
Uh, the author got so close to make the same realization I had while working on a project [0] for the Wikimedia Foundation: we wouldn't need search engines if we had better tooling to query semantic databases like wikidata.
However, the thing that the author might be missing is that the semantic web exists. [1] The problem is that the tools that we can use to access it are not being developed by Big Tech. Remember Freebase? Remember that Google could have easily kept it around but decided to fold it and shoved it into the structured query results? That's because Google is not interested in "organizing the world's information and make it universally accessible" unless it is done in a way that it can justify itself into being the data broker.
I'm completely out of time or energy for any side project at the moment, but if someone wants to steal my idea: please take a llm model and fine tune so that it can take any question and turn it into a SparQL query for Wikidata. Also, make a web crawler that reads the page and turns into a set of RDF triples or QuickStatements for any new facts that are presented. This would effectively be the "ultimate information organizer" and could potentially replace Wikidata as most people's entry page of the internet.
[0]: https://meta.wikimedia.org/wiki/QuickStatements_3.0
[1] https://guides.library.ucla.edu/semantic-web/datasets
DBpedia Spotlight and entity-fishing already do something similar to your idea - they extract structured data from text and link to knowledge bases. Combining these with LLM-based query translation to SPARQL could indeed bridge the gap between semantic web's structure and natural language interfaces.
ChatGPT etc does an OK job at SPARQL generation. Try something like "generate a list of all supermarkets, including websites, country, description" and you get usable queries out.
In a much, much more limited way, this is what I was dabbling with with alltheprices - queries to pull data from wikidata, crawling sites to pull out the schema.org Product and offers, and publish the aggregate.
I would argue that LLM is ultimately making true Semantic Web available, or irrelevant.
It can basically figure out all the semantic knowledge graphs automatically for us, and it's multi-modal to boot. That means it can infer the relationships between any node across text, audio, images, videos, and even across different languages.
The question is whether humans specifying the relationships between nodes is more or less reliable than LLMs inferring those relationships.
So we need to compare the reliability of something hypothetical to something real? I'll take the real every time.
The funny thing to me is that now more than ever structured data is important so that AI has a known good data set to train from, and so that search engines have contextual and semantically rich data to search.
AI isn't a solution to this; on the contrary, whatever insufficiency exists in the original data set will only be amplified, as compression artifacts can be amplified in audio.
We also can't trust any data that's created recently, because an LLM can be trained to provide correct-looking structured data that may or may not accurately model semantic structure.
The best source of structured, tagged, contextually-rich data suitable for training or searching in the future will be from people who currently AREN'T using generative AI.
"... if all knowledge were stored in a structured way with rich semantic linking..." this sounds a lot like Google's "Knowledge Graph". https://developers.google.com/knowledge-graph. (Disclosure: I work at Google.)
If you ask an LLM where you can find a structured database of knowledge with structured semantic links, they'll point you to this and other knowledge graphs. TIL about Diffbot!
In my experience, it's a lot more fun to imagine the perfect database like this than it is to work with the actual ones people have built.
My read is that the author is saying it would have been really nice if there had been a really good protocol for storing data in a rich semantically structured way and everyone had been really really good at adhering to that standard.
Is that the main thrust of it?
> AI is impressive
Ok, but Google's result summary got the answer wrong. So did Gemini, when I tried it (Lion, Sierra Leone). And so did ChatGPT when I tried it (Lion, Sri Lanka).
So... it's impressive, sure, because the author is correct that a search engine can't answer that question in a single query. But it's also wildly wrong.
I also vaguely agree with the author that Google Drive sucks, but I wish they'd mentioned that the solution to their problem - using search! - also fucking sucks in Google Drive.
I saw a quote somewhere to the effect of “LLMs are lossy compression of the internet” and it seemed about right.
Yep, Ted Chiang wrote that for the New Yorker: https://www.newyorker.com/tech/annals-of-technology/chatgpt-...
> because we've failed at personal computing
?
If you really want to, you know have this super generic indexing thing, why don't you go organize the web with hypercard and semantic web crap and tell us how it worked out for you
I'm pretty sure AI _is_ "personal computing".
One can speak in their native language to a computer now and it mostly understands what is meant and can retrieve information or even throw together a scaffold of a project somewhat reliably.
It's not particularly good at writing software, however. Still feels like a human is needed to make sure it doesn't generate nonsense or apparently pretty insecure code.
So I'm not sure the author got the point across that they wished, but aren't vector databases basically a semantic storage/retrieval technology?
Yes, I rolled my eyes about the sneering of the author.
By the way: His blog getting a few dozen hn comments is only impressive, because he failed to write a better blog.
At Engelbart's mother of all demos in 1968, which basically birthed what we call personal computing today, most computer scientists were convinced that AGI was right around the corner and "personal computing" wasn't worth it.
Now, back then AGI wasn't right around, and personal computing was really really necessary, but how did we forget the viewpoint that personal computing was seen as a competing way to use computing power vs AI?
Economic interests, walled gardens, lock-in effects. Providers learned to make it (initially) convenient for us to forget. Of course, once we’re hooked, enshittification ensues: https://news.ycombinator.com/item?id=44837367
We weren't given enough time
Basically the same as decrying why we should have to learn foreign languages instead of everyone speaking Esperanto.
Which I still decry!
Another angle: we're super over provisioned in compute resources because of reasonable limitations in how quickly we can collectively absorb and learn how to use computers. "AI" is simply a new paradigm in our understanding of how to program computers. This isn't a failure, it's just an evolution.
Leading in with what feels like a p-hacked question that almost certainly isn't typical kind of hurts the point. Reminds me of seeing clearly rehearsed demos for things that ignore the reason it worked is because it was rehearsed. A lot.
Am I wrong in that this was a completely organic question?
People have said this since Vannevar Bush theorized the Memex. The unanswered question, the problem unsolved, remains: Who does all the categorization work? How come they're never part of "we," such that the plaint is never "if only we had done the work..."?
Neither chat-gpt nor gemini got the same answer as the article for me
` What animal is featured on a flag of a country where the first small British colony was established in the same year that Sweden's King Gustav IV Adolf declared war on France? `
chatgpt: Final answer: Lion
gemini: A lion is featured on the flag of Sri Lanka.
Gemini (2.5 pro) stands firm on its answer even when told that ChatGPT said it is a parrot. It does provide additional reasoning: https://g.co/gemini/share/45232dcc7278
> Neither chat-gpt nor gemini got the same answer as the article for me
getting wildly different and unpredictable answers from the same input is one of the features AI offers
Just complaining that the world is bad is a good way to waste your energy and end up being a cynic.
So why is not all information organized in structured, open formats? Because there's not enough of an incentive to label/structure your documents/data that way. That's if you even want to open your data to the public - paywalls fund business models.
There have been some smaller successes with semantic web, however. While a recipe site might not want to make it easy for everyone to scrape their recipes, people do want Twitter to generate a useful link preview from their sites' metadata. They do that with special tags Twitter recognizes, and other sites can use as well.
The good news is that LLMs can generate structured data from unstructured documents. It's not perfect, but has two advantages: it's cheaper than humans doing it manually, and you don't have to ask the author to do anything. The structuring can happen on the read side, not the write side - that's powerful. This means we could generate large corpuses of open data from previously-inaccessible opaque documents.
This massive conversion of unstructured to structured data has already been happening in private, with efforts like Google's internal Knowledge Graph. That project has probably seen billions in cumultative investment over the years.
What we need is open data orgs like Wikipedia pick up this mantle. They already have Wikidata, whose facts you can query with a graph querying language. The flag example in the article could be decomposed into motifs by an LLM and added to the flag's entry. And then you could use SPARQL to do the structured query. (And that structured query can be generated from LLMs, too!)
LLMs and structured data are friends.
Cyc[0] tried for 40 years to make handwritten semantic rules a thing, but still don't have much to show for. Humans just aren't good or fast at writing down the rather fuzzy semantics of the real world into a computer readable format.
With RDF specially, there was also the issue of "WTF is this good for?". Semantic Web sounds lofty in theory, but was there ever even a clear plan on how the UI would look like? How would I explore all that semantic data if it ever came into existence? How would it deal with link rot?
And much like with RSS, I think a big failure of RDF is that it's some weird thing outside the Web, instead of just some addition HTML tags to enrich existing documents. If there is a failure, it's that. Even today, a lot of basic semantic tags are missing from HTML, we finally got <date> in 2011, but we still have nothing for names, cities, units, books[1], movie, gps coordinates and a lot of other stuff we use daily.
Another big failure is that HTML has become a read-only format, the idea that one uses HTML as a source format to publish documents seems to have been completely abandoned. HTML is just a UI language for Web apps nowadays, Markdown, LaTeX or whatever is what one uses to write content.
0. https://en.wikipedia.org/wiki/Cyc
1. <a href="urn:isbn:..."> exists, but browsers don't support it natively
Building the semantic associations that LLMs can do is just building an LLM.
How do LLMs help Google, an ad company, generate more ad revenue. LLMs drive the traffic away from websites, give you answer directly without needing bother with ads. How does it benefit Google's ad business?
Ad clicks are a trifle compared to the possibility of winning the game of displacing vast amounts of labor and delivering ads and narratives undisclosed through conversational AI output. At that point you've won the economy, period.
> winning the game of displacing vast amounts of labor and delivering ads and narratives undisclosed through conversational AI output. At that point you've won the economy, period.
In a functional democracy, all you should win by doing that is harsh antitrust action.
Yes that's why my opposition to LLMs is contextual. If it had been invented 30 years ago I'd have been optimistic about it.
Alan Kay has also written about his disappointment with what personal computing delivered. I think he'd agree with this.
Most people can't use the power of the computers they have effectively. Maintaining the data in Excel spreadsheets for most people is like manual labour.
How will user-generated content work in a semantic web world?
Personal computing failed because desktop operating systems started trying to work for mobile.
Everything that is bad in UI is a direct consequence of that.
1. No tooltips, right click, middle click behavior because touch doesn't have that. No ctrl+click either.
2. Large click areas wasting screen space with padding and margins.
3. Low density UI so it can shape-shift into mobile version.
4. Why type on a phone when you can talk? Make everything a search box.
5. Everything must be flat instead of skeumorphic because it's easier to resize for other screen sizes.
6. Everything needs a swipe animation and views instead of dialogs because smartphones can't have windows.
Chatgpt does not build semantic maps. we know it has some attention maps, but really fuzzy how it uses them
is it unreasonable to expect some kind of good enough baseline guided by the prior of these llm to be the standard ? Should google's priors (or their training dataset and recipe) be allowed to guide societal's prior? same problem, different era ?
This is a clickbait article one could've written when Google was new and people were getting used to the idea of "You can just search for things on the internet."
It's just dressed up with mentions of AI for better "engagement."
"pile of divs" is so accurate
Isn't this what Reddit is for
No AI is impressive because we built a machine that finally understands what we are saying and has a concept of what creativity is. It's rudimentary but it's a milestone for humanity.
It's social media and the endless barrage of generated AI that's creating the illusion that AI isn't impressive. You're inundated everyday with the same crap and is just making it less and less impressive.
This take is very naive and should anticipate the obvious criticism: people did try, very very hard, to create structured information systems. They simply did not work. The sort of systems people tend to build are well-scoped for a particular problem, but fall apart completely when moving out of that domain. It is very easy to imagine an information system that results in correct answering of flag questions. It is a wide open problem to come up with a way of tagging information that works well for flag questions and for questions about memes, or art. It's not like Google didn't try!
I always appreciate our weekly Crankypants Take on LLMs.
> AI is not a triumph of elegant design, but a brute-force workaround
You can read and understand Attention Is All You Need in one hour, and then (after just scaling out by a few billion) a computer talks to you like a human. Pretty elegant, if you ask me.
> The web was supposed to evolve into semantically structured, linked, machine-readable data that would enable amazing opportunities.
I missed that memo. The web was, and forever shall be, a giant, unstructured, beautiful mess. In fact, LLMs show just how hopeless the semantic web approach was. Yes, it's useful to attach metadata to objects, but you will still need massive layering and recursion to derive higher-order, non-trivial information.
This entire article is someone unable to let go of an old idea that Did Not Work.
> AI is not a triumph of elegant design, but a brute-force workaround
I think the author is on to something here, but does not realize it, or applies the think to the wrong problem. The issue isn't the web, search was good enough and it's a pretty big problem to solve. We need to go smaller. Applying AI to customer service, internal processes and bureaucracy, among other things, is an inelegant brute-force approach to not just fixing your shit.
The majority of customer service could be fixed by having better self-service, designing better UIs, writing better manuals, having better monitoring, better processes, better trained staff, not insisting on pumping stock numbers and actually caring about your product. AI will never fix the issues your customers are having, they'll just make customer service cheaper and worse because just like real humans the content and levers they need aren't available.
Same for much of the bureaucracy in companies and governments. Rather than having AI attempt to fill out forms or judge if someone is entitled to a pension, insurance pay out or what have you. Take the time to fix your damn processes and built self-service portals that actually works (some of those may have some sort of AI on the backend for things like scanning documents).
Forget the semantic web thing, it never worked, to much data, and AI generated content certainly isn't making the problem easier. Let's work on that at some other time. Still the author is correct, LLMs are a brute-force workaround, regardless of how elegant the design may be. They are applied to a host of problem that should just be eliminated, not hidden and stupefied by a prediction engine.
Except it's all just smoke and mirrors.
Explain? The algorithms are surprisingly easy to understand. There is no trickery.
And the conclusions are delusional, especially given how easy it is for anyone to see there is nothing in there even resembling intelligence.
"AI is impressive because it solves problems better than the previous solutions"? Yes?
What a terrible question in the opening example. Are we playing Jeopardy?
> They scan the unstructured web and build ephemeral semantic maps across everything. It's not knowledge in the classic sense.. or perhaps it is exactly what knowledge is?
Betteridge's law
Personal computing, best now it's flav djour we get to slam on cloud based approaches to everything being what they always were. Wasteful, slow and privacy invasive... But still can't just plainly say cloud has been mostly bad.
I sympathize so much with the failure of personal computing to manifest!
> My point is that if all knowledge were stored in a structured way with rich semantic linking, then very primitive natural language processing algorithms could parse question like the example at the beginning of the article, and could find the answer using orders of magnitude fewer computational resources. And most importantly: the knowledge and the connections would remain accessible and comprehensible, not hidden within impenetrable AI models.
It's a pocket hope of mine that AI brings us back to the Semantic Web, or something very like it. In many ways these embeddings are giant non-human lexicons already. Distilling this information out seems so possible.
Even just making an AI to go markup (or uhh perhaps refine) a page with microdata seems conceptually very doable!
More broadly, looking at personal computing: my personal belief is that failure is hugely because of apps. Instead of a broader personal computing that aggregates, that allows constructivism, that enables personal growth & development (barefoot developers & home-cooked software style), computing has been defined by massificstion, by big tech software. The dominant computing paradigm has been mainframe pattern: we the user have an app that acts as a terminal to some far off cloud-y data center. Whatever agency we get is hewn out for us apriori by product teams, and any behavior not explicitly built in is a "Felony Contempt of Business Model" (an oh so accurate Doctorow-ism)! https://maggieappleton.com/home-cooked-software https://news.ycombinator.com/item?id=40633029 https://pluralistic.net/2022/10/20/benevolent-dictators/#fel... https://news.ycombinator.com/item?id=33279274
It is so so sad to see computing squandered so!
The good news is AI is changing this relationship with software. I'm sure we will have no end of AI models built in to our software, that companies will maintain the strict control (tyrant's grip) over software as long as they can! But for AI to flourish, it's going to need to work across systems. And that means tearing down some of the walls, walls that have forcibly kept computing (ardently) anti-personal.
I can go look at https://github.com/punkpeye/awesome-mcp-servers and take such great hope from it. Hundreds of ways that we have eeked out a way to talk interface with systems that, before, people had no say and no control over.
dang, please fix the title, "semantic web" somehow got deleted.
if it's a space issue, "semantic web" is far more relevant to the article than "personal computing".
The last ten years of consumer web tech was dedicated to SEO slop and crypto. Google has destroyed everything else with its endless promotion of hindustantimes .com listicles.
The essential meaning of technological progress is the utter inadequacy of humanity. We build cars that we never figure out how to safely drive, but being to arrogant to admit defeat and reconstruct society around trams like a sensible animal would do, we wait a century for cars that can drive themselves while the bodies pile up on the freeways in the meantime.
Yes it's true, we can't make a UI. Or a personal computer, or anything else that isn't deeply opaque to its intended users. But it's so much worse than that. We humans can't do hardly anything successfully besides build technology and beat each other in an ever expanding array of ridiculous pissing contests we call society.
[dead]