I am on the receiving end: I have to parse CSV generated by various (very expensive, very complicated) eCAD software packages. And it's often garbage. Those expensive software packages trip on things like escaping quotes. There is no way to recover a CSV line that has an unescaped double quote.
I can't point to a strict spec and say "you are doing this wrong", because there is no strict spec.
Then there are the TSV and semicolon-Separated V variants.
Did I mention that field quoting was optional?
And then there are banks, which take this to another level. My bank (mBank), which is known for levels of programmer incompetence never seen before (just try the mobile app) generates CSVs that are supposed to "look" like paper documents. So, the first 10 or so rows will be a "letterhead", with addresses and stuff in various random columns. Then there will be your data, but they will format currency values as prettified strings, for example "34 593,12 USD", instead of producing one column with a number and another with currency.
I used to be a data analyst at a Big 4 management consultancy, so I've seen an awful lot of this kind of thing. One thing I never understood is the inverse correlation between "cost of product" and "ability to do serialisation properly".
Free database like Postgres? Perfect every time.
Big complex 6-figure e-discovery system? Apparently written by someone who has never heard of quoting, escaping or the difference between \n and \r and who thinks it's clever to use 0xFF as a delimiter, because in the Windows-1252 code page it looks like a weird rune and therefore "it won't be in the data".
"Enterprise software" has been defined as software that is purchased based on the decisions of people that will not use it. I think that explains a lot.
Yep, we had a constant tug of war between techies who wanted to use open-source tools that actually work (Linux, Postgres, Python, Go etc.) and bigwigs who wanted impressive-sounding things in Powerpoint decks and were trying to force "enterprise" platforms like Palantir and IBM BigInsights on us.
Any time we were allowed to actually test one of the "enterprise" platforms, we'd break it in a few minutes. And I don't mean by being pathologically abusive, I mean stuff like "let's see if it can correctly handle a UTF-8 BOM...oh no, it can't".
> Big complex 6-figure e-discovery system? Apparently written by someone who has never heard of quoting...
It's because about a certain size, system projects are captured by the large consultancy shops, who eat the majority of the price in profit and management overhead...
... and then send the coding work to a lowest-cost someone who has never heard of quoting, etc.
And it's a vicious cycle, because the developers in those shops that do learn and mature quickly leave for better pay and management.
(Yes, there's usually a shit hot tiger team somewhere in these orgs, but they spend all their time bailing out dumpster fires or landing T10 customers. The average customer isn't getting them.)
Just a nitpick about consultancy shops -- I've had a chance of working in one in eastern europe and noticed that it's approach to quality was way better than client's. It also helped that client paid by hours, so consultancy company was incentivized to spend more time on refactorings, improvals and testing (with constant pushback from client).
So I don't buy the consultancy company sentiment, it always boils down to engineers and incentives.
In my experience, smaller ones tend to align incentives better.
Once they grow past a certain size though, it's a labor arbitrage game. Bill client X, staff with resources costing Y (and over-represented), profit = X-Y, minimize Y to maximize profit.
PwC / IBM Global Services wasn't offering the best and brightest. (Outside of aforementioned tiger teams)
I agree with you in general, although my case was the other way around. My company was 10k+ people. But my client was probably the most technically advanced company at that time, with famously hard interviews for their own employees. My employer also didn't want to lose the client (it was beginning of collaboration), and since everyone wanted to work there (and move to US+California) my shop applied pretty strong filter for their own heads, even before sending them to client's vendor-interview.
And client was very-very happy with the quality, and that we didn't fight for promotions and could maintain very important, but promotion-poor projects. Up to the point that client trusted to completely gave couple of projects fully to my shop. When you don't need to fight for promotions, code quality also improves.
Try to live in a country where "," is the decimal point. Of course this causes numerous interoperability issues or hidden mistakes in various data sets.
There would have been many better separators... but good idea to bring formatting into it as well...
There was a long period of my life that I thought .csv meant cemicolon separated because all I saw was cemicolon separated files and I had no idea of the pain.
Not sure if they still do this, but Klarna would send us ", " separated files. If there wasn't a space after the comma then it was to be read as a decimal point. Most of the CSV parser don't/didn't allow you to specify multi-character separators. In the end I just accepted that we had one field for krona and for öre and most fields would need to have a leading space removed.
Microsoft did this very extensively. Many Non-English versions of Excel do save CSV-files with a semicolon as a separator and it probably was handled differently too in normal Excel files.
But it goes even further, it affected their scripting languages even to this day with newer languages like their BI script (forgot the name of the language). For example, parameters of function calls aren't separated by ',' anymore and ';' is used instead. But only in the localized versions.
That of course means that you have to translate these scripts depending on the locale set in your office suite, otherwise they are full of syntax errors...
There are better separators included in ASCII, but not used as often: 28 File Separator, 29 Group Separator, 30 Record Separator and 31 Unit Separator.
> Then there will be your data, but they will format currency values as prettified strings, for example "34 593,12 USD", instead of producing one column with a number and another with currency.
To be fair, that's not a problem with CSV but with the provider's lack of data literacy.
I worked in a web shop which had to produce spreadsheets which people wanted to look at in Excel. I gave them so many options, and told each client to experiment and choose the option which worked for them. In the end, we had (a) UTF-8 CSV, (b) UTF-8 CSV with BOM, (c) UTF-16 TSV, (d) UTF-8 HTML table with a .xlsx file extension and a lying Content-Type header which claimed it was an Excel spreadsheet.
Option a worked fine so long as none of the names in the spreadsheet had any non-ASCII characters.
Option d was by some measures the worst (and was definitely the largest file size), but it did seem to consistently work in Excel and Libre Office. In fact, they all worked without any issue in Libre Office.
I agree and as a result I have completely abandoned CSV.
I use the industry standard that everyone understands: ECMA-376, ISO/IEC 29500 aka .xlsx.
Nobody has any problems producing or ingesting .xlsx files. The only real problem is the confusion between numbers and numeric text that happens when people use excel manually. For machine to machine communication .xlsx has never failed me.
Now you might argue that ECMA-376 accounts for this, because it has a `date1904` flag, which has to be 0 for 1900-based dates and 1 for 1904-based dates. But what does that really accomplish if you can’t be sure that vendors understand subtleties like that if they produce or consume it? Last time I checked (maybe 8 years ago), spreadsheets created on Windows and opened on Mac still shifted dates by four years, and the bug was already over twenty years old at that time.
And the year-1904 issue is just the one example that I happen to know.
I have absolutely zero confidence in anything that has touched, or might have touched, MS Excel with anything short of a ten-foot pole.
Parsing Excel files in simple data interchange use cases that don't involve anyone manually using spreadsheets is an instance of unnecessary complexity. There are plenty of alternatives to CSV that remain plaintext, have much broader support, and are more rigorous than Excel in ensuring data consistency. You can use JSON, XML, ProtoBuf, among many other options.
If only there were character codes specifically meant to separate fields and records.... we wouldn't have to worry so much about quoted commas or quoted quotes.
That isn't solving anything, just changing the problem. If I want to store a string containing 0x1C - 0x1F in one of the columns then we're back in the exact same situation while also losing the human readable/manually typeable aspect people seem to love about CSV. The real solution is a strict spec with mandatory escaping.
Not for text data. Those values are not text characters like , or " are, and have only one meaning. It would be like arguing that 0x41 isn't always the letter "A".
For binary files, yeah but you don't see CSV used there anyway.
The idea that binary data doesn't go in CSVs is debatable; people do all sorts of weird stuff. Part of the robustness of a format is coping with abuse.
But putting that aside, if the control chars are not text, then you sacrifice human-readability and human-writability. In which case, you may as well just use a binary format.
True, but very few people compose or edit CSV data in Notepad. You can, but it's very error-prone. Most people will use a spreadsheet and save as CSV, so field and record separator characters are not anything they would ever deal with.
I've dealt with a few cases of CSVs including base64-encoded binary data. It's an unusual scenario, but the tools for working with CSVs are robust enough that it was never an issue.
So in addition to losing human readability, we are also throwing away the ability to nest (pseudo-)CSVs? With comma delimiters, I can take an entire CSV document and put it in 1 column, but with 0x1C-0x1F delimiters and banning non-text valid utf-8 in columns I no longer can. This continues to be a step backwards.
There are lots of 8-bit mostly-ASCII character sets that assign printable glyphs to some or all of the codepoints that ASCII assigns to control characters. TeX defined one, and the IBM PC's "code page 437" defined another.
There are several ways how a control character might inadvertently end up inside a text corpus.
Given enough millions of lines, it’s bound to happen, and you absolutely don’t want it to trip up your whole export because of that one occurrence. So yes, you have to account for it in text data, too.
You can disallow all control characters (ASCII < 32) other than CR/LF/TAB, which is reasonable. I don't know of any data besides binary blobs which uses those. I've never heard of anyone inlining a binary file (like an image) into a "CSV" anyway.
If you disallow control characters so that you can use them as delimiters, then CSV itself becomes a "binary" data format - or to put it another way, you lose the ability to nest CSV.
It isn't good enough to say "but people don't/won't/shouldn't do that", because it will just happen regardless. I've seen nested CSV in real-life data.
Compare to the zero-terminated strings used by C, one legacy of which is that PostgreSQL doesn't quite support UTF-8 properly, because it can't handle a 0 byte in a string, because 0 is "special" in C.
Right, but the original point I was responding to is that control characters are disallowed in the data and therefore don't need to be escaped. If you're going to have an escaping mechanism then you can use "normal" characters like comma as delimiters, which is better because they can be read and written normally.
It's good for a delimiter to be uncommon in the data, so that you don't have to use your escaping mechanism too much.
This is a different thing altogether from using "disallowed" control characters, which is an attempt to avoid escaping altogether - an attempt which I was arguing is doomed to fail.
But it is terrible at that because there is no widely adhered to standard[1], the sender and receiver often disagree on the details of what exactly a CSV is.
[1]: yes, I know about RFC 4180. But csvs in the wild often don't follow it.
The only times I hated CSV was when it came from another system I had no control over. For example Windows and their encodings, or some other proprietary BS.
But CSV under controlled circumstances is very simple.
And speaking of Wintendo, the bonus is often that you can go straight from CSV to Excel presentation for the middle management.
The post should at least mention in passing the major problem with CSV: it is a "no spec" family of de-facto formats, not a single thing (it is an example of "historically grown").
And omission of that meams I'm going to have to call this our for its bias (but then it is a love letter, and love makes blind...).
Unlike XML or JSON, there isn't a document defining the grammar of well-formed or valid CSV files, and there are many flavours that are incompatible with each other in the sense that a reader for one flavour would not be suitable for reading the other and vice versa. Quoting, escaping, UTF-8 support are particular problem areas, but also that you cannot tell programmatically whether line 1 contains column header names or already data (you will have to make an educated guess but there ambiguities in it that cannot be resolved by machine).
Having worked extensively with SGML for linguistic corpora, with XML for Web development and recently with JSON I would say programmatically, JSON is the most convenient to use regarding client code, but also its lack of types makes it useful less broadly than SGML, which is rightly used by e.g. airlines for technical documntation and digital humanities researchers to encode/annotate historic documents, for which it is very suitable, but programmatically puts more burden on developers. You can't have it all...
XML is simpler than SGML, has perhaps the broadest scope and good software support stack (mostly FOSS), but it has been abused a lot (nod to Java coders: Eclipse, Apache UIMA), but I guess a format is not responsible for how people use or abuse it. As usual, the best developers know the pros and cons and make good-taste judgments what to use each time, but some people go ideological.
(Waiting for someone to write a love letter to the infamous Windows INI file format...)
In fairness there are also several ambiguities with JSON. How do you handle multiple copies of the same key? Does the order of keys have semantic meaning?
jq supports several pseudo-JSON formats that are quite useful like record separator separated JSON, newline separated JSON. These are obviously out of spec, but useful enough that I've used them and sometimes piped them into a .json file for storage.
Also, encoding things like IEEE NaN/Infinity, and raw byte arrays has to be in proprietary ways.
> The JSON syntax does not impose any restrictions on the strings used as names, does not require that name strings be unique, and does not assign any significance to the ordering of name/value pairs.
That IS unambiguous.
And for more justification:
> Meaningful data interchange requires agreement between a producer and consumer on the semantics attached to a particular use of the JSON syntax. What JSON does provide is the syntactic framework to which such semantics can be attached
> JSON is agnostic about the semantics of numbers. In any programming language, there can be a variety of number types of various capacities and complements, fixed or floating, binary or decimal.
> It is expected that other standards will refer to this one, strictly adhering to the JSON syntax, while imposing semantics interpretation and restrictions on various encoding details. Such standards may require specific behaviours. JSON itself specifies no behaviour.
It all makes sense when you understand JSON is just a specification for a grammar, not for behaviours.
> and does not assign any significance to the ordering of name/value pairs.
I think this is outdated? I believe that the order is preserved when parsing into a JavaScript Object. (Yes, Objects have a well-defined key order. Please don't actually rely on this...)
> Valid JSON text is a subset of the ECMAScript PrimaryExpression syntax. Step 2 verifies that jsonString conforms to that subset, and step 10 asserts that that parsing and evaluation returns a value of an appropriate type.
And in the algorithm
c. Else,
i. Let keys be ? EnumerableOwnProperties(val, KEY).
ii. For each String P of keys, do
1. Let newElement be ? InternalizeJSONProperty(val, P, reviver).
2. If newElement is undefined, then
a. Perform ? val.[[Delete]](P).
3. Else,
a. Perform ? CreateDataProperty(val, P, newElement).
If you theoretically (not practically) parse a JSON file into a normal JS AST then loop over it this way, because JS preserves key order, it seems like this would also wind up preserving key order. And because it would add those keys to the final JS object in that same order, the order would be preserved in the output.
> (Yes, Object's have a well-defined key order. Please don't actually rely on this...)
JS added this in 2009 (ES5) because browsers already did it and loads of code depended on it (accidentally or not).
There is theoretically a performance hit to using ordered hashtables. That doesn't seem like such a big deal with hidden classes except that `{a:1, b:2}` is a different inline cache entry than `{b:2, a:1}` which makes it easier to accidentally make your function polymorphic.
In any case, you are paying for it, you might as well use it if (IMO) it makes things easier. For example, `let copy = {...obj, updatedKey: 123}` is relying on the insertion order of `obj` to keep the same hidden class.
I-JSON (short for "Internet JSON") is a restricted profile of JSON designed to maximize interoperability and increase confidence that software can process it successfully with predictable results.
So it's not JSON, but a restricted version of it.
I wonder if use of these restrictions is popular. I had never heard of I-JSON.
I think it's rare for them to be explicilty stated, but common for them to be present in practice. I-JSON is just an explicit list of these common implicit limits. For any given tool/service that describes itself as accepting JSON I would expect I-JSON documents to be more likely to work as expected than non-I-JSON.
> How do you handle multiple copies of the same key? Does the order of keys have semantic meaning?
This is also an issue, due to the way that order of keys are working in JavaScript, too.
> record separator separated JSON, newline separated JSON.
There is also JSON with no separators, although that will not work very well if any of the top-level values are numbers.
> Also, encoding things like IEEE NaN/Infinity, and raw byte arrays has to be in proprietary ways.
Yes, as well as non-Unicode text (including (but not limited to) file names on some systems), and (depending on the implementation) 64-bit integers and big integers. Possibly also date/time.
I think DER avoids these problems. You can specify whether or not the order matters, you can store Unicode and non-Unicode text, NaN and Infinity, raw byte arrays, big integers, and date/time. (It avoids some other problems as well, including canonization (DER is already in canonical form) and other issues. Although, I have a variant of DER that avoids some of the excessive date/time types and adds a few additional types, but this does not affect the framing, which can still be parsed in the same way.)
A variant called "Multi-DER" could be made up, which is simply concatenating any number of DER files together. Converting Multi-DER to BER is easy just by adding a constant prefix and suffix. Converting Multi-DER to DER is almost as easy; you will need the length (in bytes) of the Multi-DER file and then add a prefix to specify the length. (In none of these cases does it require parsing or inspecting or modifying the data at all. However, converting the JSON variants into ordinary JSON does require inspecting the data in order to figure out where to add the commas.)
`JSON.parse` actually does give you that option via the `reviver` parameter, which gives you access to the original string of digits (to pass to `BigInt` or the number type of your choosing) – so per this conversation fits the "good parser" criteria.
To be specific (if anyone was curious), you can force BigInt with something like this:
//MAX_SAFE_INTEGER is actually 9007199254740991 which is 16 digits
//you can instead check if exactly 16 and compare size one string digit at a time if absolute precision is desired.
const bigIntReviver = (key, value, context) => typeof value === 'number' && Math.floor(value) === value && context.source.length > 15 ? BigInt(context.source) : value
const jsonWithBigInt = x => JSON.parse(x, bigIntReviver)
Generally, I'd rather throw if a number is unexpectedly too big otherwise you will mess up the types throughout the system (the field may not be monomorphic) and will outright fail if you try to use math functions not available to BigInts.
Sorry yes, i was thinking of the context object with source parameter.
The issue it solves is a big one though, since without it the JSON.parse functionality cannot parse numbers that are larger than 64bit float numbers (f.ex. bigints).
“No one owns CSV. It has no real specification (yes, I know about the controversial ex-post RFC 4180), just a set of rules everyone kinda agrees to respect implicitly. It is, and will forever remain, an open and free collective idea.”
They even seem to think it is a good thing. But I don't see how not having a bunch of implementations that can't agree on the specifics of a file/interchange format is a good thing. And being free and open is completely orthogonal. There are many proprietary formats that don't have a spec, and many open formats that do have a spec (like, say, json).
Waiting for someone to write a love letter to the infamous Windows INI file format
I actually miss that. It was nice when settings were stored right alongside your software, instead of being left behind all over a bloated registry. And the format was elegant, if crude.
I wrote my own library for encoding/writing/reading various datatypes and structure into ini's, in a couple different languages, and it served me well for years.
> instead of being left behind all over a bloated registry
Really? I think the idea of a central, generic, key-value pair database for all the setting on a system is probably the most elegant reasonable implementation there could be.
The initial implementation of Windows Registry wasn't good. It was overly simplistic and pretty slow. Though the "bloat" (what ever that means) of registry hasn't been an actual issue in over 20 years. The only people invested in convincing you "it's an issue" are CCleaner type software that promise to "speed up your computer" if you just pay $6.99.
How many rows do you need in a sqlite database for it to be "bloated"?
> I feel like YAML is a spiritual successor to the .ini, since it shares a notable ideal of simple human readability/writability.
It doesn't feel that way to me: it's neither simple to read nor to write. I suppose that that's a builtin problem due to tree representation, which is something that INI files were never expected to represent.
TBH, I actually prefer the various tree representation workarounds used by INI files: using whitespace to indicate child nodes stops being readable once you have more than a screenful of children in a node.
YAML is readable? No way as there are too many ways to do the same thing and nested structures are unclear to the non trained eye (what is a list? What is nested?), let alone indentation in large files is an issue especially with the default 2 space unreadable standard so many people adhere to.
YAML simple? It's sepc is larger than XML... Parsing of numbers and strings is ambiguous, leading zeros are not strings but octal (implicit conversion...). List as keys? Oh ffs, and you said readable. And do not get me started about "Yes" being a boolean, reminds me of the MS Access localizations which had other decimal values for true and [local variant of true] (1 vs -1).
Writable? Even worse. I think I have never been able to write a YAML file without errors. But that might just be me, XML is fine though while unreadable.
YAML 1.2 leaves data types ambiguous, merely making the "Norway problem" optional and at the mercy of the application rather than, in the words of https://yaml.org/type/ (which has not been marked as deprecated), "strongly recommended".
Those schemas aren't part of the core schema, and you may interpret them if you are aiming for full 1.1 compatibility. If you're aiming for 1.1 compatibility, then you accept the Norway problem.
As someone living in a country where , is used as the decimal separator, I cannot begin to describe the number of times CSV data has caused me grief. This becomes especially common in an office environment where Excel is the de facto only data handling tool that most people can and will use. Here the behavior of loading data becomes specific to the individual machine and changes over time (e.g. when IT suddenly forces a reset of MS Office application languages to the local one).
That said, I don't really know of any alternative that won't be handled even worse by my colleagues...
To be honest, I'm wondering why you are rating JSON higher than CSV.
> Unlike XML or JSON, there isn't a document defining the grammar of well-formed or valid CSV files,
There is, actually, RFC 4180 IIRC.
> there are many flavours that are incompatible with each other in the sense that a reader for one flavour would not be suitable for reading the other and vice versa.
"There are many flavours that deviate from the spec" is a JSON problem too.
> you cannot tell programmatically whether line 1 contains column header names or already data (you will have to make an educated guess but there ambiguities in it that cannot be resolved by machine).
Also a problem in JSON
> Quoting, escaping, UTF-8 support are particular problem areas,
Sure, but they are no more nor no less a problem in JSON as well.
Have you had to work with csv files from the wild much? I'm not being snarky but what you're talking about is night and day to what I've experienced over the years.
There aren't vast numbers of different JSON formats. There's practically one and realistically maybe two.
Headers are in each line, utf8 has never been an issue for me and quoting and escaping are well defined and obeyed.
This is because for datasets, almost exclusively, the file is machine written and rarely messed with.
Csv files have all kinds of separators, quote characters, some parsers don't accept multi lines and some do, people sort files which mostly works until there's a multi line. All kinds of line endings, encodings and mixed encodings where people have combined files.
I tried using ASCII record separators after dealing with so many issues with commas, semicolons, pipes, tabs etc and still data in the wild had these jammed into random fields.
Lots of these things don't break when you hit the issue either, the parsers happily churn on with garbage data, leading to further broken datasets.
Also they're broken for clients if the first character is a capital I.
This might be my old “space and network cost savings” reflex, which is a lot less necessary these days, kicking in, but the feels inefficient. It also gives rise to not knowing the whole schema until you read the whole dataset (which might be multiple files), unless some form of external schema definition is provided.
Having said that, I accept that JSON has advantages over CSV, even if all that is done is translating a data-table into an array of objects representing one row each.
> utf8 has never been an issue for me
The main problem with UTF8 isn't with CSV generally, it is usually, much like the “first column is called ID” issue, due to Excel. Unfortunately a lot of people interact with CSVs primarily with Excel, so it gets tarred with that brush by association. Unless Excel sees the BOM sequence at the start of a CSV file, which the Unicode standards recommend against for UTF8, it assumes its characters are using the Win1252 encoding (almost, but not quite, ISO-8859-1).
> Csv files have all kinds of separators
I've taken to calling them Character Separated Value files, rather than Comma, for this reason.
Yes, it's not great. Space is annoying, though compression pretty much removes that as a concern (zstd is good for this, you can even have a custom dictionary). And yes, missing keys is annoying.
JSONL is handy, JSON that's in the form {data: [...hundred megs of lines]} is annoying for various parsers.
I'm quite a fan of parquet, but never expect to receive that from a client (alas).
Parquet should get the praise. It's simply awesome.
It's what I'd pick for tabular data exchange.
A recent problem I solved with it and duckdb allowed me to query and share a 3M record dataset. The size? 50M. And my queries all ran subsecond. You just aren't going to get that sort of compression and query-ability with a csv.
I wonder if CSV is the trivial format, so you have many people picking it because they want the easiest, and still getting it wrong. JSON is harder, so very few people are going to roll their own serializer/deserializer, and those who do are more likely to focus on getting it right (or at least catching the really obvious bugs).
I've dealt with incorrect CSVs numerous times, never with incorrect JSON, but, of the times I know what was happening on the other system, each time the CSV was from some in house (or similar) implementation of dumping a SQL output (or similar) into a text file as an MVP. JSON was always using some library.
If so, that's all the more reason to love CSV as it stands guard for JSON. If CSV didn't exist, we would instead have broken JSON implementations. (JSON and XML would likely then share a similar relationship.)
Sometimes people interpret the term too generically and actually implement a high degree of non-trivial, very idiosyncratic complexity, while still calling it "CSV".
One project I worked on involved a vendor promising to send us data dumps in "CSV format". When we finally received their "CSV" we had to figure out how to deal with (a) global fields being defined in special rows above the header row, and (b) a two-level hierarchy of semicolon-delimited values nested within comma-delimited columns. We had to write a custom parser to complete the import.
Sure, I get your arguments and we're probably mostly in agreement, but in practice I see very few problems arising with using CSV.
I mean, right now, the data interchange format between multiple working systems is CSV; think payment systems, inter-bank data interchange, ERP systems, CRM systems, billing systems ... the list goes on.
I just recently had a coffee with a buddy who's a salesman for some enterprise system: of the most common enterprise systems we recently worked with (SAP type things, but on smaller scales), every single one of them had CSV as the standard way to get data between themselves and other systems.
And yet, they work.
The number of people uploading excel files to be processed or downloading excel files for local visualistation and processing would floor you. It's done multiple times a day, on multiple systems, in multiple companies.
And yet, they work.
I get your argument though - a JSON array of arrays can represent everything that CSV can, and is preferable to CSV, and is what I would choose when given the choice, but the issues with using that are not going to be fewer than issues with CSV using RFC1480.
>but in practice I see very few problems arising with using CSV
That is not my experience at all. I've been processing CSV files from financial institutions for many years. The likelihood of brokenness must be around 40%. It's unbelievable.
The main reason for this is not necessarily the CSV format as such. I believe the reason is that it is often the least experienced developers who are tasked with writing export code. And many inexperienced developers seem to think that they can generate CSV without using a library because the format is supposedly so simple.
JSON is better but it doesn't help with things like getting dates right. XML can help with that but it has complexities that people get wrong all the time (such as entities), so I think JSON is the best compromise.
> And many inexperienced developers seem to think that they can generate CSV without using a library because the format is supposedly so simple.
Can't they?
def excel_csv_of(rows):
for row in rows:
for i, field in enumerate(row):
if i:
yield ','
yield '"'
for c in field:
yield '""' if c == '"' else c
yield '"'
yield '\n'
I haven't tested this, even to see if the code parses. What did I screw up?
If my experience reflects a relevant sample then the answer is that most can but a very significant minority fails at the job (under the given working conditions).
Whether or not _you_ can is a separate question. I don't see anything wrong with your code. It does of course assume that whatever is contained in rows is correct. It also assumes that the result is correctly written to a file without making any encoding mistakes or forgetting to flush the stream.
Not using name value pairs makes CSV more prone to mistakes such as incorrect ordering or number of values in some rows, a header row that doesn't correspond with the data rows, etc. Some export files are merged from multiple sources or go through many iterations over many years, which makes such mistakes far more likely.
I have also seen files that end abruptly somewhere in the middle. This isn't specific to CSV but it is specific to not using libraries and not using libraries appears to be more prevalent when people generate CSV.
You'd be surprised how many CSV files are out there where the developer tried to guess incorrectly whether or not a column would ever have to be escaped. Maybe they were right initially and it didn't have to be escaped but then years later something causes a change in number formats (internationalisation) and bang, silent data corruption.
Prioritising correctness and robustness over efficiency as you have done is the best choice in most situations. Using a well tested library is another option to get the same result.
This forces each field to be quoted, and it assumes that each row has the same fields in the same order. A library can handle the quoting issues and fields more reliably. Not sure why you went with a generator for this either.
Most people expect something like
`12,,213,3`
instead of
`"12","213","3"`
which yours might give.
Forcing each field to be quoted is always correct, isn't it? How could something be "more reliable" than something that is always correct?
With respect to "the same fields in the same order", no, although you may or may not feed the CSV to an application that has such an expectation. But if you apply it to data like [("Points",),(),("x","y"),("3","4"),("6","8","10")] it will successfully preserve that wonky structure in a file Excel can ingest reliably. (As reliably as Excel can ingest anything, anyway, since Excel has its own Norway problem.)
It's true that it's possible to produce more optimized output, but I didn't claim that the output was optimal, just correct.
Using generators is necessary to be able to correctly output individual fields that are many times larger than physical memory.
I'll preface this that I think we are mostly in agreement, so that's the friendly tone of reply, part of this is just having flashbacks.
It's massively used, but the lack of adherence to a proper spec causes huge issues. If you have two systems that happen to talk properly to each other, great, but if you are as I was an entrypoint for all kinds of user generated files it's a nightmare.
CSV is the standard, sure, but it's easy to write code that produces it that looks right at first glance but breaks with some edge case. Or someone has just chosen a different separator, or quote, so you need to try and detect those before parsing (I had a list that I'd go through, then look for the most commonly appearing non-letter character).
The big problem is that the resulting semantically broken csv files often look pretty OK to someone scanning them and permissive parsers. So one system reads it in, splits something on lines and assumes missing columns are blank and suddenly you have the wrong number of rows, then it exports it. Worse if it's been sorted before the export.
Of course then there's also the issues around a lack of types, so numbers and strings are not distinguishable automatically leading to broken issues where you do want leading zeros. Again often not identified until later. Or auto type detection in a system breaking because it sees a lot of number-like things and assumes it's a number column. Without types there's no verification either.
So even properly formatted CSV files need a second place for metadata about what types there are in the file.
JSON has some of these problems too, it lacks dates, but far fewer.
> but the issues with using that are not going to be fewer than issues with CSV using RFC1480.
My only disagreement here is that I've had to deal with many ingest endpoints that don't properly support that.
Fundamentally I think nobody uses CSV files because they're a good format. They've big, slow to parse, lack proper typing, lack columnar reading, lack fast jumping to a particular place, etc.
They are ubiquitous, just not good, and they're very easy to screw up in hard to identify or fix ways.
Finally, lots of this comes up because RFC4180 is only from *2005*.
Oh, and if I'm reading the spec correctly, RFC4180 doesn't support UTF8. There was a proposed update maybe in 2022 but I can't see it being accepted as an RFC.
> I mean, right now, the data interchange format between multiple working systems is CSV; think payment systems, inter-bank data interchange, ERP systems, CRM systems, billing systems ... the list goes on.
And there are constant issues arising from that. You basically need a small team to deal with them in every institution that is processing them.
> I just recently had a coffee with a buddy who's a salesman for some enterprise system: of the most common enterprise systems we recently worked with (SAP type things, but on smaller scales), every single one of them had CSV as the standard way to get data between themselves and other systems.
Salesman of enterprise system do not care about issues programmers and clients have. They care about what they can sell to other businessmen. That teams on both sides then waste time and money on troubleshooting is no concern to the salesman. And I am saying that as someone who worked on the enterprise system that consumed a lot of csv. It does not work and process of handling them literally sometimes involved phone calls to admins of other systems. More often then would be sane.
> The number of people uploading excel files to be processed or downloading excel files for local visualistation and processing would floor you.
That is perfectly fine as long as it is a manager downloading data so that he can manually analyze them. It is pretty horrible when those files are then uploaded to other systems.
In practice, I have never ever received CSV to process that complied with RFC 4180, and in most cases it was completely incoherent and needed incredibly special handling to handle all the various problems like lack of escaping.
SAP has been by far the worst. I never managed to get data out of it that were not completely garbage and needed hand crafted parsers.
Through a lot of often-painful manual intervention. I've seen it first-hand.
If an organization really needs something to work, it's going to work somehow—or the organization wouldn't be around any more—but that is a low bar.
In a past role, I switched some internal systems from using CSV/TSV to using Parquet and the difference was amazing both in performance and stability. But hey, the CSV version worked too! It just wasted a ton of people's time and attention. The Parquet version was far better operationally, even given the fact that you had to use parquet-tools instead of just opening files in a text editor.
> There aren't vast numbers of different JSON formats.
Independent variations I have seen:
* Trailing commas allowed or not
* Comments allowed or not
* Multiple kinds of date serialization conventions
* Divergent conventions about distinguishing floating point types from integers
* Duplicated key names tolerated or not
* Different string escaping policies, such as, but not limited to "\n" vs "\x0a"
The json spec does not allow commas. Although there are jsom supersets that do.
> Comments allowed or not
The json spec does not allow comments. Although there are jsom supersets that do.
> Multiple kinds of date serialization conventions
Json spec doesn't say anything about dates. That is dependent on your application schema.
> Divergent conventions about distinguishing floating point types from integers
This is largely due to divergent ways different programming languages handle numbers. I won't say jsom handles this the best, but any file format used across multiple languages will run into problems with differences in how numbers are represented. At least there is a well defined difference between a number and a string, unlike csv.
> Duplicated key names tolerated or not
According to the spec, they are tolerated, although the semantics of such keys is implementation defined.
> Different string escaping policies, such as, but not limited to "\n" vs "\x0a"
Both of those are interpreted as the same thing, at least according to the spec. That is an implementation detail of the serializer, not a different language.
There are always many, but in comparison to csv I've received almost no differences. Json issues were rare but csv issues it was common to have a brand new issue per client.
Typically the big difference is there are different parsers that are less tolerant of in spec values. Clickhouse had a more restrictive parser, and recently I've dealt with matrix.
Maybe I've been lucky for json and unlucky for csv.
Basically, Excel uses the equivalent of ‘file’ (https://man7.org/linux/man-pages/man1/file.1.html), sees the magic “ID”, and decides a SYLK file, even though .csv files starting with “ID” have outnumbered .SYLK files by millions for decades.
Thanks. So I guess the easy compatible solution is to always quote the first item on the first line when writing CSV. Good to know. (Checking if the item starts with ID is more work. Possibly quote all items on the first line for simplicity.) (Reading SYLK is obviously irrelevant, so accepting unquoted ID when reading is the smarter way to go and will actually improve compatibility with writers that are not Excel. Also it takes no work.)
The byte for a capital I is the same as the start for an odd file format, slyk maybe? Excel has (or did if they finally fixed it) for years decided this was enough to assume the file (called .csv) cannot possibly be csv but must actually be slyk. It then parses it as such, and is shocked to find your slyk file is totally broken!
Yes but in practice CSV is defined by what Excel does.
As there is no standard to which Excel conforms as it predates standards and there would be an outcry if Excel started rejecting files that had worked for years.
There is a common misconception here. You can import CSV files into an excel sheet. You cannot open a CSV file with excel. That is a nonsense operation.
- “Each record is located on a separate line, delimited by a line break (CRLF)” ⇒ editing .csv files using your the typical Unix text editor is complicated.
- “Spaces are considered part of a field and should not be ignored”
- “Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes” ⇒ fields containing lone carriage returns or new lines need not be enclosed in double quotes.
INI was for a long time a seemingly preferable format in the Python community for configuration for a long time, as I recall it.
Haven’t been a full time Python dev in sometime though, it seems TOML has supplanted that, but I remember thinking how interesting it was that Python had a built in INI parser and serializer
While CSV isn't exactly grammared or standardised like XML I think if it as more schema:d than JSON. There might be data corruption or consistency issues, but there is implicitly a schema: every line is exactly n fields, and the first line might contain field names.
When a JSON API turns out to have optional fields it usually shows through trial and error, and unlike CSV it's typically not considered a bug you can expect the API owner to fix. In CSV 'missing data' is an empty string rather than nulls or their cousins because missing fields aren't allowed, which is nice.
I also like that I can write my own ad hoc CSV encoder in most programming languages that can do string concatenation, and probably also a suitable decoder. It helps a lot in some ETL tasks and debugging. Decent CSV also maps straight to RDBMS tables, if the database for some reason fails at immediate import (e.g. too strict expectations) into a newly created table it's almost trivial to write an importer that does it.
JSON is not schema’d per se and intentionally so. There’s jsonschema which has better expressiveness than inference of a tabular schema, as it can reflect relationships.
I lived through SOAP/WSDL horror with their numerous standards and the lack of compatibility between stacks in different programming languages. Having seen abused XML, CSV formats. CSV is preferable over XML. Human-readability matters. Relative simplicity matters.
Despite JSON may also be interpreted differently by different tools, it is a good default choice for communicating between programs
> lack of compatibility between stacks in different programming languages
Well, that sure beats OpenAPI lack of compatibility between stacks in the same programming language.
I think the fact one can't randomly concatenate strings and call it "valid XML" a huge bonus over the very common "join strings with comma and \r\n", non-rfc4180 compliant (therefore mostly unparseable without human/LLM interaction) garbage people often pretend is CSV.
There is no file format that works out of box under all extreme corner cases.
You would think that ie XML-defined WSDL with XSD schema is well battle proven. I've encountered 2 years ago (and still dealing with that) WSDL from a major banking vendor that is technically valid, but no open source library in Java (from all languages) was able to parse it successfully or generate binding classes out of box.
Heck, flat files can end up with extreme cases, just work enough with legacy banking or regulatory systems and you will see some proper shit.
The thing is, any sort of critical integration needs to be battle tested and continuously maintained, otherwise it will eventually go bad, even a decade after implementation and regular use without issues.
XML is a pretty good markup language. Using XML to store structured data is an abuse of it. All the features that are useful for markup are not useful for structured data and only add to the confusion.
> Waiting for someone to write a love letter to the infamous Windows INI file format...
Honestly, it’s fine. TOML is better if you can use it, but otherwise for simple applications, it’s fine. PgBouncer still uses INI, though that in particular makes me twitch a bit, due to discovering that if it fails to parse its config, it logs the failed line (reasonable), which can include passwords if it’s a DSN string.
Well, once you get over the fact that information on a TOML file can be out of order in any place, denominated by any mix of 3 different key encodings, and broken down in any random way... then yes, the rest of TOML is good.
CSV is ever so elegant but it has one fatal flaw - quoting has "non-local" effects, i.e. an extra or missing quote at byte 1 can change the meaning of a comma at byte 1000000. This has (at least) two annoying consequences:
1. It's tricky to parallelise processing of CSV.
2. A small amount of data corruption can have a big impact on the readability of a file (one missing or extra quote can bugger the whole thing up).
So these days for serialisation of simple tabular data I prefer plain escaping, e.g. comma, newline and \ are all \-escaped. It's as easy to serialise and deserialise as CSV but without the above drawbacks.
> It’s that libraries and tools to parse CSV tend to suck. Whereas JSON is the lingua franca of data.
This isn't the case. An incredible amount of effort and ingenuity has gone into CSV parsing because of its ubiquity. Despite the lack of any sort of specification, it's easily the most widely supported data format in existence in terms of tools and language support.
> An incredible amount of effort and ingenuity has gone into CSV parsing because of its ubiquity.
Yea and it's still a partially-parseable shit show with guessed values. But we can and could have and should have done better by simply defining a format to use.
Well, Excel has a lot of common use-cases around processing numeric (and particularly financial) data. Since some locales use commas as decimal separators, using a character that's frequently present as a piece of data as a delimiter is a bit silly; it would be hard to think of a _worse_ character to use.
So, that means that Excel in those locales uses semicolons as separators rather than the more-frequently-used-in-data commas. Probably not the decision I'd make in retrospect, but not completely stupid.
Every time you cat a BSV file, your terminal beeps like it's throwing a tantrum. A record separator (RS) based file would be missing this feature! In other words, my previous comment was just a joke! :)
By the way, RS is decimal 30 (not octal '\030'). In octal, RS is '\036'. For example:
$ printf '\036' | xxd -p
1e
$ printf '\x1e' | xxd -p
1e
Can you point me to a language with any significant number of users that does NOT have a JSON library?
I went looking at some of the more niche languages like Prolog, COBOL, RPG, APL, Eiffel, Maple, MATLAB, tcl, and a few others. All of these and more had JSON libraries (most had one baked into the standard library).
The exceptions I found (though I didn't look too far) were: Bash (use jq with it), J (an APL variant), Scratch (not exposed to users, but scratch code itself is encoded in JSON), and Forth (I could find implementations, but it's very hard to pin down forth dialects).
CSV tooling has had to invest enormous amounts of effort to make a fragile, under-specified format half-useful. I would call it ubiquitous, I would call the tooling that we’ve built around it “impressive” but I would by no means call any of it “good”.
I do not miss dealing with csv files in the slightest.
> CSV tooling has had [...] to make a fragile, under-specified format half-useful
You get this backwards. Tabular structured data to store are ubiquitous. Text as a file format is also ubiquitous because it is accessible. The only actual decisions are about whether to encode your variables as rows or columns, what is the delimiter, and other rules such as escaping etc. Vars as columns makes sense because it makes appending easier. There is a bunch of stuff that can be used for delimeters, commas being the most common, none is perfect. But from this point onwards, decisions do not really matter, and "CSV" basically covers everything from now on. "CSV" is basically what comes naturally when you have tabular datasets and want to store them in text. CSV tooling is developed because there is a need for this way of formatting data. Whether CSV is "good" or "ugly" or whatever is irrelevant, handling data is complicated as much as the world itself is. The alternatives are either not structuring/storing the data in a tabular manner, or non-text (eg binary) formats. These alternative exist and are useful in their own right, but don't solve the same problems.
I think the issue is that CSV parsing is really easy to screw up. You mentioned delimiter choice and escaping, and I’d add header presence/absence to that list.
There are at least 3 knobs to turn every time you want to parse a CSV file. There’s reasonably good tooling around this (for example, Python’s CSV module has 8 parser parameters that let you select stuff), but the fact that you have to worry about these details is itself a problem.
You said “handling data is complicated as much as the world itself is”, and I 100% agree. But the really hard part is understanding what the data means, what it describes. Every second spent on figuring out which CSV parsing option I have to change could be better spent actually thinking about the data.
I am kind of amazed how people nag about having to parse practically a random file.
Having header or not should be specified up front and one should not parse some unknown file because that will always end up with failure.
If you have your own serialization and your own parsing working yeah this will simply work.
But then not pushing back to the user some errors and trying to deal with everything is going to be frustrating because amount of edge cases is almost infinite.
Handling random data is hard, saying it is a CSV and trying to support everything that comes with it is hard.
Microsoft Windows has had to invest enormous amounts...
Apple macOS has had to invest enormous amounts...
Pick your distro of Linux has had to invest enormous amounts...
None of them a perfect and any number of valid complaints can be said about any of them. None of the complaints make any of the things useless. Everyone has workarounds.
Hell, JSON has had to invest enormous amounts of effort...
I guess the point is that I can take a generic json parser and point it at just about any JSON I get my hands on, and have close to no issues parsing it.
Want to do the same with csv? Good luck. Delimiter? Configurable. Encoding? Configurable. Misplaced comma? No parse in JSON, in csv: might still parse, but is now semantically incorrect and you possibly won’t know until it’s too late, depending on your parser. The list goes on.
You claimed that CSV is "easily the most widely supported data format in existence in terms of tools and language support", which is a claim that CSV is better supported than JSON, which is a claim that JSON support is lacking.
Importing csvs in excel can be a huge pain due to how excel handles localisation. It can basically alter your data if you are not mindful about that, and I have seen it happening too many times.
Depends on what you mean by "better". I would rather software not handle a piece of data at all, than handle it erroneously and changing the data without me realising and thus causing all sorts of issues after.
Before you dismiss it as 'not a language, people have argued that it is. And you can definitely program stuff in it, and so that surely makes it a language
Excel can import and parse JSON, it's under the "Get Data" header. It doesn't have a direct GUI way to export to JSON, but it takes just a few lines in Office Scripts. You can even use embedded TypeScript to call JSON.stringify.
> it's easily the most widely supported data format in existence in terms of tools and language support.
Even better, the majority of the time I write/read CSV these days I don't need to use a library or tools at all. It'd be overkill. CSV libraries are best saved for when you're dealing with random CSV files (especially from multiple sources) since the library will handle the minor differences/issues that can pop up in the wild.
It's just that people tend to use specialized tools for encoding and decoding it instead of like ",".join(row) and row.split(",")
I have seen people try to build up JSON strings like that too, and then you have all the same problems.
So there is no problem with CSV except that maybe it's too deceptively simple. We also see people trying to build things like URLs and query strings without using a proper library.
There is a clear standard and it's usually written on an old Word '97 doc in a local file server. Using CSV means that you are the compatibility layer, and this is useful if you need firm control or understanding of your data.
If that sounds like a lot of edge-case work keep in mind that people have been doing this for more than half a century. Lots of examples and notes you can steal.
JSON has a clearly-defined standards: ISO/IEC 21778:2017, IETF RFC 7159, and ECMA-404. Additionally, Crockford has had a spec available on json.org since it's creation in 2001.
Do you have any examples of Python, Java, or any of the other Tiobe top 40 languages breaking the JSON spec in their standard library?
In contrast, for the few of those that have CSV libraries, how many of those libraries will simply fail to parse a large number of the .csv variations out there?
You need more than a standard; that standard has to be complete and unambiguous. What you're looking for is https://github.com/nst/JSONTestSuite
EDIT: The readme's results are from 2016, but there's more recent results (last updated 5 years ago). Of the 54 parsers /versions tested, 7 gave always the expected result per the spec (disregarding cases where the spec does not define a result).
JSON doesn't fail for very large values because they are sent over the wire as strings. Only parsers may fail if they or their backing language doesn't account for BigInts or floats larger than f64, but these problems exist when parsing any string to a number.
And indeed applies to CSV as well: it's just strings at the end of the day, its up to the parser to make sense of it into the data types one wants. There is nothing inherently stopping you from parsing a JSON string into a uint64: I've done so plenty!
Trailing commas and comments are plainly not standard JSON under any definition. There are standards that include them which extend JSON, sure, but I'm not aware of any JSON library that emits this kind of stuff by default.
> It's just that people tend to use specialized tools for encoding and decoding it instead of like ",".join(row) and row.split(",")
You really super can't just split on commas for csv. You need to handle the string encodings since records can have commas occur in a string, and you need to handle quoting since you need to know when a string ends and that string may have internal quote characters. For either format unless you know your data super well you need to use a library.
Until you have a large amount of data & need either random access or to work on multiple full columns at once. Duplicated keys names mean it's very easy for data in jsonlines format to be orders of magnitude larger than the same data as CSV, which is incredibly annoying if your processing for it isn't amenable to streaming.
Space-wise, as long as you compress it, it's not going to make any difference. I suspect a JSON parser is a bit slower than a CSV parser, but the slight extra CPU usage is probably worth the benefits that come with JSON.
Eh, it really isn't. The format does not lend itself to tabular data, instead the most natural way of representing data involves duplicating the keys N times for each record.
Almost, except the way Excel-style quoting works with newlines sucks - you end up with rows that span multiple lines, so you can't split on newline to get individual rows.
With JSON those new lines are \n characters which are much easier to work with.
I ended up parsing the XML format instead of the CSV format when handling paste from Excel due to the newlines issue.
CSV seemed so simple but after numerous issues, a cell with both newline and " made me realize I should keep the little hair I had left and put in the work to parse the XML.
It's not great either, with all its weird tags, but at least it's possible to parse reliably.
This is the way. jsonl where each row is a json list. It has well-defined standard quoting.
Just like csv you don't actually need the header row either, as long as there's convention about field ordering. Similar to proto bufs, where the field names are not included in the file itself.
This misses the point of standardization imo because it’s not possible to know a priori that the first line represents the variable names, that all the rows are supposed to have the same number of elements and in general that this is supposed to represent a table. An arbitrary parser or person wouldn’t know to guess since it's not standard or expected. Of course it would be parsed fine but the default result would be a kind of structure or multi-array rather than tabular.
Types at the type layer are not the same as types at the semantic layer. Sure every type in the JSON level has a "strong type" but the semantic meaning of the contents of e.g. a string are usually not expressable in pure JSON. So it is with CSV; you can think of every cell in CSV as containing a string (series of bytes) with it being up to you to enforce the semantics atop those bytes. JSON gives you a couple extra types, and if you can fit things into those types well, then that's great, but for most data concrete semantically meaningful data you won't be able to do that and you'll end up in a similar world to CSVs.
I see an array of arrays. The first and second arrays have two strings each, the last one has a float and a string. All those types are concrete.
Let's say those "1.1" and 7.4 values are supposed to be version strings. If your code is only sometimes putting quotes around the version string, the bug is in your code. You're outputting a float sometimes, but a string in others. Fix your shit. It's not your serialization format that's the problem.
If you have "7.4" as a string, and your serialization library is saying "Huh, that looks like a float, I'm going to make it a float", then get a new library, because it has a bug.
You're missing my point: basically nothing spits out data in that format because it's not ergonomic to do so. JSON is designed to represent object hierarchies, not tabular data.
JSON is lists of lists of any length and groups of key/value pairs (basically lisp S-expressions with lots of unnecessary syntax). This makes it a superset of CSV's capabilities.
JSON fundamentally IS made to represent tabular data, but it's made to represent key-value groups too.
Why make it able to represent tabular data if that's not an intended use?
> JSON is lists of lists of any length and groups of key/value pairs
The "top-level" structure of JSON is usually an object, but it can be a list.
> JSON fundamentally IS made to represent tabular data
No, it's really not. It's made to represent objects consisting of a few primitive types and exactly two aggregate types: lists and objects. It's a textual representation of the JavaScript data model and even has "Object" in the name.
> Why make it able to represent tabular data if that's not an intended use?
It's mostly a question of specialization and ergonomics, which was my original point. You can represent tabular data using JSON (as you can in JavaScript), but it was not made for it. Anything that can represent """data""" and at least 2 nesting levels of arbitrary-length sequences can represent tabular data, which is basically every data format ever regardless of how awkward actually working with it may be.
The fact that json can represent a superset of tabular data structures that csv is specifically designed to represent can be rephrased into that csv is more specialised than json in representing tabular data. The fact that json can also represent tabular data does not mean it is a better or more efficient way to represent that data instead of a format like csv.
In the same way, there are hierarchically structured datasets that can be represented by both json in hierarchical form and csv in tabular form by repeating certain variables, but if using csv would require repeating them too many times, it would be a bad idea to choose that instead of json. The fact that you can do sth does not always make it a good idea to do it. The question imo is about which way would be more natural, easy or efficient.
> The fact that json can represent a superset of tabular data structures that csv is specifically designed to represent can be rephrased into that csv is more specialised than json in representing tabular data. The fact that json can also represent tabular data does not mean it is a better or more efficient way to represent that data instead of a format like csv.
The reverse is true as well: being more specialized is a description of goals, not advantages.
> This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file
But of course, CSV is the wild west and there's no guarantee that any two encoders will do the same thing (sometimes, there's not even a guarantee that the same encoder will do the same thing with two different inputs).
Headers should have as many rows as possible that contain data items for their column and data items in a row should have a header for the respective columns, but real CSV files should be assumed to have incomplete or variable length lines.
"Any JSON primitive" does add a few requirements not semantically comparable to CSV, like numbers that are numbers, and keywords true, false, none.
When these syntaxes are parsed into objects, either the type info has to be
retaind, or some kind of attribute tag, so they can be output back to the same form.
> make it so any consumer can parse it by splitting on newline and then ...
There is something like that called JSON-lines. It has a .org domain 'n' everything:
JSON was designed to represent any data. There's plenty of systems that spit out data in exact that format because it's the natural way to represent tabular data using JSON serialization. And clearly if you're the one building the system you can choose to use it.
JSON is designed to represent JavaScript objects with literal notation. Guess what, an array of strings or an array of numbers or even an array of mixed strings and numbers is a commonly encountered format in JavaScript.
The new line character in a JSON string would always be \n. The new line in the record itself as whitespace would not be acceptable as that breaks the one line record contract.
Remember that this does not allow arbitrary representation of serialized JSON data. But it allows for any and all JSON data as you can always roundtrip valid JSON to a compact one line representation without extra whitespace.
Actually even whitespace-separated json would be a valid format and if you forbid json documents to be a single integer or float then even just concatenating json gives a valid format as JSON is a prefix free language.
That is[0] if a string s is a valid JSON then there is no substring s[0..i] for i < n that is a valid json.
So you could just consume as many bytes you need to produce a json and then start a new one when that one is complete. To handle malformed data you just need to throw out the partial data on syntax error and start from the following byte (and likely throw away data a few more times if the error was in the middle of a document)
That is [][]""[][]""[] is unambiguos to parse[1]
[0] again assuming that we restrict ourselves to string, null, boolean, array and objects at the root
[1] still this is not a good format as a single missing " can destroy the entire document.
In jsonl a modified chunk will lose you at most the removed lines and the two adjacent ones (unless the noise is randomly valid json), in particular a single byte edit can destry at most 2 lines.
utf-8 is also similarly self-correcting and so is html and many media formats.
My point was that in my made-up concatenated json format
[]"""[][][][][][][][][][][]"""[]
and
[]""[][][][][][][][][][][]""[]
are both valid but have differ only for 2 bytes but have entirely different structures.
Also it is a made-up format nobody uses (if somebody were to want this they would likely disallow strings at the root level).
When you need to encode the newline character in your data, you say \n in the JSON. Unlike (the RFC dialect of) CSV, JSON has an escape sequence denoting a newline and in fact requires its use. The only reason to introduce newlines into JSON data is prettyprinting.
It's tricky, but simple enough, RFC states that " must be used, inserting a " is done with "". This makes knowing what a record is difficult, since you must keep a variable that keeps the entire string.
How do you do this simply? you read each line, and if there's an uneven number of ", then you have an incomplete record and you will keep all lines until there is an odd number of ". after having the string, parsing the fields correctly is harder but you can do it in regex or PEGs or a disgusting state machine.
(...and many more.) "This comes up every single time someone mentions CSV. Without fail." - top reply from burntsushi in that last link, and it remains as true today as in 2017 :D
You're not wrong though, we just need some major text editor to get the ball rolling and start making some attempts to understand these characters, and the rest will follow suit. We're kinda stuck at a local optimum which is clearly not ideal but also not troublesome enough to easily drum up wide support for ADSV (ASCII Delimiter Separated Values).
>we just need some major text editor to get the ball rolling and start making some attempts to understand these characters
Many text editors offer extensions APIs, including Vim, Emacs, Notepad++. But the ideal behavior would be to auto-align record separators and treat unit separators as a special kind of newline. That would allow the file to actually look like a table within the text editor. Input record separator as shift+space and unit separator as shift+enter.
Hahah, I came here to make the comment about ASCII's control characters, so I'm glad someone else beat me to it, and also that someone further pointed out that this topic comes up every time someone mentions CSV!
And that’s why I tend to use tab delimited files more… when viewed with invisible characters shown, it’s pretty clear to read/write separate fields and have an easier to parse format.
This, of course, assumes that your input doesn’t include tabs or newlines… because then you’re still stuck with the same problem, just with a different delimiter.
As soon as you give those characters magic meanings then suddenly people will have reason to want to use them— it'll be a CSV containing localization strings for tooltips that contain that character and bam, we'll be back to escaping.
Except the usages of that character will be rare and so potentially way more scary. At least with quotes and commas, the breakages are everywhere so you confront them sooner rather than later.
Graphical representations of the control characters begin at U+2400 in the "Control Pictures" Unicode block. Instead of the actual U+001E Record Separator, you put the U+241E Symbol for Record Separator in the help text.
.... with a note underneath urging readers not to copy and paste the character because it's only the graphical representation of it, not the thing itself.
Perhaps a more salient example might be CSV nested in CSV. This happens all the time with XML (hello junit) and even JSON— when you plug a USB drive into my LG TV, it creates a metadata file on it that contains {"INFO":"{ \"thing\": true, <etc> }"}
The entire argument against ASCII Delimited Text boils down to "No one bothered to support it in popular editors back in 1984. Because I grew up without it, it is impossible to imagine supporting it today."
You need 4 new keyboard shortcuts. Use ctrl+, ctrl+. ctrl+[ ctrl+] You need 4 new character symbols. You need a bit of new formatting rules. Pretty much page breaks decorated with the new symbols. It's really not that hard.
But, like many problems in tech, the popular advice is "Everyone recognizes the problem and the solution. But, the problematic way is already widely used and the solution is not. Therefore everyone doing anything new should invest in continuing to support the problem forever."
> The entire argument against ASCII Delimited Text boils down to "No one bothered to support it in popular editors back in 1984. Because I grew up without it, it is impossible to imagine supporting it today."
There's also the argument of "Now you have two byte values that cannot be allowed to appear in a record under any circumstances. (E.g., incoming data from uncontrolled sources MUST be sanitized to reject or replace those bytes.)" Unless you add an escaping mechanism, in which case the argument shifts to "Why switch from CSV/TSV if the alternative still needs an escaping mechanism?"
Length-delimited binary formats do not need escaping. But the usual "ASCII Delimited Text" proposal just uses two unprintable bytes as record and line separators, and the signalling is all in-band.
This means that records must not contain either of those two bytes, or else the format of the table will be corrupted. And unless you're producing the data yourself, this means you have to sanitize the data before adding it, and have a policy for how to respond to invalid data. But maintaining a proper sanitization layer has historically been finicky: just look at all the XSS vulnerabilities out there.
If you're creating a binary format, you can easily design it to hold arbitrary data without escaping. But just taking a text format and swapping out the delimiters does not achieve this goal.
At least you don't need these values in your data, unlike the comma, which shows up in human-written text.
If you do need these values in your data, then don't use them as delimiters.
Something the industry has stopped doing, but maybe should do again, is restricting characters that can appear in data. "The first name must not contain a record separator" is a quite reasonable restriction. Even Elon Musk's next kid won't be able to violate that restriction.
In Windows (and DOS EDIT.COM and a few other similarly ancient tools) there have existed Alt+028, Alt+029, Alt+030, and Alt+031 for a long time. I vaguely recall some file format I was working with in QBASIC used some or all of them and I was editing those files for some reason. That was not quite as far back as 1984, but sometime in the early 1990s for sure. I believe EDIT.COM had basic glyphs for them too, but I don't recall what they were, might have been random Wingdings like the playing card suits.
Having keyboard shortcuts doesn't necessarily solve why people don't want to use that format, either.
> I believe EDIT.COM had basic glyphs for them too, but I don't recall what they were, might have been random Wingdings like the playing card suits.
That is not specific to EDIT.COM; they are the PC characters with the same codes as the corresponding control characters, so they appear as graphic characters. (They can be used in any program that can use PC character set.)
However, in EDIT.COM and QBASIC you can also prefix a control character with CTRL+P in order to enter it directly into the file (and they appear as graphic characters, since I think the only control characters they will handle as control characters are tabs and line breaks).
Suits are PC characters 3 to 6; these are PC characters 28 to 31 which are other shapes.
The keys would be something other than those, though. They would be: CTRL+\ for file separator, CTRL+] for group separator, CTRL+^ for record separator, CTRL+_ for unit separator. Other than that, it would work like you described, I think.
> But, like many problems in tech, the popular advice is "Everyone recognizes the problem and the solution. But, the problematic way is already widely used and the solution is not
This is unfortunately common. However, what else happens too, is disagreement about what is the problem and the solution.
It's more like, "because the industry grew up without it, other approaches gained critical mass."
Path dependence is a thing. Things that experience network effects don't get changed unless the alternative is far superior, and ASCII Delimited Text is not that superior.
Ignoring that and pushing for it anyway will at most achieve an xkcd 927.
But when looking for a picture to back up my (likely flawed) memory, Google helpfully told me that you can get a record separator character by hitting Ctrl-^ (caret). Who knew?
You’ll still find that sequence in data; it’ll just be rare enough that it won’t rear its ugly head until your solution has been in production for a while.
If there were visible well known characters that could be printed for those and keys on a keyboard for inputting them we would probably have RSV files. Because they are buried down in the nonprintable section of the ASCII chart they are a pain for people to deal with. All it would have taken is one more key on the keyboard, maybe splitting the tab key in half.
> If there were visible well known characters that could be printed...
...There would be datasets that include those characters, and so they wouldn't be as useful for record separators. Look into your heart and know it to be true.
Can't type them on a keyboard I guess, or generally work with them in the usual text-oriented tools? Part of the appeal of CSV is you can just open it up in Notepad or something if you need to. Maybe that's more a critique of text tools than it is of ASCII record separator characters.
I was really excited when I learned of these characters, but ultimately if it’s in ASCII then it’s in-band and will eventually require escaping leading to the same problem.
> I don't get how this is not widely used as standard.
It requires bespoke tools for edition, and while CSV is absolute garbage it can be ingested and produced by most spreadsheet software, as well as databases.
I couldn't find the story on it, but there was an instance of a config for some major service getting truncated, but since it was yaml it was more difficult to figure out that that was what happened. I think in AWS, but I can't find the story, so can't really remember.
And fully fair that you can have similar issues in other formats. I think the complaint here was that it was a bit harder, specifically because it did not trip up any of the loading code. With a big lesson learned that configs should probably either go pascal string style, where they have an expected number of items as the first part of the data, or xml style, where they have a closing tag.
Really, it is always amusing to find how many of the annoying parts of XML turned out to be somewhat more well thought out than people want to admit.
Really depends on how the CSV is generated/transferred. If the output of the faulting software is line-buffered then it's quite likely that a failure would terminate the file at a line break.
I don't understand why CSV became a thing when TSV, or a format using the nowadays weird ASCII control characters like start/end of text, start of heading, horizontal/vertical tab, file/group/record/unit separator.
It seems many possible designs would've avoided the quoting chaos and made parsing sort of trivial.
Any time you have a character with a special meaning you have to handle that character turning up in the data you're encoding. It's inevitable. No matter what obscure character you choose, you'll have to deal with it
I understand this argument in general. But basically everyone has some sort of spreatsheet application that can read CSV installed.
In some alternate worked where this "binary" format caught on it would be a very minor issue that it isn't human readable because everyone has a tool that is better at reading it than humans are. (See the above mentioned non-local property of quotes where you may think you are reading rows but are actually inside a single cell.)
Makes me also wonder if something like CBOR caught on early enough we would just be used to using something like `jq` to read it.
Except we have all these low ASCII characters specifically for this purpose that don't turn up in the data at all. But there is, of course, also an escape character specifically for escaping them if necessary.
Even if you find a character that really is never in the data - your encoded data will contain it. And it's inevitable that someone encodes the encoded data again. Like putting CSV in a CSV value.
You can't type any of those on a typewriter, or see them in old or simple simple editors, or no editor like just catting to a tty.
If you say those are contrived examples that don't matter any more then you have missed the point and will probably never acknowledge the point and there is no purpose in continuing to try to communicate.
One can only ever remember and type out just so many examples, and one can always contrive some response to any single or finite number of examples, but they are actually infinite, open-ended.
Having a least common denominator that is extremely low that works in all the infinite situations you never even thought of, vs just pretty low and pretty easy to meet in most common situations, is all the difference in the world.
The difference is that the coma and newline characters are much more common in text than 0x1F and 0x1E, which if you restrict your data to alphanumeric characters (which you really should) will never appear anywhere else.
Exactly. "Use a delimiter that's not in the data" is not real serialisation, it's fingers-crossed-hope-for-the-best stuff.
I have in the past does data extractions from systems which really can't serialise properly, where the only option is to concat all the fields with some "unlikely" string like @#~!$ as a separator, then pick it apart later. Ugh.
> Exactly. "Use a delimiter that's not in the data" is not real serialisation, it's fingers-crossed-hope-for-the-best stuff.
It's not doing just this, you pick something that's likely not in the data, and then escape things properly. When writing strings you can write a double quote within double quotes with \", and if you mean to type the designated escape character you just write it twice, \\.
The only reason you go for something likely not in the data is to keep things short and readable, but it's not impossible to deal with.
I agree that it's best to pick "unlikely" delimiters so that you don't have to pepper your data with escape chars.
But some people (plenty in this thread) really do think "pick a delimiter that won't be in the data" - and then forget quoting and/or escaping - is a viable solution.
The characters would likely be unique, maybe even by the spec.
Even if you wanted them, we use backslashes to escape strings in most common programming languages just fine, the problem CSV is that commas aren't easy to recognize because they might be within a single or double quote string, or might just be a separator.
Can strings in CSV have newlines? I bet parsers disagree since there's no spec really.
In TSV as commonly implemented (for example, the default output format of Postgres and MySQL), tab and newline are escaped, not quoted. This makes processing the data much easier. For example, you can skip to a certain record or field just by skipping literal newlines or tabs.
It's a lot easier to type comma than control characters and it's a lot easier to view comma than a tab (which might look like a space).
For automated serialization, plain text formats won out, because they are easy to implement a minimal working solution (both import and export) and more importantly, almost all systems agree on what plain text is.
We don't really have Apple formatted text, that will show up as binary on windows. Especially if you are just transferring id's and numbers, those will fall within ascii and that will work even if you are expecting unicode.
What issues would TSV have? As commonly implemented (for example, the default output format of Postgres and MySQL), in TSV, tab and newline are escaped, not quoted.
I want to push Sqlite as a data interchange format! it has the benefit of being well defined, and can store binary data, like images for product pictures inside the database. not a good idea if you're trying to serve users behind a web app, but as interchange, better than a zip file with filenames that have to be "relinked".
For context: I have a LOT of experience of interchange formats, like "full time job, every day, all day, hundreds of formats, for 20-years" experience.
Based on that experience I have come to one key, but maybe, counter-intuitive truth about interchange formats:
- Too much freedom is bad.
Why? Generating interchange data is cheaper than consuming it, because the creator only needs to consider the stuff they want to include, whereas the consumer needs to consider every single possible edge case and or scenario the format itself can support.
This is why XML is WAY more costly to ingest than CSV, because in XML someone is going to use: attributes, CDATA, namespaces, comments, different declaration, includes, et al. In CVS they're going to use rows, a format separator, and quotes (with or without escaping). That's it. That's all it supports.
Sqlite as an interchange format is a HORRIFYING suggestion, because every single feature Sqlite supports may need to be supported by consumers. Even if you curtailed Sqlite's vast feature set, you've still created something vastly more expensive to consume than XML, which itself is obnoxious.
My favorite interchange formats are, in order:
- CVS, JSON (inc. NDJSON), YAML, XML, BSON (due to type system), MessagePack, Protobuf, [Giant Gap] Sqlite, Excel (xlsx, et al)
More features mean more cost, more edge cases, more failures, more complex consumers. Keep in mind, this is ONLY about interchange formats between two parties, I have wildly different opinions about what I would use for my own application where I am only ever the creator/consumer, I actually love Sqlite for THAT.
Oh, this is interesting. Are you tying different systems together? If so, do you use some preferred intermediate format? Do you have a giant library of * -> intermediate -> * converters that you sprinkle between everything? Or maybe the intermediate format is in memory?
Not the person you were replying to, but from my experience CSV is a good place to define data types coming into a system and a safe way to dump data as long as I write everything down.
So I might do things like have every step in a pipeline begin development by reading from and writing to CSV. This helps with parallel dev work and debugging, and is easy to load into any intermediate format.
> do you use some preferred intermediate format?
This is usually dictated by speed vs money calculations, weird context issues, and familiarity. I think it's useful to look at both "why isn't this a file" and "why isn't this all in memory" perspectives.
For tabular/time-series greater than 100k rows I personally feel like parquet cannot be beat. It's self-describing, strongly-typed, relatively compact, supports a bunch of io/decode skipping, and is quite fast.
Orc also looks good, but isn't well supported. I think parquet is optimal for now for most analytical use-cases that don't require human readability.
It is an interchange format, so it is inter-system by virtue of that. If I am a self-creator/consumer the format I use can be literally anything even binary memory dumps.
Interesting! I've dealt with file interchange between closed source (and a couple open source) programs, but that was a while ago. I've also had to deal with csvs and xslts between SaaS vendors for import export of customer's data. I've done a bunch of reverse engineering of proprietary formats so we could import the vendor's files, which had more information than they were willing to export in an interchange format. Sometimes they're encrypted and you have to break it.
What you say is fair. Csv is underspecified though, there's no company called csv that's gonna sue for trademark enforcement, there's no official csv standard library that everyone uses. (They exist are some but there are so many naive importations because from first principles, because how hard could it be? output records and use a comma and newline (of which there are three possible options)).
How often do you deal with multiple Csv files to represent multiple tables that are actually what's used by vendors internally, vs one giant flattened Csv with hundreds of columns and lots of empty cells? I don't have your level of experience with csvs, but I've dealt with a them being a mess, where the other side implement whatever they think is reasonable given the name "comma separated values".
With sqlite, we're in the Internet age and so I presume this hypothetical developer would use the sqlite library and not implement their own library from scratch for funsies. This then leads to types, database normalization, multiple tables. I hear you that too many choices can bad, and xml is a great example of this, but sqlite isn't xml and isn't Csv.
It's hard to have this discussion in the abstract so I'll be forthcoming about where I'm coming from, which is Csv import export between vendors for stores, think like Doordash to UberEATS. the biggest problem we have is images of the items, and how to deal with that. It's an ongoing issue how to get them, but the failure mode, which does happen, is that when moving vendor, they just have to redo a lot of work that they shouldn't have to.
Ultimately the North Star I want to push towards is moving beyond csvs, because it'll let a people who currently have to hand edit the Csv so every row imports properly, not have to do that. They'd still exist, but instead have to deal with, well, what you see with XML files. which has its shortcomings, as you mention, but at least once how a vendor is using it is understood, individual records are generally understandable.
I was moved so I don't deal with import export currently, but it's because sqlite is so nice to work with on personal projects where it's appropriate that I want to push the notion of moving to sqlite over csvs.
One very minor problem is that you max out storing blobs of 2GB(? I think, maybe 4GB). Granted few will hit this, but this limit did kill one previous data transfer idea of mine.
> not a good idea if you're trying to serve users behind a web app
I use Sqlite for a static site! Generating those static pages out to individual pages would involve millions of individual files. So instead I serve up a sqlite database over http, and use a sqlite wasm driver [0] to load (database) pages as needed. Good indexing cuts down on the number of pages it grabs, and I can even get full text search!
Only feature I'm missing is compression, which is complicated because for popular extensions like sqlite-zstd written in Rust.
CSV's actual problem is that there's no single CSV, and you don't know what type you have (or even if you have single consistent type through the whole file) without trying to parse the whole file and seeing what breaks. Is there quoting? Is that quoting used consistently? Do you have five-digit ZIP codes, or have the East Coast ones been truncated to four digits because they began with zero? Spin the wheel!
> So these days for serialisation of simple tabular data I prefer plain escaping, e.g. comma, newline and \ are all \-escaped. It's as easy to serialise and deserialise as CSV but without the above drawbacks.
For my own parser, I made everything `\` escaped: outside of a quote or double-quote delimited string, any character prefixed with a `\` is read verbatim. There are no special exceptions resulting in `\,` producing a comma while `\a` produces `\a`. This makes it a good rule, because it is only one rule with no exceptions.
I considered this but then went the other way - a \ before anything other than a \, newline or comma is treated as an error. This leaves room for adding features, e.g. \N to signify a SQL NULL.
Regarding quoting and escaping, there are two options that make sense to me - either use quoting, in which case quotes are self-escaped and that's that; or use escaping, in which case quotes aren't necessary at all.
A good way to parallelize CSV processing is to split datasets into multiple files, kinda like manual sharding. xan has a parallel command able to perform a wide variety of map-reduce tasks on splitted files.
I don't. If I ever have a dataset that requires newlines in a string, I use another method to store it.
I don't know why so many people think every solution needs to to be a perfect fit for every problem in order to be viable. CSV is good at certain things, so use it for those things! And for anything it's not good at, use something else!
True, but in that case I'm not the one choosing how to store it, until I ingest the data, and then I will store it in whatever format makes sense to me.
I don't think we do? It's more that a bunch of companies already have their data in CSV format and aren't willing to invest any effort in moving to a new format. Doesn't matter how much one extolls all the benefits, they know right? They're paying someone else to deal with it.
Of course. I’m not saying I roll my own parser for every project that uses a CSV file, I’m just describing my criteria for using CSV vs some other format when I have the option.
Eh, all you're really saying is "I'm not using CSV. Instead I'm using my CSV." Except that's all that anybody does.
CSV can just as easily support escaping as any other format, but there is no agreement for a CSV format.
After all, a missed escape can just as easily destroy a JSON or XML structure. And parallel processing of text is already a little sketchy simply because UTF-8 exists.
I'm not clear why quotes prevent parallel processing?
I mean, you don't usually parallelize reading a file in the first place, only processing what you've already read and parsed. So read each record in one process and then add it to a multiprocessing queue for multiple processes to handle.
And data corruption is data corruption. If a movie I'm watching has a corrupted bit I don't mind a visual glitch and I want it to keep playing. But with a CSV I want to fix the problem, not ignore a record.
Do you really have a use case where reading itself is the performance bottleneck and you need to parallelize reading by starting at different file offsets? I know that multiple processes can read faster from certain high-end SSD's than just one process, but that's a level of performance optimization that is pretty extraordinary. I'm kind of curious what it is!
> I'm not clear why quotes prevent parallel processing?
Because of the "non-local" effect of quotes, you can't just jump into the middle of a file and start reading it, because you can't tell whether you're inside a quoted section or not. If (big if) you know something about the structure of the data, you might be able to guess. So that's why I said "tricky" instead of "impossible".
Contrast to my escaping-only strategy, where you can jump into the middle of a file and fully understand your context by looking one char on either side.
> Do you really have a use case where reading itself is the performance bottleneck and you need to parallelize reading by starting at different file offsets? I know that multiple processes can read faster from certain high-end SSD's than just one process, but that's a level of performance optimization that is pretty extraordinary. I'm kind of curious what it is!
I used to be a data analyst at a management consultancy. A very common scenario would be that I'm handed a multi-gigabyte CSV and told to "import the data". No spec, no schema, no nothing. Data loss or corruption is totally unacceptable, because we were highly risk-sensitive. So step 1 is to go through the whole thing trying to determine field types by testing them. Does column 3 always parse as a timestamp? Great, we'll call it a timestamp. That kind of thing. In that case, it's great to be able to parallelise reading.
> And data corruption is data corruption
Agreed, but I prefer data corruption which messes up one field, not data corruption which makes my importer sit there for 5 minutes thinking the whole file is a 10GB string value and then throw "EOF in quoted field".
Doing sequential reading into a queue for workers to read is a lot more complicated than having a file format that supports parallel reading.
And the fix to allow parallel reading is pretty trivial: escape new lines so that you can just keep reading until the first unescaped new line and start at that record.
It is particularly helpful if you are distributing work across machines, but even in the single machine case, it's simpler to tell a bunch of workers their offset/limit in a file.
Tab-Separated Value, as implemented by many databases, solves these problems, because tab, newline and other control characters are escaped. For example, the default text serialization format of Postgres (`COPY <table> TO '<file>'` without any options) is this way.
Anyone with a love of CSV hasn't been asked to deal with CSV-injection prevention in an enterprise setting, without breaking various customer data formats.
That mostly breaks down to "excel is intentionally stupid with csv files if you don't use the import function to open them" along with the normal "don't trust customer input without stripping or escaping it" concerns you'd have with any input.
That was my initial reaction as well – it's a vulnerability in MS software, not ours, not our problem. Unfortunately, reality quickly came to bear: our customers and employees ubiquitously use excel and other similar spreadsheet software, which exposes us and them to risk regardless where the issue lies. We're inherently vulnerable because of the environment we're operating in, by using CSV.
"don't trust customer input without stripping or escaping it" feels obvious, but I don't think it stands up to scrutiny. What exactly do you strip or escape when you're trying to prevent an unknown multitude of legacy spreadsheet clients that you don't control from mishandling data in an unknown variety of ways? How do you know you're not disrupting downstream customer data flows with your escaping? The core issue, as I understand it, stems from possible unintended formula execution – which can be prevented by prefixing certain cells with a space or some invisible character (mentioned in the linked post above). This _does_ modify customer data, but hopefully in a way that unobtrusive enough to be acceptable. All in all, it seems to be a problem without a perfect solution.
Hey, I'm the author of the linked article, cool to see this is still getting passed around.
Definitely agree there's no perfect solution. There's some escaping that seems to work ok, but that's going to break CSV-imports.
An imperfect solutions is that applications should be designed with task-driven UIs so that they know the intended purpose of a CSV export and can make the decision to escape/not escape then. Libraries can help drive this by designing their interfaces in a similar manner. Something like `export_csv_for_eventual_import()`, `export_csv_for_spreadsheet_viewing()`.
Another imperfect solution would be to ... ugh...generate exports in Excel format rather than CSV. I know, I know, but it does solve the problem.
Or we could just get everyone in the world to switch to emacs csv-mode as a csv viewer. I'm down with that as well.
Appreciate your work! Your piece was pivotal in changing my mind about whether this should be considered in our purview to address.
The intention-based philosophy of all this makes a lot of sense, was eye opening, and I agree it should be the first approach. Unfortunately after considering our use cases, we quickly realized that we'd have no way of knowing how customers intend to use the csv exports they've requested - we've talked to some of them and it's a mix. We could approach things case by case but we really just want a setup which works well 99% of the time and mitigates known risk. We settled on the prefixing approach and have yet to receive any complaints about it, specifically using a space character with the mind that something unobtrusive (eg. easily strippable) but also visible, would be best - to avoid quirks stemming from something completely hidden.
Thank again for your writing and thoughts, like I said above I haven't found much else of quality on the topic.
I’ve almost always found the simple way around Excel users not knowing how to safely use CSV files is to just give the file another extension: I prefer .txt or .dat
Then, the user doesn’t have Excel has the default program for opening the file and has to jump through a couple safety hoops
If your customers and employees are using Excel then stop going against the grain with your niche software developer focused formats that need a lot of explanations.
I need to interface with a lot of non-technical people who exclusively use Excel. I give them .xlsx files. It's just as easy to export .xlsx as it is to export .CSV and my customers are happy.
How is .csv a niche dev-focused format? Our customers use our exports for a mix of purposes, some of them involving spreadsheet clients (not just excel) and some of them integrating with their own data pipelines. Csv conveniently works with these use cases across the board, without explanation, and is inconveniently saddled with these legacy security flaws in Excel (and probably other clients).
If xlsx works for all your use cases that's great, a much better solution that trying to sidestep these issues by lightly modifying the data. It's not an option for us, and (I'd imagine) a large contingent of export tools which can't make assumptions about downstream usage.
Someone filed a bug report on a project I work on, saying that it was a security vulnerability that we don't prefix cell values with a single quote (') when the cell content contains certain values like an equal sign (=). They said this can cause Excel to evaluate the content and potentially run unsafe code.
I responded that this was Excel's problem, not ours, and that nobody would assign a CVE to our product for such a "vulnerability". How naive I was! They forwarded me several such CVEs assigned to products that create CSVs that are "unsafe" for Excel.
I agree with the characterization ("security theater") of these bug reports. The problem is that the intentions of these reports don't make the potential risk less real, depending on the setting, and I worry that the "You're just looking for attention" reaction (a very fair one!) leads to a concerning downplaying of this issue across the web.
As a library author, I agree this very well may not be something that needs to be addressed. But as someone working in a company responsible for customers, employees, and their sensitive information, disregarding this issue disregards the reality of the tools these people will invariably use, downstream of software we _are_ responsible for. Aiming to make this downstream activity as safe as possible seems like a worthy goal.
I didn't know about the formula injection, I just knew that Excel and Sheets mangle my dates every time and it drives me bonkers. Why is that the default? It makes no sense.
I also love CSV for its simplicity. A key part of that love is that it comes from the perspective of me as a programmer.
Many of the criticisms of CSV I'm reading here boil down to something like: CSV has no authoritative standard, and everyone implements it differently, which makes it bad as a data interchange format.
I agree with those criticisms when I imagine them from the perspective of a user who is not also a programmer. If this user exports a CSV from one program, and then tries to load the CSV into a different program, but it fails, then what good is CSV to them?
But from the perspective of a programmer, CSV is great. If a client gives me data to load into some app I'm building for them, then I am very happy when it is in a CSV format, because I know I can quickly write a parser, not by reading some spec, but by looking at the actual CSV file.
Parsing CSV is quick and fun if you only care about parsing one specific file. And that's the key: It's so quick and fun, that it enables you to just parse anew each time you have to deal with some CSV file. It just doesn't take very long to look at the file, write a row-processing loop, and debug it against the file.
The beauty of CSV isn't that it's easy to write a General CSV Parser that parses every CSV file in the wild, but rather that its easy to write specific CSV parsers on the spot.
Going back to our non-programmer user's problem, and revisiting it as a programmer, the situation is now different. If I, a programmer, export a CSV file from one program, and it fails to import into some other program, then as long as I have an example of the CSV format the importing program wants, I can quickly write a translator program to convert between the formats.
There's something so appealing about to me about simple-to-parse-by-hand data formats. They are very empowering to a programmer.
Totally agree that its biggest strength is how approachable it is for quick, ad hoc tooling. Need to convert formats? Join two datasets? Normalize a weird export? CSV gives you just enough structure to work with and not so much that it gets in your way.
> I know I can quickly write a parser, not by reading some spec, but by looking at the actual CSV file
This is fine if you can hand-check all the data, or if you are okay if two offsetting errors happen to corrupt a portion of the data without affecting all of it.
Also I find it odd that you call it "easy" to write custom code to parse CSV files and translate between CSV formats. If somebody give you a JSON file that isn't valid JSON, you tell them it isn't valid, and they say "oh, sorry" and give you a new one. That's the standard for "easy." When there are many and diverse data formats that meet that standard, it seems perverse to use the word "easy" to talk about empirically discovering the quirks in various undocumented dialects and writing custom logic to accommodate them.
Like, I get that a farmer a couple hundred years ago would describe plowing a field with a horse as "easy," but given the emergence of alternatives, you wouldn't use the word in that context anymore.
> If somebody give you a JSON file that isn't valid JSON, you tell them it isn't valid, and they say "oh, sorry" and give you a new one. That's the standard for "easy."
But it isn't that reliably easy with JSON. Sometimes I have clients give me data that I just have to work with, as-is. Maybe it was invalid JSON spat out by some programmer or tool long ago. Maybe it's just from a different department than my contact, which might delay things for days before the bureaucracy gets me a (hopefully) valid JSON.
I consider CSV's level of "easy" more reliable.
And even valid JSON can be less easy. I've had experiences where writing the high-level parsing for some JSON file, in terms of a JSON library, was less easy and more time-consuming than writing a custom CSV parser.
Subjectively, I think programming a CSV parser from basic programming primitives is just more fun and appealing than programming in terms of a JSON library or XML library. And I find the CSV code is often simpler and quicker to write.
> When there are many and diverse data formats that meet that standard, it seems perverse to use the word "easy" to talk about empirically discovering the quirks in various undocumented dialects and writing custom logic to accommodate them.
But the premise of CSV is so simple, that there are only four quirks to empirically discover: cell delimiter, row delimiter, quote, escaped-quote.
I think it's "easy" to peek at the file and say, "Oh, they use semicolon cell delimiters."
And it's likewise "easy" to write the "custom logic", which is about as simple as parsing something directly from a text stream gets. I typically have to stop and think a minute about the quoting, but it's not that bad.
If a programmer is practiced at parsing from a text stream (a powerful, general skill that is worth exercising), than I think it is reasonable to think they might find parsing CSV by hand to be easier and quicker than parsing JSON (etc.) with a library.
The best part about csv, anyone can write a parser in 30 minutes meaning that I can take data from the early '90s and import it into a modern web service.
The worst part about CSV, anyone can ride a parser in about 30 minutes, meaning that it's very easy to get incorrect implementations, incorrect data, and other strange undefined behaviors. But to be clear json, and yaml also have issues with everyone trying to reinvent the wheel constantly. XML is rather ugly, but it seems to be the most resilient.
until you find someone abusing XSD schemas, or someone designing a "dynamically typed" XML... or sneaks in extra data in comments - happened to me way often than it should.
You know what grinds my gears about using XSD for message definitions? Namespaces. Namespaces are a good idea and were done well in XML, as far as I can see, but with XSD you run into this [problem][1]:
Namespaces are used to qualify tags and attributes in XML elements. But they're also used by XSD to qualify the names of types defined in the schema. A sequence element's type is indicated by the value of its "type" attribute. The attribute value is a string that is the namespace-qualified name of the type.
So, if you want to change the alias of an XML namespace in an XSD schema, you can't just use your XML library's facilities for namespace management. You also have to go find the "type" attributes (but not all of the "type" attributes), parse their values, and do the corresponding alias change in the type name.
Don't use a string for a thing that is not a string! I guess in XML attributes you have no choice. XAML improved on the situation a bit.
I like CSV for the same reasons I like INI files. It's simple, text based, and there's no typing encoded in the format, it's just strings. You don't need a library.
They're not without their drawbacks, like no official standards etc, but they do their job well.
Similarly I had once loved the schemaless datastorages. They are so much simpler!
Until I worked quite a bit with them and realized that there's always schema in the data, otherwise it's just random noise. The question is who maintains the schema, you or a dbms.
Re. formats -- the usefulness comes from features (like format enforcing). E.g. you may skip .ini at all and just go with lines on text files, but somewhere you still need to convert those lines to your data, there's no way around it, the question is who's going to do that (and report sane error messages).
Schemaless can be accomplished with well-formed formats like json, xml, yaml, toml, etc. from the producer side these are roughly equivalent interfaces. There's zero upside to using CSVs except to comfort your customer. Or maybe you have centered importing of CSVs into your actual business, in which case you should probably not exist.
My experience has indicated the exact opposite. CSVs are the only "structured" format nobody can claim to parse 100% (ok probably not true thinking about html etc, just take this as hyperbole.) Just use a well-specified format and save your brain-cells.
Occasionally, we must work with people who can only export to csv. This does not imply csv is a reasonable way to represent data compared to other options.
It's of quite large importance, and despite being difficult, it is well-specified, which is the point here. Importantly, there is also no competing HTML spec, either de facto or otherwise. CSV doesn't have anything of comparable authority.
> CSVs are the only "structured" format nobody can claim to parse 100%
You don't need to though since in most cases you just need to support whatever CSV format the tool you're handling, unless of course you're trying to write the next Excel/Google Sheets competitor.
CSV works because CSV is understood by non technical people who have to deal with some amount of technicality. CSV is the friendship bridge that prevents technical and non technical people from going to war.
I can tell an MBA guy to upload a CSV file and i'll take care of it. Imagine i tell him i need everything in a PARQUET file!!! I'm no longer a team player.
This is incorrect. Everyone uses Excel, not CSV. There are billions of people on this planet who know what to do with an .xlsx file.
Do the same with a .CSV file and you'll have to teach those people how to use the .CSV importer in Excel and also how to set up the data types for each column etc. It's a non trivial problem that forces you down to a few million people.
.CSV is a niche format for inexperienced software developers.
Indeed the my main use is most financial services will output your records in csv, although I mostly open that in excel which sometimes gets a bit confused.
Among the shit I have seen in CSV, no " for strings, including those with a return char, innovative SEP, date, numbers, no escape for " within strings, rows related to the reporting tools used to export to CSV etc
True. But most of those problems are pretty easy for the non-technical person to see, understand, and (often) fix. Which strengthens the "friendship bridge".
(I'm assuming the technical person can easily write a basic parsing script for the CSV data - which can flag, if not fix, most of the format problems.)
For a dataset of any size, my experience is that most of the time & effort goes into handling records which do not comply with the non-technical person's beliefs about their data. Which data came from (say) an old customer database - and between bugs in the db software, and abuse by frustrated, lazy, or just ill-trained CSR's, there are all sorts of "interesting" things, which need cleaning up.
I wish this was a joke. I'm always trying to convince data scientists with a foot in the open source world that their life will be so much better if they use parquet or Stata or Excel or any other kind of file but CSV.
On top of all the problems people mention here involving the precise definition of the format and quoting, it's outright shocking how long it takes to parse ASCII numbers into floating point. One thing that stuck with me from grad school was that you could do a huge number of FLOPS on a matrix in the time it would take to serialize and deserialize it to aSCII.
Accurate data typing (never confuse a string with a number)
Maybe be circular but: always loads correctly into Excel, if you want to load into a spreadsheet you can add text formatting and even formulas, checkboxes and stuff which can be a lot of fun.
Excel does that type coercion if you import from CSV. If you export pandas data to XLSX it adds proper type information and then it imports properly into Excel and you avoid those problems.
I've recently been developing a raspberry pi based solution which works with telemetry logs. First implementation used an SQLite database (with WAL log) – only to find it corrupted after just couple of days of extensive power on/off cycles.
I've since started looking at parquet files – which turned out to not be friendly to append-only operations. I've ended up implementing writing events into ipc files which then periodically get "flushed" into the parquet files. It works and it's efficient – but man is it non-trivial to implement properly!
My point here is: for a regular developer – CSV (or jsonl) is still the king.
> First implementation used an SQLite database (with WAL log) – only to find it corrupted after just couple of days of extensive power on/off cycles.
Did you try setting `PRAGMA synchronous=FULL` on your connection? This forces fsync() after writes.
That should be all that's required if you're using an NVMe SSD.
But I believe most microSD cards do not even respect fsync() calls properly and so there's technically no way to handle power offs safely, regardless of what software you use.
I use SanDisk High Endurance SD cards because I believe (but have not fully tested) that they handle fsync() properly. But I think you have to buy "industrial" SD cards to get real power fail protection.
Raspberry Pi uses microSD card. Just using fsync after every write would be a bit devastating, but batching might've worked ok in this case. Anyways, too late to check now.
There's definitely a place for it. I ran into the same problem with a battery powered event logger. Basically alternate between sleep-until-event and sample-until-event-over.
SQLite was fine until the realities of that environment hit.
0) I need to save the most data over time and my power budget is unpredictable due to environmentals.
1) When should I commit? SQLite commit per insert slows down, impacts battery life, impacts sample rate. Practically you could get away with batching all data for a small period.
2) SQLite is slow to repair databases. Partially written file would often take longer to repair than we had battery to run.
CSV based format filled that niche. First column was line-column count to support firmware upgrades. Last column is line-checksum. Another column indicating if this line was the last for an event. Parser skips corrupted lines/entries.
> I've since started looking at parquet files – which turned out to not be friendly to append-only operations. I've ended up implementing writing events into ipc files which then periodically get "flushed" into the parquet files. It works and it's efficient – but man is it non-trivial to implement properly!
I think the industry standard for supporting this is something like iceberg or delta, it's not very lightweight, but if you're doing anything non-trivial, it's the next logical move.
Thats true, in recent years its been less of a disaster with lots of good csv libraries for various languages. In the 90s csv was a constant footgun, perhaps thats why they went crazy and came up with XML
When I first started, installing packages which required compiling native code on either my work Windows machine and the old Unix servers was not easy.
So I largely stuck to the Python standard library where I could, and most of the operations I had at the time did not require data analysis on a server, that was mostly done in a database. Often the job was validating and transforming the data to then insert it into a database.
As the Python packaging ecosystem matured and I found I could easily use Pandas everywhere it just wasn't my first thing I'd reach to. And occasionally it'd be very helpful to iterate through them with the csv module, only taking a few MBs of memory, vs. loading the entire dataset into memory with Pandas.
The JSON version is only marginally bigger (just a few brackets), but those brackets represent the ability to be either simple or complex. This matters because you wind up with terrible ad-hoc nesting in CSV ranging from entries using query string syntax to some entirely custom arrangement.
And in these cases, JSON's objects are WAY better.
Because CSV is so simple, it's common for them to avoid using a parsing/encoding library. Over the years, I've run into this particular kind of issue a bunch.
//outputs `val1,val2,unexpected,comma,valN` which has one too many items
["val1", "val2", "unexpected,comma", "valN"].join(',')
JSON parsers will not only output the expected values every time, but your language likely uses one of the super-efficient SIMD-based parsers under the surface (probably faster than what you are doing with your custom CSV parser).
Another point is standardization. Does that .csv file use commas, spaces, semicolons, pipes, etc? Does it use CR,LF, or CRLF? Does it allow escaping quotations? Does it allow quotations to escape commas? Is it utf-8, UCS-2, or something different? JSON doesn't have these issues because these are all laid out in the spec.
JSON is typed. Sure, it's not a LOT of types, but 6 types is better than none.
While JSON isn't perfect (I'd love to see an official updated spec with some additional features), it's generally better than CSV in my experience.
While most people would prefer the second version, the first version is also valid JSON and will definitely see use when you want/need JSON but want to reduce data over the wire though you'd probably still see a wrapper object like:
Essentially nobody uses JSON without a library, but tons of people (maybe even most people) use CSV without a library.
Part of the problem here is standards. There's a TON of encoding variations all using the same .csv extension. Making a library that can accurately detect exactly which one is correct is a big problem once you leave the handful of most common variants. If you are doing subfield encoding, you are almost certainly on your own with decoding at least part of your system.
JSON has just one standard and everyone adheres to that standard which makes fast libraries possible.
If you want to stream large volumes of row-oriented data, you aren't reading yourself and you should be using a binary format which is going to be significantly smaller (especially for numeric data).
Yeah that would be the next step in optimization. In the meanwhile, raw text CSV streaming (for not purely numeric data) is still extremely fast and easy to set up
I am annoyed that comma won out as the separator. Tab would have been a massively better choice. Especially for those of us who have discovered and embraced elastic tabstops. Any slightly large CSV is unreadable and uneditable because you can't easily see where the commas are, but with tabs and elastic tabstops, the whole thing is displayed as a nice table.
(That is, of course, assuming the file doesn't contain newlines or other tabs inside of fields. The format should use \t \n etc for those. What a missed opportunity.)
I wrote a web scraper for some county government data and went for tabs as well. It's nice how the columns lined up in my editor (some of these files had hundreds of thousands of lines).
And all kinds of other weirdness, right in ascii. Vertical tabs, LOL. Put those in filenames on someone else's computer if you want to fuck with them. Linux and its common file systems are terrifyingly permissive in the character set they allow for file names.
CSV have multiple different separators.
Eg. Excel defaults to different separators based on locale. Like CZ locale, it uses commas in numbers instead of dot, so CSV uses semicolon as default separator.
They're completely skipping over the complications of header rows and front matter.
"8. Reverse CSV is still valid CSV" is not true if there are header rows for instance.
But really, whether or not CSV is a good format or not comes down to how much control you have over the input you'll be reading. If you have to deal with random CSV from "in the wild", it's pretty rough. If you have some sort of supplier agreement with someone that's providing the data, or you're always parsing data from the same source, it's pretty fine.
At various points in my career, I've had to oversee people creating data export features for research-focused apps. Eventually, I instituted a very simple rule:
As part of code review, the developer of the feature must be able to roundtrip export -> import a realistic test dataset using the same program and workflow that they expect a consumer of the data to use. They have up to one business day to accomplish this task, and are allowed to ask an end user for help. If they don't meet that goal, the PR is sent back to the developer.
What's fascinating about the exercise is that I've bounced as many "clever" hand-rolled CSV exporters (due to edge cases) as other more advanced file formats (due to total incompatibility with every COTS consuming program). All without having to say a word of judgment.
Data export is often a task anchored by humans at one end. Sometimes those humans can work with a better alternative, and it's always worth asking!
As someone who likes modern formats like parquet, when in doubt, I end up using CSV or JSONL (newline-delimited JSON). Mainly because they are plain-text (fast to find things with just `grep`) and can be streamed.
Most features listed in the document are also shared by JSONL, which is my favourite format. It compresses really well with gzip or zstd. Compression removes some plain-text advantages, but ripgrep can search compressed files too. Otherwise, you can:
zcat data.jsonl.gz | grep ...
Another advantage of JSONL is that it's easier to chunk into smaller files.
Too bad xz/lzma isn't supported in these formats. I often get pretty big improvements in compression ratio. It's slower, but it can be parallelized too.
JSONL as a replacement for CSV, you shouldn't be using CSV as format for long term storage or querying, it has so many downsides and nearly zero upsides.
JSONL when compressed with zstd, most of "expensive if large" disappears as well.
Generating and consuming JSONL can easily be in the GB/s range.
I mean on the querying side. Parquet's ability to skip rowgroups and even pages, and paired with iceberg or delta can make the difference between being able to run your queries at all versus needing to scale up dramatically.
I am saying JSONL is a lower bound format, if you can use something better you should. Data interchange, archiving, transmission, etc. It shouldn't be repeatedly queried.
Parquet, Arrow, sqlite, etc are all better formats.
One of my favorite tools. However, I don’t think that Visidata is a spreadsheet, even though it looks like one and is named after one. It is more spreadsheet adjacent. It is focused on row-based and column-based operations. It doesn’t support arbitrary inter-cell operation(s), like you get in Excel-like spreadsheets. It is great for “Tidy Data’, where each row represents a coherent set of information about an object or observation. This is very much like Awk, or other pipeline tools which are also line/row oriented.
CSV still quietly powers the majority of the world’s "data plumbing."
At any medium+ sized company, you’ll find huge amounts of CSVs being passed around, either stitched into ETL pipelines or sent manually between teams/departments.
It’s just so damn adaptable and easy to understand.
For example, one of the CSVs my company shovels around is our Azure billing data. There are several columns that I just have absolutely no idea what the data in them is. There are several columns we discovered are essentially nullable¹ The Hard Way when we got a bill for which, e.g., included a charge that I guess Azure doesn't know what day that charge occurred on? (Or almost anything else about it.)
(If this format is documented anywhere, well, I haven't found the docs.)
Values like "1/1/25" in a "date" column. I mean, I did say it was an Azure-generated CSV, so obviously the bar wasn't exactly high, but then it never is, because anyone wanting to build something with some modicum of reliability, or discoverability, is sending data in some higher-level format, like JSON or Protobuf or almost literally anything but CSV.
If I can never see the format "JSON-in-CSV-(but-we-fucked-up-the-CSV)" ever again, that would spark joy.
(¹after parsing, as CSV obviously lacks "null"; usually, "" is a serialized null.)
I work as a data engineer in the financial services industry, and I am still amazed that CSV remains the preferred delivery format for many of our customers. We're talking datasets that cost hundreds of thousands of dollar to subscribe to.
"You have a REST API? Parquet format available? Delivery via S3? Databricks, you say? No thanks, please send us daily files in zipped CSV format on FTP."
Requires AWS credentials (api access token and secret key? iam user console login? sso?), AWS SDK, manual text file configuration, custom tooling, etc. I guess with Cyberduck it's easier, but still...
> Databricks
I've never used it but I'm gonna say it's just as proprietary as AWS/S3 but worse.
Anybody with Windows XP can download, extract, and view a zipped CSV file over FTP, with just what comes with Windows. It's familiar, user-friendly, simple to use, portable to any system, compatible with any program. As an almost-normal human being, this is what I want out of computers. Yes the data you have is valuable; why does that mean it should be a pain in the ass?
I've recently written a library at work to run visitors on data models bound to data sets. One of these visitors is a CSV serializer that dumps a collection as a CSV document.
I've just checked and strings are escaped using the same mechanism for JSON, with backslashes. I should've double-checked against RFC 4180, but thankfully that mechanism isn't currently triggered anywhere for CSV (it's used for log exportation and no data for these triggers that code path). I've also checked the code from other teams and it's just handwritten C++ stream statements inside a loop that doesn't even try to escape data. It also happens to be fine for the same reason (log exportation).
I've also written serializers for JSON, BSON and YAML and they actually output spec-compliant documents, because there's only one spec to pay attention to. CSV isn't a specification, it's a bunch of loosely-related formats that look similar at a glance. There's a reason why fleshed-out CSV parsers usually have a ton of knobs to deal with all the dialects out there (and I've almost added my own by accident), that's simply not a thing for properly specified file formats.
This is particularly funny because I just received a ticket saying that the CSV import in our product doesn't work. I asked for the CSV, and it uses a semicolon as a delimiter. That's just what their Excel produced, apparently. I'm taking their word for it because... Excel.
To me, CSV is one of the best examples of why Postel's Law is scary. Being a liberal recipient means your work never ends because senders will always find fun new ideas for interpreting the format creatively and keeping you on your toes.
Of course, because there are locales which uses comma as decimal separator. So CSV in Excel then defaults to semicolon.
Another Microsoft BS, they should defaults to ENG locale in CSV, do a translation in background. And let user choose, if they want to save as different separator. Excel in every part of world should produce same CSV by default.
Bunch of idiots.
Yes. CSV is a data interchange format, it's meant to be written on one computer and read by another. Making the representation of data dependent on the locale in use is stupid af. Locales are for interaction with the user, not for data interchange.
There is a lot not to like about CSV, for all the reasons given here. The only real positive is that you can easily create, read and edit CSV in an editor.
Personally I think we missed a trick by not using the ASCII US and RS characters:
Columns separated by \u001F (ASCII unit separator).
Rows separated by \u001E (ASCII record separator).
Good idea, but probably a non-starter due to no keyboard keys for those characters. Even | would've been a better character to use since it almost never appears in common data.
It is probably unrealistic to expect keyboard keyboard vendors to add new keys. But editors could support adding them through keyboard shortcuts (Ctrl + something).
Pipe can be useful as a field delimiter. But what do you use as the record delimiter?
In abstract CSV is great. In reality, it is a nightmare not because of CSV, but because of all the legacy tools that product it in slightly different ways (different character encodings mostly - excel still produces some variant of latin1, some tools drop a BOM in your UTF8, etc).
Unless you control the producer of the data you are stuck trying to infer the character encoding and transcoding to your destination, and there's no foolproof way of doing that.
100% agree. TSV is under-rated. Tabs don't naturally occur in data nearly as often as commas so tabs are a great delimiter. Copy paste into Excel also works much better with tabs.
Code editors may convert tabs to spaces but are you really editing and saving TSV data files in your code editor?
TSV's big advantage is that, as commonly implemented, the separators are escaped, not quoted. This means that a literal newline (ASCII 0x0A) is always a record separator and a literal tab (ASCII 0x09) is always a field separator. This is the format many databases -- including Postgres -- use for text export by default.
The problem with using TSV is different user configuration. For ex. if I use vim then Tab might indeed be a '\t' character but in TextEdit on Mac it might be something different, so editing the file in different programs can yield different formatting. While ',' is a universal char present on all keyboards and formatted in a single way
The only situation I can think of where a tab is not a tab, is in a code editor that's been configured (possibly by default) to use spaces instead. But that's an easy enough configuration to change. And certainly wouldn't be a problem for something like TextEdit.
> IMO the delimiter should be a configurable option
It is. CSV has been character separated vs comma separated for probably decades now. Most tools you'd use to mess with them allow you to specify which separator character is being used.
Functionally the same. I'd prefer CSV if my content was likely to have whitespace in it and didn't want billion quotes. I'd prefer TSV if my content was unlikely to have whitespace, and more likely to contain commas.
The problem with TSV is what are you going to do about quotes. Some fields might contain them [1] or they might be needed to store fields with tabs inside them.
Because of this in order to read a plain simple TSV (fields separated by tabs, nothing more) with the Python csv module [2] you need to set the quote character to an improbable value, say € (using the euro sign because HN won't let me use U+1F40D), or just parse it by hand, e.g. row.split('\t').
Agreed, much easier to work with, especially if you can guarantee no embedded tabs or newlines. Otherwise you end up with backslash escaping, but that's still usually easier than quotes.
Parsing escapes is easier than parsing quoted text with field and record separators embedded in it. Every literal newline or literal tab is a separator. One can jump to the thousandth record, for example, just by skipping lines, without looking into the contents.
If have to use a dedicated tabular data editing program, I may as well use a spreadsheet application. What do the options you propose do better than libreoffice calc?
is pretty nice because you can inspect a file or dataset on a server (no X) before downloading, to see if you even want to bother. Or you can pass it along through a series of pipes to just get the rows you want.
I’m not sure where that quote is from but it is incorrect, CSVs aren’t easily viewed or modified in text editors in general (at least not in any way that takes advantage of their tabular nature).
There’s at least a slight chance that tab separated values will look ok in a text editor (although in general, nope).
There is at least a chance that the text fields will fit in a tab. If not, a tool like “column” on Linux can be used. There is no chance that a text field will fit inside the width of a comma.
I think for "untyped" files with records, using the ASCII file, (group) and record separators (hex 1C, 1D and 1E) work nicely. The only constraint is that the content cannot contain these characters, but I found that that is generally no problem in practice. Also the file is less human readable with a simple text editor.
For other use cases I would use newline separated JSON. Is has most of the benefits as written in the article, except the uncompressed file size.
I agree that JSONL is the spiritual successor of CSV with most of the benefits and almost none of the drawbacks.
It has a downside though: wherever JSON itself is used, it tends to be a few kilobytes at least (from an API response, for example). If you collect those in a JSONL file the lines tend to get verrrry long and difficult to edit. CSV files are more compact.
JSONL files are a lot easier to work with though. Less headaches.
Honestly yes. If text editors would have supported these codes from the start, we might not even have XML, JSON or similar today. If these codes weren't "binary" and all scary, we would live in much different world.
I wonder how much we have been hindered ourselves by reinventing plain text human-readable formats over the years. CSV -> XML -> JSON -> YAML and that's just the top-level lineage, not counting all the branches everywhere out from these. And the unix folks will be able to name plenty of formats predating all of this.
I'm not really sure why "Excel hates CSV". I import into Excel all the time. I'm sure the functionality could be expanded, but it seems to work fine. The bit of the process I would like improved is nothing to do with CSV - it's that the exporting programs sometimes rearrange the order of fields, and you have to accommodate that in Excel after the import. But since you can have named columns in Excel (make the data in to a table), it's not a big deal.
One problem is that Excel uses locale settings for parsing CSV files (and, to be fair, other text files). So if you're in e.g. Europe and you've configured Excel to use commas as decimal separators, Excel imports numbers with decimals (with points as decimal separator) as text. Or it thinks the point is a thousands separator. I forgot exactly which one of those incorrect options it chooses.
I don't know what they were thinking, using a UI setting for parsing an interchange format.
There's a way around, IIRC, with the "From text / csv" command, but that looses a lot of the convenience of double-clicking a CSV file in Explorer or whatever to open it in Excel.
It's really bad if your header row has less columns than the data rows. You really need to do the import vs just opening the file because it's not even obvious that it's dropping data unless you know what to expect from your file.
Excel is halfway decent if you do the 'import' but not if you just doubleclick on them. It seems to have been programmed to intentionally do stupid stuff with them if you just doubleclick on them.
I agree that the default way Excel handles CSV files is terrible. Using Power Query to manage them is the way to go. But it's the general Microsoft approach to backwards compatibility so very unlikely to change now.
In the past I remember that Excel not properly handling UTF-8 encoded text in a CSV. It would treat it as raw ASCII (or possibly code page 1252). So if you opened and saved a CSV, it would corrupt any Unicode text in the file. It's possible this has been fixed in newer versions, I haven't tried in a while.
It's related to how older versions of Windows/Office handled Unicode in general.
From what I have heard, it's still an issue with Excel, although I assume that Windows may handle plain text better these days (I haven't used it in a while)
You need to write an UTF-8 BOM at the beginning (0xEF, 0xBB, 0xBF), if you want to make sure it's recognized as UTF-8.
Ugh, UTF-8 BOM. Many apps can handle UTF-8 but will try to return those bytes as content; maybe ours in 2015 too
I was on the Power Query team when we were improving the encoding sniffing. An app can scan ahead i.e. 64kB, but ultimately the user needs to just say what the encoding is. All the Power Query data import dialogs should let you specify the encoding.
UTF-8 BOM is probably not a good idea for anything other than (maybe) plain text documents. For data, many (although not all) programs should not need to care about character encoding, and if they include something such as UTF-8 BOM then it will become necessary to consider the character encoding even though it shouldn't be necessary.
I have repeatedly seen people getting the spreadsheets altered by excel, and in general a lot of troubles due to localisation reasons. Sometimes these changes can be subtle and be hard to spot until somebody tries to troubleshoot what went wrong down the line.
It works better if you click to "import the data" instead of just opening the csv file with it, and if you then choose the right data types. But having to do this everytime to make it work is really annoying, esp when you have a lot of columns, plus people can easily get confused with the data types. I have never seen that much confusion eg with macos's numbers.
It's an ad hoc text format that is often abused and a last-chance format for interchange. While heuristics can frequently work at determining the structure, they can just as easily frequently fail. This is especially true when dealing with dates and times or other locale-specific formats. Then, people outright abuse it by embedding arrays or other such nonsense.
You can use CSV for interchange, but a duck db import script with the schema should accompany it.
"CSV" should die. The linked article makes critical ommisions and is wrong about some points. Goes to show just how awful "CSV" is.
For one thing, it talks about needing only to quote commas and newlines... qotes are usually fine... until they are on either side of the value. then you NEED to quote them as well.
Then there is the question about what exactly "text" is; with all the complications around Unicode, BOM markers, and LTR/RTL text.
CSV on the Web (CSVW) is a W3C standard designed to enable the description of CSV files in a machine-readable way.[1]
"Use the CSV on the Web (CSVW) standard to add metadata to describe the contents and structure of comma-separated values (CSV) data files." — UK Government
Digital Service[2][3]
One thing that has changed the game with how I work with CSVs is ClickHouse. It is trivially easy to run a local database, import CSV files into a table, and run blazing-fast queries on it. If you leave the data there, ClickHouse will gradually optimize the compression. It's pretty magical stuff if you work in data science.
Datasette is a wonderful tool that I've used before, and I have the highest admiration for its creator, but the underlying Sqlite3 database doesn't handle large datasets (i.e. hundreds of millions of rows) nearly as well as ClickHouse does.
It's worth noting that I only ran into this limitation when working with huge federal campaign finance datasets [1] and trying to do some compute-intensive querying. For 99% of use cases, datasette is a similarly magical piece of software for quickly exploring some CSV files.
How much easier would all of this be if whoever did CSV first had done the equivalent of "man ascii". There are all these wonderful codes there like FS, GS, RS, US that could have avoided all the hassle that quoting has brought generations of programmers and data users.
I wrote my own CSV parser in C++. I wasn't sure what to do in some edge cases, e.g. when character 1 is space and character 2 is a quote. So I tried importing the edge case CSV into both MS Excel and Apple Numbers. They parsed it differently!
CSV is so deceptively simple that people don't care understanding it. I wasted countless hours working around services providing non-escaped data that off the shelf parsers could not parse.
If CSV is indeed so horrible - and I do not deny that there can be an improvement - how about the clever data people spec out a format that
Does not require a bizarre C++ RPC struct definition library _both_ to write and to read
Does not invent a clever number encoding scheme that requires native code to decode at any normal speed
Does not use a fancy compression algorithm (or several!) that you need - again - native libraries to decompress
Does not, basically, require you be using C++, Java or Python to be able to do any meaningful work with it
It is not that hard, really - but CSV is better (even though it's terrible) exactly because it does not have all of these clever dependency requirements for clever features piled onto it. I do understand the utility of RLE, number encoding etc. I do not, and will not, understand the utility of Thrift/Avro, zstandard and brotli and whatnot over standard deflate, and custom integer encoding which requires you download half of Apache Commons and libboost to decode. Yes, those help the 5% to 10% of the use cases where massive savings can be realised. It absolutely ruins the experience for the other 90 to 95.
But they also give Parquet and its ilk a very high barrier of entry.
I have written a new database system that will convert CSV, JSON, and XML files into relational tables.
On of the biggest challenges to CSV files is the lack of data types on the header line that could help determine the schema for the table.
For example a file containing customer data might have a column for a Zip Code. Do you make the column type a number or a string? The first thousand rows might have just 5 digit numbers (e.g. 90210) but suddenly get to rows with the expanded format (e.g. 12345-1234) which can't be stored in an integer column.
I realize that. But when reading a header you have to imply the data types which might be wrong. I always thought it would have been great if the first line read something like:
name:STRING,address:STRING,zip code:INTEGER,ID:BIG_INT,...
# Mangle your data with dplyr, regular expressions, search and replace, drop NA's, you name it.
<code to sanitize all your data>
Multiple libraries exist for R to move data around, change the names of entire columns, change values in every single row with regular expressions, drop any values that have no assigned value, it's the swiss army knife of data. There are also all sorts of things you can do with data in R, from mapping with GPS coordinates to complex scientific graphing with ggplot2 and others.
I've found some use cases where CSV can be a good alternative to arrays for storage, search and retrieval. Storing and searching nested arrays in document databases tends to be complicated and require special queries (sometimes you don't want to create a separate collection/table when the arrays are short and 1D). Validating arrays is actually quite complicated; you have to impose limits not only on the number of elements in the array, but also on the type and size of elements within the array. Then it adds a ton of complexity if you need to pass around data because, at the end of the day, the transport protocol is either string or binary; so you need some way to indicate that something is an array if you serialize it to a string (hence why JSON exists).
Reminds me of how I built a simple query language which does not require quotation marks around strings, this means that you don't need to escape strings in user input anymore and it prevents a whole bunch of security vulnerabilities such as query injections. The only cost was to demand that each token in the query language be separated by a single space. Because if I type 2 spaces after an operator, then the second one will be treated as part of the string; meaning that the string begins with a space. If I see a quotation mark, it's just a normal quotation mark character which is part of the string; no need to escape. If you constrain user input based on its token position within a rigid query structure, you don't need special escape characters. It's amazing how much security has been sacrificed just to have programming languages which collapse space characters between tokens...
It's kind of crazy that we decided that quotation marks are OK to use as special characters within strings, but commas are totally out of bounds... That said, I think Tab Separated Values TSV are even more broadly applicable.
Funny how the "specification holds in a tweet" yet manages to miss at least three things: 1) character encoding, 2) BOM or not, 3) header or no header.
Great if you're the one producing the CSV yourself.
But if you're ingesting data from other organizations, they will, at one time or another, fuck up every single one of those (as well as the ones mentioned in TFA), no matter how clearly you specify them.
the simplicity is underappreciated because people don't realise how many dumb data engineers there are. i'm pretty sure most of them can't unpack an xml or json. people see a csv and think they can probably do it themselves, any other data format they think 'gee better buy some software with the integration for this'.
Excel hates CSV only if you don't use the "From text / csv" function (under the data tab).
For whatever reason, it flawlessly manages to import most CSV data using that functionality. It is the only way I can reliably import data to excel with datestamps / formats.
Just drag/dropping a CSV file onto a spreadsheet, or "open with excel" sucks.
… what concrete language are we talking about, here?
In literally any language I can think of, hassle(json) < hassle(CSV), esp. since CSV received is usually "CSV, but I've screwed it up in a specific, annoying way"
The Italian version of Excel uses a custom CSV style with ; as a column separator. This breaks many applications that accept CSVs. It's super annoying.
>This is so simple you might even invent it yourself without knowing it already exists while learning how to program.
This is a double-edged sword. The "you might even event it yourself" simplicity means that in practice lots of different people do end up just inventing their own version rather than standardizing to RFC-4180 or whatever when it comes to "quote values containing commas", values containing quotes, values containing newlines, etc. And the simplicity means these type of non-standard implementations can go completely undetectable until a problematic value happens to be used. Sometimes added complexity that forces paying more attention to standards and quickly surfaces a diversion from those standards is helpful.
Just last week I was bitten by a customer’s CSV that failed due to Windows‘ invisible BOM character that sometimes occurs at the beginning of unicode text files. The first column‘s title is not „First Title“ then but „&zwnbsp;First Title“. Imagine how long it takes before you catch that invisible character.
Aside from that: Yes, if CSV would be a intentional, defined format, most of us would do something different here and there. But it is not, it is more of a convention that came upon us. CSV „happened“, so to say. No need to defend it more passionate than the fact that we walk on two legs. It could have been much worse and it has surprising advantages against other things that were well thought out before we did it.
There was/is CSVY [0] which attempted to put column style and separator information in a standard header. It is supported by R lang.
I also asked W3C on theirGithub if there was any spec for CSV headers and they said there isn't [1].
Kind of defeats the point of the spec in my opinion.
CSV is the bane of my existence. There is no reason to use it outside of legacy use-cases, when so many alternatives are not so brittle that they require endless defensive hacks to avoid erring as soon as exposed to the universe. CSV must die.
CVS isn't brittle, and I'm not sure what "hacks" you're referring to. If you or your parser just follow RFC4180 (particularly quote every field, and double quoting to cancel-quote), that will get you 90%+ compatibility.
Surely you’ve come across situations where line number 10,000,021 of a 60m line CSV fails to parse because there aren’t enough fields in that line of the file…? The issue is that you can’t definitively know which of the 50 fields is missing, so you have to fail the line or worse the file.
In my experience (perhaps more niche than yours since you mentioned it has been your day job), the lack of fall back options makes for brittle integrations. Failing entire files due to a borked row can be expensive in terms of time.
Having to ingest large CSV files from legacy systems has made me rethink the value of XML, lol. Types and schemas add complexity for sure, but you get options for dealing with variances in structure and content.
That is a problem, but it is also a problem with XML. Parsing the XML file to discover e.g. unmatching tags is far more CPU and memory expensive than correctly passing a CVS.
In both cases you'd fail the entire file rather than partial recovery.
RFC4180 is a late attempt at CSV standardization, merely codifying a bunch of sane practices. It also provides a nice specification for generating CSV. But anyone taking care to code from a specification might as well use a proper file format.
The real specification for CSV is as follows: "Valid CSV is whatever is designated as CSV by its emitter". I wish I was joking.
There is literally an infinity of ways CSV can be broken. The developer will bump his head on each as he encounters them, and add a specific fix. After a while, his code will be robust against the local strains of CSV... Until the next mutation is encountered after acquiring yet another company with a bunch of ERP way past their last extended maintainance era, a history of local adaptations and CSV as a message bus.
CSV is awesome for front-end webapps needing to fetch A LOT of data from a server in order to display an information-dense rendering. For that use-case, one controls both sides so the usual serialization issues aren't a problem.
Working in clinical trial data processing I receive data in 1 of 3 formats:
csv, sas datasets, image scans of pdf pages showing spreadsheets
Of these 3 options sas datasets are my preference but I'll immediately convert to csv or excel, csv is a close 2nd once you confirm the quoting / seperator conventions it's very easy to parse. I understand why someone may find the csv format disagreeable but in my experience the alternatives can be so much worse I don't worry too much about csv files
Typical latin fonts divide characters into three heights: short like "e" or "m", tall like "l" or "P" and deep like "j" or "y". As you may notice, letters only use one or two of these three sections.
Pipe is unique in that it uses all three at the same time from the very top to the very bottom. No matter what latin character you put next to it, it remains distinct. This makes the separators relatively easy to spot.
Pipe is a particularly uncommon character in normal text while commas, spaces, semicolons, etc are quite common. This means you don't need to escape it very often. With an escapable pipe, an escapable newline, and unicode escapes ("\|", "\n", and "\uXXXX") you can handle pretty much everything tabular with minimal extra characters or parsing difficulty.
This in turn means that you can theoretically differentiate between different basic types of data stored within each entry without too much difficulty. You could even embed JSON inside it as long as you escape pipes and newlines.
I already prefer using pipe as separator in logging; now you're telling me there is a chance that my logs can be automatically ingested as tabular data? Sign me up for this branch of the multiverse :)
Look at how it handles escaping of special characters and particularly new lines (RFC 4180 doesn’t guarantee that a new line is a new record) and how it’s written in 2005 yet still doesn’t handle unicode other than via a comment about ”other character sets”.
Doesn't it make sense to have a common document in the usual format (RFC) which every newbie can consult when in doubt?
I much prefer that to any sort of "common institutional memory" that is nevertheless only talked about on random forums. People die, other people enter the field... hello subtle incompatibilities.
The fact that you can parse CSV in reverse is quite cool, but you can't necessarily use it for crash recovery (as suggested) because you can't be sure that the last thing written was a complete record.
The column count doesn't help because you don't know where the last record starts because you don't know whether you're in a string or not. Unless you scan the entire file from the beginning, which defeats the object.
I love CSV for a number of reasons. Not the least of which it’s super easy to write a program (code) in C to directly output all kinds of things to CSV. You can also write simple middleware to go from just about any database or just general “thing” to CSV. Very easily. Then toss CSV into excel and do literally anything you want.
It’s sort of like, the computing dream when I was growing up.
+1 to ini files. I like you can mess around with them yourself in notepad. Wish there was a general outline / structure to those though.
I have been just splitting my head to parse data from from a erp database to csv and then from csv to erp database again using the programming language user by erp system.
The first part of converting data to csv works fine with help of ai coding assistant.
The reverse part of csv to database is getting challenging and even claude sonnet 3.7 is not able to escape newline correctly.
I am now implementation the data format in json which is much simpler.
Items #6 to #9 sound like genuine trolling to me; item #8, reversing bytes because of course no other text encodings than ASCII exist, is particularly horrible.
the reversing bytes part is encoding agnostic. you just feed the reversed bytes to the csv parser then re-reverse both the yielded rows and the cells bytes and get the original order of the bytes themselves.
I think I understand the point being made, but all this reliance on text-based data means we require proper agreement on text encodings, etc. I don't think it's very useful for number-based data anyway, it's a massively bloated way to store float32s for instance and usually developers truncate the data losing about half of the precision in the process.
For numerical data, nothing beats packing floats into blobs.
I think binary formats have many advantages. Not only for numbers but other data as well, including data that contains text (to avoid needing escaping, etc; and to declare what character sets are being used if that is necessary), and other structures. (For some of my stuff I use a variant of DER, which adds a few new types such as key/value list type.)
I can‘t really understand the love for the format. Yes it’s simple but also not defined in a common spec. Same story with markdown. Yes GitHub tried to push for a spec but it still feels more like a flavor. I mean there is nothing wrong with not having a spec. But certain guarantees are not given. Will the document exported by X work with Y.
I love CSV when it's only me creating/using the CSV. It's a very useful spreadsheet/table interchange format.
But god help you if you have to accept CSVs from random people/places, or there's even minor corruption. Now you need an ELT pipeline and manual fix-ups. A real standard is way better for working with disparate groups.
CSV has caused me a lot of problems due to the weak type system. If I save a Dataframe to CSV and reload it, there is no guarantee that I'll end up with an identical dataframe.
I can depend on parquet. The only real disadvantages with parquet are that they aren't human-readable or mutable, but I can live with that since I can easily load and resave them.
Quick question while we’re on the topic of CSV files: is there a command-line tool you’d recommend for handling CSV files that are malformed, corrupted, or use unexpected encodings?
My experience with CSVs is mostly limited to personal projects, and I generally find the format very convenient. That said, I occasionally (about once a year) run into issues that are tricky to resolve.
I'm in on the "shit on microsoft for hard to use formats train" but as someone who did a LOT of .docx parsing - it turned into zen when I realized that I can just convert my docs into the easily parsed .html5 using something like pandoc.
This is a good blog post and Xan is a really neat terminal tool.
Excel won't import ISO 8601 timestamps either, which is crazy these days where it's the universal standard, and there's no excuse to use anything else.
You have to replace the "T" separator with a space and also any trailing "Z" UTC suffix (and I think any other timezone/offset as well?) for Excel to be able to parse as a time/date.
Have you tried using the "from text/csv" importer under the data tab? Where it will import your data into a table. Because that one will import ISO 8601 timestamps just fine.
This, it's dumb but Excel handles csv way better if you 'import' it vs just opening it. I use excel to quickly preview csv files, but never to edit them unless I'm OK only ever using it in Excel afterwards.
Even in that case I'd be hesitant to open a CSV file in excel. The problem is that it will automatically apply whatever transformation it thinks is appropriate the moment you open the file. Have a digit string that isn't semantically a number? Too bad, it's a number now, and we're gonna go ahead and round it. You didn't really need _all_ of the digits of that insurance policy number, did you?
They did finally add options to turn off the common offenders, but I have a deeply ingrained distrust at this point.
I don't get it - why the world, Excel can't just open the CSV, assume from the extension it's COMMA separated value and do the rest.
It does work slightly better when importing, just a little.
No, french systems also use comma to separate fields in CSV files. Excel uses semicolon to separate fields in France, meaning it generates semicolon-separated files rather than comma-separated files.
It's not the fault of CSV that Excel changes which file format it uses based on locale.
It's even worse than that. Office on my work computer is set to the English language, but my locale is French and so is my Windows language. It's saving semicolon-separated CSV files with the comma as a decimal point.
I need to uncheck File > Option Advanced > Use system separators and set the decimal separator to a dot to get Excel to generate English-style CSV files with semicolon-separated values. I can't be bothered to find out where Microsoft moved the CSV export dialog again in the latest version of Office to get it to spit out comma-separated fields.
Point is, CSV is a term for a bunch of loosely-related formats that depends among other things on the locale. In other words, it's a mess. Any sane file format either mandates a canonical textual representation for numbers independent of locale (like JSON) or uses binary (like BSON).
> It's saving semicolon-separated CSV files with the comma as a decimal point.
It's not though, is what I'm saying. It's saving semicolon-separated files, not CSV files. CSV files have commas separating the values. Saying that Excel saves "semicolon-separated CSV files" is nonsensical.
I can save binary data in a .txt file, that doesn't make it a "text file with binary data"; it's a binary file with a stupid name.
Sorry, but what Excel does is save to a file with a CSV extension. This format is well defined and includes ways to specify encoding and separator to be readable under different locales.
This format is not comma separated values. But Excel calls it CSV.
The headaches comes if people assume that a csv file must be comma separated.
That, bad specs, weird management/ timezone/ governance/ communications and random \n\r issues transformed a 2 day fun little project into a 4 week hell. I will never work with CSV in France ever again. Mostly because of Excel, normal CSV nice.
Most of the people most of the time aren't importing data from a different locale. A good assumption for defaults could be that the CSV file honors the current Windows regional settings.
It could, but it doesn't want to. The whole MS Office dominance came into being by making sure other tools can't properly open documents created by MS tools; plus being able to open standard formats but creating small incompatibilities all around, so that you share the document in MS format instead.
Probably Microsoft treats a pure-text, simply specified, human-readable and editable spreadsheet format that fosters interoperability with competing software as an existential threat.
This. I got burnt by the encoding and other issues with CSV in Excel back in the day, I've only used LibreOffice Calc (on Linux) for viewing / editing CSVs for many years now, it's almost always a trouble-free experience. Fortunately I don't deal much with CSVs that Excel-wielding non-devs also need to open these days - I assume that, for most folks, that's the source of most of their CSV woes.
I just wish Excel was a little less bad about copy-pasting CSVs as well. Every single time, without fail, it dumps them into a single column. Every single time I use "text to columns" it has insane defaults where it's fixed-width instead of delimited by, you know, commas. So I change that and finally it's fixed.
Then I go do it somewhere else and have to set it up all over again. Drives me nuts. How the default behavior isn't to just put them in the way you'd expect is mind-boggling.
Honestly if there is no comma to separate the values, then its not csv maybe Csv for character separate values or asv for anything separates values but you're right, this makes it hard how everyone is doing whatever. IMV supporting "" makes supporting anything else redundant.
It makes me a bit worried to read this thread, I would've thought its pretty common knowledge why CSV is horrible and widely agreed upon. I also have hard time taking anybody seriously who uses "specification" and "CSV" in the same sentence unironically.
I suspect its 1) people who worked with legacy systems AND LIKED IT, or 2) people who never worked with legacy systems before and need to rediscover painful old lessons for themselves.
It feels like trying to convince someone, why its a bad idea to store the year as a CHAR(2) in 1999, unsuccessfully.
The thread makes me worried for a different reason, especially how many avoid reading the RFC and even saying that the RFC doesn't matter, that there's no standard for csv. maybe there should be an extension for rfc-csv.
So easy to get data in and out of an application, opens seamlessly in Excel or your favor DB for further inspection. The only issue is the comma rather than a less used separator like | that occasionally causes issues.
Any recommendations for CSV editors on OSX? I was just looking around for this today. The "Numbers" app is pretty awful and I couldn't find any superb substitutes, only ones that were just OK.
I've said it before, CSV will still be used in 200 years. It's ugly, but it occupies an optimal niche between human readability, parsing simplicity, and universality.
JSON, XML, YAML are tree describing languages while CSV defines a single table. This is why CSV still works for a lot of things (sure there is JSON lines format of course).
and of course you can do that if you wish, as many CSV libraries allow arbitrary separators and escapes (though they usually default to the "excel compatible" format)
but at least in my case, I would not like to use those characters because they are cumbersome to work with in a text editor. I like very much to be able to type out CSV columns and rows quickly, when I need to.
It’s all a pros and cons.. the benefit of those characters are they are not used anywhere else, hence you never have to worry about escaping/quoting strings. But obviously most of my csv usage is automated in/out.
with quick and dirty bash stuff ive written the same csv parser so many times it lives in my head and i can write it from memory. no other format is like that. trying to parse json without jq or a library is much more difficult
Kudos for writing this, it's always worth flagging up the utility of a format that just is what it is, for the benefit of all. Commas can also create fun ambiguity, as that last sentence demonstrates. :P
CSV is lovely. It isn't trying to be cool or legendary. It works for the reasons the author proposes, but isn't trying to go further.
I work in a work of VERY low power devices and CSV sometimes is all you need for a good time.
If it doesn't need to be complicated, it shouldn't be. There are always times when I think to myself CSV fits and that is what makes it a legend. Are those times when I want to parallelise or deal with gigs of data in one sitting. Nope. There are more complex formats for that. CSV has a place in my heart too.
Thanks for reminding me of the beauty of this legendary format... :)
I'll repeat what I say every time I talk about CSV: I have never encountered a customer who insisted on integrating via CSV who was capable of producing valid CSV. Anybody who can reliably produce valid CSV will send you something else if you ask for it.
> CSV is not a binary format, can be opened with any text editor and does not require any specialized program to be read. This means, by extension, that it can both be read and edited by humans directly, somehow.
This is why you should run screaming when someone says they have to integrate via CSV. It's because they want to do this.
Nobody is "pretending CSV is dead." It'll never die, because some people insist on sending hand-edited, unvalidated data files to your system and not checking for the outcome until mid-morning the next day when they notice that the text selling their product is garbled. Then they will frantically demand that you fix it in the middle of the day, and they will demand that your system be "smarter" about processing their syntactically invalid files.
Seriously. I've worked on systems that took CSV files. I inherited a system in which close to twenty "enhancement requests" had been accepted, implemented, and deployed to production that were requests to ignore and fix up different syntactical errors, because the engineer who owned it was naive enough to take the customer complaints at face value. For one customer, he wrote code that guessed at where to insert a quote to make an invalid line valid. (This turned out to be a popular request, so it was enabled for multiple customers.) For another customer, he added code that ignored quoting on newlines. Seriously, if we encountered a properly quoted newline, we were supposed to ignore the quoting, interpret it as the end of the line, and implicitly append however many commas were required to make the number of fields correct. Since he actually was using a CSV parsing library, he did all of this in code that would pre-process each line, parse the line using the library, look at the error message, attempt to fix up the line, GOTO 10. All of these steps were heavily branched based on the customer id.
The first thing I did when I inherited that work was make it clear to my boss how much time we were spending on CSV parsing bullshit because customers were sending us invalid files and acting like we were responsible, and he started looking at how much revenue we were making from different companies and sending them ultimatums. No surprise, the customers who insisted on sending CSVs were mostly small-time, and the ones who decided to end their contracts rather than get their shit together were the least lucrative of all.
> column-oriented data formats ... are not able to stream files row by row
I so hate CSV.
I am on the receiving end: I have to parse CSV generated by various (very expensive, very complicated) eCAD software packages. And it's often garbage. Those expensive software packages trip on things like escaping quotes. There is no way to recover a CSV line that has an unescaped double quote.
I can't point to a strict spec and say "you are doing this wrong", because there is no strict spec.
Then there are the TSV and semicolon-Separated V variants.
Did I mention that field quoting was optional?
And then there are banks, which take this to another level. My bank (mBank), which is known for levels of programmer incompetence never seen before (just try the mobile app) generates CSVs that are supposed to "look" like paper documents. So, the first 10 or so rows will be a "letterhead", with addresses and stuff in various random columns. Then there will be your data, but they will format currency values as prettified strings, for example "34 593,12 USD", instead of producing one column with a number and another with currency.
I used to be a data analyst at a Big 4 management consultancy, so I've seen an awful lot of this kind of thing. One thing I never understood is the inverse correlation between "cost of product" and "ability to do serialisation properly".
Free database like Postgres? Perfect every time.
Big complex 6-figure e-discovery system? Apparently written by someone who has never heard of quoting, escaping or the difference between \n and \r and who thinks it's clever to use 0xFF as a delimiter, because in the Windows-1252 code page it looks like a weird rune and therefore "it won't be in the data".
"Enterprise software" has been defined as software that is purchased based on the decisions of people that will not use it. I think that explains a lot.
Yep, we had a constant tug of war between techies who wanted to use open-source tools that actually work (Linux, Postgres, Python, Go etc.) and bigwigs who wanted impressive-sounding things in Powerpoint decks and were trying to force "enterprise" platforms like Palantir and IBM BigInsights on us.
Any time we were allowed to actually test one of the "enterprise" platforms, we'd break it in a few minutes. And I don't mean by being pathologically abusive, I mean stuff like "let's see if it can correctly handle a UTF-8 BOM...oh no, it can't".
> Big complex 6-figure e-discovery system? Apparently written by someone who has never heard of quoting...
It's because about a certain size, system projects are captured by the large consultancy shops, who eat the majority of the price in profit and management overhead...
... and then send the coding work to a lowest-cost someone who has never heard of quoting, etc.
And it's a vicious cycle, because the developers in those shops that do learn and mature quickly leave for better pay and management.
(Yes, there's usually a shit hot tiger team somewhere in these orgs, but they spend all their time bailing out dumpster fires or landing T10 customers. The average customer isn't getting them.)
Just a nitpick about consultancy shops -- I've had a chance of working in one in eastern europe and noticed that it's approach to quality was way better than client's. It also helped that client paid by hours, so consultancy company was incentivized to spend more time on refactorings, improvals and testing (with constant pushback from client).
So I don't buy the consultancy company sentiment, it always boils down to engineers and incentives.
How big was the one you worked for?
In my experience, smaller ones tend to align incentives better.
Once they grow past a certain size though, it's a labor arbitrage game. Bill client X, staff with resources costing Y (and over-represented), profit = X-Y, minimize Y to maximize profit.
PwC / IBM Global Services wasn't offering the best and brightest. (Outside of aforementioned tiger teams)
I agree with you in general, although my case was the other way around. My company was 10k+ people. But my client was probably the most technically advanced company at that time, with famously hard interviews for their own employees. My employer also didn't want to lose the client (it was beginning of collaboration), and since everyone wanted to work there (and move to US+California) my shop applied pretty strong filter for their own heads, even before sending them to client's vendor-interview.
And client was very-very happy with the quality, and that we didn't fight for promotions and could maintain very important, but promotion-poor projects. Up to the point that client trusted to completely gave couple of projects fully to my shop. When you don't need to fight for promotions, code quality also improves.
Try to live in a country where "," is the decimal point. Of course this causes numerous interoperability issues or hidden mistakes in various data sets.
There would have been many better separators... but good idea to bring formatting into it as well...
There was a long period of my life that I thought .csv meant cemicolon separated because all I saw was cemicolon separated files and I had no idea of the pain.
Although it is spelled "semicolon," so that doesn't quite fit.
CMYK -- Cyan, Magenta, Yellow, blacK :)
(of course it originally stood for "key", but you don't see that much anymore)
Not sure if they still do this, but Klarna would send us ", " separated files. If there wasn't a space after the comma then it was to be read as a decimal point. Most of the CSV parser don't/didn't allow you to specify multi-character separators. In the end I just accepted that we had one field for krona and for öre and most fields would need to have a leading space removed.
Microsoft did this very extensively. Many Non-English versions of Excel do save CSV-files with a semicolon as a separator and it probably was handled differently too in normal Excel files. But it goes even further, it affected their scripting languages even to this day with newer languages like their BI script (forgot the name of the language). For example, parameters of function calls aren't separated by ',' anymore and ';' is used instead. But only in the localized versions.
That of course means that you have to translate these scripts depending on the locale set in your office suite, otherwise they are full of syntax errors...
Many English languages use ';' to end statements instead of '.'.
Many European languages use '.' to end statements, Prolog (France) for example, but use ';' to separate arguments.
There are better separators included in ASCII, but not used as often: 28 File Separator, 29 Group Separator, 30 Record Separator and 31 Unit Separator.
TSV should do it for you. Been there done that.
> Then there will be your data, but they will format currency values as prettified strings, for example "34 593,12 USD", instead of producing one column with a number and another with currency.
To be fair, that's not a problem with CSV but with the provider's lack of data literacy.
Yeah, you can also use Parquet/JSON/protobuf/XLSX and store numbers as strings in this format. CSV is just a container.
But somehow CSV is the PHP of serialization formats, attracts the wrong kind of developers and projects.
I definitely wouldn't say that. I saw a lot of weird stuff in Excel files, and there's the whole crowd only giving you data as PDFs.
I worked in a web shop which had to produce spreadsheets which people wanted to look at in Excel. I gave them so many options, and told each client to experiment and choose the option which worked for them. In the end, we had (a) UTF-8 CSV, (b) UTF-8 CSV with BOM, (c) UTF-16 TSV, (d) UTF-8 HTML table with a .xlsx file extension and a lying Content-Type header which claimed it was an Excel spreadsheet.
Option a worked fine so long as none of the names in the spreadsheet had any non-ASCII characters.
Option d was by some measures the worst (and was definitely the largest file size), but it did seem to consistently work in Excel and Libre Office. In fact, they all worked without any issue in Libre Office.
> I can't point to a strict spec and say "you are doing this wrong", because there is no strict spec.
Have you tried RFC 4180?
https://www.ietf.org/rfc/rfc4180.txt
I've written a commercial, point and click, data wrangling tool (Easy Data Transform) that can deal with a lot of these issues:
-different delimiters (comma, semi-colon, tab, pipe etc)
-different encodings (UTF8, UTF16 etc)
-different line ending (CR, LF, CR+LF)
-ragged rows
-splitting and merging columns
And much more besides.
However, if you have either:
-line feeds and/or carriage returns in data values, but no quoting
or
-quoting, but quotes in data values aren't properly handled
Then you are totally screwed and you have my sympathies!
I agree and as a result I have completely abandoned CSV.
I use the industry standard that everyone understands: ECMA-376, ISO/IEC 29500 aka .xlsx.
Nobody has any problems producing or ingesting .xlsx files. The only real problem is the confusion between numbers and numeric text that happens when people use excel manually. For machine to machine communication .xlsx has never failed me.
Off the top of my head:
https://learn.microsoft.com/en-us/office/troubleshoot/excel/...
Now you might argue that ECMA-376 accounts for this, because it has a `date1904` flag, which has to be 0 for 1900-based dates and 1 for 1904-based dates. But what does that really accomplish if you can’t be sure that vendors understand subtleties like that if they produce or consume it? Last time I checked (maybe 8 years ago), spreadsheets created on Windows and opened on Mac still shifted dates by four years, and the bug was already over twenty years old at that time.
And the year-1904 issue is just the one example that I happen to know.
I have absolutely zero confidence in anything that has touched, or might have touched, MS Excel with anything short of a ten-foot pole.
Parsing Excel files in simple data interchange use cases that don't involve anyone manually using spreadsheets is an instance of unnecessary complexity. There are plenty of alternatives to CSV that remain plaintext, have much broader support, and are more rigorous than Excel in ensuring data consistency. You can use JSON, XML, ProtoBuf, among many other options.
But everyone already has a GUI installed for editing xlsx files...
Which introduces even more problems when manually editing files is out of scope.
I was recently writing a parser for a weird CSV. It had multiple header column rows in it as well as other header rows indicating a folder.
This RFC maybe?
https://www.ietf.org/rfc/rfc4180.txt
The RFC explicitly does not define a standard
I hate CSV too. If I have to use it, I'll live with TSV or some other special-charter delimited format.
I'd much rather it be something that is neither used in normal* text/numbers, nor whitespace, thus non-printable delimiters wins for me.
* Don't mind me extending 'normal' here to include human-written numbers with thousand seperators.
If only there were character codes specifically meant to separate fields and records.... we wouldn't have to worry so much about quoted commas or quoted quotes.
That isn't solving anything, just changing the problem. If I want to store a string containing 0x1C - 0x1F in one of the columns then we're back in the exact same situation while also losing the human readable/manually typeable aspect people seem to love about CSV. The real solution is a strict spec with mandatory escaping.
Not for text data. Those values are not text characters like , or " are, and have only one meaning. It would be like arguing that 0x41 isn't always the letter "A".
For binary files, yeah but you don't see CSV used there anyway.
The idea that binary data doesn't go in CSVs is debatable; people do all sorts of weird stuff. Part of the robustness of a format is coping with abuse.
But putting that aside, if the control chars are not text, then you sacrifice human-readability and human-writability. In which case, you may as well just use a binary format.
True, but very few people compose or edit CSV data in Notepad. You can, but it's very error-prone. Most people will use a spreadsheet and save as CSV, so field and record separator characters are not anything they would ever deal with.
I've dealt with a few cases of CSVs including base64-encoded binary data. It's an unusual scenario, but the tools for working with CSVs are robust enough that it was never an issue.
So in addition to losing human readability, we are also throwing away the ability to nest (pseudo-)CSVs? With comma delimiters, I can take an entire CSV document and put it in 1 column, but with 0x1C-0x1F delimiters and banning non-text valid utf-8 in columns I no longer can. This continues to be a step backwards.
No reason you can't escape those special characters.
Then we're back to my original response of it doesn't solve anything: https://news.ycombinator.com/item?id=43495217
There are lots of 8-bit mostly-ASCII character sets that assign printable glyphs to some or all of the codepoints that ASCII assigns to control characters. TeX defined one, and the IBM PC's "code page 437" defined another.
There are several ways how a control character might inadvertently end up inside a text corpus. Given enough millions of lines, it’s bound to happen, and you absolutely don’t want it to trip up your whole export because of that one occurrence. So yes, you have to account for it in text data, too.
Sure, but that's orders of magnitude less likely to happen than a comma ending up inside a text corpus.
There's just no such thing as a delimiter which won't find its way into the data. Quoting and escaping really are the only robust way.
You can disallow all control characters (ASCII < 32) other than CR/LF/TAB, which is reasonable. I don't know of any data besides binary blobs which uses those. I've never heard of anyone inlining a binary file (like an image) into a "CSV" anyway.
If you disallow control characters so that you can use them as delimiters, then CSV itself becomes a "binary" data format - or to put it another way, you lose the ability to nest CSV.
It isn't good enough to say "but people don't/won't/shouldn't do that", because it will just happen regardless. I've seen nested CSV in real-life data.
Compare to the zero-terminated strings used by C, one legacy of which is that PostgreSQL doesn't quite support UTF-8 properly, because it can't handle a 0 byte in a string, because 0 is "special" in C.
Nested CSVs as you've seen in real-life data are a good counterexample, thanks for providing it.
So have a way to escape those control characters.
Right, but the original point I was responding to is that control characters are disallowed in the data and therefore don't need to be escaped. If you're going to have an escaping mechanism then you can use "normal" characters like comma as delimiters, which is better because they can be read and written normally.
But a comma is much more likely to need to be escaped.
It's good for a delimiter to be uncommon in the data, so that you don't have to use your escaping mechanism too much.
This is a different thing altogether from using "disallowed" control characters, which is an attempt to avoid escaping altogether - an attempt which I was arguing is doomed to fail.
CSV is a data-exchange format.
But it is terrible at that because there is no widely adhered to standard[1], the sender and receiver often disagree on the details of what exactly a CSV is.
[1]: yes, I know about RFC 4180. But csvs in the wild often don't follow it.
Agreed that that CSV construction isn't consistent even with standards.
That footprint seems to be dozens of variations to work with to find a library for?
CSV are universal though, text kind of like markdown, and that is my intended main point.
The only times I hated CSV was when it came from another system I had no control over. For example Windows and their encodings, or some other proprietary BS.
But CSV under controlled circumstances is very simple.
And speaking of Wintendo, the bonus is often that you can go straight from CSV to Excel presentation for the middle management.
The post should at least mention in passing the major problem with CSV: it is a "no spec" family of de-facto formats, not a single thing (it is an example of "historically grown"). And omission of that meams I'm going to have to call this our for its bias (but then it is a love letter, and love makes blind...).
Unlike XML or JSON, there isn't a document defining the grammar of well-formed or valid CSV files, and there are many flavours that are incompatible with each other in the sense that a reader for one flavour would not be suitable for reading the other and vice versa. Quoting, escaping, UTF-8 support are particular problem areas, but also that you cannot tell programmatically whether line 1 contains column header names or already data (you will have to make an educated guess but there ambiguities in it that cannot be resolved by machine).
Having worked extensively with SGML for linguistic corpora, with XML for Web development and recently with JSON I would say programmatically, JSON is the most convenient to use regarding client code, but also its lack of types makes it useful less broadly than SGML, which is rightly used by e.g. airlines for technical documntation and digital humanities researchers to encode/annotate historic documents, for which it is very suitable, but programmatically puts more burden on developers. You can't have it all...
XML is simpler than SGML, has perhaps the broadest scope and good software support stack (mostly FOSS), but it has been abused a lot (nod to Java coders: Eclipse, Apache UIMA), but I guess a format is not responsible for how people use or abuse it. As usual, the best developers know the pros and cons and make good-taste judgments what to use each time, but some people go ideological.
(Waiting for someone to write a love letter to the infamous Windows INI file format...)
In fairness there are also several ambiguities with JSON. How do you handle multiple copies of the same key? Does the order of keys have semantic meaning?
jq supports several pseudo-JSON formats that are quite useful like record separator separated JSON, newline separated JSON. These are obviously out of spec, but useful enough that I've used them and sometimes piped them into a .json file for storage.
Also, encoding things like IEEE NaN/Infinity, and raw byte arrays has to be in proprietary ways.
JSON lines is not JSON It is built on top of it. .jsonl extension can be used to make it clear https://jsonlines.org/
Back in my day it was called NDJSON.
The industry is so chaotic now we keep giving the same patterns different names, adding to the chaos.
> How do you handle multiple copies of the same key
That’s unambiguously allowed by the JSON spec, because it’s just a grammar. The semantics are up to the implementation.
interestingly other people are answering the opposite in this thread.
They're wrong.
From ECMA-404[1] in section 6:
> The JSON syntax does not impose any restrictions on the strings used as names, does not require that name strings be unique, and does not assign any significance to the ordering of name/value pairs.
That IS unambiguous.
And for more justification:
> Meaningful data interchange requires agreement between a producer and consumer on the semantics attached to a particular use of the JSON syntax. What JSON does provide is the syntactic framework to which such semantics can be attached
> JSON is agnostic about the semantics of numbers. In any programming language, there can be a variety of number types of various capacities and complements, fixed or floating, binary or decimal.
> It is expected that other standards will refer to this one, strictly adhering to the JSON syntax, while imposing semantics interpretation and restrictions on various encoding details. Such standards may require specific behaviours. JSON itself specifies no behaviour.
It all makes sense when you understand JSON is just a specification for a grammar, not for behaviours.
[1]: https://ecma-international.org/wp-content/uploads/ECMA-404_2...
> and does not assign any significance to the ordering of name/value pairs.
I think this is outdated? I believe that the order is preserved when parsing into a JavaScript Object. (Yes, Objects have a well-defined key order. Please don't actually rely on this...)
In the JS spec, you'd be looking for 25.5.1
If I'm not mistaken, this is the primary point:
> Valid JSON text is a subset of the ECMAScript PrimaryExpression syntax. Step 2 verifies that jsonString conforms to that subset, and step 10 asserts that that parsing and evaluation returns a value of an appropriate type.
And in the algorithm
If you theoretically (not practically) parse a JSON file into a normal JS AST then loop over it this way, because JS preserves key order, it seems like this would also wind up preserving key order. And because it would add those keys to the final JS object in that same order, the order would be preserved in the output.> (Yes, Object's have a well-defined key order. Please don't actually rely on this...)
JS added this in 2009 (ES5) because browsers already did it and loads of code depended on it (accidentally or not).
There is theoretically a performance hit to using ordered hashtables. That doesn't seem like such a big deal with hidden classes except that `{a:1, b:2}` is a different inline cache entry than `{b:2, a:1}` which makes it easier to accidentally make your function polymorphic.
In any case, you are paying for it, you might as well use it if (IMO) it makes things easier. For example, `let copy = {...obj, updatedKey: 123}` is relying on the insertion order of `obj` to keep the same hidden class.
In JS maybe (I don't know tbh), but that's irrelevant to the JSON spec. Other implementations could make a different decision.
Ah, I thought the quote was from the JS spec. I didn't realize that ECMA published their own copy of the JSON spec.
Internet JSON (RRC 7493) forbids objects to have members with duplicate names.
As it says:
I-JSON (short for "Internet JSON") is a restricted profile of JSON designed to maximize interoperability and increase confidence that software can process it successfully with predictable results.
So it's not JSON, but a restricted version of it.
I wonder if use of these restrictions is popular. I had never heard of I-JSON.
I think it's rare for them to be explicilty stated, but common for them to be present in practice. I-JSON is just an explicit list of these common implicit limits. For any given tool/service that describes itself as accepting JSON I would expect I-JSON documents to be more likely to work as expected than non-I-JSON.
> How do you handle multiple copies of the same key? Does the order of keys have semantic meaning?
This is also an issue, due to the way that order of keys are working in JavaScript, too.
> record separator separated JSON, newline separated JSON.
There is also JSON with no separators, although that will not work very well if any of the top-level values are numbers.
> Also, encoding things like IEEE NaN/Infinity, and raw byte arrays has to be in proprietary ways.
Yes, as well as non-Unicode text (including (but not limited to) file names on some systems), and (depending on the implementation) 64-bit integers and big integers. Possibly also date/time.
I think DER avoids these problems. You can specify whether or not the order matters, you can store Unicode and non-Unicode text, NaN and Infinity, raw byte arrays, big integers, and date/time. (It avoids some other problems as well, including canonization (DER is already in canonical form) and other issues. Although, I have a variant of DER that avoids some of the excessive date/time types and adds a few additional types, but this does not affect the framing, which can still be parsed in the same way.)
A variant called "Multi-DER" could be made up, which is simply concatenating any number of DER files together. Converting Multi-DER to BER is easy just by adding a constant prefix and suffix. Converting Multi-DER to DER is almost as easy; you will need the length (in bytes) of the Multi-DER file and then add a prefix to specify the length. (In none of these cases does it require parsing or inspecting or modifying the data at all. However, converting the JSON variants into ordinary JSON does require inspecting the data in order to figure out where to add the commas.)
Plus the 64-bit integer problem, really 52-bit integers, due to JS not having integers.
JSON itself is not limited to neither 52 nor 64-bit integers.
https://json.org/That’s a JavaScript problem, not JSON.
Most good parsers have an option to parse to integers or arbitrary precision decimals.
Agreed. Which means that Javascript does not have a good parser.
`JSON.parse` actually does give you that option via the `reviver` parameter, which gives you access to the original string of digits (to pass to `BigInt` or the number type of your choosing) – so per this conversation fits the "good parser" criteria.
To be specific (if anyone was curious), you can force BigInt with something like this:
Generally, I'd rather throw if a number is unexpectedly too big otherwise you will mess up the types throughout the system (the field may not be monomorphic) and will outright fail if you try to use math functions not available to BigInts.Huh, TIL!
https://caniuse.com/mdn-javascript_builtins_json_parse_reviv...
Absent in Safari though
Sadly the reviver parameter is a new invention only recently available in FF and Node, not at all in Safari.
Naturally not that hard to write a custom JSON parser but the need itself is a bad thing.
Just use the polyfill
https://github.com/zloirock/core-js#jsonparse-source-text-ac...
No it's been there for ages. Finalized as part of ecmascript 5
What you are probably thinking of is the context parameter of the reviver callback. That is relatively recent and mostly a qol improvement
Sorry yes, i was thinking of the context object with source parameter.
The issue it solves is a big one though, since without it the JSON.parse functionality cannot parse numbers that are larger than 64bit float numbers (f.ex. bigints).
bigint exists
They do specifically mention this:
“No one owns CSV. It has no real specification (yes, I know about the controversial ex-post RFC 4180), just a set of rules everyone kinda agrees to respect implicitly. It is, and will forever remain, an open and free collective idea.”
They even seem to think it is a good thing. But I don't see how not having a bunch of implementations that can't agree on the specifics of a file/interchange format is a good thing. And being free and open is completely orthogonal. There are many proprietary formats that don't have a spec, and many open formats that do have a spec (like, say, json).
That's true of the vast majority of protocols that people use in real life to exchange date. We're using one of them right now, in fact.
Waiting for someone to write a love letter to the infamous Windows INI file format
I actually miss that. It was nice when settings were stored right alongside your software, instead of being left behind all over a bloated registry. And the format was elegant, if crude.
I wrote my own library for encoding/writing/reading various datatypes and structure into ini's, in a couple different languages, and it served me well for years.
TOML is nice like that... elegant like INI, only with lists.
> instead of being left behind all over a bloated registry
Really? I think the idea of a central, generic, key-value pair database for all the setting on a system is probably the most elegant reasonable implementation there could be.
The initial implementation of Windows Registry wasn't good. It was overly simplistic and pretty slow. Though the "bloat" (what ever that means) of registry hasn't been an actual issue in over 20 years. The only people invested in convincing you "it's an issue" are CCleaner type software that promise to "speed up your computer" if you just pay $6.99.
How many rows do you need in a sqlite database for it to be "bloated"?
I feel like YAML is a spiritual successor to the .ini, since it shares a notable ideal of simple human readability/writability.
Whenever I ask myself "should I use YAML?" I answer myself "Norway".
To be fair to YAML that's been solved in 1.2.
https://yaml.org/spec/1.2.2/#10212-boolean
> I feel like YAML is a spiritual successor to the .ini, since it shares a notable ideal of simple human readability/writability.
It doesn't feel that way to me: it's neither simple to read nor to write. I suppose that that's a builtin problem due to tree representation, which is something that INI files were never expected to represent.
TBH, I actually prefer the various tree representation workarounds used by INI files: using whitespace to indicate child nodes stops being readable once you have more than a screenful of children in a node.
Given how YAML does magic and sometimes accidental type conversions of potentially nested objects, I think TOML is the well-defined sucessor to .ini
YAML is readable? No way as there are too many ways to do the same thing and nested structures are unclear to the non trained eye (what is a list? What is nested?), let alone indentation in large files is an issue especially with the default 2 space unreadable standard so many people adhere to.
YAML simple? It's sepc is larger than XML... Parsing of numbers and strings is ambiguous, leading zeros are not strings but octal (implicit conversion...). List as keys? Oh ffs, and you said readable. And do not get me started about "Yes" being a boolean, reminds me of the MS Access localizations which had other decimal values for true and [local variant of true] (1 vs -1).
Writable? Even worse. I think I have never been able to write a YAML file without errors. But that might just be me, XML is fine though while unreadable.
I agree that one could make wild YAML if you get into advanced stuff, but I make YAML files that look like this:
Just because you can use it to create a monstrosity doesn't prevent it from being useful for simple configuration. Basically, it's just prettier JSON.Say "no" to YAML. As a string, if you can.
You can. YAML 1.2 is only 16 years old. Just old enough to drive. Norway problem has been solved for only 16 years.
YAML 1.2 leaves data types ambiguous, merely making the "Norway problem" optional and at the mercy of the application rather than, in the words of https://yaml.org/type/ (which has not been marked as deprecated), "strongly recommended".
Those schemas aren't part of the core schema, and you may interpret them if you are aiming for full 1.1 compatibility. If you're aiming for 1.1 compatibility, then you accept the Norway problem.
I've been looking in the specs and I can't find the link to the https://yaml.org/type/
I think GRON[1] would fit the bill better
[1] https://github.com/tomnomnom/gron
The post does mention it, as a positive:
https://github.com/medialab/xan/blob/master/docs/LOVE_LETTER...
People who say that CSV is "simpler" are talking about whatever format Excel exports.
Also these people have only ever had to deal with the American Excel localization.
So yeah, with the caveat of "only ever use Excel and only ever the American edition" CSV is pretty nice.
As someone living in a country where , is used as the decimal separator, I cannot begin to describe the number of times CSV data has caused me grief. This becomes especially common in an office environment where Excel is the de facto only data handling tool that most people can and will use. Here the behavior of loading data becomes specific to the individual machine and changes over time (e.g. when IT suddenly forces a reset of MS Office application languages to the local one).
That said, I don't really know of any alternative that won't be handled even worse by my colleagues...
Also keeping in mind all the locales where comma is the decimal point…tsv for the world.
And all the 'simple' formats start failing when dealing with blocks of text.
To be honest, I'm wondering why you are rating JSON higher than CSV.
> Unlike XML or JSON, there isn't a document defining the grammar of well-formed or valid CSV files,
There is, actually, RFC 4180 IIRC.
> there are many flavours that are incompatible with each other in the sense that a reader for one flavour would not be suitable for reading the other and vice versa.
"There are many flavours that deviate from the spec" is a JSON problem too.
> you cannot tell programmatically whether line 1 contains column header names or already data (you will have to make an educated guess but there ambiguities in it that cannot be resolved by machine).
Also a problem in JSON
> Quoting, escaping, UTF-8 support are particular problem areas,
Sure, but they are no more nor no less a problem in JSON as well.
Have you had to work with csv files from the wild much? I'm not being snarky but what you're talking about is night and day to what I've experienced over the years.
There aren't vast numbers of different JSON formats. There's practically one and realistically maybe two.
Headers are in each line, utf8 has never been an issue for me and quoting and escaping are well defined and obeyed.
This is because for datasets, almost exclusively, the file is machine written and rarely messed with.
Csv files have all kinds of separators, quote characters, some parsers don't accept multi lines and some do, people sort files which mostly works until there's a multi line. All kinds of line endings, encodings and mixed encodings where people have combined files.
I tried using ASCII record separators after dealing with so many issues with commas, semicolons, pipes, tabs etc and still data in the wild had these jammed into random fields.
Lots of these things don't break when you hit the issue either, the parsers happily churn on with garbage data, leading to further broken datasets.
Also they're broken for clients if the first character is a capital I.
WRT JSON:
> Headers are in each line
This might be my old “space and network cost savings” reflex, which is a lot less necessary these days, kicking in, but the feels inefficient. It also gives rise to not knowing the whole schema until you read the whole dataset (which might be multiple files), unless some form of external schema definition is provided.
Having said that, I accept that JSON has advantages over CSV, even if all that is done is translating a data-table into an array of objects representing one row each.
> utf8 has never been an issue for me
The main problem with UTF8 isn't with CSV generally, it is usually, much like the “first column is called ID” issue, due to Excel. Unfortunately a lot of people interact with CSVs primarily with Excel, so it gets tarred with that brush by association. Unless Excel sees the BOM sequence at the start of a CSV file, which the Unicode standards recommend against for UTF8, it assumes its characters are using the Win1252 encoding (almost, but not quite, ISO-8859-1).
> Csv files have all kinds of separators
I've taken to calling them Character Separated Value files, rather than Comma, for this reason.
Yes, it's not great. Space is annoying, though compression pretty much removes that as a concern (zstd is good for this, you can even have a custom dictionary). And yes, missing keys is annoying.
JSONL is handy, JSON that's in the form {data: [...hundred megs of lines]} is annoying for various parsers.
I'm quite a fan of parquet, but never expect to receive that from a client (alas).
> JSON that's in the form {data: [...hundred megs of lines]} is annoying for various parsers.
One reason this became common was a simple protection against json hijacking: https://haacked.com/archive/2009/06/25/json-hijacking.aspx/
Parquet should get the praise. It's simply awesome.
It's what I'd pick for tabular data exchange.
A recent problem I solved with it and duckdb allowed me to query and share a 3M record dataset. The size? 50M. And my queries all ran subsecond. You just aren't going to get that sort of compression and query-ability with a csv.
I wonder if CSV is the trivial format, so you have many people picking it because they want the easiest, and still getting it wrong. JSON is harder, so very few people are going to roll their own serializer/deserializer, and those who do are more likely to focus on getting it right (or at least catching the really obvious bugs).
I've dealt with incorrect CSVs numerous times, never with incorrect JSON, but, of the times I know what was happening on the other system, each time the CSV was from some in house (or similar) implementation of dumping a SQL output (or similar) into a text file as an MVP. JSON was always using some library.
If so, that's all the more reason to love CSV as it stands guard for JSON. If CSV didn't exist, we would instead have broken JSON implementations. (JSON and XML would likely then share a similar relationship.)
Sometimes people interpret the term too generically and actually implement a high degree of non-trivial, very idiosyncratic complexity, while still calling it "CSV".
One project I worked on involved a vendor promising to send us data dumps in "CSV format". When we finally received their "CSV" we had to figure out how to deal with (a) global fields being defined in special rows above the header row, and (b) a two-level hierarchy of semicolon-delimited values nested within comma-delimited columns. We had to write a custom parser to complete the import.
Hi,
Yes, we chose ARFF format, which is idiosyncratic yet well-defined back in the old data mining days.
Sure, I get your arguments and we're probably mostly in agreement, but in practice I see very few problems arising with using CSV.
I mean, right now, the data interchange format between multiple working systems is CSV; think payment systems, inter-bank data interchange, ERP systems, CRM systems, billing systems ... the list goes on.
I just recently had a coffee with a buddy who's a salesman for some enterprise system: of the most common enterprise systems we recently worked with (SAP type things, but on smaller scales), every single one of them had CSV as the standard way to get data between themselves and other systems.
And yet, they work.
The number of people uploading excel files to be processed or downloading excel files for local visualistation and processing would floor you. It's done multiple times a day, on multiple systems, in multiple companies.
And yet, they work.
I get your argument though - a JSON array of arrays can represent everything that CSV can, and is preferable to CSV, and is what I would choose when given the choice, but the issues with using that are not going to be fewer than issues with CSV using RFC1480.
>but in practice I see very few problems arising with using CSV
That is not my experience at all. I've been processing CSV files from financial institutions for many years. The likelihood of brokenness must be around 40%. It's unbelievable.
The main reason for this is not necessarily the CSV format as such. I believe the reason is that it is often the least experienced developers who are tasked with writing export code. And many inexperienced developers seem to think that they can generate CSV without using a library because the format is supposedly so simple.
JSON is better but it doesn't help with things like getting dates right. XML can help with that but it has complexities that people get wrong all the time (such as entities), so I think JSON is the best compromise.
> And many inexperienced developers seem to think that they can generate CSV without using a library because the format is supposedly so simple.
Can't they?
I haven't tested this, even to see if the code parses. What did I screw up?>Can't they?
If my experience reflects a relevant sample then the answer is that most can but a very significant minority fails at the job (under the given working conditions).
Whether or not _you_ can is a separate question. I don't see anything wrong with your code. It does of course assume that whatever is contained in rows is correct. It also assumes that the result is correctly written to a file without making any encoding mistakes or forgetting to flush the stream.
Not using name value pairs makes CSV more prone to mistakes such as incorrect ordering or number of values in some rows, a header row that doesn't correspond with the data rows, etc. Some export files are merged from multiple sources or go through many iterations over many years, which makes such mistakes far more likely.
I have also seen files that end abruptly somewhere in the middle. This isn't specific to CSV but it is specific to not using libraries and not using libraries appears to be more prevalent when people generate CSV.
You'd be surprised how many CSV files are out there where the developer tried to guess incorrectly whether or not a column would ever have to be escaped. Maybe they were right initially and it didn't have to be escaped but then years later something causes a change in number formats (internationalisation) and bang, silent data corruption.
Prioritising correctness and robustness over efficiency as you have done is the best choice in most situations. Using a well tested library is another option to get the same result.
This forces each field to be quoted, and it assumes that each row has the same fields in the same order. A library can handle the quoting issues and fields more reliably. Not sure why you went with a generator for this either.
Most people expect something like `12,,213,3` instead of `"12","213","3"` which yours might give.
https://en.wikipedia.org/wiki/Comma-separated_values#Basic_r...
Forcing each field to be quoted is always correct, isn't it? How could something be "more reliable" than something that is always correct?
With respect to "the same fields in the same order", no, although you may or may not feed the CSV to an application that has such an expectation. But if you apply it to data like [("Points",),(),("x","y"),("3","4"),("6","8","10")] it will successfully preserve that wonky structure in a file Excel can ingest reliably. (As reliably as Excel can ingest anything, anyway, since Excel has its own Norway problem.)
It's true that it's possible to produce more optimized output, but I didn't claim that the output was optimal, just correct.
Using generators is necessary to be able to correctly output individual fields that are many times larger than physical memory.
I'll preface this that I think we are mostly in agreement, so that's the friendly tone of reply, part of this is just having flashbacks.
It's massively used, but the lack of adherence to a proper spec causes huge issues. If you have two systems that happen to talk properly to each other, great, but if you are as I was an entrypoint for all kinds of user generated files it's a nightmare.
CSV is the standard, sure, but it's easy to write code that produces it that looks right at first glance but breaks with some edge case. Or someone has just chosen a different separator, or quote, so you need to try and detect those before parsing (I had a list that I'd go through, then look for the most commonly appearing non-letter character).
The big problem is that the resulting semantically broken csv files often look pretty OK to someone scanning them and permissive parsers. So one system reads it in, splits something on lines and assumes missing columns are blank and suddenly you have the wrong number of rows, then it exports it. Worse if it's been sorted before the export.
Of course then there's also the issues around a lack of types, so numbers and strings are not distinguishable automatically leading to broken issues where you do want leading zeros. Again often not identified until later. Or auto type detection in a system breaking because it sees a lot of number-like things and assumes it's a number column. Without types there's no verification either.
So even properly formatted CSV files need a second place for metadata about what types there are in the file.
JSON has some of these problems too, it lacks dates, but far fewer.
> but the issues with using that are not going to be fewer than issues with CSV using RFC1480.
My only disagreement here is that I've had to deal with many ingest endpoints that don't properly support that.
Fundamentally I think nobody uses CSV files because they're a good format. They've big, slow to parse, lack proper typing, lack columnar reading, lack fast jumping to a particular place, etc.
They are ubiquitous, just not good, and they're very easy to screw up in hard to identify or fix ways.
Finally, lots of this comes up because RFC4180 is only from *2005*.
Oh, and if I'm reading the spec correctly, RFC4180 doesn't support UTF8. There was a proposed update maybe in 2022 but I can't see it being accepted as an RFC.
> I mean, right now, the data interchange format between multiple working systems is CSV; think payment systems, inter-bank data interchange, ERP systems, CRM systems, billing systems ... the list goes on.
And there are constant issues arising from that. You basically need a small team to deal with them in every institution that is processing them.
> I just recently had a coffee with a buddy who's a salesman for some enterprise system: of the most common enterprise systems we recently worked with (SAP type things, but on smaller scales), every single one of them had CSV as the standard way to get data between themselves and other systems.
Salesman of enterprise system do not care about issues programmers and clients have. They care about what they can sell to other businessmen. That teams on both sides then waste time and money on troubleshooting is no concern to the salesman. And I am saying that as someone who worked on the enterprise system that consumed a lot of csv. It does not work and process of handling them literally sometimes involved phone calls to admins of other systems. More often then would be sane.
> The number of people uploading excel files to be processed or downloading excel files for local visualistation and processing would floor you.
That is perfectly fine as long as it is a manager downloading data so that he can manually analyze them. It is pretty horrible when those files are then uploaded to other systems.
In practice, I have never ever received CSV to process that complied with RFC 4180, and in most cases it was completely incoherent and needed incredibly special handling to handle all the various problems like lack of escaping.
SAP has been by far the worst. I never managed to get data out of it that were not completely garbage and needed hand crafted parsers.
SAP only has to be SAP and MS Excel compatible. The rest is not needed so in their eyes it is probably to spec.
European quality™.
> And yet, they work.
Through a lot of often-painful manual intervention. I've seen it first-hand.
If an organization really needs something to work, it's going to work somehow—or the organization wouldn't be around any more—but that is a low bar.
In a past role, I switched some internal systems from using CSV/TSV to using Parquet and the difference was amazing both in performance and stability. But hey, the CSV version worked too! It just wasted a ton of people's time and attention. The Parquet version was far better operationally, even given the fact that you had to use parquet-tools instead of just opening files in a text editor.
> There aren't vast numbers of different JSON formats.
Independent variations I have seen:
* Trailing commas allowed or not * Comments allowed or not * Multiple kinds of date serialization conventions * Divergent conventions about distinguishing floating point types from integers * Duplicated key names tolerated or not * Different string escaping policies, such as, but not limited to "\n" vs "\x0a"
There are bazillions of JSON variations.
> Trailing commas allowed or not
The json spec does not allow commas. Although there are jsom supersets that do.
> Comments allowed or not
The json spec does not allow comments. Although there are jsom supersets that do.
> Multiple kinds of date serialization conventions
Json spec doesn't say anything about dates. That is dependent on your application schema.
> Divergent conventions about distinguishing floating point types from integers
This is largely due to divergent ways different programming languages handle numbers. I won't say jsom handles this the best, but any file format used across multiple languages will run into problems with differences in how numbers are represented. At least there is a well defined difference between a number and a string, unlike csv.
> Duplicated key names tolerated or not
According to the spec, they are tolerated, although the semantics of such keys is implementation defined.
> Different string escaping policies, such as, but not limited to "\n" vs "\x0a"
Both of those are interpreted as the same thing, at least according to the spec. That is an implementation detail of the serializer, not a different language.
And CSV parsers and serializers compliant with RFC 4180 are similarly reliable.
But many, perhaps most, parsers and serializers for CSV are not compliant with RFC 4180.
RFC 4180 is not an official standard. The text of the RFC itself states:
> This memo provides information for the Internet community. It does > not specify an Internet standard of any kind.
CSVs existed long before that RFC was written, and it is more a description of CSVs that are somewhat portable, not a definitive specification.
That RFC doesn't even support utf8.
It is, and accepts it is, codifing best practices rather than defining an authoritative standard.
There are always many, but in comparison to csv I've received almost no differences. Json issues were rare but csv issues it was common to have a brand new issue per client.
Typically the big difference is there are different parsers that are less tolerant of in spec values. Clickhouse had a more restrictive parser, and recently I've dealt with matrix.
Maybe I've been lucky for json and unlucky for csv.
What's the problem with capital I?
https://superuser.com/questions/210027/why-does-excel-think-... says itks not capital I but “ID”.
Basically, Excel uses the equivalent of ‘file’ (https://man7.org/linux/man-pages/man1/file.1.html), sees the magic “ID”, and decides a SYLK file, even though .csv files starting with “ID” have outnumbered .SYLK files by millions for decades.
Thanks. So I guess the easy compatible solution is to always quote the first item on the first line when writing CSV. Good to know. (Checking if the item starts with ID is more work. Possibly quote all items on the first line for simplicity.) (Reading SYLK is obviously irrelevant, so accepting unquoted ID when reading is the smarter way to go and will actually improve compatibility with writers that are not Excel. Also it takes no work.)
The byte for a capital I is the same as the start for an odd file format, slyk maybe? Excel has (or did if they finally fixed it) for years decided this was enough to assume the file (called .csv) cannot possibly be csv but must actually be slyk. It then parses it as such, and is shocked to find your slyk file is totally broken!
It sounds to me like as often the problem here is Excel, not CSV
Yes but in practice CSV is defined by what Excel does.
As there is no standard to which Excel conforms as it predates standards and there would be an outcry if Excel started rejecting files that had worked for years.
There is a common misconception here. You can import CSV files into an excel sheet. You cannot open a CSV file with excel. That is a nonsense operation.
Excel do not ask the user whether they want to import file, and tell user their file was broken.
Clients don't particularly make the distinction, and in a way nor should they - they can't open your file.
Probably referring to the "turkish i problem"
Not an unreasonable guess, but it turned out to be something different.
> There is, actually, RFC 4180 IIRC.
Does any software fully follow that spec (https://www.rfc-editor.org/rfc/rfc4180)? Some requirements that I doubt are commonly followed:
- “Each record is located on a separate line, delimited by a line break (CRLF)” ⇒ editing .csv files using your the typical Unix text editor is complicated.
- “Spaces are considered part of a field and should not be ignored”
- “Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes” ⇒ fields containing lone carriage returns or new lines need not be enclosed in double quotes.
> Unlike XML or JSON, there isn't a document defining the grammar of well-formed or valid CSV files
There is such a document: RFC 4180. It may not be a good document, but it does exist.
INI was for a long time a seemingly preferable format in the Python community for configuration for a long time, as I recall it.
Haven’t been a full time Python dev in sometime though, it seems TOML has supplanted that, but I remember thinking how interesting it was that Python had a built in INI parser and serializer
While CSV isn't exactly grammared or standardised like XML I think if it as more schema:d than JSON. There might be data corruption or consistency issues, but there is implicitly a schema: every line is exactly n fields, and the first line might contain field names.
When a JSON API turns out to have optional fields it usually shows through trial and error, and unlike CSV it's typically not considered a bug you can expect the API owner to fix. In CSV 'missing data' is an empty string rather than nulls or their cousins because missing fields aren't allowed, which is nice.
I also like that I can write my own ad hoc CSV encoder in most programming languages that can do string concatenation, and probably also a suitable decoder. It helps a lot in some ETL tasks and debugging. Decent CSV also maps straight to RDBMS tables, if the database for some reason fails at immediate import (e.g. too strict expectations) into a newly created table it's almost trivial to write an importer that does it.
JSON is not schema’d per se and intentionally so. There’s jsonschema which has better expressiveness than inference of a tabular schema, as it can reflect relationships.
Sure. I have yet to come across a data source with JSON Schema, I'll develop an opinion of it when I do.
I lived through SOAP/WSDL horror with their numerous standards and the lack of compatibility between stacks in different programming languages. Having seen abused XML, CSV formats. CSV is preferable over XML. Human-readability matters. Relative simplicity matters.
Despite JSON may also be interpreted differently by different tools, it is a good default choice for communicating between programs
> lack of compatibility between stacks in different programming languages
Well, that sure beats OpenAPI lack of compatibility between stacks in the same programming language.
I think the fact one can't randomly concatenate strings and call it "valid XML" a huge bonus over the very common "join strings with comma and \r\n", non-rfc4180 compliant (therefore mostly unparseable without human/LLM interaction) garbage people often pretend is CSV.
There is no file format that works out of box under all extreme corner cases.
You would think that ie XML-defined WSDL with XSD schema is well battle proven. I've encountered 2 years ago (and still dealing with that) WSDL from a major banking vendor that is technically valid, but no open source library in Java (from all languages) was able to parse it successfully or generate binding classes out of box.
Heck, flat files can end up with extreme cases, just work enough with legacy banking or regulatory systems and you will see some proper shit.
The thing is, any sort of critical integration needs to be battle tested and continuously maintained, otherwise it will eventually go bad, even a decade after implementation and regular use without issues.
What about S-expressions? Where do they break?
XML is a pretty good markup language. Using XML to store structured data is an abuse of it. All the features that are useful for markup are not useful for structured data and only add to the confusion.
> Waiting for someone to write a love letter to the infamous Windows INI file format...
Honestly, it’s fine. TOML is better if you can use it, but otherwise for simple applications, it’s fine. PgBouncer still uses INI, though that in particular makes me twitch a bit, due to discovering that if it fails to parse its config, it logs the failed line (reasonable), which can include passwords if it’s a DSN string.
Well, once you get over the fact that information on a TOML file can be out of order in any place, denominated by any mix of 3 different key encodings, and broken down in any random way... then yes, the rest of TOML is good.
I should write a love letter to JSON.
For as good as JSON is or is not, it's definitely not under-rated.
It does mention this. Point 2.
why you hate csv, not the program that is not able to properly create csv?
CSV is ever so elegant but it has one fatal flaw - quoting has "non-local" effects, i.e. an extra or missing quote at byte 1 can change the meaning of a comma at byte 1000000. This has (at least) two annoying consequences:
1. It's tricky to parallelise processing of CSV. 2. A small amount of data corruption can have a big impact on the readability of a file (one missing or extra quote can bugger the whole thing up).
So these days for serialisation of simple tabular data I prefer plain escaping, e.g. comma, newline and \ are all \-escaped. It's as easy to serialise and deserialise as CSV but without the above drawbacks.
JSON serialized without extra white space with one line per record is superior to CSV.
If you want CSV-ish, enforce an array of strings for each record. Or go further with actual objects and non-string types.
You can even jump to an arbitrary point and then seek till you see an actual new line as it’s always a record boundary.
It’s not that CSV is an invalid format. It’s that libraries and tools to parse CSV tend to suck. Whereas JSON is the lingua franca of data.
> It’s that libraries and tools to parse CSV tend to suck. Whereas JSON is the lingua franca of data.
This isn't the case. An incredible amount of effort and ingenuity has gone into CSV parsing because of its ubiquity. Despite the lack of any sort of specification, it's easily the most widely supported data format in existence in terms of tools and language support.
> An incredible amount of effort and ingenuity has gone into CSV parsing because of its ubiquity.
Yea and it's still a partially-parseable shit show with guessed values. But we can and could have and should have done better by simply defining a format to use.
Meanwhile, Excel exports to CSV as “semicolon separated values” depending on your OS locale
Albeit for fairly justifiable reasons
Justifiable how?
Well, Excel has a lot of common use-cases around processing numeric (and particularly financial) data. Since some locales use commas as decimal separators, using a character that's frequently present as a piece of data as a delimiter is a bit silly; it would be hard to think of a _worse_ character to use.
So, that means that Excel in those locales uses semicolons as separators rather than the more-frequently-used-in-data commas. Probably not the decision I'd make in retrospect, but not completely stupid.
Decimal separators being commas in some locales?
They could have just ignored the locale altogether though. Put dots on the numbers when using csv, and assume it has dots when importing
This exactly. Numbers in XLS(X) are (hopefully) not locale-specific – why should they be in CSV?
CSV -> text/csv
Microsoft Excel -> application/vnd.ms-excel
CSV is a text format, xls[x], json, and (mostly) xml are not.
Commas are commonly used in text, too.
Clearly they should have gone with BEL as the delimiter.
I'm hoping no reasonable person would ever use BEL as punctuation or decimal separator.If one was going to use a non-printable character as a delimiter, why wouldn't they use the literal record separator "\030"?
Every time you cat a BSV file, your terminal beeps like it's throwing a tantrum. A record separator (RS) based file would be missing this feature! In other words, my previous comment was just a joke! :)
By the way, RS is decimal 30 (not octal '\030'). In octal, RS is '\036'. For example:
See also https://en.cppreference.com/w/cpp/language/ascii for confirmation.On the off chance you're not being facetious, why not ASCII 0 as a delimiter? (This is a rhetorical question.)
ASCII has characters more or less designed for this
0x1C - File Separator
0x1D - Group Separator
0x1E - Record Separator
0x1F - Unit Separator
So I guess 1F would be the "comma" and 1E would be the "newline."
https://stackoverflow.com/questions/8695118/what-are-the-fil...
I am pretty sure you shifted the meaning, the decimal separator is part of the atomic data it does not need a control character.
You would use 1F instead of the comma/semicolon/tab and 1E to split lines (record means line just like in SQL).
You could then use 1D to store multiple CSV tables in a single file.
Yes but then the text is not human readable or editable in a plain text editor.
This would confuse most users of csvs they are not programmers they at most use text editors and Excel.
I am not proposing to do this, but if you were to use ascii separators you would do it this way
There are some decent arguments for BEL over NUL, however given you posed that as a rhetorical question I feel I can say little other than
ding! ding! ding! winner winner, chicken dinner!
Although BEL would drive me up the wall if I broke out any of my old TTY hardware.
...and excel macros
Sure, let's put quotation marks around all number values.
Oh wait.
lol
Can you point me to a language with any significant number of users that does NOT have a JSON library?
I went looking at some of the more niche languages like Prolog, COBOL, RPG, APL, Eiffel, Maple, MATLAB, tcl, and a few others. All of these and more had JSON libraries (most had one baked into the standard library).
The exceptions I found (though I didn't look too far) were: Bash (use jq with it), J (an APL variant), Scratch (not exposed to users, but scratch code itself is encoded in JSON), and Forth (I could find implementations, but it's very hard to pin down forth dialects).
I made no claim about JSON libraries. I contested the claim that "CSV libraries and tools suck". They do not.
CSV tooling has had to invest enormous amounts of effort to make a fragile, under-specified format half-useful. I would call it ubiquitous, I would call the tooling that we’ve built around it “impressive” but I would by no means call any of it “good”.
I do not miss dealing with csv files in the slightest.
> CSV tooling has had [...] to make a fragile, under-specified format half-useful
You get this backwards. Tabular structured data to store are ubiquitous. Text as a file format is also ubiquitous because it is accessible. The only actual decisions are about whether to encode your variables as rows or columns, what is the delimiter, and other rules such as escaping etc. Vars as columns makes sense because it makes appending easier. There is a bunch of stuff that can be used for delimeters, commas being the most common, none is perfect. But from this point onwards, decisions do not really matter, and "CSV" basically covers everything from now on. "CSV" is basically what comes naturally when you have tabular datasets and want to store them in text. CSV tooling is developed because there is a need for this way of formatting data. Whether CSV is "good" or "ugly" or whatever is irrelevant, handling data is complicated as much as the world itself is. The alternatives are either not structuring/storing the data in a tabular manner, or non-text (eg binary) formats. These alternative exist and are useful in their own right, but don't solve the same problems.
I think the issue is that CSV parsing is really easy to screw up. You mentioned delimiter choice and escaping, and I’d add header presence/absence to that list.
There are at least 3 knobs to turn every time you want to parse a CSV file. There’s reasonably good tooling around this (for example, Python’s CSV module has 8 parser parameters that let you select stuff), but the fact that you have to worry about these details is itself a problem.
You said “handling data is complicated as much as the world itself is”, and I 100% agree. But the really hard part is understanding what the data means, what it describes. Every second spent on figuring out which CSV parsing option I have to change could be better spent actually thinking about the data.
I am kind of amazed how people nag about having to parse practically a random file.
Having header or not should be specified up front and one should not parse some unknown file because that will always end up with failure.
If you have your own serialization and your own parsing working yeah this will simply work.
But then not pushing back to the user some errors and trying to deal with everything is going to be frustrating because amount of edge cases is almost infinite.
Handling random data is hard, saying it is a CSV and trying to support everything that comes with it is hard.
Microsoft Windows has had to invest enormous amounts...
Apple macOS has had to invest enormous amounts...
Pick your distro of Linux has had to invest enormous amounts...
None of them a perfect and any number of valid complaints can be said about any of them. None of the complaints make any of the things useless. Everyone has workarounds.
Hell, JSON has had to invest enormous amounts of effort...
I guess the point is that I can take a generic json parser and point it at just about any JSON I get my hands on, and have close to no issues parsing it.
Want to do the same with csv? Good luck. Delimiter? Configurable. Encoding? Configurable. Misplaced comma? No parse in JSON, in csv: might still parse, but is now semantically incorrect and you possibly won’t know until it’s too late, depending on your parser. The list goes on.
Here is a quick test
The table of contents points to a single Json object that is 20ish gb compressed
https://www.anthem.com/machine-readable-file/search/
All stock libs will fail
You claimed that CSV is "easily the most widely supported data format in existence in terms of tools and language support", which is a claim that CSV is better supported than JSON, which is a claim that JSON support is lacking.
Can you import .jsonl files into Google sheets or excel natively?
Importing csvs in excel can be a huge pain due to how excel handles localisation. It can basically alter your data if you are not mindful about that, and I have seen it happening too many times.
Excel dropping leading zeros (as in ZIP codes) was a crazy design decision that has certainly cost many lifetimes of person-hours.
And forcing 16+ digits to be floats, destroying information.
Yeah have had similar struggles with social security numbers.
For example:
Scientists rename genes because Microsoft Excel reads them as dates (2020)
https://www.reddit.com/r/programming/comments/i57czq/scienti...
I was so glad of that story. It gave me something to point to to get my boss off my back.
But it handles it better than Json.
Depends on what you mean by "better". I would rather software not handle a piece of data at all, than handle it erroneously and changing the data without me realising and thus causing all sorts of issues after.
In practice, web browsers accept the tag soup that is sometimes called html and strict xml-based formats failed.
The browser are not a database (unlike excel). Modifying data before showing it is reversible, modifying it before storing it is not.
Excel.
Before you dismiss it as 'not a language, people have argued that it is. And you can definitely program stuff in it, and so that surely makes it a language
Excel can import and parse JSON, it's under the "Get Data" header. It doesn't have a direct GUI way to export to JSON, but it takes just a few lines in Office Scripts. You can even use embedded TypeScript to call JSON.stringify.
I’ve found that the number of parsers that don’t handle multiline records is pretty high though.
It's widely, but inconsistently, supported. The behavior of importers varies a lot, which is generally not the case for JSON.
> Despite the lack of any sort of specification
People keep saying this but RFC 4180 exists.
> it's easily the most widely supported data format in existence in terms of tools and language support.
Even better, the majority of the time I write/read CSV these days I don't need to use a library or tools at all. It'd be overkill. CSV libraries are best saved for when you're dealing with random CSV files (especially from multiple sources) since the library will handle the minor differences/issues that can pop up in the wild.
JSON is a textual encoding no different than CSV.
It's just that people tend to use specialized tools for encoding and decoding it instead of like ",".join(row) and row.split(",")
I have seen people try to build up JSON strings like that too, and then you have all the same problems.
So there is no problem with CSV except that maybe it's too deceptively simple. We also see people trying to build things like URLs and query strings without using a proper library.
The problem with CSV is that there's no clear standard, so even if you do reach for a library to parse it, that doesn't ensure compatibility.
There is a clear standard and it's usually written on an old Word '97 doc in a local file server. Using CSV means that you are the compatibility layer, and this is useful if you need firm control or understanding of your data.
If that sounds like a lot of edge-case work keep in mind that people have been doing this for more than half a century. Lots of examples and notes you can steal.
https://datatracker.ietf.org/doc/html/rfc4180 exists
And does Excel fully comply and more imprtantly tell you when the CSV file is wrong
No. Excel's fault. Not CSV. There are plenty of busted CSV parsers (and serializers) too.
Same for JSON though. What Python considers a valid JSON might not be that if you ask a Java library.
JSON has a clearly-defined standards: ISO/IEC 21778:2017, IETF RFC 7159, and ECMA-404. Additionally, Crockford has had a spec available on json.org since it's creation in 2001.
Do you have any examples of Python, Java, or any of the other Tiobe top 40 languages breaking the JSON spec in their standard library?
In contrast, for the few of those that have CSV libraries, how many of those libraries will simply fail to parse a large number of the .csv variations out there?
Not to mention that stuff like Excel loves to export CSV files in "locale specific ways".
Sometimes commas to delimiter, sometimes semicolons, floating point values might have dots or commas to separate fraction digits.
Not to mention text encodings, Ascii, western european character sets, or maybe utf-8 or whatever...
It's a bloody mess.
That’s why we just email the sheets around like it’s 1999 :)
You need more than a standard; that standard has to be complete and unambiguous. What you're looking for is https://github.com/nst/JSONTestSuite
EDIT: The readme's results are from 2016, but there's more recent results (last updated 5 years ago). Of the 54 parsers /versions tested, 7 gave always the expected result per the spec (disregarding cases where the spec does not define a result).
It falls down under very large integers -- think large valid uint64_t values.
JSON doesn't fail for very large values because they are sent over the wire as strings. Only parsers may fail if they or their backing language doesn't account for BigInts or floats larger than f64, but these problems exist when parsing any string to a number.
And indeed applies to CSV as well: it's just strings at the end of the day, its up to the parser to make sense of it into the data types one wants. There is nothing inherently stopping you from parsing a JSON string into a uint64: I've done so plenty!
Example? I know there's some ambiguity over whether literals like false are valid JSON, but I can't think of anything else.
That _shouldn't_ be ambiguous, `false` is a valid JSON document according to specification, but not all parsers are compliant.
There's some interesting examples of ambiguities here: https://seriot.ch/projects/parsing_json.html
Trailing commas, comments, duplicate key names, for a few examples.
Trailing commas and comments are plainly not standard JSON under any definition. There are standards that include them which extend JSON, sure, but I'm not aware of any JSON library that emits this kind of stuff by default.
I'm not aware of any CSV library that doesn't follow RFC4180 by default, and yet... this whole thread.
> It's just that people tend to use specialized tools for encoding and decoding it instead of like ",".join(row) and row.split(",")
You really super can't just split on commas for csv. You need to handle the string encodings since records can have commas occur in a string, and you need to handle quoting since you need to know when a string ends and that string may have internal quote characters. For either format unless you know your data super well you need to use a library.
Right, obviously.
Yes but people don't
Until you have a large amount of data & need either random access or to work on multiple full columns at once. Duplicated keys names mean it's very easy for data in jsonlines format to be orders of magnitude larger than the same data as CSV, which is incredibly annoying if your processing for it isn't amenable to streaming.
> JSON [...] with one line per record
Couple of standards that I know of that does this, primarily intended for logging:
https://jsonlines.org/
https://clef-json.org/
Really easy to work with in my experience.
Sure some space is usually wasted on keys but compression takes care of that.
You serialize the keys on every row which is a bit inefficient but it’s a text format anyway
Space-wise, as long as you compress it, it's not going to make any difference. I suspect a JSON parser is a bit slower than a CSV parser, but the slight extra CPU usage is probably worth the benefits that come with JSON.
This is simply not true, parsing json v csv is a difference of thousands of lines.
Eh, it really isn't. The format does not lend itself to tabular data, instead the most natural way of representing data involves duplicating the keys N times for each record.
You can easily represent it as an array:
That’s as tabular as CSV but you now have optional types. You can even have lists of lists. Lists of objects. Lists of lists of objects…Right - the JSON-newline equivalent of CSV can look like this:
Remove the [] characters and you've invented CSV with excel style quoting.
Almost, except the way Excel-style quoting works with newlines sucks - you end up with rows that span multiple lines, so you can't split on newline to get individual rows.
With JSON those new lines are \n characters which are much easier to work with.
I ended up parsing the XML format instead of the CSV format when handling paste from Excel due to the newlines issue.
CSV seemed so simple but after numerous issues, a cell with both newline and " made me realize I should keep the little hair I had left and put in the work to parse the XML.
It's not great either, with all its weird tags, but at least it's possible to parse reliably.
This is the way. jsonl where each row is a json list. It has well-defined standard quoting.
Just like csv you don't actually need the header row either, as long as there's convention about field ordering. Similar to proto bufs, where the field names are not included in the file itself.
This misses the point of standardization imo because it’s not possible to know a priori that the first line represents the variable names, that all the rows are supposed to have the same number of elements and in general that this is supposed to represent a table. An arbitrary parser or person wouldn’t know to guess since it's not standard or expected. Of course it would be parsed fine but the default result would be a kind of structure or multi-array rather than tabular.
application/jsonl+table
Typing isn't optional in JSON, every value has a concrete type, always.
Types at the type layer are not the same as types at the semantic layer. Sure every type in the JSON level has a "strong type" but the semantic meaning of the contents of e.g. a string are usually not expressable in pure JSON. So it is with CSV; you can think of every cell in CSV as containing a string (series of bytes) with it being up to you to enforce the semantics atop those bytes. JSON gives you a couple extra types, and if you can fit things into those types well, then that's great, but for most data concrete semantically meaningful data you won't be able to do that and you'll end up in a similar world to CSVs.
...and?
I see an array of arrays. The first and second arrays have two strings each, the last one has a float and a string. All those types are concrete.
Let's say those "1.1" and 7.4 values are supposed to be version strings. If your code is only sometimes putting quotes around the version string, the bug is in your code. You're outputting a float sometimes, but a string in others. Fix your shit. It's not your serialization format that's the problem.
If you have "7.4" as a string, and your serialization library is saying "Huh, that looks like a float, I'm going to make it a float", then get a new library, because it has a bug.
You're missing my point: basically nothing spits out data in that format because it's not ergonomic to do so. JSON is designed to represent object hierarchies, not tabular data.
CSV is lists of lists of fixed length.
JSON is lists of lists of any length and groups of key/value pairs (basically lisp S-expressions with lots of unnecessary syntax). This makes it a superset of CSV's capabilities.
JSON fundamentally IS made to represent tabular data, but it's made to represent key-value groups too.
Why make it able to represent tabular data if that's not an intended use?
> JSON is lists of lists of any length and groups of key/value pairs
The "top-level" structure of JSON is usually an object, but it can be a list.
> JSON fundamentally IS made to represent tabular data
No, it's really not. It's made to represent objects consisting of a few primitive types and exactly two aggregate types: lists and objects. It's a textual representation of the JavaScript data model and even has "Object" in the name.
> Why make it able to represent tabular data if that's not an intended use?
It's mostly a question of specialization and ergonomics, which was my original point. You can represent tabular data using JSON (as you can in JavaScript), but it was not made for it. Anything that can represent """data""" and at least 2 nesting levels of arbitrary-length sequences can represent tabular data, which is basically every data format ever regardless of how awkward actually working with it may be.
The fact that json can represent a superset of tabular data structures that csv is specifically designed to represent can be rephrased into that csv is more specialised than json in representing tabular data. The fact that json can also represent tabular data does not mean it is a better or more efficient way to represent that data instead of a format like csv.
In the same way, there are hierarchically structured datasets that can be represented by both json in hierarchical form and csv in tabular form by repeating certain variables, but if using csv would require repeating them too many times, it would be a bad idea to choose that instead of json. The fact that you can do sth does not always make it a good idea to do it. The question imo is about which way would be more natural, easy or efficient.
> The fact that json can represent a superset of tabular data structures that csv is specifically designed to represent can be rephrased into that csv is more specialised than json in representing tabular data. The fact that json can also represent tabular data does not mean it is a better or more efficient way to represent that data instead of a format like csv.
The reverse is true as well: being more specialized is a description of goals, not advantages.
It's hardly a bad idea to do a list of lists in JSON...
The big advantage of JSON is that it's standardized and you can reuse the JSON infrastructure for more than just tabular data.
> CSV is lists of lists of fixed length.
I'd definitely put that in my list of falsehoods programmers believe about CSV files.
It seems to be indicated by RCF-4180 which says
> This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file
But of course, CSV is the wild west and there's no guarantee that any two encoders will do the same thing (sometimes, there's not even a guarantee that the same encoder will do the same thing with two different inputs).
[0] https://www.ietf.org/rfc/rfc4180.txt
You should know that "should" isn't very binding.
Headers should have as many rows as possible that contain data items for their column and data items in a row should have a header for the respective columns, but real CSV files should be assumed to have incomplete or variable length lines.
NOTHING is very binding about the CSV spec and that's the biggest problem with CSV.
CSV is a text file that might have commas in it
yeah if I had a cookie for every time I've had to deal with this I'd have maybe 10 cookies - it's not a lot but it's more than it should be.
A format consisting of newline-terminated records, each containing comma-separated JSON strings would be superior to CSV.
It could use backslash escapes to denote control characters and Unicode points.
Everyone would agree exactly on what the format is, in contrast to the zoo of CSV variants.
It wouldn't have pitfalls in it, like spaces that defeat quotes
oops; add an innocuous-looking space, and the quotes are now literal."Any JSON primitive" does add a few requirements not semantically comparable to CSV, like numbers that are numbers, and keywords true, false, none.
When these syntaxes are parsed into objects, either the type info has to be retaind, or some kind of attribute tag, so they can be output back to the same form.
> make it so any consumer can parse it by splitting on newline and then ...
There is something like that called JSON-lines. It has a .org domain 'n' everything:
https://jsonlines.org/
JSON was designed to represent any data. There's plenty of systems that spit out data in exact that format because it's the natural way to represent tabular data using JSON serialization. And clearly if you're the one building the system you can choose to use it.
JSON is designed to represent JavaScript objects with literal notation. Guess what, an array of strings or an array of numbers or even an array of mixed strings and numbers is a commonly encountered format in JavaScript.
What happens when you need to encode the newline character in your data? That makes splitting _either_ CSV or LDJSON files difficult.
The new line character in a JSON string would always be \n. The new line in the record itself as whitespace would not be acceptable as that breaks the one line record contract.
Remember that this does not allow arbitrary representation of serialized JSON data. But it allows for any and all JSON data as you can always roundtrip valid JSON to a compact one line representation without extra whitespace.
Actually even whitespace-separated json would be a valid format and if you forbid json documents to be a single integer or float then even just concatenating json gives a valid format as JSON is a prefix free language.
That is[0] if a string s is a valid JSON then there is no substring s[0..i] for i < n that is a valid json.
So you could just consume as many bytes you need to produce a json and then start a new one when that one is complete. To handle malformed data you just need to throw out the partial data on syntax error and start from the following byte (and likely throw away data a few more times if the error was in the middle of a document)
That is [][]""[][]""[] is unambiguos to parse[1]
[0] again assuming that we restrict ourselves to string, null, boolean, array and objects at the root
[1] still this is not a good format as a single missing " can destroy the entire document.
« a single missing " can destroy the entire document » This is basically true for any data format, so really worse argument ever...
In jsonl a modified chunk will lose you at most the removed lines and the two adjacent ones (unless the noise is randomly valid json), in particular a single byte edit can destry at most 2 lines.
utf-8 is also similarly self-correcting and so is html and many media formats.
My point was that in my made-up concatenated json format
[]"""[][][][][][][][][][][]"""[]
and
[]""[][][][][][][][][][][]""[]
are both valid but have differ only for 2 bytes but have entirely different structures.
Also it is a made-up format nobody uses (if somebody were to want this they would likely disallow strings at the root level).
When you need to encode the newline character in your data, you say \n in the JSON. Unlike (the RFC dialect of) CSV, JSON has an escape sequence denoting a newline and in fact requires its use. The only reason to introduce newlines into JSON data is prettyprinting.
It's tricky, but simple enough, RFC states that " must be used, inserting a " is done with "". This makes knowing what a record is difficult, since you must keep a variable that keeps the entire string.
How do you do this simply? you read each line, and if there's an uneven number of ", then you have an incomplete record and you will keep all lines until there is an odd number of ". after having the string, parsing the fields correctly is harder but you can do it in regex or PEGs or a disgusting state machine.
That would be solved by using the ASCII control chars Record Separator / Unit Separator! I don't get how this is not widely used as standard.
I remembered seeing a comment like this before, and...
comment: https://news.ycombinator.com/item?id=26305052
comment: https://news.ycombinator.com/item?id=39679662
"ASCII Delimited Text – Not CSV or Tab Delimited Text" post [2014]: https://news.ycombinator.com/item?id=7474600
same post [2024]: https://news.ycombinator.com/item?id=42100499
comment: https://news.ycombinator.com/item?id=15440801
(...and many more.) "This comes up every single time someone mentions CSV. Without fail." - top reply from burntsushi in that last link, and it remains as true today as in 2017 :D
You're not wrong though, we just need some major text editor to get the ball rolling and start making some attempts to understand these characters, and the rest will follow suit. We're kinda stuck at a local optimum which is clearly not ideal but also not troublesome enough to easily drum up wide support for ADSV (ASCII Delimiter Separated Values).
>we just need some major text editor to get the ball rolling and start making some attempts to understand these characters
Many text editors offer extensions APIs, including Vim, Emacs, Notepad++. But the ideal behavior would be to auto-align record separators and treat unit separators as a special kind of newline. That would allow the file to actually look like a table within the text editor. Input record separator as shift+space and unit separator as shift+enter.
I think it would be enough for:
1. the field separator to be shown as a special character
2. the row separator to be (optionally) be interpreted as a linefeed
IIRC 1) is true for Notepad++, but not 2).
Excel needs to default its export to this. Unfortunately excel is proprietary software and therefore fucked.
Hahah, I came here to make the comment about ASCII's control characters, so I'm glad someone else beat me to it, and also that someone further pointed out that this topic comes up every time someone mentions CSV!
The _entire_ point of a CSV file is that it's fully human readable and write-able.
The characters you mention could be used in a custom delimiter variant of the format, but at that point it's back to a binary machine format.
And that’s why I tend to use tab delimited files more… when viewed with invisible characters shown, it’s pretty clear to read/write separate fields and have an easier to parse format.
This, of course, assumes that your input doesn’t include tabs or newlines… because then you’re still stuck with the same problem, just with a different delimiter.
As soon as you give those characters magic meanings then suddenly people will have reason to want to use them— it'll be a CSV containing localization strings for tooltips that contain that character and bam, we'll be back to escaping.
Except the usages of that character will be rare and so potentially way more scary. At least with quotes and commas, the breakages are everywhere so you confront them sooner rather than later.
Graphical representations of the control characters begin at U+2400 in the "Control Pictures" Unicode block. Instead of the actual U+001E Record Separator, you put the U+241E Symbol for Record Separator in the help text.
.... with a note underneath urging readers not to copy and paste the character because it's only the graphical representation of it, not the thing itself.
Perhaps a more salient example might be CSV nested in CSV. This happens all the time with XML (hello junit) and even JSON— when you plug a USB drive into my LG TV, it creates a metadata file on it that contains {"INFO":"{ \"thing\": true, <etc> }"}
You wouldn't use this format for that. It's not a universal format, but an application-specific one. It works in some applications, not in others.
Best to add deliberate breakage to the spec, then.
What do you mean? I just push the Record Separator key on my keyboard.
/s in case :)
The entire argument against ASCII Delimited Text boils down to "No one bothered to support it in popular editors back in 1984. Because I grew up without it, it is impossible to imagine supporting it today."
You need 4 new keyboard shortcuts. Use ctrl+, ctrl+. ctrl+[ ctrl+] You need 4 new character symbols. You need a bit of new formatting rules. Pretty much page breaks decorated with the new symbols. It's really not that hard.
But, like many problems in tech, the popular advice is "Everyone recognizes the problem and the solution. But, the problematic way is already widely used and the solution is not. Therefore everyone doing anything new should invest in continuing to support the problem forever."
> The entire argument against ASCII Delimited Text boils down to "No one bothered to support it in popular editors back in 1984. Because I grew up without it, it is impossible to imagine supporting it today."
There's also the argument of "Now you have two byte values that cannot be allowed to appear in a record under any circumstances. (E.g., incoming data from uncontrolled sources MUST be sanitized to reject or replace those bytes.)" Unless you add an escaping mechanism, in which case the argument shifts to "Why switch from CSV/TSV if the alternative still needs an escaping mechanism?"
One benefit of binary formats is not needing the escaping.
Length-delimited binary formats do not need escaping. But the usual "ASCII Delimited Text" proposal just uses two unprintable bytes as record and line separators, and the signalling is all in-band.
This means that records must not contain either of those two bytes, or else the format of the table will be corrupted. And unless you're producing the data yourself, this means you have to sanitize the data before adding it, and have a policy for how to respond to invalid data. But maintaining a proper sanitization layer has historically been finicky: just look at all the XSS vulnerabilities out there.
If you're creating a binary format, you can easily design it to hold arbitrary data without escaping. But just taking a text format and swapping out the delimiters does not achieve this goal.
I did mean length-delimited binary formats (rather than ASCII formats).
At least you don't need these values in your data, unlike the comma, which shows up in human-written text.
If you do need these values in your data, then don't use them as delimiters.
Something the industry has stopped doing, but maybe should do again, is restricting characters that can appear in data. "The first name must not contain a record separator" is a quite reasonable restriction. Even Elon Musk's next kid won't be able to violate that restriction.
Hear hear! Why is all editing done with text-based editors where humans can make syntax errors. Is it about job security?
In Windows (and DOS EDIT.COM and a few other similarly ancient tools) there have existed Alt+028, Alt+029, Alt+030, and Alt+031 for a long time. I vaguely recall some file format I was working with in QBASIC used some or all of them and I was editing those files for some reason. That was not quite as far back as 1984, but sometime in the early 1990s for sure. I believe EDIT.COM had basic glyphs for them too, but I don't recall what they were, might have been random Wingdings like the playing card suits.
Having keyboard shortcuts doesn't necessarily solve why people don't want to use that format, either.
> I believe EDIT.COM had basic glyphs for them too, but I don't recall what they were, might have been random Wingdings like the playing card suits.
That is not specific to EDIT.COM; they are the PC characters with the same codes as the corresponding control characters, so they appear as graphic characters. (They can be used in any program that can use PC character set.)
However, in EDIT.COM and QBASIC you can also prefix a control character with CTRL+P in order to enter it directly into the file (and they appear as graphic characters, since I think the only control characters they will handle as control characters are tabs and line breaks).
Suits are PC characters 3 to 6; these are PC characters 28 to 31 which are other shapes.
The keys would be something other than those, though. They would be: CTRL+\ for file separator, CTRL+] for group separator, CTRL+^ for record separator, CTRL+_ for unit separator. Other than that, it would work like you described, I think.
> But, like many problems in tech, the popular advice is "Everyone recognizes the problem and the solution. But, the problematic way is already widely used and the solution is not
This is unfortunately common. However, what else happens too, is disagreement about what is the problem and the solution.
It's more like, "because the industry grew up without it, other approaches gained critical mass."
Path dependence is a thing. Things that experience network effects don't get changed unless the alternative is far superior, and ASCII Delimited Text is not that superior.
Ignoring that and pushing for it anyway will at most achieve an xkcd 927.
I’m pretty sure those used to exist.
But when looking for a picture to back up my (likely flawed) memory, Google helpfully told me that you can get a record separator character by hitting Ctrl-^ (caret). Who knew?
Can't there be some magic sequence (like two unescaped newlines) to start a new record?
You’ll still find that sequence in data; it’ll just be rare enough that it won’t rear its ugly head until your solution has been in production for a while.
If there were visible well known characters that could be printed for those and keys on a keyboard for inputting them we would probably have RSV files. Because they are buried down in the nonprintable section of the ASCII chart they are a pain for people to deal with. All it would have taken is one more key on the keyboard, maybe splitting the tab key in half.
> If there were visible well known characters that could be printed...
...There would be datasets that include those characters, and so they wouldn't be as useful for record separators. Look into your heart and know it to be true.
I wouldn't feel too bad about blindly scrubbing those characters out of inputs unlike commas, tabs, and quotes.
Can't type them on a keyboard I guess, or generally work with them in the usual text-oriented tools? Part of the appeal of CSV is you can just open it up in Notepad or something if you need to. Maybe that's more a critique of text tools than it is of ASCII record separator characters.
I was really excited when I learned of these characters, but ultimately if it’s in ASCII then it’s in-band and will eventually require escaping leading to the same problem.
But what if one of your columns contains arbitrary binary data?
You'd likely need to uuencode it or similar as CSV isn't designed for binary data.
Perhaps the popularity,or lack thereof? More often than not, the bad standard wins the long term market
> I don't get how this is not widely used as standard.
It requires bespoke tools for edition, and while CSV is absolute garbage it can be ingested and produced by most spreadsheet software, as well as databases.
Reminds me of a fatal flaw of yaml. Turns out truncating a yaml file doesn't make it invalid. Which can lead to some rather non-obvious failures.
What is the failure mode where a yaml file gets truncated? They are normally config files in Git. Or uploaded to S3 or Kubernetes etc.
CSV has the same failure mode. As does HTML. (But not XML)
I couldn't find the story on it, but there was an instance of a config for some major service getting truncated, but since it was yaml it was more difficult to figure out that that was what happened. I think in AWS, but I can't find the story, so can't really remember.
And fully fair that you can have similar issues in other formats. I think the complaint here was that it was a bit harder, specifically because it did not trip up any of the loading code. With a big lesson learned that configs should probably either go pascal string style, where they have an expected number of items as the first part of the data, or xml style, where they have a closing tag.
Really, it is always amusing to find how many of the annoying parts of XML turned out to be somewhat more well thought out than people want to admit.
Bad merges.
Same is true of CSV/TSV.
I think you are a bit more likely to notice in a CSV/TSV, as it is unlikely to truncate at a newline?
Still, fair point. And is part of why I said it is a flaw, not the flaw. Plenty of other reasons to not like YAML, to me. :D
Not if it is split at a line e.g. if the source or target can only deal with a fixed number of lines.
Right, that is what I meant about that being unlikely? Most instances of truncated files that I have seen were because of size, not lines.
Still, a fair point.
Really depends on how the CSV is generated/transferred. If the output of the faulting software is line-buffered then it's quite likely that a failure would terminate the file at a line break.
I don't understand why CSV became a thing when TSV, or a format using the nowadays weird ASCII control characters like start/end of text, start of heading, horizontal/vertical tab, file/group/record/unit separator.
It seems many possible designs would've avoided the quoting chaos and made parsing sort of trivial.
Any time you have a character with a special meaning you have to handle that character turning up in the data you're encoding. It's inevitable. No matter what obscure character you choose, you'll have to deal with it
It's evitable by stating the number of bytes in a field and then the field. No escaping needed and faster parsing.
But not human editable/readable
I understand this argument in general. But basically everyone has some sort of spreatsheet application that can read CSV installed.
In some alternate worked where this "binary" format caught on it would be a very minor issue that it isn't human readable because everyone has a tool that is better at reading it than humans are. (See the above mentioned non-local property of quotes where you may think you are reading rows but are actually inside a single cell.)
Makes me also wonder if something like CBOR caught on early enough we would just be used to using something like `jq` to read it.
https://github.com/wader/fq is "jq for binary formats."
Except we have all these low ASCII characters specifically for this purpose that don't turn up in the data at all. But there is, of course, also an escape character specifically for escaping them if necessary.
Even if you find a character that really is never in the data - your encoded data will contain it. And it's inevitable that someone encodes the encoded data again. Like putting CSV in a CSV value.
You can't type any of those on a typewriter, or see them in old or simple simple editors, or no editor like just catting to a tty.
If you say those are contrived examples that don't matter any more then you have missed the point and will probably never acknowledge the point and there is no purpose in continuing to try to communicate.
One can only ever remember and type out just so many examples, and one can always contrive some response to any single or finite number of examples, but they are actually infinite, open-ended.
Having a least common denominator that is extremely low that works in all the infinite situations you never even thought of, vs just pretty low and pretty easy to meet in most common situations, is all the difference in the world.
The difference is that the coma and newline characters are much more common in text than 0x1F and 0x1E, which if you restrict your data to alphanumeric characters (which you really should) will never appear anywhere else.
Exactly. "Use a delimiter that's not in the data" is not real serialisation, it's fingers-crossed-hope-for-the-best stuff.
I have in the past does data extractions from systems which really can't serialise properly, where the only option is to concat all the fields with some "unlikely" string like @#~!$ as a separator, then pick it apart later. Ugh.
> Exactly. "Use a delimiter that's not in the data" is not real serialisation, it's fingers-crossed-hope-for-the-best stuff.
It's not doing just this, you pick something that's likely not in the data, and then escape things properly. When writing strings you can write a double quote within double quotes with \", and if you mean to type the designated escape character you just write it twice, \\.
The only reason you go for something likely not in the data is to keep things short and readable, but it's not impossible to deal with.
I agree that it's best to pick "unlikely" delimiters so that you don't have to pepper your data with escape chars.
But some people (plenty in this thread) really do think "pick a delimiter that won't be in the data" - and then forget quoting and/or escaping - is a viable solution.
The characters would likely be unique, maybe even by the spec.
Even if you wanted them, we use backslashes to escape strings in most common programming languages just fine, the problem CSV is that commas aren't easy to recognize because they might be within a single or double quote string, or might just be a separator.
Can strings in CSV have newlines? I bet parsers disagree since there's no spec really.
In TSV as commonly implemented (for example, the default output format of Postgres and MySQL), tab and newline are escaped, not quoted. This makes processing the data much easier. For example, you can skip to a certain record or field just by skipping literal newlines or tabs.
It's a lot easier to type comma than control characters and it's a lot easier to view comma than a tab (which might look like a space).
For automated serialization, plain text formats won out, because they are easy to implement a minimal working solution (both import and export) and more importantly, almost all systems agree on what plain text is.
We don't really have Apple formatted text, that will show up as binary on windows. Especially if you are just transferring id's and numbers, those will fall within ascii and that will work even if you are expecting unicode.
The ASCII control characters do not appear well or are editable in a plain text editor.
I did always use TSV and I think the original use of CSV could have used that.
But TSV would still have many issues.
What issues would TSV have? As commonly implemented (for example, the default output format of Postgres and MySQL), in TSV, tab and newline are escaped, not quoted.
I don't understand why CSV became a thing in the 70s when S-expressions existed since at least the 50s and are better in practically every way.
I want to push Sqlite as a data interchange format! it has the benefit of being well defined, and can store binary data, like images for product pictures inside the database. not a good idea if you're trying to serve users behind a web app, but as interchange, better than a zip file with filenames that have to be "relinked".
For context: I have a LOT of experience of interchange formats, like "full time job, every day, all day, hundreds of formats, for 20-years" experience.
Based on that experience I have come to one key, but maybe, counter-intuitive truth about interchange formats:
- Too much freedom is bad.
Why? Generating interchange data is cheaper than consuming it, because the creator only needs to consider the stuff they want to include, whereas the consumer needs to consider every single possible edge case and or scenario the format itself can support.
This is why XML is WAY more costly to ingest than CSV, because in XML someone is going to use: attributes, CDATA, namespaces, comments, different declaration, includes, et al. In CVS they're going to use rows, a format separator, and quotes (with or without escaping). That's it. That's all it supports.
Sqlite as an interchange format is a HORRIFYING suggestion, because every single feature Sqlite supports may need to be supported by consumers. Even if you curtailed Sqlite's vast feature set, you've still created something vastly more expensive to consume than XML, which itself is obnoxious.
My favorite interchange formats are, in order:
- CVS, JSON (inc. NDJSON), YAML, XML, BSON (due to type system), MessagePack, Protobuf, [Giant Gap] Sqlite, Excel (xlsx, et al)
More features mean more cost, more edge cases, more failures, more complex consumers. Keep in mind, this is ONLY about interchange formats between two parties, I have wildly different opinions about what I would use for my own application where I am only ever the creator/consumer, I actually love Sqlite for THAT.
Oh, this is interesting. Are you tying different systems together? If so, do you use some preferred intermediate format? Do you have a giant library of * -> intermediate -> * converters that you sprinkle between everything? Or maybe the intermediate format is in memory?
What about Parquet and the like?
Not the person you were replying to, but from my experience CSV is a good place to define data types coming into a system and a safe way to dump data as long as I write everything down.
So I might do things like have every step in a pipeline begin development by reading from and writing to CSV. This helps with parallel dev work and debugging, and is easy to load into any intermediate format.
> do you use some preferred intermediate format?
This is usually dictated by speed vs money calculations, weird context issues, and familiarity. I think it's useful to look at both "why isn't this a file" and "why isn't this all in memory" perspectives.
For tabular/time-series greater than 100k rows I personally feel like parquet cannot be beat. It's self-describing, strongly-typed, relatively compact, supports a bunch of io/decode skipping, and is quite fast.
Orc also looks good, but isn't well supported. I think parquet is optimal for now for most analytical use-cases that don't require human readability.
It is an interchange format, so it is inter-system by virtue of that. If I am a self-creator/consumer the format I use can be literally anything even binary memory dumps.
I love the wisdom in this comment!
I’m not sure you need to support every SQLite feature. I’m unconvinced of binary formats, but the .dump output is text and simple SQL.
Interesting! I've dealt with file interchange between closed source (and a couple open source) programs, but that was a while ago. I've also had to deal with csvs and xslts between SaaS vendors for import export of customer's data. I've done a bunch of reverse engineering of proprietary formats so we could import the vendor's files, which had more information than they were willing to export in an interchange format. Sometimes they're encrypted and you have to break it.
What you say is fair. Csv is underspecified though, there's no company called csv that's gonna sue for trademark enforcement, there's no official csv standard library that everyone uses. (They exist are some but there are so many naive importations because from first principles, because how hard could it be? output records and use a comma and newline (of which there are three possible options)).
How often do you deal with multiple Csv files to represent multiple tables that are actually what's used by vendors internally, vs one giant flattened Csv with hundreds of columns and lots of empty cells? I don't have your level of experience with csvs, but I've dealt with a them being a mess, where the other side implement whatever they think is reasonable given the name "comma separated values".
With sqlite, we're in the Internet age and so I presume this hypothetical developer would use the sqlite library and not implement their own library from scratch for funsies. This then leads to types, database normalization, multiple tables. I hear you that too many choices can bad, and xml is a great example of this, but sqlite isn't xml and isn't Csv.
It's hard to have this discussion in the abstract so I'll be forthcoming about where I'm coming from, which is Csv import export between vendors for stores, think like Doordash to UberEATS. the biggest problem we have is images of the items, and how to deal with that. It's an ongoing issue how to get them, but the failure mode, which does happen, is that when moving vendor, they just have to redo a lot of work that they shouldn't have to.
Ultimately the North Star I want to push towards is moving beyond csvs, because it'll let a people who currently have to hand edit the Csv so every row imports properly, not have to do that. They'd still exist, but instead have to deal with, well, what you see with XML files. which has its shortcomings, as you mention, but at least once how a vendor is using it is understood, individual records are generally understandable.
I was moved so I don't deal with import export currently, but it's because sqlite is so nice to work with on personal projects where it's appropriate that I want to push the notion of moving to sqlite over csvs.
One very minor problem is that you max out storing blobs of 2GB(? I think, maybe 4GB). Granted few will hit this, but this limit did kill one previous data transfer idea of mine.
> not a good idea if you're trying to serve users behind a web app
I use Sqlite for a static site! Generating those static pages out to individual pages would involve millions of individual files. So instead I serve up a sqlite database over http, and use a sqlite wasm driver [0] to load (database) pages as needed. Good indexing cuts down on the number of pages it grabs, and I can even get full text search!
Only feature I'm missing is compression, which is complicated because for popular extensions like sqlite-zstd written in Rust.
[0] https://github.com/mmomtchev/sqlite-wasm-http
CSV's actual problem is that there's no single CSV, and you don't know what type you have (or even if you have single consistent type through the whole file) without trying to parse the whole file and seeing what breaks. Is there quoting? Is that quoting used consistently? Do you have five-digit ZIP codes, or have the East Coast ones been truncated to four digits because they began with zero? Spin the wheel!
https://www.ietf.org/rfc/rfc4180.txt
> So these days for serialisation of simple tabular data I prefer plain escaping, e.g. comma, newline and \ are all \-escaped. It's as easy to serialise and deserialise as CSV but without the above drawbacks.
For my own parser, I made everything `\` escaped: outside of a quote or double-quote delimited string, any character prefixed with a `\` is read verbatim. There are no special exceptions resulting in `\,` producing a comma while `\a` produces `\a`. This makes it a good rule, because it is only one rule with no exceptions.
I considered this but then went the other way - a \ before anything other than a \, newline or comma is treated as an error. This leaves room for adding features, e.g. \N to signify a SQL NULL.
Regarding quoting and escaping, there are two options that make sense to me - either use quoting, in which case quotes are self-escaped and that's that; or use escaping, in which case quotes aren't necessary at all.
A good way to parallelize CSV processing is to split datasets into multiple files, kinda like manual sharding. xan has a parallel command able to perform a wide variety of map-reduce tasks on splitted files.
https://github.com/medialab/xan
nice .. xsv is also very handy for wrangling csv files generally
xan is a maintained fork of xsv
I always treat CSVs as comma separated values with new line delimiters. If it’s a new line, it’s a new row.
Do you ever have CSV data that has newlines within a string?
I don't. If I ever have a dataset that requires newlines in a string, I use another method to store it.
I don't know why so many people think every solution needs to to be a perfect fit for every problem in order to be viable. CSV is good at certain things, so use it for those things! And for anything it's not good at, use something else!
> use something else
You don't always get to pick the format in which data is provided to you.
True, but in that case I'm not the one choosing how to store it, until I ingest the data, and then I will store it in whatever format makes sense to me.
I don't think we do? It's more that a bunch of companies already have their data in CSV format and aren't willing to invest any effort in moving to a new format. Doesn't matter how much one extolls all the benefits, they know right? They're paying someone else to deal with it.
No - that’s what I’m trying to say. If I have newlines I use something else.
Wouldn't work if csv is used as a exchange format with external companies.
Of course. I’m not saying I roll my own parser for every project that uses a CSV file, I’m just describing my criteria for using CSV vs some other format when I have the option.
Eh, all you're really saying is "I'm not using CSV. Instead I'm using my CSV." Except that's all that anybody does.
CSV can just as easily support escaping as any other format, but there is no agreement for a CSV format.
After all, a missed escape can just as easily destroy a JSON or XML structure. And parallel processing of text is already a little sketchy simply because UTF-8 exists.
How is this not true for every format that includes quote marks?
It is true for everything that uses quoting, I didn't mean to imply otherwise.
I'm not clear why quotes prevent parallel processing?
I mean, you don't usually parallelize reading a file in the first place, only processing what you've already read and parsed. So read each record in one process and then add it to a multiprocessing queue for multiple processes to handle.
And data corruption is data corruption. If a movie I'm watching has a corrupted bit I don't mind a visual glitch and I want it to keep playing. But with a CSV I want to fix the problem, not ignore a record.
Do you really have a use case where reading itself is the performance bottleneck and you need to parallelize reading by starting at different file offsets? I know that multiple processes can read faster from certain high-end SSD's than just one process, but that's a level of performance optimization that is pretty extraordinary. I'm kind of curious what it is!
> I'm not clear why quotes prevent parallel processing?
Because of the "non-local" effect of quotes, you can't just jump into the middle of a file and start reading it, because you can't tell whether you're inside a quoted section or not. If (big if) you know something about the structure of the data, you might be able to guess. So that's why I said "tricky" instead of "impossible".
Contrast to my escaping-only strategy, where you can jump into the middle of a file and fully understand your context by looking one char on either side.
> Do you really have a use case where reading itself is the performance bottleneck and you need to parallelize reading by starting at different file offsets? I know that multiple processes can read faster from certain high-end SSD's than just one process, but that's a level of performance optimization that is pretty extraordinary. I'm kind of curious what it is!
I used to be a data analyst at a management consultancy. A very common scenario would be that I'm handed a multi-gigabyte CSV and told to "import the data". No spec, no schema, no nothing. Data loss or corruption is totally unacceptable, because we were highly risk-sensitive. So step 1 is to go through the whole thing trying to determine field types by testing them. Does column 3 always parse as a timestamp? Great, we'll call it a timestamp. That kind of thing. In that case, it's great to be able to parallelise reading.
> And data corruption is data corruption
Agreed, but I prefer data corruption which messes up one field, not data corruption which makes my importer sit there for 5 minutes thinking the whole file is a 10GB string value and then throw "EOF in quoted field".
Doing sequential reading into a queue for workers to read is a lot more complicated than having a file format that supports parallel reading.
And the fix to allow parallel reading is pretty trivial: escape new lines so that you can just keep reading until the first unescaped new line and start at that record.
It is particularly helpful if you are distributing work across machines, but even in the single machine case, it's simpler to tell a bunch of workers their offset/limit in a file.
The practical solution is to generate several CSV files and distribute work at the granularity of files
Sure, now you need to do this statically ahead of time.
It's not unsolvable, but now you have a more complicated system.
A better file format would not have this problem.
The fix is also trivial (escape new lines into \n or similar) would also make the files easier to view with a text editor.
But in practice, you’ll receive a bag of similar-format CSVs.
Tab-Separated Value, as implemented by many databases, solves these problems, because tab, newline and other control characters are escaped. For example, the default text serialization format of Postgres (`COPY <table> TO '<file>'` without any options) is this way.
Anyone with a love of CSV hasn't been asked to deal with CSV-injection prevention in an enterprise setting, without breaking various customer data formats.
There's a dearth of good resources about this around the web, this is the best I've come across: https://georgemauer.net/2017/10/07/csv-injection.html
That mostly breaks down to "excel is intentionally stupid with csv files if you don't use the import function to open them" along with the normal "don't trust customer input without stripping or escaping it" concerns you'd have with any input.
That was my initial reaction as well – it's a vulnerability in MS software, not ours, not our problem. Unfortunately, reality quickly came to bear: our customers and employees ubiquitously use excel and other similar spreadsheet software, which exposes us and them to risk regardless where the issue lies. We're inherently vulnerable because of the environment we're operating in, by using CSV.
"don't trust customer input without stripping or escaping it" feels obvious, but I don't think it stands up to scrutiny. What exactly do you strip or escape when you're trying to prevent an unknown multitude of legacy spreadsheet clients that you don't control from mishandling data in an unknown variety of ways? How do you know you're not disrupting downstream customer data flows with your escaping? The core issue, as I understand it, stems from possible unintended formula execution – which can be prevented by prefixing certain cells with a space or some invisible character (mentioned in the linked post above). This _does_ modify customer data, but hopefully in a way that unobtrusive enough to be acceptable. All in all, it seems to be a problem without a perfect solution.
Hey, I'm the author of the linked article, cool to see this is still getting passed around.
Definitely agree there's no perfect solution. There's some escaping that seems to work ok, but that's going to break CSV-imports.
An imperfect solutions is that applications should be designed with task-driven UIs so that they know the intended purpose of a CSV export and can make the decision to escape/not escape then. Libraries can help drive this by designing their interfaces in a similar manner. Something like `export_csv_for_eventual_import()`, `export_csv_for_spreadsheet_viewing()`.
Another imperfect solution would be to ... ugh...generate exports in Excel format rather than CSV. I know, I know, but it does solve the problem.
Or we could just get everyone in the world to switch to emacs csv-mode as a csv viewer. I'm down with that as well.
Appreciate your work! Your piece was pivotal in changing my mind about whether this should be considered in our purview to address.
The intention-based philosophy of all this makes a lot of sense, was eye opening, and I agree it should be the first approach. Unfortunately after considering our use cases, we quickly realized that we'd have no way of knowing how customers intend to use the csv exports they've requested - we've talked to some of them and it's a mix. We could approach things case by case but we really just want a setup which works well 99% of the time and mitigates known risk. We settled on the prefixing approach and have yet to receive any complaints about it, specifically using a space character with the mind that something unobtrusive (eg. easily strippable) but also visible, would be best - to avoid quirks stemming from something completely hidden.
Thank again for your writing and thoughts, like I said above I haven't found much else of quality on the topic.
>Another imperfect solution would be to ... ugh...generate exports in Excel format rather than CSV. I know, I know, but it does solve the problem.
Or you could just use the ISO standard .xlsx, which is a widely supported format that is not Excel specific but has first class support in Excel.
>Another imperfect solution would be to ... ugh...generate exports in Excel format rather than CSV. I know, I know, but it does solve the problem.
That's honestly a good solution and the end users prefer it anyway.
I’ve almost always found the simple way around Excel users not knowing how to safely use CSV files is to just give the file another extension: I prefer .txt or .dat
Then, the user doesn’t have Excel has the default program for opening the file and has to jump through a couple safety hoops
>Then, the user doesn’t have Excel has the default program for opening the file and has to jump through a couple safety hoops
Nah, they just quickly learn to rename it .csv or .xls and excel will open it
If your customers and employees are using Excel then stop going against the grain with your niche software developer focused formats that need a lot of explanations.
I need to interface with a lot of non-technical people who exclusively use Excel. I give them .xlsx files. It's just as easy to export .xlsx as it is to export .CSV and my customers are happy.
How is .csv a niche dev-focused format? Our customers use our exports for a mix of purposes, some of them involving spreadsheet clients (not just excel) and some of them integrating with their own data pipelines. Csv conveniently works with these use cases across the board, without explanation, and is inconveniently saddled with these legacy security flaws in Excel (and probably other clients).
If xlsx works for all your use cases that's great, a much better solution that trying to sidestep these issues by lightly modifying the data. It's not an option for us, and (I'd imagine) a large contingent of export tools which can't make assumptions about downstream usage.
Someone filed a bug report on a project I work on, saying that it was a security vulnerability that we don't prefix cell values with a single quote (') when the cell content contains certain values like an equal sign (=). They said this can cause Excel to evaluate the content and potentially run unsafe code.
I responded that this was Excel's problem, not ours, and that nobody would assign a CVE to our product for such a "vulnerability". How naive I was! They forwarded me several such CVEs assigned to products that create CSVs that are "unsafe" for Excel.
Terrible precedent. Ridiculous security theater.
There are a lot of these sorts of bug reports running around, to the point that Google's bug bounty program has classified them as invalid: https://bughunters.google.com/learn/invalid-reports/google-p...
I agree with the characterization ("security theater") of these bug reports. The problem is that the intentions of these reports don't make the potential risk less real, depending on the setting, and I worry that the "You're just looking for attention" reaction (a very fair one!) leads to a concerning downplaying of this issue across the web.
As a library author, I agree this very well may not be something that needs to be addressed. But as someone working in a company responsible for customers, employees, and their sensitive information, disregarding this issue disregards the reality of the tools these people will invariably use, downstream of software we _are_ responsible for. Aiming to make this downstream activity as safe as possible seems like a worthy goal.
The next version of CVSS needs to add a metric for these kind of bullshit non-vulnerabilities so that we can ignore them at source.
I didn't know about the formula injection, I just knew that Excel and Sheets mangle my dates every time and it drives me bonkers. Why is that the default? It makes no sense.
I also love CSV for its simplicity. A key part of that love is that it comes from the perspective of me as a programmer.
Many of the criticisms of CSV I'm reading here boil down to something like: CSV has no authoritative standard, and everyone implements it differently, which makes it bad as a data interchange format.
I agree with those criticisms when I imagine them from the perspective of a user who is not also a programmer. If this user exports a CSV from one program, and then tries to load the CSV into a different program, but it fails, then what good is CSV to them?
But from the perspective of a programmer, CSV is great. If a client gives me data to load into some app I'm building for them, then I am very happy when it is in a CSV format, because I know I can quickly write a parser, not by reading some spec, but by looking at the actual CSV file.
Parsing CSV is quick and fun if you only care about parsing one specific file. And that's the key: It's so quick and fun, that it enables you to just parse anew each time you have to deal with some CSV file. It just doesn't take very long to look at the file, write a row-processing loop, and debug it against the file.
The beauty of CSV isn't that it's easy to write a General CSV Parser that parses every CSV file in the wild, but rather that its easy to write specific CSV parsers on the spot.
Going back to our non-programmer user's problem, and revisiting it as a programmer, the situation is now different. If I, a programmer, export a CSV file from one program, and it fails to import into some other program, then as long as I have an example of the CSV format the importing program wants, I can quickly write a translator program to convert between the formats.
There's something so appealing about to me about simple-to-parse-by-hand data formats. They are very empowering to a programmer.
Totally agree that its biggest strength is how approachable it is for quick, ad hoc tooling. Need to convert formats? Join two datasets? Normalize a weird export? CSV gives you just enough structure to work with and not so much that it gets in your way.
> I know I can quickly write a parser, not by reading some spec, but by looking at the actual CSV file
This is fine if you can hand-check all the data, or if you are okay if two offsetting errors happen to corrupt a portion of the data without affecting all of it.
Also I find it odd that you call it "easy" to write custom code to parse CSV files and translate between CSV formats. If somebody give you a JSON file that isn't valid JSON, you tell them it isn't valid, and they say "oh, sorry" and give you a new one. That's the standard for "easy." When there are many and diverse data formats that meet that standard, it seems perverse to use the word "easy" to talk about empirically discovering the quirks in various undocumented dialects and writing custom logic to accommodate them.
Like, I get that a farmer a couple hundred years ago would describe plowing a field with a horse as "easy," but given the emergence of alternatives, you wouldn't use the word in that context anymore.
> If somebody give you a JSON file that isn't valid JSON, you tell them it isn't valid, and they say "oh, sorry" and give you a new one. That's the standard for "easy."
But it isn't that reliably easy with JSON. Sometimes I have clients give me data that I just have to work with, as-is. Maybe it was invalid JSON spat out by some programmer or tool long ago. Maybe it's just from a different department than my contact, which might delay things for days before the bureaucracy gets me a (hopefully) valid JSON.
I consider CSV's level of "easy" more reliable.
And even valid JSON can be less easy. I've had experiences where writing the high-level parsing for some JSON file, in terms of a JSON library, was less easy and more time-consuming than writing a custom CSV parser.
Subjectively, I think programming a CSV parser from basic programming primitives is just more fun and appealing than programming in terms of a JSON library or XML library. And I find the CSV code is often simpler and quicker to write.
> When there are many and diverse data formats that meet that standard, it seems perverse to use the word "easy" to talk about empirically discovering the quirks in various undocumented dialects and writing custom logic to accommodate them.
But the premise of CSV is so simple, that there are only four quirks to empirically discover: cell delimiter, row delimiter, quote, escaped-quote.
I think it's "easy" to peek at the file and say, "Oh, they use semicolon cell delimiters."
And it's likewise "easy" to write the "custom logic", which is about as simple as parsing something directly from a text stream gets. I typically have to stop and think a minute about the quoting, but it's not that bad.
If a programmer is practiced at parsing from a text stream (a powerful, general skill that is worth exercising), than I think it is reasonable to think they might find parsing CSV by hand to be easier and quicker than parsing JSON (etc.) with a library.
The best part about csv, anyone can write a parser in 30 minutes meaning that I can take data from the early '90s and import it into a modern web service.
The worst part about CSV, anyone can ride a parser in about 30 minutes, meaning that it's very easy to get incorrect implementations, incorrect data, and other strange undefined behaviors. But to be clear json, and yaml also have issues with everyone trying to reinvent the wheel constantly. XML is rather ugly, but it seems to be the most resilient.
until you find someone abusing XSD schemas, or someone designing a "dynamically typed" XML... or sneaks in extra data in comments - happened to me way often than it should.
You know what grinds my gears about using XSD for message definitions? Namespaces. Namespaces are a good idea and were done well in XML, as far as I can see, but with XSD you run into this [problem][1]:
Namespaces are used to qualify tags and attributes in XML elements. But they're also used by XSD to qualify the names of types defined in the schema. A sequence element's type is indicated by the value of its "type" attribute. The attribute value is a string that is the namespace-qualified name of the type.
So, if you want to change the alias of an XML namespace in an XSD schema, you can't just use your XML library's facilities for namespace management. You also have to go find the "type" attributes (but not all of the "type" attributes), parse their values, and do the corresponding alias change in the type name.
Don't use a string for a thing that is not a string! I guess in XML attributes you have no choice. XAML improved on the situation a bit.
[1]: https://github.com/dgoffredo/stag/tree/master/src/stag/xsd-u...
My condolences. Any open standard runs the risk of this happening. It's not a problem I think we'll ever solve.
In principle, if you make your standard extensible enough, people should stop sneaking data into comments or strings.
... What makes the GP's problem so much more amusing. XML was the last place I'd expect to see it.
That's assuming they know how to use it properly.
Rest has this same issue.
I've seen this when trying to integrate with 3rd party apis.
Status Code 200 Body: Sorry bro, no data.
Even then, this is subject to debate. Should a 404 only be used when the endpoint doesn't exist ? When we have no data to return, etc.
i think that default REST is a bit problematic as it conflates transport protocol level errors with application logic ones.
At least unless you use application/problem+json or application/problem+xml MIME types but those are still just in draft stage
https://datatracker.ietf.org/doc/html/rfc9457#name-the-probl...
For some side projects, I would like use XML, but cannot read the spec, as that costs money.
So I cannot trust XML in depth, and depend on using a library that bought the spec and hopefully adheres to it.
I like CSV for the same reasons I like INI files. It's simple, text based, and there's no typing encoded in the format, it's just strings. You don't need a library.
They're not without their drawbacks, like no official standards etc, but they do their job well.
I will be bookmarking this like I have the ini critique of toml: https://github.com/madmurphy/libconfini/wiki/An-INI-critique...
I think the first line of the toml critique applies to CSV: it's a federation of dialects.
Similarly I had once loved the schemaless datastorages. They are so much simpler!
Until I worked quite a bit with them and realized that there's always schema in the data, otherwise it's just random noise. The question is who maintains the schema, you or a dbms.
Re. formats -- the usefulness comes from features (like format enforcing). E.g. you may skip .ini at all and just go with lines on text files, but somewhere you still need to convert those lines to your data, there's no way around it, the question is who's going to do that (and report sane error messages).
Schemaless can be accomplished with well-formed formats like json, xml, yaml, toml, etc. from the producer side these are roughly equivalent interfaces. There's zero upside to using CSVs except to comfort your customer. Or maybe you have centered importing of CSVs into your actual business, in which case you should probably not exist.
> It's simple
My experience has indicated the exact opposite. CSVs are the only "structured" format nobody can claim to parse 100% (ok probably not true thinking about html etc, just take this as hyperbole.) Just use a well-specified format and save your brain-cells.
Occasionally, we must work with people who can only export to csv. This does not imply csv is a reasonable way to represent data compared to other options.
The HTML 5 spec says exactly how you're supposed to deal with broken HTML files.
Yes, that is a single spec with correspondingly-small importance. Generally parsing html remains extremely difficult.
It's of quite large importance, and despite being difficult, it is well-specified, which is the point here. Importantly, there is also no competing HTML spec, either de facto or otherwise. CSV doesn't have anything of comparable authority.
> CSVs are the only "structured" format nobody can claim to parse 100%
You don't need to though since in most cases you just need to support whatever CSV format the tool you're handling, unless of course you're trying to write the next Excel/Google Sheets competitor.
CSV works because CSV is understood by non technical people who have to deal with some amount of technicality. CSV is the friendship bridge that prevents technical and non technical people from going to war.
I can tell an MBA guy to upload a CSV file and i'll take care of it. Imagine i tell him i need everything in a PARQUET file!!! I'm no longer a team player.
This is incorrect. Everyone uses Excel, not CSV. There are billions of people on this planet who know what to do with an .xlsx file.
Do the same with a .CSV file and you'll have to teach those people how to use the .CSV importer in Excel and also how to set up the data types for each column etc. It's a non trivial problem that forces you down to a few million people.
.CSV is a niche format for inexperienced software developers.
Indeed the my main use is most financial services will output your records in csv, although I mostly open that in excel which sometimes gets a bit confused.
"Friendship bridge" is the perfect phrase
This is so relatable to all data eng people from SWE background!
Thanks
Among the shit I have seen in CSV, no " for strings, including those with a return char, innovative SEP, date, numbers, no escape for " within strings, rows related to the reporting tools used to export to CSV etc
True. But most of those problems are pretty easy for the non-technical person to see, understand, and (often) fix. Which strengthens the "friendship bridge".
(I'm assuming the technical person can easily write a basic parsing script for the CSV data - which can flag, if not fix, most of the format problems.)
For a dataset of any size, my experience is that most of the time & effort goes into handling records which do not comply with the non-technical person's beliefs about their data. Which data came from (say) an old customer database - and between bugs in the db software, and abuse by frustrated, lazy, or just ill-trained CSR's, there are all sorts of "interesting" things, which need cleaning up.
I wish this was a joke. I'm always trying to convince data scientists with a foot in the open source world that their life will be so much better if they use parquet or Stata or Excel or any other kind of file but CSV.
On top of all the problems people mention here involving the precise definition of the format and quoting, it's outright shocking how long it takes to parse ASCII numbers into floating point. One thing that stuck with me from grad school was that you could do a huge number of FLOPS on a matrix in the time it would take to serialize and deserialize it to aSCII.
What advantages does excel give you over CSV?
Accurate data typing (never confuse a string with a number)
Maybe be circular but: always loads correctly into Excel, if you want to load into a spreadsheet you can add text formatting and even formulas, checkboxes and stuff which can be a lot of fun.
That is very much not true, Excel does type coercion, especially around things that happen to look like dates: https://www.theverge.com/2020/8/6/21355674/human-genes-renam...
Excel does that type coercion if you import from CSV. If you export pandas data to XLSX it adds proper type information and then it imports properly into Excel and you avoid those problems.
I've recently been developing a raspberry pi based solution which works with telemetry logs. First implementation used an SQLite database (with WAL log) – only to find it corrupted after just couple of days of extensive power on/off cycles.
I've since started looking at parquet files – which turned out to not be friendly to append-only operations. I've ended up implementing writing events into ipc files which then periodically get "flushed" into the parquet files. It works and it's efficient – but man is it non-trivial to implement properly!
My point here is: for a regular developer – CSV (or jsonl) is still the king.
> First implementation used an SQLite database (with WAL log) – only to find it corrupted after just couple of days of extensive power on/off cycles.
Did you try setting `PRAGMA synchronous=FULL` on your connection? This forces fsync() after writes.
That should be all that's required if you're using an NVMe SSD.
But I believe most microSD cards do not even respect fsync() calls properly and so there's technically no way to handle power offs safely, regardless of what software you use.
I use SanDisk High Endurance SD cards because I believe (but have not fully tested) that they handle fsync() properly. But I think you have to buy "industrial" SD cards to get real power fail protection.
Raspberry Pi uses microSD card. Just using fsync after every write would be a bit devastating, but batching might've worked ok in this case. Anyways, too late to check now.
There's definitely a place for it. I ran into the same problem with a battery powered event logger. Basically alternate between sleep-until-event and sample-until-event-over.
SQLite was fine until the realities of that environment hit.
0) I need to save the most data over time and my power budget is unpredictable due to environmentals. 1) When should I commit? SQLite commit per insert slows down, impacts battery life, impacts sample rate. Practically you could get away with batching all data for a small period. 2) SQLite is slow to repair databases. Partially written file would often take longer to repair than we had battery to run.
CSV based format filled that niche. First column was line-column count to support firmware upgrades. Last column is line-checksum. Another column indicating if this line was the last for an event. Parser skips corrupted lines/entries.
If sqlite ends up corrupted, why wouldn't a CSV? What happens if the system dies partway through a write?
It would, but it'd be very easy to skip corrupted lines. With SQLite, I ended up losing data since the corrupted entry.
> I've since started looking at parquet files – which turned out to not be friendly to append-only operations. I've ended up implementing writing events into ipc files which then periodically get "flushed" into the parquet files. It works and it's efficient – but man is it non-trivial to implement properly!
I think the industry standard for supporting this is something like iceberg or delta, it's not very lightweight, but if you're doing anything non-trivial, it's the next logical move.
What isn't fun about CSV is quickly written parsers and serializers repeatedly making the common mistake of not handling, or badly handling, quoting.
For a long time I was very wary of CSV until I learnt Python and started using it's excellent csv standard library module.
Thats true, in recent years its been less of a disaster with lots of good csv libraries for various languages. In the 90s csv was a constant footgun, perhaps thats why they went crazy and came up with XML
Even widely used libraries that you might expect get it right, don't. (Like Spark, which uses Java style backslash escaping)
Why not Pandas, since you're working with tabular data anyway?
When I first started, installing packages which required compiling native code on either my work Windows machine and the old Unix servers was not easy.
So I largely stuck to the Python standard library where I could, and most of the operations I had at the time did not require data analysis on a server, that was mostly done in a database. Often the job was validating and transforming the data to then insert it into a database.
As the Python packaging ecosystem matured and I found I could easily use Pandas everywhere it just wasn't my first thing I'd reach to. And occasionally it'd be very helpful to iterate through them with the csv module, only taking a few MBs of memory, vs. loading the entire dataset into memory with Pandas.
Because maybe they’re not doing something column oriented? Because it has a notoriously finicky API? A dozen other reasons?
The argument against JSON isn't very compelling. Adding a name to every field as they do in their strawman example isn't necessary.
Compare this CSV
To the directly-equivalent JSON The JSON version is only marginally bigger (just a few brackets), but those brackets represent the ability to be either simple or complex. This matters because you wind up with terrible ad-hoc nesting in CSV ranging from entries using query string syntax to some entirely custom arrangement. And in these cases, JSON's objects are WAY better.Because CSV is so simple, it's common for them to avoid using a parsing/encoding library. Over the years, I've run into this particular kind of issue a bunch.
JSON parsers will not only output the expected values every time, but your language likely uses one of the super-efficient SIMD-based parsers under the surface (probably faster than what you are doing with your custom CSV parser).Another point is standardization. Does that .csv file use commas, spaces, semicolons, pipes, etc? Does it use CR,LF, or CRLF? Does it allow escaping quotations? Does it allow quotations to escape commas? Is it utf-8, UCS-2, or something different? JSON doesn't have these issues because these are all laid out in the spec.
JSON is typed. Sure, it's not a LOT of types, but 6 types is better than none.
While JSON isn't perfect (I'd love to see an official updated spec with some additional features), it's generally better than CSV in my experience.
> the directly-equivalent JSON
I think it's a big stretch to use that JSON for comparison. In practice, one is much more likely to see this:While most people would prefer the second version, the first version is also valid JSON and will definitely see use when you want/need JSON but want to reduce data over the wire though you'd probably still see a wrapper object like:
> Because CSV is so simple, it's common for them to avoid using a parsing/encoding library.
A but unfair to compare CSV without parser library to JSON with library.
Essentially nobody uses JSON without a library, but tons of people (maybe even most people) use CSV without a library.
Part of the problem here is standards. There's a TON of encoding variations all using the same .csv extension. Making a library that can accurately detect exactly which one is correct is a big problem once you leave the handful of most common variants. If you are doing subfield encoding, you are almost certainly on your own with decoding at least part of your system.
JSON has just one standard and everyone adheres to that standard which makes fast libraries possible.
The flexibility of JSON is a downside when you just want to stream large volumes of row-oriented tabular data
If you want to stream large volumes of row-oriented data, you aren't reading yourself and you should be using a binary format which is going to be significantly smaller (especially for numeric data).
Yeah that would be the next step in optimization. In the meanwhile, raw text CSV streaming (for not purely numeric data) is still extremely fast and easy to set up
> JSON parsers will not only output the expected values every time
Unless you need appendability, but then you should probably just use NDJSON/JSONL for a lot of cases.
I am annoyed that comma won out as the separator. Tab would have been a massively better choice. Especially for those of us who have discovered and embraced elastic tabstops. Any slightly large CSV is unreadable and uneditable because you can't easily see where the commas are, but with tabs and elastic tabstops, the whole thing is displayed as a nice table.
(That is, of course, assuming the file doesn't contain newlines or other tabs inside of fields. The format should use \t \n etc for those. What a missed opportunity.)
I wrote a web scraper for some county government data and went for tabs as well. It's nice how the columns lined up in my editor (some of these files had hundreds of thousands of lines).
We have dedicated field separator characters :-/
And all kinds of other weirdness, right in ascii. Vertical tabs, LOL. Put those in filenames on someone else's computer if you want to fuck with them. Linux and its common file systems are terrifyingly permissive in the character set they allow for file names.
Nobody uses any of that stuff, though.
CSV have multiple different separators. Eg. Excel defaults to different separators based on locale. Like CZ locale, it uses commas in numbers instead of dot, so CSV uses semicolon as default separator.
> Excel defaults to different separators based on locale.
Which is absolutely awful for interop and does not deserve being hauled as a feature.
They're completely skipping over the complications of header rows and front matter.
"8. Reverse CSV is still valid CSV" is not true if there are header rows for instance.
But really, whether or not CSV is a good format or not comes down to how much control you have over the input you'll be reading. If you have to deal with random CSV from "in the wild", it's pretty rough. If you have some sort of supplier agreement with someone that's providing the data, or you're always parsing data from the same source, it's pretty fine.
At various points in my career, I've had to oversee people creating data export features for research-focused apps. Eventually, I instituted a very simple rule:
As part of code review, the developer of the feature must be able to roundtrip export -> import a realistic test dataset using the same program and workflow that they expect a consumer of the data to use. They have up to one business day to accomplish this task, and are allowed to ask an end user for help. If they don't meet that goal, the PR is sent back to the developer.
What's fascinating about the exercise is that I've bounced as many "clever" hand-rolled CSV exporters (due to edge cases) as other more advanced file formats (due to total incompatibility with every COTS consuming program). All without having to say a word of judgment.
Data export is often a task anchored by humans at one end. Sometimes those humans can work with a better alternative, and it's always worth asking!
As someone who likes modern formats like parquet, when in doubt, I end up using CSV or JSONL (newline-delimited JSON). Mainly because they are plain-text (fast to find things with just `grep`) and can be streamed.
Most features listed in the document are also shared by JSONL, which is my favourite format. It compresses really well with gzip or zstd. Compression removes some plain-text advantages, but ripgrep can search compressed files too. Otherwise, you can:
Another advantage of JSONL is that it's easier to chunk into smaller files.Too bad xz/lzma isn't supported in these formats. I often get pretty big improvements in compression ratio. It's slower, but it can be parallelized too.
You can cat Parquet and other formats into grep just as easily.
I switched to JSONL over a decade ago and I would recommend everyone else to also have switched then.
This whole thread is an uninformed rehash of bad ideas.
I think that might make sense ingest side, but that's very expensive to deal with if you're doing anything remotely large.
I think sinking into something like delta-lake or iceberg probably makes sense at scale.
But yeah, I definitely agree that CSV is not great.
JSONL as a replacement for CSV, you shouldn't be using CSV as format for long term storage or querying, it has so many downsides and nearly zero upsides.
JSONL when compressed with zstd, most of "expensive if large" disappears as well.
Generating and consuming JSONL can easily be in the GB/s range.
I mean on the querying side. Parquet's ability to skip rowgroups and even pages, and paired with iceberg or delta can make the difference between being able to run your queries at all versus needing to scale up dramatically.
Totally agree.
I am saying JSONL is a lower bound format, if you can use something better you should. Data interchange, archiving, transmission, etc. It shouldn't be repeatedly queried.
Parquet, Arrow, sqlite, etc are all better formats.
Essential CSV shell tools:
csvtk: https://bioinf.shenwei.me/csvtk/
gawk: https://www.gnu.org/software/gawk/manual/html_node/Comma-Sep...
awk: https://github.com/onetrueawk/awk?tab=readme-ov-file#csv
Also VisiData is an excellent TUI spreadsheet.
One of my favorite tools. However, I don’t think that Visidata is a spreadsheet, even though it looks like one and is named after one. It is more spreadsheet adjacent. It is focused on row-based and column-based operations. It doesn’t support arbitrary inter-cell operation(s), like you get in Excel-like spreadsheets. It is great for “Tidy Data’, where each row represents a coherent set of information about an object or observation. This is very much like Awk, or other pipeline tools which are also line/row oriented.
For CLI tools, I’m also a big fan of Miller (https://github.com/johnkerl/miller) to filter/modify CSV and other data sources.
Forgive me for promoting this that I wrote:
csvquote: https://github.com/dbro/csvquote
Especially for use with existing shell text processing tools, eg. cut, sort, wc, etc.
I would add xan to this list: https://github.com/medialab/xan
But of course, I am partial ;)
aaand xsv : https://github.com/BurntSushi/xsv
xan is a maintained fork of xsv
CSV still quietly powers the majority of the world’s "data plumbing."
At any medium+ sized company, you’ll find huge amounts of CSVs being passed around, either stitched into ETL pipelines or sent manually between teams/departments.
It’s just so damn adaptable and easy to understand.
> It's just so damn adaptable
Like a rapidly mutating virus, yes.
> and easy to understand.
Gotta disagree there.
For example, one of the CSVs my company shovels around is our Azure billing data. There are several columns that I just have absolutely no idea what the data in them is. There are several columns we discovered are essentially nullable¹ The Hard Way when we got a bill for which, e.g., included a charge that I guess Azure doesn't know what day that charge occurred on? (Or almost anything else about it.)
(If this format is documented anywhere, well, I haven't found the docs.)
Values like "1/1/25" in a "date" column. I mean, I did say it was an Azure-generated CSV, so obviously the bar wasn't exactly high, but then it never is, because anyone wanting to build something with some modicum of reliability, or discoverability, is sending data in some higher-level format, like JSON or Protobuf or almost literally anything but CSV.
If I can never see the format "JSON-in-CSV-(but-we-fucked-up-the-CSV)" ever again, that would spark joy.
(¹after parsing, as CSV obviously lacks "null"; usually, "" is a serialized null.)
Insurance. One of the core pillars of insurance tech is the CSV format. You'll never escape it.
>You'll never escape it.
I see what you did there.
I work as a data engineer in the financial services industry, and I am still amazed that CSV remains the preferred delivery format for many of our customers. We're talking datasets that cost hundreds of thousands of dollar to subscribe to.
"You have a REST API? Parquet format available? Delivery via S3? Databricks, you say? No thanks, please send us daily files in zipped CSV format on FTP."
> REST API
Requires a programmer
> Parquet format
Requires a data engineer
> S3
Requires AWS credentials (api access token and secret key? iam user console login? sso?), AWS SDK, manual text file configuration, custom tooling, etc. I guess with Cyberduck it's easier, but still...
> Databricks
I've never used it but I'm gonna say it's just as proprietary as AWS/S3 but worse.
Anybody with Windows XP can download, extract, and view a zipped CSV file over FTP, with just what comes with Windows. It's familiar, user-friendly, simple to use, portable to any system, compatible with any program. As an almost-normal human being, this is what I want out of computers. Yes the data you have is valuable; why does that mean it should be a pain in the ass?
Yes because users can read the data themselves and don't need a programmer.
Financial users live in Excel. If you stick to one locale (unfortunately it will have to be US) then you are OKish.
I've recently written a library at work to run visitors on data models bound to data sets. One of these visitors is a CSV serializer that dumps a collection as a CSV document.
I've just checked and strings are escaped using the same mechanism for JSON, with backslashes. I should've double-checked against RFC 4180, but thankfully that mechanism isn't currently triggered anywhere for CSV (it's used for log exportation and no data for these triggers that code path). I've also checked the code from other teams and it's just handwritten C++ stream statements inside a loop that doesn't even try to escape data. It also happens to be fine for the same reason (log exportation).
I've also written serializers for JSON, BSON and YAML and they actually output spec-compliant documents, because there's only one spec to pay attention to. CSV isn't a specification, it's a bunch of loosely-related formats that look similar at a glance. There's a reason why fleshed-out CSV parsers usually have a ton of knobs to deal with all the dialects out there (and I've almost added my own by accident), that's simply not a thing for properly specified file formats.
The joy of CSV is everyone knows roughly what you mean and the details can communicated succintly.
The python3 csv module basically does the job.
This is particularly funny because I just received a ticket saying that the CSV import in our product doesn't work. I asked for the CSV, and it uses a semicolon as a delimiter. That's just what their Excel produced, apparently. I'm taking their word for it because... Excel.
To me, CSV is one of the best examples of why Postel's Law is scary. Being a liberal recipient means your work never ends because senders will always find fun new ideas for interpreting the format creatively and keeping you on your toes.
Of course, because there are locales which uses comma as decimal separator. So CSV in Excel then defaults to semicolon.
Another Microsoft BS, they should defaults to ENG locale in CSV, do a translation in background. And let user choose, if they want to save as different separator. Excel in every part of world should produce same CSV by default. Bunch of idiots.
Yes. CSV is a data interchange format, it's meant to be written on one computer and read by another. Making the representation of data dependent on the locale in use is stupid af. Locales are for interaction with the user, not for data interchange.
There is a lot not to like about CSV, for all the reasons given here. The only real positive is that you can easily create, read and edit CSV in an editor.
Personally I think we missed a trick by not using the ASCII US and RS characters:
Columns separated by \u001F (ASCII unit separator).
Rows separated by \u001E (ASCII record separator).
No escaping needed.
More about this at:
https://successfulsoftware.net/2022/04/30/why-isnt-there-a-d...
Good idea, but probably a non-starter due to no keyboard keys for those characters. Even | would've been a better character to use since it almost never appears in common data.
It is probably unrealistic to expect keyboard keyboard vendors to add new keys. But editors could support adding them through keyboard shortcuts (Ctrl + something).
Pipe can be useful as a field delimiter. But what do you use as the record delimiter?
Welp, now I know my weekend project.
In abstract CSV is great. In reality, it is a nightmare not because of CSV, but because of all the legacy tools that product it in slightly different ways (different character encodings mostly - excel still produces some variant of latin1, some tools drop a BOM in your UTF8, etc).
Unless you control the producer of the data you are stuck trying to infer the character encoding and transcoding to your destination, and there's no foolproof way of doing that.
I greatly prefer TSV over CSV. https://en.wikipedia.org/wiki/Tab-separated_values
100% agree. TSV is under-rated. Tabs don't naturally occur in data nearly as often as commas so tabs are a great delimiter. Copy paste into Excel also works much better with tabs.
Code editors may convert tabs to spaces but are you really editing and saving TSV data files in your code editor?
TSV's big advantage is that, as commonly implemented, the separators are escaped, not quoted. This means that a literal newline (ASCII 0x0A) is always a record separator and a literal tab (ASCII 0x09) is always a field separator. This is the format many databases -- including Postgres -- use for text export by default.
There are some notes I put together about TSV a few years ago that expand on these points: https://github.com/solidsnack/tsv?tab=readme-ov-file#motivat...
Unix and AWK will parse it on the spot.
The problem with using TSV is different user configuration. For ex. if I use vim then Tab might indeed be a '\t' character but in TextEdit on Mac it might be something different, so editing the file in different programs can yield different formatting. While ',' is a universal char present on all keyboards and formatted in a single way
Well, back in my day, we used tabs for tabs.
But then some folks came along about 15 years ago screaming about spaces, and they won, so now tabs are 2 or 4 spaces.
The law of unintended consequences strikes again!
Note: not meant to denigrate you space supporters out there.
Csv suffers from similar encoding and platform specific gotchas surrounding newlines and delimiter escaping.
The only situation I can think of where a tab is not a tab, is in a code editor that's been configured (possibly by default) to use spaces instead. But that's an easy enough configuration to change. And certainly wouldn't be a problem for something like TextEdit.
CSV is a pseudo-standard anyway, IMO the delimiter should be a configurable option (like it is in Unix cut and those kinds of tools).
> IMO the delimiter should be a configurable option
It is. CSV has been character separated vs comma separated for probably decades now. Most tools you'd use to mess with them allow you to specify which separator character is being used.
I agree although it seems to render the distinction made by GP moot, right?
Functionally the same. I'd prefer CSV if my content was likely to have whitespace in it and didn't want billion quotes. I'd prefer TSV if my content was unlikely to have whitespace, and more likely to contain commas.
The problem with TSV is what are you going to do about quotes. Some fields might contain them [1] or they might be needed to store fields with tabs inside them.
Because of this in order to read a plain simple TSV (fields separated by tabs, nothing more) with the Python csv module [2] you need to set the quote character to an improbable value, say € (using the euro sign because HN won't let me use U+1F40D), or just parse it by hand, e.g. row.split('\t').
[1]: https://github.com/wireservice/csvkit/issues/1194
[2]: https://docs.python.org/3/library/csv.html
Quotes are just characters in TSV. Actual tab characters are banned so this should be simple. (Of course poor implementations may behave differently.)
Agreed, much easier to work with, especially if you can guarantee no embedded tabs or newlines. Otherwise you end up with backslash escaping, but that's still usually easier than quotes.
It's not really easier than CSV if you can guarantee no commas or newlines.
Parsing escapes is easier than parsing quoted text with field and record separators embedded in it. Every literal newline or literal tab is a separator. One can jump to the thousandth record, for example, just by skipping lines, without looking into the contents.
Much easier to require fields not have tabs than not have commas, though.
TSV looks incredibly ugly when opened in a text editor unless all values are 7 characters or less.
There are some specialized text editors for editing files with tabs. https://en.wikipedia.org/wiki/Tab_stop#Dynamic_tab_stops , https://nick-gravgaard.com/elastic-tabstops/ , https://tibleiz.net/code-browser/
If have to use a dedicated tabular data editing program, I may as well use a spreadsheet application. What do the options you propose do better than libreoffice calc?
What part of "the value of CSV is to have something that can be easily viewed and modified in any text editor" do you not understand
I’m not sure where that quote is from but it is incorrect, CSVs aren’t easily viewed or modified in text editors in general (at least not in any way that takes advantage of their tabular nature).
There’s at least a slight chance that tab separated values will look ok in a text editor (although in general, nope).
Set a bigger tab size in the editor, e.g. `:set ts=32` in vim.
vartabstop can help with per-column width. I found
https://sharats.me/posts/automating-the-vim-workplace-3/#usi...
which greatly improved my TSV editing experience in vim. Couple that with autocmd TextChangedI and you get realtime recalculation and alignment.
Aaaand now I can fit like 3 fields next to each other before they wrap and the field with arbitrary length text still misaligns the field after it
There is at least a chance that the text fields will fit in a tab. If not, a tool like “column” on Linux can be used. There is no chance that a text field will fit inside the width of a comma.
I was about to mention that column [1] is part of util-linux, so perhaps it's Linux specific, but then I noticed this in the FreeBSD man page [2]:
> The column command appeared in 4.3BSD-Reno.
[1]: https://man7.org/linux/man-pages/man1/column.1.html
[2]: https://man.freebsd.org/cgi/man.cgi?query=column&sektion=1
How does CSV look when the fields are all of different widths?
Like a list of lists of strings
meh. The whitespace makes it easier to eyeball IMHO.
Thanks for your input.
I think for "untyped" files with records, using the ASCII file, (group) and record separators (hex 1C, 1D and 1E) work nicely. The only constraint is that the content cannot contain these characters, but I found that that is generally no problem in practice. Also the file is less human readable with a simple text editor.
For other use cases I would use newline separated JSON. Is has most of the benefits as written in the article, except the uncompressed file size.
I agree that JSONL is the spiritual successor of CSV with most of the benefits and almost none of the drawbacks.
It has a downside though: wherever JSON itself is used, it tends to be a few kilobytes at least (from an API response, for example). If you collect those in a JSONL file the lines tend to get verrrry long and difficult to edit. CSV files are more compact.
JSONL files are a lot easier to work with though. Less headaches.
The drawbacks are quite substantial actually – uses much more data per record. For many cases it's a no-go.
Honestly yes. If text editors would have supported these codes from the start, we might not even have XML, JSON or similar today. If these codes weren't "binary" and all scary, we would live in much different world.
I wonder how much we have been hindered ourselves by reinventing plain text human-readable formats over the years. CSV -> XML -> JSON -> YAML and that's just the top-level lineage, not counting all the branches everywhere out from these. And the unix folks will be able to name plenty of formats predating all of this.
I'm not really sure why "Excel hates CSV". I import into Excel all the time. I'm sure the functionality could be expanded, but it seems to work fine. The bit of the process I would like improved is nothing to do with CSV - it's that the exporting programs sometimes rearrange the order of fields, and you have to accommodate that in Excel after the import. But since you can have named columns in Excel (make the data in to a table), it's not a big deal.
One problem is that Excel uses locale settings for parsing CSV files (and, to be fair, other text files). So if you're in e.g. Europe and you've configured Excel to use commas as decimal separators, Excel imports numbers with decimals (with points as decimal separator) as text. Or it thinks the point is a thousands separator. I forgot exactly which one of those incorrect options it chooses.
I don't know what they were thinking, using a UI setting for parsing an interchange format.
There's a way around, IIRC, with the "From text / csv" command, but that looses a lot of the convenience of double-clicking a CSV file in Explorer or whatever to open it in Excel.
It used to silently transform data on import. It used to silently drop columns.
That's it, but it's really bad.
It's really bad if your header row has less columns than the data rows. You really need to do the import vs just opening the file because it's not even obvious that it's dropping data unless you know what to expect from your file.
Excel is halfway decent if you do the 'import' but not if you just doubleclick on them. It seems to have been programmed to intentionally do stupid stuff with them if you just doubleclick on them.
I agree that the default way Excel handles CSV files is terrible. Using Power Query to manage them is the way to go. But it's the general Microsoft approach to backwards compatibility so very unlikely to change now.
In the past I remember that Excel not properly handling UTF-8 encoded text in a CSV. It would treat it as raw ASCII (or possibly code page 1252). So if you opened and saved a CSV, it would corrupt any Unicode text in the file. It's possible this has been fixed in newer versions, I haven't tried in a while.
It's related to how older versions of Windows/Office handled Unicode in general.
From what I have heard, it's still an issue with Excel, although I assume that Windows may handle plain text better these days (I haven't used it in a while)
You need to write an UTF-8 BOM at the beginning (0xEF, 0xBB, 0xBF), if you want to make sure it's recognized as UTF-8.
Ugh, UTF-8 BOM. Many apps can handle UTF-8 but will try to return those bytes as content; maybe ours in 2015 too
I was on the Power Query team when we were improving the encoding sniffing. An app can scan ahead i.e. 64kB, but ultimately the user needs to just say what the encoding is. All the Power Query data import dialogs should let you specify the encoding.
UTF-8 BOM is probably not a good idea for anything other than (maybe) plain text documents. For data, many (although not all) programs should not need to care about character encoding, and if they include something such as UTF-8 BOM then it will become necessary to consider the character encoding even though it shouldn't be necessary.
I have repeatedly seen people getting the spreadsheets altered by excel, and in general a lot of troubles due to localisation reasons. Sometimes these changes can be subtle and be hard to spot until somebody tries to troubleshoot what went wrong down the line.
It works better if you click to "import the data" instead of just opening the csv file with it, and if you then choose the right data types. But having to do this everytime to make it work is really annoying, esp when you have a lot of columns, plus people can easily get confused with the data types. I have never seen that much confusion eg with macos's numbers.
https://stackoverflow.com/questions/165042/stop-excel-from-a...
It's an ad hoc text format that is often abused and a last-chance format for interchange. While heuristics can frequently work at determining the structure, they can just as easily frequently fail. This is especially true when dealing with dates and times or other locale-specific formats. Then, people outright abuse it by embedding arrays or other such nonsense.
You can use CSV for interchange, but a duck db import script with the schema should accompany it.
"CSV" should die. The linked article makes critical ommisions and is wrong about some points. Goes to show just how awful "CSV" is.
For one thing, it talks about needing only to quote commas and newlines... qotes are usually fine... until they are on either side of the value. then you NEED to quote them as well.
Then there is the question about what exactly "text" is; with all the complications around Unicode, BOM markers, and LTR/RTL text.
CSV on the Web (CSVW) is a W3C standard designed to enable the description of CSV files in a machine-readable way.[1]
"Use the CSV on the Web (CSVW) standard to add metadata to describe the contents and structure of comma-separated values (CSV) data files." — UK Government Digital Service[2][3]
[1] https://www.w3.org/TR/tabular-data-primer/
[2] https://www.gov.uk/government/publications/recommended-open-...
[3] https://csvw.org/
One thing that has changed the game with how I work with CSVs is ClickHouse. It is trivially easy to run a local database, import CSV files into a table, and run blazing-fast queries on it. If you leave the data there, ClickHouse will gradually optimize the compression. It's pretty magical stuff if you work in data science.
I feel the same way about elastic.
That being said I noticed .parquet as an export format option on Shopify recently and an hopeful more providers offer the choice.
Simon W's https://datasette.io/ is also excellent.
Datasette is a wonderful tool that I've used before, and I have the highest admiration for its creator, but the underlying Sqlite3 database doesn't handle large datasets (i.e. hundreds of millions of rows) nearly as well as ClickHouse does.
It's worth noting that I only ran into this limitation when working with huge federal campaign finance datasets [1] and trying to do some compute-intensive querying. For 99% of use cases, datasette is a similarly magical piece of software for quickly exploring some CSV files.
1. https://www.fec.gov/data/browse-data/?tab=bulk-data
CSV is everywhere. I use manifold-csv[1] it’s amazing.
1. https://github.com/manifold-systems/manifold/tree/master/man...
How much easier would all of this be if whoever did CSV first had done the equivalent of "man ascii". There are all these wonderful codes there like FS, GS, RS, US that could have avoided all the hassle that quoting has brought generations of programmers and data users.
All hail TSV. Like CSV, but you're probably less likely to want tabs, than commas.
I wrote my own CSV parser in C++. I wasn't sure what to do in some edge cases, e.g. when character 1 is space and character 2 is a quote. So I tried importing the edge case CSV into both MS Excel and Apple Numbers. They parsed it differently!
Also CSV can be queried : https://til.simonwillison.net/sqlite/one-line-csv-operations
CSV is so deceptively simple that people don't care understanding it. I wasted countless hours working around services providing non-escaped data that off the shelf parsers could not parse.
Something I support completely - previously https://news.ycombinator.com/item?id=35418933#35438029
If CSV is indeed so horrible - and I do not deny that there can be an improvement - how about the clever data people spec out a format that
Does not require a bizarre C++ RPC struct definition library _both_ to write and to read
Does not invent a clever number encoding scheme that requires native code to decode at any normal speed
Does not use a fancy compression algorithm (or several!) that you need - again - native libraries to decompress
Does not, basically, require you be using C++, Java or Python to be able to do any meaningful work with it
It is not that hard, really - but CSV is better (even though it's terrible) exactly because it does not have all of these clever dependency requirements for clever features piled onto it. I do understand the utility of RLE, number encoding etc. I do not, and will not, understand the utility of Thrift/Avro, zstandard and brotli and whatnot over standard deflate, and custom integer encoding which requires you download half of Apache Commons and libboost to decode. Yes, those help the 5% to 10% of the use cases where massive savings can be realised. It absolutely ruins the experience for the other 90 to 95.
But they also give Parquet and its ilk a very high barrier of entry.
I have written a new database system that will convert CSV, JSON, and XML files into relational tables.
On of the biggest challenges to CSV files is the lack of data types on the header line that could help determine the schema for the table.
For example a file containing customer data might have a column for a Zip Code. Do you make the column type a number or a string? The first thousand rows might have just 5 digit numbers (e.g. 90210) but suddenly get to rows with the expanded format (e.g. 12345-1234) which can't be stored in an integer column.
csv does not stop you from making the first line be column headers, with implied data types, you just have to comma separate them!
I realize that. But when reading a header you have to imply the data types which might be wrong. I always thought it would have been great if the first line read something like: name:STRING,address:STRING,zip code:INTEGER,ID:BIG_INT,...
The worst thing about CSV is Excel using localization to choose the delimiter: https://answers.microsoft.com/en-us/msoffice/forum/all/csv-f...
The second worst thing is that the escape character cannot be determined safely from the document itself.
For everyone complaining about CSV/TSV, there's a scripting language called R.
It makes working with CSV/TSV files super simple.
It's as easy this:
# Import tidyverse after installing it with install.packages("tidyverse")
library(tidyverse)
# Import TSV
dataframe_tsv <- read_tsv("data/FileFullOfDataToBeRead.tsv")
# Import CSV
dataframe_csv <- read_csv("data/FileFullOfDataToBeRead.csv")
# Mangle your data with dplyr, regular expressions, search and replace, drop NA's, you name it.
<code to sanitize all your data>
Multiple libraries exist for R to move data around, change the names of entire columns, change values in every single row with regular expressions, drop any values that have no assigned value, it's the swiss army knife of data. There are also all sorts of things you can do with data in R, from mapping with GPS coordinates to complex scientific graphing with ggplot2 and others.
Here's an example for reading iButton temperature sensor data: https://github.com/hominidae/ibutton_tempsensors/
Notice that in the code you can do the following to skip leading lines by passing it as an argument: skip = 18
cf1h <- read_csv("data/Coldframe_01_High.csv", skip = 18)
I've found some use cases where CSV can be a good alternative to arrays for storage, search and retrieval. Storing and searching nested arrays in document databases tends to be complicated and require special queries (sometimes you don't want to create a separate collection/table when the arrays are short and 1D). Validating arrays is actually quite complicated; you have to impose limits not only on the number of elements in the array, but also on the type and size of elements within the array. Then it adds a ton of complexity if you need to pass around data because, at the end of the day, the transport protocol is either string or binary; so you need some way to indicate that something is an array if you serialize it to a string (hence why JSON exists).
Reminds me of how I built a simple query language which does not require quotation marks around strings, this means that you don't need to escape strings in user input anymore and it prevents a whole bunch of security vulnerabilities such as query injections. The only cost was to demand that each token in the query language be separated by a single space. Because if I type 2 spaces after an operator, then the second one will be treated as part of the string; meaning that the string begins with a space. If I see a quotation mark, it's just a normal quotation mark character which is part of the string; no need to escape. If you constrain user input based on its token position within a rigid query structure, you don't need special escape characters. It's amazing how much security has been sacrificed just to have programming languages which collapse space characters between tokens...
It's kind of crazy that we decided that quotation marks are OK to use as special characters within strings, but commas are totally out of bounds... That said, I think Tab Separated Values TSV are even more broadly applicable.
Funny how the "specification holds in a tweet" yet manages to miss at least three things: 1) character encoding, 2) BOM or not, 3) header or no header.
Always UTF-8. Never a BOM. Always a header
Great if you're the one producing the CSV yourself.
But if you're ingesting data from other organizations, they will, at one time or another, fuck up every single one of those (as well as the ones mentioned in TFA), no matter how clearly you specify them.
the simplicity is underappreciated because people don't realise how many dumb data engineers there are. i'm pretty sure most of them can't unpack an xml or json. people see a csv and think they can probably do it themselves, any other data format they think 'gee better buy some software with the integration for this'.
Excel hates CSV only if you don't use the "From text / csv" function (under the data tab).
For whatever reason, it flawlessly manages to import most CSV data using that functionality. It is the only way I can reliably import data to excel with datestamps / formats.
Just drag/dropping a CSV file onto a spreadsheet, or "open with excel" sucks.
Even "From Text / CSV" sucks:
It inserts an extra row at the top for its pivot table, with entries "Column1, Column2, ...".
So if you export to CSV again, you now have 2 header rows.
So Excel can't roundtrip CSVs, and the more often you roundtrip, the more header rows you get.
You need to remember to manually delete the added header row each time, otherwise software you export back to can't read it.
That's strange, I've never seen this behaviour. Loading a CSV this way (Data -> From Text/CSV) always parses the first record as the header for me.
Yea they seem to have added this about a year ago and it works pretty well, to be fair.
Now if they would just also allow pasting CSV data as “source” it would be great.
It's a carry over from powerbi actually, separate function entirely.
If this was really a love letter, it would have been in CSV format.
CSV is bad. Furthermore it’s unnecessary. ASCII has field and record separator characters that were for this purpose.
That would be great if keyboards had keys for those characters and there was a common way to display them on a screen, but they don't and there isn't.
I can’t remember the last time I or anyone else I know typed a CSV file out. It’s almost universally the lowest common denominator interchange format.
That's a feature, not a bug
> 4. CSV is streamable
This is what keeps me coming back.
…ndjson is streamable, too…
I like ndjson and jsonl just fine, but unless I need a more complicated structure, it's not worth the extra hassle of parsing JSON.
… what concrete language are we talking about, here?
In literally any language I can think of, hassle(json) < hassle(CSV), esp. since CSV received is usually "CSV, but I've screwed it up in a specific, annoying way"
I'm thinking mostly of the computational complexity.
But even ergonomically, in python, can read a csv like:
which imo is not less ergonomic thanJust don't write that love letter in French ... or any language that uses comma for decimals
The Italian version of Excel uses a custom CSV style with ; as a column separator. This breaks many applications that accept CSVs. It's super annoying.
You should look at its author's nationality ;)
The library quotes values like that in quotes
>This is so simple you might even invent it yourself without knowing it already exists while learning how to program.
This is a double-edged sword. The "you might even event it yourself" simplicity means that in practice lots of different people do end up just inventing their own version rather than standardizing to RFC-4180 or whatever when it comes to "quote values containing commas", values containing quotes, values containing newlines, etc. And the simplicity means these type of non-standard implementations can go completely undetectable until a problematic value happens to be used. Sometimes added complexity that forces paying more attention to standards and quickly surfaces a diversion from those standards is helpful.
Just last week I was bitten by a customer’s CSV that failed due to Windows‘ invisible BOM character that sometimes occurs at the beginning of unicode text files. The first column‘s title is not „First Title“ then but „&zwnbsp;First Title“. Imagine how long it takes before you catch that invisible character.
Aside from that: Yes, if CSV would be a intentional, defined format, most of us would do something different here and there. But it is not, it is more of a convention that came upon us. CSV „happened“, so to say. No need to defend it more passionate than the fact that we walk on two legs. It could have been much worse and it has surprising advantages against other things that were well thought out before we did it.
I wish the UTF8BOM was standardized. Encoding guessing usually works until it doesn't.
There was/is CSVY [0] which attempted to put column style and separator information in a standard header. It is supported by R lang.
I also asked W3C on theirGithub if there was any spec for CSV headers and they said there isn't [1]. Kind of defeats the point of the spec in my opinion.
0: https://github.com/leeper/csvy
1: https://github.com/w3c/csvw/issues/873
CSV is the bane of my existence. There is no reason to use it outside of legacy use-cases, when so many alternatives are not so brittle that they require endless defensive hacks to avoid erring as soon as exposed to the universe. CSV must die.
CVS isn't brittle, and I'm not sure what "hacks" you're referring to. If you or your parser just follow RFC4180 (particularly quote every field, and double quoting to cancel-quote), that will get you 90%+ compatibility.
Surely you’ve come across situations where line number 10,000,021 of a 60m line CSV fails to parse because there aren’t enough fields in that line of the file…? The issue is that you can’t definitively know which of the 50 fields is missing, so you have to fail the line or worse the file.
In my experience (perhaps more niche than yours since you mentioned it has been your day job), the lack of fall back options makes for brittle integrations. Failing entire files due to a borked row can be expensive in terms of time.
Having to ingest large CSV files from legacy systems has made me rethink the value of XML, lol. Types and schemas add complexity for sure, but you get options for dealing with variances in structure and content.
That is a problem, but it is also a problem with XML. Parsing the XML file to discover e.g. unmatching tags is far more CPU and memory expensive than correctly passing a CVS.
In both cases you'd fail the entire file rather than partial recovery.
/me laughs in legacese
RFC4180 is a late attempt at CSV standardization, merely codifying a bunch of sane practices. It also provides a nice specification for generating CSV. But anyone taking care to code from a specification might as well use a proper file format.
The real specification for CSV is as follows: "Valid CSV is whatever is designated as CSV by its emitter". I wish I was joking.
There is literally an infinity of ways CSV can be broken. The developer will bump his head on each as he encounters them, and add a specific fix. After a while, his code will be robust against the local strains of CSV... Until the next mutation is encountered after acquiring yet another company with a bunch of ERP way past their last extended maintainance era, a history of local adaptations and CSV as a message bus.
CSV is awesome for front-end webapps needing to fetch A LOT of data from a server in order to display an information-dense rendering. For that use-case, one controls both sides so the usual serialization issues aren't a problem.
Working in clinical trial data processing I receive data in 1 of 3 formats: csv, sas datasets, image scans of pdf pages showing spreadsheets
Of these 3 options sas datasets are my preference but I'll immediately convert to csv or excel, csv is a close 2nd once you confirm the quoting / seperator conventions it's very easy to parse. I understand why someone may find the csv format disagreeable but in my experience the alternatives can be so much worse I don't worry too much about csv files
TSV > CSV
Way easier to parse
pipe (|) separated or gtfo!
sqlite3 gets it right.
What do you like so much about the pipe?
I think pipe is better too.
Typical latin fonts divide characters into three heights: short like "e" or "m", tall like "l" or "P" and deep like "j" or "y". As you may notice, letters only use one or two of these three sections.
Pipe is unique in that it uses all three at the same time from the very top to the very bottom. No matter what latin character you put next to it, it remains distinct. This makes the separators relatively easy to spot.
Pipe is a particularly uncommon character in normal text while commas, spaces, semicolons, etc are quite common. This means you don't need to escape it very often. With an escapable pipe, an escapable newline, and unicode escapes ("\|", "\n", and "\uXXXX") you can handle pretty much everything tabular with minimal extra characters or parsing difficulty.
This in turn means that you can theoretically differentiate between different basic types of data stored within each entry without too much difficulty. You could even embed JSON inside it as long as you escape pipes and newlines.
Maybe someone should type this up into a .psv file format (maybe it already exists).I already prefer using pipe as separator in logging; now you're telling me there is a chance that my logs can be automatically ingested as tabular data? Sign me up for this branch of the multiverse :)
I forgot to add that you need a "\\" escape for when it appears before |, n, or u in text.
Visually similar to column separator in a spreadsheet.
Less likely to appear in normal data. Of course you have to escape it but at the very least the data looks less noisy.
"the controversial ex-post RFC 4180"
I looked at the RFC. What is controversial about it?
Look at how it handles escaping of special characters and particularly new lines (RFC 4180 doesn’t guarantee that a new line is a new record) and how it’s written in 2005 yet still doesn’t handle unicode other than via a comment about ”other character sets”.
"how it’s written in 2005 yet still doesn’t handle unicode other than via a throwaway comment about ”other character sets”"
Yeah, you are spot on with this one (cries in Czech, which used to be encoded in several various ways).
> RFC 4180 doesn’t guarantee that a new line is a new record
Correctly. A good parser should step through the line one column at a time, and shouldn't even consider newlines that are quoted.
If you're naively splitting the entire file via newline, that isn't 4180's fault, that is your fault for not following the standard or industry norms.
I'll happily concede the UNICODE point however; but I don't know if that makes it controversial.
You mean aside from the fact it's ex-post ...
Doesn't it make sense to have a common document in the usual format (RFC) which every newbie can consult when in doubt?
I much prefer that to any sort of "common institutional memory" that is nevertheless only talked about on random forums. People die, other people enter the field... hello subtle incompatibilities.
The fact that you can parse CSV in reverse is quite cool, but you can't necessarily use it for crash recovery (as suggested) because you can't be sure that the last thing written was a complete record.
Last field rather than last record. The first row will give you column count.
The column count doesn't help because you don't know where the last record starts because you don't know whether you're in a string or not. Unless you scan the entire file from the beginning, which defeats the object.
Unless newlines in strings are escaped
I love CSV for a number of reasons. Not the least of which it’s super easy to write a program (code) in C to directly output all kinds of things to CSV. You can also write simple middleware to go from just about any database or just general “thing” to CSV. Very easily. Then toss CSV into excel and do literally anything you want.
It’s sort of like, the computing dream when I was growing up.
+1 to ini files. I like you can mess around with them yourself in notepad. Wish there was a general outline / structure to those though.
I have been just splitting my head to parse data from from a erp database to csv and then from csv to erp database again using the programming language user by erp system.
The first part of converting data to csv works fine with help of ai coding assistant.
The reverse part of csv to database is getting challenging and even claude sonnet 3.7 is not able to escape newline correctly.
I am now implementation the data format in json which is much simpler.
Items #6 to #9 sound like genuine trolling to me; item #8, reversing bytes because of course no other text encodings than ASCII exist, is particularly horrible.
the reversing bytes part is encoding agnostic. you just feed the reversed bytes to the csv parser then re-reverse both the yielded rows and the cells bytes and get the original order of the bytes themselves.
Except for multibyte encodings.
CSV?
it's surely simple but..
- where is meta data? - how do you encode binary? - how do you index? - where are relationships? different files?
I think I understand the point being made, but all this reliance on text-based data means we require proper agreement on text encodings, etc. I don't think it's very useful for number-based data anyway, it's a massively bloated way to store float32s for instance and usually developers truncate the data losing about half of the precision in the process.
For numerical data, nothing beats packing floats into blobs.
I think binary formats have many advantages. Not only for numbers but other data as well, including data that contains text (to avoid needing escaping, etc; and to declare what character sets are being used if that is necessary), and other structures. (For some of my stuff I use a variant of DER, which adds a few new types such as key/value list type.)
Relevant discussion from a few years back
https://news.ycombinator.com/item?id=28221654
I can‘t really understand the love for the format. Yes it’s simple but also not defined in a common spec. Same story with markdown. Yes GitHub tried to push for a spec but it still feels more like a flavor. I mean there is nothing wrong with not having a spec. But certain guarantees are not given. Will the document exported by X work with Y.
I love CSV when it's only me creating/using the CSV. It's a very useful spreadsheet/table interchange format.
But god help you if you have to accept CSVs from random people/places, or there's even minor corruption. Now you need an ELT pipeline and manual fix-ups. A real standard is way better for working with disparate groups.
CSV has caused me a lot of problems due to the weak type system. If I save a Dataframe to CSV and reload it, there is no guarantee that I'll end up with an identical dataframe.
I can depend on parquet. The only real disadvantages with parquet are that they aren't human-readable or mutable, but I can live with that since I can easily load and resave them.
Quick question while we’re on the topic of CSV files: is there a command-line tool you’d recommend for handling CSV files that are malformed, corrupted, or use unexpected encodings?
My experience with CSVs is mostly limited to personal projects, and I generally find the format very convenient. That said, I occasionally (about once a year) run into issues that are tricky to resolve.
I'm in on the "shit on microsoft for hard to use formats train" but as someone who did a LOT of .docx parsing - it turned into zen when I realized that I can just convert my docs into the easily parsed .html5 using something like pandoc.
This is a good blog post and Xan is a really neat terminal tool.
Xan looks great. Miller is another great cli tool for transforming data among csv, tsv, json and other formats.
https://miller.readthedocs/
https://miller.readthedocs.io/
trimmed the version path off and went a bit too far! thanks.
> CSV is dynamically typed
No, CSV is dependently typed. Way cooler ;)
I wrote something about this https://github.com/Ericson2314/baccumulation/blob/main/datab...
Excel won't import ISO 8601 timestamps either, which is crazy these days where it's the universal standard, and there's no excuse to use anything else.
You have to replace the "T" separator with a space and also any trailing "Z" UTC suffix (and I think any other timezone/offset as well?) for Excel to be able to parse as a time/date.
I usually have the opposite problem. There's even a joke about it:
How is Excel like an Incel? Both of them think everything is a date.
Have you tried using the "from text/csv" importer under the data tab? Where it will import your data into a table. Because that one will import ISO 8601 timestamps just fine.
This, it's dumb but Excel handles csv way better if you 'import' it vs just opening it. I use excel to quickly preview csv files, but never to edit them unless I'm OK only ever using it in Excel afterwards.
Even in that case I'd be hesitant to open a CSV file in excel. The problem is that it will automatically apply whatever transformation it thinks is appropriate the moment you open the file. Have a digit string that isn't semantically a number? Too bad, it's a number now, and we're gonna go ahead and round it. You didn't really need _all_ of the digits of that insurance policy number, did you?
They did finally add options to turn off the common offenders, but I have a deeply ingrained distrust at this point.
I've noticed recently, they ask you about some of the transformations with a popup instead of automatically doing them when you open csv files.
Honestly I’m happier when Excel doesn’t try to convert anything. too much bugginess.
Especially gene names. It was so bad that the scientific community renamed the genes in question rather than suffering from the same horror endlessly.
[0] https://www.theverge.com/2020/8/6/21355674/human-genes-renam...
Just wrong!!
I don't get it - why the world, Excel can't just open the CSV, assume from the extension it's COMMA separated value and do the rest. It does work slightly better when importing, just a little.
Your comma isn't my comma. French systems use the comma as a decimal point for numbers and we use semicolons to separate fields in CSV files.
No, french systems also use comma to separate fields in CSV files. Excel uses semicolon to separate fields in France, meaning it generates semicolon-separated files rather than comma-separated files.
It's not the fault of CSV that Excel changes which file format it uses based on locale.
It's even worse than that. Office on my work computer is set to the English language, but my locale is French and so is my Windows language. It's saving semicolon-separated CSV files with the comma as a decimal point.
I need to uncheck File > Option Advanced > Use system separators and set the decimal separator to a dot to get Excel to generate English-style CSV files with semicolon-separated values. I can't be bothered to find out where Microsoft moved the CSV export dialog again in the latest version of Office to get it to spit out comma-separated fields.
Point is, CSV is a term for a bunch of loosely-related formats that depends among other things on the locale. In other words, it's a mess. Any sane file format either mandates a canonical textual representation for numbers independent of locale (like JSON) or uses binary (like BSON).
> It's saving semicolon-separated CSV files with the comma as a decimal point.
It's not though, is what I'm saying. It's saving semicolon-separated files, not CSV files. CSV files have commas separating the values. Saying that Excel saves "semicolon-separated CSV files" is nonsensical.
I can save binary data in a .txt file, that doesn't make it a "text file with binary data"; it's a binary file with a stupid name.
Sorry, but what Excel does is save to a file with a CSV extension. This format is well defined and includes ways to specify encoding and separator to be readable under different locales.
This format is not comma separated values. But Excel calls it CSV.
The headaches comes if people assume that a csv file must be comma separated.
I don't care what Excel calls it. As I said, if I name a file .txt but stuff it with binary data, it's not a text file.
That, bad specs, weird management/ timezone/ governance/ communications and random \n\r issues transformed a 2 day fun little project into a 4 week hell. I will never work with CSV in France ever again. Mostly because of Excel, normal CSV nice.
Most of the people most of the time aren't importing data from a different locale. A good assumption for defaults could be that the CSV file honors the current Windows regional settings.
If it only was that easy. Experience has shown that the only reliable way is to run heuristics against the first few lines of the file.
There are office programs that save CSV with the proper comma delimiter regardless of the locale.
There are people who run non-local locales for various good reasons.
There are technically savvy people who have to deal with CSV shenanigans and can and will send it with the proper comma delimiter.
It could, but it doesn't want to. The whole MS Office dominance came into being by making sure other tools can't properly open documents created by MS tools; plus being able to open standard formats but creating small incompatibilities all around, so that you share the document in MS format instead.
Probably Microsoft treats a pure-text, simply specified, human-readable and editable spreadsheet format that fosters interoperability with competing software as an existential threat.
This. I got burnt by the encoding and other issues with CSV in Excel back in the day, I've only used LibreOffice Calc (on Linux) for viewing / editing CSVs for many years now, it's almost always a trouble-free experience. Fortunately I don't deal much with CSVs that Excel-wielding non-devs also need to open these days - I assume that, for most folks, that's the source of most of their CSV woes.
I just wish Excel was a little less bad about copy-pasting CSVs as well. Every single time, without fail, it dumps them into a single column. Every single time I use "text to columns" it has insane defaults where it's fixed-width instead of delimited by, you know, commas. So I change that and finally it's fixed.
Then I go do it somewhere else and have to set it up all over again. Drives me nuts. How the default behavior isn't to just put them in the way you'd expect is mind-boggling.
Search and replace + text to columns after the fact works fine.
10. CSV doesn't need commas!
Use a different separator if you need to.
CSV is the Vim of formats. If you get a CSV from 1970 you can still load it.
I have to agree.
It was pretty straightforward (although tedious) to write custom CSV data exports in embedded C, with ZERO dependencies.
I know, I know, only old boomers care about removing pip from their code dev process, but, I'm an old boomer, so it was a great feature for me.
Straight out of libc I was able to dump data in real-time, that everyone on the latest malware OSes was able to import and analyze.
CSV is awesome!
Yeah CSV is easy to export, because its not really a file format, but more an idea. I'm not even sure there is such a thing as "invalid" CSV
The following are all valid CSV, and they should all mean the same thing, depending on your point of view:
1) foo, bar, foobar
2) "foo", "bar", "foobar"
3) "foo", bar, foobar
4) foo; bar; "foobar"
5) foo<TAB>bar<TAB>"foobar"
5) foo<EOL> bar<EOL foobar<EOL>
Have fun writing that parser!
Using <tab> makes it not csv but tsv.
Honestly if there is no comma to separate the values, then its not csv maybe Csv for character separate values or asv for anything separates values but you're right, this makes it hard how everyone is doing whatever. IMV supporting "" makes supporting anything else redundant.
Tell it to Microsoft
I hate CSV (but not as much as XML).
Most reasonably large CSV files will have issues parsing on another system.
It makes me a bit worried to read this thread, I would've thought its pretty common knowledge why CSV is horrible and widely agreed upon. I also have hard time taking anybody seriously who uses "specification" and "CSV" in the same sentence unironically.
I suspect its 1) people who worked with legacy systems AND LIKED IT, or 2) people who never worked with legacy systems before and need to rediscover painful old lessons for themselves.
It feels like trying to convince someone, why its a bad idea to store the year as a CHAR(2) in 1999, unsuccessfully.
The thread makes me worried for a different reason, especially how many avoid reading the RFC and even saying that the RFC doesn't matter, that there's no standard for csv. maybe there should be an extension for rfc-csv.
Maybe just read the love letter?
This might be the most passionate and well-argued defense of CSV I've read
So easy to get data in and out of an application, opens seamlessly in Excel or your favor DB for further inspection. The only issue is the comma rather than a less used separator like | that occasionally causes issues.
Any recommendations for CSV editors on OSX? I was just looking around for this today. The "Numbers" app is pretty awful and I couldn't find any superb substitutes, only ones that were just OK.
I’ve been using Easy CSV Editor. I especially like getting the min/max, unique, etc values in a given column.
I've said it before, CSV will still be used in 200 years. It's ugly, but it occupies an optimal niche between human readability, parsing simplicity, and universality.
With a nice ASIC parser to replace databases.
JSON, XML, YAML are tree describing languages while CSV defines a single table. This is why CSV still works for a lot of things (sure there is JSON lines format of course).
I just wish line breaks weren't allowed to be quoted. I would have preferred \n. Now I can't read line-by-line or stream line-by-line.
Using ascii 'US' Unit Separator and 'RS' Record Separator characters would be a far better implementation of a CSV file.
and of course you can do that if you wish, as many CSV libraries allow arbitrary separators and escapes (though they usually default to the "excel compatible" format)
but at least in my case, I would not like to use those characters because they are cumbersome to work with in a text editor. I like very much to be able to type out CSV columns and rows quickly, when I need to.
It’s all a pros and cons.. the benefit of those characters are they are not used anywhere else, hence you never have to worry about escaping/quoting strings. But obviously most of my csv usage is automated in/out.
I always feel like CSV gets a bad rap. It definitely has problems if you get into corner cases but for many situations it's just fine.
People that talk about readability: if you store using jsonl (one json per line) - you can get your csv by using the terminal command jq.
I wish CSV could have headers for meta data. And schemas would be awesome. And pipes as well to avoid the commas in strings problem.
CSV is too new and without a good standard quoting strategy. Better staying with the boring old fixed length column format :))
> Excel hates CSV
Does it though? Seems to be importing from and exporting to CSV just fine? Elaborate maybe.
9. Excel hates CSV: It clearly means CSV must be doing something right. <3<3<3
with quick and dirty bash stuff ive written the same csv parser so many times it lives in my head and i can write it from memory. no other format is like that. trying to parse json without jq or a library is much more difficult
I prefer Tab-Separated. Its problem though: No tabs allowed in your data.
CSV isn't dynamically typed. Everything is just a string.
I used to prefer csv. Then I started using parquet.
Never want to use sas7bdat again.
shout out to BurntSushis excellent xsv util
https://github.com/BurntSushi/xsv
Kudos for writing this, it's always worth flagging up the utility of a format that just is what it is, for the benefit of all. Commas can also create fun ambiguity, as that last sentence demonstrates. :P
CSV is lovely. It isn't trying to be cool or legendary. It works for the reasons the author proposes, but isn't trying to go further.
I work in a work of VERY low power devices and CSV sometimes is all you need for a good time.
If it doesn't need to be complicated, it shouldn't be. There are always times when I think to myself CSV fits and that is what makes it a legend. Are those times when I want to parallelise or deal with gigs of data in one sitting. Nope. There are more complex formats for that. CSV has a place in my heart too.
Thanks for reminding me of the beauty of this legendary format... :)
Because if there is anything we love in data exchange formats its ambiguity.
CSV is the PHP of fileformats
I'll repeat what I say every time I talk about CSV: I have never encountered a customer who insisted on integrating via CSV who was capable of producing valid CSV. Anybody who can reliably produce valid CSV will send you something else if you ask for it.
> CSV is not a binary format, can be opened with any text editor and does not require any specialized program to be read. This means, by extension, that it can both be read and edited by humans directly, somehow.
This is why you should run screaming when someone says they have to integrate via CSV. It's because they want to do this.
Nobody is "pretending CSV is dead." It'll never die, because some people insist on sending hand-edited, unvalidated data files to your system and not checking for the outcome until mid-morning the next day when they notice that the text selling their product is garbled. Then they will frantically demand that you fix it in the middle of the day, and they will demand that your system be "smarter" about processing their syntactically invalid files.
Seriously. I've worked on systems that took CSV files. I inherited a system in which close to twenty "enhancement requests" had been accepted, implemented, and deployed to production that were requests to ignore and fix up different syntactical errors, because the engineer who owned it was naive enough to take the customer complaints at face value. For one customer, he wrote code that guessed at where to insert a quote to make an invalid line valid. (This turned out to be a popular request, so it was enabled for multiple customers.) For another customer, he added code that ignored quoting on newlines. Seriously, if we encountered a properly quoted newline, we were supposed to ignore the quoting, interpret it as the end of the line, and implicitly append however many commas were required to make the number of fields correct. Since he actually was using a CSV parsing library, he did all of this in code that would pre-process each line, parse the line using the library, look at the error message, attempt to fix up the line, GOTO 10. All of these steps were heavily branched based on the customer id.
The first thing I did when I inherited that work was make it clear to my boss how much time we were spending on CSV parsing bullshit because customers were sending us invalid files and acting like we were responsible, and he started looking at how much revenue we were making from different companies and sending them ultimatums. No surprise, the customers who insisted on sending CSVs were mostly small-time, and the ones who decided to end their contracts rather than get their shit together were the least lucrative of all.
> column-oriented data formats ... are not able to stream files row by row
I'll let this one speak for itself.
I love CSV