A love letter to the CSV format

(github.com)

708 points | by Yomguithereal 4 months ago ago

111 comments

jwr 4 months ago
I so hate CSV.
I am on the receiving end: I have to parse CSV generated by various (very expensive, very complicated) eCAD software packages. And it's often garbage. Those expensive software packages trip on things like escaping quotes. There is no way to recover a CSV line that has an unescaped double quote.
I can't point to a strict spec and say "you are doing this wrong", because there is no strict spec.
Then there are the TSV and semicolon-Separated V variants.
Did I mention that field quoting was optional?
And then there are banks, which take this to another level. My bank (mBank), which is known for levels of programmer incompetence never seen before (just try the mobile app) generates CSVs that are supposed to "look" like paper documents. So, the first 10 or so rows will be a "letterhead", with addresses and stuff in various random columns. Then there will be your data, but they will format currency values as prettified strings, for example "34 593,12 USD", instead of producing one column with a number and another with currency.
[-]
- mjw_byrne 4 months ago
  I used to be a data analyst at a Big 4 management consultancy, so I've seen an awful lot of this kind of thing. One thing I never understood is the inverse correlation between "cost of product" and "ability to do serialisation properly".
  Free database like Postgres? Perfect every time.
  Big complex 6-figure e-discovery system? Apparently written by someone who has never heard of quoting, escaping or the difference between \n and \r and who thinks it's clever to use 0xFF as a delimiter, because in the Windows-1252 code page it looks like a weird rune and therefore "it won't be in the data".
- raxxorraxor 4 months ago
  Try to live in a country where "," is the decimal point. Of course this causes numerous interoperability issues or hidden mistakes in various data sets.
  There would have been many better separators... but good idea to bring formatting into it as well...
- danso 4 months ago
  > Then there will be your data, but they will format currency values as prettified strings, for example "34 593,12 USD", instead of producing one column with a number and another with currency.
  To be fair, that's not a problem with CSV but with the provider's lack of data literacy.
- TRiG_Ireland 4 months ago
  I worked in a web shop which had to produce spreadsheets which people wanted to look at in Excel. I gave them so many options, and told each client to experiment and choose the option which worked for them. In the end, we had (a) UTF-8 CSV, (b) UTF-8 CSV with BOM, (c) UTF-16 TSV, (d) UTF-8 HTML table with a .xlsx file extension and a lying Content-Type header which claimed it was an Excel spreadsheet.
  Option a worked fine so long as none of the names in the spreadsheet had any non-ASCII characters.
  Option d was by some measures the worst (and was definitely the largest file size), but it did seem to consistently work in Excel and Libre Office. In fact, they all worked without any issue in Libre Office.
- recursive 4 months ago
  > I can't point to a strict spec and say "you are doing this wrong", because there is no strict spec.
  Have you tried RFC 4180?
  https://www.ietf.org/rfc/rfc4180.txt
- hermitcrab 4 months ago
  I've written a commercial, point and click, data wrangling tool (Easy Data Transform) that can deal with a lot of these issues:
  -different delimiters (comma, semi-colon, tab, pipe etc)
  -different encodings (UTF8, UTF16 etc)
  -different line ending (CR, LF, CR+LF)
  -ragged rows
  -splitting and merging columns
  And much more besides.
  However, if you have either:
  -line feeds and/or carriage returns in data values, but no quoting
  or
  -quoting, but quotes in data values aren't properly handled
  Then you are totally screwed and you have my sympathies!
- imtringued 4 months ago
  I agree and as a result I have completely abandoned CSV.
  I use the industry standard that everyone understands: ECMA-376, ISO/IEC 29500 aka .xlsx.
  Nobody has any problems producing or ingesting .xlsx files. The only real problem is the confusion between numbers and numeric text that happens when people use excel manually. For machine to machine communication .xlsx has never failed me.
- byyll 4 months ago
  I was recently writing a parser for a weird CSV. It had multiple header column rows in it as well as other header rows indicating a folder.
jll29 4 months ago
The post should at least mention in passing the major problem with CSV: it is a "no spec" family of de-facto formats, not a single thing (it is an example of "historically grown"). And omission of that meams I'm going to have to call this our for its bias (but then it is a love letter, and love makes blind...).
Unlike XML or JSON, there isn't a document defining the grammar of well-formed or valid CSV files, and there are many flavours that are incompatible with each other in the sense that a reader for one flavour would not be suitable for reading the other and vice versa. Quoting, escaping, UTF-8 support are particular problem areas, but also that you cannot tell programmatically whether line 1 contains column header names or already data (you will have to make an educated guess but there ambiguities in it that cannot be resolved by machine).
Having worked extensively with SGML for linguistic corpora, with XML for Web development and recently with JSON I would say programmatically, JSON is the most convenient to use regarding client code, but also its lack of types makes it useful less broadly than SGML, which is rightly used by e.g. airlines for technical documntation and digital humanities researchers to encode/annotate historic documents, for which it is very suitable, but programmatically puts more burden on developers. You can't have it all...
XML is simpler than SGML, has perhaps the broadest scope and good software support stack (mostly FOSS), but it has been abused a lot (nod to Java coders: Eclipse, Apache UIMA), but I guess a format is not responsible for how people use or abuse it. As usual, the best developers know the pros and cons and make good-taste judgments what to use each time, but some people go ideological.
(Waiting for someone to write a love letter to the infamous Windows INI file format...)
mjw_byrne 4 months ago
CSV is ever so elegant but it has one fatal flaw - quoting has "non-local" effects, i.e. an extra or missing quote at byte 1 can change the meaning of a comma at byte 1000000. This has (at least) two annoying consequences:
1. It's tricky to parallelise processing of CSV. 2. A small amount of data corruption can have a big impact on the readability of a file (one missing or extra quote can bugger the whole thing up).
So these days for serialisation of simple tabular data I prefer plain escaping, e.g. comma, newline and \ are all \-escaped. It's as easy to serialise and deserialise as CSV but without the above drawbacks.
evnp 4 months ago
Anyone with a love of CSV hasn't been asked to deal with CSV-injection prevention in an enterprise setting, without breaking various customer data formats.
There's a dearth of good resources about this around the web, this is the best I've come across: https://georgemauer.net/2017/10/07/csv-injection.html
999900000999 4 months ago
The best part about csv, anyone can write a parser in 30 minutes meaning that I can take data from the early '90s and import it into a modern web service.
The worst part about CSV, anyone can ride a parser in about 30 minutes, meaning that it's very easy to get incorrect implementations, incorrect data, and other strange undefined behaviors. But to be clear json, and yaml also have issues with everyone trying to reinvent the wheel constantly. XML is rather ugly, but it seems to be the most resilient.
beautron 4 months ago
I also love CSV for its simplicity. A key part of that love is that it comes from the perspective of me as a programmer.
Many of the criticisms of CSV I'm reading here boil down to something like: CSV has no authoritative standard, and everyone implements it differently, which makes it bad as a data interchange format.
I agree with those criticisms when I imagine them from the perspective of a user who is not also a programmer. If this user exports a CSV from one program, and then tries to load the CSV into a different program, but it fails, then what good is CSV to them?
But from the perspective of a programmer, CSV is great. If a client gives me data to load into some app I'm building for them, then I am very happy when it is in a CSV format, because I know I can quickly write a parser, not by reading some spec, but by looking at the actual CSV file.
Parsing CSV is quick and fun if you only care about parsing one specific file. And that's the key: It's so quick and fun, that it enables you to just parse anew each time you have to deal with some CSV file. It just doesn't take very long to look at the file, write a row-processing loop, and debug it against the file.
The beauty of CSV isn't that it's easy to write a General CSV Parser that parses every CSV file in the wild, but rather that its easy to write specific CSV parsers on the spot.
Going back to our non-programmer user's problem, and revisiting it as a programmer, the situation is now different. If I, a programmer, export a CSV file from one program, and it fails to import into some other program, then as long as I have an example of the CSV format the importing program wants, I can quickly write a translator program to convert between the formats.
There's something so appealing about to me about simple-to-parse-by-hand data formats. They are very empowering to a programmer.
account-5 4 months ago
I like CSV for the same reasons I like INI files. It's simple, text based, and there's no typing encoded in the format, it's just strings. You don't need a library.
They're not without their drawbacks, like no official standards etc, but they do their job well.
I will be bookmarking this like I have the ini critique of toml: https://github.com/madmurphy/libconfini/wiki/An-INI-critique...
I think the first line of the toml critique applies to CSV: it's a federation of dialects.
samdung 4 months ago
CSV works because CSV is understood by non technical people who have to deal with some amount of technicality. CSV is the friendship bridge that prevents technical and non technical people from going to war.
I can tell an MBA guy to upload a CSV file and i'll take care of it. Imagine i tell him i need everything in a PARQUET file!!! I'm no longer a team player.
PaulHoule 4 months ago
I wish this was a joke. I'm always trying to convince data scientists with a foot in the open source world that their life will be so much better if they use parquet or Stata or Excel or any other kind of file but CSV.
On top of all the problems people mention here involving the precise definition of the format and quoting, it's outright shocking how long it takes to parse ASCII numbers into floating point. One thing that stuck with me from grad school was that you could do a huge number of FLOPS on a matrix in the time it would take to serialize and deserialize it to aSCII.
notatallshaw 4 months ago
What isn't fun about CSV is quickly written parsers and serializers repeatedly making the common mistake of not handling, or badly handling, quoting.
For a long time I was very wary of CSV until I learnt Python and started using it's excellent csv standard library module.
k_bx 4 months ago
I've recently been developing a raspberry pi based solution which works with telemetry logs. First implementation used an SQLite database (with WAL log) – only to find it corrupted after just couple of days of extensive power on/off cycles.
I've since started looking at parquet files – which turned out to not be friendly to append-only operations. I've ended up implementing writing events into ipc files which then periodically get "flushed" into the parquet files. It works and it's efficient – but man is it non-trivial to implement properly!
My point here is: for a regular developer – CSV (or jsonl) is still the king.
hajile 4 months ago
The argument against JSON isn't very compelling. Adding a name to every field as they do in their strawman example isn't necessary.
Compare this CSV
```
    field1,field2,fieldN
    "value (0,0)","value (0,1)","value (0,n)"
    "value (1,0)","value (1,1)","value (1,n)"
    "value (2,0)","value (2,1)","value (2,n)"
```
To the directly-equivalent JSON
```
    [["field1","field2","fieldN"],
     ["value (0,0)","value (0,1)","value (0,n)"],
     ["value (1,0)","value (1,1)","value (1,n)"],
     ["value (2,0)","value (2,1)","value (2,n)"]]
```
The JSON version is only marginally bigger (just a few brackets), but those brackets represent the ability to be either simple or complex. This matters because you wind up with terrible ad-hoc nesting in CSV ranging from entries using query string syntax to some entirely custom arrangement.
```
    person,val2,val3,valN
    fname=john&lname=doe&age=55&children=[jill|jim|joey],v2,v3,vN
```
And in these cases, JSON's objects are WAY better.
Because CSV is so simple, it's common for them to avoid using a parsing/encoding library. Over the years, I've run into this particular kind of issue a bunch.
```
    //outputs `val1,val2,unexpected,comma,valN` which has one too many items
    ["val1", "val2", "unexpected,comma", "valN"].join(',')
```
JSON parsers will not only output the expected values every time, but your language likely uses one of the super-efficient SIMD-based parsers under the surface (probably faster than what you are doing with your custom CSV parser).
Another point is standardization. Does that .csv file use commas, spaces, semicolons, pipes, etc? Does it use CR,LF, or CRLF? Does it allow escaping quotations? Does it allow quotations to escape commas? Is it utf-8, UCS-2, or something different? JSON doesn't have these issues because these are all laid out in the spec.
JSON is typed. Sure, it's not a LOT of types, but 6 types is better than none.
While JSON isn't perfect (I'd love to see an official updated spec with some additional features), it's generally better than CSV in my experience.
stevage 4 months ago
They're completely skipping over the complications of header rows and front matter.
"8. Reverse CSV is still valid CSV" is not true if there are header rows for instance.
But really, whether or not CSV is a good format or not comes down to how much control you have over the input you'll be reading. If you have to deal with random CSV from "in the wild", it's pretty rough. If you have some sort of supplier agreement with someone that's providing the data, or you're always parsing data from the same source, it's pretty fine.
Timwi 4 months ago
I am annoyed that comma won out as the separator. Tab would have been a massively better choice. Especially for those of us who have discovered and embraced elastic tabstops. Any slightly large CSV is unreadable and uneditable because you can't easily see where the commas are, but with tabs and elastic tabstops, the whole thing is displayed as a nice table.
(That is, of course, assuming the file doesn't contain newlines or other tabs inside of fields. The format should use \t \n etc for those. What a missed opportunity.)
abought 4 months ago
At various points in my career, I've had to oversee people creating data export features for research-focused apps. Eventually, I instituted a very simple rule:
As part of code review, the developer of the feature must be able to roundtrip export -> import a realistic test dataset using the same program and workflow that they expect a consumer of the data to use. They have up to one business day to accomplish this task, and are allowed to ask an end user for help. If they don't meet that goal, the PR is sent back to the developer.
What's fascinating about the exercise is that I've bounced as many "clever" hand-rolled CSV exporters (due to edge cases) as other more advanced file formats (due to total incompatibility with every COTS consuming program). All without having to say a word of judgment.
Data export is often a task anchored by humans at one end. Sometimes those humans can work with a better alternative, and it's always worth asking!
polyrand 4 months ago
As someone who likes modern formats like parquet, when in doubt, I end up using CSV or JSONL (newline-delimited JSON). Mainly because they are plain-text (fast to find things with just `grep`) and can be streamed.
Most features listed in the document are also shared by JSONL, which is my favourite format. It compresses really well with gzip or zstd. Compression removes some plain-text advantages, but ripgrep can search compressed files too. Otherwise, you can:
```
  zcat data.jsonl.gz | grep ...
```
Another advantage of JSONL is that it's easier to chunk into smaller files.
williamcotton 4 months ago
Essential CSV shell tools:
csvtk: https://bioinf.shenwei.me/csvtk/
gawk: https://www.gnu.org/software/gawk/manual/html_node/Comma-Sep...
awk: https://github.com/onetrueawk/awk?tab=readme-ov-file#csv
mitchpatin 4 months ago
CSV still quietly powers the majority of the world’s "data plumbing."
At any medium+ sized company, you’ll find huge amounts of CSVs being passed around, either stitched into ETL pipelines or sent manually between teams/departments.
It’s just so damn adaptable and easy to understand.
boricj 4 months ago
I've recently written a library at work to run visitors on data models bound to data sets. One of these visitors is a CSV serializer that dumps a collection as a CSV document.
I've just checked and strings are escaped using the same mechanism for JSON, with backslashes. I should've double-checked against RFC 4180, but thankfully that mechanism isn't currently triggered anywhere for CSV (it's used for log exportation and no data for these triggers that code path). I've also checked the code from other teams and it's just handwritten C++ stream statements inside a loop that doesn't even try to escape data. It also happens to be fine for the same reason (log exportation).
I've also written serializers for JSON, BSON and YAML and they actually output spec-compliant documents, because there's only one spec to pay attention to. CSV isn't a specification, it's a bunch of loosely-related formats that look similar at a glance. There's a reason why fleshed-out CSV parsers usually have a ton of knobs to deal with all the dialects out there (and I've almost added my own by accident), that's simply not a thing for properly specified file formats.
jellyfishbeaver 4 months ago
I work as a data engineer in the financial services industry, and I am still amazed that CSV remains the preferred delivery format for many of our customers. We're talking datasets that cost hundreds of thousands of dollar to subscribe to.
"You have a REST API? Parquet format available? Delivery via S3? Databricks, you say? No thanks, please send us daily files in zipped CSV format on FTP."
InsideOutSanta 4 months ago
This is particularly funny because I just received a ticket saying that the CSV import in our product doesn't work. I asked for the CSV, and it uses a semicolon as a delimiter. That's just what their Excel produced, apparently. I'm taking their word for it because... Excel.
To me, CSV is one of the best examples of why Postel's Law is scary. Being a liberal recipient means your work never ends because senders will always find fun new ideas for interpreting the format creatively and keeping you on your toes.
hermitcrab 4 months ago
There is a lot not to like about CSV, for all the reasons given here. The only real positive is that you can easily create, read and edit CSV in an editor.
Personally I think we missed a trick by not using the ASCII US and RS characters:
Columns separated by \u001F (ASCII unit separator).
Rows separated by \u001E (ASCII record separator).
No escaping needed.
More about this at:
https://successfulsoftware.net/2022/04/30/why-isnt-there-a-d...
jacobsenscott 4 months ago
In abstract CSV is great. In reality, it is a nightmare not because of CSV, but because of all the legacy tools that product it in slightly different ways (different character encodings mostly - excel still produces some variant of latin1, some tools drop a BOM in your UTF8, etc).
Unless you control the producer of the data you are stuck trying to infer the character encoding and transcoding to your destination, and there's no foolproof way of doing that.
nayuki 4 months ago
I greatly prefer TSV over CSV. https://en.wikipedia.org/wiki/Tab-separated_values
wglb 4 months ago
How much easier would all of this be if whoever did CSV first had done the equivalent of "man ascii". There are all these wonderful codes there like FS, GS, RS, US that could have avoided all the hassle that quoting has brought generations of programmers and data users.
hiddew 4 months ago
I think for "untyped" files with records, using the ASCII file, (group) and record separators (hex 1C, 1D and 1E) work nicely. The only constraint is that the content cannot contain these characters, but I found that that is generally no problem in practice. Also the file is less human readable with a simple text editor.
For other use cases I would use newline separated JSON. Is has most of the benefits as written in the article, except the uncompressed file size.
tengwar2 4 months ago
I'm not really sure why "Excel hates CSV". I import into Excel all the time. I'm sure the functionality could be expanded, but it seems to work fine. The bit of the process I would like improved is nothing to do with CSV - it's that the exporting programs sometimes rearrange the order of fields, and you have to accommodate that in Excel after the import. But since you can have named columns in Excel (make the data in to a table), it's not a big deal.
ok123456 4 months ago
It's an ad hoc text format that is often abused and a last-chance format for interchange. While heuristics can frequently work at determining the structure, they can just as easily frequently fail. This is especially true when dealing with dates and times or other locale-specific formats. Then, people outright abuse it by embedding arrays or other such nonsense.
You can use CSV for interchange, but a duck db import script with the schema should accompany it.
primitivesuave 4 months ago
One thing that has changed the game with how I work with CSVs is ClickHouse. It is trivially easy to run a local database, import CSV files into a table, and run blazing-fast queries on it. If you leave the data there, ClickHouse will gradually optimize the compression. It's pretty magical stuff if you work in data science.
esbranson 4 months ago
CSV on the Web (CSVW) is a W3C standard designed to enable the description of CSV files in a machine-readable way.[1]
"Use the CSV on the Web (CSVW) standard to add metadata to describe the contents and structure of comma-separated values (CSV) data files." — UK Government Digital Service[2][3]
[1] https://www.w3.org/TR/tabular-data-primer/
[2] https://www.gov.uk/government/publications/recommended-open-...
[3] https://csvw.org/
owlstuffing 4 months ago
CSV is everywhere. I use manifold-csv[1] it’s amazing.
1. https://github.com/manifold-systems/manifold/tree/master/man...
hermitcrab 4 months ago
I wrote my own CSV parser in C++. I wasn't sure what to do in some edge cases, e.g. when character 1 is space and character 2 is a quote. So I tried importing the edge case CSV into both MS Excel and Apple Numbers. They parsed it differently!
thenoblesunfish 4 months ago
All hail TSV. Like CSV, but you're probably less likely to want tabs, than commas.
nelblu 4 months ago
Also CSV can be queried : https://til.simonwillison.net/sqlite/one-line-csv-operations
wukerplank 4 months ago
CSV is so deceptively simple that people don't care understanding it. I wasted countless hours working around services providing non-escaped data that off the shelf parsers could not parse.
didgetmaster 4 months ago
I have written a new database system that will convert CSV, JSON, and XML files into relational tables.
On of the biggest challenges to CSV files is the lack of data types on the header line that could help determine the schema for the table.
For example a file containing customer data might have a column for a Zip Code. Do you make the column type a number or a string? The first thousand rows might have just 5 digit numbers (e.g. 90210) but suddenly get to rows with the expanded format (e.g. 12345-1234) which can't be stored in an integer column.
julik 4 months ago
Something I support completely - previously https://news.ycombinator.com/item?id=35418933#35438029
If CSV is indeed so horrible - and I do not deny that there can be an improvement - how about the clever data people spec out a format that
Does not require a bizarre C++ RPC struct definition library _both_ to write and to read
Does not invent a clever number encoding scheme that requires native code to decode at any normal speed
Does not use a fancy compression algorithm (or several!) that you need - again - native libraries to decompress
Does not, basically, require you be using C++, Java or Python to be able to do any meaningful work with it
It is not that hard, really - but CSV is better (even though it's terrible) exactly because it does not have all of these clever dependency requirements for clever features piled onto it. I do understand the utility of RLE, number encoding etc. I do not, and will not, understand the utility of Thrift/Avro, zstandard and brotli and whatnot over standard deflate, and custom integer encoding which requires you download half of Apache Commons and libboost to decode. Yes, those help the 5% to 10% of the use cases where massive savings can be realised. It absolutely ruins the experience for the other 90 to 95.
But they also give Parquet and its ilk a very high barrier of entry.
conceptme 4 months ago
The worst thing about CSV is Excel using localization to choose the delimiter: https://answers.microsoft.com/en-us/msoffice/forum/all/csv-f...
The second worst thing is that the escape character cannot be determined safely from the document itself.
sirukinx 4 months ago
For everyone complaining about CSV/TSV, there's a scripting language called R.
It makes working with CSV/TSV files super simple.
It's as easy this:
# Import tidyverse after installing it with install.packages("tidyverse")
library(tidyverse)
# Import TSV
dataframe_tsv <- read_tsv("data/FileFullOfDataToBeRead.tsv")
# Import CSV
dataframe_csv <- read_csv("data/FileFullOfDataToBeRead.csv")
# Mangle your data with dplyr, regular expressions, search and replace, drop NA's, you name it.
<code to sanitize all your data>
Multiple libraries exist for R to move data around, change the names of entire columns, change values in every single row with regular expressions, drop any values that have no assigned value, it's the swiss army knife of data. There are also all sorts of things you can do with data in R, from mapping with GPS coordinates to complex scientific graphing with ggplot2 and others.
Here's an example for reading iButton temperature sensor data: https://github.com/hominidae/ibutton_tempsensors/
Notice that in the code you can do the following to skip leading lines by passing it as an argument: skip = 18
cf1h <- read_csv("data/Coldframe_01_High.csv", skip = 18)
brazzy 4 months ago
Funny how the "specification holds in a tweet" yet manages to miss at least three things: 1) character encoding, 2) BOM or not, 3) header or no header.
jongjong 4 months ago
I've found some use cases where CSV can be a good alternative to arrays for storage, search and retrieval. Storing and searching nested arrays in document databases tends to be complicated and require special queries (sometimes you don't want to create a separate collection/table when the arrays are short and 1D). Validating arrays is actually quite complicated; you have to impose limits not only on the number of elements in the array, but also on the type and size of elements within the array. Then it adds a ton of complexity if you need to pass around data because, at the end of the day, the transport protocol is either string or binary; so you need some way to indicate that something is an array if you serialize it to a string (hence why JSON exists).
Reminds me of how I built a simple query language which does not require quotation marks around strings, this means that you don't need to escape strings in user input anymore and it prevents a whole bunch of security vulnerabilities such as query injections. The only cost was to demand that each token in the query language be separated by a single space. Because if I type 2 spaces after an operator, then the second one will be treated as part of the string; meaning that the string begins with a space. If I see a quotation mark, it's just a normal quotation mark character which is part of the string; no need to escape. If you constrain user input based on its token position within a rigid query structure, you don't need special escape characters. It's amazing how much security has been sacrificed just to have programming languages which collapse space characters between tokens...
It's kind of crazy that we decided that quotation marks are OK to use as special characters within strings, but commas are totally out of bounds... That said, I think Tab Separated Values TSV are even more broadly applicable.
baazaa 4 months ago
the simplicity is underappreciated because people don't realise how many dumb data engineers there are. i'm pretty sure most of them can't unpack an xml or json. people see a csv and think they can probably do it themselves, any other data format they think 'gee better buy some software with the integration for this'.
TrackerFF 4 months ago
Excel hates CSV only if you don't use the "From text / csv" function (under the data tab).
For whatever reason, it flawlessly manages to import most CSV data using that functionality. It is the only way I can reliably import data to excel with datestamps / formats.
Just drag/dropping a CSV file onto a spreadsheet, or "open with excel" sucks.
amelius 4 months ago
If this was really a love letter, it would have been in CSV format.
RandomMarius 3 months ago
"CSV" should die. The linked article makes critical ommisions and is wrong about some points. Goes to show just how awful "CSV" is.
For one thing, it talks about needing only to quote commas and newlines... qotes are usually fine... until they are on either side of the value. then you NEED to quote them as well.
Then there is the question about what exactly "text" is; with all the complications around Unicode, BOM markers, and LTR/RTL text.
relistan 4 months ago
CSV is bad. Furthermore it’s unnecessary. ASCII has field and record separator characters that were for this purpose.
jszymborski 4 months ago
> 4. CSV is streamable
This is what keeps me coming back.
seydor 4 months ago
Just don't write that love letter in French ... or any language that uses comma for decimals
slg 4 months ago
>This is so simple you might even invent it yourself without knowing it already exists while learning how to program.
This is a double-edged sword. The "you might even event it yourself" simplicity means that in practice lots of different people do end up just inventing their own version rather than standardizing to RFC-4180 or whatever when it comes to "quote values containing commas", values containing quotes, values containing newlines, etc. And the simplicity means these type of non-standard implementations can go completely undetectable until a problematic value happens to be used. Sometimes added complexity that forces paying more attention to standards and quickly surfaces a diversion from those standards is helpful.
baumschubser 4 months ago
Just last week I was bitten by a customer’s CSV that failed due to Windows‘ invisible BOM character that sometimes occurs at the beginning of unicode text files. The first column‘s title is not „First Title“ then but „&zwnbsp;First Title“. Imagine how long it takes before you catch that invisible character.
Aside from that: Yes, if CSV would be a intentional, defined format, most of us would do something different here and there. But it is not, it is more of a convention that came upon us. CSV „happened“, so to say. No need to defend it more passionate than the fact that we walk on two legs. It could have been much worse and it has surprising advantages against other things that were well thought out before we did it.
liotier 4 months ago
CSV is the bane of my existence. There is no reason to use it outside of legacy use-cases, when so many alternatives are not so brittle that they require endless defensive hacks to avoid erring as soon as exposed to the universe. CSV must die.
athenot 4 months ago
CSV is awesome for front-end webapps needing to fetch A LOT of data from a server in order to display an information-dense rendering. For that use-case, one controls both sides so the usual serialization issues aren't a problem.
lxe 4 months ago
TSV > CSV
Way easier to parse
Evidlo 4 months ago
There was/is CSVY [0] which attempted to put column style and separator information in a standard header. It is supported by R lang.
I also asked W3C on theirGithub if there was any spec for CSV headers and they said there isn't [1]. Kind of defeats the point of the spec in my opinion.
0: https://github.com/leeper/csvy
1: https://github.com/w3c/csvw/issues/873
inglor_cz 4 months ago
"the controversial ex-post RFC 4180"
I looked at the RFC. What is controversial about it?
masfuerte 4 months ago
The fact that you can parse CSV in reverse is quite cool, but you can't necessarily use it for crash recovery (as suggested) because you can't be sure that the last thing written was a complete record.
acc_297 4 months ago
Working in clinical trial data processing I receive data in 1 of 3 formats: csv, sas datasets, image scans of pdf pages showing spreadsheets
Of these 3 options sas datasets are my preference but I'll immediately convert to csv or excel, csv is a close 2nd once you confirm the quoting / seperator conventions it's very easy to parse. I understand why someone may find the csv format disagreeable but in my experience the alternatives can be so much worse I don't worry too much about csv files
486sx33 4 months ago
I love CSV for a number of reasons. Not the least of which it’s super easy to write a program (code) in C to directly output all kinds of things to CSV. You can also write simple middleware to go from just about any database or just general “thing” to CSV. Very easily. Then toss CSV into excel and do literally anything you want.
It’s sort of like, the computing dream when I was growing up.
+1 to ini files. I like you can mess around with them yourself in notepad. Wish there was a general outline / structure to those though.
HelloNurse 4 months ago
Items #6 to #9 sound like genuine trolling to me; item #8, reversing bytes because of course no other text encodings than ASCII exist, is particularly horrible.
ringofchaos 4 months ago
I have been just splitting my head to parse data from from a erp database to csv and then from csv to erp database again using the programming language user by erp system.
The first part of converting data to csv works fine with help of ai coding assistant.
The reverse part of csv to database is getting challenging and even claude sonnet 3.7 is not able to escape newline correctly.
I am now implementation the data format in json which is much simpler.
mccanne 4 months ago
Relevant discussion from a few years back
https://news.ycombinator.com/item?id=28221654
uoaei 4 months ago
I think I understand the point being made, but all this reliance on text-based data means we require proper agreement on text encodings, etc. I don't think it's very useful for number-based data anyway, it's a massively bloated way to store float32s for instance and usually developers truncate the data losing about half of the precision in the process.
For numerical data, nothing beats packing floats into blobs.
larusso 4 months ago
I can‘t really understand the love for the format. Yes it’s simple but also not defined in a common spec. Same story with markdown. Yes GitHub tried to push for a spec but it still feels more like a flavor. I mean there is nothing wrong with not having a spec. But certain guarantees are not given. Will the document exported by X work with Y.
0xbadcafebee 4 months ago
I love CSV when it's only me creating/using the CSV. It's a very useful spreadsheet/table interchange format.
But god help you if you have to accept CSVs from random people/places, or there's even minor corruption. Now you need an ELT pipeline and manual fix-ups. A real standard is way better for working with disparate groups.
meemo 4 months ago
Quick question while we’re on the topic of CSV files: is there a command-line tool you’d recommend for handling CSV files that are malformed, corrupted, or use unexpected encodings?
My experience with CSVs is mostly limited to personal projects, and I generally find the format very convenient. That said, I occasionally (about once a year) run into issues that are tricky to resolve.
pretoriusdre 4 months ago
CSV has caused me a lot of problems due to the weak type system. If I save a Dataframe to CSV and reload it, there is no guarantee that I'll end up with an identical dataframe.
I can depend on parquet. The only real disadvantages with parquet are that they aren't human-readable or mutable, but I can live with that since I can easily load and resave them.
Der_Einzige 4 months ago
I'm in on the "shit on microsoft for hard to use formats train" but as someone who did a LOT of .docx parsing - it turned into zen when I realized that I can just convert my docs into the easily parsed .html5 using something like pandoc.
This is a good blog post and Xan is a really neat terminal tool.
Qem 4 months ago
```
  9. Excel hates CSV
  It clearly means CSV must be doing something right.
```
This is one area where LibreOffice Calc shines in comparison to Excel. Importing CSVs is much more convenient.
Ericson2314 4 months ago
> CSV is dynamically typed
No, CSV is dependently typed. Way cooler ;)
I wrote something about this https://github.com/Ericson2314/baccumulation/blob/main/datab...
nextts 4 months ago
10. CSV doesn't need commas!
Use a different separator if you need to.
CSV is the Vim of formats. If you get a CSV from 1970 you can still load it.
johnea 4 months ago
I have to agree.
It was pretty straightforward (although tedious) to write custom CSV data exports in embedded C, with ZERO dependencies.
I know, I know, only old boomers care about removing pip from their code dev process, but, I'm an old boomer, so it was a great feature for me.
Straight out of libc I was able to dump data in real-time, that everyone on the latest malware OSes was able to import and analyze.
CSV is awesome!
Maro 4 months ago
I hate CSV (but not as much as XML).
Most reasonably large CSV files will have issues parsing on another system.
MarceliusK 4 months ago
This might be the most passionate and well-argued defense of CSV I've read
Vaslo 4 months ago
So easy to get data in and out of an application, opens seamlessly in Excel or your favor DB for further inspection. The only issue is the comma rather than a less used separator like | that occasionally causes issues.
cypherpunks01 4 months ago
Any recommendations for CSV editors on OSX? I was just looking around for this today. The "Numbers" app is pretty awful and I couldn't find any superb substitutes, only ones that were just OK.
orefalo 3 months ago
CSV?
it's surely simple but..
- where is meta data? - how do you encode binary? - how do you index? - where are relationships? different files?
gpvos 4 months ago
I've said it before, CSV will still be used in 200 years. It's ugly, but it occupies an optimal niche between human readability, parsing simplicity, and universality.
osigurdson 4 months ago
JSON, XML, YAML are tree describing languages while CSV defines a single table. This is why CSV still works for a lot of things (sure there is JSON lines format of course).
hdjrudni 4 months ago
I just wish line breaks weren't allowed to be quoted. I would have preferred \n. Now I can't read line-by-line or stream line-by-line.
achr2 4 months ago
Using ascii 'US' Unit Separator and 'RS' Record Separator characters would be a far better implementation of a CSV file.
BrenBarn 4 months ago
I always feel like CSV gets a bad rap. It definitely has problems if you get into corner cases but for many situations it's just fine.
diegolo 4 months ago
People that talk about readability: if you store using jsonl (one json per line) - you can get your csv by using the terminal command jq.
rr808 4 months ago
I wish CSV could have headers for meta data. And schemas would be awesome. And pipes as well to avoid the commas in strings problem.
fbn79 4 months ago
CSV is too new and without a good standard quoting strategy. Better staying with the boring old fixed length column format :))
barbazoo 4 months ago
> Excel hates CSV
Does it though? Seems to be importing from and exporting to CSV just fine? Elaborate maybe.
dubyajaysmith 4 months ago
9. Excel hates CSV: It clearly means CSV must be doing something right. <3<3<3
JohnMakin 4 months ago
with quick and dirty bash stuff ive written the same csv parser so many times it lives in my head and i can write it from memory. no other format is like that. trying to parse json without jq or a library is much more difficult
Dwedit 4 months ago
I prefer Tab-Separated. Its problem though: No tabs allowed in your data.
4 months ago
[deleted]
tomrod 4 months ago
I used to prefer csv. Then I started using parquet.
Never want to use sas7bdat again.
richardwhiuk 4 months ago
CSV isn't dynamically typed. Everything is just a string.
jgord 4 months ago
shout out to BurntSushis excellent xsv util
https://github.com/BurntSushi/xsv
circadian 4 months ago
Kudos for writing this, it's always worth flagging up the utility of a format that just is what it is, for the benefit of all. Commas can also create fun ambiguity, as that last sentence demonstrates. :P
CSV is lovely. It isn't trying to be cool or legendary. It works for the reasons the author proposes, but isn't trying to go further.
I work in a work of VERY low power devices and CSV sometimes is all you need for a good time.
If it doesn't need to be complicated, it shouldn't be. There are always times when I think to myself CSV fits and that is what makes it a legend. Are those times when I want to parallelise or deal with gigs of data in one sitting. Nope. There are more complex formats for that. CSV has a place in my heart too.
Thanks for reminding me of the beauty of this legendary format... :)
jbverschoor 4 months ago
CSV is the PHP of fileformats
dkarl 4 months ago
I'll repeat what I say every time I talk about CSV: I have never encountered a customer who insisted on integrating via CSV who was capable of producing valid CSV. Anybody who can reliably produce valid CSV will send you something else if you ask for it.
> CSV is not a binary format, can be opened with any text editor and does not require any specialized program to be read. This means, by extension, that it can both be read and edited by humans directly, somehow.
This is why you should run screaming when someone says they have to integrate via CSV. It's because they want to do this.
Nobody is "pretending CSV is dead." It'll never die, because some people insist on sending hand-edited, unvalidated data files to your system and not checking for the outcome until mid-morning the next day when they notice that the text selling their product is garbled. Then they will frantically demand that you fix it in the middle of the day, and they will demand that your system be "smarter" about processing their syntactically invalid files.
Seriously. I've worked on systems that took CSV files. I inherited a system in which close to twenty "enhancement requests" had been accepted, implemented, and deployed to production that were requests to ignore and fix up different syntactical errors, because the engineer who owned it was naive enough to take the customer complaints at face value. For one customer, he wrote code that guessed at where to insert a quote to make an invalid line valid. (This turned out to be a popular request, so it was enabled for multiple customers.) For another customer, he added code that ignored quoting on newlines. Seriously, if we encountered a properly quoted newline, we were supposed to ignore the quoting, interpret it as the end of the line, and implicitly append however many commas were required to make the number of fields correct. Since he actually was using a CSV parsing library, he did all of this in code that would pre-process each line, parse the line using the library, look at the error message, attempt to fix up the line, GOTO 10. All of these steps were heavily branched based on the customer id.
The first thing I did when I inherited that work was make it clear to my boss how much time we were spending on CSV parsing bullshit because customers were sending us invalid files and acting like we were responsible, and he started looking at how much revenue we were making from different companies and sending them ultimatums. No surprise, the customers who insisted on sending CSVs were mostly small-time, and the ones who decided to end their contracts rather than get their shit together were the least lucrative of all.
> column-oriented data formats ... are not able to stream files row by row
I'll let this one speak for itself.
realPtolemy 4 months ago
I love CSV
4 months ago
[deleted]
spintin 4 months ago
[dead]
merillecuz56 4 months ago
[dead]
fatih-erikli-cg 4 months ago
[dead]
mharig 3 months ago
[dead]
KingLancelot 4 months ago
[dead]
nukem222 4 months ago
[flagged]