“Microsoft has incompatible versions of CSV files between its own applications, and in some cases between different versions of the same application (Excel being the obvious example here).”
We have all been there: you are assigned an exciting new data science project, you talk to your enthusiastic clients, generate grand ideas and visions, can’t wait to sit down and write the first lines of code and then it happens. The CSV file you got appears to be somehow broken and despite your best efforts you cannot seem to get the data loaded into a dataframe. This goes on for minutes then hours and finally the day is over and you go home, data unparsed, spirit lost.
Why does this keep happening and what can we do about it? In this post I will rant about the horrors of CSV files and why we should avoid them whenever we can. The hope is that next time someone answers your request to provide data in a proper file format with a “But why don’t we just use CSV?”, this blog post will help you change their mind.
So, what is wrong with CSV files? To answer this question I want to begin with the above quote by Eric S. Raymond, which is taken from his book The Art of Unix Programming. It beautifully demonstrates that the CSV file format is not a standard 1. Because it is not a standard, it is naturally underspecified leading to an infinite number of dialects and even though it is called “comma-separated values”, most of the files I get don’t even use commas as separators.
This underspecification leads to the first problem, which is:
CSV files mean a lot of work
At least for the receiving end that is. The people creating CSV files don’t break a sweat over the following problems.
That CSV files create a lot of work becomes very apparent when comparing the pandas API for reading a CSV file and the one for reading a Parquet file. As of version 0.25.1 of pandas you will either look at this
pandas.read_csv(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]], sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)
or at this
pandas.read_parquet(path, engine='auto', columns=None, **kwargs)
Because the Parquet file format is fully specified, you provide
read_parquet with a path and it will readily load the file so you can get on with your life while
read_csv requires you to specify all the intricate ways in which the CSV file you received looks different from the pandas default. Some examples of the things you have to specify are
- the encoding of the file (people never know which encoding the CSV file they gave you has - unfortunately, it is not always utf8 - of course you can always just try out some encodings or simply use latin1 and break all special characters in the data)
- the character which has been used as the delimiter
- whether there is a header row (yes, sometimes you get a CSV file without a header leaving you without the names of the columns)
- the data types of the columns (int, float, date, …)
- the format of dates and timestamps (these might differ from column to column)
- number of non-data lines at the top and / or the bottom of the file (because why not, after all it is just a text file - so much freedom without restricting standards)
- number formats for numerical columns (dots and commas for thousands and decimal separators differ depending on your locale - “Of course you can’t use this file. You live in the wrong country!”)
The way you get this information is typically to open the file in a text editor and reverse engineer the format in a trial and error loop. Doing this for a 50 column CSV file can easily take an hour. Compare this to the mere seconds it takes you to write
df = pd.read_parquet(<your file path>).
This isn’t even necessarily the end of it. Sometimes your data is formatted in a way that the pandas CSV parser does not support. Then you have to parse the offending column with the data type
string and do some regex magic or similar processing to get it into the proper form for its data type (I encountered this situation with a date column).
The reason for all of this pain is twofold:
- The CSV file format is underspecified so the specification needs to be provided at parsing time. Expect every program that writes CSV files to have its own dialect of the format.
- CSV files don’t contain the schema of their data, such that you have to provide that at parsing time, too.
In many cases, the data that ends up in a CSV file is extracted from a strongly typed system like a relational database and is then loaded into another strongly typed system like a pandas dataframe. The thought that we throw away the existing knowledge about the data’s schema when serializing it only to later on reverse engineer the schema and put it back onto the data is truly terrible.
Alright, now you know what to do if you have too much time on your hands. Are you ready for more pain? Here you go!
CSV files are not safe
Whenever you receive a CSV file, there is a fair chance that the file is broken. Yes, broken - broken in the sense that you cannot parse the file even if you provide all the right format specifications and the schema.
How does this happen? Well, since CSV is underspecified, there are programs out there that write CSV files without taking all of the precautions necessary to produce a healthy, parsable file. Here are some things that can happen:
- The delimiter is contained in a field. If the string field is not properly embedded into quotation marks or if the parser doesn’t care about the quotation marks, it will think that there is one too many fields in the line. This can also happen for numerical data depending on your locale.
- The newline character is contained in a field. Same as above, only this time the parser will think that a new record begins at the newline and will find two records (the two parts of the broken line), both with too few fields. One time, I even found a partial newline in a field (only the carriage return part - cf. Wikipedia). I was quite surprised, but if you have millions of records with free-text fields, you should expect just about every possible character in your data.
- An unescaped quotation mark is in a quoted string field. The parser will now regard the rest of the file after the unescaped quotation mark to be a single string field, searching for the closing quote.
- A column contains data of multiple different data types, like integers and strings. This can happen if the data comes from a system that doesn’t enforce fixed data types per column (yes, I am looking at you, Excel).
Of course, since CSV is a text format you can easily open the file in your favorite editor and fix the above problems, right? The problem with this is that first, it is not always obvious where in the file the problem is and second, depending of the file size, you can end up manually fixing hundreds or even thousands of instances of the same problem. Of course, every time you fixed an instance of a problem, you have to rerun the parser to find the next problem. A fun activity for an otherwise idle afternoon.
Even if you gave your data provider all the rules they need to follow to create safe CSV files, they probably wouldn’t be able to act on it since there is a good chance that they are stuck with a proprietary, broken implementation of a CSV writer like the one that comes with the database in which the data resides.
CSV files are so regularly broken that people have written elaborate services to check your files and help you make them safe, like https://csvlint.io (I shouldn’t need to mention that you should of course NOT upload your confidential data to the site).
CSV files are expensive
As we have seen above, CSV files waste your valuable working time. On top of this, they negatively affect your cloud bill because
- CSV files are big compared to natively compressed file types like Parquet, which increases your storage demands.
- Reading data from a CSV file is slow compared to other file formats, which is due to the high computational effort of parsing data from formatted text as well as the fact that CSV is a row-based file format, forcing you to parse the whole file even if you are only interested in a couple of its columns.
You can find some exemplary numbers for both of these points on DZone.
What about frictionless data?
Frictionless data tries to make working with data, well, frictionless. They do this by writing specifications and software. One way they want to make working with data a better experience is by fixing some of the shortcomings of the CSV file format. Notably, they provide software to validate CSV files and a specification for a file format that can store metadata for CSV files (data schema, CSV dialect specification, etc.). The combination of a CSV file and its metadata file is called a data package.
I haven’t used their solutions, but as far as I understand, the process goes something like this:
- I export the data I want to share as a CSV file and hope that the program I use for the export has a safe implementation of CSV. Since I exported my data into a CSV file, I just naturally lost the data’s schema.
- I then run a data validator to test if my CSV file is safe. If this is not the case, I am probably out of luck unless I can find a different, safe way to write my data to a CSV file.
- Then I use the frictionless data software to infer the schema from the CSV file I produced. Sometimes,the inferred schema will be correct and sometimes it won’t, in which case I have to manually adjust the schema file.
- Then I probably pack up the CSV file and its metadata file into a zip file and send it on its way.
- Finally, I wonder why I didn’t just do
df.to_parquet(<your file path>)instead of the previous 4 steps.
I did find software to read data packages directly into pandas dataframes, but I didn’t find software to write data packages directly from pandas dataframes, which is why I imagine the process to go as outlined above.
The team at frictionless data has put a lot of work into adding compatibility with their data package format to many programming languages, databases and applications and if for some reason you really want a text format to transport your data, frictionless data’s data packages give you a vastly better experience than only using CSV files. Ultimately though, I would personally just skip trying to fix the mess that is CSV and simply use a proper file format.
Every seasoned data scientist has their own horror stories to tell about their work with CSV files. We already have to fight the problem that data is always dirty. Nobody needs the underspecified, unsafe approach to data serialization that goes by the name of CSV to make working with data worse. Especially, when this is completely unnecessary as is the case here. There is nothing special about the content of CSV files. All of the data could just as well be serialized in a standardized and safe format like Parquet.
Of course, not every CSV file is broken and not every CSV file requires you to reverse engineer its specification. Even for CSV files, sometimes everything just works. But you should be aware that choosing to work with CSV files in a project is like playing Russian roulette with your working time and data integrity.
So, what is your experience working with CSV files? Is CSV your best friend and has never let you down or have you gone through blood, sweat and tears trying to get that damned file to finally parse? I’d love to hear your stories! Also, if you liked this post, you should subscribe to my mailing list and I will notify you when the next one is ready.