Regular expressions (RegEx) is a really powerful tool when it comes to text manipulation. It is simple (!= easy), fully featured, extremely versatile, fast and is implemented in every possible platform imaginable like programming languages , unix shell programs, text editors etc. Recently I figured out an elegant way of cleaning csv files where there are unescaped quotes present in the data.
To cover the basics, a character separated values (csv) is a format to store tabular data in a text file where each data point (rows) is separated by new line character and each field (column) is separated by a delimiter – usually a comma. This works well until we encounter data with the comma present in them. for example if the name has to be formatted last name, first name we run into problems. To solve this we use a quote character which encloses every textual data. This works well with most of the data but there are instances where we encounter the quote character within the data. This is where things get messy. Though this can be solved by escaping the quote character with a backslash (\), it is not always possible to introduce escape characters when you are collecting textual data using less structured methods eg. User input via forms, sensors, etc. I recently collected data where unescaped quotes were present within the data and there was no way of cleaning them at source.
Let us consider a set of data where we have encrypted messages posted by users on a forum. The example csv look like this,
Since the data has already been collected, the problem here is to identify and escape only the problematic quotes (red ones) before reading the data as a csv. The particular technique in regex we use is called variable length negative lookahead and look-behind. This is implemented in vim as “@!” and “@<!” commands. The entire command for doing the cleaning is,
Detailed explanation of the command is,
: - start the vim command % - search the whole file s - search and replace / - start of search pattern \v - use simple regex syntax rather than vim style ( " - double quote ( ," - comma_quote | - or \n - new line ) - @! - not followed by ) & - and ( ( ", - quote_comma | - or ^ - beginning of line ) @<!- does not precedes " - double quote ) / - end of search pattern and start of replace pattern \" - replace with escaped quote / - end of replace pattern g - replace all the matches c - confirm with user for every replacement
Only case where this doesn’t work is when there is a quote-comma-quote (“,”) pattern inside the data. For now, I think that this cannot be fixed by regex and needs a different approach like counting number of “,” in each line and fixing lines where the number is greater than expected but would be really happy to be proved wrong.