Stupid CSVs...ideas anyone?

SkyHog

Touchdown! Greaser!
Joined
Feb 23, 2005
Messages
18,431
Location
Castle Rock, CO
Display Name

Display name:
Everything Offends Me
I am parsing through a CSV that I have no control over (can't change the format, period).

In this CSV, I pull data out of certain fields. Unfortunately, every once in a while, one of those fields has a comma in it, and it jacks the whole thing up. I can't change the format of it to ensure that each value is in quotes, nor change it to a tab delimited or "ASSAASDSADS" delimited file.

Ideas on how I can programmatically detect that a field contains commas and treat them all the same? An example is below:

08/06/2009,12:04:41 PM,PRINT,OBSCURRED,OBSCURRED,,Microsoft Word - Verizonlist809.doc,4,0,4,,1,,Lexmark X646e,791M8F4,LETTER,LETTER,TRUE,OBSCURRED,OBSCURRED,OBSCURRED,Lexmark X646e,
08/06/2009,12:38:04 PM,PRINT,OBSCURRED,OBSCURRED,,Microsoft Word - Roster 2009, 07.doc,15,0,15,,1,,Lexmark X646e,791M8F4,LETTER,LETTER,FALSE,OBSCURRED,OBSCURRED,OBSCURRED,Lexmark X646e,
 
Well, if there's only one field that could contain the comma, start from each end, pulling off those fields that cannot contain a comma, and whatever's left, including any commas must be the value of that remaining field. If there's more than one, you're hosed unless there's something unique about the fields that will allow you to identify them.
 
Well, if there's only one field that could contain the comma, start from each end, pulling off those fields that cannot contain a comma, and whatever's left, including any commas must be the value of that remaining field. If there's more than one, you're hosed unless there's something unique about the fields that will allow you to identify them.

The problem is that I can't really find any identifiable pattern to help identify it....the next field has to be a number, which helps somewhat, except in the example I posted, there's only one additional field, but there are some instances where the file name is something like "Information About Smith, John, Employee of XYC Corp, 2008.doc" which means there's like 5 or 6 extra fields.

Better then - anyone know of a simple way to make Lexmark Printers encapsulate the csv fields in quotes or possible to change the delimiter? I can't seem to find a way to do it.
 
You'd think lexmark would know better than that.

I like Grant's solution, it's elegant. start counting commas from the left and right, and that filename field is what's leftover.

I don't think there's a good way to scan the record in only one direction and get what you're after -- short of writing a ton of logic to anticipate a potentially borked field, and then tiptoeing over it. Seems like more work than it's worth. :)
 
Is the comma always in the same field? in that case you could always count the commas / line. if they are equal to the expected number of commas use the split method. If it isn't use Gran'ts method.
 
Nick--I need to see better data. Either not obscured or less obscured. You'll have to obscure it in a way that doesn't ruin the data. Once you've done that point out the parts of the data you want to capture.

Sometimes when you get lame data like this you have to start to think less about the field delimiters and more about the data you want out of each line. A few regexps in the right places can often do the job.

You can either put in here or PM me.
 
Last edited:
You know what I don't understand? How come software that generates these CSV files doesn't fail with an error when one of the fields contains a comma?
-harry
 
It appears to be only the source - filename field that has the potential error.

Both towards the middle seems liek the simplest plan of attack.
 
Thanks everyone. I kind of used Grant's idea, and it seems to have worked. I took the total number of expected fields vs the total number of given fields, and if it was after column 5 I added the difference to the field.

Stupid Lexmark.
 
Back
Top