.NN #4: TextFieldParser Class
In the last .Net Nugget I talked about streams. One of the most common uses of streams is to read files in or write files out. I’ve written many applications that needed to read data from physical files as either a data input mechanism or to read configuration. Reading configuration these days is very nicely handled with the System.Configuration namespace objects, but there are many times we need to read in a file and parse it’s contents.
It’s easy when someone sends us an XML file with a defined schema. We simply load the file up and start parsing. What if we get a data file in from a client or consumer that doesn’t use XML (gasp!)? The common types we see are delimited or fixed width text files that outline some sort of known data structure. If you’re lucky you can get a file specification that tells you what the layout of the file is and what each field is, which is analogous to a schema file.
In the past this seemed pretty easy to deal with by reading in each line of the file one at a time, then parsing the line against the known format. We would be careful with making sure we got what we expected, etc., but it was all pretty straight forward. For delimited files we could use the Split method to cut the line up into an array. This worked the majority of the time, unless the delimiter was also a valid character to have in the string (for example, a comma or tab in a notes field). Regular Expressions could also be used to parse up the line into the correct fields, which gave us a little more control over the cases where delimiter characters could appear as valid part of fields. For fixed width fields we would carve up the line into the specified chunks of the right size and assign them to our input fields. This method was easier because you knew exactly how long the line was, but this also constrained your input to only data within the size limit. Basically, what always sounded easy to deal with could sometimes became quite complex.
Visual Basic .Net provides us with a helper class to make reading of text files a little easier. The class is called Microsoft.VisualBasic.FileIO.TextFieldParser. This class has a whole slew of methods to deal with both delimited and fixed width fields.
Below is a sample data file that is comma delimited. The first field is an ID, followed by a name, followed by a notes field. Note that both the name and at least one of the notes fields have commas in their values.
1, “Wood, Mike”, “Notes go here.”
2, “Man Chu, Fu”, “This, space, for, rent”
If we were going to do this manually we would first open a file stream, read in the first line then parse it. Since this is a comma separated file the first thought would be to do a split function on the line, but that would not yield the result we are looking for. So, next we could try a regular expression to parse the line on commas that are outside of quotes. That would work, but can you rattle that regular expression off the top of your head ? (posting the regular expression as a comment doesn’t prove that you can pull this off the top of your head, only that you can figure it out, which isn’t the point.) Some of you may be able to, but I can’t. I’d have to spend some time digging in the recesses of my mind to pull the syntax out (and then I’d Google or pull out my trusty RegEx pocket reference).
With the TextFieldParser class the code looks like this:
1 Using MyReader As New Microsoft.VisualBasic.FileIO.TextFieldParser(“Data.txt”)
2 MyReader.TextFieldType = FileIO.FieldType.Delimited
3 MyReader.SetDelimiters(",")
4 Dim currentRow As String()
5 While Not MyReader.EndOfData
6 Try
7 currentRow = MyReader.ReadFields()
8 Dim currentField As String
9 For Each currentField In currentRow
10 MsgBox(currentField)
11 Next
12 Catch ex As Microsoft.VisualBasic.FileIO.MalformedLineException
13 MsgBox(“Line " & ex.Message & _
14 “is not valid and will be skipped.”)
15 End Try
16 End While
17 End Using
This code block was taken pretty much verbatim from an MSDN sample (I changed the name of the input file). If you ran this against the data above you’d see that each field is correctly displayed in the message box and the commas are handled correctly for you (The HasFieldsEnclosedInQuotes property controls this and is true by default). Note that the line doing reading is line 7. The ReadFields will parse the line into an array of fields. This is the way it works even if you have a fixed width file. You can then take that array of strings and map it into anything you need to (such as an object or database parameters to be persisted).
You may be thinking, “Yeah, that’s great and all, but what if the file format changes in the middle of the file?” That’s a fair question, because I’ve dealt with that before as well. The PeekChars method off the class will return you a certain number of characters off the start of the next line without moving the cursor. This should allow you to figure out the structure of the next line before you attempt to read it in.
The class also has some great extras that you end up needing when you are doing this all by hand. The LineNumber property tells you the line number you’re parsing. The ErrorLine property will tell you the last line that returned a MalformedLineException which is also very handy. The MalformedLineException is thrown anytime the parser can’t correctly parse the line. You can even give a comment token so that the parser will automatically skip over what appears to be a comment line in the file.
I’ve not had a chance to use this class in a live production project so I can’t say it will solve all your non XML parsing needs, but it is definitely worth a look. Even if you are a C# developer/shop you can always just make a reference to the Microsoft.VisualBasic assembly to get at it. I didn’t talk much here about the fixed width capabilities, but the docs give lots of examples.
Oh, and by the way, one of the constructor overloads for TextFieldParser takes a stream as the input. Parsing really LARGE files? Try using the TextFieldParser in conjunction with the System.IO.BufferedStream class to get some performance gains.