This is an old post from September 2012 that lost the images during their upgrade. http://bidn.com/Blogs/userid/470/address-parse-transform-ndash-task-factory-vs-ssis
Pragmatic Works Task Factory is a collection of twenty-six high performance SSIS components one of which is the Address Parse Transform component. This component is used to parse unformatted address data into USPS standardized address data. It can also be used to transform addresses from other countries as well and is extremely fast and efficient as I will demonstrate in this blog entry. It is so easy to use, even a caveman could do it. No offense to any cavemen.
Cozy Roc also has a collection of SSIS components called SSIS+ one of which also does address parsing. It has a weird limitation that your address has to be stored in a single column. Nonetheless, I thought it would be interesting to test the two and compare the differences.
The US data set and contains 350,000 records and comes from the following website: http://www.briandunning.com/sample-data/. I am going to test loading this data set twice. Once using it the way it comes and then once again using a slightly modified version where I have concatenated the address, city, state, and zip columns into one column. This will allow me to compare the two components side by side.
Before we begin, it is important to define exactly what it means for an address to conform to United States Postal Service address standards. The following excerpts were taken from Postal Addressing Standards (Publication 28) and provide a sample of their requirements.
Jointly developed by the Postal Service and mailing industry, standardized address information enhances the processing and delivery of mail, reduces undeliverable-as-addressed mail, and provides mutual cost reduction
opportunities through improved efficiency.
A standardized address is one that is fully spelled out, abbreviated by using the Postal Service standard abbreviations (shown in this publication or as shown in the current Postal Service ZIP+4 file), and uses the proper format
for the address style (shown in this publication).
Format all lines of the address with a uniform left margin. Uppercase letters are preferred on all lines of the address block. Lowercase letters in various type styles are acceptable provided they meet postal guidelines for OCR readability.
Secondary address unit designators, such as APARTMENT or SUITE, are preferred to be printed on the mail piece for address locations containing secondary unit designators. The preferred location is at the end of the Delivery Address Line. The pound sign (#) should not be used as a secondary unit designator if the correct designation, such as APT
or STE, is known or is shown in the ZIP+4 file. If the pound sign (#) is used, there must be a space between the pound sign and the secondary number.
Spell city names in their entirety. When abbreviations must be used due to labeling constraints, use only the approved 13-character abbreviations provided in the City State file.
Format the Last Line with at least one space between the city name, two-character state abbreviation, and ZIP+4 Code.
Test One – Transform 350,000 US Addresses into USPS Standardized Address Format
First, I created a new solution in BIDS and created a new package. I added a data flow task and renamed it to Test One. I dragged the flat file source to my data flow canvas and linked it to the CSV I downloaded containing 350K rows of sample US addresses. I set double quotes as my text qualifier and checked the box indicating that the first row contains the column names.
Now comes the fun part! I dragged the Address Parse Transform component onto my data flow designer and linked the flat file source to the component. Double clicking the component shows the setup screen below. For this test, I am going use the second option: Address data is spread across multiple columns. The Cozy Roc SSIS+ implementation doesn’t seem to allow you to do this. I mapped the columns to each field as shown below.
Next I clicked the Address Quality tab and checked the box to include only high quality parsed addresses. I decided to redirect errors to non parsable output so I selected that option from the drop down. Finally, I checked the box to replace ‘Null’ values with an empty string. According to the dialog, sometimes data from CSV or text files actually contain the text ‘Null’ where an actual NULL value was before. That can’t be good for anyone. I can imagine a letter carrier looking for someone named Null or a city named Null.
Now, I clicked the Parsed Output tab. I decided to replace the original columns where applicable and I changed the Output Case Formatting. I selected Upper Case from the dropdown since this is what the USPS prefers. Other options include: no formatting or proper case.
When I clicked Ok, I got an error message!
I don’t have a column for AddressLine2. It doesn’t look like I can ignore this column either so I went back to the Address Data Input tab and mapped Address to that field also. I’ll just ignore the AddressLine2 column downstream.
This is a screenshot showing what my data flow looked like once I was finished adding my destination components for the parsed addresses and for the bad addresses.
Finally, for a proper test we need to record when the data flow starts and when it finishes. Sure, I could have manually modified the package to log that info into a table; but, I have BIxPress so I did it the easy way!
The first run failed.
The “output column “FirstName” (23)” failed because truncation occurred, and the truncation row disposition on “output column “FirstName” (23)” specifies failure on truncation.
A modification of my destination tables and explicitly setting the length of the string data types to 250 using the advanced editor for the source component on the input output properties tab should fix that.
I still received errors so I decided to redirect error rows from the source to an error table rather than fight them when I got a nice surprise from BIxpress!
Very cool! Of course I clicked Yes!
Next time it ran green. Let’s check the output in BIDS.
Check out what I get in BIxPress!
Not only does it tell me the row counts; but, it also tells me the buffers used! Not to mention how long it took which was almost 5 and a half minutes. Why can’t BIDS tell you that at least?
Tests are being performed on an IBM Lenovo X201 Tablet running Win 7 w/ an Intel I7 quad core, 6 GB RAM and a SSD.
Looking at the good addresses I see that it made everything uppercase like I wanted and used the USPS approved and recommended abbreviations. There are two new columns that the component made available to me: Quality and CSZQuality.
I was curious to see why one of the addresses was given a quality rating of medium so I went back to the source. That record had the following stored in the address field: 3202 W Charleston Blvd Bldg. So the ‘Bldg’ string was removed from the field and quality was set to medium. Neat!
Looking at examples where CSZQuality was medium I discovered entries where the zip code was missing; but, had all of the other elements which would still make it deliverable. Records with both columns set to medium were missing addresses and zip codes. These should be discarded or investigated.
Looking at the records in the bad addresses table it looks like most of them have incomplete information in the address field.
Test Two – Transform 350,000 US Addresses into USPS Standardized Address Format From a One Column Source
This time I modified the source to concatenate all of the address columns into one column. Let’s see how Address Parse Transform does now!
Wow! BIxPress totally blows me away again!
Of course I do!
The final destination component now shows an error. I’ve got to correct the mappings.
I delete the mappings to address, city, state, and zip. Then I mapped addressLine1 to address, TF Address Parse Transform City to city, TF Address Parse Transform State to State, PostalCode to Zip, Country to Country, Quality to Quality, and CSZQuality to CSZQuality. AddressLine2 can just remain unmapped.
It took just under 6 minutes. A couple more thousand rows went to the bad address pipeline as would be expected under the circumstances.
Next, I want to test the same functionality using Cozy Roc’s SSIS+ Address Parse Component. I was unable to test this component in the same fashion as test one of Task Factory component because this one only parses addresses that are contained in a single column.
The following screenshots document how I set the component up for the test.
First, the result in BIDs:
So it took about 30 seconds longer using Cozy Roc’s component and there were ~30K bad addresses compared to Pragmatic Work’s ~15K.
Let’s look at some of the good addresses. The first thing I notice is that the addresses are not using uppercase as is preferred by the USPS. Secondly I see for the third line it chopped the NW off of the street address and put it in front of the city.
Now, let’s look at those same addresses in the good address table from the Task Factory test two. It also seems to have a hard time with the cities; but, the mail still should be deliverable. Notice the city RAPIDS which should be GRAND and PARK PITTSFORD which should be just PITTSFORD. Nonetheless, I would still consider the mail deliverable especially if we can assume the zip code is correct.
Nutshell: Pragmatic Works Task Factory Address Parse component is slightly faster and had half the number of bad addresses as Cozy Roc’s SSIS+ Address Parse component. The Task Factory component also has the ability to accept addresses that are already in separate columns and the SSIS+ component does not. The Address Parse Transform component from Pragmatic Works will be able to pay for itself in the long run by reducing the costs of returned mail.