SAS Tutorial: Loading Tab-Delimited Files
January 25, 2008 by: Blink 7
Imagine you are given a list of major Canadian political donators and wish to load it into SAS for further analysis. The list is in a text file with fields delimited by tab characters. SAS Enterprise Guide (EG) offers wizards to automatically import a variety of file formats, but using these GUI’s preempts the chance for automation or additional processing during import. Using code to load data offers several advantages:
- Ability to redirect data to multiple data sets
- Ability to clean and modify data during load time
- Ability to delete unwanted data
- Portability and reusability
Pre-Requisites
- SAS v9.x or SAS Enterprise Guide 4.x
- Basic knowledge of the SAS Data Step
- Basic knowledge of how to load and execute SAS programs
- Basic knowledge of numeric and character formatting
- Access to SAS’s read/write file space
Downloads
- Tab Project File (EG only, contains all programs)
- SAS Program: tab1
- SAS Program: tab2
- SAS Program: tab3
- SAS Program: tab4
- SAS Program: tab5
- Donor List (tab file)
Take 1: Basic Loader
Open the SAS program tab1.sas (those using SAS Enterprise Guide 4 can alternatively open the LoadTabDelimit project file and double-click on the tab1 code icon the SAS Project designer window).

As you can see in the screenshot above, SAS reads text files using the Data step. A minimum of two statements are needed to properly load a text file:
- INFILE – The infile statement specifies the text file to be loaded. The full path can be specified (as was done in this example) or a path relative to SAS’s default document directory can be used.
- INPUT – The input statement specifies the fields that should be read from the text file and how they should be read.
First let’s look at the INFILE statement. To make this example work you will probably have to change the path from ‘r:\namelist.txt’ to the location of the tab-delimited file. There are also two options included in the INFILE statement:
- DLM – The DLM option allows the user to set the delimiter between fields in a file. In this case we’ve specified ‘09’x as the delimiter, which is the hexagon representation of the tab character. Other common delimiters include the comma and the space. Delimiters should be specified in single quotes. If the DLM option is not included, SAS assumes the delimiter is a single space
- DSD – The DSD option instructs the DATA step to treat consecutive delimiters as an empty value for the current variable. Without this option, SAS would simply skip to the next item of field data and read it into the current data step variable, which could cause the file to misalign to the data variables and load incorrectly. DSD also removes surrounding quotes when reading data from text files.
Next, let’s look at the INPUT statement. Variable names are created to store each of the fields to be read sequentially into the data set. There are several variants of the INPUT syntax. The syntax in this example specifies exactly how each variable should be read. For now, the code is instructing SAS to read all data into text fields that are a maximum 16 characters long ($16.). There are many alternative ways to read these fields, some of which will be covered later.

Running the tab1 code should result in a data set identical to the screenshot above.
Take 2: Updating the Data Types
Tab1 loaded all of the fields as 16-character text, but that may not be appropriate for purely numeric fields. Firstly, suppose you want to filter out people who are younger than 18 (as we will do later). If the age is left in character format, you will filter out people aged 10-17 but not those aged 2-9, since the digits 2, 3, 4, …, 9 have a greater ASCII value than the number 1 in 18. You could remedy the issue by converting the text field to numeric before performing numeric comparisons, but it is just as easy to read the field directly into a numeric variable.
Secondly, someone may be tempted to list the party affiliation as “Conservative Party of Canada” rather than simply “Conservative”. The party’s long-form name is 25 characters long and will be truncated if read into a 16-character variable.

To solve these problems, change the age variable to read data as a 3-digit number (3.) and change the party variable to read data as a 32-character variable ($32.).

Run the tab2 code. The age column header should now contain a circular blue symbol, while the age column data is right-justified. This indicates that the column contains numeric data.
Take 3: Removing the Title Line
Incoming text files often contain the field names within the first line. This is unnecessary, as the variable names are defined within the INPUT statement of the DATA step. By default, the title line is read in as data, which is undesirable. There are several types of logic that can be used to remove title lines, but if you are certain that the title line will always be on the first line (i.e. no blank header lines) then the easiest approach is simply not to output the first line. This can be accomplished using _N_, SAS’s input record counter. When _N_=1, the data will not be output to the final data set.

A subsetting if statement was used to test the value of _N_. This statement takes the general form
IF
If the condition evaluates to false, the DATA step program flow returns to the top without executing any of the code beneath the subsetting if. In this case, the subsetting if statement will evaluate to true only if the input record is greater than one, meaning the implicit output statement will only be triggered if the record being read is NOT the first record in the input.

Run the tab3 code and ensure that the output does not contain the header names (see above).
Take 4: Removing bad Data
Eventually, a SAS programmer will encounter two types of undesirable data:
- Data that can be salvaged or at least needs to be flagged
- Data that is absolutely useless and can be discarded
Generally speaking, the latter category should be minimized to near-zero, as there are few instances where junking data without further analysis won’t affect the final product. But let’s suppose, for this exercise, that the donation information is simply not usable unless it contains both a first and last name. The tab3 output contains a Montreal resident with the last name “Smith” but no accompanying first name. We can get rid of this record and any record like it by combining conditional logic with the commands LENGTHN and DELETE.

LENGTHN returns the number of non-blank characters in a string. If the LENGTHN comparison evaluates to zero we can be assured that the field being evaluated is either null or consists of blank characters. The DELETE command prevents the current record from being written to the output data set.

Run the tab4 code and ensure that the output record contains seven records. None of records should contain blank first or last names (see above)
Take 5: Filtering Data
Finally, we must devise a way to deal with data that is not quite usable but can possibly be corrected. Looking at our data, it is unlikely that a minor would be a heavy financial contributor for any particular party, while it is equally unlikely for someone to surpass 120 years old. For now it may be best to separate such records from the other data for further evaluation. Luckily, SAS makes it easy by allowing processed data to be routed to multiple output data sets. The following modifications need to be made to the tab4 code
- The opening data statement needs to specify multiple data sets; one for clean data and one for data that needs to be investigated
- The OUTPUT command must be explicitly defined. Processed data needs to be directed to one of the data sets based on some condition that will determine whether the data requires further investigation

See the screenshot above. At the top of the code, the DATA statement now defines two data sets:
- tab_v5 will store records that do not require modification
- tab_v5_reject will store records the require further investigation
Now look at the bottom of the code. There are two explicit OUTPUT commands, each followed by name of the data set. The data will be written to tab_v5 if the age of the person in the record is between 18 and 120. Otherwise, the data will be written to tab_v5_reject. Execution of either OUTPUT statement assumes that the record has not already been deleted via the logic added in tab4.


Run the tab5 code. Two data sets should be produced – one containing 6 records with ages within the range 18-120 and the other containing a single record with an age outside this range.
There are computers software and demos for statistical analysis. For SAS software information and tutorials, get the help book online. The microsoft office software includes statistical analysis in excel but this software does wonders. When shopping for software, pick up antivirus firewall software. The antivirus that has all the latest virus web definitions is norton antivirus. In today’s phone development, it’s a must to invest in an internet phone handset.








