SAS Tutorial: Creating Categories with PROC FORMAT

February 28, 2008 by: Blink 7

When performing data analysis on a cohort population, it is often desirable to categorize characteristics can have many values. Age, income level and credit card score are examples of population attributes that can be placed into “buckets” and easily analyzed based on broad categories (without having resort to linear or logistic regression). While variables can be quickly encoded using IF statements, SAS’s Format procedure provides a more elegant and portable solution.

Pre-Requisites

  • SAS v9.x or SAS Enterprise Guide 4.x
  • Basic knowledge of the SAS Data Step
  • Basic knowledge of how to load and execute SAS programs
  • Basic knowledge of how to load and save SAS data sets
  • Access to SAS’s read/write file space

Downloads

The Scenario

tutorial004-01.jpg

The provided data set contains basic information about a random group of people. Looking at the data set, there are three columns representing the first name, last name and age of each person. Suppose that some performance analysis will be performed on the population provided – categorizing by age might be useful for discerning trends.

Take 1 – Categorization with the IF statement

Suppose you want to categorize the age by 10’s: 10-19, 20-29, 30-39 and 40+. The input data can be loaded into the SAS DATA step and extended by using the IF statement to create a categorized variable. Specifically, we can do the following:

  • Create a new data set, using the provided data as the source
  • Create a category variable
  • Categorize the age, placing the results in the category variable

Open the SAS program cat1.sas (those using SAS Enterprise Guide 4 can alternatively open the Categorization project file and double-click on the cat1 code icon the SAS Project designer window).

tutorial004-02.jpg

As you can see in the screenshot above, the code creates a new data set called catv1 in the WORK library.

  • The default input is set to the nameage data set provided by this tutorial
  • A 16-character variable called age_category is created to hold the literal categorization of the age variable
  • An IF/ELSEIF conditional structure is used to calculate the bucket categorization for the age of the participant in each record

After the DATA step, a PROC PRINT is issued to display the output of the catv1

tutorial004-03.jpg

Running the cat1 code should produce output as shown above

Take 2 – Categorization using Custom Formats

Using IF/ELSIF statements are easy and probably sufficient if you only have to categorize one variable … knowing that you’ll never use the same categorization in another piece of code ever or use the same categorization with different variables. If you cannot guarantee these conditions then you’ll be stuck in the copy/paste coding trap. Any “changes” to the categorization will have to be manually reconfigured for each variable in each piece of code – a time-consuming and error-prone process.

Luckily, SAS provides the FORMAT procedure, which can be used to create centralized categorizations for character or numeric input. Advantages of using the FORMAT procedure include the following:

  1. Formats can be applied to any number of variables in multiple SAS programs
  2. Changes to formats will be applied to each target variable the next time its assignment code is executed
  3. Flexibility to embed the formats in program code or create in self-contained programs
  4. Formats can be stored in temporary or permanent SAS libraries

With that in mind, let’s rework this code by replacing the conditional structure with a FORMAT procedure. Open the SAS program cat2.sas to view the code.

tutorial004-04.jpg

The FORMAT procedure begins with a PROC FORMAT statement and ends with a RUN statement (case insensitive). Within this structure the VALUE statement can be used to define a format. Multiple VALUE statements can be issued within a FORMAT procedure.

In this case we defined a format called agecat. This format will convert numeric input values into text-based categories. (note: to convert text input values, place a $ in front of the format name - $agecat).

Single input values or ranges of input values are assigned format values via simple equations. In this case, ranges of input values (e.g. 10-19) are assigned single text-based, formatted values.

The keyword high signifies the largest possible value in the data set. This keyword is essentially a catch-all for values above the start of the range. As applied to this user format, any input value above 40 will be assigned the formatted value ‘40+’. Similarly, the keyword low can be used to represent the lowest possible data value and could be used in a bottom-end catch-all category.

Our user-defined format can now be applied to any numeric variable via the PUT statement. As a result, the DATA step no longer requires complicated conditional statements and can assign categorized values to age_category in a single line.

Just to demonstrate the flexibility of user-defined formats, a new variable called age_in_5 was also created. This variable shows what age category each person will belong to 5 years from now. Note that the assignment statement is identical to age_category except that 5 years has been added to the age before formatting.

After the DATA step, a PROC PRINT is issued to display the output of the catv2

tutorial004-05.jpg

Running the cat2 code should produce output as shown above

Filed under: Development, Tutorials
Tags:

Comments

4 Responses to “SAS Tutorial: Creating Categories with PROC FORMAT”
  1. beginer says:

    THANK YOU, really good and helpfull

  2. Tobias says:

    Great! Just what I was looking for!

Leave a Reply

Captcha
Enter the letters you see above.