SAS Tutorial: Creating Categories with PROC FORMAT
February 28, 2008 by: Blink 7When performing data analysis on a cohort population, it is often desirable to categorize characteristics can have many values. Age, income level and credit card score are examples of population attributes that can be placed into “buckets” and easily analyzed based on broad categories (without having resort to linear or logistic regression). While variables can be quickly encoded using IF statements, SAS’s Format procedure provides a more elegant and portable solution.
Pre-Requisites
- SAS v9.x or SAS Enterprise Guide 4.x
- Basic knowledge of the SAS Data Step
- Basic knowledge of how to load and execute SAS programs
- Basic knowledge of how to load and save SAS data sets
- Access to SAS’s read/write file space
Downloads
- Categorization Project File (EG only, contains all programs)
- SAS Program: cat1
- SAS Program: cat2
- SAS Data Set: nameage (move this data set to your WORK directory)
The Scenario

The provided data set contains basic information about a random group of people. Looking at the data set, there are three columns representing the first name, last name and age of each person. Suppose that some performance analysis will be performed on the population provided – categorizing by age might be useful for discerning trends.
Take 1 – Categorization with the IF statement
Suppose you want to categorize the age by 10’s: 10-19, 20-29, 30-39 and 40+. The input data can be loaded into the SAS DATA step and extended by using the IF statement to create a categorized variable. Specifically, we can do the following:
- Create a new data set, using the provided data as the source
- Create a category variable
- Categorize the age, placing the results in the category variable
Open the SAS program cat1.sas (those using SAS Enterprise Guide 4 can alternatively open the Categorization project file and double-click on the cat1 code icon the SAS Project designer window).

As you can see in the screenshot above, the code creates a new data set called catv1 in the WORK library.
- The default input is set to the nameage data set provided by this tutorial
- A 16-character variable called age_category is created to hold the literal categorization of the age variable
- An IF/ELSEIF conditional structure is used to calculate the bucket categorization for the age of the participant in each record
After the DATA step, a PROC PRINT is issued to display the output of the catv1

Running the cat1 code should produce output as shown above
Take 2 – Categorization using Custom Formats
Using IF/ELSIF statements are easy and probably sufficient if you only have to categorize one variable … knowing that you’ll never use the same categorization in another piece of code ever or use the same categorization with different variables. If you cannot guarantee these conditions then you’ll be stuck in the copy/paste coding trap. Any “changes” to the categorization will have to be manually reconfigured for each variable in each piece of code – a time-consuming and error-prone process.
Luckily, SAS provides the FORMAT procedure, which can be used to create centralized categorizations for character or numeric input. Advantages of using the FORMAT procedure include the following:
- Formats can be applied to any number of variables in multiple SAS programs
- Changes to formats will be applied to each target variable the next time its assignment code is executed
- Flexibility to embed the formats in program code or create in self-contained programs
- Formats can be stored in temporary or permanent SAS libraries
With that in mind, let’s rework this code by replacing the conditional structure with a FORMAT procedure. Open the SAS program cat2.sas to view the code.

The FORMAT procedure begins with a PROC FORMAT statement and ends with a RUN statement (case insensitive). Within this structure the VALUE statement can be used to define a format. Multiple VALUE statements can be issued within a FORMAT procedure.
In this case we defined a format called agecat. This format will convert numeric input values into text-based categories. (note: to convert text input values, place a $ in front of the format name - $agecat).
Single input values or ranges of input values are assigned format values via simple equations. In this case, ranges of input values (e.g. 10-19) are assigned single text-based, formatted values.
The keyword high signifies the largest possible value in the data set. This keyword is essentially a catch-all for values above the start of the range. As applied to this user format, any input value above 40 will be assigned the formatted value ‘40+’. Similarly, the keyword low can be used to represent the lowest possible data value and could be used in a bottom-end catch-all category.
Our user-defined format can now be applied to any numeric variable via the PUT statement. As a result, the DATA step no longer requires complicated conditional statements and can assign categorized values to age_category in a single line.
Just to demonstrate the flexibility of user-defined formats, a new variable called age_in_5 was also created. This variable shows what age category each person will belong to 5 years from now. Note that the assignment statement is identical to age_category except that 5 years has been added to the age before formatting.
After the DATA step, a PROC PRINT is issued to display the output of the catv2

Running the cat2 code should produce output as shown above









thanks man
THANK YOU, really good and helpfull
Great! Just what I was looking for!