Misconceptions about SAS: The MERGE statement

Recently, I used the MERGE statement to attach information about treatment group to the patient data. The patient data have more than one observation (or, to the non-SAS crowd out there, record) per patient while the additional information is patient-specific and therefore has only one observation per patient. Treatment was coded as 0=Placebo and 1=Verum. For the sake of clarity, let's assume the tables were set up like this:

USUBJID DATE VALUE
1001 20070520 43.8
1001 20070521 41.5
1001 20070522 39.8
1002 20070602 46.2
1002 20070603 41.7
1002 20070604 43.6

and

USUBJID ARMCD
1001 1
1002 0
1003 0

where ARMCD denotes the treatment group.

Then I did some analysis with repeated measurements per patient. I found out (for the umpteenth time) that SAS, when using categorical variables as exploratory variables, the last category will be used as a reference and all effects are described in comparison to this reference. this meant that Verum was the reference and the relative effect of Placebo was given. This is unintuitive. Not knowing if this behavour could be altered, instead I altered the treatment group assignment in place (Don't ever do that, always create a new variable for this unless you know what you are doing):

data new;
    merge patients values;
    by usubjid;
    armcd = armcd-1;

What does the result look like? For good measure, I already introduced the formatted version of the treatment here:

USUBJID ARMCD DATE VALUE
1001 Verum 20070520 43.8
1001 Placebo 20070521 41.5
1001 Verum 20070522 39.8
1002 Placebo 20070602 46.2
1002 Verum 20070603 41.7
1002 Placebo 20070604 43.6

Yikes! We get different treatment assignments to the same patient! What is happening here?

Time to read the documentation. In the language reference I am forwarded to the tome Language Reference: Concepts. In it you can find the following sentence which describes what the system does when match-merging:

SAS retains the values of all variables in the program data vector except those variables that were created by the DATA step; SAS sets those values to missing.

The operational word is "retains", or shall I say RETAINs?

SAS mostly works on data stored in tables; like Perl, it is very good to read and process one line at a time. Very few actions happen between lines or carry information from one line to the other. This is part of what makes SAS the number-crunching behemoth who is not afraid of these 20 million observations of telecommunication customers. This behaviour makes the system inflexible on the other hand, yet it is ingrained in so many users' minds (such as mine) that we act surprised when the system actually uses information from one line in the next one.

When programming, I had the following model in mind: "I assemble a new table by merging one table from the left, the other from the right. There is one variable in the new table which I would like to recode in place." The erroneous assumption is in the words "in the new table". Instead, SAS retains a copy of the variables to be merged in its internal memory until the last line of merge value matches has been reached in both data sets. This internal memory is called "program data vector". All operations are happening on the variable in the internal memory now. If the code armcd = 1 - armcd; is applied to it, the value of the variable will merrily hop to and fro, between 1 and 0, until the next USUBJID is reached, in which case it is set to the correct treatment group of the next patient (whose data lines will be accompanied with zebra stripes of toggling 1 and 0 values again).

Moral: Don't ever overwrite variables in place, create new variables instead.

Add a New Comment
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License