I have never liked SAS much, although I did blog about how to send email from SAS using your server instead of your email client. Today I ran into an issue that strongly reinforced that lack of respect for SAS.
One of the hats I wear at work is that of a Data Analyst. Much of the data we work with is large. Quite large actually. Usually I do this work in SQL Server and move the data to SAS after the data has been rolled up and minimized a smuch as possible. This time is different, the data ended up in SAS directly, and then ended up in my share, since our SAS person recently moved on.
So, SAS is an enterprise level tool, it should handle the data just fine. The workstation is a dual proc xeon box with 2 gigs of RAM and a 400 gig striped data drive. I am confident.
The first thing I decide I want to do is do a join, called a merge in SAS, between a 73 million record dataset and another smaller dataset. In SAS, before you can do a merge, you have to actuall *sort* the dataset. Otherwise SAS gets confused. Can anyone say 'Not a real relation database engine'? I knew you could. So I start the sort process and go to lunch. I expect it to take awhile, in SQL, with proper indexes built, it would still take a bit.
I get my hair cut and then go eat. My lunch is slightly longer than an hour and a half. SAS is not done. That is not a good sign.
I go work on some server things for an hour and come back. SAS is not done. I am no longer confident. The end result? SAS has some issues with the way it sorts. It filled up the whole RAID drive with its temporary data set and died. I am more lacking in confidence than anyone alive.
I call SAS support. I explain the issue and the immediate answer is 'Oh yeah, thats going to be ugly'. Hello? I think I know this already. They gave me a nice formula that tells me how much space is needed to do a sort.
To estimate the amount of free work disk space required for a proc sort, review the following formula:
free_bytes_required=(key_variable_length + observation_length) * number_of_observations * 4
Where:
key_variable_length: Length of sorting variable. Example: If sorting by a character variable with a length of 10, this would produce a key_variable_length of 10 bytes. A numeric variable would be 8 bytes.
obsevation_length: Total number of bytes in an observation. Add together the bytes for all variables in one observation.
number_of_observations: Total number of observations in a data set.
This does not help me solve my issue. The colums are 397 characters wide, most of the data is character, and I had enabled Unicode, which is off by default.
By my estimates, based off of the integers being 8 bytes and using 2 bytes for each unicode character, SAS will need a temp file of ~320 gigs. To sort a 28 gig file...
Fortunantly SAS allows you to build adhoc SQL code inside a PROC SQL block, unfortunantly the join has taken 2 hours so far and is still running...