*using the First Last varialbes of group;
*first.group=last.group means there is only one record within that group;
data single dup;
set temp;
by group;
if first.group and last.group then output single;
else output dup;
run;
My notes for Statistic related topics, such as SAS, STATA programming. Other topics software related topics such as Latex may also be included.
Saturday, August 14, 2010
find duplicated data and put them into a different file
output a data set to a txt file
Output a SAS data set
data _null_;
set lib.zjl;
file 'drive:\path\file.txt' notitles noprint;
put var1 var2 var3 ... varn;
run;
data _null_;
set lib.zjl;
file 'drive:\path\file.txt' DLM=',';
put var1 var2 var3 ... varn;
run;
data _null_;
set lib.zjl;
file 'drive:\path\file.txt' DLM='09'x;
put var1 var2 var3 ... varn;
run;
data _null_;
set lib.zjl;
file 'drive:\path\file.txt' notitles noprint;
put var1 var2 var3 ... varn;
run;
data _null_;
set lib.zjl;
file 'drive:\path\file.txt' DLM=',';
put var1 var2 var3 ... varn;
run;
data _null_;
set lib.zjl;
file 'drive:\path\file.txt' DLM='09'x;
put var1 var2 var3 ... varn;
run;
use subset of observations in SAS procedures
*Limiting Observations
* Read from a raw data file;
filename myfile 'c:\SASfiles\temp\myfile';
Data temp;
infile myfile obs=100;
input ...;
run;
*printing data;
proc print data=temp (obs=10);
proc print data=temp (firstobs=20 obs=50);
*copy to other data set;
data temp2;
set temp (obs=10);
... other statements...
run;
*frequency test;
proc freq data=temp (obs=50);
var ...your variables...
run;
* Read from a raw data file;
filename myfile 'c:\SASfiles\temp\myfile';
Data temp;
infile myfile obs=100;
input ...;
run;
*printing data;
proc print data=temp (obs=10);
proc print data=temp (firstobs=20 obs=50);
*copy to other data set;
data temp2;
set temp (obs=10);
... other statements...
run;
*frequency test;
proc freq data=temp (obs=50);
var ...your variables...
run;
missing variable handling in SAS procedures
Missing Variables handles in SAS procedures
* - proc means
For each variable, the number of non-missing values are used
* proc freq
By default, missing values are excluded and percentages are based on the number of non-missing values. If you use the missing option on the tables statement, the percentages are based on the total number of observations (non-missing and missing) and the percentage of missing values are reported in the table.
* proc corr
By default, correlations are computed based on the number of pairs with non-missing data (pairwise deletion of missing data). The nomiss option can be used on the proc corr statement to request that correlations be computed only for observations that have non-missing data for all variables on the var statement (listwise deletion of missing data).
* proc reg
If any of the variables on the model or var statement are missing, they are excluded from the analysis (i.e., listwise deletion of missing data)
* proc factor
Missing values are deleted listwise, i.e., observations with missing values on any of the variables in the analysis are omitted from the analysis.
* proc glm
The handling of missing values in proc glm can be complex to explain. If you have an analysis with just one variable on the left side of the model statement (just one outcome or dependent variable), observations are eliminated if any of the variables on the model statement are missing. Likewise, if you are performing a repeated measures ANOVA or a MANOVA, then observations are eliminated if any of the variables in the model statement are missing. For other situations, see the SAS/STAT manual about proc glm.
* For other procedures, see the SAS manual for information on how missing data are handled.
* - proc means
For each variable, the number of non-missing values are used
* proc freq
By default, missing values are excluded and percentages are based on the number of non-missing values. If you use the missing option on the tables statement, the percentages are based on the total number of observations (non-missing and missing) and the percentage of missing values are reported in the table.
* proc corr
By default, correlations are computed based on the number of pairs with non-missing data (pairwise deletion of missing data). The nomiss option can be used on the proc corr statement to request that correlations be computed only for observations that have non-missing data for all variables on the var statement (listwise deletion of missing data).
* proc reg
If any of the variables on the model or var statement are missing, they are excluded from the analysis (i.e., listwise deletion of missing data)
* proc factor
Missing values are deleted listwise, i.e., observations with missing values on any of the variables in the analysis are omitted from the analysis.
* proc glm
The handling of missing values in proc glm can be complex to explain. If you have an analysis with just one variable on the left side of the model statement (just one outcome or dependent variable), observations are eliminated if any of the variables on the model statement are missing. Likewise, if you are performing a repeated measures ANOVA or a MANOVA, then observations are eliminated if any of the variables in the model statement are missing. For other situations, see the SAS/STAT manual about proc glm.
* For other procedures, see the SAS manual for information on how missing data are handled.
Friday, August 13, 2010
Macro counting Missing Values in a data set
*Calculate all missing values for each variable in a database;
%macro miscount(dsn);
* Determine the number of misssing values
*of each variable in the submitted data set;
data temp_char(keep=_character_); *copy all character variables into a temp file;
set &dsn;
run;
data temp_num(keep=_numeric_); *copy all numeric variables into a temp file;
set &dsn;
run;
proc contents data= temp_char
out= cont noprint; *output the contents to a file;
run;
* make the variable names into Macro variables;
proc sql noprint;
select distinct name
into :varname1-:varname999
from cont;
quit;
*make character variables into numeric variables;
*then we can use the PROC MEANS to find the missing numbers;
*this is better then using the PROC FREQ for each variable;
data temp_char;
set temp_char;
%do i = 1 %to &sqlobs; *the &sqlobs tells you how many variables were read in previous SQL;
if compress(&&varname&i)='' then new&i=.;
else new&i=0;
*Drop and rename the variables;
drop &&varname&i;
rename new&i=&&varname&i;
%end;
*Use PROC MEANS to do the tables;
proc means data=temp_char n nmiss;
title 'Missing number table for Char Var of ' &dsn;
proc means data=temp_num n nmiss;
title 'Missing number table of Num Var of ' &dsn;
%mend miscount;
%macro miscount(dsn);
* Determine the number of misssing values
*of each variable in the submitted data set;
data temp_char(keep=_character_); *copy all character variables into a temp file;
set &dsn;
run;
data temp_num(keep=_numeric_); *copy all numeric variables into a temp file;
set &dsn;
run;
proc contents data= temp_char
out= cont noprint; *output the contents to a file;
run;
* make the variable names into Macro variables;
proc sql noprint;
select distinct name
into :varname1-:varname999
from cont;
quit;
*make character variables into numeric variables;
*then we can use the PROC MEANS to find the missing numbers;
*this is better then using the PROC FREQ for each variable;
data temp_char;
set temp_char;
%do i = 1 %to &sqlobs; *the &sqlobs tells you how many variables were read in previous SQL;
if compress(&&varname&i)='' then new&i=.;
else new&i=0;
*Drop and rename the variables;
drop &&varname&i;
rename new&i=&&varname&i;
%end;
*Use PROC MEANS to do the tables;
proc means data=temp_char n nmiss;
title 'Missing number table for Char Var of ' &dsn;
proc means data=temp_num n nmiss;
title 'Missing number table of Num Var of ' &dsn;
%mend miscount;
Missing Values in SAS
Find out missing value of a vairable
http://www2.sas.com/proceedings/sugi31/025-31.pdf
Character missing values are represented by a single blank enclosed in quotes (‘ ‘).
Numeric missing values are represented by a single period (.).
Special numeric missing values are represented by a single period followed by a single letter or an underscore (for example .A, .S, .Z, ._).
IF VALUE <=.Z THEN PUT “*** Value is missing”;
the order of missing variable:: negative number > .z > .(letters) > . > _
http://www2.sas.com/proceedings/sugi31/025-31.pdf
Character missing values are represented by a single blank enclosed in quotes (‘ ‘).
Numeric missing values are represented by a single period (.).
Special numeric missing values are represented by a single period followed by a single letter or an underscore (for example .A, .S, .Z, ._).
IF VALUE <=.Z THEN PUT “*** Value is missing”;
the order of missing variable:: negative number > .z > .(letters) > . > _
remove labels formats informats from a SAS data set
The following example of code removes all labels and formats for the dataset temp in the directory referenced by the LIBNAME WORK:
proc datasets lib=work nolist;
modify temp;
attrib _all_ label=''; *Remove labels;
format _all_; *Remove formats;
informat _all_; *Remove informats;
quit;
run;
It can also be done in the data step:
data temp;
set temp;
attrib _all_ label=''; *remove labels;
format _all_; *remove formats;
informat _all_; *remove informats;
run;
proc datasets lib=work nolist;
modify temp;
attrib _all_ label=''; *Remove labels;
format _all_; *Remove formats;
informat _all_; *Remove informats;
quit;
run;
It can also be done in the data step:
data temp;
set temp;
attrib _all_ label=''; *remove labels;
format _all_; *remove formats;
informat _all_; *remove informats;
run;
Thursday, August 12, 2010
SAS character to number or date
Character to number:: input( charvar, 6.0);
number to Character:: put(numvar, 4.0);
character to date:: input(charvar, mmddyy10.);
number to date:: mdy(mon, date, year);
remove specific character from a string:: compress(charvar, '()- '); *remove the (, ), -, and white space from the variable.
/* Program 1:: Convert num to char and char to num */
data temp1;
input numvar charvar $4.;
/* convert character to numeric */
numvar_n=input(charvar, 4.);
/* convert numeric to character */
char_n=put(numvar,4.0);
cards;
123 7691
789 0306
011 8729
;;
proc print data=temp1;
run;
/*program 2 character to date and number to date*/
data temp2;
input date_char mon day year;
/* convert character to date */
date1=input(date_char, mmddyy10.);
/* convert numeric to date */
date2=mdy(mon,day,year);
cards;
10/01/1999 10 01 1999
12/24/2003 12 24 2003
07/01/2001 7 1 2001
;;
proc print data=temp2;
run;
*program 3 remove the white space from a date variable;
data temp2;
input date_char mon day year;
/* convert character to date */
date1=input(compress(date_char), mmddyy10.);
*here the compress without the second parameter;
*by default it is the white space;
/* convert numeric to date */
date2=mdy(mon,day,year);
cards;
*instead of 10/01/1999, this format is 10/ 1/1999 using a space instead of 0;
10/ 1/1999 10 01 1999
12/24/2003 12 24 2003
7/ 1/2001 7 1 2001
;;
proc print data=temp2;
run;
number to Character:: put(numvar, 4.0);
character to date:: input(charvar, mmddyy10.);
number to date:: mdy(mon, date, year);
remove specific character from a string:: compress(charvar, '()- '); *remove the (, ), -, and white space from the variable.
/* Program 1:: Convert num to char and char to num */
data temp1;
input numvar charvar $4.;
/* convert character to numeric */
numvar_n=input(charvar, 4.);
/* convert numeric to character */
char_n=put(numvar,4.0);
cards;
123 7691
789 0306
011 8729
;;
proc print data=temp1;
run;
/*program 2 character to date and number to date*/
data temp2;
input date_char mon day year;
/* convert character to date */
date1=input(date_char, mmddyy10.);
/* convert numeric to date */
date2=mdy(mon,day,year);
cards;
10/01/1999 10 01 1999
12/24/2003 12 24 2003
07/01/2001 7 1 2001
;;
proc print data=temp2;
run;
*program 3 remove the white space from a date variable;
data temp2;
input date_char mon day year;
/* convert character to date */
date1=input(compress(date_char), mmddyy10.);
*here the compress without the second parameter;
*by default it is the white space;
/* convert numeric to date */
date2=mdy(mon,day,year);
cards;
*instead of 10/01/1999, this format is 10/ 1/1999 using a space instead of 0;
10/ 1/1999 10 01 1999
12/24/2003 12 24 2003
7/ 1/2001 7 1 2001
;;
proc print data=temp2;
run;
stat 复制结果为文本
虽然stata10已经可以把结果复制成图像,但是这样会使文件变大。但是复制结果为文字,特别是在复制表格的时候,如果粘贴到office上会变的很丑陋。如果你结果粘贴到notebook上或者emacs,vim上你会看到结果跟stata上的结果是一样的。原因是文本编辑软件都是用等宽字体(monospaced font)。微软的office里宋体是等宽字体,英文Consolas, 和Courier New都是等宽字体。粘贴后,选选择粘贴部分改用任一等宽字体,你就会看到跟stata一样的效果。
原因是等宽字体,顾名思义任意字符宽度相等,比如空格跟字母M的宽度是相等的。非等宽字符每个字符的宽度是不相等的。比如字母l跟m的宽度就不一样。 所以非等宽字体书写比较美观,但是用了粘贴这类表格就会很丑。。。需要对齐的地方都变得不对齐了。
原因是等宽字体,顾名思义任意字符宽度相等,比如空格跟字母M的宽度是相等的。非等宽字符每个字符的宽度是不相等的。比如字母l跟m的宽度就不一样。 所以非等宽字体书写比较美观,但是用了粘贴这类表格就会很丑。。。需要对齐的地方都变得不对齐了。
Subscribe to:
Posts (Atom)