*using the First Last varialbes of group;
*first.group=last.group means there is only one record within that group;
data single dup;
set temp;
by group;
if first.group and last.group then output single;
else output dup;
run;
My Notes
My notes for Statistic related topics, such as SAS, STATA programming. Other topics software related topics such as Latex may also be included.
Saturday, August 14, 2010
find duplicated data and put them into a different file
output a data set to a txt file
Output a SAS data set
data _null_;
set lib.zjl;
file 'drive:\path\file.txt' notitles noprint;
put var1 var2 var3 ... varn;
run;
data _null_;
set lib.zjl;
file 'drive:\path\file.txt' DLM=',';
put var1 var2 var3 ... varn;
run;
data _null_;
set lib.zjl;
file 'drive:\path\file.txt' DLM='09'x;
put var1 var2 var3 ... varn;
run;
data _null_;
set lib.zjl;
file 'drive:\path\file.txt' notitles noprint;
put var1 var2 var3 ... varn;
run;
data _null_;
set lib.zjl;
file 'drive:\path\file.txt' DLM=',';
put var1 var2 var3 ... varn;
run;
data _null_;
set lib.zjl;
file 'drive:\path\file.txt' DLM='09'x;
put var1 var2 var3 ... varn;
run;
use subset of observations in SAS procedures
*Limiting Observations
* Read from a raw data file;
filename myfile 'c:\SASfiles\temp\myfile';
Data temp;
infile myfile obs=100;
input ...;
run;
*printing data;
proc print data=temp (obs=10);
proc print data=temp (firstobs=20 obs=50);
*copy to other data set;
data temp2;
set temp (obs=10);
... other statements...
run;
*frequency test;
proc freq data=temp (obs=50);
var ...your variables...
run;
* Read from a raw data file;
filename myfile 'c:\SASfiles\temp\myfile';
Data temp;
infile myfile obs=100;
input ...;
run;
*printing data;
proc print data=temp (obs=10);
proc print data=temp (firstobs=20 obs=50);
*copy to other data set;
data temp2;
set temp (obs=10);
... other statements...
run;
*frequency test;
proc freq data=temp (obs=50);
var ...your variables...
run;
missing variable handling in SAS procedures
Missing Variables handles in SAS procedures
* - proc means
For each variable, the number of non-missing values are used
* proc freq
By default, missing values are excluded and percentages are based on the number of non-missing values. If you use the missing option on the tables statement, the percentages are based on the total number of observations (non-missing and missing) and the percentage of missing values are reported in the table.
* proc corr
By default, correlations are computed based on the number of pairs with non-missing data (pairwise deletion of missing data). The nomiss option can be used on the proc corr statement to request that correlations be computed only for observations that have non-missing data for all variables on the var statement (listwise deletion of missing data).
* proc reg
If any of the variables on the model or var statement are missing, they are excluded from the analysis (i.e., listwise deletion of missing data)
* proc factor
Missing values are deleted listwise, i.e., observations with missing values on any of the variables in the analysis are omitted from the analysis.
* proc glm
The handling of missing values in proc glm can be complex to explain. If you have an analysis with just one variable on the left side of the model statement (just one outcome or dependent variable), observations are eliminated if any of the variables on the model statement are missing. Likewise, if you are performing a repeated measures ANOVA or a MANOVA, then observations are eliminated if any of the variables in the model statement are missing. For other situations, see the SAS/STAT manual about proc glm.
* For other procedures, see the SAS manual for information on how missing data are handled.
* - proc means
For each variable, the number of non-missing values are used
* proc freq
By default, missing values are excluded and percentages are based on the number of non-missing values. If you use the missing option on the tables statement, the percentages are based on the total number of observations (non-missing and missing) and the percentage of missing values are reported in the table.
* proc corr
By default, correlations are computed based on the number of pairs with non-missing data (pairwise deletion of missing data). The nomiss option can be used on the proc corr statement to request that correlations be computed only for observations that have non-missing data for all variables on the var statement (listwise deletion of missing data).
* proc reg
If any of the variables on the model or var statement are missing, they are excluded from the analysis (i.e., listwise deletion of missing data)
* proc factor
Missing values are deleted listwise, i.e., observations with missing values on any of the variables in the analysis are omitted from the analysis.
* proc glm
The handling of missing values in proc glm can be complex to explain. If you have an analysis with just one variable on the left side of the model statement (just one outcome or dependent variable), observations are eliminated if any of the variables on the model statement are missing. Likewise, if you are performing a repeated measures ANOVA or a MANOVA, then observations are eliminated if any of the variables in the model statement are missing. For other situations, see the SAS/STAT manual about proc glm.
* For other procedures, see the SAS manual for information on how missing data are handled.
Friday, August 13, 2010
Macro counting Missing Values in a data set
*Calculate all missing values for each variable in a database;
%macro miscount(dsn);
* Determine the number of misssing values
*of each variable in the submitted data set;
data temp_char(keep=_character_); *copy all character variables into a temp file;
set &dsn;
run;
data temp_num(keep=_numeric_); *copy all numeric variables into a temp file;
set &dsn;
run;
proc contents data= temp_char
out= cont noprint; *output the contents to a file;
run;
* make the variable names into Macro variables;
proc sql noprint;
select distinct name
into :varname1-:varname999
from cont;
quit;
*make character variables into numeric variables;
*then we can use the PROC MEANS to find the missing numbers;
*this is better then using the PROC FREQ for each variable;
data temp_char;
set temp_char;
%do i = 1 %to &sqlobs; *the &sqlobs tells you how many variables were read in previous SQL;
if compress(&&varname&i)='' then new&i=.;
else new&i=0;
*Drop and rename the variables;
drop &&varname&i;
rename new&i=&&varname&i;
%end;
*Use PROC MEANS to do the tables;
proc means data=temp_char n nmiss;
title 'Missing number table for Char Var of ' &dsn;
proc means data=temp_num n nmiss;
title 'Missing number table of Num Var of ' &dsn;
%mend miscount;
%macro miscount(dsn);
* Determine the number of misssing values
*of each variable in the submitted data set;
data temp_char(keep=_character_); *copy all character variables into a temp file;
set &dsn;
run;
data temp_num(keep=_numeric_); *copy all numeric variables into a temp file;
set &dsn;
run;
proc contents data= temp_char
out= cont noprint; *output the contents to a file;
run;
* make the variable names into Macro variables;
proc sql noprint;
select distinct name
into :varname1-:varname999
from cont;
quit;
*make character variables into numeric variables;
*then we can use the PROC MEANS to find the missing numbers;
*this is better then using the PROC FREQ for each variable;
data temp_char;
set temp_char;
%do i = 1 %to &sqlobs; *the &sqlobs tells you how many variables were read in previous SQL;
if compress(&&varname&i)='' then new&i=.;
else new&i=0;
*Drop and rename the variables;
drop &&varname&i;
rename new&i=&&varname&i;
%end;
*Use PROC MEANS to do the tables;
proc means data=temp_char n nmiss;
title 'Missing number table for Char Var of ' &dsn;
proc means data=temp_num n nmiss;
title 'Missing number table of Num Var of ' &dsn;
%mend miscount;
Missing Values in SAS
Find out missing value of a vairable
http://www2.sas.com/proceedings/sugi31/025-31.pdf
Character missing values are represented by a single blank enclosed in quotes (‘ ‘).
Numeric missing values are represented by a single period (.).
Special numeric missing values are represented by a single period followed by a single letter or an underscore (for example .A, .S, .Z, ._).
IF VALUE <=.Z THEN PUT “*** Value is missing”;
the order of missing variable:: negative number > .z > .(letters) > . > _
http://www2.sas.com/proceedings/sugi31/025-31.pdf
Character missing values are represented by a single blank enclosed in quotes (‘ ‘).
Numeric missing values are represented by a single period (.).
Special numeric missing values are represented by a single period followed by a single letter or an underscore (for example .A, .S, .Z, ._).
IF VALUE <=.Z THEN PUT “*** Value is missing”;
the order of missing variable:: negative number > .z > .(letters) > . > _
remove labels formats informats from a SAS data set
The following example of code removes all labels and formats for the dataset temp in the directory referenced by the LIBNAME WORK:
proc datasets lib=work nolist;
modify temp;
attrib _all_ label=''; *Remove labels;
format _all_; *Remove formats;
informat _all_; *Remove informats;
quit;
run;
It can also be done in the data step:
data temp;
set temp;
attrib _all_ label=''; *remove labels;
format _all_; *remove formats;
informat _all_; *remove informats;
run;
proc datasets lib=work nolist;
modify temp;
attrib _all_ label=''; *Remove labels;
format _all_; *Remove formats;
informat _all_; *Remove informats;
quit;
run;
It can also be done in the data step:
data temp;
set temp;
attrib _all_ label=''; *remove labels;
format _all_; *remove formats;
informat _all_; *remove informats;
run;
Subscribe to:
Posts (Atom)