SESUG 2016 Conference Abstracts
Application/Macro Development
Creating Viable SAS® Data Sets From Survey Monkey® Transport Files
John R. Gerlach
Survey Monkey is an application that provides a means for creating online surveys. Unfortunately, the transport (Excel) file from this application requires a complete overhaul in order to do any serious data analysis. Besides having a peculiar structure and containing extraneous data points, the column headers become very problematic when importing the file into SAS. In fact, the initial SAS data set is virtually unusable. This paper explains a systematic approach for creating a viable SAS data set for doing serious analysis.
Universal File Flattener
David Vandenbroucke
This paper describes the construction of a program that converts any set of relational tables into a single flat file, using Base. The program gets the information it needs from the data tables themselves, with a minimum of user configuration. It automatically detects one-to-many relationships and creates sets of replicate variables in the flat file. The program illustrates the use of macro programming and the SQL, DATASETS, and TRANSPOSE procedures, among others. The output is sent to a Microsoft Excel spreadsheet using the Output Delivery System (ODS).
%SUBMIT_R: A SAS® Macro to Interface SAS and R
Ross Bettinger
The purpose of the %SUBMIT_R macro is to facilitate communication between SAS and R under Windows. %SUBMIT_R uses SAS_ unnamed pipe device type to invoke the R executable. SAS datasets may be converted into R data frames and vice versa in a manner similar to using the SAS/IML ExportDataSetToR and ImportDataSetFromR functions. R graphics are also supported, and are displayed in the SAS results viewer. Graphs may be saved in user-specified locations as image files. R scripts may be created using a SAS data null step to write a file containing an R script, read from a user-specified .R input file, or by using the %R macro. Output of R execution may be directed to the SAS log file for inspection or to a user-specified .Rout file for later use key words: SAS macro, R script, SAS program file, unnamed pipe, ODS, HTML.
New Game in Town: SAS® Proc Lua with Applications
Jiangtang Hu
Lua (pronounced as _lu-aa_) is scripting language like Python. It is very popular in gaming industry. In recent years it is also moved to machine learning world with a widely used library Torch 7. SAS introduced Lua to its Base module since version 9.4 which will definitely extend SAS programming functionality. In the upcoming SAS next-generation high-performance analytics platform, Viya, Lua is supported.
In this paper, several examples (dynamic programming as replacement of SAS macro, reading external files like JSON, XML, HDF5, and implementing a machine learning algorithms like Naïve Bayes) will be presented to showcase how Proc Lua can add to SAS programmers_ toolbox. All codes will be available in Github at https://github.com/Jiangtang/SESUG/tree/master/2016.
Breaking up (Axes) Isn't Hard to Do: A Macro for Choosing Axis Breaks
Alex Buck
SAS® 9.4 brought some wonderful new graphics options. One of the most exciting
is the addition of the RANGE option for SGPLOT. As the name suggests,
specifying ranges for a broken axis is controlled by the user. The only
question left is where to set the breaks and if a break is actually needed. That is what this macro is designed to do. The macro analyzes the specified
input parameter to create macro variables for the overall minimum and maximum,
as well as macro variables specifying values prior to and following the largest
difference that occurs between successive parameter values. The macro will also
create variables for suggested break values to ensure graphic items such as
markers are displayed in full. The user then utilizes these macro variables to
determine if an axis break is needed and where to set those breaks. With the
macros dynamic nature it can be incorporated into larger graphics macro
programs easily while making specific recommendations for each individual
parameter. A complete and intuitive graph is produced with every macro call.
Moving Data and Results Between SAS® and Microsoft Excel
Harry Droogendyk
Microsoft Excel spreadsheets are often the format of choice for our users, both
when supplying data to our processes and as a preferred means for receiving
processing results and data. SAS® offers a number of means to import Excel
data quickly and efficiently. There are equally flexible methods to move data
and results from SAS to Excel. This paper will outline the many techniques
available and identify useful tips for moving data and results between SAS and
Excel efficiently and painlessly.
Using PROC FCMP for Short Test Assembly
Tsung-hsun Tsai and Yung-chen Hsu
Assembling short test forms is a practical task in testing service
organizations. A short test form contains less test items than the full-length
test does, yet still preserves the essential test quality and captures the
required test statistical characteristics. We explore an instance selection
method for short test assembly task by using PROC FCMP to conduct low-level
array operations to implement minimum spanning tree clustering. PROC FCMP was
developed to allow SAS users to write their own functions and subroutines for
use in DATA step or SAS procedures. The purpose of this paper is to demonstrate
how we can use SAS as a functional programming language by utilizing PROC FCMP
to write reusable functions or subroutines to manage complex tasks.
A Waze App for Base SAS®: Automatically Routing around Locked Data Sets, Bottleneck Processes, and Other Traffic Congestion on the Data Superhighway
Troy Hughes
The Waze application, purchased by Google in 2013, alerts millions of users
about traffic congestion, collisions, construction, and other complexities of
the road that can stymie motorists' attempts to get from A to B. From
jackknifed rigs to jackalope carcasses, roads can be gnarled by gridlock or
littered with obstacles that impede traffic flow and efficiency. Waze
algorithms automatically reroute users to more efficient routes based on
user-reported events as well as historical norms that demonstrate typical road
conditions. Extract-transform-load (ETL) infrastructures often represent
serialized process flows that can mimic highways, and which can become
similarly snarled by locked data sets, slow processes, and other factors that
introduce inefficiency. The LOCKITDOWN SAS® macro, introduced at WUSS in
2014, detects and prevents data access collisions that occur when two or more
SAS processes or users simultaneously attempt to access the same SAS data set. Moreover, the LOCKANDTRACK macro, introduced at WUSS in 2015, provides
real-time tracking of and historical performance metrics for locked data sets
through a unified control table, enabling developers to hone processes to
optimize efficiency and data throughput. This text demonstrates the
implementation of LOCKSMART and its lock performance metrics to create
data-driven, fuzzy logic algorithms that preemptively reroute program flow
around inaccessible data sets. Thus, rather than needlessly waiting for a data
set to become available or a process to complete, the software actually
anticipates the wait time based on historical norms, performs other
(independent) functions, and returns to the original process when it becomes
available.
The Demystification of a Great Deal of Files
Chao-Ying Hsieh
Our input data are sometimes stored in external flat files rather than in a
traditional database environment. This creates tedious work if programmers need
to read a large amount of input files from multiple locations. This paper will
address a solution to this issue that uses the SAS® macro and DREAD function. Additionally, the paper will also address further applications for using file name information to validate data.
Multiple Studies! DataDefinitionTracker, Made Easy!
Saritha Bathi
It is always hard to track the changes in the data definitions in your project. It gets harder when you are dealing with multiple studies. It gets even harder
when the team is big. But dont worry our system/process/set of macros helps
you to track the changes in data definitions effeciently and effectively. Main
idea behind the process is to read the data definitions into a sas dataset. When there are changes in the data definitions read the changed/new data definitions and create a new sas dataset. Compare the new and old data
definitions and create a table [be creative in preseinting the changes to the
team] and send it through email to the team with both old and new data
definitions. It helps the programmer to write efficient programs as one can see
the change in the definition clearly. Automate and schedule it as daily/weekly
updates depending on the project needs. Develop once and use it across all the
therapeutic areas and across the company!
Banking/Finance
Leads and Lags: Static and Dynamic Queues in the SAS® DATA STEP
Mark Keintz
From stock price histories to hospital stay records, analysis of time series
data often requires use of lagged (and occasionally lead) values of one or more
analysis variable. For the SAS® user, the central operational task is
typically getting lagged (lead) values for each time point in the data set. While SAS has long provided a LAG function, it has no analogous lead
function an especially significant problem in the case of large data
series. This paper will (1) review the lag function, in particular the
powerful, but non-intuitive implications of its queue-oriented basis, (2)
demonstrate efficient ways to generate leads with the same flexibility as the
lag function, but without the common and expensive recourse to data re-sorting,
and (3) show how to dynamically generate leads and lags through use of the hash
object.
Using SAS®/QC to Design Optimal Experimental Designs in Consumer Lending
Jonas Bilenas
We will look at how to design experiments using SAS®/QC PROC OPTEX. Examples
will come from simulated experiments testing consumer preference for direct
mail credit card offers. Discussion will focus on designs of experiments using
PROC OPTEX, sample size requirements, and response surface models using PROC LOGISTIC and
PROC GENMOD. We will then discuss what to do with design results after the
experiment is completed. Focus is on measuring consumer preferences for
consumer products but the methodology has application in many disciplines.
Using Regression Splines in SAS® STAT Procedures
Jonas Bilenas and Nish Herat
Regression splines are an added feature in many SAS® STAT procedures. These can be used in scatterplot smoothing to detect non-linearity, and also be used in
generating model predictions. We will review how to code up nonparametric
splines in many procedures and how to score new populations. We will also
compare these splines with parametric Natural Splines when building regression
modes. Applications will come from consumer credit examples but are also
applicable to other industries.
Computing Risk Measures for Cross Collateralized Loans Using Graph Algorithms
Chaoxian Cai
In commercial lending, multiple loans or commitments may be collateralized by
multiple properties, and this cross collateralization creates linked
loan-property networks. These networks are embedded in a loan-property table,
stored in a relational database. Some risk measures for these cross
collateralized loans or commitments are better evaluated in aggregated terms if
we can identify all cross linked properties, loans, and commitments. The
Union-Find algorithm is commonly used to find connected components in a graph. In this paper, I have implemented the primary operations of the Union-Find
algorithm in a SAS® macro program using only Base SAS DATA steps. The program
can be used to find all connected components in a SAS data set and separate
them into discrete groups. It can be applied to find all cross collateralized
loan-property networks and compute pooled loan-to-value (LTV) ratios. The
program can also be applied to identify main obligations and loan structures of
future commitments with takedown loans and hierarchical lines of credit. Risk
measures, such as exposure at default (EAD) and credit conversion factor (CCF),
are computed for these complicated loans and illustrated by examples.
Know your Interest Rate
Anirban Chakraborty
The Federal Reserve of the United States reported a drastic increase in
consumer debt over the past few years, reaching $3.5 trillion in May 2015. Credit card debt accounts for only 26% of total consumer debt, however, the rest of the 74% is derived from student loans, automobile loans, mortgage etc. Lending loans has become an integral part of US consumers everyday life. Have you ever wondered how lenders use various factors such as FICO score,
annual income, the loan amount approved, tenure, debt-to-income ratio etc. and
select your interest rates? The process, defined as risk-based pricing,
uses a sophisticated algorithm that leverages different determining factors of
a loan applicant. This research provides an approach to explore the factors
that significantly affect borrowers fixed loan interest rates. For the purpose
of this research, data was collected from a publicly available data source
lending club, which is the largest peer-to-peer online credit market place. The
downloaded dataset has information about successful loan applications of 2015,
and includes 421,097 observations and 115 variables. Exploration of data shows
that debt consolidation is the primary reason for loan application, comprising
59% of all loan applications followed by credit card bill payments. Further
analysis is warranted to determine factors that affect loan interest rate
significantly. Selection of significant factors will help develop a prediction
algorithm which can estimate loan interest rates based on clients
information. On one hand, knowing the factors will help consumers and borrowers
to increase their credit worthiness and place themselves in a better position
to negotiate for getting a lower interest rate. On the other hand, this will
help lending companies to get an immediate fixed interest rate estimation based
on clients information.
By building various predictive models on diverse factors that might influence
the interest rate set we take an attempt to answer the following problem
statement: Estimate the interest rate based on various factors of a loan applicant.
Building Blocks
Fuzzy Matching: Where Is It Appropriate and How Is It Done? SAS® Can Help
Stephen Sloan and Dan Hoicowitz
When attempting to match names and addresses from different files, we often run
into a situation where the names are similar, but not exactly the same. Sometimes there are additional words in the names, sometimes there are
different spellings, and sometimes the businesses have the same name but are
located thousands of miles apart. The files that contain the names might have
numeric keys that cannot be matched. Therefore, we need to use a process called
fuzzy matching to match the names from different files. The SAS®
function COMPGED, combined with SAS® character-handling functions, provides a
straightforward method of applying business rules and testing for similarity.
True is not False: Evaluating Logical Expressions
Ronald Fehd
The SAS(R) software language provides methods to evaulate logical expressions
which then allow conditional execution of parts of programs. In cases where logical expressions contain combinations
of intersection (and), negation (not), and union (or),
later readers doing maintenance may
question whether the expression is correct.
The purpose of this paper is to provide
a truth table of Boole's rules,
De~Morgan's laws, and sql joins
for anyone writing complex conditional statements
in data steps, macros, or procedures with a where clause.
AutoHotKey: an Editor-independent Alternative to SAS® Keyboard Abbreviations
Shane Rosanbalm
Have you ever been editing a SAS program in a text editor (UltraEdit,
NotePad++, etc.) and found yourself thinking, "I wish I had access to my SAS
keyboard abbreviations"? AutoHotKey (AHK) is free open-source macro creation
and automation software for Windows that allows users to automate repetitive
tasks. It's a lot like SAS keyboard abbreviations, but with one major perk: not
only does AHK work in your SAS editor, but it also works in text editors, word
processors, web browsers, and just about any other application you can think
of. If you like SAS keyboard abbreviations but wish they worked in other
applications, AHK is the missing piece of software you've been yearning for. In
this paper you will learn how to install AHK and get started creating your own
editor-independent abbreviations.
If you need these OBS and these VARS, then drop IF and keep WHERE
Jayanth Iyengar
Reading data effectively in the data step requires knowing the implications of
using various methods. The impact on efficiency is especially pronounced when
working with large data sets. Useful also is a working knowledge of data step
mechanics and constructs: the observation loop and the PDV. Individual
techniques for subsetting data have varying levels of efficiency and
implications for input/output time. Use of the WHERE statement/option to subset
observations consumes less resources than the subsetting IF statement. Also,
use of DROP and KEEP to select variables to include/exclude can be efficient
depending on how theyre used.
An Introduction to SAS® Hash Programming Techniques
Kirk Paul Lafler
Beginning in Version 9, SAS software supports a DATA step programming technique
known as hash that enables faster table lookup, search, merge/join, and sort
operations. This presentation introduces what a hash object is, how it works,
and the syntax required. Essential programming techniques are illustrated to
define a simple key, sort data, search memory-resident data using a simple key,
match-merge (or join) two data sets, handle and resolve collision scenarios
where two distinct pieces of data have the same hash value, as well as more
complex programming techniques that use a composite key to search for multiple
values.
Building a Better Dashboard Using Base-SAS® Software
Kirk Paul Lafler
Organizations around the world develop business intelligence dashboards to
display the current status of point-in-time metrics and key performance
indicators. Effectively designed dashboards often extract real-time data from
multiple sources for the purpose of highlighting important information,
numbers, tables, statistics, metrics, and other content on a single screen. This presentation introduces basic rules for good dashboard design and
the metrics frequently used in dashboards, to build a simple drill-down
dashboard using the DATA step, PROC FORMAT, PROC PRINT, PROC MEANS, ODS, ODS
Statistical Graphics, PROC SGPLOT and PROC SGPANEL in Base-SAS® software.
An Easy Route to a Missing Data Report with ODS+PROC FREQ+A Data Step
Mike Zdeb
A first step in analyzing data is making a decision on how to handle missing
values. That decision could be deletion of observations and/or variables with
excess missing data, substitution of imputed values for missing values, or
taking no action at all if the amount of missing data is insignificant and not
likely to affect the analysis.
This paper shows how to use an ODS OUTPUT
statement, PROC FREQ, and some data step programming to produce a missing data
report showing the percentage of missing data for each variable in a data set. Also shown is a method for identifying and dropping from a data set all variables with either all or a high percentage of missing values. The method
of producing the missing data report is less complicated and more compact than
several methods already proposed in other papers.
Array Programming Basics
John Cohen
Using arrays offers a wonderful extension to your SAS® programming toolkit. Whenever iterative processing is called for they can make programming easier
and programs easier to maintain. You will need to learn some new syntax, but we
will explain several of the key components such as indexing and subscripts,
temporary and multi-dimensional arrays, determining array dimension, and a few
special tricks.
Dynamically Changing Time Zones and Daylight Savings on Time Series Data
Chao-Ying Hsieh
SAS® programmers commonly deal with time zones and daylight saving time
changes, especially when working for a large corporation with multiple
subsidiaries located in different time zones. This paper presents a dynamic
way to convert GMT time to local time with daylight saving time changes
accounted for on time series data collected from smart meters. The technique
uses SAS® functions, formats, and macros to create a program that streamlines
the time conversion process.
Introduction to PROC REPORT
Kirk Paul Lafler
SAS users often need to create and deliver quality custom reports and
specialized output for management, end users, and customers. The SAS System
provides users with the REPORT PROCedure, a canned Base-SAS procedure,
for producing quick and formatted detail and summary results. This presentation
is designed for users who have no formal experience working with the REPORT
procedure. Attendees learn the basic PROC REPORT syntax using the COLUMN,
DEFINE, other optional statements, and procedure options to produce quality
output; explore basic syntax to produce basic reports; compute subtotals and
totals at the end of a report using a COMPUTE Block; calculate percentages;
produce statistics for an analysis variables; apply conditional logic to
control summary output rows; and enhance the appearance of output results with
basic Output Delivery System (ODS) techniques.
The Power of PROC FORMAT
Jonas Bilenas and Kajal Tahiliani
The FORMAT procedure in SAS® is a very powerful and productive tool, yet many
beginning programmers rarely make use of it. The FORMAT procedure provides a
convenient way to do a table lookup in SAS. User generated FORMATS can be used
to assign descriptive labels to data values, create new variables and find
unexpected values. PROC FORMAT can also be used to generate data extracts and
used to merge data sets without having to sort large datasets. We will also
look at generateing user FORMATS using the PICTURE statement that is the
sometimes confusing but is powerfull in modifiying how numeric data is
presented in reports. This paper will show you the POWER of PROC FORMAT for all
SAS users.
Divide & Conquer: Simple Sub-Datasets Creation with Call Execute
Dylan Holt and Wilhelmina Ross
When working with larger datasets, one often needs to isolate certain data
based on a range of criteria to disseminate sub-datasets, perform analyses, and
create reports. This task is simple when isolating just one or even a small
handful of sub-datasets from a larger one and may require substituting a few
values for the parameters each time one is creating a sub-dataset. However,
such a task can become cumbersome and redundant at best and error-ridden at
worst when attempting to create a large number of individual sub-datasets
manually. Moreover, consistent formatting and naming conventions are harder to
maintain when attempting this task one-by-one, and could lead to confusion when
visiting and using these sub-datasets at a later time. This paper demonstrates
the use of the Call Execute routine to systematically create sub-datasets
efficiently from a larger dataset while maintaining consistency in format and
naming conventions. In addition, the paper presents the flexibility of this
method when dividing the entire dataset into logical sub-datasets or selecting
only certain criteria to create sub-datasets. Lastly, the paper demonstrates
this method as a clear example of the powerful extension of the Macro facility
within the DATA Step as mediated by the Call Execute routine.
PROC DATASETS; The Swiss Army Knife of SAS® Procedures
Michael Raithel
This paper highlights many of the major capabilities of PROC DATASETS. It
discusses how it can be used as a tool to update variable information in a SAS
data set; provide information on data set and catalog contents; delete data
sets, catalogs, and indexes; repair damaged SAS data sets; rename files; create
and manage audit trails; add, delete, and modify passwords; add and delete
integrity constraints; and more. The paper contains examples of the various
uses of PROC DATASETS that programmers can cut and paste into their own
programs as a starting point. After reading this paper, a SAS programmer will
have practical knowledge of the many different facets of this important SAS
procedure.
SAS® Debugging 101
Kirk Paul Lafler
SAS® users are always surprised to discover their programs contain bugs (or
errors). In fact, when asked, users will emphatically stand by their programs
and logic by saying they are error free. But, the vast number of experiences
along with the realities of writing code says otherwise. Errors in program code
can appear anywhere; whether accidentally introduced by users when writing
code. No matter where an error occurs, the overriding sentiment among most
users is that debugging SAS programs can be a daunting, and humbling, task. Attendees learn about the various error types, identification techniques, their
symptoms, and how to repair program code to work as intended.
Introduction to ODS Statistical Graphics
Kirk Paul Lafler
Delivering timely and quality looking reports, graphs and information to
management, end users, and customers is essential. This presentation provides
SAS® users with an introduction to ODS Statistical Graphics found in the
Base-SAS software. Attendees learn basic concepts, features and applications of
ODS statistical graphic procedures to create high-quality, production-ready
output; an introduction to the statistical graphic SGPLOT procedure, SGPANEL
procedure, and SGSCATTER procedure; and an illustration of plots and graphs
including histograms, vertical and horizontal bar charts, scatter plots, bubble
plots, vector plots, and waterfall charts.
Creating Test Data Using SAS® Hash Tables
LaSelva Gwen
SAS® programmers sometimes need to simulate real data. Real data often cannot
be used because it contains confidential in-formation or is unavailable. This
paper describes generating test data for the New York State Congenital
Malformations Registry. The data was required to be in a flat text file format
and to contain a medical record number, last name, first name, street address,
city, state, ZIP code, sex, two dates, and from one to 12 diagnosis codes with
descriptions. The data was created in a data step using SAS random number
generating functions to generate values directly or to select items from hash
tables. The selected items were written to a text file with PUT statements. The code was written in SAS® 9.4 under Windows 7.
Taming the Bear Make Your Programs Easier to Control and Monitor
Bob Bolen
SAS® programs that run for any extended period of time but encounter some
process or data error can turn a good day bad in a hurry. This paper will
examine some of the methods we have used to control and monitor these types of
jobs. It will look at the use of macros for breaking your code into manageable
blocks, variable validation for checking that a program runs correctly and
using email to give alert/status notifications.
Using PROC EXPAND to Easily Manipulate Longitudinal and Panel Data
Matthew Hoolsema
Working with longitudinal and panel data comes with many data structure
challenges. In order to merge data from multiple sources, analysts often have
to manipulate a longitudinal dataset to make the structure of the two datasets
match. Other times, a particular analysis may be more conducive to a wide
dataset than a long dataset, or vice versa. New SAS® users (or experienced
SAS® users new to analyzing longitudinal data) can find writing code to
manipulate their panel data and prepare an analysis dataset particularly
challenging.
This paper describes how PROC EXPAND can be used as a simple yet powerful tool
for manipulating longitudinal and panel data. Examples explore
calculating lags, leads, and moving averages of time varying observations within a panel,
as well as reshaping a dataset from long to wide without the need to use arrays
or macros. Also shown is how PROC EXPAND can be used in conjunction with other
SAS® procedures to easily calculate trends over time. Other examples
demonstrate using PROC EXPAND to collapse time periods (i.e. converting monthly
observations to quarterly observations), interpolate values between observed
time periods, and perform data transformations on variables.
Handling Numeric Representation SAS® Errors Caused by Simple Floating-Point Arithmetic Computation
Fuad Foty
Every SAS® programmer knows that an 8 byte maximum length numeric data type
imposes a maximum machine specify limit on the numbers and their computations. In fact, all numeric SAS® values are represented as 64-bit floating point numbers. At the Census Bureau we deal with specific survey replicate weights
that can result in very large positive or negative numbers due to simple
calculations of various weights and factors. Rounding is one way to control
the numbers and correct errors that are due to iterative computations. Understanding how floating-point arithmetic works and being aware of its limit is important when heavy computations are involved in millions of survey
records. Adding and subtracting obvious numbers results in surprisingly
different values. In this paper, I will show ways to handle errors caused by
computations and how to avoid obvious issues dealing with large or small
numbers.
Writing Code With Your Data: Basics of Data-Driven Programming Techniques
Joe Matise
In this presentation aimed at SAS® programmers who have limited experience
with data step programming, we discuss the basics of Data-Driven Programming,
first by defining Data-Driven Programming, and then by showing several easy to
learn techniques to get a novice or intermediate programmer started using
Data-Driven Programming in their own work. We discuss using PROC SQL SELECT
INTO to push information into macro variables; PROC CONTENTS and the dictionary
tables to query metadata; using an external file to drive logic; and generating
and applying formats and labels automatically.
Prior to attending this presentation, programmers should be familiar with the
basics of the data step; should be able to import data from external files;
basic understanding of formats and variable labels; and should be aware of both
what a macro variable is and what a macro is. Knowledge of macro programming
is not a prerequisite for understanding the conecpts in this presentation.
All Aboard! Next Stop is the Destination Excel
William E Benjamin Jr.
Over the last few years both Microsoft Excel file formats and the SAS®
interfaces to those Excel formats have changed. SAS® has worked hard to make
the interface between the two systems easier to use. Starting with Comma
Separated Variable files and moving to PROC IMPORT and PROC EXPORT, LIBNAME
processing, SQL processing, SAS® Enterprise Guide®, JMP®, and then on to the
HTML and XML tagsets like MSOFFICE2K, and EXCELXP. Well, there is now a new
entry into the processes available for SAS users to send data directly to
Excel. This new entry into the ODS arena of data transfer to Excel is the ODS
destination called EXCEL. This process is included within SAS ODS and produces
native format Excel files for version 2007 of Excel and later. It was first
shipped as an experimental version with the first maintenance release of SAS®
9.4. This ODS destination has many features similar to the EXCELXP tagsets.
Coder's Corner
Using an Array to Examine Gastric and Colorectal Cancer Risk Factors
Verlin Joseph
Colorectal and gastric cancers are two of the most prevalent forms of cancer
worldwide. One of the key bacterial risk factors for gastrointestinal cancers
is Helicobacter Pylori (H. Pylori). This study seeks to examine the
association of H. Pylori in colorectal and gastric cancer. The analysis was
conducted using 2002-2012 Florida Agency for Health Care Administration (AHCA)
hospital discharge data. Each data set contained millions of records and up to
thirty diagnoses codes. Arrays were utilized to quickly locate the variables
of analysis. The purpose of this presentation is to illustrate how programmers
may use arrays to analyze big data sets.
What Do You Mean My CSV Doesnt Match My SAS® Dataset?
Patricia Guldin and Young Zhuge
Statistical programmers are responsible for delivering high quality and
reproducible analysis datasets to statisticians, modelers and other
quantitative scientists. Regardless of the format (e.g. SAS dataset or .csv),
the content of the datasets should be identical. Converting SAS datasets to
other formats can be easily accomplished in SAS, but the consistency between
the output files must be included in the quality control checks. An example
will be given where tables and figures created by a statistician (using the SAS
dataset) and those created by a modeler (using the .csv dataset) were
different. We will provide the results of our exploration into why the
inconsistencies occurred and the steps taken to ensure reliability for
subsequent data exports / format conversions.
Using SAS® to Get a List of File Names and Other Information from a Directory
Imelda Go
The ability to create a data set which contains the names of all the files in a
directory can be very useful. This paper goes through three possible uses of
this file information data set. The first is you have many files with the same
input structure that need to be read from the same directory. Instead of
hard-coding the file name to create each of the many data sets, apply a SAS
macro on each file name in the file information data set to create each data
set. The second is you are collecting data files and want to determine if all
expected files are in a directory and if not, which ones are missing. The third
is you are running out of computer storage space and need to identify files
based on file size, age, etc. for possible deletion or other action. An often
overlooked source of solutions is the SAS Knowledge Base online. Two SAS
samples (24820 and 41880) were used to address the above situations.
PROC DOC III: Self-generating Codebooks Using SAS(R)
Louise Hadden
This paper will demonstrate how to use good documentation practices and SAS(R)
to easily produce attractive, camera-ready data codebooks (and accompanying
materials such as label statements, format assignment statements, etc.) Four
primary steps in the codebook production process will be explored: use of SAS
metadata to produce a master documentation spreadsheet for a file; review and
modification of the master documentation spreadsheet; import and manipulation
of the metadata in the master documentation spreadsheet to self-generate code
to be included to generate a codebook; and use of the documentation metadata to
self-generate other helpful code such as label statements. Full code for the
example shown (using the SASHELP.HEART data base) will be provided.
A Five-Step Quality Control Method: Checking for Unintended Changes to SAS® Datasets
Aaron Brown
A SAS® programmer may need to make many edits or changes to a given data set. There is always a risk that ones code will have unintended consequences, leading to unintended changes. This paper describes a five-step quality
control method that allows a programmer to quickly and systematically check for
any changes in the number of records, variables in a dataset, or values in a
dataset, in order to assure that the intended changes and only the intended
changes occurred. This quality control method utilizes the COMPARE and MEANS
procedures.
Talk to Me!
Elizabeth Axelrod
Wouldnt it be nice if your long-running program could tap you on the
shoulder and say Okay, Im all done now. It can! This quick tip will
show you how easy it is to have your SAS® program send you (or anyone else) an
email during program execution. Once youve got the simple basics down,
youll come up with all sorts of uses for this great feature, and youll
wonder how you ever lived without it.
Extract Information from Large Database Using SAS® Array, PROC FREQ, and SAS Macro
Lifang Zhang
SAS® software offers the statistician and programmer a lot of ways to
extracting useful information from large SAS data sets for statistical analysis
and report. The techniques discussed include the use of base SAS software i.e.
Common used PROC IMPORT, SAS ARRAY, PROC FREQ and OUPUT, SAS MACRO, and PROC
EXPORT to extract useful information for future analysis and report.
This presentation provides a detailed blueprint for 1) Using PROC IMPORT to
import Excel data; 2) Using ARRAY to detect specific diseases (Diagnostic
code); 3) PROC FREQ and OUTPUT to generate indicate variables and data sets; 4)
Merging them together to produce one summary report or dataset; 5) Using SAS
MACRO to repeat the step 2 to 4; 6) Export the final data sets or tables to
define destination.
The paper use mimic diabetes clinic data from 2012 to 2014. ICD-9 Diabetes
diagnostic codes included in range 250.00 - 250.80. The paper will show 1) how
to extract each diagnostic code by each year for all visits; 2) Count all
diabetes diagnostic codes for each unique patient for each year. In other
words, each patient can only count one time in each year.
The Excel tables will be created and exported in a defined destination at the
end of program.
Tips for Pulling Data from Oracle® Using PROC SQL® Pass-Through
John Cohen
For many of us a substantial portion of our data reside outside of SAS®. Often
these are in DBMS (Data Base Management Systems) such as Oracle, DB2, or MYSQL. The great news is that the data will be available to us in an
already-structured format, with likely a minimum of reformatting optimized
database effort required. Secondly, these DBMS come with an array of
manipulation tools of which we can take advantage. The not so good news is that
the syntax required for pulling these data may be somewhat unfamiliar to us.
We will offer several tips for making this process smoother for you, including
how to leverage a number of the DBMS tools. We will take advantage of the
robust DBMS engines to do a lot of the preliminary work for us, thereby
reducing memory/work space/sort space, data storage, and CPU cycles required of
the SAS server which is usually optimized for analytical work while being
relatively weaker (than the DBMS) at the heavy lifting required in an
increasingly Big Data environment for initial data selection and manipulation. Finally, we will make our SAS Administrators happy by reducing some of the load in that environment.
Some _FILE_ Magic
Mike Zdeb
The use of the SAS® automatic variable _INFILE_ has been the subject of
several published papers. However, discussion of possible uses of the
automatic variable _FILE_ has been limited to postings on the SAS-L listserv
and on the SAS Support Communities web site. This paper shows several uses of
the variable _FILE_, including creating a new variable in a data set by
concatenating the formatted values of other variables and recoding variable
values.
Personalized Birthday Wisher (PBW): An Indispensable SAS® Tool for Your Workplace
Jinson Erinjeri and Angela Soriano
Personalized Birthday Wisher (PBW) is a customizable birthday wishing tool
which operates on the input provided to it. The input to this tool includes
various attributes, based on which a birthday wish comprising of an image and
text as its content is selected and delivered via email. This is an automated
tool which sends a birthday wish to a team member on his/her birthday based on
certain attributes, giving the wish a personalized touch. The attributes chosen
for this tool is entirely up to the decision maker making this tool highly
customizable. This paper presents an application of the tool with simple
predetermined attributes that can be tailored as needed by the team. In
addition, the paper also points to various customization possibilities.
Macro Code to Test Existence of Various Objects
Ronald Fehd
SAS(R) software provides functions
to check the existence of the objects which it manages catalogs and data sets as well as folders
referred to by the filename and libname statements.
The purpose of this paper is to provide a
set of macro statements for assertions of existence
and to highlight the exceptions where
these functions return non-boolean choices.
When PROPCASE Isnt Proper: A Macro Supplement for the SAS® Function
Alex Buck
Formatting is important for clarity when reporting descriptive data such as
country or subject status. In comes PROPCASE to save the day. Indonesia
and Protocol Violation are certainly easier to read than their UPCASE
counterparts. However, PROPCASE is limited. It will convert every word in the
argument without exceptions, creating phrases such as United States Of
America and Withdrawal By Subject. These distractions can sometimes be
anticipated and fixed, but as new data are collected there is a risk of a
renegade To, And or Out hiding somewhere. This paper
introduces a macro supplement for PROPCASE which contains a default list of
common lowercase words and the option to add or exclude words at the users
discretion. With this macro, the user will specify lowercase words in an
efficient and dynamic way, allowing confidence in formatting across deliveries.
Using CALL VNAME to Populate Missing Data from a Default Values Lookup Dataset
Raghav Adimulam
When SAS is used to read ASCII data files, there is sometimes a need to recode
missing data or to alter already existing data values. Often, the values to be
substituted are in an Excel spreadsheet, which can be imported to SAS as a
lookup table. This paper walks through the creation of a program that populates
default values in a column when that columns information is empty in the
input file. The relevant functions and routines used in this process are CALL
VNAME, CMISS, Arrays, and DO loops. The goal is to output specific data
patterns from the lookup datasets into the variables with missing values.
Simplifying the Use of Multidimensional Array in SAS®
Alec Zhixiao Lin
A look-up table usually contains two or more attributes. Multidimensional
array is a common and effective way to code the table in SAS. However, many
users find this practice to be confusing and daunting. This paper
introduces a process that greatly simplifies the use of multidimensional array. Users only
need to define the dimensions of a look-up table and paste it to SAS with no
need to understand the logic for programming multidimensional array in SAS. The macro will automatically kick off the process of assigning given values to each record.
Flexible Programming with Hash Tables
Joe Matise
When developing a general application, it often pays to use flexible techniques
that will enable an application to handle a diverse set of inputs with a simple
and concise syntax, while efficiently producing a consistent final output. This paper will show intermediate to advanced SAS® programmers how to make use of hash tables to produce powerful applications that are easy to use.
The example application, a data cleaning program, will enable users to specify
connections between different datasets using a straightforward syntax through
the use of hash tables. We will also introduce the concept of hash tables for
users who are unfamiliar with hash tables and their use in SAS programming.
Before reading this paper, users should have a good understanding of the SAS
data step, should have some familiarity with basic macro programming, should be
familiar with data driven programming techniques, and should be comfortable
combining multiple tables.
Using SAS® Macros to Extract P-values from PROC FREQ
Rachel Straney
This paper shows how to leverage the SAS ® Macro Facility with PROC FREQ to
obtain multiple chi-square test statistics and their associated p-values into
one data set to achieve a quick solution to the common variable selection
problem. The purpose of this paper is to provide a simplified macro function
that can be used to identify important factors in a study. Although the use of
PROC FREQ in this macro limits its use to categorical data, references to other
SAS papers will be summarized for the readers to get a better understanding of
how this concept can be expended upon.
Dynamically Setting Decimal Precision Using PUTN
Brandon Welch
In the pharmaceutical industry, there are often rules for displaying the number
of decimal places in a report. This is particularly important when reporting
laboratory results in a table. For example, for a particular laboratory test,
if the laboratory result in the data is carried to the tenths place, the
displayed mean will be reported at the hundredths place. The common rule is
displaying the mean result one place higher than what is in the data. There are
similar rules for other statistics such as the median, standard deviation,
minimum, and maximum. This paper will illustrate a simply technique using the
PUTN function to dynamically set the decimal precision. The techniques
presented offer a good overview of basic SAS functions that will educate
programmers at all levels.
Your Own SAS® Macros Are as Powerful as You Are Ingenious
Yinghua Shi
This article proposes, for user-written SAS macros, separate definitions for
function-style macros and for routine-style macros, respectively. A
function-style macro returns a value, while a routine-style macro does not. Implementation of function-style macros follows a set of rules that are not identical with the rules for routine-style macros. In client code, the method
for invoking those two types of macros is not the same, either. Just like SAS
language has a distinction between functions and call routines, it is natural
for macros to also have that distinction.
With that distinction in place, this article describes the proper approach and
rules to follow for writing function-style macros, and for writing
routine-style macros. Usage of each kind of macro is also provided to describe
the proper way of invoking the macros in client code.
This article also points out some common problems in writing and invoking
macros that appeared in published articles. Some of those problems can cause
errors in SAS programs, and therefore should be guarded against when designing
and coding macros.
For each key point discussed in this article, working sample SAS code is
provided to further illustrate what to do, what not to do, what to avoid, and
what are good, or not-so-good coding practices in writing user-written macros.
Cover Your Assumptions with Custom %str(W)ARNING Messages
Shane Rosanbalm
In clinical trial work we often have to write programs based on very little
(and mostly clean) data. But an experienced programmer's lizard brain is
constantly warning them that their clean log is an illusion, that there is
likely dirty data lurking just down the road. And sadly, the lizard brain is
usually right. What's a programmer to do?
In this paper we explore using custom WARNING messages in both the data step
(with PUT) and the macro language (with %PUT). This programming technique
allows us to write simple programs for the data we have while at the same time
protecting ourselves from the dirty data we fear.
From Professional Life to Personal Life: SAS® Makes It Easy
Nushrat Alam
You have been using SAS in your professional life but how about your personal
life? Got a long to do list for the next week and doing grocery is one them? Let SAS make your personal life easier by providing you an up-to-date grocery
list. This paper describes an innovative way of applying SAS to make our
routine tasks more interesting and thus expand our creativity to a new level. In the process of generating a customized grocery list this paper will show the
use of SAS Output Delivery System along with the use of PROC REPORT and its
options, traffic lighting and SAS time function. Let your SAS-siness play and
be amazed by your own creativity.
WAPTWAP, but remember TMTOWTDI
Jack Shoemaker
Writing a program that writes a program (WAPtWAP) is a technique that allows
one to solve computing problems in a flexible and dynamic fashion. Of course,
there is more than one way to do it (TMTOWTDI). This presentation will explore
three WAPtWAP methods available in the SAS system.
Data Management/Big Data
A New Method To Deal With 2 Level Variables in Big Data Analysis
John Gao
In the big data, often variables have multiple levels. Here we mainly study the
big data with 2 levels or 2 layers(Child level and parent level), such as in
the medical area, the patient and hospital. The patients outcome greatly
related to
- Health condition at Individual patient level (child level)
- Hospital and doctor service quality (parent level).
This paper presents a new approach for the big data with 2 levels. The outcome
is binary at child level. The first step is to develop a predictive model at
child level for the binary outcome, such as patients or trialers for the
re-admissions or conversions. The aggregation of predicted probability of the
binary outcome for each parent is the expected or natural overall binary
outcome. All aggregation of actual outcomes for all child for each parent may
not be the same as the overall expected outcome. The difference between the
actual outcome with expected outcome is due to the parents performance. Therefore, the second step is to identify the impact from parents
performance on the child outcome. In this step, the dependent variable is
continuous either the ratio of the aggregation of all actual outcome over the
aggregation of all expected probability for all child. Then the second step
analysis is to identify the drivers from parents information. Then, we can combine the 2 model score into one by normalizing the expected
probability of all child and expected ratio outcome of all parents. Then final
binary outcome is going to be predicted by weighting childs information and
parent information. Logistic and linear regression procedures in SAS have been
used for this study. The result from this new approach has better prediction
with the parent's information.
Fuzzy Name-Matching Applications
Alan Dunham
Name-matching among two or more lists of names can be ambiguous and problematic
for several reasons. These problems are the same for corporations and
government organizations, leading to a variety of negative outcomes, such as
rejected applications, missed business customer opportunities, lost payment
vouchers, duplicate bills, and overlooking applicant criminal records.
Base SAS code is discussed that uses the Spedis function (available since SAS 6.0)
to significantly reduce the number of name mis-matches in a comprehensive,
generic manner. The code can be applied to treat Romanization and
transliteration spelling variances, differing use of diacritical marks in
multiple lists, inconsistent use of titles, sub-names appearing in different
order (such as reversal of surname and given name), incomplete names, trailing
or leading blanks, mixed use of punctuation, and varying use of capitalization. This paper shows the application and the effectiveness of this SAS solution to a problem common to many organizational settings.
Using Proc FCMP To Improve Fuzzy Matching
Christine Warner
This paper will address how to utilize Proc FCMP and user-defined functions to
enhance fuzzy matching techniques. In addition to utilizing COMPLEV and
COMPGED, I have developed several user-defined functions that improve the
accuracy and completeness of fuzzy matching. "Fuzzy matching" is a term known
to many programmers as matching on non-exact character strings, they could be
names, addresses, invoice numbers, or any piece of data. Several edit-distance
functions such as Jaro-Winkler, General Edit Distance (GED), Levenshtein, and
SoundEx make fuzzy matching easier but they each have their shortcomings. Many
functions, such as Jaro-Winkler, are great for comparing one word with another
word, but not as useful for comparing entire phrases. This abstract will
explain in detail how to harness the power of existing and new user-defined
functions to develop the best possible fuzzy matching plan.
List of Experience Fuzzy-Matching:
- Currently the Lead Architect to match Medicaid Providers with Medicare Providers for the Country
- Experience Matching Healthcare Providers against the List of Excluded Individuals and Entities (LEIE)
- Experience Matching Financial Data Against the Terrorist Watch List
- Experience Matching Student Names in an Accounts Receivable Data Base, in the absence of a reliable Student ID
- Creator of the PCTTRIGRAM Function.
Super Boost Data Transpose Puzzle
Ahmed Al-Attar
This paper compares different solutions to a data transpose puzzle presented to
the SAS User Group at the US Census Bureau (CenSAS). The presented solutions
ranged from a SAS 101 multi-step solution to an advanced solution utilizing not
widely known techniques yielding 85% run time savings!
Using SAS® Hash Object to Speed and Simplify Survey Cell Collapsing Process
Ahmed Al-Attar
This paper introduces an extremely fast and simple implementation of the Survey
Cell Collapsing Process. Prior implementations had either utilized several SQL
queries, or numerous data step arrays, with multiple data reads. This new
approach, utilizes single Hash Object with a maximum of two data reads. The
Hash object provides an efficient and convenient mechanism for quick data
storage and retrieval (sub-second total run time).
Removal of PII
Stanley Legum
At the end of a project, the IRBs require project directors to certify that no
personally identifiable information (PII) is retained. This paper briefly
reviews what information is considered PII and explores how to identify
variables containing PII in a given project. It then shows a comprehensive way
to ensure that all SAS variables containing PII have their values set to NULL
and how to use SAS to document that this has been done.
What to Expect When You Need to Make a Data Delivery: Helpful Tips and Techniques
Tom McCall and Louise Hadden
Making a data delivery to a client is a complicated endeavor. There are many
aspects that must be carefully considered and planned for: de-identification,
public use versus restricted access, documentation, ancillary files such as
programs, formats, and so on, and methods of data transfer, among others. This
paper provides a blueprint for planning and executing your data delivery, and
will walk you through the questions you should ask, the resources you should
check, the resources you should create, and the SAS® tools that will help you
along the way. In essence, well travel back to the future to find out
exactly what you need to be doing from the very start of your project to ensure
a successful data delivery.
Sorting a Bajillion Records: Conquering Scalability in a Big Data World
Troy Hughes
"Big data" is often distinguished as encompassing high volume, velocity, or
variability of data. While big data can signal big business intelligence and
big business value, it also can wreak havoc on systems and software
ill-prepared for its profundity. Scalability describes the ability of a system
or software to adequately meet the needs of additional users or its ability to
utilize additional processors or resources to fulfill those added requirements. Scalability also describes the adequate and efficient response of a system to increased data throughput. Because sorting data is one of the most common as
well as resource-intensive operations in any software language, inefficiencies
or failures caused by big data often are first observed during sorting
routines. Much SAS® literature has been dedicated to optimizing big data sorts
for efficiency, including minimizing execution time and, to a lesser extent,
minimizing resource usage (i.e., memory and storage consumption). Less
attention has been paid, however, to implementing big data sorting that is
reliable and robust even when confronted with resource limitations. To that
end, this text introduces the SAFESORT macro that facilitates a priori
exception handling routines (which detect environmental and data set attributes
that could cause process failure) and post hoc exception handling routines
(which detect actual failed sorting routines). If exception handling is
triggered, SAFESORT automatically reroutes program flow from the default sort
routine to a less resource-intensive routine, thus sacrificing execution speed
for reliability. However, because SAFESORT does not exhaust system resources
like default SAS sorting routines, in some cases it performs more than 200
times faster than default SAS sorting methods. Macro modularity moreover allows
developers to select their favorite sorting routine and, for data-driven
disciples, to build fuzzy logic routines that dynamically select a sort
algorithm based on environmental and data set attributes.
Spawning SAS® Sleeper Cells and Calling Them into Action: Implementing Distributed Parallel Processing in the SAS University Edition Using Commodity Computing To Maximize Performance
Troy Hughes
With the 2014 launch of the SAS® University Edition, the reach of SAS was
greatly expanded to educators, students, researchers, and non-profits who could
for the first time utilize a full version of Base SAS software for free,
enabling SAS to better compete with open source solutions such as Python and R. Because the SAS University Edition allows a maximum of two CPUs, however,
performance is curtailed sharply from more substantial SAS environments that
can benefit from parallel and distributed processing, such as designs that
implement SAS Grid Manager, Teradata, and Hadoop solutions. Even when comparing
performance of the SAS University Edition against the most straightforward
implementation of SAS Display Manager (running on the same computer), SAS
Display Manager demonstrates significantly greater performance. With parallel
processing and distributed computingincluding programmatic and
non-programmatic methodsbecoming the status quo in SAS production software,
the SAS University Edition will unfortunately continue to fall behind its SAS
counterparts if it cannot harness parallel processing best practices. To curb
this performance disparity, this text introduces groundbreaking programmatic
methods that enable commodity hardware to be networked so that multiple
instances of the SAS University Edition can communicate and work collectively
to divide and conquer complex tasks. With parallel processing enabled, a SAS
practitioner can now easily harness an endless number of computers to produce
blitzkrieg solutions with SAS University Edition that rival the performance of
those produced on costly, complex infrastructure.
Your database can do complex string manipulation too!
Harry Droogendyk
Since databases often lacked the extensive string handling capabilities
available in SAS, SAS users were often forced to extract complex character data
from the database into SAS for string manipulation. As database vendors make
regular expression functionality more widely available for use in SQL, the need
to move data into SAS for pattern matching, string replacement and character
extraction is no longer (as) necessary.
This paper will cover enough regular expression patterns to make you dangerous,
demonstrate the various regexp SQL functions and provide practical applications
for each.
Stress Testing and Supplanting the SAS® LOCK Statement: Implementing Mutex Semaphores To Provide Reliable File Locking in Multi-User Environments To Enable and Synchronize Parallel Processing
Troy Hughes
The SAS® LOCK Statement was introduced in SAS version 7 with great pomp and
circumstance, as it enabled SAS software to lock data sets exclusively. In a
multi-user or networked environment, an exclusive file lock prevents other
users or processes from accessing and accidentally corrupting a data set while
it is in use. Moreover, because file lock status can be tested programmatically
with the LOCK statement return code (&SYSLCKRC), data set accessibility can be
validated before attempted access, thus preventing file access collisions and
facilitating more reliable, robust software. Notwithstanding the intent of the
LOCK statement, stress testing demonstrated in this text illustrates
vulnerabilities in the LOCK statement that render its use inadvisable due to
its inability to reliably lock data setsits only purpose. To overcome this
limitation and enable reliable data set locking, a methodology is demonstrated
that utilizes dichotomous semaphoresor flagsthat indicate whether a data
set is available or is in use, and mutually exclusive (mutex) semaphores that
restrict data set access to a single process at one time. With Base SAS file
locking capabilities now restored, this text further demonstrates control table
locking to support process synchronization and parallel processing. The SAS
macro LOCKITDOWN is included and demonstrates busy-waiting (or spinlock) cycles
that repeatedly test data set availability until file access is achieved or a
process times out.
e-Posters
Using SAS® to Examine Relationships among Leadership Styles of College of Nursing Deans and Faculty Job Satisfaction Levels in Research Intensive Institutions
Abbas Tavakoli and Karen Worthy
Leadership and job satisfaction are two factors that have been regarded as
fundamental for organizational success. It has been shown that employees with
high job satisfaction levels are likely to be more productive and exert more
effort in pursuing organizational interests. The purpose of this study was to
identify perceived leadership styles of nursing deans to determine whether they
correlate with nursing faculty job satisfaction. The study used many SAS®
procedures to analyze descriptive, correlational data. The sample for this
national study consisted of 303, out of 1626 recruited, full-time nursing
faculty members from 24 public, research universities with very high research
activity in the United States.
The result show there is positive significant linear relationship between
transformation leadership style and transactional leadership with job
satisfaction. Also, the result indicated that there is negative significant
relationship between passive leadership style and job satisfaction (r=-.43). The result of multiple regression indicated that different leadership styles (transformational, transactional, and passive) are related to the job
satisfaction after controlling with interaction with dean. The R square for
transformational, transactional, and passive leadership on job satisfaction
were .38, .13, and .26; respectively. SAS® is powerful software to analyze
any types of data.
Using SAS® to Examine the Relationship between Primary Caregivers' Adverse Childhood Experiences (ACE) and Child Abuse Allegations
Abbas Tavakoli, Katherine Chappell and Senna Dejardins
Child maltreatment affected nearly 700,000 children in the United States in
2012. Child maltreatment is broken down into four main divisions according to
the Centers for Disease Control (CDC): physical abuse, sexual abuse, emotional
abuse, and neglect. South Carolinas ranking for child well-being is among
the poorest at 45th in the nation. Earlier recognition and intervention for a
child victim of abuse allegations could result in a positive impact on their
future health and well-being. The purpose of this paper is using SAS® to
examine the relationship between primary caregivers adverse childhood
experiences score and child abuse allegations in the family. Adverse childhood
experiences (ACE) scores were used for this study. There were 10 items with
possible responses of no or yes. The total scale was created by summing
responses for 10 items. The data collection was conducted at the Child
Advocacy Center (CAC) of Aiken and Dickerson Center for Children where families
with allegations of child abuse bring children for services. Each participant
completed an ACE survey and a demographic questionnaire. Proc Mean and Freq
were used to describe the data. Proc Corr was used to examine the linear
relationship of total ACE score to ordinal and continuous variables. Proc
T-Test, Npar1way, and GLM were used to examine the difference of means for ACE
score with selected variables. Male caregivers have slightly higher ACE score
8.28 compared to females score of 7.75. The average total ACE score was
similar by site, race, and marital status. The ACE score was higher for
physical abuse compared to other types of abuse. The results of the T-test,
nonparametric, and GLM did not reveal significant differences between ACE score
and above variables with the P values were greater than .05.
Patients with Morbid Obesity and Congestive Heart Failure Have Longer Operative Time and Room Time in Total Hip Arthroplasty
Yubo Gao
More and more patients with total hip arthroplasty have obesity, and previous
studies have shown a positive correlation between obesity and increased
operative time in total hip arthroplasty. But those studies shared the
limitation of small sizes. Decreasing operative time and room time is essential
to meeting the increased demand for total hip arthroplasty, and factors that
influence these metrics should be quantified to allow for targeted reduction in
time and adjusted reimbursement models. This study intend to use a multivariate
approach to identify which factors increase operative time and room time in
total hip arthroplasty. For the purposes of this study, the American College of
Surgeons National Surgical Quality Improvement Program database was used to
identify a cohort of over thirty thousand patients having total hip
arthroplasty between 2006 and 2012. Patient demographics, comorbidities
including body mass index, and anesthesia type were used to create generalized
linear models identifying independent predictors of increased operative time
and room time. The results showed that morbid obesity (body mass index >40)
independently increased operative time by 13 minutes and room time 18 by
minutes. Congestive heart failure led to the greatest increase in overall room
time, resulting in a 20-minute increase. Anesthesia method further influenced
room time, with general anesthesia resulting in an increased room time of 18
minutes compared with spinal or regional anesthesia. Obesity is the major
driver of increased operative time in total hip arthroplasty. Congestive heart
failure, general anesthesia, and morbid obesity each lead to substantial
increases in overall room time, with congestive heart failure leading to the
greatest increase in overall room time. All analyses are conducted via SAS®
(version SAS 9.4, Cary, NC).
The Power of Interleaving Data
Yu Feng
Have you ever had the experience of writing multiple SAS® data steps to
accomplish a task but felt that there should be more efficient SAS code which
could do the job? Interleaving a dataset with itself can help you fulfill the
task in the examples enumerated in this paper. The intent of this paper is to
introduce the fundamental understanding of the interleaving process and discuss
a few examples showing the power of interleaving data with itself. The examples
included in this paper are commonly used in outcomes research. This paper will
focus on interleaving a dataset with itself, not on interleaving two or more
different datasets.
Formula 1: Analytics behind the tracks to the podium
Pallabi Deb and Piyush Lashkare
SPEED, POWER, INNOVATION, PERFORMANCE, FACTS, and STATISTICS are the words that
distinguish Formula 1 racing, also known as The Pinnacle of Motor Sport. It generates a yearly income of 1.2 billion dollars and involves many teams competing with their roaring turbocharged engines for the spectators. With an
average teams budget of 300 million dollars, teams need not only show their
engineering excellence but also a winning strategy. During a race, most drivers
have an average heartbeat of 170 per minute, cars often go past a maximum speed
of 150mph and a difference of milliseconds separate winners and losers. These
characteristics make it different from most other sports. A formula one car deploys about 150 sensors measuring all sorts of variables
around the car, with 500 different parameters across the system measuring
nearly 13,000 health parameters and events; resulting in 750 million numbers. A
single race generates over 3 terabytes of staggering data. This data is further
analyzed with the objective to gain a competitive advantage and excel. In
short, applying analytics for winning business insights. Today, Formula 1 racing is not only about flawless aerodynamics and legendary
driving but is also fueled by DATA to accelerate the analysis. Currently, we
have extracted the data for the entire 2011 grand Prix season across all the
circuits. For each circuit, we have the data for 5 races (FP1, FP2, FP3,
Qualifying, and Final) for all the constructors teams. As the volume of the
data is too huge, so we have planned to analyze the data for one circuit and
top five constructor teams to create a model, which would illustrate the
significant attributes such as track conditions, tire types, fuel lap time, pit
stops, stint length to predict a probable position of a driver in the final
standing charts. This model will predict the following -
- What would the probable position be on the particular circuit?
- How qualifying positions can influence race results?
- Did pit stop timings really made a difference in winning?
- The probability of a driver taking the pole position?
- Which constructor has the highest probability to bag maximum points?
Privacy Protection using Base SAS®: Purging Sensitive Information from Free Text Emergency Room Data
Michelle White, Thomas Schroeder, Li Hui Chen and Jean Mah
Federal agencies must balance privacy protection concerns with the competing
priorities of accessibility and usability of open data. Free text data often
provides detail and qualitative value not offered by coded data. However, free
text narratives are more likely to contain personally identifiable or sensitive
information. This paper describes how U.S. Consumer Product Safety Commission
(CPSC) staff identifies sensitive information in emergency department (ED)
narratives using macros, Perl regular expressions and the PRXMATCH function in
Base SAS Version 9.
CPSCs National Electronic Injury Surveillance System (NEISS) is a national
probability sample of hospitals with EDs in the U.S. and its territories. The
NEISS collects information for about 400,000 product-related ED visits
annually. Each NEISS record includes coded variables and a brief text
narrative. This narrative may contain sensitive information (e.g., patient
names, product brands) that must be purged before the NEISS data is publicly
released. About 65 percent of the narratives are immediately reviewed and
purged, if necessary, by contract reviewers. The data is then input into a SAS
program to identify potentially sensitive words in the remaining narratives and
to verify the contractor review. A macro compares each narrative to a SAS
dataset where each observation contains a purge word or Perl regular
expression, which can encompass misspellings or indicate a numerical identifier
(e.g., social security number, birthdate). If a purge word or expression
is contained in the narrative, the case is output for review by CPSC staff.
In this way, CPSC staff reviews only narratives containing potentially
sensitive information. Previously un-reviewed narratives may be purged, and
purges done on previously reviewed narratives are marked as having been missed
by the contractors. New purge terms are periodically identified by
comparing the terms actually purged from reviewed narratives with those in the
SAS dataset.
Using tools available in Base SAS, CPSC staff has semi-automated the purging of
sensitive information from ED narratives. A similar process could be applied to
other data containing free text, such as electronic medical records and death
certificates.
Extracting Email Domains and Geo-Processing IP Addresses in SAS®
Alec Zhixiao Lin
Web data has become a very important source for analytics in the current era of
social media and booming ecommerce. Email domain and IP address are two
important attributes potentially useful for market sizing, detection of online
statistical anomaly and fraud prevention. This paper introduces a few methods
in SAS that extract email domains and process IP addresses to prepare data for
subsequent analyses.
The Orange Lifestyle
Sangar Rane and Mohit Singhi
Being a freshman at a large university, life can be fun as well as stressful. The choices a freshman makes while in college may impact his/her overall
health. In order to examine the overall health and different behaviors of
students at Oklahoma State University a survey was conducted among the freshmen
students. The survey focused on capturing the psychological,environmental,
diet, exercise, alcohol and drug use among students. A total of 790 out of
1,036 freshman students filled the survey which included around 270 questions
or items that covered the range of issues mentioned above. An exploratory
factor analysis identified 34 possible factors. For example, two factors that
relates to the behavior of students under stress are eating and relaxing. Analysis is currently continuing and we hope the results will give us deep insights into the lives of students and thereby help improve the health and
lifestyle of students at Oklahoma State University in future years.
Using SAS® to create a Build Combinations Tool to Support Modularity
Stephen Sloan
With SAS PROC SQL we can use a combination of a manufacturing Bill of Materials
and a sales specification document to calculate the total number of
configurations of a product that are potentially available for sale. This will
allow the organization to increase modularity with maximum efficiency.
Since some options might require or preclude other options, the result is more
complex than a straight multiplication of the numbers of available options. Through judicious use of PROC SQL, we can maintain accuracy while reducing the time, space, and complexity involved in the calculations.
Strike a pose! Quick and Easy Camera Ready Reporting with SAS®
Nancy McGarry
Getting data out to non-programmer staff in a clear non-technical format can
present a challenge. Often all the analyst wants to see are numbers presented
in a clear and concise manner. Enter PROC REPORT!
This ePoster presents a simple SAS® program which takes processed data and
produce camera ready report results.
A Failure to EXIST: Why Testing for Data Set Existence with the EXIST Function Alone Is Inadequate for Serious Software Development in Asynchronous, Multi-User, and Parallel Processing Environments
Troy Hughes
The Base SAS® EXIST function demonstrates the existence (or lack thereof) of a
data set. Conditional logic routines commonly rely on EXIST to validate data
set existence or absence before subsequent processes can be dynamically
executed, circumvented, or terminated based on business logic. In synchronous
software design where data sets cannot be accessed by other processes or users,
EXIST is both a sufficient and reliable solution. However, because EXIST
captures only a split-second snapshot of the file state, it provides no
guarantee of file state persistence. Thus, in asynchronous, multi-user, and
parallel processing environments, data set existence can be assessed by one
process but instantaneously modified (by creating or deleting the data set)
thereafter by a concurrent process, leading to a race condition that causes
failure. Due to this vulnerability, most classic implementations of the EXIST
function within SAS literature are insufficient for testing data set existence
in these complex environments. This text demonstrates more reliable and secure
methods to test SAS data set existence and perform subsequent, conditional
tasks in asynchronous, multi-user, and parallel processing environments.
An Analysis of the Repetitiveness of Lyrics in Predicting a Songs Popularity
Drew Doyle
In the interest of understanding whether or not there is a correlation between
the repetitiveness of a songs lyrics and its popularity, the top ten songs
from the year-end Billboard Hot 100 Songs chart from 2002 to 2015 were collect. These songs then had their lyrics assessed to determine the count of the top
ten words used. These words counts were then used to predict the number of
weeks the song was on the chart. The prediction model was analyzed to determine
the quality of the model and if word count is a significant predictor of a
songs popularity. To investigate if song lyrics are becoming more simplistic
over time there were several tests completed in order to see if the average
word counts have been changing over the years. All analysis was completed in
SAS® using various PROCs.
SAS® Macro for Automated Model Selection Involving PROC GLIMMIX and PROC MIXED
Fan Pan and Jin Liu
Generalized linear mixed model (GLMM) and linear mixed model (MIXED) deal with
the situation where the mean conditional on the normal random effects is
linearly related to the model effects. Assumptions for the MIXED are similar to
GLMM including normally distributed random effects and residuals and
independence among the random effects and model errors. GLMMs allow analysis of
both normally distributed and certain types of non-normally distributed
dependent variables when random effects are present, whereas MIXED is used for
the normal distribution. Both PROC GLMMIX and PROC MIXED can be used for the
same dataset, but different results may be received by using different
procedures. It will be of interest to compare the final model results between
two procedures. This paper details a group of macros performing separate
automated model selection using PROC GLIMMIX and PROC MIXED. The macro will use
Output Delivery System (ODS) to save the resulting statistics from model
fittings. Model selection indices, such as AICC (corrected Akaike Information
Criterion) and Chi-square values, will be calculated basing on resulting
statistic used in all possible model selection. Final GLMM and MIXED models
include data exploration, influential diagnostics, and checking for model
violations with the experimental ODS GRAPHICS option. The macros will also
graphically compare the summaries of GLIMMIX and MIXED two best model
selections, and give suggestion which procedure is better suited for the
dataset. Two examples will be provided using a real dataset to show the
application of the macro. The macro will be significant for researchers who are
interested in the application of mixed models.
Predicting student success based on interactions with virtual learning environment
Vivek Doijode and Neha Singh
Online learning can be called the millennial sister of classroom learning; tech
savvy, always connected, and flexible. These features offer a convenient
alternative to students with constraints and working professionals to learn on
demand. According to National Center for Education Statistics, over 5 million
students are currently enrolled in distance education courses. The growing
trend and popularity of MOOCs (Massive Open Online Courses) and distance
learning makes it an interesting area of research. We plan to work on OULA
(Open University Learning Analytics) dataset. Learning analytics provides many
insights on the learning pattern of students and on module assessments. These
insights may be researched to enhance participants learning experience. In
this paper, we predict students success in an online course using
regression, clustering and classification methods. We have a mix of categorical
and numeric inputs present in the OULA datasets that are in csv file formats
and contain information for more than 30,000 students pertaining to 7 distance
learning courses, student demographics, course assessments and student
interaction with virtual learning environment. We have merged tables together
using unique identifiers. We will first explore the merged data using SAS® to
generate insights and then build appropriate predictive models.
Applying SAS® to Explore the Utilization and Impact of Sensitive Clinical Indicators on a Heart Failure Unit
Jametta Magwood-Golston, Abbas Tavakoli, Nurses of Moultrie Heart Failure Unit at Palmetto Health Richland, Christina Payne, Harmony Robinson, Veronica Deas and Forrest Fortier
Heart failure (HF) is a chief cause of hospitalization and contributor to
health care costs in the United States. While there has been a significant
reduction in HF patients hospital length of stay, because of the diseases
complexity, nearly 25% of patients hospitalized with heart failure are
readmitted within 30 days, primarily due to co-morbidities. In an effort to
reduce the potential of 30-day readmission and continue to reduce HF patient
length of stay, our hospital implemented a care intervention that involved
interdisciplinary rounding that occurs at each patients bedside. Each
clinical discipline involved in patient care is present at the bedside to
discuss clinical, quality and harm indicators (i.e. pressure ulcers, central
lines, foleys, medication and dosage, patient discharge, length of stay, etc.)
and their potential influence on patient length of stay. The purpose of the
study was to evaluate the impact of clinical culture change after the
introduction of the Accountable Care UnitTM care model. Specific aims of the
study were (1) to measure change in the length of stay of patients discharged
from the heart failure accountable care unit; (2) to enhance team situational
awareness to achieve unit-level quality enhancements; (3) to eliminate
unnecessary medical waste. This study used a descriptive, comparative design to
analyze clinical metrics before the care and after the care model was
introduced on the heart failure unit. The result did reveal a significant
difference in the length of stay for patients discharged from the heart failure
unit. There were statistically significant differences for all
included minority patients, Black, Hispanic, American Indians, with the exception of
Oriental Asian patients. There was also a statistical significance in the
utilization of high risk medications; however, there were not any statistical
differences for patient gender.
Text mining and sentiment analysis on video game user reviews using SAS® Enterprise Miner" and SAS® Sentiment Analysis Studio
Mukesh Kumar Singh
Digital gaming has a history of more than 50 years. The industry started in the
late 1960s when the game titles such as Pong, Centipede and Odyssey were
introduced to consumer markets. Digital gaming is now a wide spread phenomenon
and at least 70% of the US and Europe households say that they play video games
using different consoles such as PC, Xbox, PS4, Nintendo etc. It is reported
that in 2011, the total revenue of the industry amounted to about 17 billion
USD. Each game is reviewed and rated on the internet by users who played the
game and the reviews are often contrasting based on the sentiments expressed by
the user. Analyzing those reviews and ratings to describe the positive and
negative factors of a specific game could help consumers make a more informed
decision about the game.
In this paper, we will analyze 10,000 reviews and ratings on a scale (1-10) of
200 games culled from two sites: metacritic.com and gamespot.com. We will then
build a predictive models to classify the reviews into positive, negative and
mixed based on the sentiments of users and develop a score which defines the
overall performance of the game so that users get all the required information
about a game before purchasing a copy.
Text Analysis of American Airline Reviews
Rajesh Tolety
According to a survey report of tripadvisor, about 43% of the airline passenger
rely on online reviews of different airlines before booking a ticket. Therefore
the nature and the tone of the reviews are important metrics for airlines to
track and manage. We plan to do text analysis of online reviews of American
Airlines which runs about 945 flights across 350 destinations. The analysis
would help American Airlines company to understand what their passengers are
talking about and perhaps take actions to improve their service. The extracted
dataset includes customers rating (on a scale of 1-5), date of review,
detailed comments and location. We plan to do text analysis as well as
supervised sentiment analysis on this dataset.
Hands On Workshop
A Tutorial on the SAS® Macro Language
John Cohen
The SAS Macro language is another language that rests on top of regular SAS
code. If used properly, it can make programming easier and more fun. However,
not every program is improved by using macros. Further-more, it is another
language syntax to learn, and can create problems in debugging programs that
are even more entertaining than those offered by regular SAS.
We will discuss using macros as code generators, saving repetitive and tedious
effort, for passing parameters through a program to avoid hard coding values,
and to pass code fragments, thereby making certain tasks easi-er than using
regular SAS alone. Macros facilitate conditional execution and can be used to
create program modules that can be standardized and re-used throughout your
organization. Finally, macros can help us create interactive systems in the
absence of SAS/AF.
When we are done, you will know the difference between a macro, a macro
variable, a macro statement, and a macro function. We will introduce
interaction between macros and regular SAS language, offer tips on debugging
macros, and discuss SAS macro options.
Fundamentals of the SAS® Hash Object
Paul Dorfman
Starting with the basics and progressing to some less-used features, this
workshop is designed to show how the SAS® hash object really works. The main
emphasis will be not on amassing as many pieces of template code as possible,
but rather on the fundamental things a hash object programmer must understand
in order to use it in creative ways. The aim is not so much about tasting
already cooked hash dishes (though there will be plenty of chances to do that,
too), but about cooking properly based on the fundamental properties of the
ingredients and their interactions. Cuisine so much more than just following a
recipe, and the same is true for programming with the hash object! In
particular, we'll learn what the DATA step compiler sees when it encounters
hash object references and what it must have seen - and done - to make the hash
object work when its run-time turn comes. We'll goof - intentionally - to learn
from errors reported in the SAS log. We'll see how the variables stored in the
hash object talk to their host counterparts in the PDV, and which hash methods
make them affect each other and how, including those related to the hash
iterator. While focusing on these underlying works, we'll learn many other
utile things that together ought to form a solid basis for making the hash
object one's valuable SAS programming tool.
Making Sense of PROC TABULATE
Jonas Bilenas and Kajal Tahiliani
The TABULATE procedure in SAS® provides a flexible platform to generate
tabular reports. Many beginning SAS programmers have a difficult time
understanding the syntax of PROC TABULATE and tend to avoid using the
procedure. This tutorial will explain the syntax of PROC TABULATE and, with
examples, show how to grasp the power of PROC TABULATE. The data used in this
paper represents simulated consumer credit card usage data and the code was
developed using SAS 9.2
New for SAS® 9.4: A Technique for Including Text and Graphics in Your Microsoft Excel Workbooks, Part 1
Vince DelGobbo
A new ODS destination for creating Microsoft Excel workbooks is available
starting in the third maintenance release of SAS® 9.4. This destination
creates native Microsoft Excel XLSX files, supports graphic images, and offers
other advantages over the older ExcelXP tagset. In this presentation you learn
step-by-step techniques for quickly and easily creating attractive multi-sheet
Excel workbooks that contain your SAS® output. The techniques can be used
regardless of the platform on which SAS software is installed. You can even
use them on a mainframe! Creating and delivering your workbooks on-demand and
in real time using SAS server technology is discussed. Although the title is
similar to previous presentations by this author, this presentation contains
new and revised material not previously presented.
Introduction to Data Simulation
Jason Brinkley
Creating synthetic data via simulation can often be a powerful tool for a wide
variety of analyses. The purpose of this workshop is to provide a basic
overview of simulating data for a variety of purposes. Examples will
include power calculations, sensitivity analysis, and exploring nonstandard analyses. The workshop is designed for the mid-level analyst who has basic knowledge of
data management, visualizations and basic statistical analyses such as
correlations and t-tests.
A Short Introduction to Longitudinal and Repeated Measures Data Analyses
Leanne Goldstein
Longitudinal and repeated Measures data are seen in nearly all fields of
analysis. Examples of this data include weekly lab test results of patients or
test scores by children from the same class. Statistics students and analysts
alike may be overwhelmed when it comes to repeated measures or longitudinal
data analyses. They may try to educate themselves by diving into text books or
taking semester long or intensive weekend courses resulting in even more
confusion. Some may try to ignore the repeated nature of data and take short
cuts such as analyzing all data as independent observations or analyzing
summary statistics such as averages or changes from first to last points and
ignoring all the data in-between. This Hands-On presentation will introduce
longitudinal and repeated measures analyses without heavy emphasis on theory. Students in the workshop will have the opportunity to get hands-on experience
graphing longitudinal and repeated measures data. They will learn how to
approach these analyses with tools like PROC MIXED and PROC GENMOD. Emphasis
will be on continuous outcomes but categorical outcomes will briefly be
covered.
Quick Results with SAS® Enterprise Guide®
Kirk Paul Lafler, Mira Shapiro and Ryan Paul Lafler
SAS® Enterprise Guide® empowers organizations, programmers, business
analysts, statisticians and end-users with all the capabilities that SAS has to
offer. This hands-on workshop presents the Enterprise Guide graphical user
interface (GUI), access to multi-platform enterprise data sources, various data
manipulation techniques without the need to learn complex coding constructs,
built-in wizards for performing reporting and analytical tasks, the delivery of
data and results to a variety of mediums and outlets, and support for data
management and documentation requirements. Attendees learn how to use the
graphical user interface to access SAS data sets, tab-delimited and Excel input
files; subset and summarize data; join (or merge) two tables together; flexibly
export results to HTML, PDF and Excel; and visually manage projects using flow
diagrams.
Life Sciences/Healthcare/Insurance
A Novel Approach to Calculating Medicare Hospital Readmissions for the SAS® Novice
Karen Wallace
Hospital Medicare readmission rate has become a key indicator for measuring the
quality of healthcare in the US, currently adopted by major healthcare
stakeholders including the Centers for Medicare and Medicaid Services (CMS),
the Agency for Healthcare Research and Quality (AHRQ), and the National
Committee for Quality Assurance (NCQA) (Fan and Sarfarazi, 2014).
Although many papers have been written about how to calculate readmissions (as
referenced), this paper offers a novel, basic and comprehensive approach using
the options of the SAS DATA Step as well as PROC SQL for: 1) de-identifying
patient data, 2) calculating sequential admissions and 3) filtering out
criteria required to report for CMS 30 day readmissions. Additionally, it
demonstrates: 1) using ODS to create a labeled and de-identified data set, 2) a
macro to examine data quality, and 3) summary statistics useful for reporting
and analysis.
Protecting the Innocent (and your data)
Stanley Legum
A recurring problem with large research data bases containing sensitive
information on individuals health, financial, and personal information is
how to make meaningful extracts available to qualified researchers without
compromising the privacy of the individuals whose data are in the data base. This problem is exacerbated when a large number of extracts need to be made from the database. In addition to employing statistical disclosure control
methods, this paper recommends limiting the variables included in each extract
to the minimum needed and implementing a method of assigning request-specific
randomized IDs to each extract that is secure, self-documenting, and secure.
Sankey Diagram with Incomplete Data From a Medical Research Perspective
Yichen Zhong
Sankey diagram is widely used in energy industry but relatively rare in medical
research. Sankey diagram is an innovative tool in visualizing patient flow in
longitudinal data. A SAS® Macro for generating Sankey Diagram with Bar Charts
was published in 2015. However, it did not provide an intuitive solution for
missing data or terminated data, which is common in medical research. This
paper presents a modification of this macro that allows subjects with
incomplete data to appear on the Sankey diagram. In addition, examples of using
Sankey diagram in medical research are also provided.
How is Healthcare Cost Data Distributed? Using Proc Univariate to Draw Conclusions about Millions of Different Customers
Aran Canes
Modelling health care cost and utilization data has received substantial
attention from the academic community. Different methodological
approaches have been used, tested and contrasted on various health care datasets. Some of
the more simple approaches include Ordinary Least Squares, Generalized Linear
Methods and taking the log transform of the dependent variable while more
sophisticated non-parametric methods have also been proposed and tested. The
conclusion of most researchers is that, while some questions remain unresolved,
different approaches are recommended for different datasets.
Despite this plethora of methodological comparisons there is a paucity of
papers researching how health care cost data is actually distributed. This is
probably because of two major factors: difficulties accessing data because of
privacy concerns and because there seems to be a tacit assumption that
different slices of cost will have different distributions.
While this paper does not try to falsify the hypothesis that, in some
instances, different slices of healthcare cost data may be distributed
differently, it reaches a surprising conclusion regarding a wide range of
healthcare cost data slices among customers of a major insurer: all are
distributed approximately log-normally once the substantial part of the
population that is zero-cost is excluded. This is true whether one looks at
pharmacy or medical cost, different ways of purchasing insurance among
customers or comparing customers who stayed eligible for a full calendar year
versus customers who may have only temporarily had coverage. These results are
reached using the histogram and statistical significance tests available to all
SAS users in PROC UNIVARIATE.
Since there is a noted lack of empirical results regarding how the cost data of
large customer populations is distributed this paper should be a significant
help in assessing the validity of different methodological approaches. If the
results are confirmed in other populations, methodological discussions could be
conducted with the underlying knowledge that, if healthcare cost data has a
substantial number of customers, the Probability Density Function that best
matches the data will be the lognormal.
Are You Sure That Is Correct?: An Overview Of Good Practices For Dataset And Output Validation
Gregory Weller and Alex Buck
In the world of statistical programming, it is imperative to ensure that the
source data are accurately represented in the datasets, tables, listings, and
figures submitted to regulatory agencies and academic journals. The
gold-standard for ensuring correctness is double-independent programming where
a production programmer and validation programmer aim to produce the same
output. SAS® PROC COMPARE is the most commonly-used tool to compare the
outputs and determine differences. While double-independent programming using
PROC COMPARE can be a very useful process, it is important to recognize its
limitations. In certain cases, this may not be the most efficient or complete
method of validation required. Further, failure to establish and follow sound
procedures and practices for validation can lead to disaster.
In this article, we will discuss good practices for double-independent
programming, the proper use of PROC COMPARE, common pitfalls, and explore the
different techniques needed for validating datasets, tables, listings, and
figures. While there will never be one true and final answer for validation
best practice, it is the hope of the authors to provide a starting point for
discussion.
Evaluating Sociodemographic and Geographic Disparities of Hypertension in Florida using SAS®
Desiree Jonas and Shamarial Roberson
Hypertension is one of the leading risk factors for chronic disease. Chronic
conditions such as heart disease, stroke and diabetes are associated with
hypertension. In 2013 the prevalence of hypertension among adults was 34.6% in
Florida. As age increases the risk of hypertension increases, placing older
populations at greater risk for developing chronic conditions. Florida has the
second largest elderly population in the United States which places an
increased burden on the healthcare system. This paper will demonstrate the use
of SAS® to evaluate and map the influence of socio-demographic factors such as
sex, race/ethnicity, age, education and income on hypertension prevalence in
Florida using Behavioral Risk Factor Surveillance System (BRFSS) data. Additionally, PROC MAPIMPORT, PROC GMAP and PROC SURVEY LOGISTIC will be used to assess the burden of hypertension in Florida.
A General SAS® Macro to Implement Optimal N:1 Propensity Score Matching Within a Maximum Radius
Kathy Fraeman
A propensity score is the probability that an individual will be assigned to a
condition or group, given a set of covariates when the assignment is made. For
example, the type of drug treatment given to a patient in a real-world setting
may be non-randomly based on the patients age, gender, geographic location,
overall health, and/or socioeconomic status when the drug is prescribed. Propensity scores are used in observational studies to reduce selection bias by matching different groups based on these propensity score probabilities, rather
than matching patients on the values of the individual covariates. Although the
underlying statistical theory behind propensity score matching is complex,
implementing propensity score matching with SAS® is relatively
straightforward. An output data set of each patients propensity score can be
generated with SAS using PROC LOGISTIC, and a generalized SAS macro can do
optimized N:1 propensity score matching of patients assigned to different
groups. This paper gives the general PROC LOGISTIC syntax to generate
propensity scores, and provides the SAS macro for optimized propensity score
matching. A published example of the effect of comparing unmatched and
propensity score matched patient groups using the SAS programming techniques
described in this paper is presented.
I See de Codes: Using SAS® to Process and Analyze ICD-9 and ICD-10 Diagnosis Codes Found in Administrative Healthcare Data
Kathy Fraeman
Administrative healthcare data including insurance claims data, electronic
medical records (EMR) data, and hospitalization data contain standardized
diagnosis codes used to identify diseases and other medical conditions. These
codes go by their short-form name of International Classification of
Diseases, also known as ICD. Much of the currently available healthcare
data contain the 9th version of these codes, referred to as ICD-9, while the
more recent 10th version ICD-10 are becoming more common in healthcare data. These diagnosis codes are typically saved as character variables, often stored
in arrays of multiple codes representing primary and secondary diagnoses, and
can be associated with either outpatient medical visits or inpatient
hospitalizations. SAS text processing functions, array processing, and the SAS
colon modifier can be used to analyze the text of these codes and identify
similar codes, or ranges of ICD codes. In epidemiologic analyses, groups of
multiple ICD diagnosis codes are typically used to define more general
comorbidities or medical outcomes. These disease definitions based on multiple
ICD diagnosis codes, also known as coding algorithms, can either be
hard-coded within a SAS program, or defined externally from the
programming. When coding algorithm definitions based on ICD codes are stored
externally, the definitions can be read into SAS, transformed to SAS format,
and dynamically converted into SAS programming statements required to identify
patients with the comorbidities and outcomes of interest.
SDTM What? ADaM Who? A Programmers Introduction to CDISC
Venita DePuy
Most programmers in the pharmaceutical industry have at least heard of CDISC,
but may not be familiar with the overall data structure, naming conventions,
and variable requirements for SDTM and ADaM datasets. This overview will
provide a general introduction to CDISC from a programing standpoint, including
the creation of the standard SDTM domains and supplemental datasets, and
subsequent creation of ADaM datasets. Time permitting, we will also discuss
when it might be preferable to do a CDISC-like dataset instead of a
dataset that fully conforms to CDISC standards.
Sample Size Estimation with PROC FREQ and PROC POWER
Adeline Wilcox
Under contract to the Centers for Medicare & Medicaid Services, The Joint
Commission specifies sample sizes for healthcare quality measurement. Their
sample size specifications pay no heed to established methods for sample size
estimation. SAS PROC POWER can be used to compute sample size estimates with
precision.
Most healthcare quality measures are dichotomous. From healthcare measurement
data I used as pilot samples, I computed upper and lower confidence limits. To
do this, I used the BINOMIAL and CL options on the PROC FREQ TABLE statement. After examining these results, I chose input values for the PROC POWER HALFWIDTH and PROBWIDTH options. More statistics textbooks cover sample size
estimation for hypothesis testing than for estimation.
Building Efficiencies in Standard Macro Library using Polymorphism
Binoy Varghese and Sagar Rana
It is common practice in the bio-pharma industry to develop and maintain a
library of standard SAS® macros to facilitate expedited data analysis &
reporting, directed at ensuring compliance, consistency, quality and
reusability. As requirements change over time, new macros are created or
existing macros are updated. In both scenarios, there is a tradeoff. If new
macros are created, programs calling earlier versions of these macros have to
be modified before being used on a new project, thereby affecting the
portability of the programs. If existing macros are updated, programs from
older projects may cease to execute as originally intended impeding backward
compatibility. Both these issues can be addressed by using polymorphism. Polymorphism is an object oriented programming concept that refers to the
ability of managing methods bearing the same name but exhibiting different
behaviors. From the context of SAS programming, this may be translated as
having the capability of calling homonymous macros that may accept identical or
different parameters to perform a diverse set of tasks. In this paper, we
examine the concept of polymorphism from a SAS macro programming perspective
and present an implementation of this technique.
Planning/Support/Administration
Whats Hot Skills for SAS® Professionals
Kirk Paul Lafler
As a new generation of SAS® user emerges, current and prior generations of
users have an extensive array of procedures, programming tools, approaches and
techniques to choose from. This presentation identifies and explores the areas
that are hot in the world of the professional SAS user. Topics include
Enterprise Guide, PROC SQL, PROC REPORT, Output Delivery System (ODS), Macro
Language, DATA step programming techniques such as arrays and hash objects, SAS
University Edition software, technical support at support.sas.com, wiki-content
on sasCommunity.org®, published white papers on LexJansen.com, and other
venues.
Take a SPA Day with the SAS® Performance Assessment (SPA): Baselining Software Performance across Diverse Environments To Elucidate Performance Placement and Performance Drivers
Troy Hughes
Software performance is often measured through program execution time with
higher performing software executing more rapidly than lower performing
software. Intrinsic factors affecting software performance can include the use
of efficient coding techniques, other software development best practices, and
SAS® system options. Factors extrinsic to software that affect performance can
include SAS configuration and infrastructure, SAS add-on modules, third-party
software, and hardware and network infrastructure. The variability in data
processed by SAS software also heavily drives execution time, and these
combined and commingled factors make it difficult to compare performance of one
SAS environment to another. Moreover, many SAS users may work in only one or a
few SAS environments, giving them limited to no insight into how performance of
their SAS environment compares to other SAS environments. The SAS Performance
Assessment (SPA) projectlaunched at SAS Global Forum in 2016 examines
FULLSTIMER performance metrics from diverse organizations with equally diverse
infrastructures. By running standardized software that manipulates standardized
data sets, the relative performance of unique environments can for the first
time be compared. Moreover, as the number and variability of SPA participants
continues to increase, the role that individual extrinsic factors play in
software performance will continue to be disentangled and better understood,
enabling SAS users not only to identify how their SAS environment compares to
other environments, but also to identify specific modifications that could be
implemented to increase performance levels.
Your Local Fire Engine Has an Apparatus Inventory Sheet and So Should Your Software: Automatically Generating Software Use and Reuse Libraries and Catalogs from Standardized SAS® Code
Troy Hughes
Fire and rescue services are required to maintain inventory sheets that
describe the specific tools, devices, and other equipment located on each
emergency vehicle. From the location of fire extinguishers to the make, model,
and location of power tools, inventory sheets ensure that firefighters and
rescue personnel know exactly where to find equipment during an emergency, when
restocking an apparatus, or when auditing an apparatus inventory. At the
department level, inventory sheets can also facilitate immediate identification
of equipment in the event of a product recall or the need to upgrade to newer
equipment. Software should be similarly monitored within a production
environment, first and foremost to describe and organize code
modulestypically SAS® macrosso they can be discovered and located when
needed. When code is reused throughout an organization, a reuse library and
reuse catalog should be established that demonstrate where reuse occurs and to
ensure that only the most recent, tested, validated version of code modules are
reused. This text introduces Python code that automatically parses a directory
structure, parses all SAS program files therein (including SAS programs and SAS
Enterprise Guide project files), and automatically builds reuse libraries and
reuse catalogs from standardized comments within code. Reuse libraries and
reuse catalogs not only encourage code reuse but also facilitate backward
compatibility when modules must be modified because all implementations of
specific modules are identified and tracked.
Tales from the Help Desk 7: Solutions to Common SAS® Tasks
Bruce Gilsen
In 30 years as a SAS ® consultant at the Federal Reserve Board, questions
about some common SAS tasks seem to surface again and again. This paper
collects some of these common questions, and provides code to resolve them.
The following tasks are reviewed.
- Save and restore SAS option values.
- Pad multiple non-continuous time series with missing values to make
continuous time series.
- Read a SAS data set backwards (last observation to first).
- Create a data set containing the last 5 observations of an existing data set.
- Drop the last n observations in each BY group.
- Write data to multiple external files in a DATA step, determining file names
dynamically from data values.
- Compare observations in the same data set.
In the context of discussing these tasks, the paper provides details about SAS
system processing that can help users employ the SAS system more effectively. This paper is the seventh of its type; see the references for six previous papers.
Downloading, Configuring, and Using the Free SAS® University Edition Software
Kirk Paul Lafler
The announcement of SAS Institutes free SAS University Edition is an
exciting development for SAS users and learners around the world! The software
bundle includes Base SAS, SAS/STAT, SAS/IML, Designer Studio (user interface),
and SAS/ACCESS for Windows, with all the popular features found in the licensed
SAS versions. This is an incredible opportunity for users, statisticians, data
analysts, scientists, programmers, students, and academics everywhere to use
(and learn) for career opportunities and advancement. Capabilities include
data manipulation, data management, comprehensive programming language,
powerful analytics, high quality graphics, world-renowned statistical analysis
capabilities, and many other exciting features.
This presentation discusses and illustrates the process of downloading and
configuring the SAS University Edition. Additional topics include the process
of downloading the required applications, key configuration strategies to
run the SAS University Edition on your computer, and the demonstration of
a few powerful features found in this exciting software bundle.
How To Win Friends and Influence People A Programmers Perspective in Effective Human Relationships
Priscilla Gathoni
Dealing with people has become a task and an art that every person has to
master in the work and home environment. This paper explores 10 different ways
that a programmer can use to win friends and influence people. It displays the
steps leading to a positive, warm, and enthusiastic balanced work and life
environment. The ability to think and to do things in their order of
importance is a key ingredient for a successful career growth. Programmers who
want to grow beyond just programming should enhance their people skills in
order to move up to the management level. However, for this to be a reality a
programmer must have good technical skills, possess the ability to arouse
enthusiasm among peers, and is able to assume leadership. It is the programmer
that embraces non-judgment, non-resistance, and non-attachment as the core
mantras that will succeed in the complex and high paced work environment that
we are in. Avoiding arguments, being a good listener, respecting the other
persons point of view, and recalling peoples names will increase your
earning power and ability to influence people to your way of thinking. The
ability to enjoy your work, be friendly, and be enthusiastic tends to bring you
goodwill. This eventually leads to creating good relationships in the office
and the power to influence those around you in a positive way.
Divide and ConquerWriting Parallel SAS® Code to Speed Up Your SAS Program
Doug Haigh
Being able to split SAS® processing over multiple SAS processers on a single
machine or over multiple machines running SAS, as in the case of SAS® Grid
Manager, enables you to get more done in less time. This paper looks at the
methods of using SAS/CONNECT® to process SAS code in parallel, including the
SAS statements, macros, and PROCs available to make this processing easier for
the SAS programmer. SAS products that automatically generate parallel code are
also highlighted.
Reporting/Visualization/JMP
Proc Report, the Graph Template Language, and ODS Layouts: Used in Unison to Create Dynamic, Customer-Ready PowerPoints
Amber Carlson, Amelia Stein and Xiaobin Zhou
Twice a year, we create PowerPoint decks and supplemental tables for over 100
customers to present data on their system performance to help inform their
decision-making. We use one SAS program to create PowerPoint slides that
incorporate the corporate template and include dynamic-editable tables, charts,
titles, footnotes, and embedded hyperlinks that open additional drill-down data
tables in either PDF or Excel format. These additional data tables are saved in
automatically created categorized folders.
In this SAS program we first employ SAS styles, ODS Layout, and ODS PowerPoint
to format the slides and automate creation. Macros and X Command are also
utilized to create categorized folders for organization. Finally, the Graph
Template Language, Proc Report, and ODS PDF are utilized to create the
customer-specific charts and tables for the main deck and the supplemental
tables that are linked by hyperlinks on the corresponding slides of the
PowerPoint. This program starts from the raw data source and the output is a
complete customer deck that is ready for presentation.
In this paper we share examples of how to create a completely customized
PowerPoint deck using SAS styles and ODS Layout. We also share tips and tricks
that we have learned regarding what works and what does not in the ODS
PowerPoint destination. In addition, we demonstrate the program flow to
highlight each type of functionality required to create a multi-level custom
report.
Building Interactive Microsoft Excel Worksheets with SAS® Office Analytics
Tim Beese
Microsoft Office has over 1 billion users worldwide, making it one of the most
successful pieces of software on the market today. Imagine combining the
familiarity and functionality of Microsoft Office with the power of SAS® to
include SAS content in a Microsoft Office document. By using SAS® Office
Analytics, you can create Microsoft Excel worksheets that are not just static
reports, but interactive documents. This paper looks at opening, filtering, and
editing data in an Excel worksheet. It shows how to create an interactive
experience in Excel by leveraging Visual Basic for Applications using SAS data
and stored processes. Finally this paper shows how to open SAS® Visual
Analytics reports into Excel, so the interactive benefits of SAS Visual
Analytics are combined with the familiar interface of an Excel worksheet. All
of these interactions with SAS content are possible without leaving Microsoft
Excel.
Make the jump from Business User to Data Analyst in SAS® Visual Analytics
Ryan Kumpfmiller
SAS® Visual Analytics is effective in empowering the business user with the
skills to build reports and dashboards. The tool is easy to use and navigate,
but it also has capabilities that go beyond just presenting data. There are
additional data analysis features, such as forecasting, fit lines, and
correlations, which can give those business users better insight into their
data. This paper is going to go into what each of those features are, how to
interpret them, and what objects they are used with in SAS® Visual Analytics.
Success Takes Balance, Don't Fall Over With Your SAS® Visual Analytics Implementation
Ryan Kumpfmiller and Craig Willis
With deploying SAS Visual Analytics, companies want to set up a system that
will be effective in supporting their organization. When it comes to building
anything, it is key to set a solid foundation on the most important areas. In a
SAS Visual Analytics implementation, those are technology, people, culture, and
process. In this paper, you will learn how to structure those areas so that you
can put your system in a position to succeed.
Annotating the ODS Graphics Way!
Dan Heath
For some users, having an annotation facility is an integral part of creating
polished graphics for their work. To meet that need, we created a new
annotation facility for the ODS Graphics procedures in SAS® 9.3. Now, with
SAS® 9.4, the Graph Template Language (GTL) supports annotation as well! In
fact, GTL annotation facility has some unique features not available in the ODS
Graphics procedures, such as using multiple sets of annotation in the same
graph and the ability to bind annotation to a particular cell in the graph. This presentation covers some basic concepts of annotating that are common to both GTL and the ODS Graphics procedures. I apply those concepts to demonstrate
some unique abilities of GTL annotation. Come see annotation in action!
Mapping Roanoke Island: from 1585 to present
Barbara Okerson
One of the first maps of the present United States was John White's 1585 map of
the Albemarle Sound and Roanoke Island, the site of the Lost Colony and of my
present home. This presentation looks at advances in mapping through the ages,
from the early surveys and hand-painted maps, through lithographic and
photochemical processes, to digitization and computerization. Inherent
difficulties in including small pieces of coastal land- often removed from map
boundary files and data sets to smooth a boundary - are also discussed. The
presentation concludes with several current maps of Roanoke Island created with
SAS®.
Creating a Publication Quality Graphic with SAS®
Charlotte Baker
Graphics are an excellent way to display results from multiple statistical
analyses and get a visual message across to the correct audience. The
combination of SG procedures, such as PROC SGPLOT, and ODS statements in SAS®
allow for the creation of custom graphics that meet the expectations of
scientific journals and are excellent for research presentations or handouts. While these are excellent tools, first time users may experience difficulties when attempting to utilize them. This paper will describe two methods for
creating a publication quality graphic in SAS® 9.4 and, more specifically,
solutions for some issues encountered when doing so.
SAS® Formats: Effective and Efficient
Harry Droogendyk
SAS® formats, whether they be the vanilla variety supplied with the SAS
system, or fancy ones you create yourself, will increase your coding and
program efficiency. (In)Formats can be used effectively for data conversion,
data presentation and data summarization, resulting in efficient, data-driven
code that's less work to maintain. Creation and use of user-defined formats,
including picture formats, are also included in this paper.
A Real World Example: Using the SAS® ODS Report Writing Interface to revamp South Carolina's School District Special Education Data Profiles
Fred Edora
Creating complex reports can be a daunting task, especially when using multiple
data sources. The SAS ODS Report Writing Interface can provide a significant
amount of flexibility for your reports. Prior to the use of SAS, the South
Carolina Department of Educations 88 special education district data
profiles were created using Microsoft Excel and Word and were required to be
produced annually. With limited staff, this process took weeks to build the
reports and was not flexible to meet changing data needs. This paper will
illustrate how the SAS ODS Report Writing Interface helped revamp these annual
district reports to help staff save time and effort while also including
additional features and data (such as conditional processing and color coding)
to give the reports a more professional appearance.
UCF Stored Process Conversion for Current STEM Retention Reports
Carlos Piemonti and Geoffrey Wical
At the University of Central Florida, in order to reduce the proliferation of
similar reports in our SAS® Information Delivery Portal, we have been tasked
with generating multiple different reports from one stored process. Users will
be prompted to select multiple criteria which will determine the report to be
output, as opposed to having multiple stored processes and running them
separately.
CMISS® the SAS® Function You May Have Been MISSING
Mira Shapiro
Those of us who have been using SAS for more than a few years often rely on our
tried- and-true techniques for standard operations like assessing missing
values. Even though the old techniques still work, we often miss some of the
new functionality added to SAS that would make our lives much easier. In
effort to ascertain how many people skipped questions on a survey and, what
percentage of people answered each question, I did a search of past conference
papers and came across a function that was introduced in SAS 9.2 -- CMISS. By
using a combination of CMISS and Proc Transpose, a full missing
assessment
can be done in a concise program. This paper will demonstrate how CMISS makes
assessing survey completeness an easy task.
Time-to-Degree Issue Solution using Academic Analytics
Sivakumar Jaganathan, Thanuja Sakruti and Abhishek Uppalapati
Analyzing academic data and arriving at administrative decisions is a crucial
process at any university and is made easier using SAS® Business Intelligence
creating informative applications. Excessive time-to-degree is a worsening
problem causing loss to both students and universities. Studying trends help in
identifying the factors that could help in decreasing the time-to-degree which
then helps in advancement of the institution.
Our main objective is to analyze the data pertaining to time-to-degree trend
and assess the factors that constitute this trend. These reports are developed
in SAS® Visual Analytics using the dataset extraction and processing done in
the SAS® Enterprise Guide. This data can be observed for different departments
and can be used to see the birds eye view of trend of time to degree in
different fields. The other objectives are to provide analytical insights on
Time to degree at UConn in order to avoid/decrease the losses incurred due to
more time to degree, added direct costs to students, reduced ROI and increased
debts, lost new admission seats to university due to occupancy of students for
a long time and bad success rate affecting the reputation of the university
which leads to decrease in future revenue. Factors can be further studied using
hypothesis tests and various statistical models to identify relation between
time-to-degree and various variables obtained from Academic Analytics datasets. Maintaining research standards at a university is of utmost importance at a university and the key metrics of the quality will be monitored using Academic
Analytics.
Higher Education institutions need reliable and consistent quantitative and
qualitative information on productivity and accountability. Efficient research
and analysis using visualization and modeling tools support the planning and
crucial decisions.
Enhanced Swimmer Plots: Tell More Sophisticated Graphic Stories in Oncology Studies
Ilya Krivelevich, Andrea Dobrindt, Simon Lin and Xiaomin He
In oncology studies, investigators are often interested in the relationship
among a subject's various evaluations, including treatment exposure, response
timepoint, start and end of adverse events etc. One of the ways to achieve it
is a swimmer plot showing multiple pieces of a subject's response "story" in
one glance (Stacey 2014). A traditional swimmer plot providing single cell
graphs might be over-simplified because it can't provide sufficient information
to the investigators. However, oncology studies often have more complicated
scenarios, such as multi-therapies and dosing titration during the study. This
paper proposes enhanced swimmer plots which extended the use of swimmer plots
to more sophisticated cases with two examples. One example is to investigate
the adverse events occurrence during the course of a clinical trial, and the
other is to show subject's tumor response status during multi-therapy treatment
phase of the study. The paper provides detailed SAS code and
statistical/clinical explanation at each step when creating plots, so readers
could understand the "story" behind the study more thoroughly.
Utilize SAS® 9 SGPLOT to Create Genome Wide Association Studies Plots
Huei-Ling Chen and Jialin Xu
In clinical studies, there is an increasing research interest focusing on the
associations between genetic variants and quantitative or categorical traits of
clinical interest, including disease status, treatment response to a specific
medication, especially as the cost of the molecular technology drops
dramatically. For example, one may examine whether cancer patients with a
certain genetic variant can respond better to a particular oncology drug
compared to cancer patients without that type of genetic variant. A major
methodology to test that genetic association is by Genome Wide Association
Study (GWAS) Analysis.
The GWAS analysis consist three main plots to do
the research step by step: Q-Q Plot, Manhattan Plot, and Regional Association
Plot. The Q-Q Plot is to justify the distribution of the observed p-value for
each genetic variant. The Manhattan Plot and the Regional Association Plot is to
identify which particular region in the genome is associated with the clinical
interest. This paper introduces the plots and utilizes SAS version 9 SGPLOT
procedure to develop three macros for each of the three GWAS plots respectively.
Exploring JMP®'s Image Visualization Tools in Medical Diagnostic Applications
Melvin Alexander
Among JMP®'s powerful, graphical functionalities is the ability to open,
display, and analyze data from images. This capability converts pixel values
from images into data tables and matrices. Once data are in data tables,
additional analyses and data visualizations may be performed.
This presentation will demonstrate these capabilities using examples from
medical computer tomographic (CT) images that are viewed for signs to diagnose
patients with specific injuries and medical conditions. Results of the image
analyses help radiologists and clinicians decide the medical treatment options
to give to patients (e.g., surgery, non-operative management, targeted
radiation-beam therapies).
Combining the statistical, graphical and data analytic functionality of JMP®
extends information visualization beyond what can be seen with standard image
interpretation. Ways to replicate these capabilities in SAS® will also be
discussed.
Diabetes Self-Management Education Services in Florida
Adetosoye Oladokun and Rahel Dawit
Identifying areas and populations without access to diabetes self-management
education services is important to the health of Florida residents with
diabetes. This paper will demonstrate how Base SAS can be used as a tool for
mapping health data in order to address issues such as this. It will also
discuss how to compare the results of the created maps. The procedure utilized
for this project include procedure mapimport, contents, and gmap.
Using JMP to Apply Decision Trees and Random Forests as Screening Tools for Limiting Candidate Predictors in Regression Models
Jason Brinkley
There are many techniques for evaluating candidate predictors in regression
models. Some, such as stepwise regression, have been well studied and with
known limitations. Others, such as shrinkage techniques (i.e. LASSO), have
become increasingly popular and are showing true potential in helping to
provide quality regression models for estimating effects in a multiple variable
environment. Many of these techniques become difficult in the world of big
data, especially if that data is long (many columns or variables) and short
(fewer numbers of observations or rows). Tree based methods are a good
alternative as a framework for data exploration and identification of variables
to be used in building a quality regression model. Indeed tree methods provide
an entirely different framework for model building that can oftentimes provide
better predictions. However, when the main goal is still effect estimation,
tree methods can be a very useful screening tool. This work examines the
practice based on several examples using options very commonly found in both
JMP and JMP Pro software. The focus is on using both single classification
trees as well as so-called 'Random' or 'Bootstrap' Forests.
5 Secrets for Building Fierce Dashboards
Tricia Aanderud
Are your dashboard or web reports lifeless, unappealing, or ignored? A fierce
dashboard is not an accident, it is the result of careful planning, design
knowledge, and the right data. In this paper, you will learn the techniques
professionals use for creating dashboards that are engaging, beautiful, and
functional. This paper uses SAS Visual Analytics as the example, but the tasks
shown could be accomplished with other SAS tools.
Data Visualization Through 3-D Graphs Using SAS® Graph Template Language (GTL)
Venu Perla
Certain types of data are better visualized through 3-D graphs. SAS Graph
procedures available prior to SAS 9.2 version are not user friendly. These
procedures are hard to practice and require quite a bit of time to implement. Since future of graphs in SAS is centered on Graph Template Language (GTL), the objective of this paper is to create 3-D graphs using SAS GTL. This paper also
explains how initial SAS GTL code can be obtained, and how STATGRAPH template
(aka GTL template) and SGRENDER procedure in the obtained code are modified
with different OPTIONS and GTL STATEMENTS to create 3-D graphs for the
beginners.
Color Speaks Louder than Words
Jaime Thompson
What if color blind people saw real colors and it was everyone else who's color
blind? Imagine finishing a Rubik's cube, wondering why people find it so
difficult or relying on the position, rather than color of a traffic light in
determining when to stop and go. Color matters! Color enhances our perspective,
it can change how we feel, but it also varies culturally. In western countries
red and white have opposing symbolic meanings to eastern countries which in
turn can send the wrong message if misused. Color can fundamentally change a
report, therefore, finding the right color palettes for data visualizations is
essential. In this paper, I will cover the significance of color and how to
pick a palette for your next SAS Visual Analytics report.
Geospatial Analysis with SAS®
Mike Jadoo
Geospatial analysis is the finest example of data visualization products today. It produces the maximum amount of information of statistical accounts data. Join us on an adventure, whether you are the seasoned practitioner or the
exploring novice, as we explore the world of heat maps. An in depth look will be conducted on how to make choropleth (heat) maps in
SAS®. This review will cover different types of maps that can be made,
importing data, and data structure needed to create the map.
Statistics/Data Analysis
Testing the Gateway Hypothesis from Waterpipe to Cigarette Smoking among Youth Using Dichotomous Grouped-Time Survival Analysis (DGTSA) with Shared frailty in SAS®
Rana Jaber
Dichotomous grouped-time survival analyses is a combination of grouped-Cox
model (D'Agostino et al., 1990), discrete time-hazard model (Singer and Willet,
1993), and the dichotomous approach (Hedeker et al., 2000). Items measured from
wave 1 through wave 4 were used for time-dependent covariates linking the
predictors to the risk of waterpipe smoking progression at the subsequent
students interview. This analysis allows for maximum data use, inclusion of
the time-dependent covariates and relaxing of the proportional hazards
assumption, and takes into consideration the interval censored (i.e. the event
occurred during a certain known interval (e.g., one year), but the exact time
at which it was occurred cannot be specified) nature of the data. The aim of this paper is to provide new method of analyzing panel data where
the outcome is binary with some explanation of the SAS® codes. Examples of
using the PROC PHREG procedure are drawn from data that was recently published
in the International Journal of Tuberculosis and Lung Disease (IJTLD).
Mixture Priors 101: Using SAS® to Obtain Powerful Frequentist Inferences with Bayesian Methods
Tyler Hicks and Jeffrey Kromrey
Using informed priors, background information about parameters can be included
in statistical analysis. Standard frequentist procedures purposefully omit
priors. Bayesian methods can thus yield more powerful hypothesis tests with
informed priors, even when judged by frequentist criteria. However, a
misspecified informed prior can wreak havoc on Bayesian inferences. Mixtures
priors may be used to make Bayesian methods more robust to a possibly
misspecified informed prior. The purpose of this paper is to show how
specifying mixture priors is very easy in PROC MCMC. Researchers can thus use
PROC MCMC to implemented mixture priors in their own research to obtain more
powerful and robust frequentist statistical inferences using Bayesian methods. This paper provides a rationale for mixture priors, presents annotated PROC MCMC code, and concludes with an executed example.
Finding the Area of a Polygon... On a Sphere!
Seth Hoffman
While a picture may be worth a thousand words, sometimes a few numbers are even
better, especially if you want to do some statistics. One of the basic ways to
describe a region such as state, county, or zip code is its area. There are
several methods one can use to find the area of such polygons. At least, when
they are on a Cartesian plane (flat). This paper presents a method and the math
to calculate that area if a polygon is drawn on a spherical surface, such as
Earth.
Factors of Multiple Fatalities in Car Crashes
Bill Bentley
Road safety is a major concern for all United States of America citizens. According to the National Highway Traffic Safety Administration, 30,000 deaths
are caused by automobile accidents annually. Oftentimes fatalities occur due to
a number of factors such as driver carelessness, speed of operation, impairment
due to alcohol or drugs, and road environment. Some studies suggest that car
crashes are solely due to driver factors, while other studies suggest car
crashes are due to a combination of roadway and driver factors. However, other
factors not mentioned in previous studies may be contributing to automobile
accident fatalities. The objective of this project was to identify the
significant factors that lead to multiple fatalities in the event of a car crash.
In the Pursuit of Balanced Groups: A SAS® Macro for an Adaptive Randomization Test with continuous covariates
Tyler Hicks and Jeffrey Kromrey
In adaptive randomized experiments, researchers can verify that random
allocation yielded equivalent groups on continuous covariates before proceeding
with the study. Given a precise definition of balanced groups (e.g., Cohens
d < 0.5), researchers may plan to keep reshuffling participants until group
balance is achieved. However, reshuffling can invalidate the p-values of
standard parametric tests, such as t-tests, F-tests, and Ç^2-tests. This paper
describes a non-parametric analog of the t-test of independent means, called an
adaptive randomization test, which can preserve Type I error control in
adaptive randomized experiments. Although such tests are well established in
mainstream statistics (Morgin & Rubin, 2015), they are not readily available in
SAS making them difficult to implement. This paper provides an overview of
adaptive randomization tests, presents a macro for doing such a test in base
SAS, and concludes with two executed examples of the macro.
Empowering Self-Service Capabilities with Agile Analytics
Tho Nguyen and Bob Matsey
Business and IT users are struggling to know what version of the data is valid,
where should they get the data from, and how to combine and aggregate all the
data sources to apply analytics and deliver results in a timely manner. In
addition, once they start trying to join and aggregate all the different types
of data, the manual coding can be very complicated and tedious that demand
extraneous resources, processing and negatively impact the overhead on the
system. By enabling agile analytics in a data lab, it can alleviate many of
these issues, increase productivity and deliver an effective self-service
environment for all users. This self-service environment using SAS® analytics
in Teradata has decreased time to prepare the data, develop the statistical
data model and deliver faster results in minutes compared to days or even
weeks. This session will discuss how you can enable agile analytics in a
data lab, leverage SAS® Analytics in Teradata to increase performance and learn how
hundreds of organizations have adopted this concept to deliver self-service
capabilities in a streamlined process.
Identifying Gaps in Time Series Data
Bruce Gilsen
Missing observations are a common issue with time series data. For a small
amount of data, you can print the data set and inspect manually, but this is
not realistic for large data sets, especially if there are multiple BY groups.
With the TIMEID procedure, which is part of SAS/ETS ® software, you can easily
check a data set to determine if observations are missing, and print a simple
report that shows the location and quantity of missing observations.
PROC TIMEID can be used with SAS ® date or datetime variables, and for any
frequency. The examples in this paper use SAS dates with frequency WEEKDAY,
which is common at the Federal Reserve Board.
A Simple SAS® Macro to Perform Blinder-Oaxaca Decomposition
Taylor Lewis and Stanislas Ezoua
Blinder-Oaxaca decomposition is a straightforward statistical method that
emerged in the econometrics literature as a way to explain differences observed
between groups with respect to the mean of a continuous variable. To give a
few examples, the procedure has been used to investigate potential pay
differentials between males and females, or whether health disparities exist
amongst individuals of varying socioeconomic statuses. In this paper, we
provide background on the fundamental concepts and objectives behind
Blinder-Oaxaca decomposition, and present a general-purpose macro that analysts
interested in conducting the technique might find helpful.
PROC LOGISTIC: Traps for the unwary
Peter Flom
This paper covers some gotchas in SAS PROC LOGISTIC. A gotcha is a
mistake that isnt obviously a mistake the program runs, there may be a
note or a warning, but no errors. Output appears. But its the wrong output. This is not a primer on PROC LOGISTIC, much less on logistic regression. There are many good books on logistic regression; one such is Hosmer
and Lemeshow [2000].
Each section has several subsections. First, I identify the gotcha. Then
I give an example. Third, I show what evidence you have that it occurs. Fourth,
I show how to fix it in some cases, referring to other resources. In some cases, I offer an
explanation between the evidence and the solution.
Missing Data and Complex Sample Surveys Using SAS®: The Impact of Listwise Deletion vs. Multiple Imputation on Point and Interval Estimates when Data are MCAR and MAR
Anh Kellermann, DeAnn Trevathan and Jeffrey Kromrey
Social scientists from many fields use secondary data analysis of complex
sample surveys to answer research questions and test hypotheses. Despite great
care taken to obtain the data needed, missing data are frequently found in such
samples. Even though missing data is a ubiquitous problem, the methodological
literature has provided little guidance to inform the appropriate treatment for
such missingness. This Monte Carlo study used SAS to investigate the impact of
missing data treatment (multiple imputations versus listwise deletion) when
data are MCAR and MAR. By using 10% to 70% of missing data (along
with complete sample conditions as a reference point for interpretation of results),
the research focused on the parameter estimates in multiple regression analysis
in complex sample data. Results are presented in terms of statistical bias in
the parameter estimates and both confidence interval coverage and width.
The HPSUMMARY® Procedure: A Younger (and Brawnier) Cousin to an Old SAS® Friend
Anh Kellermann and Jeffrey Kromrey
The HPSUMMARY procedure provides data summarization tools to compute basic
descriptive statistics for variables in a SAS dataset. It is a high-performance
version of the SUMMARY procedure in Base SAS. Although the PROC SUMMARY is
popular with data analysts, the PROC HPSUMMARY is still a new kid on the
block. This paper provides an introduction to PROC HPSUMMARY by comparing it
with its well-known counterpart, PROC SUMMARY. General syntax differences as
well as performance in terms of processing time and memory utilization of the
two procedures were examined. Simulated data of different sizes were used to
observe the performance of the two procedures. Experiment results indicate that
there was no clear difference in real time between the PROC SUMMARY and its
high performance counterpart. The HP version utilized more memory but provided
better memory management in a limited-memory environment than the legacy
version did. The impact of multi-core computing on HPSUMMARYs processing
time for different data volumes was also examined.
A Macro for Calculating Kendalls Tau-b Correlation on Left Censored Environmental Data
Dennis Beal
Kendalls tau-b correlation is a nonparametric method for calculating
monotonic correlation between two variables with multiple censoring levels. While SAS does calculate Kendalls Tau-b as an option in the PROC CORR
procedure within SAS/BASE, its calculation assumes all values are known and
uncensored. It does not incorporate censoring that is common with chemical or
environmental data. Environmental data often are reported from the analytical
laboratory as left censored, meaning the actual concentration for a given
contaminant was not detected above the method detection limit. Therefore, the
true concentration is known only to be between 0 and the reporting limit. Kendalls tau-b can be used on left censored data with multiple reporting
limits with minimal assumptions. This paper presents a SAS macro that
calculates Kendalls tau-b by incorporating the additional ties that occur
when comparing detected concentrations with non-detects. This paper is for
intermediate SAS users of SAS/BASE.
A Data Mining Approach to Predict Student-at-risk
Youyou Zheng, Sivakumar Jaganathan, Thanuja Sakruti and Abhishek Uppalapati
Data mining is an analysis process to obtain useful information from large data
set and unveil its hidden pattern (Mehmed 2003, Tan 2005). It has been
successfully applied in the business areas like fraud detection and customer
retention for decades. With the increasing amount of educational
data, educational data mining has become more and more important to uncover the
hidden patterns within the institutional data, so as to support institutional
decision making (Luan 2012). However, only very limited studies have been done
on educational data mining for institutional decision support. The
institutional researchers from Western Kentucky University built up a model to
help increasing yield and retention at the University (Bogard 2013). The
researcher from the University of California also proposed to apply data mining
technique in the college recruitment process to achieve enrollment goals (Chang
2009). Both of the institutions used SAS Enterprise Miner as their
data mining tool. In this study, we are going to use SAS Enterprise Miner to build up the
student-at-risk model. At the University of Connecticut, organic
chemistry is a required course for undergraduate students in a STEM discipline. It has a
very high DFW rate (D=Drop, F=Failure, W=Withdraw). Take Fall 2014 as an
example, the average DFW% is 24% at UCONN and there are over 1200 students
enrolled in this class. In this study, Fall 2009 2013 student enrollment
data is used to build up the model. Fall 2014 data is used to test the model. SEMMA (Sample, Explore, Modify, Model and Assess) method introduced by SAS
Institute Inc. is applied to develop the predictive model. The freshmen SAT
scores, campus, semester GPA, financial aid, first generation information and
other factors are used to predict students performance in this course. In
the predictive modeling process, several different modeling techniques
(decision tree, neural network, ensemble models, and logistic regression) are
compared with each other in order to find an optimal one for our institution. The purpose of this study is to predict student success in the future study so
as to improve the education quality in our institution.
Surviving the Interim: Insights Into Interim Survival Analyses
Venita DePuy
Pre-specified interim analyses may be performed to evaluate whether a clinical
trial can be halted prematurely for overwhelming efficacy and/or futility. This approach is somewhat more complex when the primary endpoint is a survival analysis. This paper provides an overview of performing the initial
calculations and actual interim analyses using SASs Proc SEQDESIGN and Proc
SEQTEST, EAST software, and PASS software.
Designing and Analyzing Surveys with SAS/STAT® Software
Pushpal Mukhopadhyay
Designing probability-based sample surveys usually requires the use of
strategies such as stratification, clustering and unequal weighting. Analyzing the resulting data
requires specialized techniques that takes these strategies into account in
order to produce statistically valid inferences. This requires specialized
software. This tutorial shows you how to use the SAS/STAT software
specifically designed for selecting and analyzing probability samples for
survey data. You will learn how to:
- Select probability samples according to various designs with the
SURVEYSELECT procedure.
- Impute missing values in your sample with the SURVEYIMPUTE procedure.
- Produce descriptive statistics with the SURVEYMEANS and SURVEYFREQ
procedures.
- Build statistical models with the SURVEYREG, SURVEYLOGISTIC, and
SURVEYPHREG procedures.
The tutorial also discusses the characteristics of different variance
estimation techniques, including both Taylor series and replication methods.
The course is intended for a broad audience of statisticians who are interested
in analyzing sample survey data. Familiarity with basic statistics, including
regression analysis, is strongly recommended.
A prediction model for which country will win the highest number of Golds in 2016 Olympics
Antarlina Sen and Gaurang Margaj
Olympic Games was started to establish a healthy friendship between countries. Given its history and value, http://www.olympic.org/ , www.iaaf.org and
http://www.rio2016.com and other website contain records of 300+ events where
10,000+ athletes (men and women) from 204 countries compete. This data is
available from 1896 to 2012. The goal of this paper is to predict which country
will score the maximum number Golds in the Rio Olympics of 2016 using
past trends of countries' performances and the statistics of players listed for
2016. Before building this prediction model, exploration and descriptive
analytics will be carried out to answer questions such as following. Does the
current form of a player impact his/her performance in the Olympics? Has a
specific country performed better with respect to the others in the last two
years in a particular sport? If a player wins one medal in one Olympics, what
are the chances he/she will win in a Gold in the next one? Does a more
experienced player have a better chance of winning? Insights generated from
answering these questions will be used to build the predictive model.
Analyzing non-normal data with categorical response variables
Niloofar Ramezani and Ali Ramezani
In many applications, the response variable is not normally distributed. If the
response variable is categorical with two or more possible responses, it makes
no sense to model the outcome as normal. When dealing with categorical outcome
variables, the relationship between the outcome and predictors is not linear
anymore, hence more advanced models than general linear models need to be used
to appropriately model this relationship. Binary, Multinomial, and ordinal
logistic regression models are some examples of the robust predictive methods
to use for modeling the relationship between non-normal discrete response and
the predictors.
This study looks at several methods of modeling binary, categorical and ordinal
correlated response variables within regression models. Starting with the
simplest case of binary responses, through ordinal response variables, this
study discusses different modeling options within SAS. At the end, some missing
data handling techniques are suggested to appropriately account for the high
percentages of missing observations which happens a lot in practice when
performing these statistical models.
Various statistical techniques such as logistic and probit models, generalized
linear mixed models (GLMM), log-linear models, and generalized estimating
equations (GEE) are some of the models which are being discussed and applied to
real data with categorical outcome variables in this study. This paper
discusses different options within SAS 9.4 for the aforementioned models. These
procedures include PROC LOGISTIC, PROC PROBIT, PROC GENMOD, PROC GLIMMIX, PROC
NLMIXED, PROC MIXED, PROC CATMOD and PROC GEE.
Statistical Model Building for Large, Complex Data: Five New Directions in SAS/STAT® Software
Robert Rodriguez
The increasing size and complexity of data encountered in business analytics
require analysts to apply a growing set of tools for building statistical
models. In response to this need, SAS/STAT software continues to add new
methods. This presentation takes you on a high-level tour of four major
enhancements: new effect selection methods for ordinary regression models in
the GLMSELECT procedure; model selection for generalized linear models with the
HPGENSELECT procedure; model selection for quantile regression with the
HPQUANTSELECT procedure; and construction of generalized additive models with
the GAMPL procedure. For each of these approaches, the presentation explains
the concepts, illustrates the benefits with a basic example, and guides you to
information that will help you get started.