SESUG 2016 Conference Abstracts

Application/Macro Development

Creating Viable SAS® Data Sets From Survey Monkey® Transport Files
John R. Gerlach

Survey Monkey is an application that provides a means for creating online surveys.  Unfortunately, the transport (Excel) file from this application requires a complete overhaul in order to do any serious data analysis.  Besides having a peculiar structure and containing extraneous data points, the column headers become very problematic when importing the file into SAS.  In fact, the initial SAS data set is virtually unusable.  This paper explains a systematic approach for creating a viable SAS data set for doing serious analysis.


Universal File Flattener
David Vandenbroucke

This paper describes the construction of a program that converts any set of relational tables into a single flat file, using Base.  The program gets the information it needs from the data tables themselves, with a minimum of user configuration.  It automatically detects one-to-many relationships and creates sets of replicate variables in the flat file.  The program illustrates the use of macro programming and the SQL, DATASETS, and TRANSPOSE procedures, among others.  The output is sent to a Microsoft Excel spreadsheet using the Output Delivery System (ODS).


%SUBMIT_R: A SAS® Macro to Interface SAS and R
Ross Bettinger

The purpose of the %SUBMIT_R macro is to facilitate communication between SAS and R under Windows.  %SUBMIT_R uses SAS_ unnamed pipe device type to invoke the R executable.  SAS datasets may be converted into R data frames and vice versa in a manner similar to using the SAS/IML ExportDataSetToR and ImportDataSetFromR functions.  R graphics are also supported, and are displayed in the SAS results viewer.  Graphs may be saved in user-specified locations as image files.  R scripts may be created using a SAS data null step to write a file containing an R script, read from a user-specified .R input file, or by using the %R macro.  Output of R execution may be directed to the SAS log file for inspection or to a user-specified .Rout file for later use key words: SAS macro, R script, SAS program file, unnamed pipe, ODS, HTML.


New Game in Town: SAS® Proc Lua with Applications
Jiangtang Hu

Lua (pronounced as _lu-aa_) is scripting language like Python.  It is very popular in gaming industry.  In recent years it is also moved to machine learning world with a widely used library Torch 7.  SAS introduced Lua to its Base module since version 9.4 which will definitely extend SAS programming functionality.  In the upcoming SAS next-generation high-performance analytics platform, Viya, Lua is supported.

In this paper, several examples (dynamic programming as replacement of SAS macro, reading external files like JSON, XML, HDF5, and implementing a machine learning algorithms like Naïve Bayes) will be presented to showcase how Proc Lua can add to SAS programmers_ toolbox.  All codes will be available in Github at https://github.com/Jiangtang/SESUG/tree/master/2016.


Breaking up (Axes) Isn't Hard to Do: A Macro for Choosing Axis Breaks
Alex Buck

SAS® 9.4 brought some wonderful new graphics options.  One of the most exciting is the addition of the RANGE option for SGPLOT.  As the name suggests, specifying ranges for a broken axis is controlled by the user.  The only question left is where to set the breaks and if a break is actually needed.  That is what this macro is designed to do.  The macro analyzes the specified input parameter to create macro variables for the overall minimum and maximum, as well as macro variables specifying values prior to and following the largest difference that occurs between successive parameter values.  The macro will also create variables for suggested break values to ensure graphic items such as markers are displayed in full.  The user then utilizes these macro variables to determine if an axis break is needed and where to set those breaks.  With the macros dynamic nature it can be incorporated into larger graphics macro programs easily while making specific recommendations for each individual parameter.  A complete and intuitive graph is produced with every macro call.


Moving Data and Results Between SAS® and Microsoft Excel
Harry Droogendyk

Microsoft Excel spreadsheets are often the format of choice for our users, both when supplying data to our processes and as a preferred means for receiving processing results and data.  SAS® offers a number of means to import Excel data quickly and efficiently.  There are equally flexible methods to move data and results from SAS to Excel.  This paper will outline the many techniques available and identify useful tips for moving data and results between SAS and Excel efficiently and painlessly.


Using PROC FCMP for Short Test Assembly
Tsung-hsun Tsai and Yung-chen Hsu

Assembling short test forms is a practical task in testing service organizations.  A short test form contains less test items than the full-length test does, yet still preserves the essential test quality and captures the required test statistical characteristics.  We explore an instance selection method for short test assembly task by using PROC FCMP to conduct low-level array operations to implement minimum spanning tree clustering.  PROC FCMP was developed to allow SAS users to write their own functions and subroutines for use in DATA step or SAS procedures.  The purpose of this paper is to demonstrate how we can use SAS as a functional programming language by utilizing PROC FCMP to write reusable functions or subroutines to manage complex tasks.


A Waze App for Base SAS®: Automatically Routing around Locked Data Sets, Bottleneck Processes, and Other Traffic Congestion on the Data Superhighway
Troy Hughes

The Waze application, purchased by Google in 2013, alerts millions of users about traffic congestion, collisions, construction, and other complexities of the road that can stymie motorists' attempts to get from A to B.  From jackknifed rigs to jackalope carcasses, roads can be gnarled by gridlock or littered with obstacles that impede traffic flow and efficiency.  Waze algorithms automatically reroute users to more efficient routes based on user-reported events as well as historical norms that demonstrate typical road conditions.  Extract-transform-load (ETL) infrastructures often represent serialized process flows that can mimic highways, and which can become similarly snarled by locked data sets, slow processes, and other factors that introduce inefficiency.  The LOCKITDOWN SAS® macro, introduced at WUSS in 2014, detects and prevents data access collisions that occur when two or more SAS processes or users simultaneously attempt to access the same SAS data set.  Moreover, the LOCKANDTRACK macro, introduced at WUSS in 2015, provides real-time tracking of and historical performance metrics for locked data sets through a unified control table, enabling developers to hone processes to optimize efficiency and data throughput.  This text demonstrates the implementation of LOCKSMART and its lock performance metrics to create data-driven, fuzzy logic algorithms that preemptively reroute program flow around inaccessible data sets.  Thus, rather than needlessly waiting for a data set to become available or a process to complete, the software actually anticipates the wait time based on historical norms, performs other (independent) functions, and returns to the original process when it becomes available.


The Demystification of a Great Deal of Files
Chao-Ying Hsieh

Our input data are sometimes stored in external flat files rather than in a traditional database environment.  This creates tedious work if programmers need to read a large amount of input files from multiple locations.  This paper will address a solution to this issue that uses the SAS® macro and DREAD function.  Additionally, the paper will also address further applications for using file name information to validate data.


Multiple Studies! DataDefinitionTracker, Made Easy!
Saritha Bathi

It is always hard to track the changes in the data definitions in your project.  It gets harder when you are dealing with multiple studies.  It gets even harder when the team is big.  But dont worry our system/process/set of macros helps you to track the changes in data definitions effeciently and effectively.  Main idea behind the process is to read the data definitions into a sas dataset.  When there are changes in the data definitions read the changed/new data definitions and create a new sas dataset.  Compare the new and old data definitions and create a table [be creative in preseinting the changes to the team] and send it through email to the team with both old and new data definitions.  It helps the programmer to write efficient programs as one can see the change in the definition clearly.  Automate and schedule it as daily/weekly updates depending on the project needs.  Develop once and use it across all the therapeutic areas and across the company!



Banking/Finance

Leads and Lags: Static and Dynamic Queues in the SAS® DATA STEP
Mark Keintz

From stock price histories to hospital stay records, analysis of time series data often requires use of lagged (and occasionally lead) values of one or more analysis variable.  For the SAS® user, the central operational task is typically getting lagged (lead) values for each time point in the data set.  While SAS has long provided a LAG function, it has no analogous lead function an especially significant problem in the case of large data series.  This paper will (1) review the lag function, in particular the powerful, but non-intuitive implications of its queue-oriented basis, (2) demonstrate efficient ways to generate leads with the same flexibility as the lag function, but without the common and expensive recourse to data re-sorting, and (3) show how to dynamically generate leads and lags through use of the hash object.


Using SAS®/QC to Design Optimal Experimental Designs in Consumer Lending
Jonas Bilenas

We will look at how to design experiments using SAS®/QC PROC OPTEX.  Examples will come from simulated experiments testing consumer preference for direct mail credit card offers.  Discussion will focus on designs of experiments using PROC OPTEX, sample size requirements, and response surface models using PROC LOGISTIC and PROC GENMOD.  We will then discuss what to do with design results after the experiment is completed.  Focus is on measuring consumer preferences for consumer products but the methodology has application in many disciplines.


Using Regression Splines in SAS® STAT Procedures
Jonas Bilenas and Nish Herat

Regression splines are an added feature in many SAS® STAT procedures.  These can be used in scatterplot smoothing to detect non-linearity, and also be used in generating model predictions.  We will review how to code up nonparametric splines in many procedures and how to score new populations.  We will also compare these splines with parametric Natural Splines when building regression modes.  Applications will come from consumer credit examples but are also applicable to other industries.


Computing Risk Measures for Cross Collateralized Loans Using Graph Algorithms
Chaoxian Cai

In commercial lending, multiple loans or commitments may be collateralized by multiple properties, and this cross collateralization creates linked loan-property networks.  These networks are embedded in a loan-property table, stored in a relational database.  Some risk measures for these cross collateralized loans or commitments are better evaluated in aggregated terms if we can identify all cross linked properties, loans, and commitments.  The Union-Find algorithm is commonly used to find connected components in a graph.  In this paper, I have implemented the primary operations of the Union-Find algorithm in a SAS® macro program using only Base SAS DATA steps.  The program can be used to find all connected components in a SAS data set and separate them into discrete groups.  It can be applied to find all cross collateralized loan-property networks and compute pooled loan-to-value (LTV) ratios.  The program can also be applied to identify main obligations and loan structures of future commitments with takedown loans and hierarchical lines of credit.  Risk measures, such as exposure at default (EAD) and credit conversion factor (CCF), are computed for these complicated loans and illustrated by examples.


Know your Interest Rate
Anirban Chakraborty

The Federal Reserve of the United States reported a drastic increase in consumer debt over the past few years, reaching $3.5 trillion in May 2015.  Credit card debt accounts for only 26% of total consumer debt, however, the rest of the 74% is derived from student loans, automobile loans, mortgage etc.  Lending loans has become an integral part of US consumers everyday life.  Have you ever wondered how lenders use various factors such as FICO score, annual income, the loan amount approved, tenure, debt-to-income ratio etc. and select your interest rates?  The process, defined as risk-based pricing, uses a sophisticated algorithm that leverages different determining factors of a loan applicant.  This research provides an approach to explore the factors that significantly affect borrowers fixed loan interest rates. For the purpose of this research, data was collected from a publicly available data source lending club, which is the largest peer-to-peer online credit market place.  The downloaded dataset has information about successful loan applications of 2015, and includes 421,097 observations and 115 variables.  Exploration of data shows that debt consolidation is the primary reason for loan application, comprising 59% of all loan applications followed by credit card bill payments.  Further analysis is warranted to determine factors that affect loan interest rate significantly.  Selection of significant factors will help develop a prediction algorithm which can estimate loan interest rates based on clients information.  On one hand, knowing the factors will help consumers and borrowers to increase their credit worthiness and place themselves in a better position to negotiate for getting a lower interest rate.  On the other hand, this will help lending companies to get an immediate fixed interest rate estimation based on clients information.

By building various predictive models on diverse factors that might influence the interest rate set we take an attempt to answer the following problem statement: Estimate the interest rate based on various factors of a loan applicant.



Building Blocks

Fuzzy Matching: Where Is It Appropriate and How Is It Done? SAS® Can Help
Stephen Sloan and Dan Hoicowitz

When attempting to match names and addresses from different files, we often run into a situation where the names are similar, but not exactly the same.  Sometimes there are additional words in the names, sometimes there are different spellings, and sometimes the businesses have the same name but are located thousands of miles apart.  The files that contain the names might have numeric keys that cannot be matched.  Therefore, we need to use a process called fuzzy matching to match the names from different files.  The SAS® function COMPGED, combined with SAS® character-handling functions, provides a straightforward method of applying business rules and testing for similarity.


True is not False: Evaluating Logical Expressions
Ronald Fehd

The SAS(R) software language provides methods to evaulate logical expressions which then allow conditional execution of parts of programs.  In cases where logical expressions contain combinations of intersection (and), negation (not), and union (or), later readers doing maintenance may question whether the expression is correct.

The purpose of this paper is to provide a truth table of Boole's rules, De~Morgan's laws, and sql joins for anyone writing complex conditional statements in data steps, macros, or procedures with a where clause.


AutoHotKey: an Editor-independent Alternative to SAS® Keyboard Abbreviations
Shane Rosanbalm

Have you ever been editing a SAS program in a text editor (UltraEdit, NotePad++, etc.) and found yourself thinking, "I wish I had access to my SAS keyboard abbreviations"?  AutoHotKey (AHK) is free open-source macro creation and automation software for Windows that allows users to automate repetitive tasks.  It's a lot like SAS keyboard abbreviations, but with one major perk: not only does AHK work in your SAS editor, but it also works in text editors, word processors, web browsers, and just about any other application you can think of.  If you like SAS keyboard abbreviations but wish they worked in other applications, AHK is the missing piece of software you've been yearning for.  In this paper you will learn how to install AHK and get started creating your own editor-independent abbreviations.


If you need these OBS and these VARS, then drop IF and keep WHERE
Jayanth Iyengar

Reading data effectively in the data step requires knowing the implications of using various methods.  The impact on efficiency is especially pronounced when working with large data sets.  Useful also is a working knowledge of data step mechanics and constructs: the observation loop and the PDV.  Individual techniques for subsetting data have varying levels of efficiency and implications for input/output time.  Use of the WHERE statement/option to subset observations consumes less resources than the subsetting IF statement.  Also, use of DROP and KEEP to select variables to include/exclude can be efficient depending on how theyre used.


An Introduction to SAS® Hash Programming Techniques
Kirk Paul Lafler

Beginning in Version 9, SAS software supports a DATA step programming technique known as hash that enables faster table lookup, search, merge/join, and sort operations.  This presentation introduces what a hash object is, how it works, and the syntax required.  Essential programming techniques are illustrated to define a simple key, sort data, search memory-resident data using a simple key, match-merge (or join) two data sets, handle and resolve collision scenarios where two distinct pieces of data have the same hash value, as well as more complex programming techniques that use a composite key to search for multiple values.


Building a Better Dashboard Using Base-SAS® Software
Kirk Paul Lafler

Organizations around the world develop business intelligence dashboards to display the current status of point-in-time metrics and key performance indicators.  Effectively designed dashboards often extract real-time data from multiple sources for the purpose of highlighting important information, numbers, tables, statistics, metrics, and other content on a single screen.  This presentation introduces basic rules for good dashboard design and the metrics frequently used in dashboards, to build a simple drill-down dashboard using the DATA step, PROC FORMAT, PROC PRINT, PROC MEANS, ODS, ODS Statistical Graphics, PROC SGPLOT and PROC SGPANEL in Base-SAS® software.


An Easy Route to a Missing Data Report with ODS+PROC FREQ+A Data Step
Mike Zdeb

A first step in analyzing data is making a decision on how to handle missing values.  That decision could be deletion of observations and/or variables with excess missing data, substitution of imputed values for missing values, or taking no action at all if the amount of missing data is insignificant and not likely to affect the analysis.

This paper shows how to use an ODS OUTPUT statement, PROC FREQ, and some data step programming to produce a missing data report showing the percentage of missing data for each variable in a data set.  Also shown is a method for identifying and dropping from a data set all variables with either all or a high percentage of missing values.  The method of producing the missing data report is less complicated and more compact than several methods already proposed in other papers.


Array Programming Basics
John Cohen


Using arrays offers a wonderful extension to your SAS® programming toolkit.  Whenever iterative processing is called for they can make programming easier and programs easier to maintain.  You will need to learn some new syntax, but we will explain several of the key components such as indexing and subscripts, temporary and multi-dimensional arrays, determining array dimension, and a few special tricks.


Dynamically Changing Time Zones and Daylight Savings on Time Series Data
Chao-Ying Hsieh


SAS® programmers commonly deal with time zones and daylight saving time changes, especially when working for a large corporation with multiple subsidiaries located in different time zones.  This paper presents a dynamic way to convert GMT time to local time with daylight saving time changes accounted for on time series data collected from smart meters.  The technique uses SAS® functions, formats, and macros to create a program that streamlines the time conversion process.


Introduction to PROC REPORT
Kirk Paul Lafler

SAS users often need to create and deliver quality custom reports and specialized output for management, end users, and customers.  The SAS System provides users with the REPORT PROCedure, a canned Base-SAS procedure, for producing quick and formatted detail and summary results.  This presentation is designed for users who have no formal experience working with the REPORT procedure.  Attendees learn the basic PROC REPORT syntax using the COLUMN, DEFINE, other optional statements, and procedure options to produce quality output; explore basic syntax to produce basic reports; compute subtotals and totals at the end of a report using a COMPUTE Block; calculate percentages; produce statistics for an analysis variables; apply conditional logic to control summary output rows; and enhance the appearance of output results with basic Output Delivery System (ODS) techniques.


The Power of PROC FORMAT
Jonas Bilenas and Kajal Tahiliani

The FORMAT procedure in SAS® is a very powerful and productive tool, yet many beginning programmers rarely make use of it.  The FORMAT procedure provides a convenient way to do a table lookup in SAS.  User generated FORMATS can be used to assign descriptive labels to data values, create new variables and find unexpected values.  PROC FORMAT can also be used to generate data extracts and used to merge data sets without having to sort large datasets.  We will also look at generateing user FORMATS using the PICTURE statement that is the sometimes confusing but is powerfull in modifiying how numeric data is presented in reports.  This paper will show you the POWER of PROC FORMAT for all SAS users.


Divide & Conquer: Simple Sub-Datasets Creation with Call Execute
Dylan Holt and Wilhelmina Ross

When working with larger datasets, one often needs to isolate certain data based on a range of criteria to disseminate sub-datasets, perform analyses, and create reports.  This task is simple when isolating just one or even a small handful of sub-datasets from a larger one and may require substituting a few values for the parameters each time one is creating a sub-dataset.  However, such a task can become cumbersome and redundant at best and error-ridden at worst when attempting to create a large number of individual sub-datasets manually.  Moreover, consistent formatting and naming conventions are harder to maintain when attempting this task one-by-one, and could lead to confusion when visiting and using these sub-datasets at a later time.  This paper demonstrates the use of the Call Execute routine to systematically create sub-datasets efficiently from a larger dataset while maintaining consistency in format and naming conventions.  In addition, the paper presents the flexibility of this method when dividing the entire dataset into logical sub-datasets or selecting only certain criteria to create sub-datasets.  Lastly, the paper demonstrates this method as a clear example of the powerful extension of the Macro facility within the DATA Step as mediated by the Call Execute routine.


PROC DATASETS; The Swiss Army Knife of SAS® Procedures
Michael Raithel

This paper highlights many of the major capabilities of PROC DATASETS.  It discusses how it can be used as a tool to update variable information in a SAS data set; provide information on data set and catalog contents; delete data sets, catalogs, and indexes; repair damaged SAS data sets; rename files; create and manage audit trails; add, delete, and modify passwords; add and delete integrity constraints; and more.  The paper contains examples of the various uses of PROC DATASETS that programmers can cut and paste into their own programs as a starting point.  After reading this paper, a SAS programmer will have practical knowledge of the many different facets of this important SAS procedure.


SAS® Debugging 101
Kirk Paul Lafler

SAS® users are always surprised to discover their programs contain bugs (or errors).  In fact, when asked, users will emphatically stand by their programs and logic by saying they are error free.  But, the vast number of experiences along with the realities of writing code says otherwise.  Errors in program code can appear anywhere; whether accidentally introduced by users when writing code.  No matter where an error occurs, the overriding sentiment among most users is that debugging SAS programs can be a daunting, and humbling, task.  Attendees learn about the various error types, identification techniques, their symptoms, and how to repair program code to work as intended.


Introduction to ODS Statistical Graphics
Kirk Paul Lafler

Delivering timely and quality looking reports, graphs and information to management, end users, and customers is essential.  This presentation provides SAS® users with an introduction to ODS Statistical Graphics found in the Base-SAS software.  Attendees learn basic concepts, features and applications of ODS statistical graphic procedures to create high-quality, production-ready output; an introduction to the statistical graphic SGPLOT procedure, SGPANEL procedure, and SGSCATTER procedure; and an illustration of plots and graphs including histograms, vertical and horizontal bar charts, scatter plots, bubble plots, vector plots, and waterfall charts.


Creating Test Data Using SAS® Hash Tables
LaSelva Gwen

SAS® programmers sometimes need to simulate real data.  Real data often cannot be used because it contains confidential in-formation or is unavailable.  This paper describes generating test data for the New York State Congenital Malformations Registry.  The data was required to be in a flat text file format and to contain a medical record number, last name, first name, street address, city, state, ZIP code, sex, two dates, and from one to 12 diagnosis codes with descriptions.  The data was created in a data step using SAS random number generating functions to generate values directly or to select items from hash tables.  The selected items were written to a text file with PUT statements.  The code was written in SAS® 9.4 under Windows 7.


Taming the Bear Make Your Programs Easier to Control and Monitor
Bob Bolen

SAS® programs that run for any extended period of time but encounter some process or data error can turn a good day bad in a hurry.  This paper will examine some of the methods we have used to control and monitor these types of jobs.  It will look at the use of macros for breaking your code into manageable blocks, variable validation for checking that a program runs correctly and using email to give alert/status notifications.


Using PROC EXPAND to Easily Manipulate Longitudinal and Panel Data
Matthew Hoolsema

Working with longitudinal and panel data comes with many data structure challenges.  In order to merge data from multiple sources, analysts often have to manipulate a longitudinal dataset to make the structure of the two datasets match.  Other times, a particular analysis may be more conducive to a wide dataset than a long dataset, or vice versa.  New SAS® users (or experienced SAS® users new to analyzing longitudinal data) can find writing code to manipulate their panel data and prepare an analysis dataset particularly challenging.

This paper describes how PROC EXPAND can be used as a simple yet powerful tool for manipulating longitudinal and panel data.  Examples explore calculating lags, leads, and moving averages of time varying observations within a panel, as well as reshaping a dataset from long to wide without the need to use arrays or macros.  Also shown is how PROC EXPAND can be used in conjunction with other SAS® procedures to easily calculate trends over time.  Other examples demonstrate using PROC EXPAND to collapse time periods (i.e. converting monthly observations to quarterly observations), interpolate values between observed time periods, and perform data transformations on variables.


Handling Numeric Representation SAS® Errors Caused by Simple Floating-Point Arithmetic Computation
Fuad Foty

Every SAS® programmer knows that an 8 byte maximum length numeric data type imposes a maximum machine specify limit on the numbers and their computations.  In fact, all numeric SAS® values are represented as 64-bit floating point numbers.  At the Census Bureau we deal with specific survey replicate weights that can result in very large positive or negative numbers due to simple calculations of various weights and factors.  Rounding is one way to control the numbers and correct errors that are due to iterative computations.  Understanding how floating-point arithmetic works and being aware of its limit is important when heavy computations are involved in millions of survey records.  Adding and subtracting obvious numbers results in surprisingly different values.  In this paper, I will show ways to handle errors caused by computations and how to avoid obvious issues dealing with large or small numbers.


Writing Code With Your Data: Basics of Data-Driven Programming Techniques
Joe Matise

In this presentation aimed at SAS® programmers who have limited experience with data step programming, we discuss the basics of Data-Driven Programming, first by defining Data-Driven Programming, and then by showing several easy to learn techniques to get a novice or intermediate programmer started using Data-Driven Programming in their own work.  We discuss using PROC SQL SELECT INTO to push information into macro variables; PROC CONTENTS and the dictionary tables to query metadata; using an external file to drive logic; and generating and applying formats and labels automatically.

Prior to attending this presentation, programmers should be familiar with the basics of the data step; should be able to import data from external files; basic understanding of formats and variable labels; and should be aware of both what a macro variable is and what a macro is.  Knowledge of macro programming is not a prerequisite for understanding the conecpts in this presentation.


All Aboard! Next Stop is the Destination Excel
William E Benjamin Jr.

Over the last few years both Microsoft Excel file formats and the SAS® interfaces to those Excel formats have changed.  SAS® has worked hard to make the interface between the two systems easier to use.  Starting with Comma Separated Variable files and moving to PROC IMPORT and PROC EXPORT, LIBNAME processing, SQL processing, SAS® Enterprise Guide®, JMP®, and then on to the HTML and XML tagsets like MSOFFICE2K, and EXCELXP.  Well, there is now a new entry into the processes available for SAS users to send data directly to Excel.  This new entry into the ODS arena of data transfer to Excel is the ODS destination called EXCEL.  This process is included within SAS ODS and produces native format Excel files for version 2007 of Excel and later.  It was first shipped as an experimental version with the first maintenance release of SAS® 9.4.  This ODS destination has many features similar to the EXCELXP tagsets.



Coder's Corner

Using an Array to Examine Gastric and Colorectal Cancer Risk Factors
Verlin Joseph

Colorectal and gastric cancers are two of the most prevalent forms of cancer worldwide.  One of the key bacterial risk factors for gastrointestinal cancers is Helicobacter Pylori (H. Pylori).  This study seeks to examine the association of H. Pylori in colorectal and gastric cancer.  The analysis was conducted using 2002-2012 Florida Agency for Health Care Administration (AHCA) hospital discharge data.  Each data set contained millions of records and up to thirty diagnoses codes.  Arrays were utilized to quickly locate the variables of analysis.  The purpose of this presentation is to illustrate how programmers may use arrays to analyze big data sets.


What Do You Mean My CSV Doesnt Match My SAS® Dataset?
Patricia Guldin and Young Zhuge

Statistical programmers are responsible for delivering high quality and reproducible analysis datasets to statisticians, modelers and other quantitative scientists.  Regardless of the format (e.g. SAS dataset or .csv), the content of the datasets should be identical.  Converting SAS datasets to other formats can be easily accomplished in SAS, but the consistency between the output files must be included in the quality control checks.  An example will be given where tables and figures created by a statistician (using the SAS dataset) and those created by a modeler (using the .csv dataset) were different.  We will provide the results of our exploration into why the inconsistencies occurred and the steps taken to ensure reliability for subsequent data exports / format conversions.


Using SAS® to Get a List of File Names and Other Information from a Directory
Imelda Go

The ability to create a data set which contains the names of all the files in a directory can be very useful.  This paper goes through three possible uses of this file information data set.  The first is you have many files with the same input structure that need to be read from the same directory.  Instead of hard-coding the file name to create each of the many data sets, apply a SAS macro on each file name in the file information data set to create each data set.  The second is you are collecting data files and want to determine if all expected files are in a directory and if not, which ones are missing.  The third is you are running out of computer storage space and need to identify files based on file size, age, etc. for possible deletion or other action.  An often overlooked source of solutions is the SAS Knowledge Base online. Two SAS samples (24820 and 41880) were used to address the above situations.


PROC DOC III: Self-generating Codebooks Using SAS(R)
Louise Hadden

This paper will demonstrate how to use good documentation practices and SAS(R) to easily produce attractive, camera-ready data codebooks (and accompanying materials such as label statements, format assignment statements, etc.)  Four primary steps in the codebook production process will be explored: use of SAS metadata to produce a master documentation spreadsheet for a file; review and modification of the master documentation spreadsheet; import and manipulation of the metadata in the master documentation spreadsheet to self-generate code to be included to generate a codebook; and use of the documentation metadata to self-generate other helpful code such as label statements.  Full code for the example shown (using the SASHELP.HEART data base) will be provided.


A Five-Step Quality Control Method: Checking for Unintended Changes to SAS® Datasets
Aaron Brown

A SAS® programmer may need to make many edits or changes to a given data set.  There is always a risk that ones code will have unintended consequences, leading to unintended changes.  This paper describes a five-step quality control method that allows a programmer to quickly and systematically check for any changes in the number of records, variables in a dataset, or values in a dataset, in order to assure that the intended changes and only the intended changes occurred.  This quality control method utilizes the COMPARE and MEANS procedures.


Talk to Me!
Elizabeth Axelrod

Wouldnt it be nice if your long-running program could tap you on the shoulder and say Okay, Im all done now.  It can!  This quick tip will show you how easy it is to have your SAS® program send you (or anyone else) an email during program execution.  Once youve got the simple basics down, youll come up with all sorts of uses for this great feature, and youll wonder how you ever lived without it.


Extract Information from Large Database Using SAS® Array, PROC FREQ, and SAS Macro
Lifang Zhang

SAS® software offers the statistician and programmer a lot of ways to extracting useful information from large SAS data sets for statistical analysis and report.  The techniques discussed include the use of base SAS software i.e.   Common used PROC IMPORT, SAS ARRAY, PROC FREQ and OUPUT, SAS MACRO, and PROC EXPORT to extract useful information for future analysis and report.

This presentation provides a detailed blueprint for 1) Using PROC IMPORT to import Excel data; 2) Using ARRAY to detect specific diseases (Diagnostic code); 3) PROC FREQ and OUTPUT to generate indicate variables and data sets; 4) Merging them together to produce one summary report or dataset; 5) Using SAS MACRO to repeat the step 2 to 4; 6) Export the final data sets or tables to define destination.

The paper use mimic diabetes clinic data from 2012 to 2014.  ICD-9 Diabetes diagnostic codes included in range 250.00 - 250.80.  The paper will show 1) how to extract each diagnostic code by each year for all visits; 2) Count all diabetes diagnostic codes for each unique patient for each year.  In other words, each patient can only count one time in each year.

The Excel tables will be created and exported in a defined destination at the end of program.


Tips for Pulling Data from Oracle® Using PROC SQL® Pass-Through
John Cohen

For many of us a substantial portion of our data reside outside of SAS®.  Often these are in DBMS (Data Base Management Systems) such as Oracle, DB2, or MYSQL.  The great news is that the data will be available to us in an already-structured format, with likely a minimum of reformatting optimized database effort required.  Secondly, these DBMS come with an array of manipulation tools of which we can take advantage.  The not so good news is that the syntax required for pulling these data may be somewhat unfamiliar to us.

We will offer several tips for making this process smoother for you, including how to leverage a number of the DBMS tools.  We will take advantage of the robust DBMS engines to do a lot of the preliminary work for us, thereby reducing memory/work space/sort space, data storage, and CPU cycles required of the SAS server which is usually optimized for analytical work while being relatively weaker (than the DBMS) at the heavy lifting required in an increasingly Big Data environment for initial data selection and manipulation.  Finally, we will make our SAS Administrators happy by reducing some of the load in that environment.


Some _FILE_ Magic
Mike Zdeb

The use of the SAS® automatic variable _INFILE_ has been the subject of several published papers.  However, discussion of possible uses of the automatic variable _FILE_ has been limited to postings on the SAS-L listserv and on the SAS Support Communities web site.  This paper shows several uses of the variable _FILE_, including creating a new variable in a data set by concatenating the formatted values of other variables and recoding variable values.


Personalized Birthday Wisher (PBW): An Indispensable SAS® Tool for Your Workplace
Jinson Erinjeri and Angela Soriano

Personalized Birthday Wisher (PBW) is a customizable birthday wishing tool which operates on the input provided to it.  The input to this tool includes various attributes, based on which a birthday wish comprising of an image and text as its content is selected and delivered via email.  This is an automated tool which sends a birthday wish to a team member on his/her birthday based on certain attributes, giving the wish a personalized touch.  The attributes chosen for this tool is entirely up to the decision maker making this tool highly customizable.  This paper presents an application of the tool with simple predetermined attributes that can be tailored as needed by the team.  In addition, the paper also points to various customization possibilities.


Macro Code to Test Existence of Various Objects
Ronald Fehd

SAS(R) software provides functions to check the existence of the objects which it manages catalogs and data sets as well as folders referred to by the filename and libname statements.

The purpose of this paper is to provide a set of macro statements for assertions of existence and to highlight the exceptions where these functions return non-boolean choices.


When PROPCASE Isnt Proper: A Macro Supplement for the SAS® Function
Alex Buck

Formatting is important for clarity when reporting descriptive data such as country or subject status.  In comes PROPCASE to save the day.  Indonesia and Protocol Violation are certainly easier to read than their UPCASE counterparts.  However, PROPCASE is limited.  It will convert every word in the argument without exceptions, creating phrases such as United States Of America and Withdrawal By Subject.  These distractions can sometimes be anticipated and fixed, but as new data are collected there is a risk of a renegade To, And or Out hiding somewhere.  This paper introduces a macro supplement for PROPCASE which contains a default list of common lowercase words and the option to add or exclude words at the users discretion.  With this macro, the user will specify lowercase words in an efficient and dynamic way, allowing confidence in formatting across deliveries.


Using CALL VNAME to Populate Missing Data from a Default Values Lookup Dataset
Raghav Adimulam

When SAS is used to read ASCII data files, there is sometimes a need to recode missing data or to alter already existing data values.  Often, the values to be substituted are in an Excel spreadsheet, which can be imported to SAS as a lookup table.  This paper walks through the creation of a program that populates default values in a column when that columns information is empty in the input file.  The relevant functions and routines used in this process are CALL VNAME, CMISS, Arrays, and DO loops.  The goal is to output specific data patterns from the lookup datasets into the variables with missing values.


Simplifying the Use of Multidimensional Array in SAS®
Alec Zhixiao Lin

A look-up table usually contains two or more attributes.  Multidimensional array is a common and effective way to code the table in SAS.  However, many users find this practice to be confusing and daunting.  This paper introduces a process that greatly simplifies the use of multidimensional array.  Users only need to define the dimensions of a look-up table and paste it to SAS with no need to understand the logic for programming multidimensional array in SAS.  The macro will automatically kick off the process of assigning given values to each record.


Flexible Programming with Hash Tables
Joe Matise

When developing a general application, it often pays to use flexible techniques that will enable an application to handle a diverse set of inputs with a simple and concise syntax, while efficiently producing a consistent final output.  This paper will show intermediate to advanced SAS® programmers how to make use of hash tables to produce powerful applications that are easy to use.

The example application, a data cleaning program, will enable users to specify connections between different datasets using a straightforward syntax through the use of hash tables.  We will also introduce the concept of hash tables for users who are unfamiliar with hash tables and their use in SAS programming.

Before reading this paper, users should have a good understanding of the SAS data step, should have some familiarity with basic macro programming, should be familiar with data driven programming techniques, and should be comfortable combining multiple tables.


Using SAS® Macros to Extract P-values from PROC FREQ
Rachel Straney

This paper shows how to leverage the SAS ® Macro Facility with PROC FREQ to obtain multiple chi-square test statistics and their associated p-values into one data set to achieve a quick solution to the common variable selection problem.  The purpose of this paper is to provide a simplified macro function that can be used to identify important factors in a study.  Although the use of PROC FREQ in this macro limits its use to categorical data, references to other SAS papers will be summarized for the readers to get a better understanding of how this concept can be expended upon.


Dynamically Setting Decimal Precision Using PUTN
Brandon Welch

In the pharmaceutical industry, there are often rules for displaying the number of decimal places in a report.  This is particularly important when reporting laboratory results in a table.  For example, for a particular laboratory test, if the laboratory result in the data is carried to the tenths place, the displayed mean will be reported at the hundredths place.  The common rule is displaying the mean result one place higher than what is in the data.  There are similar rules for other statistics such as the median, standard deviation, minimum, and maximum.  This paper will illustrate a simply technique using the PUTN function to dynamically set the decimal precision.  The techniques presented offer a good overview of basic SAS functions that will educate programmers at all levels.


Your Own SAS® Macros Are as Powerful as You Are Ingenious
Yinghua Shi

This article proposes, for user-written SAS macros, separate definitions for function-style macros and for routine-style macros, respectively.  A function-style macro returns a value, while a routine-style macro does not.  Implementation of function-style macros follows a set of rules that are not identical with the rules for routine-style macros.  In client code, the method for invoking those two types of macros is not the same, either.  Just like SAS language has a distinction between functions and call routines, it is natural for macros to also have that distinction.

With that distinction in place, this article describes the proper approach and rules to follow for writing function-style macros, and for writing routine-style macros.  Usage of each kind of macro is also provided to describe the proper way of invoking the macros in client code.

This article also points out some common problems in writing and invoking macros that appeared in published articles.  Some of those problems can cause errors in SAS programs, and therefore should be guarded against when designing and coding macros.

For each key point discussed in this article, working sample SAS code is provided to further illustrate what to do, what not to do, what to avoid, and what are good, or not-so-good coding practices in writing user-written macros.


Cover Your Assumptions with Custom %str(W)ARNING Messages
Shane Rosanbalm

In clinical trial work we often have to write programs based on very little (and mostly clean) data.  But an experienced programmer's lizard brain is constantly warning them that their clean log is an illusion, that there is likely dirty data lurking just down the road.  And sadly, the lizard brain is usually right.  What's a programmer to do?

In this paper we explore using custom WARNING messages in both the data step (with PUT) and the macro language (with %PUT).  This programming technique allows us to write simple programs for the data we have while at the same time protecting ourselves from the dirty data we fear.


From Professional Life to Personal Life: SAS® Makes It Easy
Nushrat Alam

You have been using SAS in your professional life but how about your personal life?  Got a long to do list for the next week and doing grocery is one them?  Let SAS make your personal life easier by providing you an up-to-date grocery list.  This paper describes an innovative way of applying SAS to make our routine tasks more interesting and thus expand our creativity to a new level.  In the process of generating a customized grocery list this paper will show the use of SAS Output Delivery System along with the use of PROC REPORT and its options, traffic lighting and SAS time function.  Let your SAS-siness play and be amazed by your own creativity.


WAPTWAP, but remember TMTOWTDI
Jack Shoemaker

Writing a program that writes a program (WAPtWAP) is a technique that allows one to solve computing problems in a flexible and dynamic fashion. Of course, there is more than one way to do it (TMTOWTDI).  This presentation will explore three WAPtWAP methods available in the SAS system.



Data Management/Big Data

A New Method To Deal With 2 Level Variables in Big Data Analysis
John Gao

In the big data, often variables have multiple levels.  Here we mainly study the big data with 2 levels or 2 layers(Child level and parent level), such as in the medical area, the patient and hospital.  The patients outcome greatly related to This paper presents a new approach for the big data with 2 levels.  The outcome is binary at child level.  The first step is to develop a predictive model at child level for the binary outcome, such as patients or trialers for the re-admissions or conversions.  The aggregation of predicted probability of the binary outcome for each parent is the expected or natural overall binary outcome.  All aggregation of actual outcomes for all child for each parent may not be the same as the overall expected outcome.  The difference between the actual outcome with expected outcome is due to the parents performance.  Therefore, the second step is to identify the impact from parents performance on the child outcome.  In this step, the dependent variable is continuous either the ratio of the aggregation of all actual outcome over the aggregation of all expected probability for all child.  Then the second step analysis is to identify the drivers from parents information.  Then, we can combine the 2 model score into one by normalizing the expected probability of all child and expected ratio outcome of all parents.  Then final binary outcome is going to be predicted by weighting childs information and parent information.  Logistic and linear regression procedures in SAS have been used for this study.  The result from this new approach has better prediction with the parent's information.


Fuzzy Name-Matching Applications
Alan Dunham

Name-matching among two or more lists of names can be ambiguous and problematic for several reasons.  These problems are the same for corporations and government organizations, leading to a variety of negative outcomes, such as rejected applications, missed business customer opportunities, lost payment vouchers, duplicate bills, and overlooking applicant criminal records.

Base SAS code is discussed that uses the Spedis function (available since SAS 6.0) to significantly reduce the number of name mis-matches in a comprehensive, generic manner.  The code can be applied to treat Romanization and transliteration spelling variances, differing use of diacritical marks in multiple lists, inconsistent use of titles, sub-names appearing in different order (such as reversal of surname and given name), incomplete names, trailing or leading blanks, mixed use of punctuation, and varying use of capitalization.  This paper shows the application and the effectiveness of this SAS solution to a problem common to many organizational settings.


Using Proc FCMP To Improve Fuzzy Matching
Christine Warner

This paper will address how to utilize Proc FCMP and user-defined functions to enhance fuzzy matching techniques.  In addition to utilizing COMPLEV and COMPGED, I have developed several user-defined functions that improve the accuracy and completeness of fuzzy matching.  "Fuzzy matching" is a term known to many programmers as matching on non-exact character strings, they could be names, addresses, invoice numbers, or any piece of data.  Several edit-distance functions such as Jaro-Winkler, General Edit Distance (GED), Levenshtein, and SoundEx make fuzzy matching easier but they each have their shortcomings.  Many functions, such as Jaro-Winkler, are great for comparing one word with another word, but not as useful for comparing entire phrases.  This abstract will explain in detail how to harness the power of existing and new user-defined functions to develop the best possible fuzzy matching plan.

List of Experience Fuzzy-Matching:

Super Boost Data Transpose Puzzle
Ahmed Al-Attar

This paper compares different solutions to a data transpose puzzle presented to the SAS User Group at the US Census Bureau (CenSAS).  The presented solutions ranged from a SAS 101 multi-step solution to an advanced solution utilizing not widely known techniques yielding 85% run time savings!


Using SAS® Hash Object to Speed and Simplify Survey Cell Collapsing Process
Ahmed Al-Attar

This paper introduces an extremely fast and simple implementation of the Survey Cell Collapsing Process.  Prior implementations had either utilized several SQL queries, or numerous data step arrays, with multiple data reads.  This new approach, utilizes single Hash Object with a maximum of two data reads.  The Hash object provides an efficient and convenient mechanism for quick data storage and retrieval (sub-second total run time).


Removal of PII
Stanley Legum

At the end of a project, the IRBs require project directors to certify that no personally identifiable information (PII) is retained.  This paper briefly reviews what information is considered PII and explores how to identify variables containing PII in a given project.  It then shows a comprehensive way to ensure that all SAS variables containing PII have their values set to NULL and how to use SAS to document that this has been done.


What to Expect When You Need to Make a Data Delivery: Helpful Tips and Techniques
Tom McCall and Louise Hadden

Making a data delivery to a client is a complicated endeavor.  There are many aspects that must be carefully considered and planned for: de-identification, public use versus restricted access, documentation, ancillary files such as programs, formats, and so on, and methods of data transfer, among others.  This paper provides a blueprint for planning and executing your data delivery, and will walk you through the questions you should ask, the resources you should check, the resources you should create, and the SAS® tools that will help you along the way.  In essence, well travel back to the future to find out exactly what you need to be doing from the very start of your project to ensure a successful data delivery.


Sorting a Bajillion Records: Conquering Scalability in a Big Data World
Troy Hughes

"Big data" is often distinguished as encompassing high volume, velocity, or variability of data.  While big data can signal big business intelligence and big business value, it also can wreak havoc on systems and software ill-prepared for its profundity.  Scalability describes the ability of a system or software to adequately meet the needs of additional users or its ability to utilize additional processors or resources to fulfill those added requirements.  Scalability also describes the adequate and efficient response of a system to increased data throughput.  Because sorting data is one of the most common as well as resource-intensive operations in any software language, inefficiencies or failures caused by big data often are first observed during sorting routines.  Much SAS® literature has been dedicated to optimizing big data sorts for efficiency, including minimizing execution time and, to a lesser extent, minimizing resource usage (i.e., memory and storage consumption).  Less attention has been paid, however, to implementing big data sorting that is reliable and robust even when confronted with resource limitations.  To that end, this text introduces the SAFESORT macro that facilitates a priori exception handling routines (which detect environmental and data set attributes that could cause process failure) and post hoc exception handling routines (which detect actual failed sorting routines).  If exception handling is triggered, SAFESORT automatically reroutes program flow from the default sort routine to a less resource-intensive routine, thus sacrificing execution speed for reliability.  However, because SAFESORT does not exhaust system resources like default SAS sorting routines, in some cases it performs more than 200 times faster than default SAS sorting methods.  Macro modularity moreover allows developers to select their favorite sorting routine and, for data-driven disciples, to build fuzzy logic routines that dynamically select a sort algorithm based on environmental and data set attributes.


Spawning SAS® Sleeper Cells and Calling Them into Action: Implementing Distributed Parallel Processing in the SAS University Edition Using Commodity Computing To Maximize Performance
Troy Hughes

With the 2014 launch of the SAS® University Edition, the reach of SAS was greatly expanded to educators, students, researchers, and non-profits who could for the first time utilize a full version of Base SAS software for free, enabling SAS to better compete with open source solutions such as Python and R.  Because the SAS University Edition allows a maximum of two CPUs, however, performance is curtailed sharply from more substantial SAS environments that can benefit from parallel and distributed processing, such as designs that implement SAS Grid Manager, Teradata, and Hadoop solutions.  Even when comparing performance of the SAS University Edition against the most straightforward implementation of SAS Display Manager (running on the same computer), SAS Display Manager demonstrates significantly greater performance.  With parallel processing and distributed computingincluding programmatic and non-programmatic methodsbecoming the status quo in SAS production software, the SAS University Edition will unfortunately continue to fall behind its SAS counterparts if it cannot harness parallel processing best practices.  To curb this performance disparity, this text introduces groundbreaking programmatic methods that enable commodity hardware to be networked so that multiple instances of the SAS University Edition can communicate and work collectively to divide and conquer complex tasks.  With parallel processing enabled, a SAS practitioner can now easily harness an endless number of computers to produce blitzkrieg solutions with SAS University Edition that rival the performance of those produced on costly, complex infrastructure.


Your database can do complex string manipulation too!
Harry Droogendyk

Since databases often lacked the extensive string handling capabilities available in SAS, SAS users were often forced to extract complex character data from the database into SAS for string manipulation.  As database vendors make regular expression functionality more widely available for use in SQL, the need to move data into SAS for pattern matching, string replacement and character extraction is no longer (as) necessary.

This paper will cover enough regular expression patterns to make you dangerous, demonstrate the various regexp SQL functions and provide practical applications for each.


Stress Testing and Supplanting the SAS® LOCK Statement: Implementing Mutex Semaphores To Provide Reliable File Locking in Multi-User Environments To Enable and Synchronize Parallel Processing
Troy Hughes

The SAS® LOCK Statement was introduced in SAS version 7 with great pomp and circumstance, as it enabled SAS software to lock data sets exclusively.  In a multi-user or networked environment, an exclusive file lock prevents other users or processes from accessing and accidentally corrupting a data set while it is in use.  Moreover, because file lock status can be tested programmatically with the LOCK statement return code (&SYSLCKRC), data set accessibility can be validated before attempted access, thus preventing file access collisions and facilitating more reliable, robust software.  Notwithstanding the intent of the LOCK statement, stress testing demonstrated in this text illustrates vulnerabilities in the LOCK statement that render its use inadvisable due to its inability to reliably lock data setsits only purpose.  To overcome this limitation and enable reliable data set locking, a methodology is demonstrated that utilizes dichotomous semaphoresor flagsthat indicate whether a data set is available or is in use, and mutually exclusive (mutex) semaphores that restrict data set access to a single process at one time.  With Base SAS file locking capabilities now restored, this text further demonstrates control table locking to support process synchronization and parallel processing.  The SAS macro LOCKITDOWN is included and demonstrates busy-waiting (or spinlock) cycles that repeatedly test data set availability until file access is achieved or a process times out.



e-Posters

Using SAS® to Examine Relationships among Leadership Styles of College of Nursing Deans and Faculty Job Satisfaction Levels in Research Intensive Institutions
Abbas Tavakoli and Karen Worthy

Leadership and job satisfaction are two factors that have been regarded as fundamental for organizational success.  It has been shown that employees with high job satisfaction levels are likely to be more productive and exert more effort in pursuing organizational interests.  The purpose of this study was to identify perceived leadership styles of nursing deans to determine whether they correlate with nursing faculty job satisfaction.  The study used many SAS® procedures to analyze descriptive, correlational data.  The sample for this national study consisted of 303, out of 1626 recruited, full-time nursing faculty members from 24 public, research universities with very high research activity in the United States.

The result show there is positive significant linear relationship between transformation leadership style and transactional leadership with job satisfaction.  Also, the result indicated that there is negative significant relationship between passive leadership style and job satisfaction (r=-.43).  The result of multiple regression indicated that different leadership styles (transformational, transactional, and passive) are related to the job satisfaction after controlling with interaction with dean.  The R square for transformational, transactional, and passive leadership on job satisfaction were .38, .13, and .26; respectively.  SAS® is powerful software to analyze any types of data.


Using SAS® to Examine the Relationship between Primary Caregivers' Adverse Childhood Experiences (ACE) and Child Abuse Allegations
Abbas Tavakoli, Katherine Chappell and Senna Dejardins

Child maltreatment affected nearly 700,000 children in the United States in 2012.  Child maltreatment is broken down into four main divisions according to the Centers for Disease Control (CDC): physical abuse, sexual abuse, emotional abuse, and neglect.  South Carolinas ranking for child well-being is among the poorest at 45th in the nation.  Earlier recognition and intervention for a child victim of abuse allegations could result in a positive impact on their future health and well-being.  The purpose of this paper is using SAS® to examine the relationship between primary caregivers adverse childhood experiences score and child abuse allegations in the family.  Adverse childhood experiences (ACE) scores were used for this study.  There were 10 items with possible responses of no or yes.  The total scale was created by summing responses for 10 items.  The data collection was conducted at the Child Advocacy Center (CAC) of Aiken and Dickerson Center for Children where families with allegations of child abuse bring children for services.  Each participant completed an ACE survey and a demographic questionnaire.  Proc Mean and Freq were used to describe the data.  Proc Corr was used to examine the linear relationship of total ACE score to ordinal and continuous variables.  Proc T-Test, Npar1way, and GLM were used to examine the difference of means for ACE score with selected variables.  Male caregivers have slightly higher ACE score 8.28 compared to females score of 7.75.  The average total ACE score was similar by site, race, and marital status.  The ACE score was higher for physical abuse compared to other types of abuse.  The results of the T-test, nonparametric, and GLM did not reveal significant differences between ACE score and above variables with the P values were greater than .05.


Patients with Morbid Obesity and Congestive Heart Failure Have Longer Operative Time and Room Time in Total Hip Arthroplasty
Yubo Gao

More and more patients with total hip arthroplasty have obesity, and previous studies have shown a positive correlation between obesity and increased operative time in total hip arthroplasty.  But those studies shared the limitation of small sizes.  Decreasing operative time and room time is essential to meeting the increased demand for total hip arthroplasty, and factors that influence these metrics should be quantified to allow for targeted reduction in time and adjusted reimbursement models.  This study intend to use a multivariate approach to identify which factors increase operative time and room time in total hip arthroplasty.  For the purposes of this study, the American College of Surgeons National Surgical Quality Improvement Program database was used to identify a cohort of over thirty thousand patients having total hip arthroplasty between 2006 and 2012.  Patient demographics, comorbidities including body mass index, and anesthesia type were used to create generalized linear models identifying independent predictors of increased operative time and room time.  The results showed that morbid obesity (body mass index >40) independently increased operative time by 13 minutes and room time 18 by minutes.  Congestive heart failure led to the greatest increase in overall room time, resulting in a 20-minute increase.  Anesthesia method further influenced room time, with general anesthesia resulting in an increased room time of 18 minutes compared with spinal or regional anesthesia.  Obesity is the major driver of increased operative time in total hip arthroplasty.  Congestive heart failure, general anesthesia, and morbid obesity each lead to substantial increases in overall room time, with congestive heart failure leading to the greatest increase in overall room time.  All analyses are conducted via SAS® (version SAS 9.4, Cary, NC).


The Power of Interleaving Data
Yu Feng

Have you ever had the experience of writing multiple SAS® data steps to accomplish a task but felt that there should be more efficient SAS code which could do the job?  Interleaving a dataset with itself can help you fulfill the task in the examples enumerated in this paper.  The intent of this paper is to introduce the fundamental understanding of the interleaving process and discuss a few examples showing the power of interleaving data with itself.  The examples included in this paper are commonly used in outcomes research.  This paper will focus on interleaving a dataset with itself, not on interleaving two or more different datasets.


Formula 1: Analytics behind the tracks to the podium
Pallabi Deb and Piyush Lashkare

SPEED, POWER, INNOVATION, PERFORMANCE, FACTS, and STATISTICS are the words that distinguish Formula 1 racing, also known as The Pinnacle of Motor Sport.  It generates a yearly income of 1.2 billion dollars and involves many teams competing with their roaring turbocharged engines for the spectators.  With an average teams budget of 300 million dollars, teams need not only show their engineering excellence but also a winning strategy.  During a race, most drivers have an average heartbeat of 170 per minute, cars often go past a maximum speed of 150mph and a difference of milliseconds separate winners and losers.  These characteristics make it different from most other sports.  A formula one car deploys about 150 sensors measuring all sorts of variables around the car, with 500 different parameters across the system measuring nearly 13,000 health parameters and events; resulting in 750 million numbers.  A single race generates over 3 terabytes of staggering data.  This data is further analyzed with the objective to gain a competitive advantage and excel.  In short, applying analytics for winning business insights.  Today, Formula 1 racing is not only about flawless aerodynamics and legendary driving but is also fueled by DATA to accelerate the analysis.  Currently, we have extracted the data for the entire 2011 grand Prix season across all the circuits.  For each circuit, we have the data for 5 races (FP1, FP2, FP3, Qualifying, and Final) for all the constructors teams.  As the volume of the data is too huge, so we have planned to analyze the data for one circuit and top five constructor teams to create a model, which would illustrate the significant attributes such as track conditions, tire types, fuel lap time, pit stops, stint length to predict a probable position of a driver in the final standing charts.  This model will predict the following -

Privacy Protection using Base SAS®: Purging Sensitive Information from Free Text Emergency Room Data
Michelle White, Thomas Schroeder, Li Hui Chen and Jean Mah

Federal agencies must balance privacy protection concerns with the competing priorities of accessibility and usability of open data.  Free text data often provides detail and qualitative value not offered by coded data.  However, free text narratives are more likely to contain personally identifiable or sensitive information.  This paper describes how U.S. Consumer Product Safety Commission (CPSC) staff identifies sensitive information in emergency department (ED) narratives using macros, Perl regular expressions and the PRXMATCH function in Base SAS Version 9.

CPSCs National Electronic Injury Surveillance System (NEISS) is a national probability sample of hospitals with EDs in the U.S. and its territories.  The NEISS collects information for about 400,000 product-related ED visits annually.  Each NEISS record includes coded variables and a brief text narrative.  This narrative may contain sensitive information (e.g., patient names, product brands) that must be purged before the NEISS data is publicly released.  About 65 percent of the narratives are immediately reviewed and purged, if necessary, by contract reviewers.  The data is then input into a SAS program to identify potentially sensitive words in the remaining narratives and to verify the contractor review.  A macro compares each narrative to a SAS dataset where each observation contains a purge word or Perl regular expression, which can encompass misspellings or indicate a numerical identifier (e.g., social security number, birthdate).  If a purge word or expression is contained in the narrative, the case is output for review by CPSC staff.

In this way, CPSC staff reviews only narratives containing potentially sensitive information.  Previously un-reviewed narratives may be purged, and purges done on previously reviewed narratives are marked as having been missed by the contractors.  New purge terms are periodically identified by comparing the terms actually purged from reviewed narratives with those in the SAS dataset.

Using tools available in Base SAS, CPSC staff has semi-automated the purging of sensitive information from ED narratives.  A similar process could be applied to other data containing free text, such as electronic medical records and death certificates.


Extracting Email Domains and Geo-Processing IP Addresses in SAS®
Alec Zhixiao Lin

Web data has become a very important source for analytics in the current era of social media and booming ecommerce.  Email domain and IP address are two important attributes potentially useful for market sizing, detection of online statistical anomaly and fraud prevention.  This paper introduces a few methods in SAS that extract email domains and process IP addresses to prepare data for subsequent analyses.


The Orange Lifestyle
Sangar Rane and Mohit Singhi

Being a freshman at a large university, life can be fun as well as stressful.  The choices a freshman makes while in college may impact his/her overall health.  In order to examine the overall health and different behaviors of students at Oklahoma State University a survey was conducted among the freshmen students.  The survey focused on capturing the psychological,environmental, diet, exercise, alcohol and drug use among students.  A total of 790 out of 1,036 freshman students filled the survey which included around 270 questions or items that covered the range of issues mentioned above.  An exploratory factor analysis identified 34 possible factors.  For example, two factors that relates to the behavior of students under stress are eating and relaxing.  Analysis is currently continuing and we hope the results will give us deep insights into the lives of students and thereby help improve the health and lifestyle of students at Oklahoma State University in future years.


Using SAS® to create a Build Combinations Tool to Support Modularity
Stephen Sloan

With SAS PROC SQL we can use a combination of a manufacturing Bill of Materials and a sales specification document to calculate the total number of configurations of a product that are potentially available for sale.  This will allow the organization to increase modularity with maximum efficiency.

Since some options might require or preclude other options, the result is more complex than a straight multiplication of the numbers of available options.  Through judicious use of PROC SQL, we can maintain accuracy while reducing the time, space, and complexity involved in the calculations.


Strike a pose! Quick and Easy Camera Ready Reporting with SAS®
Nancy McGarry

Getting data out to non-programmer staff in a clear non-technical format can present a challenge.  Often all the analyst wants to see are numbers presented in a clear and concise manner. Enter PROC REPORT!

This ePoster presents a simple SAS® program which takes processed data and produce camera ready report results.


A Failure to EXIST: Why Testing for Data Set Existence with the EXIST Function Alone Is Inadequate for Serious Software Development in Asynchronous, Multi-User, and Parallel Processing Environments
Troy Hughes

The Base SAS® EXIST function demonstrates the existence (or lack thereof) of a data set.  Conditional logic routines commonly rely on EXIST to validate data set existence or absence before subsequent processes can be dynamically executed, circumvented, or terminated based on business logic.  In synchronous software design where data sets cannot be accessed by other processes or users, EXIST is both a sufficient and reliable solution.  However, because EXIST captures only a split-second snapshot of the file state, it provides no guarantee of file state persistence.  Thus, in asynchronous, multi-user, and parallel processing environments, data set existence can be assessed by one process but instantaneously modified (by creating or deleting the data set) thereafter by a concurrent process, leading to a race condition that causes failure.  Due to this vulnerability, most classic implementations of the EXIST function within SAS literature are insufficient for testing data set existence in these complex environments.  This text demonstrates more reliable and secure methods to test SAS data set existence and perform subsequent, conditional tasks in asynchronous, multi-user, and parallel processing environments.


An Analysis of the Repetitiveness of Lyrics in Predicting a Songs Popularity
Drew Doyle

In the interest of understanding whether or not there is a correlation between the repetitiveness of a songs lyrics and its popularity, the top ten songs from the year-end Billboard Hot 100 Songs chart from 2002 to 2015 were collect.  These songs then had their lyrics assessed to determine the count of the top ten words used.  These words counts were then used to predict the number of weeks the song was on the chart.  The prediction model was analyzed to determine the quality of the model and if word count is a significant predictor of a songs popularity.  To investigate if song lyrics are becoming more simplistic over time there were several tests completed in order to see if the average word counts have been changing over the years.  All analysis was completed in SAS® using various PROCs.


SAS® Macro for Automated Model Selection Involving PROC GLIMMIX and PROC MIXED
Fan Pan and Jin Liu

Generalized linear mixed model (GLMM) and linear mixed model (MIXED) deal with the situation where the mean conditional on the normal random effects is linearly related to the model effects.  Assumptions for the MIXED are similar to GLMM including normally distributed random effects and residuals and independence among the random effects and model errors.  GLMMs allow analysis of both normally distributed and certain types of non-normally distributed dependent variables when random effects are present, whereas MIXED is used for the normal distribution.  Both PROC GLMMIX and PROC MIXED can be used for the same dataset, but different results may be received by using different procedures.  It will be of interest to compare the final model results between two procedures.  This paper details a group of macros performing separate automated model selection using PROC GLIMMIX and PROC MIXED.  The macro will use Output Delivery System (ODS) to save the resulting statistics from model fittings.  Model selection indices, such as AICC (corrected Akaike Information Criterion) and Chi-square values, will be calculated basing on resulting statistic used in all possible model selection.  Final GLMM and MIXED models include data exploration, influential diagnostics, and checking for model violations with the experimental ODS GRAPHICS option.  The macros will also graphically compare the summaries of GLIMMIX and MIXED two best model selections, and give suggestion which procedure is better suited for the dataset.  Two examples will be provided using a real dataset to show the application of the macro.  The macro will be significant for researchers who are interested in the application of mixed models.


Predicting student success based on interactions with virtual learning environment
Vivek Doijode and Neha Singh

Online learning can be called the millennial sister of classroom learning; tech savvy, always connected, and flexible.  These features offer a convenient alternative to students with constraints and working professionals to learn on demand.  According to National Center for Education Statistics, over 5 million students are currently enrolled in distance education courses.  The growing trend and popularity of MOOCs (Massive Open Online Courses) and distance learning makes it an interesting area of research.  We plan to work on OULA (Open University Learning Analytics) dataset.  Learning analytics provides many insights on the learning pattern of students and on module assessments.  These insights may be researched to enhance participants learning experience.  In this paper, we predict students success in an online course using regression, clustering and classification methods.  We have a mix of categorical and numeric inputs present in the OULA datasets that are in csv file formats and contain information for more than 30,000 students pertaining to 7 distance learning courses, student demographics, course assessments and student interaction with virtual learning environment.  We have merged tables together using unique identifiers.  We will first explore the merged data using SAS® to generate insights and then build appropriate predictive models.


Applying SAS® to Explore the Utilization and Impact of Sensitive Clinical Indicators on a Heart Failure Unit
Jametta Magwood-Golston, Abbas Tavakoli, Nurses of Moultrie Heart Failure Unit at Palmetto Health Richland, Christina Payne, Harmony Robinson, Veronica Deas and Forrest Fortier

Heart failure (HF) is a chief cause of hospitalization and contributor to health care costs in the United States.  While there has been a significant reduction in HF patients hospital length of stay, because of the diseases complexity, nearly 25% of patients hospitalized with heart failure are readmitted within 30 days, primarily due to co-morbidities.  In an effort to reduce the potential of 30-day readmission and continue to reduce HF patient length of stay, our hospital implemented a care intervention that involved interdisciplinary rounding that occurs at each patients bedside.  Each clinical discipline involved in patient care is present at the bedside to discuss clinical, quality and harm indicators (i.e. pressure ulcers, central lines, foleys, medication and dosage, patient discharge, length of stay, etc.) and their potential influence on patient length of stay.  The purpose of the study was to evaluate the impact of clinical culture change after the introduction of the Accountable Care UnitTM care model.  Specific aims of the study were (1) to measure change in the length of stay of patients discharged from the heart failure accountable care unit; (2) to enhance team situational awareness to achieve unit-level quality enhancements; (3) to eliminate unnecessary medical waste.  This study used a descriptive, comparative design to analyze clinical metrics before the care and after the care model was introduced on the heart failure unit.  The result did reveal a significant difference in the length of stay for patients discharged from the heart failure unit.  There were statistically significant differences for all included minority patients, Black, Hispanic, American Indians, with the exception of Oriental Asian patients.  There was also a statistical significance in the utilization of high risk medications; however, there were not any statistical differences for patient gender.


Text mining and sentiment analysis on video game user reviews using SAS® Enterprise Miner" and SAS® Sentiment Analysis Studio
Mukesh Kumar Singh

Digital gaming has a history of more than 50 years.  The industry started in the late 1960s when the game titles such as Pong, Centipede and Odyssey were introduced to consumer markets.  Digital gaming is now a wide spread phenomenon and at least 70% of the US and Europe households say that they play video games using different consoles such as PC, Xbox, PS4, Nintendo etc.  It is reported that in 2011, the total revenue of the industry amounted to about 17 billion USD.  Each game is reviewed and rated on the internet by users who played the game and the reviews are often contrasting based on the sentiments expressed by the user.  Analyzing those reviews and ratings to describe the positive and negative factors of a specific game could help consumers make a more informed decision about the game.

In this paper, we will analyze 10,000 reviews and ratings on a scale (1-10) of 200 games culled from two sites: metacritic.com and gamespot.com.  We will then build a predictive models to classify the reviews into positive, negative and mixed based on the sentiments of users and develop a score which defines the overall performance of the game so that users get all the required information about a game before purchasing a copy.


Text Analysis of American Airline Reviews
Rajesh Tolety

According to a survey report of tripadvisor, about 43% of the airline passenger rely on online reviews of different airlines before booking a ticket.  Therefore the nature and the tone of the reviews are important metrics for airlines to track and manage.  We plan to do text analysis of online reviews of American Airlines which runs about 945 flights across 350 destinations.  The analysis would help American Airlines company to understand what their passengers are talking about and perhaps take actions to improve their service.  The extracted dataset includes customers rating (on a scale of 1-5), date of review, detailed comments and location.  We plan to do text analysis as well as supervised sentiment analysis on this dataset.



Hands On Workshop

A Tutorial on the SAS® Macro Language
John Cohen

The SAS Macro language is another language that rests on top of regular SAS code.  If used properly, it can make programming easier and more fun.  However, not every program is improved by using macros.  Further-more, it is another language syntax to learn, and can create problems in debugging programs that are even more entertaining than those offered by regular SAS.

We will discuss using macros as code generators, saving repetitive and tedious effort, for passing parameters through a program to avoid hard coding values, and to pass code fragments, thereby making certain tasks easi-er than using regular SAS alone.  Macros facilitate conditional execution and can be used to create program modules that can be standardized and re-used throughout your organization.  Finally, macros can help us create interactive systems in the absence of SAS/AF.

When we are done, you will know the difference between a macro, a macro variable, a macro statement, and a macro function.  We will introduce interaction between macros and regular SAS language, offer tips on debugging macros, and discuss SAS macro options.


Fundamentals of the SAS® Hash Object
Paul Dorfman


Starting with the basics and progressing to some less-used features, this workshop is designed to show how the SAS® hash object really works.  The main emphasis will be not on amassing as many pieces of template code as possible, but rather on the fundamental things a hash object programmer must understand in order to use it in creative ways.  The aim is not so much about tasting already cooked hash dishes (though there will be plenty of chances to do that, too), but about cooking properly based on the fundamental properties of the ingredients and their interactions.  Cuisine so much more than just following a recipe, and the same is true for programming with the hash object!  In particular, we'll learn what the DATA step compiler sees when it encounters hash object references and what it must have seen - and done - to make the hash object work when its run-time turn comes.  We'll goof - intentionally - to learn from errors reported in the SAS log.  We'll see how the variables stored in the hash object talk to their host counterparts in the PDV, and which hash methods make them affect each other and how, including those related to the hash iterator.  While focusing on these underlying works, we'll learn many other utile things that together ought to form a solid basis for making the hash object one's valuable SAS programming tool.


Making Sense of PROC TABULATE
Jonas Bilenas and Kajal Tahiliani

The TABULATE procedure in SAS® provides a flexible platform to generate tabular reports.  Many beginning SAS programmers have a difficult time understanding the syntax of PROC TABULATE and tend to avoid using the procedure.  This tutorial will explain the syntax of PROC TABULATE and, with examples, show how to grasp the power of PROC TABULATE.  The data used in this paper represents simulated consumer credit card usage data and the code was developed using SAS 9.2


New for SAS® 9.4: A Technique for Including Text and Graphics in Your Microsoft Excel Workbooks, Part 1
Vince DelGobbo

A new ODS destination for creating Microsoft Excel workbooks is available starting in the third maintenance release of SAS® 9.4.  This destination creates native Microsoft Excel XLSX files, supports graphic images, and offers other advantages over the older ExcelXP tagset.  In this presentation you learn step-by-step techniques for quickly and easily creating attractive multi-sheet Excel workbooks that contain your SAS® output.  The techniques can be used regardless of the platform on which SAS software is installed.  You can even use them on a mainframe!  Creating and delivering your workbooks on-demand and in real time using SAS server technology is discussed.  Although the title is similar to previous presentations by this author, this presentation contains new and revised material not previously presented.


Introduction to Data Simulation
Jason Brinkley

Creating synthetic data via simulation can often be a powerful tool for a wide variety of analyses.  The purpose of this workshop is to provide a basic overview of simulating data for a variety of purposes.  Examples will include power calculations, sensitivity analysis, and exploring nonstandard analyses.  The workshop is designed for the mid-level analyst who has basic knowledge of data management, visualizations and basic statistical analyses such as correlations and t-tests.


A Short Introduction to Longitudinal and Repeated Measures Data Analyses
Leanne Goldstein

Longitudinal and repeated Measures data are seen in nearly all fields of analysis.  Examples of this data include weekly lab test results of patients or test scores by children from the same class.  Statistics students and analysts alike may be overwhelmed when it comes to repeated measures or longitudinal data analyses.  They may try to educate themselves by diving into text books or taking semester long or intensive weekend courses resulting in even more confusion.  Some may try to ignore the repeated nature of data and take short cuts such as analyzing all data as independent observations or analyzing summary statistics such as averages or changes from first to last points and ignoring all the data in-between.  This Hands-On presentation will introduce longitudinal and repeated measures analyses without heavy emphasis on theory.  Students in the workshop will have the opportunity to get hands-on experience graphing longitudinal and repeated measures data.  They will learn how to approach these analyses with tools like PROC MIXED and PROC GENMOD.  Emphasis will be on continuous outcomes but categorical outcomes will briefly be covered.


Quick Results with SAS® Enterprise Guide®
Kirk Paul Lafler, Mira Shapiro and Ryan Paul Lafler

SAS® Enterprise Guide® empowers organizations, programmers, business analysts, statisticians and end-users with all the capabilities that SAS has to offer.  This hands-on workshop presents the Enterprise Guide graphical user interface (GUI), access to multi-platform enterprise data sources, various data manipulation techniques without the need to learn complex coding constructs, built-in wizards for performing reporting and analytical tasks, the delivery of data and results to a variety of mediums and outlets, and support for data management and documentation requirements.  Attendees learn how to use the graphical user interface to access SAS data sets, tab-delimited and Excel input files; subset and summarize data; join (or merge) two tables together; flexibly export results to HTML, PDF and Excel; and visually manage projects using flow diagrams.



Life Sciences/Healthcare/Insurance

A Novel Approach to Calculating Medicare Hospital Readmissions for the SAS® Novice
Karen Wallace

Hospital Medicare readmission rate has become a key indicator for measuring the quality of healthcare in the US, currently adopted by major healthcare stakeholders including the Centers for Medicare and Medicaid Services (CMS), the Agency for Healthcare Research and Quality (AHRQ), and the National Committee for Quality Assurance (NCQA) (Fan and Sarfarazi, 2014).

Although many papers have been written about how to calculate readmissions (as referenced), this paper offers a novel, basic and comprehensive approach using the options of the SAS DATA Step as well as PROC SQL for: 1) de-identifying patient data, 2) calculating sequential admissions and 3) filtering out criteria required to report for CMS 30 day readmissions.  Additionally, it demonstrates: 1) using ODS to create a labeled and de-identified data set, 2) a macro to examine data quality, and 3) summary statistics useful for reporting and analysis.


Protecting the Innocent (and your data)
Stanley Legum

A recurring problem with large research data bases containing sensitive information on individuals health, financial, and personal information is how to make meaningful extracts available to qualified researchers without compromising the privacy of the individuals whose data are in the data base.  This problem is exacerbated when a large number of extracts need to be made from the database.  In addition to employing statistical disclosure control methods, this paper recommends limiting the variables included in each extract to the minimum needed and implementing a method of assigning request-specific randomized IDs to each extract that is secure, self-documenting, and secure.


Sankey Diagram with Incomplete Data From a Medical Research Perspective
Yichen Zhong

Sankey diagram is widely used in energy industry but relatively rare in medical research.  Sankey diagram is an innovative tool in visualizing patient flow in longitudinal data.  A SAS® Macro for generating Sankey Diagram with Bar Charts was published in 2015.  However, it did not provide an intuitive solution for missing data or terminated data, which is common in medical research.  This paper presents a modification of this macro that allows subjects with incomplete data to appear on the Sankey diagram.  In addition, examples of using Sankey diagram in medical research are also provided.


How is Healthcare Cost Data Distributed? Using Proc Univariate to Draw Conclusions about Millions of Different Customers
Aran Canes

Modelling health care cost and utilization data has received substantial attention from the academic community.  Different methodological approaches have been used, tested and contrasted on various health care datasets.  Some of the more simple approaches include Ordinary Least Squares, Generalized Linear Methods and taking the log transform of the dependent variable while more sophisticated non-parametric methods have also been proposed and tested.  The conclusion of most researchers is that, while some questions remain unresolved, different approaches are recommended for different datasets.

Despite this plethora of methodological comparisons there is a paucity of papers researching how health care cost data is actually distributed.  This is probably because of two major factors: difficulties accessing data because of privacy concerns and because there seems to be a tacit assumption that different slices of cost will have different distributions.

While this paper does not try to falsify the hypothesis that, in some instances, different slices of healthcare cost data may be distributed differently, it reaches a surprising conclusion regarding a wide range of healthcare cost data slices among customers of a major insurer: all are distributed approximately log-normally once the substantial part of the population that is zero-cost is excluded.  This is true whether one looks at pharmacy or medical cost, different ways of purchasing insurance among customers or comparing customers who stayed eligible for a full calendar year versus customers who may have only temporarily had coverage.  These results are reached using the histogram and statistical significance tests available to all SAS users in PROC UNIVARIATE.

Since there is a noted lack of empirical results regarding how the cost data of large customer populations is distributed this paper should be a significant help in assessing the validity of different methodological approaches.  If the results are confirmed in other populations, methodological discussions could be conducted with the underlying knowledge that, if healthcare cost data has a substantial number of customers, the Probability Density Function that best matches the data will be the lognormal.


Are You Sure That Is Correct?: An Overview Of Good Practices For Dataset And Output Validation
Gregory Weller and Alex Buck

In the world of statistical programming, it is imperative to ensure that the source data are accurately represented in the datasets, tables, listings, and figures submitted to regulatory agencies and academic journals.  The gold-standard for ensuring correctness is double-independent programming where a production programmer and validation programmer aim to produce the same output.  SAS® PROC COMPARE is the most commonly-used tool to compare the outputs and determine differences.  While double-independent programming using PROC COMPARE can be a very useful process, it is important to recognize its limitations.  In certain cases, this may not be the most efficient or complete method of validation required.  Further, failure to establish and follow sound procedures and practices for validation can lead to disaster.

In this article, we will discuss good practices for double-independent programming, the proper use of PROC COMPARE, common pitfalls, and explore the different techniques needed for validating datasets, tables, listings, and figures.  While there will never be one true and final answer for validation best practice, it is the hope of the authors to provide a starting point for discussion.


Evaluating Sociodemographic and Geographic Disparities of Hypertension in Florida using SAS®
Desiree Jonas and Shamarial Roberson

Hypertension is one of the leading risk factors for chronic disease.  Chronic conditions such as heart disease, stroke and diabetes are associated with hypertension.  In 2013 the prevalence of hypertension among adults was 34.6% in Florida.  As age increases the risk of hypertension increases, placing older populations at greater risk for developing chronic conditions.  Florida has the second largest elderly population in the United States which places an increased burden on the healthcare system.  This paper will demonstrate the use of SAS® to evaluate and map the influence of socio-demographic factors such as sex, race/ethnicity, age, education and income on hypertension prevalence in Florida using Behavioral Risk Factor Surveillance System (BRFSS) data.  Additionally, PROC MAPIMPORT, PROC GMAP and PROC SURVEY LOGISTIC will be used to assess the burden of hypertension in Florida.


A General SAS® Macro to Implement Optimal N:1 Propensity Score Matching Within a Maximum Radius
Kathy Fraeman

A propensity score is the probability that an individual will be assigned to a condition or group, given a set of covariates when the assignment is made.  For example, the type of drug treatment given to a patient in a real-world setting may be non-randomly based on the patients age, gender, geographic location, overall health, and/or socioeconomic status when the drug is prescribed.  Propensity scores are used in observational studies to reduce selection bias by matching different groups based on these propensity score probabilities, rather than matching patients on the values of the individual covariates.  Although the underlying statistical theory behind propensity score matching is complex, implementing propensity score matching with SAS® is relatively straightforward.  An output data set of each patients propensity score can be generated with SAS using PROC LOGISTIC, and a generalized SAS macro can do optimized N:1 propensity score matching of patients assigned to different groups.  This paper gives the general PROC LOGISTIC syntax to generate propensity scores, and provides the SAS macro for optimized propensity score matching.  A published example of the effect of comparing unmatched and propensity score matched patient groups using the SAS programming techniques described in this paper is presented.


I See de Codes: Using SAS® to Process and Analyze ICD-9 and ICD-10 Diagnosis Codes Found in Administrative Healthcare Data
Kathy Fraeman

Administrative healthcare data including insurance claims data, electronic medical records (EMR) data, and hospitalization data contain standardized diagnosis codes used to identify diseases and other medical conditions.  These codes go by their short-form name of International Classification of Diseases, also known as ICD.  Much of the currently available healthcare data contain the 9th version of these codes, referred to as ICD-9, while the more recent 10th version ICD-10 are becoming more common in healthcare data.  These diagnosis codes are typically saved as character variables, often stored in arrays of multiple codes representing primary and secondary diagnoses, and can be associated with either outpatient medical visits or inpatient hospitalizations.  SAS text processing functions, array processing, and the SAS colon modifier can be used to analyze the text of these codes and identify similar codes, or ranges of ICD codes.  In epidemiologic analyses, groups of multiple ICD diagnosis codes are typically used to define more general comorbidities or medical outcomes.  These disease definitions based on multiple ICD diagnosis codes, also known as coding algorithms, can either be hard-coded within a SAS program, or defined externally from the programming.  When coding algorithm definitions based on ICD codes are stored externally, the definitions can be read into SAS, transformed to SAS format, and dynamically converted into SAS programming statements required to identify patients with the comorbidities and outcomes of interest.


SDTM What? ADaM Who? A Programmers Introduction to CDISC
Venita DePuy

Most programmers in the pharmaceutical industry have at least heard of CDISC, but may not be familiar with the overall data structure, naming conventions, and variable requirements for SDTM and ADaM datasets.  This overview will provide a general introduction to CDISC from a programing standpoint, including the creation of the standard SDTM domains and supplemental datasets, and subsequent creation of ADaM datasets.  Time permitting, we will also discuss when it might be preferable to do a CDISC-like dataset instead of a dataset that fully conforms to CDISC standards.


Sample Size Estimation with PROC FREQ and PROC POWER
Adeline Wilcox

Under contract to the Centers for Medicare & Medicaid Services, The Joint Commission specifies sample sizes for healthcare quality measurement.  Their sample size specifications pay no heed to established methods for sample size estimation.  SAS PROC POWER can be used to compute sample size estimates with precision.

Most healthcare quality measures are dichotomous.  From healthcare measurement data I used as pilot samples, I computed upper and lower confidence limits.  To do this, I used the BINOMIAL and CL options on the PROC FREQ TABLE statement.  After examining these results, I chose input values for the PROC POWER HALFWIDTH and PROBWIDTH options.  More statistics textbooks cover sample size estimation for hypothesis testing than for estimation.


Building Efficiencies in Standard Macro Library using Polymorphism
Binoy Varghese and Sagar Rana

It is common practice in the bio-pharma industry to develop and maintain a library of standard SAS® macros to facilitate expedited data analysis & reporting, directed at ensuring compliance, consistency, quality and reusability.  As requirements change over time, new macros are created or existing macros are updated.  In both scenarios, there is a tradeoff.  If new macros are created, programs calling earlier versions of these macros have to be modified before being used on a new project, thereby affecting the portability of the programs.  If existing macros are updated, programs from older projects may cease to execute as originally intended impeding backward compatibility.  Both these issues can be addressed by using polymorphism.  Polymorphism is an object oriented programming concept that refers to the ability of managing methods bearing the same name but exhibiting different behaviors.  From the context of SAS programming, this may be translated as having the capability of calling homonymous macros that may accept identical or different parameters to perform a diverse set of tasks.  In this paper, we examine the concept of polymorphism from a SAS macro programming perspective and present an implementation of this technique.



Planning/Support/Administration

Whats Hot Skills for SAS® Professionals
Kirk Paul Lafler

As a new generation of SAS® user emerges, current and prior generations of users have an extensive array of procedures, programming tools, approaches and techniques to choose from.  This presentation identifies and explores the areas that are hot in the world of the professional SAS user.  Topics include Enterprise Guide, PROC SQL, PROC REPORT, Output Delivery System (ODS), Macro Language, DATA step programming techniques such as arrays and hash objects, SAS University Edition software, technical support at support.sas.com, wiki-content on sasCommunity.org®, published white papers on LexJansen.com, and other venues.


Take a SPA Day with the SAS® Performance Assessment (SPA): Baselining Software Performance across Diverse Environments To Elucidate Performance Placement and Performance Drivers
Troy Hughes

Software performance is often measured through program execution time with higher performing software executing more rapidly than lower performing software.  Intrinsic factors affecting software performance can include the use of efficient coding techniques, other software development best practices, and SAS® system options.  Factors extrinsic to software that affect performance can include SAS configuration and infrastructure, SAS add-on modules, third-party software, and hardware and network infrastructure.  The variability in data processed by SAS software also heavily drives execution time, and these combined and commingled factors make it difficult to compare performance of one SAS environment to another.  Moreover, many SAS users may work in only one or a few SAS environments, giving them limited to no insight into how performance of their SAS environment compares to other SAS environments.  The SAS Performance Assessment (SPA) projectlaunched at SAS Global Forum in 2016 examines FULLSTIMER performance metrics from diverse organizations with equally diverse infrastructures.  By running standardized software that manipulates standardized data sets, the relative performance of unique environments can for the first time be compared.  Moreover, as the number and variability of SPA participants continues to increase, the role that individual extrinsic factors play in software performance will continue to be disentangled and better understood, enabling SAS users not only to identify how their SAS environment compares to other environments, but also to identify specific modifications that could be implemented to increase performance levels.


Your Local Fire Engine Has an Apparatus Inventory Sheet and So Should Your Software: Automatically Generating Software Use and Reuse Libraries and Catalogs from Standardized SAS® Code
Troy Hughes

Fire and rescue services are required to maintain inventory sheets that describe the specific tools, devices, and other equipment located on each emergency vehicle.  From the location of fire extinguishers to the make, model, and location of power tools, inventory sheets ensure that firefighters and rescue personnel know exactly where to find equipment during an emergency, when restocking an apparatus, or when auditing an apparatus inventory.  At the department level, inventory sheets can also facilitate immediate identification of equipment in the event of a product recall or the need to upgrade to newer equipment.  Software should be similarly monitored within a production environment, first and foremost to describe and organize code modulestypically SAS® macrosso they can be discovered and located when needed.  When code is reused throughout an organization, a reuse library and reuse catalog should be established that demonstrate where reuse occurs and to ensure that only the most recent, tested, validated version of code modules are reused.  This text introduces Python code that automatically parses a directory structure, parses all SAS program files therein (including SAS programs and SAS Enterprise Guide project files), and automatically builds reuse libraries and reuse catalogs from standardized comments within code.  Reuse libraries and reuse catalogs not only encourage code reuse but also facilitate backward compatibility when modules must be modified because all implementations of specific modules are identified and tracked.


Tales from the Help Desk 7: Solutions to Common SAS® Tasks
Bruce Gilsen

In 30 years as a SAS ® consultant at the Federal Reserve Board, questions about some common SAS tasks seem to surface again and again.  This paper collects some of these common questions, and provides code to resolve them.

The following tasks are reviewed.
  1. Save and restore SAS option values.
  2. Pad multiple non-continuous time series with missing values to make continuous time series.
  3. Read a SAS data set backwards (last observation to first).
  4. Create a data set containing the last 5 observations of an existing data set.
  5. Drop the last n observations in each BY group.
  6. Write data to multiple external files in a DATA step, determining file names dynamically from data values.
  7. Compare observations in the same data set.
In the context of discussing these tasks, the paper provides details about SAS system processing that can help users employ the SAS system more effectively.  This paper is the seventh of its type; see the references for six previous papers.


Downloading, Configuring, and Using the Free SAS® University Edition Software
Kirk Paul Lafler

The announcement of SAS Institutes free SAS University Edition is an exciting development for SAS users and learners around the world!  The software bundle includes Base SAS, SAS/STAT, SAS/IML, Designer Studio (user interface), and SAS/ACCESS for Windows, with all the popular features found in the licensed SAS versions.  This is an incredible opportunity for users, statisticians, data analysts, scientists, programmers, students, and academics everywhere to use (and learn) for career opportunities and advancement.  Capabilities include data manipulation, data management, comprehensive programming language, powerful analytics, high quality graphics, world-renowned statistical analysis capabilities, and many other exciting features.

This presentation discusses and illustrates the process of downloading and configuring the SAS University Edition.  Additional topics include the process of downloading the required applications, key configuration strategies to run the SAS University Edition on your computer, and the demonstration of a few powerful features found in this exciting software bundle.


How To Win Friends and Influence People A Programmers Perspective in Effective Human Relationships
Priscilla Gathoni

Dealing with people has become a task and an art that every person has to master in the work and home environment.  This paper explores 10 different ways that a programmer can use to win friends and influence people.  It displays the steps leading to a positive, warm, and enthusiastic balanced work and life environment.  The ability to think and to do things in their order of importance is a key ingredient for a successful career growth.  Programmers who want to grow beyond just programming should enhance their people skills in order to move up to the management level.  However, for this to be a reality a programmer must have good technical skills, possess the ability to arouse enthusiasm among peers, and is able to assume leadership.  It is the programmer that embraces non-judgment, non-resistance, and non-attachment as the core mantras that will succeed in the complex and high paced work environment that we are in.  Avoiding arguments, being a good listener, respecting the other persons point of view, and recalling peoples names will increase your earning power and ability to influence people to your way of thinking.  The ability to enjoy your work, be friendly, and be enthusiastic tends to bring you goodwill.  This eventually leads to creating good relationships in the office and the power to influence those around you in a positive way.


Divide and ConquerWriting Parallel SAS® Code to Speed Up Your SAS Program
Doug Haigh

Being able to split SAS® processing over multiple SAS processers on a single machine or over multiple machines running SAS, as in the case of SAS® Grid Manager, enables you to get more done in less time.  This paper looks at the methods of using SAS/CONNECT® to process SAS code in parallel, including the SAS statements, macros, and PROCs available to make this processing easier for the SAS programmer.  SAS products that automatically generate parallel code are also highlighted.



Reporting/Visualization/JMP

Proc Report, the Graph Template Language, and ODS Layouts: Used in Unison to Create Dynamic, Customer-Ready PowerPoints
Amber Carlson, Amelia Stein and Xiaobin Zhou

Twice a year, we create PowerPoint decks and supplemental tables for over 100 customers to present data on their system performance to help inform their decision-making.  We use one SAS program to create PowerPoint slides that incorporate the corporate template and include dynamic-editable tables, charts, titles, footnotes, and embedded hyperlinks that open additional drill-down data tables in either PDF or Excel format.  These additional data tables are saved in automatically created categorized folders.

In this SAS program we first employ SAS styles, ODS Layout, and ODS PowerPoint to format the slides and automate creation.  Macros and X Command are also utilized to create categorized folders for organization.  Finally, the Graph Template Language, Proc Report, and ODS PDF are utilized to create the customer-specific charts and tables for the main deck and the supplemental tables that are linked by hyperlinks on the corresponding slides of the PowerPoint.  This program starts from the raw data source and the output is a complete customer deck that is ready for presentation.

In this paper we share examples of how to create a completely customized PowerPoint deck using SAS styles and ODS Layout.  We also share tips and tricks that we have learned regarding what works and what does not in the ODS PowerPoint destination.  In addition, we demonstrate the program flow to highlight each type of functionality required to create a multi-level custom report.


Building Interactive Microsoft Excel Worksheets with SAS® Office Analytics
Tim Beese

Microsoft Office has over 1 billion users worldwide, making it one of the most successful pieces of software on the market today.  Imagine combining the familiarity and functionality of Microsoft Office with the power of SAS® to include SAS content in a Microsoft Office document.  By using SAS® Office Analytics, you can create Microsoft Excel worksheets that are not just static reports, but interactive documents.  This paper looks at opening, filtering, and editing data in an Excel worksheet.  It shows how to create an interactive experience in Excel by leveraging Visual Basic for Applications using SAS data and stored processes.  Finally this paper shows how to open SAS® Visual Analytics reports into Excel, so the interactive benefits of SAS Visual Analytics are combined with the familiar interface of an Excel worksheet.  All of these interactions with SAS content are possible without leaving Microsoft Excel.


Make the jump from Business User to Data Analyst in SAS® Visual Analytics
Ryan Kumpfmiller

SAS® Visual Analytics is effective in empowering the business user with the skills to build reports and dashboards.  The tool is easy to use and navigate, but it also has capabilities that go beyond just presenting data.  There are additional data analysis features, such as forecasting, fit lines, and correlations, which can give those business users better insight into their data.  This paper is going to go into what each of those features are, how to interpret them, and what objects they are used with in SAS® Visual Analytics.


Success Takes Balance, Don't Fall Over With Your SAS® Visual Analytics Implementation
Ryan Kumpfmiller and Craig Willis

With deploying SAS Visual Analytics, companies want to set up a system that will be effective in supporting their organization.  When it comes to building anything, it is key to set a solid foundation on the most important areas.  In a SAS Visual Analytics implementation, those are technology, people, culture, and process.  In this paper, you will learn how to structure those areas so that you can put your system in a position to succeed.


Annotating the ODS Graphics Way!
Dan Heath

For some users, having an annotation facility is an integral part of creating polished graphics for their work.  To meet that need, we created a new annotation facility for the ODS Graphics procedures in SAS® 9.3.  Now, with SAS® 9.4, the Graph Template Language (GTL) supports annotation as well!  In fact, GTL annotation facility has some unique features not available in the ODS Graphics procedures, such as using multiple sets of annotation in the same graph and the ability to bind annotation to a particular cell in the graph.  This presentation covers some basic concepts of annotating that are common to both GTL and the ODS Graphics procedures.  I apply those concepts to demonstrate some unique abilities of GTL annotation.  Come see annotation in action!


Mapping Roanoke Island: from 1585 to present
Barbara Okerson

One of the first maps of the present United States was John White's 1585 map of the Albemarle Sound and Roanoke Island, the site of the Lost Colony and of my present home.  This presentation looks at advances in mapping through the ages, from the early surveys and hand-painted maps, through lithographic and photochemical processes, to digitization and computerization.  Inherent difficulties in including small pieces of coastal land- often removed from map boundary files and data sets to smooth a boundary - are also discussed.  The presentation concludes with several current maps of Roanoke Island created with SAS®.


Creating a Publication Quality Graphic with SAS®
Charlotte Baker

Graphics are an excellent way to display results from multiple statistical analyses and get a visual message across to the correct audience.  The combination of SG procedures, such as PROC SGPLOT, and ODS statements in SAS® allow for the creation of custom graphics that meet the expectations of scientific journals and are excellent for research presentations or handouts.  While these are excellent tools, first time users may experience difficulties when attempting to utilize them.  This paper will describe two methods for creating a publication quality graphic in SAS® 9.4 and, more specifically, solutions for some issues encountered when doing so.


SAS® Formats: Effective and Efficient
Harry Droogendyk

SAS® formats, whether they be the vanilla variety supplied with the SAS system, or fancy ones you create yourself, will increase your coding and program efficiency.  (In)Formats can be used effectively for data conversion, data presentation and data summarization, resulting in efficient, data-driven code that's less work to maintain.  Creation and use of user-defined formats, including picture formats, are also included in this paper.


A Real World Example: Using the SAS® ODS Report Writing Interface to revamp South Carolina's School District Special Education Data Profiles
Fred Edora

Creating complex reports can be a daunting task, especially when using multiple data sources.  The SAS ODS Report Writing Interface can provide a significant amount of flexibility for your reports.  Prior to the use of SAS, the South Carolina Department of Educations 88 special education district data profiles were created using Microsoft Excel and Word and were required to be produced annually.  With limited staff, this process took weeks to build the reports and was not flexible to meet changing data needs.  This paper will illustrate how the SAS ODS Report Writing Interface helped revamp these annual district reports to help staff save time and effort while also including additional features and data (such as conditional processing and color coding) to give the reports a more professional appearance.


UCF Stored Process Conversion for Current STEM Retention Reports
Carlos Piemonti and Geoffrey Wical

At the University of Central Florida, in order to reduce the proliferation of similar reports in our SAS® Information Delivery Portal, we have been tasked with generating multiple different reports from one stored process.  Users will be prompted to select multiple criteria which will determine the report to be output, as opposed to having multiple stored processes and running them separately.


CMISS® the SAS® Function You May Have Been MISSING
Mira Shapiro

Those of us who have been using SAS for more than a few years often rely on our tried- and-true techniques for standard operations like assessing missing values.  Even though the old techniques still work, we often miss some of the new functionality added to SAS that would make our lives much easier.  In effort to ascertain how many people skipped questions on a survey and, what percentage of people answered each question, I did a search of past conference papers and came across a function that was introduced in SAS 9.2 -- CMISS.  By using a combination of CMISS and Proc Transpose, a full missing assessment can be done in a concise program.  This paper will demonstrate how CMISS makes assessing survey completeness an easy task.


Time-to-Degree Issue Solution using Academic Analytics
Sivakumar Jaganathan, Thanuja Sakruti and Abhishek Uppalapati

Analyzing academic data and arriving at administrative decisions is a crucial process at any university and is made easier using SAS® Business Intelligence creating informative applications.  Excessive time-to-degree is a worsening problem causing loss to both students and universities.  Studying trends help in identifying the factors that could help in decreasing the time-to-degree which then helps in advancement of the institution.

Our main objective is to analyze the data pertaining to time-to-degree trend and assess the factors that constitute this trend.  These reports are developed in SAS® Visual Analytics using the dataset extraction and processing done in the SAS® Enterprise Guide.  This data can be observed for different departments and can be used to see the birds eye view of trend of time to degree in different fields.  The other objectives are to provide analytical insights on Time to degree at UConn in order to avoid/decrease the losses incurred due to more time to degree, added direct costs to students, reduced ROI and increased debts, lost new admission seats to university due to occupancy of students for a long time and bad success rate affecting the reputation of the university which leads to decrease in future revenue.  Factors can be further studied using hypothesis tests and various statistical models to identify relation between time-to-degree and various variables obtained from Academic Analytics datasets.  Maintaining research standards at a university is of utmost importance at a university and the key metrics of the quality will be monitored using Academic Analytics.

Higher Education institutions need reliable and consistent quantitative and qualitative information on productivity and accountability.  Efficient research and analysis using visualization and modeling tools support the planning and crucial decisions.


Enhanced Swimmer Plots: Tell More Sophisticated Graphic Stories in Oncology Studies
Ilya Krivelevich, Andrea Dobrindt, Simon Lin and Xiaomin He

In oncology studies, investigators are often interested in the relationship among a subject's various evaluations, including treatment exposure, response timepoint, start and end of adverse events etc.  One of the ways to achieve it is a swimmer plot showing multiple pieces of a subject's response "story" in one glance (Stacey 2014).  A traditional swimmer plot providing single cell graphs might be over-simplified because it can't provide sufficient information to the investigators.  However, oncology studies often have more complicated scenarios, such as multi-therapies and dosing titration during the study.  This paper proposes enhanced swimmer plots which extended the use of swimmer plots to more sophisticated cases with two examples.  One example is to investigate the adverse events occurrence during the course of a clinical trial, and the other is to show subject's tumor response status during multi-therapy treatment phase of the study.  The paper provides detailed SAS code and statistical/clinical explanation at each step when creating plots, so readers could understand the "story" behind the study more thoroughly.


Utilize SAS® 9 SGPLOT to Create Genome Wide Association Studies Plots
Huei-Ling Chen and Jialin Xu

In clinical studies, there is an increasing research interest focusing on the associations between genetic variants and quantitative or categorical traits of clinical interest, including disease status, treatment response to a specific medication, especially as the cost of the molecular technology drops dramatically.  For example, one may examine whether cancer patients with a certain genetic variant can respond better to a particular oncology drug compared to cancer patients without that type of genetic variant.  A major methodology to test that genetic association is by Genome Wide Association Study (GWAS) Analysis.

The GWAS analysis consist three main plots to do the research step by step: Q-Q Plot, Manhattan Plot, and Regional Association Plot.  The Q-Q Plot is to justify the distribution of the observed p-value for each genetic variant.  The Manhattan Plot and the Regional Association Plot is to identify which particular region in the genome is associated with the clinical interest.  This paper introduces the plots and utilizes SAS version 9 SGPLOT procedure to develop three macros for each of the three GWAS plots respectively.


Exploring JMP®'s Image Visualization Tools in Medical Diagnostic Applications
Melvin Alexander

Among JMP®'s powerful, graphical functionalities is the ability to open, display, and analyze data from images.  This capability converts pixel values from images into data tables and matrices.  Once data are in data tables, additional analyses and data visualizations may be performed.

This presentation will demonstrate these capabilities using examples from medical computer tomographic (CT) images that are viewed for signs to diagnose patients with specific injuries and medical conditions.  Results of the image analyses help radiologists and clinicians decide the medical treatment options to give to patients (e.g., surgery, non-operative management, targeted radiation-beam therapies).

Combining the statistical, graphical and data analytic functionality of JMP® extends information visualization beyond what can be seen with standard image interpretation.  Ways to replicate these capabilities in SAS® will also be discussed.


Diabetes Self-Management Education Services in Florida
Adetosoye Oladokun and Rahel Dawit

Identifying areas and populations without access to diabetes self-management education services is important to the health of Florida residents with diabetes.  This paper will demonstrate how Base SAS can be used as a tool for mapping health data in order to address issues such as this.  It will also discuss how to compare the results of the created maps.  The procedure utilized for this project include procedure mapimport, contents, and gmap.


Using JMP to Apply Decision Trees and Random Forests as Screening Tools for Limiting Candidate Predictors in Regression Models
Jason Brinkley

There are many techniques for evaluating candidate predictors in regression models.  Some, such as stepwise regression, have been well studied and with known limitations.  Others, such as shrinkage techniques (i.e. LASSO), have become increasingly popular and are showing true potential in helping to provide quality regression models for estimating effects in a multiple variable environment.  Many of these techniques become difficult in the world of big data, especially if that data is long (many columns or variables) and short (fewer numbers of observations or rows).  Tree based methods are a good alternative as a framework for data exploration and identification of variables to be used in building a quality regression model.  Indeed tree methods provide an entirely different framework for model building that can oftentimes provide better predictions.  However, when the main goal is still effect estimation, tree methods can be a very useful screening tool.  This work examines the practice based on several examples using options very commonly found in both JMP and JMP Pro software.  The focus is on using both single classification trees as well as so-called 'Random' or 'Bootstrap' Forests.


5 Secrets for Building Fierce Dashboards
Tricia Aanderud

Are your dashboard or web reports lifeless, unappealing, or ignored?  A fierce dashboard is not an accident, it is the result of careful planning, design knowledge, and the right data.  In this paper, you will learn the techniques professionals use for creating dashboards that are engaging, beautiful, and functional.  This paper uses SAS Visual Analytics as the example, but the tasks shown could be accomplished with other SAS tools.


Data Visualization Through 3-D Graphs Using SAS® Graph Template Language (GTL)
Venu Perla

Certain types of data are better visualized through 3-D graphs.  SAS Graph procedures available prior to SAS 9.2 version are not user friendly.  These procedures are hard to practice and require quite a bit of time to implement.  Since future of graphs in SAS is centered on Graph Template Language (GTL), the objective of this paper is to create 3-D graphs using SAS GTL.  This paper also explains how initial SAS GTL code can be obtained, and how STATGRAPH template (aka GTL template) and SGRENDER procedure in the obtained code are modified with different OPTIONS and GTL STATEMENTS to create 3-D graphs for the beginners.


Color Speaks Louder than Words
Jaime Thompson

What if color blind people saw real colors and it was everyone else who's color blind?  Imagine finishing a Rubik's cube, wondering why people find it so difficult or relying on the position, rather than color of a traffic light in determining when to stop and go. Color matters!  Color enhances our perspective, it can change how we feel, but it also varies culturally.  In western countries red and white have opposing symbolic meanings to eastern countries which in turn can send the wrong message if misused.  Color can fundamentally change a report, therefore, finding the right color palettes for data visualizations is essential.  In this paper, I will cover the significance of color and how to pick a palette for your next SAS Visual Analytics report.


Geospatial Analysis with SAS®
Mike Jadoo

Geospatial analysis is the finest example of data visualization products today.  It produces the maximum amount of information of statistical accounts data.  Join us on an adventure, whether you are the seasoned practitioner or the exploring novice, as we explore the world of heat maps.  An in depth look will be conducted on how to make choropleth (heat) maps in SAS®.  This review will cover different types of maps that can be made, importing data, and data structure needed to create the map.


Statistics/Data Analysis

Testing the Gateway Hypothesis from Waterpipe to Cigarette Smoking among Youth Using Dichotomous Grouped-Time Survival Analysis (DGTSA) with Shared frailty in SAS®
Rana Jaber

Dichotomous grouped-time survival analyses is a combination of grouped-Cox model (D'Agostino et al., 1990), discrete time-hazard model (Singer and Willet, 1993), and the dichotomous approach (Hedeker et al., 2000).  Items measured from wave 1 through wave 4 were used for time-dependent covariates linking the predictors to the risk of waterpipe smoking progression at the subsequent students interview.  This analysis allows for maximum data use, inclusion of the time-dependent covariates and relaxing of the proportional hazards assumption, and takes into consideration the interval censored (i.e. the event occurred during a certain known interval (e.g., one year), but the exact time at which it was occurred cannot be specified) nature of the data.  The aim of this paper is to provide new method of analyzing panel data where the outcome is binary with some explanation of the SAS® codes.  Examples of using the PROC PHREG procedure are drawn from data that was recently published in the International Journal of Tuberculosis and Lung Disease (IJTLD).


Mixture Priors 101: Using SAS® to Obtain Powerful Frequentist Inferences with Bayesian Methods
Tyler Hicks and Jeffrey Kromrey

Using informed priors, background information about parameters can be included in statistical analysis.  Standard frequentist procedures purposefully omit priors.  Bayesian methods can thus yield more powerful hypothesis tests with informed priors, even when judged by frequentist criteria.  However, a misspecified informed prior can wreak havoc on Bayesian inferences.  Mixtures priors may be used to make Bayesian methods more robust to a possibly misspecified informed prior.  The purpose of this paper is to show how specifying mixture priors is very easy in PROC MCMC.  Researchers can thus use PROC MCMC to implemented mixture priors in their own research to obtain more powerful and robust frequentist statistical inferences using Bayesian methods.  This paper provides a rationale for mixture priors, presents annotated PROC MCMC code, and concludes with an executed example.


Finding the Area of a Polygon... On a Sphere!
Seth Hoffman

While a picture may be worth a thousand words, sometimes a few numbers are even better, especially if you want to do some statistics.  One of the basic ways to describe a region such as state, county, or zip code is its area.  There are several methods one can use to find the area of such polygons.  At least, when they are on a Cartesian plane (flat).  This paper presents a method and the math to calculate that area if a polygon is drawn on a spherical surface, such as Earth.


Factors of Multiple Fatalities in Car Crashes
Bill Bentley

Road safety is a major concern for all United States of America citizens.  According to the National Highway Traffic Safety Administration, 30,000 deaths are caused by automobile accidents annually.  Oftentimes fatalities occur due to a number of factors such as driver carelessness, speed of operation, impairment due to alcohol or drugs, and road environment.  Some studies suggest that car crashes are solely due to driver factors, while other studies suggest car crashes are due to a combination of roadway and driver factors.  However, other factors not mentioned in previous studies may be contributing to automobile accident fatalities.  The objective of this project was to identify the significant factors that lead to multiple fatalities in the event of a car crash.


In the Pursuit of Balanced Groups: A SAS® Macro for an Adaptive Randomization Test with continuous covariates
Tyler Hicks and Jeffrey Kromrey

In adaptive randomized experiments, researchers can verify that random allocation yielded equivalent groups on continuous covariates before proceeding with the study.  Given a precise definition of balanced groups (e.g., Cohens d < 0.5), researchers may plan to keep reshuffling participants until group balance is achieved.  However, reshuffling can invalidate the p-values of standard parametric tests, such as t-tests, F-tests, and Ç^2-tests.  This paper describes a non-parametric analog of the t-test of independent means, called an adaptive randomization test, which can preserve Type I error control in adaptive randomized experiments.  Although such tests are well established in mainstream statistics (Morgin & Rubin, 2015), they are not readily available in SAS making them difficult to implement.  This paper provides an overview of adaptive randomization tests, presents a macro for doing such a test in base SAS, and concludes with two executed examples of the macro.


Empowering Self-Service Capabilities with Agile Analytics
Tho Nguyen and Bob Matsey

Business and IT users are struggling to know what version of the data is valid, where should they get the data from, and how to combine and aggregate all the data sources to apply analytics and deliver results in a timely manner.  In addition, once they start trying to join and aggregate all the different types of data, the manual coding can be very complicated and tedious that demand extraneous resources, processing and negatively impact the overhead on the system.  By enabling agile analytics in a data lab, it can alleviate many of these issues, increase productivity and deliver an effective self-service environment for all users.  This self-service environment using SAS® analytics in Teradata has decreased time to prepare the data, develop the statistical data model and deliver faster results in minutes compared to days or even weeks.  This session will discuss how you can enable agile analytics in a data lab, leverage SAS® Analytics in Teradata to increase performance and learn how hundreds of organizations have adopted this concept to deliver self-service capabilities in a streamlined process.


Identifying Gaps in Time Series Data
Bruce Gilsen

Missing observations are a common issue with time series data.  For a small amount of data, you can print the data set and inspect manually, but this is not realistic for large data sets, especially if there are multiple BY groups.

With the TIMEID procedure, which is part of SAS/ETS ® software, you can easily check a data set to determine if observations are missing, and print a simple report that shows the location and quantity of missing observations.

PROC TIMEID can be used with SAS ® date or datetime variables, and for any frequency.  The examples in this paper use SAS dates with frequency WEEKDAY, which is common at the Federal Reserve Board.


A Simple SAS® Macro to Perform Blinder-Oaxaca Decomposition
Taylor Lewis and Stanislas Ezoua

Blinder-Oaxaca decomposition is a straightforward statistical method that emerged in the econometrics literature as a way to explain differences observed between groups with respect to the mean of a continuous variable.  To give a few examples, the procedure has been used to investigate potential pay differentials between males and females, or whether health disparities exist amongst individuals of varying socioeconomic statuses.  In this paper, we provide background on the fundamental concepts and objectives behind Blinder-Oaxaca decomposition, and present a general-purpose macro that analysts interested in conducting the technique might find helpful.


PROC LOGISTIC: Traps for the unwary
Peter Flom

This paper covers some gotchas in SAS PROC LOGISTIC.  A gotcha is a mistake that isnt obviously a mistake the program runs, there may be a note or a warning, but no errors.  Output appears. But its the wrong output.  This is not a primer on PROC LOGISTIC, much less on logistic regression.  There are many good books on logistic regression; one such is Hosmer and Lemeshow [2000].

Each section has several subsections.  First, I identify the gotcha.  Then I give an example.  Third, I show what evidence you have that it occurs.  Fourth, I show how to fix it in some cases, referring to other resources.  In some cases, I offer an explanation between the evidence and the solution.


Missing Data and Complex Sample Surveys Using SAS®: The Impact of Listwise Deletion vs. Multiple Imputation on Point and Interval Estimates when Data are MCAR and MAR
Anh Kellermann, DeAnn Trevathan and Jeffrey Kromrey

Social scientists from many fields use secondary data analysis of complex sample surveys to answer research questions and test hypotheses.  Despite great care taken to obtain the data needed, missing data are frequently found in such samples.  Even though missing data is a ubiquitous problem, the methodological literature has provided little guidance to inform the appropriate treatment for such missingness.  This Monte Carlo study used SAS to investigate the impact of missing data treatment (multiple imputations versus listwise deletion) when data are MCAR and MAR.  By using 10% to 70% of missing data (along with complete sample conditions as a reference point for interpretation of results), the research focused on the parameter estimates in multiple regression analysis in complex sample data.  Results are presented in terms of statistical bias in the parameter estimates and both confidence interval coverage and width.


The HPSUMMARY® Procedure: A Younger (and Brawnier) Cousin to an Old SAS® Friend
Anh Kellermann and Jeffrey Kromrey

The HPSUMMARY procedure provides data summarization tools to compute basic descriptive statistics for variables in a SAS dataset.  It is a high-performance version of the SUMMARY procedure in Base SAS.  Although the PROC SUMMARY is popular with data analysts, the PROC HPSUMMARY is still a new kid on the block.  This paper provides an introduction to PROC HPSUMMARY by comparing it with its well-known counterpart, PROC SUMMARY.  General syntax differences as well as performance in terms of processing time and memory utilization of the two procedures were examined.  Simulated data of different sizes were used to observe the performance of the two procedures.  Experiment results indicate that there was no clear difference in real time between the PROC SUMMARY and its high performance counterpart.  The HP version utilized more memory but provided better memory management in a limited-memory environment than the legacy version did.  The impact of multi-core computing on HPSUMMARYs processing time for different data volumes was also examined.


A Macro for Calculating Kendalls Tau-b Correlation on Left Censored Environmental Data
Dennis Beal

Kendalls tau-b correlation is a nonparametric method for calculating monotonic correlation between two variables with multiple censoring levels.  While SAS does calculate Kendalls Tau-b as an option in the PROC CORR procedure within SAS/BASE, its calculation assumes all values are known and uncensored.  It does not incorporate censoring that is common with chemical or environmental data.  Environmental data often are reported from the analytical laboratory as left censored, meaning the actual concentration for a given contaminant was not detected above the method detection limit.  Therefore, the true concentration is known only to be between 0 and the reporting limit.  Kendalls tau-b can be used on left censored data with multiple reporting limits with minimal assumptions. This paper presents a SAS macro that calculates Kendalls tau-b by incorporating the additional ties that occur when comparing detected concentrations with non-detects.  This paper is for intermediate SAS users of SAS/BASE.


A Data Mining Approach to Predict Student-at-risk
Youyou Zheng, Sivakumar Jaganathan, Thanuja Sakruti and Abhishek Uppalapati

Data mining is an analysis process to obtain useful information from large data set and unveil its hidden pattern (Mehmed 2003, Tan 2005).  It has been successfully applied in the business areas like fraud detection and customer retention for decades.  With the increasing amount of educational data, educational data mining has become more and more important to uncover the hidden patterns within the institutional data, so as to support institutional decision making (Luan 2012).  However, only very limited studies have been done on educational data mining for institutional decision support.  The institutional researchers from Western Kentucky University built up a model to help increasing yield and retention at the University (Bogard 2013).  The researcher from the University of California also proposed to apply data mining technique in the college recruitment process to achieve enrollment goals (Chang 2009).  Both of the institutions used SAS Enterprise Miner as their data mining tool.  In this study, we are going to use SAS Enterprise Miner to build up the student-at-risk model.  At the University of Connecticut, organic chemistry is a required course for undergraduate students in a STEM discipline.  It has a very high DFW rate (D=Drop, F=Failure, W=Withdraw).  Take Fall 2014 as an example, the average DFW% is 24% at UCONN and there are over 1200 students enrolled in this class.  In this study, Fall 2009 2013 student enrollment data is used to build up the model.  Fall 2014 data is used to test the model.  SEMMA (Sample, Explore, Modify, Model and Assess) method introduced by SAS Institute Inc. is applied to develop the predictive model.  The freshmen SAT scores, campus, semester GPA, financial aid, first generation information and other factors are used to predict students performance in this course.  In the predictive modeling process, several different modeling techniques (decision tree, neural network, ensemble models, and logistic regression) are compared with each other in order to find an optimal one for our institution.  The purpose of this study is to predict student success in the future study so as to improve the education quality in our institution.


Surviving the Interim: Insights Into Interim Survival Analyses
Venita DePuy

Pre-specified interim analyses may be performed to evaluate whether a clinical trial can be halted prematurely for overwhelming efficacy and/or futility.  This approach is somewhat more complex when the primary endpoint is a survival analysis.  This paper provides an overview of performing the initial calculations and actual interim analyses using SASs Proc SEQDESIGN and Proc SEQTEST, EAST software, and PASS software.


Designing and Analyzing Surveys with SAS/STAT® Software
Pushpal Mukhopadhyay

Designing probability-based sample surveys usually requires the use of strategies such as stratification, clustering and unequal weighting.  Analyzing the resulting data requires specialized techniques that takes these strategies into account in order to produce statistically valid inferences.  This requires specialized software.  This tutorial shows you how to use the SAS/STAT software specifically designed for selecting and analyzing probability samples for survey data.  You will learn how to: The tutorial also discusses the characteristics of different variance estimation techniques, including both Taylor series and replication methods.

The course is intended for a broad audience of statisticians who are interested in analyzing sample survey data.  Familiarity with basic statistics, including regression analysis, is strongly recommended.


A prediction model for which country will win the highest number of Golds in 2016 Olympics
Antarlina Sen and Gaurang Margaj

Olympic Games was started to establish a healthy friendship between countries.  Given its history and value, http://www.olympic.org/ , www.iaaf.org and http://www.rio2016.com and other website contain records of 300+ events where 10,000+ athletes (men and women) from 204 countries compete.  This data is available from 1896 to 2012.  The goal of this paper is to predict which country will score the maximum number Golds in the Rio Olympics of 2016 using past trends of countries' performances and the statistics of players listed for 2016.  Before building this prediction model, exploration and descriptive analytics will be carried out to answer questions such as following.  Does the current form of a player impact his/her performance in the Olympics?  Has a specific country performed better with respect to the others in the last two years in a particular sport?  If a player wins one medal in one Olympics, what are the chances he/she will win in a Gold in the next one?  Does a more experienced player have a better chance of winning?  Insights generated from answering these questions will be used to build the predictive model.


Analyzing non-normal data with categorical response variables
Niloofar Ramezani and Ali Ramezani

In many applications, the response variable is not normally distributed.  If the response variable is categorical with two or more possible responses, it makes no sense to model the outcome as normal.  When dealing with categorical outcome variables, the relationship between the outcome and predictors is not linear anymore, hence more advanced models than general linear models need to be used to appropriately model this relationship.  Binary, Multinomial, and ordinal logistic regression models are some examples of the robust predictive methods to use for modeling the relationship between non-normal discrete response and the predictors.

This study looks at several methods of modeling binary, categorical and ordinal correlated response variables within regression models.  Starting with the simplest case of binary responses, through ordinal response variables, this study discusses different modeling options within SAS.  At the end, some missing data handling techniques are suggested to appropriately account for the high percentages of missing observations which happens a lot in practice when performing these statistical models.

Various statistical techniques such as logistic and probit models, generalized linear mixed models (GLMM), log-linear models, and generalized estimating equations (GEE) are some of the models which are being discussed and applied to real data with categorical outcome variables in this study.  This paper discusses different options within SAS 9.4 for the aforementioned models.  These procedures include PROC LOGISTIC, PROC PROBIT, PROC GENMOD, PROC GLIMMIX, PROC NLMIXED, PROC MIXED, PROC CATMOD and PROC GEE.


Statistical Model Building for Large, Complex Data: Five New Directions in SAS/STAT® Software
Robert Rodriguez

The increasing size and complexity of data encountered in business analytics require analysts to apply a growing set of tools for building statistical models.  In response to this need, SAS/STAT software continues to add new methods.  This presentation takes you on a high-level tour of four major enhancements: new effect selection methods for ordinary regression models in the GLMSELECT procedure; model selection for generalized linear models with the HPGENSELECT procedure; model selection for quantile regression with the HPQUANTSELECT procedure; and construction of generalized additive models with the GAMPL procedure.  For each of these approaches, the presentation explains the concepts, illustrates the benefits with a basic example, and guides you to information that will help you get started.