Preface
Financial research relies extensively on data. Mastering a statistical tool capable of handling the huge quantity of financial data is a necessary technique for every financial empiricist. There are many such tools, for example, Matlab, Stata, R, SQL, Python, etc. As a veteran in the field of empirical financial research, I have been using SAS for quite a long time. During the long years of experience, I gradually start appreciating SASs powerful yet smart ability to help me navigate through the ocean of financial data. I should admit that I have very limited, but not totally zero, knowledge of using other statistical software. However, I still feel obliged to compare SAS with other software in the context of financial research. The most distinguished advantage of SAS over other software is its ability to handle big data. This advantage is all the more meaningful when it comes to financial data. Let me explain this advantage as follows.
Big data analysis has become trendy during the past five to ten years, largely due to the rapid development of IT technology. To analyze data, we need to find the data in the first place. In this regard, financial data has long been collected, compiled, and distributed in a systematic, thorough, and scientific way. Most of the financial data originate from exchanges and the companies legally published periodic reports. In many countries, the financial information is universally formatted. These features make the financial data most easily to be collected and converted into commercial database. Many companies provide such database, such as WRDS, Thomson Reuters, CSMAR, and WIND. The Nobel laureate, Professor Eugene Fama, once mentioned that he had started using WRDS database to conduct researches since 1970s. In this sense, big data has been in place in financial research for about half a century, long ahead of the recent boom of big data analysis. To some degree, the financial data define the research topics, methodologies, and even sub-disciplines of todays financial research.
According to my personal observation, the researchers choice of the statistical software in the business and economics schools in universities varies from school to school. Interestingly, those in the finance departments are more likely to choose SAS, while those in economics and econometrics are more likely to choose Stata. This pattern of choice does have a reason. SAS treats data as a table, which stores and processes data line by line. Therefore, SAS theoretically has unlimited ability to dealing with any number of lines, although it necessarily takes a long time once the data are prohibitively large. In contrast, Stata treats data as a matrix, which requires, at least in theory, to read all data into the computers memory before starts the processing. This way, Statas ability to handle data is only as powerful as the computers memory size. However, Statas treating data as a matrix has a deeper rationale. That is, most modern econometric models are expressed in matrices, which means processing data in the form of a matrix is a more natural way in econometric research context. Partly due to this reason, when my students ask me why I choose SAS over Stata, I often half seriously and half jokingly answer them: That is because I am not an econometrician.
I can add another interesting observation to corroborate my argument that econometricians and those with strong econometric backgrounds tend to choose Stata. The similar choice pattern also shows up in the students who learn their econometric courses taught by the professors with different backgrounds. Empirical research methods have been compulsory courses in many business, economics, and finance programs at undergraduate, graduate, and doctoral levels in many universities. However, the professors teaching econometric have different backgrounds. In many schools, it is the econometric professor who teaches the students the concepts and methodologies of empirical research, no matter the students major in econometric or not. Needless to say, econometricians teaching econometrics can provide the most advanced, thorough, and rigorous knowledge of the field to the students. But when it comes to the application of empirical research methods on a specific economics or finance question, the researchers in that particular field typically have a specific preference of certain econometric methods. Often the case, top tier economics and finance journals reject the papers whose main contribution is merely to apply a better econometric method to an old research question. In other words, high quality economics and finance research puts more weight on ideas over econometric techniques. Therefore, there has gradually emerged a new norm in which the econometric course is taught by an economic or finance professor who does not major in econometrics, statistics, or math, but specializes in specific research areas. Over the past years, I have been teaching and working with many students. It seems that those who learn econometrics from professors majoring in econometrics tend to choose Stata, while those who learn econometrics from professors majoring in finance tend to choose SAS.
Given that SAS does not treat data as matrices, it seems to lose to Stata in terms of timely incorporating the newest statistical mythologies into the software. But on the flip side, the nearly unlimited ability to process any number of lines of data does make SAS more suitable, and in certain studies the only choice, for financial research. Those who have no experiences in handling financial data may not fully appreciate the massive quantity of financial data. Let me put it in perspective. Compared to the US. financial market, Chinese financial market has a very short history. We have just celebrated the thirtieth anniversary of the Chinese stock market as I finish writing this book. However, there have been more than 47 million trading-day observations for all Chinese bonds, and more than 11 million trading day observations for all Chinese stocks. For the microstructure (tick-by-tick) data, the number of observations is thousands of times larger than the daily data. The exceedingly large size of financial posts a series of challenges to us. For example, sorting data is a routine process. However, most of personal computers will have difficulty in sorting a data set with hundreds of millions of data by several variables. Matching data is another simple yet challenging task for Big data. The typical way of matching data is to first use a Cartesian product and then eliminate those that are not matched. A Cartesian product of two tables with x and y lines generate a x times y lines temporary table. Imagine how daunting the task could be if you are trying to match two tables which both have 10 million lines. In these scenarios, we need to find smarter ways to conduct the data processing. Fortunately, SAS can provide many handy tools for us.
This book summarizes my experience in using SAS to conduct financial research on big data. One of my research areas is market microstructure, which studies the tick-by-tick data of trades and quotes. As mentioned above, market microstructure data are especially large, hence a more demanding task for researchers. I often wait for a whole day to get one result. Although often frustrated and disappointed by the tedious calculation process, I have learned a lot from these studies. Most of all, I gradually master the skills that enable me to decipher the regularities hidden in the tremendous amount of data. There are many books discussing how to use SAS. I do not intend to add to this long list. This book is rather focused on how to use SAS to conduct sophisticated financial researches, especially in the context of big data. I assume that the readers have already understood the basics of SAS. The topics of this book mainly cover the advanced research and coding issues that are seldomly discussed in the general-purpose and introductory-level SAS books. I hope the experience shared in this book can be of some help for you to conduct high-quality financial researches.
Han Yan was supported by the National Natural Science Foundation of China under grant number 71772013.