Software issues

Software is a major component of the analysis of data for new product development. Some issues with software are provided in the following subsections.

Downplaying spreadsheets

The days are long past when spreadsheets were the sole tool for data analysis, whether for new product development or not. In fact, they never had the capabilities required for extensive and comprehensive data analysis and statistical and econometric modeling that are now commonly used. Spreadsheets have well known and documented major issues. Seven reasons for not using spreadsheets for Business Intelligence and Business Analytics are:

  • 1. They are not database managers.
  • 2. They were not designed to handle large data sets (i.e., “Big Data”).
  • 3. They make it difficult to identify data.
  • 4. The data are often spread across several worksheets in a workbook.
  • 5. They are inadequate for complex data structures such as panel data.
  • 6. They lack:
    • • data wrangling operations: joining, splitting, and stacking;
    • • programming capabilities, except Visual Basic for Applications (VBA) which is not a statistical programming language; and
    • • sophisticated statistical operations beyond arithmetic operations and simple regression analysis.
  • 7. Data visualization is limited: pie/bar charts and non-scientific visuals. Visualization is more infographic than scientific.

Other, more powerful software is required. I classify these software packages into Open Source and Commercial. The former are generally free while the latter are not. In addition, my brief comments on software are restricted to statistical and econometric packages and general programming languages.

Open source software

There are five open source software packages I will mention here because they are either the major ones available or have the potential to become major players. These are R, Python, Julia, Stan, and KNIME.

R

R is the de facto software for statistical analysis, although it does have issues that hinder its use for Big Data problems. The major problem is its memory handling: all data and objects created in R (e.g., functions) are stored in memory. R does not have any problems with small data sets that can easily be stored in memory. Its memory management becomes an issue with large data sets.

Another issue is its steep learning curve. It is not an easy language to learn. Finally, everything you want to do requires programming. If you are not a sophisticated analyst with programming capabilities, then R will prove a challenge. Aside from these issues, R is strong in the statistics domain and will remain in that position for some time to come.

An advantage of R is its package structure. Aside from the base R, which has many great statistical and graphing capabilities, R relies on user developed and contributed packages for all its major statistical and modeling functions. There is an extensive collection of packages that cover areas such as:

  • • Bayesian Inference
  • • Cluster Analysis and Finite Mixture Models
  • • Probability Distributions
  • • Econometrics
  • • Design of Experiments (DOE) and Analysis of Experimental Data
  • • Graphic Displays, Dynamic Graphics, Graphic Devices, and Visualization
  • • High-Performance and Parallel Computing
  • • Machine Learning and Statistical Learning
  • • Multivariate Statistics
  • • Natural Language Processing to mention a few. For a complete list of topical areas, see https://cran.r-project.org/ web/views/. For an extensive listing of packages, see https://cran.r-project.org/ web/packages/available_packages_by_date.html. These packages give R an advantage but one which also comes at a cost. The advantage is that new packages with state-of-the-art methods and cutting-edge techniques can be easily and quickly made available. There is a wide and diverse community of developers who contribute these packages so a user can be almost guaranteed a needed capability will be found in R.

The cost associated with the package paradigm is twofold. First, not all developers are highly skilled programmers or program the function correctly. Although the source code is openly available for anyone to examine and audit, the sheer volume of packages and their complexities may make this impossible, or at least a daunting challenge. Second, there is no guarantee that the person or persons who developed a package will continue to maintain it. This can become problematic if a package is relied upon but the developer simply disappears and no one steps forward to become a new maintainer. Finally, some packages have functions with similar capabilities but with different syntax which makes their use confusing.

R can be downloaded at www.r-project.org/.

Python

Python is a challenger to R that in many respects has surpassed R. Python also has memory management issues, but its easier syntax and the Pythonic way of writing code make it much simpler to use and read. It also has a package structure that is simpler to access and use. The Pandas package is excellent for data manipulation using an intuitive syntax for access, management, and manipulation. The Statsmodels and Sk-Learn packages provide many excellent functions and capabilities for statistic and machine learning, respectively. For an introduction to Pandas, see McKinney 12018).

An interesting and useful side-by-side comparison of R and Python is available on the Dataquest website.5 This comparison shows that in some instances, R is simpler to use while in others Python is simpler. The bottom line is that R and Python complement one another rather than being competitive. They each have strengths and weaknesses depending on applications. Dataquest further concludes that “Ultimately, you may end up wanting to learn Python and R so that you can make use of both languages’ strengths, choosing one or the other on a per-project basis depending on your needs.”

Python can be downloaded at www.python.org/downloads/.

Julia

Julia is a newer language developed at MIT. This is fast becoming a popular programming and statistical analysis language, but it still has development that must be done to make it a complete rival to R and Python. There are major plans for it in the data science area.

Julia can be downloaded at https://julialang.org/downloads/.

Stan

Stan is also a new language, but one that differs from R, Python, and Julia in that it focuses on Bayesian analysis. Bayesian analysis has become more accepted and useful in statistical analysis, especially since the development of the Markov Chain Monte Carlo (MCMC) methods.

Stan can be downloaded at https://mc-stan.org/.

KNIME Analytics Platform

KNIME is an open source software package specifically designed for data science applications. It has a graphical user paradigm that allows users to drag and drop icons on a canvas to paint a picture of the process they want to follow. This gives KNIME an intuitive appeal because the user can “see” how data flows from one application to another.

KNIME can be downloaded at www.knime.com/knime-software/knime- analytics-platform.

Commercial software

There are three commercial software packages I will mention, although two come from the same parent company, the SAS Institute: SAS andJMP. The third is Stata.

5Л5

SAS is probably the granddaddy of all statistical software packages, whether commercial or open source. It was originally developed in the 1970s and has dominated the statistical software industry ever since, even though it has a hefty price tag. It is probably safe to say that any major statistical, data management, and data visualization capability that an analyst will need is in SAS. There is an extensive development structure behind the software at the SAS Institute that almost guarantees a high quality and powerful product. This is, however, a drawback because the Institute is slow in adding the latest statistical innovations.

I mentioned that all objects that R creates are maintained in memory. This makes using R with Big Data more than a challenge because there are memory limits. SAS, on the other hand, dynamically manages memory and is thus more efficient. Also, SAS has an extensive library of functions (called PROCs) with each library containing many sophisticated options. In fact, the depth of options far exceeds any other software package.

SAS is available at www.sas.com.

IMP

JMP is an interesting and powerful product also from the SAS Institute that has an intuitive graphical interface that makes working with data much simpler than in any other software package. It has many powerful and complete statistical options, although not as many as SAS and without all the depth SAS has on any one function. A major advantage of JMP is its close connection to SAS. If you have SAS installed on the same system as JMP, then JMP can be used as a front-end to SAS, easily passing data to SAS and retrieving data (and graphs) back as JMP objects and reports. In addition, JMP has interfaces to R and Python that greatly magnify its capabilities. Unfortunately, JMP also has a high price tag, although not as high as the one for SAS. See Paczkowski [2016] for a discussion of using JMP to analyze market data.

JMP is available at www.jmp.com.

Stata

Stata is a powerful statistical and econometric package although it is mostly targeted to econometrics. It has an intuitive graphical user interface (GUI) and powerful programming component. However, I have found the programming language difficult to use.

Stata is available at www.stata.com/.

 
Source
< Prev   CONTENTS   Source   Next >