SQL: A must-know language
All data analysts must have a working knowledge of the Structured Query Language (SQL) because accessing, manipulating, and querying large databases is a routine function all analysts have to perform. SQL, and its many dialects, is the major query language used by all database systems. It is the foundational language for accessing, and, by default, organizing vast quantities of data. It was introduced and standardized in the 1980s. It also enabled the development of data stores, data warehouses, data marts, and data lakes that are so common and prevalent in our current business environment.
Overall software recommendation
All data analysts need to know R, Python, and SQL. A working knowledge of SAS would also be beneficial but its availability may be hindered by its price tag.
The open source software packages listed above can be used in one ecosystem: Jupyter. An ecosystem will handle multiple languages: Python, R, Julia, and SAS and many more.6 In fact, about 120! Incidentally, a language is referred to as a kernel in Jupyter nomenclature. Jupyter allows you to document your workflow for reproducibility which is important in large and complex data science tasks. Documentation and reproducibility are two separate but connected tasks that are important in any research project. They should not be an afterthought. Trying to recall what you did is itself a daunting task usually subordinated to the other daunting tasks of research. Researchers typically document after the work is done but at this point the documentation is incomplete at best and error prone at worse. Documentation is the logging of steps in the research process and includes data sources, transformations, and steps to arrive at an answer to management or a client’s question. Reproducibility is the ability to rerun an analysis. Quite often, analysts produce a report only to have management or the client call, even months later, requesting either a clarification of what was done, perhaps for a legal reason, or to request further analysis. This means you must recall exactly what you did, not to mention what data you used. This is where reproducibility comes in.
Jupyter uses a notebook paradigm which allows documentation and reproducibility through the use of a cell-orientation approach. There are two cells:
- 1. a code cell where code is entered; and
- 2. a “markdown” cell where text is entered.
Jupyter Lab is the next evolutionary step in the Jupyter notebook paradigm that allows you to have several notebooks open at once in tabs. You can drag and drop from one notebook to another, resize tabs, move tabs, and do much more.7
The best way to obtain Jupyter is by downloading and installing the Anaconda Navigator which comes with a large number of packages. Anaconda will load Python by default, but kernels for R, Julia, and SAS can be easily installed. Once installed, you can specify which language you want to use in a notebook. Anaconda is available at www.anaconda.com/.
- 1 “Automotive Industry Trends Point to Shorter Product Development Cycles” available at www.jabil.com/insights/blog-main/automotive-industry-trends-point-to- shorter-product-development-cycles.html. Last access on April 8, 2019. The survey was conducted in December, 2018 by Jabil, a consulting company.
- 2 Ibid.
- 3 See the Wikipedia article on Least Squares at https://en.wikipedia.org/wiki/Least_ squares#History. Last accessed April 4, 2019.
- 4 See the Artificial Intelligence article at https://en.wikipedia.org/wiki/Artificial_ intelligence. Last accessed April 4, 2019.
- 5 www.dataquest.io/blog/python-vs-r/
- 6 “Jupyter” is a contraction of Julia, Python, and R.
- 7 As of the time of writing this book, Jupyter Lab is still in development.