50 years of Data Science

Drawing on work by Tukey, Chambers, Breiman and Cleveland, Stanford statistics professor, David Donoho present a vision of data science based on the activities of people who are ‘learning from data’.

  • John Tukey’s The Future of Data Analysis, asserts that Statistics must become concerned with the handling and processing of data, its size, and visualization.
  • John Chambers’s S language, the predecessor of R, is the forerunner of the “notebook” concept, where an academic paper can be made reproducible, scripted, shareable (i.e. Jupyter Notebook)
  • Leo Breiman’s Two Cultures notes that concern strictly with prediction accuracy is different from inference about models, and that the former is under-represented in academia but prevalent in industry, where it has turned into “machine learning.”
  • William S. Cleveland 2001 paper Data Science: An Action Plan for Expanding the Technical Areas of the field of Statistics addressed academic statistics departments and proposed a plan to reorient their work.

His paper reviews the recent spectacle about data science in the popular media, and about how/whether Data Science is really different from Statistics.

He also describe an academic field dedicated to improving that activity in an evidence-based manner. His premises is that this new field is a better academic enlargement of statistics and machine learning than today’s Data Science Initiatives, while being able to accommodate the same short-term goals.

He propose to call the following collection of activities below as a would-be field “Greater Data Science”

1. Data Exploration and Preparation
2. Data Representation and Transformation
3. Computing with Data
4. Data Modeling
5. Data Visualization and Presentation
6. Science about Data Science

He contended that Information technology skills are a premium but scientific understanding and statistical insight should be firmly in the driver’s seat.

Check out a thoughtful essay by Stanford statistics professor David Donoho, titled “50 Years of Data Science