CityReads│Learning Statistical Thinking for the 21st Century

查看原文

其他

CityReads│Learning Statistical Thinking for the 21st Century

Russell Poldrack 城读 2020-09-12

211

Learning Statistical Thinking for the 21st Century

“Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.”

- H.G. Wells

Russell A. Poldrack. Statistical Thinking for the 21st Century (Draft: 2018-11-29)

Source: http://statsthinking21.org

Professor Russell A. Poldrack, an American psychologist and neuroscientist, began teaching an undergraduate statistics course at Stanford in 2018. He wanted to bring a number of new ideas and approaches to the class. In particular, to bring to bear the approaches that are increasingly used in real statistical practice in the 21st century. As Brad Efron and Trevor Hastie laid out so nicely in their book “Computer Age Statistical Inference: Algorithms, Evidence, and Data Science”, these methods take advantage of today’s increased computing power to solve statistical problems in ways that go far beyond the more standard methods that are usually taught in the undergraduate statistics course for psychology students.

At first, Professor Poldrack used Andy Field’s amazing graphic novel statistics book, “An Adventure in Statistics”, as the textbook. There are many good things about this book, especially the way that it frames statistical practice around the building of models, and treats null hypothesis testing with sufficient caution. Unfortunately, most of students hated the book, primarily because it involved wading through a lot of story to get to the statistical knowledge. Plus, this book does not include topics from the field of artificial intelligence known as machine learning. So ultimately Professor Poldrack started writing down his lectures into a set of computational notebooks that would ultimately become this book. The outline of his book follows roughly that of Field’s book, but the content is substantially different (and also much less fun and clever).

Statistical Thinking for the 21st Century is not a conventional statistics textbook. The full text is available at http://statsthinking21.org. It is an open source book. Its source is available online at https://github.com/poldrack/psych10-book. If you find any errors in the book or want to make a suggestion for how to improve it, please open an issue on the Github site.

What is statistical thinking?

Statistical thinking is a way of understanding a complex world by describing it in relatively simple terms that nonetheless capture essential aspects of its structure, and that also provide us some idea of how uncertain we are about our knowledge. The foundations of statistical thinking come primarily from mathematics and statistics, but also from computer science, psychology, and other fields of study.

We can distinguish statistical thinking from other forms of thinking that are less likely to describe the world accurately. In particular, human intuition often tries to answer the same questions that we can answer using statistical thinking, but often gets the answer wrong. For example, in recent years most Americans have reported that they think that violent crime was worse compared to the previous year (Pew Research Center). However, a statistical analysis of the actual crime data shows that in fact violent crime has steadily decreased since the 1990’s. Intuition fails us because we rely upon best guesses (which psychologists refer to as heuristics) that can often get it wrong.

For example, humans often judge the prevalence of some event (like violent crime) using an availability heuristic – that is, how easily can we think of an example of violent crime. For this reason, our judgments of increasing crime rates may be more reflective of increasing news coverage, in spite of an actual decrease in the rate of crime. Statistical thinking provides us with the tools to more accurately understand the world and overcome the fallibility of human intuition.

What can statistics do for us?

There are three major things that we can do with statistics:

Describe: The world is complex and we often need to describe it in a simplified way that we can understand.

Decide: We often need to make decisions based on data, usually in the face of uncertainty.

Predict: We often wish to make predictions about new situations based on our knowledge of previous situations.

Fundamental concepts of statistics

There are a number of very basic ideas that cut through nearly all aspects of statistical thinking. Several of these are outlined by Stigler (2016) in his outstanding book “The Seven Pillars of Statistical Wisdom”, including aggregation, information, likelihood, intercomparison, regression, design and residual. Here Professor Poldrack discusses four key concepts of statistics.

1 Learning from data

One way to think of statistics is as a set of tools that enable us to learn from data. In any situation, we start with a set of ideas or hypotheses about what might be the case. Statistics provides us with a way to describe how new data can be best used to update our beliefs, and in this way there are deep links between statistics and psychology. In fact, many theories of human and animal learning from psychology are closely aligned with ideas from the new field of machine learning.

Machine learning is a field at the interface of statistics and computer science that focuses on how to build computer algorithms that can learn from experience. While statistics and machine learning often try to solve the same problems, researchers from these fields often take very different approaches; the famous statistician Leo Breiman once referred to them as “The Two Cultures” to reflect how different their approaches can be (Breiman 2001). In this book I will try to blend the two cultures together because both approaches provide useful tools for thinking about data.

2 Aggregation

Another way to think of statistics is “the science of throwing away data”. It is this kind of aggregation that is one of the most important concepts in statistics. When it was first advanced, this was revolutionary: If we throw out all of the details about every one of the participants, then how can we be sure that we aren’t missing something important?

Statistics provides us ways to characterize the structure of aggregates of data, and with theoretical foundations that explain why this usually works well. However, it’s also important to keep in mind that aggregation can go too far, and later we will encounter cases where a summary can provide a misleading picture of the data being summarized.

3 Uncertainty

The world is an uncertain place. We now know that cigarette smoking causes lung cancer, but this causation is probabilistic: A 68-year-old man who smoked two packs a day for the past 50 years and continues to smoke has a 15% (1 out of 7) risk of getting lung cancer, which is much higher than the chance of lung cancer in a nonsmoker. However, it also means that there will be many people who smoke their entire lives and never get lung cancer. Statistics provides us with the tools to characterize uncertainty, to make decisions under uncertainty, and to make predictions whose uncertainty we can quantify.

One often sees journalists write that scientific researchers have “proven” some hypothesis. But statistical analysis can never “prove” a hypothesis, in the sense of demonstrating that it must be true (as one would in a mathematical proof). Statistics can provide us with evidence, but it’s always tentative and subject to the uncertainty that is always present in the real world.

4 Sampling

The concept of aggregation implies that we can make useful insights by collapsing across data – but how much data do we need? The idea of sampling says that we can summarize an entire population based on just a small number of samples from the population, as long as those samples are obtained in the right way. The way that the study sample is obtained is critical, as it determines how broadly we can generalize the results. Another fundamental insight from statistics about sampling is that while larger samples are always better (in terms of their ability to accurately represent the entire population), there are diminishing returns as the sample gets larger. In fact, the rate at which the benefit of larger samples decreases follows a simple mathematical rule, growing as the square root of the sample size.

Related CityReads

06.CityReads│Life in the City Is Essentially One Giant Math Problem

11.CityReads│Why So Many Emerging Megacities Remain So Poor?

12.CityReads│How economists study cities?

23.CityReads│How to Tell Lies with Maps?

35.CityReads│The Joy of Stats

44.CityReads│How Could Humanity Escape Poverty?

49.CityReads│1800: A Year of Significance

82.CityReads│The End of Growth in the Standard of Living?

91.CityReads│Income inequality in Latin America in the 2010s

105.CityReads│Winners and Losers of Globalization

117.CityReads│Remembering Edutainer Hans Rosling,Who Made Data Dance and Taught us Fact-based Worldview