查看原文
其他

CityReads | How to Think Like A Data Scientist?

Wainer, Howard 城读 2022-07-13

317
How to Think Like A Data Scientist?
The best of a billion is almost surely better than the best of a million.

Wainer, Howard. 2015. Truth or truthiness: Distinguishing fact from fiction by learning to think like a data scientist. New York: Cambridge University Press.


Let's say, you have won a lottery and you can choose between one of two prizes. You can opt for either:
 
1.$10,000 every day for a month, or
2.One penny on the first day of the month, two on the second, four on the third, and continued doubling every day thereafter for the entire month.
 
Which option would you prefer?
 
Some back-of-the envelope calculations show that after ten days option (1) has already yielded $100,000, whereas option (2) only yielded $10.23. The choice seems clear, but we continue with some more arithmetic and after twenty days (1) has ballooned up to $200,000, and option (2) has yielded $10,485.75. Is there any way that over the remainder of the month the tortoise of option (2) can possibly overtake the hare of option (1)?
 
Quietly, however, even after twenty days the exponential momentum has become inexorable, for by day twenty-one it is $21,971, by day twenty-two it is $41,943, and so by day twenty-five, even though option (1)has reached its laudable, but linear, total of $250,000, option (2) has passed it, reaching $335,544 and is sprinting away toward the end-of- the-month finish line.
 
If the month was a non–leap year February, option (2) would yield $2,684,354, almost ten times option (1)'s total. But with the single extra day of a leap year it would double to $5,368,709. And, if you were fortu- nate enough to have the month chosen being one lasting thirty-one days, the penny-a-day doubling would have accumulated to $21,474,836.47; almost seventy times the penurious $10,000/day’s total.
 
As we can now see, the decision of which option to choose is not even close. Yet, even though the choice of option (2) was a slam dunk, how many of us could have foreseen it?
 
Compound interest yields exponential growth, so financial planners emphasize the importance of starting to save for retirement as early as possible. And yet the result of the exponential growth yielded by com- pound interest is hard to grasp, deep in one’s soul. To aid our intuition a variety of rules of thumb have been developed. One of the best known, and the oldest, is the "Rule of 72," described in detail by Luca Pacioli (1445–1514) in 1494.
 
In brief, the Rule of 72 gives a good approximation as to how long it will take for your money to double at any given compound interest rate. The doubling time is derived by dividing the interest rate into seventy-two. So at 6 percent your money will double in twelve years, at 9 percent in eight years, and so forth. Although this approximation is easy to compute in your head, it is surprisingly accurate (see Figure 1.1).

But exponential growth happens in many circumstances outside of finance. When I was a graduate student, the remarkable John Tukey advised me that to succeed in my career, I would have to work harder than my competitors, but "not a lot harder, for if you work just 10% harder in just 7 years you will know twice as much as they do." Stated in that way, it seems that at a cost of just forty-eight minutes a day you can garner huge dividends.
 
Now that we have widened our view of the breadth of application of the Rule of 72, we can easily see other avenues where it can provide clarity. For example, I recently attended a fiftieth high school reunion and was dismayed at the state of my fellow graduates. But, once I realized that those who had allowed their weight to creep upward at even the modest rate of 1.44 percent a year would, at the fiftieth reunion, be double the size I remembered from their yearbook portrait.
 
Of course, this rule also provides insight into how effective various kinds of plans for world domination can be affected. One possible way for a culture to dominate all others is for its population to grow faster than its competitors. But not a great deal faster; for again, if the growth rate is just 6 percent greater its population will double in just twelve years. Here I join with Mark Twain (1883) in that what we both like best about science is that "one gets such wholesale returns of conjecture out of such a trifling investment of fact."
 
Here is another story about the four-minute mile.
 
The world record for running the mile has steadily improved by almost four-tenths of a second a year for the past century. When the twentieth century began the record was 4:13. It took almost fifty years until Roger Bannister collapsed in exhaustion after completing a mile in just less than four minutes. In a little more than a decade his record was being surpassed by high school runners. And, by the end of the twentieth century, Hicham El Guerrouj broke the tape at 3:43.
 

Roger Bannister, 1954
 

3:43.13,Hicham El Guerrouj,1999
 
What happened? How could the capacity of humans to run improve so drastically in such a relatively short time? Humans have been running for a very long time, and in the more distant past, the ability to run quickly was far more important for survival than it is today. A clue toward an answer lies in the names of the record holders. In the early part of the century the record was held by Scandinavians – Paavo Nurmi, Gunder Haag, and Arne Andersson. Then mid-century came the Brits: Roger Bannister, John Landy, Herb Elliot, Peter Snell, and later Steve Ovett and Sebastian Coe. And in the twenty-first century the Africans arrived; first Filbert Bayi, then Noureddine Morceli and Hicham el Guerrouj. As elite competition began to include a wider range of runners, times improved. A runner who wins a race that is the culmination of events that winnowed the competition from a thousand to a single person is likely to be slower than one who is the best of a million.
 
A simple statistical model, proposed and tested in 2002 by Scott Berry, captures this idea. It posits that human running ability has not changed over the past century. That in both 1900 and 2000 the distribution of running ability of the human race is well characterized by a normal curve with the same average and the same variability. What has changed is how many people live under that curve. And so in 1900 the best miler in the world (as far as we know) was the best of a billion; in 2000 he was the best of six billion. It turns out that this simple model can accurately describe the improvements in performance of all athletic contests for which there is an objective criterion. The best of a billion is almost surely better than the best of a million.
 
If you find the above two stories of data analysis very illuminating, you'll enjoy statistician Howard Weiner's book, Truth or truthiness: Distinguishing fact from fiction by learning to think like a data scientist. Data science is a relatively recent term coined by Peter Naur in 1960.  Data science is the study of the generalizable extraction of knowledge from data.The core of data science is, in fact, science, and the scientific method with its emphasis on only what is observable and replicable provides its very soul.
 
This book has three parts: thinking like a data scientist, communicating like a data scientist, and applying data science tools to education. While this work will not make anyone a data scientist, it might encourage readers to be more attentive to the evidence behind statements that they hear or read, more likely to ask questions, and perhaps more skeptical overall. When a claim is made the first question that we ought to ask ourselves is "how can anyone know this?""And, if the answer isn't obvious, we must ask the person who made the claim, "what evidence do you have to support it?" Wainer’s approach is to provide some general guidance and a lot of examples in the form of loosely connected case studies.
 
Each chapter is designed to suggest something of the way a data scientist thinks, and how to begin to approach what appear to be very challenging questions. Underlying the whole book are some critical ideas about evidence and its role in science. Wainer calls out some essential components. These include making hypotheses explicit, developing sound evidence to test these hypotheses, and ensuring reproducibility. Most of the knotty problems discussed in this book are unraveled using little more than three of the essential parts of scientific investigations: (1) Some carefully gathered data, combined with (2) Clear thinking and (3) Graphical displays that permit the results of the first two steps to be made visible.
 

Related CityReads

6.CityReads│Life in the City Is Essentially One Giant Math Problem

23.CityReads│How to Lie With Maps

31.CityReads│How Jogging Became A Habit?

35.CityReads│The Joy of Stats

54.CityReads│What The Limits to Growth Got Right and Wrong?

107.CityReads│My All-Time Favorite Running Book

117.CityReads|Remembering Edutainer Hans Rosling,Who Made Data Dance124.CityReads│How Marathon Has Become A National Sport in Japan?

127.CityReads│Everybody Lies: How the Internet Reveals Who We Are

144.CityReads│Everyone Can Excel at Math & Science

148.CityReads│A New Way of Learning Economics to Understand World

165.CityReads│Scale: Simple Law of organisms, Cities and Companies

169.CityReads│Dollar Street shows how people live by photos.

170.CityReads│Why GDP Is Not Enough to Measure Development

175.CityReads│What Is the Best Way to Learn Statistics?

204.CityReads│All You Need to Know About the Global Inequality

211.CityReads│Learning Statistical Thinking for the 21st Century

213.CityReads│When Words Meet Numbers: What It Reveals about Writing

235.CityReads│How to Spot Chart Lies?

236.CityReads│Using Big Data to Solve Economic and Social Problems

237.CityReads│Ten Rules of Factful Thinking to Learn about the World

252.CityReads│How To Improve Your Data Literacy?

262.CityReads│How  do the Kalenjin Become World’s Great Runners?

264.CityReads│Visualizing complexity

(Click the title or enter our WeChat menu and reply number 

CityReads Notes On Cities

"CityReads", a subscription account on WeChat, 

posts our notes on city reads weekly. 

Please follow us by searching "CityReads"  

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存