Fun (and Fraud Detection) with Benford's Law | Data and the Web

Data and the Web

Fun (and Fraud Detection) with Benford's Law

Benford Law Graph - smallBenford's law is one of those things your high school math teacher would break out on a slow, rainy day when the students' attention span was even lower than usual.

He'd start out by asking the class to look at the leading digits in a list of numbers and then predict how many times each leading digit would appear first in the list. The students would make some guesses and eventually come to the consensus that the probability would be pretty close — about 11% each.

Then, the teacher would just sit back, smile, and gently shake his head at his simple-minded pupils. He would then go on to explain Benford's law, which would blow everyone's mind — at least through lunchtime.

Play Benford's Law Video

(Click the image above… or here's an embeddable YouTube version)

Per Wikipedia:

Benford's law, also called the first-digit law, states that in lists of numbers from many real-life sources of data, the leading digit is distributed in a specific, non-uniform way.

Specifically, in this way:

Leading Digit     Probability
      1              30.1%
      2              17.6%
      3              12.5%
      4               9.7%
      5               7.9%
      6               6.7%
      7               5.8%
      8               5.1%
      9               4.6%

Again, from Wikipedia:

This counter-intuitive result applies to a wide variety of figures, including electricity bills, street addresses, stock prices, population numbers, death rates, lengths of rivers, physical and mathematical constants, and processes described by power laws (which are very common in nature).

Boiling it down, this means that for almost any naturally-occurring data set, the number 1 will appear first about 30% of the time. And, by naturally occuring, this can mean check amounts or stock prices or website statistics. Non-naturally occurring data would be pre-assigned numbers like postal codes or UPC numbers.

Besides being fun to play with, Benford's is used in the accounting profession to detect fraud. Because data like tax returns and check registers follow Benford's, auditors can use it as a high-level check of a data set. If there are anomalies, it may be worth investigating closer as potential fraud.

If you're interested in further information about fraud detection using Benford's, definitely give these two articles by Malcolm W. Browne and Mark J. Nigrini a read.

Try It Out for Yourself

Take a look at the demonstration video above to see Benford's law in action with data sets from the web. If you'd like to play with it yourself, just install the Benford's Law extension for Kirix Strata™ and have fun.

Also, please note that I used the following data sets in the video, if you'd like to give those a spin:

Wikipedia List of Lakes in Minnesota
US Census Data Sets
Social Blade - Digg Statistics

And here are a few other worthy ones that didn't make it in the video:

NASDAQ Historical Stock Price
Wikipedia List of Countries by Population
And plenty more at Delicious here…

Enjoy!

46 Responses to “Fun (and Fraud Detection) with Benford's Law”

  1. Geoff Hotchkiss says:

    This interested me so I wondered whether or not this would work with random numbers. I decided to test this with the Random class in java and I calculated the percentages. I tried it with just the numbers 1-9 and then with three digit numbers 100-999 and both came to be about 11% for each leading digit. I guess that's because of the way the Random class generates the numbers and it was an unexpected result that I thought was kind of interesting.

  2. Ken Kaczmarek says:

    Hi Geoff,

    Yeah, I agree, it's really interesting. When I was prepping for the video, I actually did the same thing, but because the video was getting a little long, I had to cut it. If you've downloaded Strata, you can try this out pretty quickly:
    1. Open up a data set, then right-click on the field name and select “Insert calculated field”.
    2. type in rand(), which will populate your calculated field with random numbers; you can play with the number of digits by doing something like rand() * 100; then hit OK.
    3. Run your Benfords test. (one note, the rand() function is dynamic, so the values change each time you do something with the field, so you can keep hitting “Graph” and you'll get new random graphs each time. To set this as static, you'd right-click on the field header and select “Convert to fixed fields”).

    Overall, it is pretty amazing what Benford's does apply to — stock market data for instance. Another thing I cut from the video was taking 200 random stock quotes and running benfords on price and volume… amazingly, worked like a charm. Not only that, you can multiply the stock price by currency conversions (say Yen or Euro) and Benford's still holds. :)

    Thanks for the comment!

  3. brian fantana says:

    you should be glad your Random class generates with 11% distribution. otherwise it's broken ;)

  4. Geoff Hotchkiss says:

    Haha yeah… I was kind of thinking that anything but 11% would mean it's broken but was still curious. I also modified my program to do 1-9999 instead of just checking numbers with the same amount of digits and still 11%. It does make sense that it should be 11% but I still had to try.

    Ken, I'm going to have to download Strata and play around with it although I do wish there was a OSX version. I'll just have to install it on my VM, it looks like a very fun program!

  5. dtret says:

    Wow. Just amazing.

    I love that the founder of Digg has a 97% chance of having his choices land on the front page.

  6. Chatham Harrison says:

    The whole point is that the numbers that this Law is referring to are not truly random, which means that they're predictable (every statistician's favorite quality). This is simple probability. It's basically gambling, if you were going all in every round. Let's say you have a 20% chance of winning any hand. You have a 20% chance of simply winning the first hand, but since you're all in, winning the second hand is predicated on you winning the first, therefore your chance of winning the second hand is only 4%. The result will fit a logarithmic curve, as in Benford's Law. In any situation where there is are multiple data points, all of them moving away from zero, and there is a constant probability of further movement less than one, the resulting distribution will be logarithmic. And forgive me if that sleep-induced, hurried definition is lacking; I'm a PoliSci major, not a mathematician.

  7. Eloy says:

    Can Benford's Law be used in detecting fraud within an evaluation survey? For example: 1 through 5 where 1 represents the best and five the worse. Suppose someone alters the scores after the evaluation is given. Thanks.

  8. Nuno Lagoa says:

    This law doesn't apply to all numbers. You have to be selective. It only applies to quantities that change a little over time.
    This is why it can be used on stock prices and tax returns: it is extremely unlikely that either jump n-fold; most of the time they change by small percentages.

    This is better illustrated with an example. Say a product starts costing $1000. With inflation at, say, 3%, over time the price increases to $1030 after the first year, then $1060.9 ($1000*1.03*1.03), then $1092,727 ($1000*1.03*1.03*1.03) and so on. As you can see, the leading number is always 1. If you wait even longer, eventually the price will reach $2000. So 1 is not the leading digit any more. But now the price will stay in the $2000-2999.99 bracket for a shorter period of time because if even inflation were to stay at 3%, 3% over $2000 makes that price increase by increments of $60, not $30 as when the price started at $1000.
    If you extend this argument you will see that as the price increases, it will stay for shorter periods of time in the same bracket.
    Well, eventually that price will reach $10000. At this point the price will stay in the $10000-$20000 bracket for quite a while longer than when it was in the $1000-$2000.
    Get it? It's beautiful!

  9. moronmark says:

    A long time ago I read that the proportion of digits (1 to 9) from a book of log tables would match the proportions of those digits on a slide rule.
    Is this the same law?

  10. Ken Kaczmarek says:

    @ Geoff — re: Mac version, it is on the feature list for sure (we get this feature request more than any other). Send us a quick email to support -at- kirix -dot-com and we'll contact you when the beta is available.

    @Eloy — as Nuno described with his very nice example, it would not apply to your survey. One of the other things to note about Benford's is that doesn't work with data sets limited to certain categories, even if that original set followed Benford. For instance, if the data is filtered to see “only ages 30 to 40″ or “all lakes less than 50 feet deep, Bendford will not apply.

    @moronmark — I'm not sure whether it is the same, but it sounds good. ;) However, interesting to note the legend surrounding the discovery of this law. Benford had noticed that the first pages of a logarithm book he was using were more worn than the latter pages. Based on this observation, he began to investigate why this was the case, which led him to his discovery (historical note: some claim Benford first observed this, others claim that an astronomer Simon Newcomb first observed it).

  11. Weaver says:

    Is there a proof of this law? I would love to read it.

  12. Ken Kaczmarek says:

    Hi Weaver, these may be helpful to you:

    http://www.mathpages.com/home/kmath302/kmath302.htm
    http://mathworld.wolfram.com/BenfordsLaw.html
    http://plus.maths.org/issue9/features/benford/

  13. Dean says:

    I just ran several “naturally occuring” datasets, and I found that about half of the curves resembled the Benford's curve with some imagination. Some were total opposites. It appears to be random. Sorry, I was excited about it too!By the way, the check writing story is not naturally occuring data either, my phone bill usually starts with 8, does that mean verison calculated it wrong?

  14. Ken Kaczmarek says:

    Hi Dean,

    It can only mean one thing: FRAUD! ;) Just kidding… it's hard to say without looking at the data. Are you using data from the web and, if so, can you post some links? A few other related notes:

    1. Your phone bill may just need to have a much larger sample (say, all phone bills that the phone company gives out). The sample size of an individual phone bill probably won't be enough to prove anything one way or another. As mentioned above, this might fall into a “category” of a larger sample (”my phone bill only”), whereas if you took the full set, it would show a Benford's distribution.

    2. As for checks, in the example, I only used the checks that I got from the case (listed in one of the articles I linked to), not the entire data set. And auditor would basically use Benford's as a thumbnail estimate to see whether or not something is worth investigating further. He may run benford's on the entire set and see that there are blips in 8s and 9s and then dig down into the data to pinpoint the issue. Benford's is definitely used in this area; we've run into it in the accounts payable industry on numerous occasions. However, I'm not fully knowledgeable as to the exact steps the auditors use when applying Benford's. At some point you need to have a decent set of data or else it is going to be skewed.

    3. If you've got a naturally occurring data set that should follow Benford's but doesn't, it may not be fraud but something else. For example, if an employee can submit expenses up to $25 without authorization but needs manager approval for anything $25 and above, Benford's (particularly 2-digit resolution) may show you a huge spike in the number 24… this likely has less to do with fraud and more to do with people just not wanting to bother to get approval.

    Anyway, if you have more info on the data sets you are looking at, lemme know and I'll be happy to take a look at ‘em.

  15. sumati says:

    nice analysis, thanks for sharing.

  16. pgr says:

    Phone bills probably not follow Benford's Law, unless you take them from a properly sampled set in numerous currencies. In the United States, a basic phone bill is typically $25 per month per line, plus taxes and long distance. Most people aren't going to spend enough long distance to push it over $100, so there will be no bills between $10 and $19.99, and very few over $100, so there will be few that start with 1. Other countries will have similar biases, but not at the same number. For example, however many Euros typical French phone bill comes to, the price will have a floor that applies everywhere. So if you took every phone bill in the world, without converting the local currency, you probably would get a Benford distribution.

    I doubt that individual stock prices will follow the “law”, either, because companies usually want their stock price to be about $30 to $60, and will use splits or reverse splits to keep it there, so again it's not naturally occurring.

    Things like baseball batting averages are also going to be selected away. It's impossible to hit 1.000, and anyone hitting under 0.200 will get sent to the minors, so there a lot of 2s, a few 3s, and the very rare 4 and no more. On the other hand, number of hits will have a lot of 1s. Anyone not getting 100 hits in a season won't be around long, and few players get more than 200.

    I can't figure out all the conditions that make it work, but it certainly will if it grows exponentially, as discussed by Nuno Lagoa, or when it's something that becomes progressively rarer but the distribution covers at least a full order of magnitude, like lake depth. I bet the number of bytes in each file in a random collection would work, too.

  17. Roundup Thursday for the Week of 7/20/08 says:

    […] has a video illustrating Benford's Law against Digg post submissions. They also use the law to show how a woman in Arizona was making fake payments to a fictional […]

  18. David says:

    I didn't bother watching the movie, but we did an experiment like this in a statistics class in college. Everybody looked at a random address in the phone book and we wrote down the first number. Plotted it, and sure enough we got this curve.

  19. Joe Mayer says:

    RE: Telephone bills and sports statistics: I agree with the folks who said that these would not be regular “natural” statistics, but I bet you could easily get them to follow the same sort of formula. For phone bills, for example, take that theoretical minimum of around $25 per month and subtract it from each phone bill value. You should then have a range extending from zero to some upper limit, and that range would probably follow a Benford curve. Similarly, if you were to take all of the batting averages and subtract .200 from them, you would wind up with a set of values from zero to probably around .150 or so, with a few odd points out there. Actually, now that I think about it, I would probably try that one by finding the “average average,” so to speak, and plotting the difference from that mean point. I'd be willing to bet, although I might not wager that much, that the plot of deviation from the mean batting average would probably follow a Benford curve, too. Anyway, just thinking rambling thoughts at 2:30 in the morning. Feel free to disregard…

  20. Primordial Ooze : Benford's Law says:

    […] Fun (and Fraud Detection) with Benford's Law | Data and the Web […]

  21. Mesothelioma says:

    But why does this happen?

  22. Ken Kaczmarek says:

    Here's a pretty good “practical” explanation from http://www.rexswain.com/benford.html using stock prices as an example:

    ====
    Dow Illustrates Benford's Law
    To illustrate Benford's Law, Dr. Mark J. Nigrini offered this example:

    “If we think of the Dow Jones stock average as 1,000, our first digit would be 1.

    “To get to a Dow Jones average with a first digit of 2, the average must increase to 2,000, and getting from 1,000 to 2,000 is a 100 percent increase.

    “Let's say that the Dow goes up at a rate of about 20 percent a year. That means that it would take five years to get from 1 to 2 as a first digit.

    “But suppose we start with a first digit 5. It only requires a 20 percent increase to get from 5,000 to 6,000, and that is achieved in one year.

    “When the Dow reaches 9,000, it takes only an 11 percent increase and just seven months to reach the 10,000 mark, which starts with the number 1. At that point you start over with the first digit a 1, once again. Once again, you must double the number — 10,000 — to 20,000 before reaching 2 as the first digit.

    “As you can see, the number 1 predominates at every step of the progression, as it does in logarithmic sequences.”
    ====

    For the actual mathematics, here are a couple links you can investigate:
    http://www.mathpages.com/home/kmath302/kmath302.htm
    http://mathworld.wolfram.com/BenfordsLaw.html

  23. jody says:

    This explains why I can't seen to break 100 playing golf !

  24. Carlos says:

    Damn, dude! Get a life…

  25. danielmadv says:

    In the next link is Mark Nigrini explaining the Benford's Law and most interesting is his comentary about de data related on Enron and how can Benford's Law would advise from that fraud.
    http://fraudit.blogspot.com/2009/01/nigrini-y-ley-de-benford.html

    But, be carefully because in statistics there are the Errors Type I and Type II, an explanation and implications about these and Benford's Law in the next link
    http://fraudit.blogspot.com/2009/01/nigrini-y-ley-de-benford.html

  26. Tom says:

    How can I use this to play lotto?

  27. Ken Kaczmarek says:

    As far as lotto… bottom line, you're out of luck. ;) The lottery is based on random numbers (number 10 has the same chance of appearing as the number 50). Benford works on “naturally occurring”/logarithmic amounts.

  28. Benford's Law and Fraud Detection Analysis | Strata Extensions says:

    […] The Benford's Law and Fraud Detection Analysis enables you to graph a data set against a Benford's law curve to find abnormalities within the data. This enables you to quickly ascertain the accuracy of the data, which is particularly helpful for detecting fraud in various business data such as check payment amounts. See a video of this extension here. […]

  29. Dusan Marjanov says:

    But if we switch from decimal to binary numbers, every binary digit will start with 1! Thus, the first figure is 1 in 100% cases!

  30. Eric says:

    Wow! I am an auditor and we use “judgemental sampling” in my department … in other words we look at a set of records(disbursement checks, for example) and pick “x” number of them to test. Usually, that leads to samples based on interesting vendors or something like that. I decided to look into a better way to pick samples, so I tried using Banford's Law. Amazing! Out of a sample set of about 14,000, 5, 7, and 9 were off. The rest were within 5% of the predicted value. So, I used some other statistical analysis on the 5, 7, and 9 data sets and have, thusfar, uncovered six fraudulent schemes. I love numbers!

  31. Ken Kaczmarek says:

    Eric, that's great! Thanks for sharing. By the way — we've got a couple other internal auditor extensions that we've been wanting to build (statistical sampling, stratification, etc.), so shoot us a support email (support@kirix.com) if you'd be interested in knowing when those get released for Strata too.

  32. Luc says:

    This isn't actually just interresting; this is fasinating! I did the same as the 1st poster before i even read the reply, but with another random generator which generated ‘random' numbers from one to a milion. At first it was close to the benford's law, but after like 500thousand it became more off. End result was this:
    1: 241098 = 24.11%
    2: 183346 = 18.33%
    3: 145647 = 14.56%
    4: 117081 = 11.71%
    5: 94570 = 9.46%
    6: 77008 = 7.70%
    7: 60312 = 6.03%
    8: 46829 = 4.68%
    9: 34097 = 3.41%
    (the first number is how often it was chosen out of a milion times)

    Anyway, thanks for making the video. I've enjoyed it :)

  33. learned something new | davehamel.com says:

    […] cool part is, forensic investigators use it when the think some one is cooking the books! Check out this cool video on it and actual data sets. It's kind of freaky […]

  34. Benford's Law : How Credit Cards Detect Frauds « The Gifted's Hub says:

    […] Leading Digit Probability 1 30.1% 2 17.6% 3 12.5% 4 9.7% 5 7.9% 6 6.7% 7 5.8% 8 5.1% 9 4.6% Now let's see in more details how this works. The following video is taken from the website called Kirix. […]

  35. Mark says:

    This actually makes perfect sense, and none of you should be surprised. The people that are assuming 1-9 are the original digits and using rand() are thinking about it the wrong way. In naturually occuring data sets, there is no 1-10 limit. It's orders of magnitude… Lets do 9 sets… the first one just contains 1, chance is 100% it's a 1, second set contains 2, chance is 50% it's a one. Third set is a 3, chance is 33% it's a 3. When you add these % up through nine, then divide by nine, guess what you get… 31.4%… and this is a very course discretization

  36. Mike Blakley says:

    There are also quite a number of good articles on the subject at Dr. Mark Nigirinis site nigrini dot com. He is an accounting professor who wrote his PhD thesis on Benford's law.

  37. me says:

    It is seriously cool as it is completely scale invariant. measure it in whatever unit it will always match, though why should nature change its rules depending on how we measure it?

  38. Irrational rotations of the circle and Benford's law « Division by Zero says:

    […] do not follow Benford's law. For more information you may want to read other accounts of Benford's law on the […]

  39. Nudging scientists to share data more « ceptional says:

    […] when humans do this they often leave tell-tale signs that indicate the data were tampered with. See Benford's Law for one example. I know, I know, perhaps only the stupid scientists wouldn't be able to […]

  40. Chris says:

    Perhaps I dont understand this aspect of the opener video fully, but aren't the amounts payed to a vendor pre-assigned in every case? Wouldn't they not follow Benford's Law because they instead correlate to a fixed amount of goods/services that this company would be supposedly providing?

  41. Ken Kaczmarek says:

    @Chris — a vendor number (e.g., 0341993) would be pre-assigned, but any invoice price can vary from one to the next (10 widget A @ $15; 4 widget B @ $22, etc.). The “random” nature of the total price would tend to follow benford.

  42. Ron says:

    I've just learned about Benford's law and have decided to apply it to the data from a potentially fraudulent scientific paper. But it's kind of hard to learn all the in's and out's. I decided to use only numbers that are part of the statistical analysis, such values for correlation coefficients, t-tests, and so on. But I excluded things that I wouldn't expect to be fudged, like sample size (and of course not dates and not numbers cited from other people's work). I would like to know what others think about the applicability of Benford's Law in this case. I'm still extracting numbers (by hand, because I have to be able to judge what is a statistic and what isn't). The distribution so far with n = 155 is as follows:

    1 16.1%
    2 29.0%
    3 9.0%
    4 17.4%
    5 12.3%
    6 7.1%
    7 5.2%
    8 1.9%
    9 1.9%

    Not at all what Benford's law predicts, but there are so many unanswered questions: What is an adequate sample size? Does the fact that some numbers keep recurring in the data mean that I should edit it? For example the power test always comes out around d = 2.xx and the hit rates for different conditions are all in the range of 50-60%. How can I learn a lot more about the forensic use of Bedford's Law, i.e. all the relevant considerations? I've been reading tons so far but nothing to answer the above.
    ron

  43. anonymous says:

    Does anyone here recognize the correlation of this principle and the golden ratio?
    *Gather data sets into separate bins
    *Analyze mean data of all bins into one “0-9″ data points.
    *Total of all Figures = total ‘area' of perfect rectangle.
    *Divide total by each number; highest to lowest.
    The result should be a close representation of the ratios of areas of golden triangles descending order.
    Maybe a derivative of the Fibonacci sequence the same way Pareto's Law is, but interesting just the same.

  44. Bill says:

    I wonder how this applies to white noise static? How about to breaking numeric codes? At one point we had an algorithm that required a random number set to simulate background noise so we took random numbers from a calculator to provide the data. Based on Benford's Law, I wonder if the resulting algorithm was accurate or not…

  45. hplc says:

    That is some crazy math. Never even heard of Benford's law before.

  46. Mark J. Nigrini says:

    Earlier this year I published a book on the topic (”Benford's Law: Applications for forensic accounting, auditing, and fraud detection, Wiley, 2012). The first few chapters review the maths (the effect of multiplying the numbers by a constant, changing the base, and so on). In the book there are many applications including the census numbers, election results, stream flow numbers, and tax return numbers. The companion site http://www.nigrini.com/benfordslaw.htm has free Excel tempates, data sets, photos, and other interesting items. Enjoy.

About

Data and the Web is a blog by Kirix about accessing and working with data, wherever it is located. We have a particular fondness for data usability, ad hoc analysis, mashups, web APIs and, of course, playing around with our data browser.