Data and the Web

Fun (and Fraud Detection) with Benford’s Law

Benford Law Graph - smallBenford’s law is one of those things your high school math teacher would break out on a slow, rainy day when the students’ attention span was even lower than usual.

He’d start out by asking the class to look at the leading digits in a list of numbers and then predict how many times each leading digit would appear first in the list.  The students would make some guesses and eventually come to the consensus that the probability would be pretty close — about 11% each.

Then, the teacher would just sit back, smile, and gently shake his head at his simple-minded pupils.  He would then go on to explain Benford’s law, which would blow everyone’s mind — at least through lunchtime.

Play Benford’s Law Video

(Click the image above… or here’s an embeddable YouTube version)

Per Wikipedia:

Benford’s law, also called the first-digit law, states that in lists of numbers from many real-life sources of data, the leading digit is distributed in a specific, non-uniform way.

Specifically, in this way:

Leading Digit     Probability
      1              30.1%
      2              17.6%
      3              12.5%
      4               9.7%
      5               7.9%
      6               6.7%
      7               5.8%
      8               5.1%
      9               4.6%

Again, from Wikipedia:

This counter-intuitive result applies to a wide variety of figures, including electricity bills, street addresses, stock prices, population numbers, death rates, lengths of rivers, physical and mathematical constants, and processes described by power laws (which are very common in nature).

Boiling it down, this means that for almost any naturally-occurring data set, the number 1 will appear first about 30% of the time.  And, by naturally occuring, this can mean check amounts or stock prices or website statistics.  Non-naturally occurring data would be pre-assigned numbers like postal codes or UPC numbers.

Besides being fun to play with, Benford’s is used in the accounting profession to detect fraud.  Because data like tax returns and check registers follow Benford’s, auditors can use it as a high-level check of a data set.  If there are anomalies, it may be worth investigating closer as potential fraud.

If you’re interested in further information about fraud detection using Benford’s, definitely give these two articles by Malcolm W. Browne and Mark J. Nigrini a read.

Try It Out for Yourself

Take a look at the demonstration video above to see Benford’s law in action with data sets from the web.  If you’d like to play with it yourself, just install the Benford’s Law extension for Kirix Strata and have fun.

Also, please note that I used the following data sets in the video, if you’d like to give those a spin:

Wikipedia List of Lakes in Minnesota
US Census Data Sets
Social Blade - Digg Statistics

And here are a few other worthy ones that didn’t make it in the video:

NASDAQ Historical Stock Price
Wikipedia List of Countries by Population
And plenty more at Delicious here…

Enjoy!

24 Responses to “Fun (and Fraud Detection) with Benford’s Law”

  1. Geoff Hotchkiss says:

    This interested me so I wondered whether or not this would work with random numbers. I decided to test this with the Random class in java and I calculated the percentages. I tried it with just the numbers 1-9 and then with three digit numbers 100-999 and both came to be about 11% for each leading digit. I guess that’s because of the way the Random class generates the numbers and it was an unexpected result that I thought was kind of interesting.

  2. Ken Kaczmarek says:

    Hi Geoff,

    Yeah, I agree, it’s really interesting. When I was prepping for the video, I actually did the same thing, but because the video was getting a little long, I had to cut it. If you’ve downloaded Strata, you can try this out pretty quickly:
    1. Open up a data set, then right-click on the field name and select “Insert calculated field”.
    2. type in rand(), which will populate your calculated field with random numbers; you can play with the number of digits by doing something like rand() * 100; then hit OK.
    3. Run your Benfords test. (one note, the rand() function is dynamic, so the values change each time you do something with the field, so you can keep hitting “Graph” and you’ll get new random graphs each time. To set this as static, you’d right-click on the field header and select “Convert to fixed fields”).

    Overall, it is pretty amazing what Benford’s does apply to — stock market data for instance. Another thing I cut from the video was taking 200 random stock quotes and running benfords on price and volume… amazingly, worked like a charm. Not only that, you can multiply the stock price by currency conversions (say Yen or Euro) and Benford’s still holds. :)

    Thanks for the comment!

  3. brian fantana says:

    you should be glad your Random class generates with 11% distribution. otherwise it’s broken ;)

  4. Geoff Hotchkiss says:

    Haha yeah… I was kind of thinking that anything but 11% would mean it’s broken but was still curious. I also modified my program to do 1-9999 instead of just checking numbers with the same amount of digits and still 11%. It does make sense that it should be 11% but I still had to try.

    Ken, I’m going to have to download Strata and play around with it although I do wish there was a OSX version. I’ll just have to install it on my VM, it looks like a very fun program!

  5. dtret says:

    Wow. Just amazing.

    I love that the founder of Digg has a 97% chance of having his choices land on the front page.

  6. Chatham Harrison says:

    The whole point is that the numbers that this Law is referring to are not truly random, which means that they’re predictable (every statistician’s favorite quality). This is simple probability. It’s basically gambling, if you were going all in every round. Let’s say you have a 20% chance of winning any hand. You have a 20% chance of simply winning the first hand, but since you’re all in, winning the second hand is predicated on you winning the first, therefore your chance of winning the second hand is only 4%. The result will fit a logarithmic curve, as in Benford’s Law. In any situation where there is are multiple data points, all of them moving away from zero, and there is a constant probability of further movement less than one, the resulting distribution will be logarithmic. And forgive me if that sleep-induced, hurried definition is lacking; I’m a PoliSci major, not a mathematician.

  7. Eloy says:

    Can Benford’s Law be used in detecting fraud within an evaluation survey? For example: 1 through 5 where 1 represents the best and five the worse. Suppose someone alters the scores after the evaluation is given. Thanks.

  8. Nuno Lagoa says:

    This law doesn’t apply to all numbers. You have to be selective. It only applies to quantities that change a little over time.
    This is why it can be used on stock prices and tax returns: it is extremely unlikely that either jump n-fold; most of the time they change by small percentages.

    This is better illustrated with an example. Say a product starts costing $1000. With inflation at, say, 3%, over time the price increases to $1030 after the first year, then $1060.9 ($1000*1.03*1.03), then $1092,727 ($1000*1.03*1.03*1.03) and so on. As you can see, the leading number is always 1. If you wait even longer, eventually the price will reach $2000. So 1 is not the leading digit any more. But now the price will stay in the $2000-2999.99 bracket for a shorter period of time because if even inflation were to stay at 3%, 3% over $2000 makes that price increase by increments of $60, not $30 as when the price started at $1000.
    If you extend this argument you will see that as the price increases, it will stay for shorter periods of time in the same bracket.
    Well, eventually that price will reach $10000. At this point the price will stay in the $10000-$20000 bracket for quite a while longer than when it was in the $1000-$2000.
    Get it? It’s beautiful!

  9. moronmark says:

    A long time ago I read that the proportion of digits (1 to 9) from a book of log tables would match the proportions of those digits on a slide rule.
    Is this the same law?

  10. Ken Kaczmarek says:

    @ Geoff — re: Mac version, it is on the feature list for sure (we get this feature request more than any other). Send us a quick email to support -at- kirix -dot-com and we’ll contact you when the beta is available.

    @Eloy — as Nuno described with his very nice example, it would not apply to your survey. One of the other things to note about Benford’s is that doesn’t work with data sets limited to certain categories, even if that original set followed Benford. For instance, if the data is filtered to see “only ages 30 to 40″ or “all lakes less than 50 feet deep, Bendford will not apply.

    @moronmark — I’m not sure whether it is the same, but it sounds good. ;) However, interesting to note the legend surrounding the discovery of this law. Benford had noticed that the first pages of a logarithm book he was using were more worn than the latter pages. Based on this observation, he began to investigate why this was the case, which led him to his discovery (historical note: some claim Benford first observed this, others claim that an astronomer Simon Newcomb first observed it).

  11. Weaver says:

    Is there a proof of this law? I would love to read it.

  12. Ken Kaczmarek says:

    Hi Weaver, these may be helpful to you:

    http://www.mathpages.com/home/kmath302/kmath302.htm
    http://mathworld.wolfram.com/BenfordsLaw.html
    http://plus.maths.org/issue9/features/benford/

  13. Dean says:

    I just ran several “naturally occuring” datasets, and I found that about half of the curves resembled the Benford’s curve with some imagination. Some were total opposites. It appears to be random. Sorry, I was excited about it too!By the way, the check writing story is not naturally occuring data either, my phone bill usually starts with 8, does that mean verison calculated it wrong?

  14. Ken Kaczmarek says:

    Hi Dean,

    It can only mean one thing: FRAUD! ;) Just kidding… it’s hard to say without looking at the data. Are you using data from the web and, if so, can you post some links? A few other related notes:

    1. Your phone bill may just need to have a much larger sample (say, all phone bills that the phone company gives out). The sample size of an individual phone bill probably won’t be enough to prove anything one way or another. As mentioned above, this might fall into a “category” of a larger sample (”my phone bill only”), whereas if you took the full set, it would show a Benford’s distribution.

    2. As for checks, in the example, I only used the checks that I got from the case (listed in one of the articles I linked to), not the entire data set. And auditor would basically use Benford’s as a thumbnail estimate to see whether or not something is worth investigating further. He may run benford’s on the entire set and see that there are blips in 8s and 9s and then dig down into the data to pinpoint the issue. Benford’s is definitely used in this area; we’ve run into it in the accounts payable industry on numerous occasions. However, I’m not fully knowledgeable as to the exact steps the auditors use when applying Benford’s. At some point you need to have a decent set of data or else it is going to be skewed.

    3. If you’ve got a naturally occurring data set that should follow Benford’s but doesn’t, it may not be fraud but something else. For example, if an employee can submit expenses up to $25 without authorization but needs manager approval for anything $25 and above, Benford’s (particularly 2-digit resolution) may show you a huge spike in the number 24… this likely has less to do with fraud and more to do with people just not wanting to bother to get approval.

    Anyway, if you have more info on the data sets you are looking at, lemme know and I’ll be happy to take a look at ‘em.

  15. sumati says:

    nice analysis, thanks for sharing.

  16. pgr says:

    Phone bills probably not follow Benford’s Law, unless you take them from a properly sampled set in numerous currencies. In the United States, a basic phone bill is typically $25 per month per line, plus taxes and long distance. Most people aren’t going to spend enough long distance to push it over $100, so there will be no bills between $10 and $19.99, and very few over $100, so there will be few that start with 1. Other countries will have similar biases, but not at the same number. For example, however many Euros typical French phone bill comes to, the price will have a floor that applies everywhere. So if you took every phone bill in the world, without converting the local currency, you probably would get a Benford distribution.

    I doubt that individual stock prices will follow the “law”, either, because companies usually want their stock price to be about $30 to $60, and will use splits or reverse splits to keep it there, so again it’s not naturally occurring.

    Things like baseball batting averages are also going to be selected away. It’s impossible to hit 1.000, and anyone hitting under 0.200 will get sent to the minors, so there a lot of 2s, a few 3s, and the very rare 4 and no more. On the other hand, number of hits will have a lot of 1s. Anyone not getting 100 hits in a season won’t be around long, and few players get more than 200.

    I can’t figure out all the conditions that make it work, but it certainly will if it grows exponentially, as discussed by Nuno Lagoa, or when it’s something that becomes progressively rarer but the distribution covers at least a full order of magnitude, like lake depth. I bet the number of bytes in each file in a random collection would work, too.

  17. Roundup Thursday for the Week of 7/20/08 says:

    […] has a video illustrating Benford’s Law against Digg post submissions. They also use the law to show how a woman in Arizona was making fake payments to a fictional […]

  18. David says:

    I didn’t bother watching the movie, but we did an experiment like this in a statistics class in college. Everybody looked at a random address in the phone book and we wrote down the first number. Plotted it, and sure enough we got this curve.

  19. Joe Mayer says:

    RE: Telephone bills and sports statistics: I agree with the folks who said that these would not be regular “natural” statistics, but I bet you could easily get them to follow the same sort of formula. For phone bills, for example, take that theoretical minimum of around $25 per month and subtract it from each phone bill value. You should then have a range extending from zero to some upper limit, and that range would probably follow a Benford curve. Similarly, if you were to take all of the batting averages and subtract .200 from them, you would wind up with a set of values from zero to probably around .150 or so, with a few odd points out there. Actually, now that I think about it, I would probably try that one by finding the “average average,” so to speak, and plotting the difference from that mean point. I’d be willing to bet, although I might not wager that much, that the plot of deviation from the mean batting average would probably follow a Benford curve, too. Anyway, just thinking rambling thoughts at 2:30 in the morning. Feel free to disregard…

  20. Primordial Ooze : Benford’s Law says:

    […] Fun (and Fraud Detection) with Benford’s Law | Data and the Web […]

  21. Mesothelioma says:

    But why does this happen?

  22. Ken Kaczmarek says:

    Here’s a pretty good “practical” explanation from http://www.rexswain.com/benford.html using stock prices as an example:

    ====
    Dow Illustrates Benford’s Law
    To illustrate Benford’s Law, Dr. Mark J. Nigrini offered this example:

    “If we think of the Dow Jones stock average as 1,000, our first digit would be 1.

    “To get to a Dow Jones average with a first digit of 2, the average must increase to 2,000, and getting from 1,000 to 2,000 is a 100 percent increase.

    “Let’s say that the Dow goes up at a rate of about 20 percent a year. That means that it would take five years to get from 1 to 2 as a first digit.

    “But suppose we start with a first digit 5. It only requires a 20 percent increase to get from 5,000 to 6,000, and that is achieved in one year.

    “When the Dow reaches 9,000, it takes only an 11 percent increase and just seven months to reach the 10,000 mark, which starts with the number 1. At that point you start over with the first digit a 1, once again. Once again, you must double the number — 10,000 — to 20,000 before reaching 2 as the first digit.

    “As you can see, the number 1 predominates at every step of the progression, as it does in logarithmic sequences.”
    ====

    For the actual mathematics, here are a couple links you can investigate:
    http://www.mathpages.com/home/kmath302/kmath302.htm
    http://mathworld.wolfram.com/BenfordsLaw.html

  23. jody says:

    This explains why I can’t seen to break 100 playing golf !

  24. Carlos says:

    Damn, dude! Get a life…

Leave a Reply

About

Data and the Web is a blog by Kirix about accessing and working with data, wherever it is located. We have a particular fondness for data usability, ad hoc analysis, mashups, web APIs and, of course, playing around with our data browser.