Using Spreadsheets and Benford’s Law to Test Accounting Data 

 
Download Article

Accounting systems are popular targets of financial frauds because, in the words of bank-robber Willie Sutton, “that’s where the money is.” One common method thieves use to commit such fraud is to create fictitious accounting entities, e.g., bogus employee records or vendor payments, and then manipulate the fictitious records to their advantage. The success or failure of such scams rests in the ability to blend the bogus entries with legitimate data and, therefore, escape the notice of supervisors and auditors.

An interesting observation here is that most people are not very good at creating “natural data,” making it possible for good auditors to apply fairly simple statistical tools to reveal such inabilities. One such test is to see how well the data follow Benford’s Law.

Benford’s Law involves the distribution of lead digits in “naturally occurring numbers,” e.g., vendor payments, customer invoices, and similar financial values that occur in the normal course of business. For example, the lead digit in the vendor payment $123.45 is 1, the lead digit in a customer invoice amount of $4,231.55 is 4, and so forth. (All figures are provided in US dollars)

What Frank Benford discovered is that these lead digits are not uniformly distributed, as one might surmise. Rather, the number 1 is by far the most likely to occur, followed by 2, 3 and so forth. To apply Benford’s Law, therefore, an accountant must count the number of times a 1 appears as the lead digit in the data values, the number of times a 2 appears, etc., and then examine the resulting frequency distribution. The distribution is “natural” if it follows Benford’s distribution, and suspect otherwise.

Several professional accounting journals have published articles about Benford’s Law (see the suggested readings at the end of this article). But, most of these papers have been theoretical in nature or required users to download additional software to perform the requisite statistical tests. For Excel users, such add-ins are unnecessary.

This paper explains how to perform the tasks required to apply Benford’s Law with simple Excel formulas.

Testing Lead Digits Using Benford’s Law

Take the case of someone who wants to evaluate corporate purchase invoices—a popular target of corporate abuse. Although the amount of information contained in each invoice is likely to be considerable, this example will focus only on the purchase amounts. The goal of applying Benford’s Law here is to know how “natural” such transactions are. Figure 1 illustrates the required steps, which are further discussed in the sections that follow.

Figure 1

Step 1: Select the Sample Data
The first task is to obtain sample test data and store them in an Excel spreadsheet—the more observations included, the better. Using the data for a complete year is best, but if the number of items is large, smaller samples are permissible. For statistical reasons, however, there should be at least 100 observations. (Figure 1 shows an example with fewer than 100 observations for illustrative purposes only.)

Step 2: Parse the Lead Digit
As noted previously, Benford’s Law focuses on the lead digit in sets of naturally occurring numbers. The actual magnitude of the data (i.e., whether an amount is $10, $100 or $1,000) is unimportant. In a spreadsheet, one can select or “parse” the lead digit for each dollar amount (see figure 1), using Excel’s LEFT formula. The general form of this formula is:

=LEFT(Data Item, Number of Characters)

Here, the term “Data Item” is a cell reference and “Number of Characters” indicates how many characters to parse (starting from the left side of the name or number). If Number of Characters is 2, for example, Excel will parse the two left-most digits from the cell indicated and if this value is 3, it will parse the three left-most digits. Only one character should be parsed for the task at hand, so the formula for cell C3 is:

=LEFT(C3, 1)

Because the value in cell C3 is “432.65,” the result is “4.” The reason the displayed result is not a dollar sign ($) is because this character is simply a formatting symbol, which Excel ignores when performing the parsing task required. Once the formula for the first cell has been created in the spreadsheet, it can be copied to the subsequent cells in the column. Figure 1 shows the results in column D.

Step 3: Create a Frequency Distribution
The next step is to create a frequency distribution of the lead digits that have been parsed from the sample data. To do this, the headings of the table shown on the right side of figure 1should be created, and the numbers "1," "2," ... , "9" should be stored in the first column under the heading “Digit.”

Now it can be known how many invoice amounts start with each of these nine digits. (Zeros are ignored because amounts beginning with zero can be reduced to numbers beginning with the digits 1 through 9.) Although it is possible to use Excel’s frequency formula for this task, it is just as easy to use the COUNTIF formula, which counts the number of elements in a data range that match a specific value. The general form of this formula is:

=COUNTIF(Data Range, Criteria)

In this formula, the Data Range refers to the set of data one wishes to evaluate, and the Criteria parameter is typically either a literal value or a cell reference to such a value. For example, the formula COUNTIF(Z1:Z100, “Smith”) would return the number of cells that contain the name “Smith” in the data range Z1:Z100 while the formula COUNTIF(Z1:Z100, X3) would return the number of cells matching whatever value is stored in cell X3.

For the illustration at hand, the desire is to know how many times each of the nine lead digits appears in the test data. Thus, the formula for cell H4—the first cell in the column with heading “Actual”—is:

=COUNTIF($D$3:$D$32, G4)

In this formula, the Data Range identifies the cells in column D, i.e., the column containing the lead digits. It is important to note that the formula uses absolute references— $D$3:$D$32—instead of D3:D32. This makes it possible to copy this formula to the other cells in the table. The Criteria for this formula is the reference to cell G4, which contains the value “1.” Thus, the COUNTIF formula returns “4”—the number of cells in column D that contain this value. This can be verified by inspecting the data in column D.

Once the COUNTIF formula has been created for the first digit, this formula can be copied to the remaining cells in the table. Figure 1 illustrates the results. Thus, for this example, the number 1 appeared four times as the lead digit in the sample data, the number 2 appeared three times as the lead digit in the sample data, and so forth.

Step 4: Compute the Expected Distribution
What would be expected for the distribution of these lead digits? Benford’s Law predicts that approximately 30.1 percent of lead digits will be a 1,17.6 percent of the lead digits will be a 2, and so forth. Column J of figure 1 shows the complete list of such percentages, which come from a logarithmic distribution and are explained more fully in the Nigrini article (see Suggested Readings).

From the percentages shown in column J of figure 1, one can work backward and compute the number of observations one would expect to find in the sample of 30. For each lead digit, the expected number is the percentage times the sample size. For example, for the lead digit of 1, the expected number of observations is 30.1 percent times 30 observations, or 9.03. Because cell H14 stores the total number of observations, the formula for the first number in column I (I4) is:

=J4*$H$14

In this formula, cell J4 contains the percentage amount (i.e., 30.1 percent for the first item) and cell H14 contains the total number of elements in the sample—in this case, 30. If one uses an absolute cell reference for cell H4 (i.e., $H$4) one can copy this formula to the other cells in column I. The results are as shown, e.g., an expected value of 5.28 for a lead digit of 2, an expected value of 3.75 for a lead digit of 3, and so forth. Of course, it is impossible ever to observe exactly 9.03 invoices with a lead digit of 1, or 5.28 invoices with a lead digit of 2. As any other average, what is being computed here is what might be found if one conducted this experiment over and over, using different sample data each time.

Step 5: Plot the Results
Now there are two sets of values—the actual distribution of lead digits from the sample and the theoretical distribution of such digits as dictated by Benford’s Law. What one wants to know is how well these distributions match.

One way to answer this question is to plot these two sets of data and observe the results. To perform this task one can use Excel’s charting tools and create a bar graph like the one in the inset portion of figure 1. The expected values show the pattern one would expect to see according to Benford’s Law—an exponential decay pattern with a lead digit of 1 the most likely, a lead digit of 2 the next most likely, and so forth. The actual values show the distribution of lead digits actually found in the sample.

This charting work performs two useful tasks. First, the graph provides a visual answer to the question “how well do the sample data match the expected values?” For the example, at hand, the answer is “not very well.” One sees, for example, that the lead digit of 1 occurred only about half as often as is expected, while the lead digit of 4 appeared much more often.

Second, the graph provides pictorial evidence of data spikes—in this case, for digit 4. Data spikes do not necessarily signal underlying problems such as fraud, but they do alert the accountant to the possibility of such problems. If the invoices illustrated here were for corporate purchases, for example, the data spike for a lead digit of 4 might be especially important if purchasing agents had a spending limit of $500. In one situation, auditors found that department heads were writing multiple checks for just under $5,000 to avoid a mandatory bidding process for items costing $5,000 or more.

Step 6: Perform a Chi-square Test
Although the sample data do not graphically match the expected values very well, the question remains “how far off are they?” To answer this question statistically, auditors can use Excel’s CHITEST function—a chi-square test—to provide some guidance.

The chi-square test is a “goodness-of-fit” test, i.e., a statistical test that measures how well the data distribution from a sample matches a hypothetical distribution dictated by theory. For the example at hand, one wants to know how well the sample data in column H of figure 1 match the expected values of Benford’s distribution in column I of the figure. Excel’s CHITEST has the general form:

=CHITEST(Data Range of Actual Values, Data Range of Expected Values).

In this formula, the Data Range of Actual Values reflects the values derived from a sample, while the Data Range of Expected Values shows the expected values dictated by the theoretical distribution. The values required for this test have been computed in columns H and I of the table. Thus, the formula for cell J15, which computes the chi-square test statistic, is:

=CHITEST(H4:H12,I4:I12)

Step 7: Reach a Conclusion; Are the Data “Natural?”
The chi-square statistic from Excel’s CHITEST indicates the likelihood that the actual values in the sample follow the prescribed (Benford) distribution. High values such as 93 percent indicate a good match between actual and expected distributions, while small values such as 3 percent indicate a poor match. If one enters the test data shown in figure 1 into a spreadsheet and changes some invoice amounts so that the actual bars come closer in pattern to the expected bars, the value for the CHITEST formula will increase accordingly.

As shown in cell J15 of figure 1, the chi-square test statistic for the sample data, formatted to a percentage, is 7.89 percent—a relatively small value. Does this value signal fraud? Not necessarily. But, generally speaking, values of less than 5 percent suggests that there is little likelihood that the data match the hypothesized (Benford) distribution, while values of 10 percent or less suggest that there is at least a 90 percent probability that the data are unnatural.

What to conclude? The low value for the chi-square test computed here suggests that the data in this sample are artificial. Before reaching this conclusion, however, there is another option: repeat the experiment using new sample data. This is one advantage of using a spreadsheet model for this work—one can overlay new data in columns B and C and the spreadsheet will perform every computation shown in figure 1automatically and immediately.

What if a chi-square test on the new data again results in a small value? This would be particularly meaningful because the results are multiplicative. If the chi-square test statistic for both samples were 10 percent, for example, the probability that the underlying data are “natural” would be (10 2 .10 = .01) only 1 percent. Such a result signals a strong need for further investigation.

Benford’s Law Do’s and Don’ts

The idea that the lead digits of “naturally occurring data” are not uniformly distributed is counterintuitive to many people. After all, if the digits 1 through 9 were painted on a perfectly balanced spinning wheel, each digit would have an equal chance of occurring. But natural accounting data are not comparable to the numbers on a spinning wheel because they are not limited to specific ranges of values. Think of it this way: as a bank balance grows, for example, from a few hundred US dollars to more than a thousand dollars, which lead digits appear first in the new balance? The answer is first “1” (for a thousand dollars), then “2” (for two thousand dollars) and then “3” (for three thousand dollars). Thus, each time values increase by an order of magnitude, the number 1 appears first, followed by 2 and then 3. Benford’s Law says exactly this, which is the reason why the probabilities for lead digits 1, 2 and 3 in Benford’s distribution collectively account for more than 60 percent of the total probability distribution—not 30 percent as one might think (refer back to column J in figure 1 to verify this larger percentage).

This explanation also suggests some important considerations when performing investigations using Benford’s Law. One is that the law applies only to naturally occurring data. Purchase amounts, payment amounts, stock prices, accounts payable data, inventory prices and customer refunds are all good examples of such data. So are baseball statistics, areas of lakes, and the populations of towns—all of which Benford examined in his research but which are usually of less interest to accountants. The Law does not apply to assigned values, e.g., telephone numbers, lottery tickets, sequential customer numbers or check numbers (all of which, by definition, cannot repeat).

Second, it is important to avoid using financial data that are not natural. For example, the purchase amounts at a discount store might not lend themselves to Benford analysis, because there often is a single price point per item. Similarly, values with upper limits, such as airline passenger counts per plane or employee days worked per year, do not lend themselves to such analyses.

Third, it is important to sample “fairly” when selecting a set of data for analysis. For example, limiting a sample of invoices to values between US $100 and US $999 defeats the tests described here, because the data are limited to a narrow range. For small companies, using the complete data for an entire month or for a random day of each month is a better option.

Fourth, it is useful to know that Frank Benford did not limit his study to the lead digits of naturally occurring numbers. He also developed frequency distributions for secondary digits, i.e., the second or third digits in such numbers. Further analysis similar to the one in this article can be performed using Benford’s distribution for such secondary digits and Excel’s Mid function to parse them from the numbers one wants to test.

Finally, as a technical matter, it is important to obtain a set of test data that is large enough to obtain useful statistical results. The rule for chi-square tests is that the expected number of observations for each cell should be at least five. Because the smallest percentage in the Benford distribution is 4.6 percent, this requires a sample size of at least 100 observations. (Again, the reason a smaller set of observations was used in figure 1was to enable readers to see all the data tested.)

Conclusion

Benford’s Law provides a powerful tool with which to determine how “natural” a given set of financial data is likely to be. The tests are both straightforward and easily implemented on spreadsheets without the need of add-in or supplemental software. But, it is also important to remember that not all financial data lend themselves to such tests and that care must be exercised when performing the analysis.

Suggested Readings

  • Benford, Frank; “The Law of Anomalous Numbers,” Proceedings of the American Philosophy Society, vol. 78, 1938, p. 551-572
  • Browne, Malcolm W.; “Following Benford’s Law, or Looking Out for No. 1,” New York Times, 4 August 1998
  • Cleary, Richard; Jay C. Thibodeau; “Applying Digital Analysis to Benford’s Law to Detect Fraud: The Dangers of Type I Errors,” Auditing, vol. 24, no. 1, May 2005, p. 77-81
  • Hill, T.P.; “The First-Digit Phenomenon,” American Scientist, vol. 86, no. 4, July-August 1998, p. 358-364
  • Johnson, Peter; “Fraud Detection with Benford’s Law,” Accountancy Ireland, vol. 37, no. 4, August 2005, p. 16-17
  • Nigrini, Mark; “I’ve Got Your Number,” Journal of Accountancy, vol. 187, no. 5, May 1999, p. 79-83
  • Rodriguez, Ricardo; “Reducing False Alarms in the Detection of Human Influence on Data,” Journal of Accounting, Auditing, and Finance, vol. 19, no. 2, 2004, p. 141-159
  • Rose, Anna M.; Jacob M. Rose; “Turn Excel Into a Financial Sleuth,” Journal of Accountancy, vol. 176, no. 2, August 2003, p. 58-60
  • Stone, Amey; “Using Software to Sniff Out Fraud,” Business Week Online, 30 September 2003, p. N
  • Williamson, Duncan; “Vital Statistics,” Accountancy, vol. 133, no. 1327, March 2004, p. 108-110

Mark G. Simkin, Ph.D.
is a professor of information systems at the University of Nevada (USA). He can be reached at markgsimkin@yahoo.com.


Enjoying this article? To read the most current ISACA® Journal articles, become a member or subscribe to the Journal.

The ISACA Journal is published by ISACA. Membership in the association, a voluntary organization serving IT governance professionals, entitles one to receive an annual subscription to the ISACA Journal.

Opinions expressed in the ISACA Journal represent the views of the authors and advertisers. They may differ from policies and official statements of ISACA and/or the IT Governance Institute® and their committees, and from opinions endorsed by authors’ employers, or the editors of this Journal. ISACA Journal does not attest to the originality of authors’ content.

© 2010 ISACA. All rights reserved.

Instructors are permitted to photocopy isolated articles for noncommercial classroom use without fee. For other copying, reprint or republication, permission must be obtained in writing from the association. Where necessary, permission is granted by the copyright owners for those registered with the Copyright Clearance Center (CCC), 27 Congress St., Salem, MA 01970, to photocopy articles owned by ISACA, for a flat fee of US $2.50 per article plus 25¢ per page. Send payment to the CCC stating the ISSN (1526-7407), date, volume, and first and last page number of each article. Copying for other than personal use or internal reference, or of articles or columns not owned by the association without express permission of the association or the copyright owner is expressly prohibited.