Information Systems Control Journal, Volume 6, 2002
Assessing Data Authenticity with Benford's Law
By Bassam Hasan, Ph.D.
The increased power and speed with which computers can process large volumes of data has revived interest in a century-old mathematical model known as Benford's law as a means to assess authenticity and accuracy of numerical data. Benford's law maintains that naturally occurring numbers pertaining to the same phenomenon or event are related to each other and are most likely to begin with 1, 2 or 3. This law presents a valuable framework for assessing the authenticity of data values and recognizing manipulated data sets. A review of Benford's law provides a deeper understanding of this important model through the description of its origin and development, and demonstration of its usefulness in evaluating the authenticity of numerical data.
The Development of Benford's Law
In 1881, a mathematician named Simon Newcomb noticed that a logarithms book was more grubby and worn at the front pages than at the back pages. Newcomb believed that researchers were using the first pages more often than the last pages. Because the first pages contained numbers beginning with low digits, Newcomb inferred that low digits were looked up and used more frequently than higher digits. He published his observation in a short article in the American Journal of Mathematics.1
Because they lacked empirical evidence and logical explanations, Newcomb's remarks did not attract significant attention at that time and went mostly unnoticed for several years. However, almost half a century later, at the General Electric Company laboratories, a physicist named Frank Benford, unaware of Newcomb's earlier work, made the same observation. He noticed that the first pages of logarithms books showing low digits were more worn than the last pages showing high digits. He believed that researchers were looking up numbers beginning with low digits.
Benford believed that researchers did not have any special preference for low digits, but that most numbers begin with low digits. To validate his assumption, Benford collected 20,000 numerical data values from diverse and dissimilar datasets, such as lengths of rivers, population numbers of people in US counties and street addresses of people. In 1938, Benford tabulated the frequencies of leading digits in datasets that he collected and published his findings in a detailed article in the Proceedings of the American Philosophical Society. In that article, Benford argued that numbers describing a phenomenon or an event such as a company's sales are interrelated and are more likely to begin with a 1, 2 or 3, than a 7, 8 or 9.
Benford's Law of Leading Digit
Benford's law also is referred to as the law of leading digits frequencies, law of anomalous numbers, significant digit law, and, more recently, digital frequencies analysis (Nigrini, 1999).2 Contrary to what most people believe, Benford's law states that the digits 1 through 9 are not equally likely to appear as a leading digit in multi-digit numbers resulting from the same phenomenon, and that their leading-digit distribution is neither random nor uniform. That is, numbers that occur as a result of an underlying phenomenon are related to each other. Benford derived a mathematical equation3 to estimate the likelihood of any digit to be a leading digit in a given number. The mathematical and statistical properties of Benford's equation of leading digits were examined by Benford4 and later studied and validated by Pinkham5 and Hill.6
Benford's distribution of first- and second-digit frequencies is presented in table 1. As can be seen in table 1, the distribution is skewed toward the lower digits. More precisely, there is a 30 percent probability that a multi-digit number starts with the digit 1. Similarly, there is a 17.6 percent chance that the leading digit will be 2. Nine is the least likely digit to appear as a leading digit, with a probability of 4.6 percent. It is important to note that Benford's law excludes the digit 0 as a leading digit because a number cannot begin with a 0.
The traditional example that had been used frequently to illustrate Benford's law is population growth.7 For example, assuming that the population number in a US county is 10,000 people and that the population is growing at an annual rate of 2 percent, it will take the county about 36 years to double its population to 20,000 (table 2). That is, the population number will begin with a 1 for about 36 years. The next change in the leading digit will occur when the population number reaches 30,000. However, it will take about 20 years for the leading digit to change from 2 to 3. Likewise, the leading digit will be a 3 for about 14 years. Finally, it will take only 6 years for the leading digit to change from a 8 to a 9. As this example illustrates, the change in the leading digit that took the longest time was from 1 to 2. Furthermore, the probability that the county's population number begins with 1 is about nine times higher than the probability of the number beginning with 9. Table 2 shows the leading digits and the total number of years that the population number will begin with each digit.
To enhance the reliability of the results, datasets to be analyzed by Benford's law must satisfy several conditions. Benford8 and Nigrini9 suggest that candidate datasets must possess the following characteristics:
- The data must be numeric. Benford estimates the expected frequencies of leading digits in numerical datasets.
- The numbers (data) must be related in some way and pertain to the same phenomenon (e.g., stock prices). In other words, there must be underlying causes (i.e., a phenomenon or event) for the numbers to occur. For instance, stock prices are influenced by competing economic and financial forces.
- The numbers are not restricted by maximum or minimum values (e.g., hourly wage rate). These limits result in the exclusion of certain numbers and, as a result, will skew the distribution of leading digits frequencies.
- The numbers must occur naturally, and they are not invented or assigned, such as telephone numbers or identification numbers. Assigned numbers can be allocated in any predetermined order. Consequently, the distribution of leading digits in assigned numbers will be biased toward certain digits.
- The numbers must be at least four or more digits. However, if the numbers are less than four digits, frequencies of the second digit can be used.
Data Accuracy
Data accuracy represents a fundamental component of data quality10 and is critical to the overall success of information systems (Hirgi, 2001).11 Information systems may incorporate various built-in functions to thwart unauthorized access to and editing of data. But there are limits to what such security measures can accomplish. For example, security features may prevent unauthorized access to data, ensure read/write permissions and allow only authorized users to gain access to stored data. But, once an authorized user accesses the data, these security checks cannot prevent intentional manipulation of data or accidental modification of data.
Benford's law offers a potential solution for the problem described above. A major application of Benford's law has been to discover inconsistent and fraudulent data values in datasets. According to Hill,12 one can distinguish between original and fake numbers in two ways:
- Identify the values that are highly likely to occur in a valid numerical dataset.
- Identify the values that are highly unlikely to appear in a distorted dataset.
To demonstrate these approaches, Hill asked his students to flip a coin 200 times and record the numbers of heads and tails. He also allowed them to fake the results without flipping the coin. The next day, students were surprised when he was able to separate the original from the fraudulent results. Hill points out that it is very likely to have six consecutive runs of heads or tails. But to a person who is not aware of that fact and who tries to fake the data, these odds seem not possible.
In essence, Benford's law identifies the probabilities of highly likely and unlikely values and, as a result, could be used to recognize likely as well as unlikely frequencies of numbers in datasets. Leading digit frequency law indicates that leading digits in original, randomly occurring numerical datasets are related to each other and are distributed logarithmically. Thus, distorted or intentionally modified numbers are not likely to exhibit the patterns of distribution as expected by Benford's law.
To obscure their alterations, individuals who falsify numbers and who are not familiar with Benford's law try to come up with numbers that look random. Therefore, the leading digits of manipulated numbers are most likely to exhibit uniform or random rather than logarithmic distribution. Accordingly, numerical datasets can be tabulated and the frequencies of their leading digits are calculated and then compared with Benford's expected frequencies. If there is inconsistency between the frequencies, the validity of the numbers in the dataset must be reexamined.
Random Numbers Experiment
The main premises of Benford's law of leading digit are that: (1) true numbers that result from a given phenomenon are interrelated, and (2) the distribution of their leading digits is not uniform or random like the frequencies of leading digits in unrelated or random numbers. The optimal way to empirically examine these arguments would be in organizational settings using real business data. However, organizations are always reluctant to allow external access to their databases due to competition and privacy concerns.13 Therefore, to empirically examine Benford's law, the authors of this article created a simulation program to generate unrelated and random numbers. The simulation program generates three sets of 500 numbers in each set. The numbers in each set were drawn from a different numeric range (i.e., 1 to 999, 1 to 9,999 and 1 to 99,999). As such, a total of 1,500 random numbers were generated and used.
The program also computes the averages of frequencies of the leading digits in each set of numbers in each range category. The average frequencies are summarized and presented in table 3. Further, table 3 presents the expected leading digits frequencies as suggested by Benford's law. The leading digit frequencies presented in table 3 highlight two main conclusions. First, the leading digit frequencies of all truly random numbers in all range categories do not match Benford's expected frequencies. Second, the digit frequencies in table 3 indicate that distribution of leading digits of random numbers is almost equal and uniform. Consequently, these results are consistent with Benford's law in that, unlike naturally occurring numbers, truly random and unrelated numbers are more likely to reveal equal distribution of leading digits.
Analysis of the leading digit frequency has been applied in various organizational settings. For example, the Wall Street Journal14 reported that the district attorney's office in New York applied Benford's law in examining financial data. The amounts of 784 checks issued by seven companies were tabulated. The frequencies of the digits were tabulated and contrasted with Benford's expected frequencies. The amounts on 103 checks did not match the expected patterns. Subsequent investigation revealed that the checks' amounts were indeed not authentic, but purposefully altered by bookkeepers and clerks. Nigrini15 presents additional examples and cases where leading digit frequency analysis uncovered inconsistencies and irregularities in datasets.
Conclusion
Benford's law indicates that the distribution of leading digits frequencies in a numerical dataset describing a given phenomenon is not equal or random. Benford derived a mathematical equation to estimate the expected probability of any digit to appear as a leading digit. According to Benford's law, a multi-digit number taken from a dataset at random is most likely to begin with a 1 and least likely to begin with a 9.
This paper reviewed Benford's law of leading digit, and demonstrated its practical implications for data security and accuracy. Using 1,500 truly random, unrelated numbers, this paper demonstrated the usefulness of Benford's law in recognizing unrelated, independent and random numbers. The frequencies of leading digits in random numbers exhibited a noticeable mismatch with Benford's expected frequencies for interrelated numbers. When such a mismatch occurs in a dataset, this normally indicates that numbers in the datasets may not be original and may have been modified.
Data authenticity and accuracy are critical to many vital organizational processes. Accurate data lead to correct information and knowledge that are vital for decision-making, problem-solving, planning and customer service. On the other hand, inconsistent and manipulated data produce erroneous knowledge that could result in performance and financial losses. Benford's law offers an effective and simple approach to enhance data quality by examining the authenticity of data and uncovering distorted datasets. Application of Benford's law in business settings will not only help preserve data integrity and quality, but also uncover fraudulent business and managerial activities.
Endnotes
1Necomb, S., "Note on the frequency of use of different
digits in natural number," American Journal of Mathematics, pp. 4, 39-40, 1881
2Nigrini, M., "The peculiar patterns of first digits," IEE Potentials, pp. 18, 24-27, 1999
3 p(n) = log10(1 + 1/n), n= 1,2,3,...,9
4 Benford, F., "The law of anomalous numbers," Proceedings of American Philosophical Society, pp. 78, 551-572, 1938
5Pinkham, R., "On the distribution of the first significant digits," Annals of Mathematical Statistic, pp. 32, 1223-1230, 1961
6Hill, T., "The difficulty of faking data," Chance 26, 8-13. 1999; "A statistical derivation of the significant-digit law," Statistical Science, pp. 10, 354-363, 1996
7Nigrini
8Benford
9Nigrini
10O'Leary, D. E., "The impact of data accuracy on system learning," Journal of Management Information Systems, pp. 4, 83-98, 1993.
11Hirji, K. K, "Exploring data mining implementation," Communications of the ACM, pp. 44, 87-93, 2001.
12Hill
13Hirji
14"He's got their numbers: scholar uses math to foil financial fraud," The Wall Street Journal, p. B1, 10 July 1995.
15Nigrini
Bassam Hasan, Ph.D.
is an assistant professor of information systems at the University of Toledo. His research interests include computer learning and training, end-user management, and database management and mining. He can be reached at bhasan@utnet.utoledo.edu.