The False Positive Problem in Audit Anomaly Detection

data scientist reviewing anomaly detection results on multiple financial dashboa

A detection system that flags 40% of a journal entry population hasn't found 40% more risk than one that flags 3%. It's found 40% noise. The auditor who receives a list of 4,000 flagged items from a 10,000-entry population is not better informed — they're worse off than if the system had returned nothing, because now they have to process 4,000 entries to find the 15 that actually matter.

False positive rate is the variable that determines whether an anomaly detection tool is operationally useful or just statistically interesting. Getting it wrong in either direction — too high or too low — produces a tool that auditors stop trusting. This is the problem we spent the most time on when building AuditPulsar's scoring engine, and it's worth describing how we approached it.

The Two Ways to Fail

Anomaly detection can fail at two ends of the sensitivity spectrum.

High sensitivity, low specificity: the system flags many items, catches most real anomalies, but also flags many benign entries. In audit terms, this means a long review list with a low conversion rate to actual findings. Auditors initially engage with such a system, spend considerable time clearing entries that turn out to be fine, and eventually apply a higher internal threshold ("I'll only look at items scoring above 85") that isn't documented in the workpaper and may exclude real findings.

Low sensitivity, high specificity: the system flags few items, and most of what it flags is genuinely suspicious, but it misses a meaningful fraction of the actual anomalies in the population. In audit terms, this is the more dangerous failure. The auditor trusts the system, documents that the complete population was screened, and proceeds — but there are material entries that the scoring didn't surface. This is the failure mode that produces restatements.

The calibration target is a system that finds substantially all material anomalies while producing a flagged-items list small enough for complete review by the audit team in a reasonable time window. For most engagements, that means the flagged population should represent 0.5% to 3% of total journal entries. Anything higher than that isn't useful for triage purposes.

Why Simple Threshold Rules Produce Too Many False Positives

The standard JE testing filters — entries above materiality, after-hours entries, entries without descriptions, entries by unusual preparers — are threshold rules. An entry either clears the threshold or doesn't. There's no weighting, no account for the context of the entry, and no way to distinguish between an after-hours entry that's suspicious and an after-hours entry that's the CFO posting a routine monthly close accrual they always post after hours.

The false positive problem with threshold rules is structural: they have no way to incorporate historical context. An after-hours posting by a user who posts 40% of their entries after hours is a normal event. An after-hours posting by a user who has posted nothing after hours in the prior 18 months is an anomaly. A threshold rule sees both as the same. A model trained on the user's historical posting pattern sees them very differently.

Population-level statistical tests like Benford's Law have the opposite problem: they produce a distribution comparison, and anything that deviates from the expected distribution is flagged regardless of whether the deviation is material. On a 20,000-entry population with genuinely clean data, Benford analysis might still identify 200 to 400 entries as digitally anomalous — more than can be reviewed usefully, and mostly not representing actual problems.

The Scoring Architecture We Built

AuditPulsar's scoring engine combines five signal types, each weighted based on its historical predictive value from the training dataset:

Statistical deviation from population patterns. How much does this entry's amount, account combination, and timing deviate from the statistical distribution of this population? This signal draws on Benford's analysis, amount distribution modeling, and posting-time distribution. Weight in the final score: approximately 20%.

Statistical deviation from historical patterns. How does this entry compare to the same account combination's history in this specific client's ledger? An entry to an account combination that has appeared 300 times in prior periods is a different risk than the same combination appearing for the first time. This signal requires prior-period data — it improves substantially in the second year of platform use on a given engagement. Weight: approximately 25%.

User behavior deviation. How does this entry's characteristics compare to the posting history of this specific user? Posting time, account range, transaction size distribution, approval pattern. This is the signal that catches direct entry bypass by authorized users — their entries look normal by population standards but deviate from their own history. Weight: approximately 30%.

Account combination risk model. Certain account combinations are inherently higher risk than others regardless of other factors — revenue and reserve accounts, intercompany clearing accounts, accounts that have historically been associated with misstatements in the training data. This is a static lookup table derived from the training dataset, not a client-specific model. Weight: approximately 15%.

Rule-based flags. Explicit rule triggers that override score modifiers: entries that bypass approval workflow entirely, entries posted after the period is closed, entries that exactly match a previously cleared anomaly from a prior period. These are binary flags that add 15-25 points to the score when triggered. Weight: approximately 10% of scores in aggregate, but high contribution to individual scores when triggered.

What 97.3% Detection Accuracy Actually Means

AuditPulsar's published accuracy figure — 97.3% — is the true positive rate on the held-out test set: the fraction of labeled anomalies in the test data that the scoring engine identified as anomalous (score above 50). What that number doesn't tell you is the corresponding false positive rate, which is the operationally relevant metric.

On the same test set, the false positive rate at the 50-point threshold is approximately 8.4%. That means 8.4% of non-anomalous entries receive scores above 50. On a 10,000-entry population with a realistic 2% base rate of meaningful anomalies (200 true anomalies), the 50-point threshold would flag approximately 200 true anomalies plus 816 false positives — a total flagged population of about 1,016 entries, or 10% of the total.

At the 70-point threshold — which is AuditPulsar's default recommendation for the items that go to detailed review — the false positive rate drops to 1.6%, producing a flagged population of approximately 200 true anomalies plus 157 false positives, or roughly 357 items. That's operationally manageable for most engagements. The trade-off is that the true positive rate at the 70-point threshold drops to approximately 91% — meaning about 9% of true anomalies receive scores below 70 and go into the "review as needed" lower-priority bucket.

The calibration decision — where to set the working threshold — belongs to the auditor. The platform's default of 70 reflects a reasonable balance for most engagement types, but high-risk engagements (fraud risk indicators present, prior-year restatements, management tone concerns) warrant a lower threshold and higher false positive tolerance.

Client-Specific Calibration Over Time

The most valuable lever for reducing false positives is client-specific calibration. Every entry that an auditor reviews and clears in the platform becomes a labeled data point that improves the scoring for that client in subsequent periods. After two to three years of use on the same engagement, the false positive rate for that specific client is typically 40 to 60% lower than on a fresh engagement, because the model has learned what normal looks like for this client's accounting patterns.

This means the value of the platform compounding over time. The first year, firms see good coverage with a manageable false positive rate. By year three on an engagement, the flagged items list is substantially cleaner, and the auditor's review time per engagement has dropped considerably. The efficiency gain is not just from faster population scanning — it's from a scoring model that has learned this client's specific accounting patterns and stopped flagging the things that are unusual by generic population standards but normal for this client.

What to Do If the False Positive Rate Seems High

If more than 5% of your total journal entry population is scoring above 50 on a first-year engagement, the most likely explanations are: the client has an unusual account structure that the generic models haven't seen, the client uses a large number of intercompany clearing accounts that generate normal-looking unusual pairings, or the data import has a mapping error that's causing the model to misclassify transaction types.

The fastest diagnostic: look at the account distribution of the high-scoring entries. If 60% of them are to a small set of accounts, those accounts are likely triggering the account combination risk model with a high base weight. Checking whether those accounts are genuinely high-risk for this client versus structurally unusual but operationally routine will tell you whether the threshold needs adjusting or the account risk weights need to be overridden for this engagement.

The platform allows per-account threshold adjustments and the ability to designate specific account combinations as low-risk overrides for a given client. Those adjustments are logged in the configuration file and are reportable in the methodology documentation if an auditor or inspector asks why certain accounts were deprioritized in the scoring.