Choosing a non‑obvious PIN

There is very little research data on PIN prevalence available, so analysis of a new dataset of 4-digit passcodes can't be ignored.

There is very little research data on PIN prevalence available, so analysis of a new dataset of 4-digit passcodes can’t be ignored.

Long-time readers of this blog may recall that I’ve talked here and in papers and articles about password and PIN selection strategies, much of that work on the basis of some PIN prevalence data provided by Daniel Amitay. So I was interested in a Security FAQs post that led me to somewhat similar research summarized in a Datagenetics post: PIN analysis.

I have my doubts about the recent craze for reproducing list after list of the top X passwords, even though I’ve commented on or reproduced research along those lines myself. Not only does the law of diminishing returns apply – how many slightly different lists of bad passwords does anyone need – but it encourages the dangerous belief that all you have to do is avoid passwords that feature in one of those (normally very short) lists. If you really want to take that password blacklisting approach, at least look at a substantial list like one of Mark Burnett’s. However, in   general I’d prefer personally to work on the principles behind common selection patterns, so as to be able to offer advice on using better strategies. Still, the DataGenetics post is highly readable and prevalence data on PIN use are very rare, so since it uses a different dataset to the Amity data that I’ve been using, it’s worth taking some time to look at it.

Whereas Amitay’s data consists of harvested (anonymized) PIN data, DataGenetics went back to password lists and extracted the passcodes that consisted only of four digits, assuming that approach will reflect real-world use of 4-digit PINs. I’m not totally convinced by that: as I’ve discussed at some length elsewhere, PIN selection is likely to be significantly influenced by the type of keyboard or keypad of the device in use – I’ll come back to that in a future post – so the prevalence data is probably compromised by the fact that so many different types of input device may have contributed to the data set. Amitay’s data comes entirely from iGadgets, probably mostly iPhones, so the prevalence data may be a little more reliable. (I may check that hypothesis with him later.) Certainly, while there is significant correlation between the most-used patterns in the two datasets, the prevalence statistics differ dramatically in some respects.

Nonetheless, it’s interesting reading. And the writer does pick up on one instance of the ergonomic aspects of passcode selection (i.e. input device layout sometimes determines the memorability and therefore selection of particular key entry patterns).

ESET Senior Research Fellow

Sign up to receive an email update whenever a new article is published in our Ukraine Crisis – Digital Security Resource Center