Posted by Mark Liberman
https://languagelog.ldc.upenn.edu/nll/?p=73311&utm_source=rss&utm_medium=rss&utm_campaign=distribution-of-acronym-lengths
https://languagelog.ldc.upenn.edu/nll/?p=73311
Or maybe "initialism lengths"? Wiktionary defines initialism as "a term formed from the initial letters of several words or parts of words, which is itself pronounced letter by letter"; while some (fussy) people argue that the term acronym should be reserved for words like laser (= "Light Amplification by Stimulated Emission of Radiation") or NATO (= "North Atlantic Treaty Organization").
Acronyms/Initialisms are (mostly) words, under any reasonable definition. But this category has the special property that most items have multiple specific and distinct senses, generally known to small groups and/or used in very special circumstances.
For example, American linguists know that LSA stands for "The Linguistic Society of America" — but the LSA didn't act in time to lock up https://lsa.org, which belongs to the "Louisiana Sheriffs' Association". And Acronym Finder gives 123 interpretations for LSA, including the linguists but (curiously) not the sheriffs.
Mark Davies' NOW ("News on the Web") Corpus has 3,680 hits for the string LSA — quickly checking a few of them (literally) at random gives us references to the Liangmai Sports Association's Badminton team; the Law Students Association at McGill; a recipe's abbreviation for a mix of ground linseed, sunflower seeds and almonds; Lifesaving South Africa; the Law Society of Alberta; and so forth. In that corpus, the Linguistic Society of America gets 55 hits, and the Louisiana Sheriffs Association has 6.
Someday it would be fun to run an acronym-finding script over that dataset, or a similar one. But this morning, as a crude approximation to the (non-frequency-weighted) distribution of initialism length, I checked the entry counts for probes of Acronym Finder with random letter-string samples of different lengths, generated by this simple R script.
A sample 20 random single letters yielded a mean of 65.5 hits and a median of 64.5:
G 66
V 65
Y 31
E 77
L 64
W 60
H 64
V 65
X 48
D 115
A two-letter sample yielded a mean of 58.1 and a median of 25.5:
ZZ 13
BO 85
UO 26
ND 82
OY 10
WY 8
MM 248
JR 25
YI 6
SK 78
A three-letter sample has a mean of 47.7 and a median of 41:
KXS 2
WRK 4
DCL 63
KNU 6
NPN 37
IPE 60
PVP 45
CCB 154
BJH 4
MCM 102
A four-letter sample has a mean of 1.4 and a median of 0:
EKCK 0
EPRL 6
BLUE 6
WIXI 0
QLCS 1
DZCZ 0
YJGM 0
BTDW 1
CWJI 0
FVOE 0
(Though the AcronymFinder's "acronym attic" has one unverified entry for EKCK as "Embassy in Kuwait City Kuwait".)
And a five-letter sample has mean and median of 0 — though ARKEM has one "unvalidated" entry in the AcronymFinder's attic, listed as "alarm remote keyless entry module":
RDZCI 0
LPEYZ 0
TUWRX 0
WMHXQ 0
ARKEM 0
VCEGP 0
MZMKH 0
WTFAY 0
RDITH 0
DBRBY 0
If we believed the unreliable probability estimates derived from those mean values, we'd estimate 6.55*26=170 single-letter entries, 5.81 *26^2=3928 two-letter entries, 4.77*26^3=83838 three-letter entries, and 0.14*26^4=63977 four-letter entries. Implausible estimates that still confirm my prejudice that three-letter initialisms are the most commonly used.
For sequence lengths of six and above, traditional initialisms or acronyms are increasingly unlikely, though "backronyms" like DREAM and PATRIOT buck the trend. And social-media and email names sometimes involve initialisms combined with abbreviations, like @FmrRepMTG.
The longest example I 've ever seen is MMIWG2SLGBTQQIA+. For an explanation and motivation of all 16 characters in that one, see Lezard Dr, Percy, Noe Prefontaine, Dawn-Marie Cederwall, Corrina Sparrow, Sylvia Maracle, Albert Beck, and Albert McCleod. "2SLGBTQQIA+ Sub-Working Group MMIWG2SLGBTQQIA+ National Action Plan Final report." (2021).
https://languagelog.ldc.upenn.edu/nll/?p=73311&utm_source=rss&utm_medium=rss&utm_campaign=distribution-of-acronym-lengths
https://languagelog.ldc.upenn.edu/nll/?p=73311