Index of Coincidence (crypto/text/data analysis)

Share your advanced PureBasic knowledge/code with the community.
User avatar
Keya
Addict
Addict
Posts: 1891
Joined: Thu Jun 04, 2015 7:10 am

Index of Coincidence (crypto/text/data analysis)

Post by Keya »

This function will return a value from 0.0 to 1.0, and hover around certain areas within that range depending on what type of data it is (from specific languages like English, Chinese, Russian etc) to cryptorandom (0-255), to short fields like "A-Z" etc etc.

You can use it, for example, to help filter out noise, detect text, or break simple ciphers.

https://en.wikipedia.org/wiki/Index_of_coincidence
https://www.dcode.fr/index-coincidence#q6

Code: Select all

Procedure.f CoincidenceIndex(*buf, len, Normalize=#False)
  ;Returns a value between 0.0 and 1.0 (if Normalize=#False), or -1 if invalid input 
  
  If (len <= 0) Or (Not *buf): ProcedureReturn -1: EndIf
  
  ;Get character counts (distribution)
  Dim Cnt(256)
  Protected *nextchar.Ascii
  For *nextchar = *buf To *buf + (len-1)
    Cnt(*nextchar\a) + 1
  Next
  
  ;Calculate IC
  Protected num.f, den.f, coefficient.i
  For i = 0 To 255
    If Cnt(i) 
      coefficient + 1
      num + ( Cnt(i) * (Cnt(i)-1) )
      den + Cnt(i)
    EndIf
  Next i
  
  Protected IC.f = (num / ( den * (den - 1) ) )  
  If Normalize = #True: IC * coefficient: EndIf    ;this is also known as returning the 'kappa-plaintext' instead of the IC
  ProcedureReturn IC
EndProcedure



;TEST ... (note: this is meant to be compiled in ASCII not Unicode)
 
;buf$ = "QPWKALVRXCQZIKGRBPFAEOMFLJMSDZVDHXCXJYEBIMTRQWNMEAIZRVKCVKVLXNEICFZPZCZZHKMLVZVZIZRRQWDKECHOSNYXXLSPMYKVQXJTDCIOMEEXDQVSRXLRLKZHOV"
buf$ = "This is a really long English sentence. I dont really have much to discuss here other than making up long boring sentences. For example, I like sports and fruit and music and guitars and all things that are interesting and I love science and rockets and the universe and quantum physics and various youtube videos"
IdxOfCoinc.f = CoincidenceIndex(@buf$, Len(buf$))
Debug StrF(IdxOfCoinc)
SAMPLE RESULTS: (The following values are with Normalize=#False)
English 0.0667, French 0.0778, German 0.0762, Spanish 0.0770, Italian 0.0738, Russian 0.0529
Random "A-Z" is around 0.038
Ciphertext (eg random distribution of all 0-255) is around 0.00389
"AAAAAAAAAAA" = 1.0
"ABCDEFGHIJKLMNOPQRSTUVWXYZ" = 0.0
"ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ" = 0.0196078438
"ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ" = 0.0259740259