The program uses a very simplified idea of a word -- It throws away anything with all numbers, throws away anything with non-ascii characters, and breaks at anything that is not alphanumeric. The "words" files contains single words and bigram words. The bigram words are made up of a sliding window using the last "valid" word and the current word - so you get something like "last current" where we simply added a space. We also ignore a short (313) list of stop words, so they are not included in the various lists. ################# NOTE: All UTF8 characters are first converted to ASCII before Words are identified. NOTE: All HTML Encodings are converted to ASCII before Words are identified. ################# sort -t'|' -k1,1nr -k2,2 singleWords > singleWords.s sort -t'|' -k1,1nr -k2,2 bigramWords > bigramWords.s These words and counts were done on all of the 2016 Medline baseline citations. singleWords - Single words in alpha order. singleWords.s - Single words in frequency of occurrence order. bigramWords - Bigram words in alpha order. bigramWords.s - Bigram words in frequency of occurrence order. Files all have the same format: Frequency Count|Text| --------------------------- Summary: Files Created: March 17, 2017 Found 26,759,399 citations Number of Unique Words: 3,938,049 Number of Unique Bigram Words: 61,604,020