Language coordinates

You don't need to know either HTML or Java to tell that files written in either language are very different. Files in HTML are all full of angle brackets (try a View Source from the browser on this page, if you don't believe me) and Java files are all full of curly brackets. They just look different!

In this project, we are going to write a program which acts on that idea to try and categorize files by language. We're going to write a character-counter program.

Now, you could use your knowledge of HTML and Java to decide whether a file was one or the other based solely on counts of angle brackets and curly brackets, but I'm going to ask you to do something more complicated. If you gather up counts of all 256 possible 8-bit characters, and figure out the percentage of each character, you'll have 256 numbers, which sum to one. The numbers can be considered to describe a point in 256-dimensional space. (Since the numbers must sum to one, that is, they are limited by a linear equation, the locus of possible points forms a hyperplane in the 256-dimensional space. But we don't really care about that.

It makes sense (although I'm not really sure how true it is, since I've only experimented a little bit) that the points describing Java files will be closer to each other than to points describing HTML files, and that in fact, you ought to be able to distinguish Python files from either one (Python is yet another computer language, notable for using indenting instead of curly brackets to mark block structure.) I expect, although I haven't tested it, that you could recognize the difference between English and Italian also.

Luckily the algorithm for finding the distance between two points in n-dimensional space is fairly simple.

Distance = square-root(sum (over all n coordinates, square(difference of corresponding coordinate values)))
You may recognize the two dimensional version of this formula, D = sqrt((x1-x2)2+ (y1-y2)2)

Nitty Gritty

We haven't yet discussed how to do file input in Java, but here is a program skeleton which does it. The program reads a file named "By.the" (not "By.the.txt" which is an easy name to create by accident while trying to create it.) This program reads bytes, and doesn't check for full Unicode characters, so if you were reading a file written in Arabic or Hindi, it wouldn't work correctly, but files consisting of only ASCII characters will read fine.
import java.io.FileInputStream; public class R { public static void main (String [] purlple) throws java.io.FileNotFoundException, java.io.IOException { FileInputStream f = new FileInputStream("By.the"); while (true){ int c = f.read(); if (c == -1) return; char d = (char)c; System.out.println(d); } } }
So how do you store the 256 coordinates, anyway?
In an array. I used an array of doubles, because once I'd finished reading the file, I want to convert the array to percentages. But you could also gather them in an array of ints, and convert them into a different array (which would still have to be doubles or ints.)

Demo

Here is a demo of me running my version of the program:
C:\hw1> java Counter print < Counter.java HTML0.329351698754067 Java0.015070606566663133 double [] cset = { 0.00000000, /* 0 : ? */ 0.00000000, /* 1 : ? */ 0.00000000, /* 2 : ? */ 0.00000000, /* 3 : ? */ 0.00000000, /* 4 : ? */ 0.00000000, /* 5 : ? */ 0.00000000, /* 6 : ? */ 0.00000000, /* 7 : ? */ 0.00000000, /* 8 : ? */ 0.00135898, /* 9 : ? */ 0.03204056, /* 10 : ? */ 0.00000000, /* 11 : ? */ 0.00000000, /* 12 : ? */ 0.03204056, /* 13 : ? */ 0.00000000, /* 14 : ? */ 0.00000000, /* 15 : ? */ 0.00000000, /* 16 : ? */ 0.00000000, /* 17 : ? */ 0.00000000, /* 18 : ? */ 0.00000000, /* 19 : ? */ 0.00000000, /* 20 : ? */ 0.00000000, /* 21 : ? */ 0.00000000, /* 22 : ? */ 0.00000000, /* 23 : ? */ 0.00000000, /* 24 : ? */ 0.00000000, /* 25 : ? */ 0.00000000, /* 26 : ? */ 0.00000000, /* 27 : ? */ 0.00000000, /* 28 : ? */ 0.00000000, /* 29 : ? */ 0.00000000, /* 30 : ? */ 0.00000000, /* 31 : ? */ 0.32510976, /* 32 : */ 0.00020907, /* 33 : ! */ 0.00073176, /* 34 : " */ 0.00010454, /* 35 : # */ 0.00010454, /* 36 : $ */ 0.00026134, /* 37 : % */ 0.00010454, /* 38 : & */ 0.00026134, /* 39 : ' */ 0.00188166, /* 40 : ( */ 0.00188166, /* 41 : ) */ 0.05498641, /* 42 : * */ 0.00282250, /* 43 : + */ 0.02717959, /* 44 : , */ 0.00141125, /* 45 : - */ 0.02848631, /* 46 : . */ 0.05451599, /* 47 : / */ 0.20050178, /* 48 : 0 */ 0.02487978, /* 49 : 1 */ 0.01730086, /* 50 : 2 */ 0.00966966, /* 51 : 3 */ 0.00862429, /* 52 : 4 */ 0.01390341, /* 53 : 5 */ 0.00930378, /* 54 : 6 */ 0.00888564, /* 55 : 7 */ 0.01097637, /* 56 : 8 */ 0.00815388, /* 57 : 9 */ 0.02702279, /* 58 : : */ 0.00224754, /* 59 : ; */ 0.00026134, /* 60 : < */ 0.00182940, /* 61 : = */ 0.00026134, /* 62 : > */ 0.00689944, /* 63 : ? */ 0.00010454, /* 64 : @ */ 0.00020907, /* 65 : A */ 0.00010454, /* 66 : B */ 0.00062722, /* 67 : C */ 0.00026134, /* 68 : D */ 0.00015681, /* 69 : E */ 0.00020907, /* 70 : F */ 0.00020907, /* 71 : G */ 0.00031361, /* 72 : H */ 0.00010454, /* 73 : I */ 0.00031361, /* 74 : J */ 0.00010454, /* 75 : K */ 0.00031361, /* 76 : L */ 0.00036588, /* 77 : M */ 0.00026134, /* 78 : N */ 0.00020907, /* 79 : O */ 0.00057495, /* 80 : P */ 0.00010454, /* 81 : Q */ 0.00010454, /* 82 : R */ 0.00078403, /* 83 : S */ 0.00047042, /* 84 : T */ 0.00010454, /* 85 : U */ 0.00010454, /* 86 : V */ 0.00010454, /* 87 : W */ 0.00010454, /* 88 : X */ 0.00010454, /* 89 : Y */ 0.00010454, /* 90 : Z */ 0.00073176, /* 91 : [ */ 0.00015681, /* 92 : \ */ 0.00073176, /* 93 : ] */ 0.00010454, /* 94 : ^ */ 0.00020907, /* 95 : _ */ 0.00010454, /* 96 : ` */ 0.00454736, /* 97 : a */ 0.00094083, /* 98 : b */ 0.00318838, /* 99 : c */ 0.00277023, /* 100 : d */ 0.00653356, /* 101 : e */ 0.00182940, /* 102 : f */ 0.00156805, /* 103 : g */ 0.00135898, /* 104 : h */ 0.00606314, /* 105 : i */ 0.00010454, /* 106 : j */ 0.00015681, /* 107 : k */ 0.00292703, /* 108 : l */ 0.00141125, /* 109 : m */ 0.00616768, /* 110 : n */ 0.00611541, /* 111 : o */ 0.00193393, /* 112 : p */ 0.00031361, /* 113 : q */ 0.00449509, /* 114 : r */ 0.00449509, /* 115 : s */ 0.00642902, /* 116 : t */ 0.00219527, /* 117 : u */ 0.00062722, /* 118 : v */ 0.00057495, /* 119 : w */ 0.00036588, /* 120 : x */ 0.00062722, /* 121 : y */ 0.00026134, /* 122 : z */ 0.00120217, /* 123 : { */ 0.00010454, /* 124 : | */ 0.00120217, /* 125 : } */ 0.00010454, /* 126 : ~ */ 0.00000000, /* 127 : */ 0.00000000, /* 128 : ? */ 0.00000000, /* 129 : ? */ 0.00000000, /* 130 : ? */ 0.00000000, /* 131 : ? */ 0.00000000, /* 132 : ? */ 0.00000000, /* 133 : ? */ 0.00000000, /* 134 : ? */ 0.00000000, /* 135 : ? */ 0.00000000, /* 136 : ? */ 0.00000000, /* 137 : ? */ 0.00000000, /* 138 : ? */ 0.00000000, /* 139 : ? */ 0.00000000, /* 140 : ? */ 0.00000000, /* 141 : ? */ 0.00000000, /* 142 : ? */ 0.00000000, /* 143 : ? */ 0.00000000, /* 144 : ? */ 0.00000000, /* 145 : ? */ 0.00000000, /* 146 : ? */ 0.00000000, /* 147 : ? */ 0.00000000, /* 148 : ? */ 0.00000000, /* 149 : ? */ 0.00000000, /* 150 : ? */ 0.00000000, /* 151 : ? */ 0.00010454, /* 152 : ? */ 0.00000000, /* 153 : ? */ 0.00000000, /* 154 : ? */ 0.00000000, /* 155 : ? */ 0.00000000, /* 156 : ? */ 0.00000000, /* 157 : ? */ 0.00000000, /* 158 : ? */ 0.00000000, /* 159 : ? */ 0.00010454, /* 160 : */ 0.00010454, /* 161 : */ 0.00000000, /* 162 : */ 0.00000000, /* 163 : */ 0.00000000, /* 164 : */ 0.00000000, /* 165 : */ 0.00177713, /* 166 : */ 0.00000000, /* 167 : */ 0.00000000, /* 168 : */ 0.00000000, /* 169 : */ 0.00010454, /* 170 : */ 0.00010454, /* 171 : */ 0.00020907, /* 172 : */ 0.00000000, /* 173 : */ 0.00000000, /* 174 : */ 0.00010454, /* 175 : */ 0.00010454, /* 176 : */ 0.00010454, /* 177 : */ 0.00010454, /* 178 : */ 0.00000000, /* 179 : */ 0.00000000, /* 180 : */ 0.00010454, /* 181 : */ 0.00000000, /* 182 : */ 0.00020907, /* 183 : + */ 0.00000000, /* 184 : + */ 0.00000000, /* 185 : */ 0.00010454, /* 186 : */ 0.00010454, /* 187 : + */ 0.00010454, /* 188 : + */ 0.00010454, /* 189 : + */ 0.00000000, /* 190 : + */ 0.00010454, /* 191 : + */ 0.00000000, /* 192 : + */ 0.00000000, /* 193 : - */ 0.00000000, /* 194 : - */ 0.00000000, /* 195 : + */ 0.00000000, /* 196 : - */ 0.00000000, /* 197 : + */ 0.00000000, /* 198 : */ 0.00000000, /* 199 : */ 0.00000000, /* 200 : + */ 0.00000000, /* 201 : + */ 0.00000000, /* 202 : - */ 0.00000000, /* 203 : - */ 0.00000000, /* 204 : */ 0.00000000, /* 205 : - */ 0.00000000, /* 206 : + */ 0.00000000, /* 207 : - */ 0.00000000, /* 208 : - */ 0.00010454, /* 209 : - */ 0.00000000, /* 210 : - */ 0.00000000, /* 211 : + */ 0.00000000, /* 212 : + */ 0.00000000, /* 213 : + */ 0.00000000, /* 214 : + */ 0.00000000, /* 215 : + */ 0.00000000, /* 216 : + */ 0.00000000, /* 217 : + */ 0.00000000, /* 218 : + */ 0.00000000, /* 219 : */ 0.00000000, /* 220 : _ */ 0.00000000, /* 221 : */ 0.00000000, /* 222 : */ 0.00010454, /* 223 : */ 0.00000000, /* 224 : a */ 0.00010454, /* 225 : */ 0.00000000, /* 226 : G */ 0.00000000, /* 227 : p */ 0.00000000, /* 228 : S */ 0.00000000, /* 229 : s */ 0.00000000, /* 230 : */ 0.00000000, /* 231 : t */ 0.00000000, /* 232 : F */ 0.00000000, /* 233 : T */ 0.00000000, /* 234 : O */ 0.00000000, /* 235 : d */ 0.00000000, /* 236 : 8 */ 0.00010454, /* 237 : f */ 0.00000000, /* 238 : e */ 0.00000000, /* 239 : n */ 0.00000000, /* 240 : = */ 0.00010454, /* 241 : */ 0.00000000, /* 242 : = */ 0.00010454, /* 243 : = */ 0.00000000, /* 244 : ( */ 0.00000000, /* 245 : ) */ 0.00000000, /* 246 : */ 0.00010454, /* 247 : */ 0.00000000, /* 248 : */ 0.00000000, /* 249 : */ 0.00010454, /* 250 : */ 0.00000000, /* 251 : v */ 0.00000000, /* 252 : n */ 0.00000000, /* 253 : */ 0.00000000, /* 254 : */ 0.00000000, /* 255 : */ }; C:\hw1>
My main program is in the Counter class, and I'm running my program from the command line, using the option "print", and redirecting standard input from the handy Java file Counter.java, which happens to be the only Java file in the directory I'm developing in.

You'll notice from the output, that my program contains sample tables for a Java program and an HTML file, and prints the distance at the beginning from each, and that (with the print option on the command line) it prints the tables in a form that can easily be cut and pasted into a Java source file (which is where I got those tables to put into my program.)

My most frustrating bug while developing my program resulted from junk characters in those comments, which is why all the characters less than 32 are printed out as "?" instead of smiley-face, etc.

You will turn in:

  1. A development diary describing the process of getting your program to work, who you talked to, what web sites you looked at, what blind alleys you went down, etc.
  2. A listing of all the code that you wrote.
  3. A printout of a sample run, showing your program actually working.