Formatting  data:  After you click the button format  data you will get the below shown window. This program will convert your  data to the format in which the training program will accept.

 

 

Your  data should be in the following format.

 

Identifier           s1         s2         s3         s4         s5         s6

Target              1          0          0          1          1          0

G1                   3          8          2          9          1          2

G2                   9          4          2          9          6          5

G3                   3          2          7          9          3          2

 

First row and column should contain the word Identifier followed by the names of identifier.

Second row first column should contain the word Target. Following this should contain the expected output of that sample 0 for one class and 1 for the other class. This row is essential for the training data. However it is optional for the testing data. If this row is present the option “Second column contains target” should be checked else unchecked. Target can be in the second row even in the testing file. In such case it will give a comparative result with the expected output.  But then don’t forget to check the option “Second column contains target” in this module and also in the testing module.

 

The following rows should contain gene identifier followed by expression values. All the expression values should be numeric. There should be no missing values or NAN in the dataset.

 

Input file: Select the file to be formatted.

 

Output file: Specify the output file name. This will be used for training or testing.

 

Divide values by: The program trains best when the data range is in between 0 and 1.  So if your data is of higher range you can divide the values by a particular number so as the values falls between 0 and 1. For cDNA arrays where we use ratio or log ratio, there is no need for changing the default. If you are using Affymetrix raw values, then divide the values by 10000 in order to get optimum results.

 

Randomize samples: Randomize samples is always helpful for the training. However you can uncheck this option while preparing data for testing. Basically it randomizes the order of training examples fed to the program. It happens that if all positive examples are shown at one time and all the negative examples after that, then the system keeps on forgetting the previous class.

 

Second row contains target: It is a must for the training file. The second row should contain the known outcomes of individual samples as shown in the previous example. However it’s optional for the testing set. If you do provide the results also for the testing set, then you get a comparative result of the actual and the predicted output.

 

On clicking on Format, you will see a dos window as shown below and the output file will be placed in the working folder.