4.1 Importing data
R provides a variety of tools for reading in data from a file. One of the more useful ones for bioinformatics data is the read.table function, which we will use to read in the first test dataset. This file contains microarray gene expression results from the widely-used Golub et al. (1999) paper. This data set contains calls from around 7,000 gene probes on samples from 38 leukemia patients. Twenty seven patients are diagnosed as having acute lymphoblastic leukemia (\(ALL\)) and eleven as having acute myeloid leukemia (\(AML\)).
Read in the datafile
The data file is in the folder /data/Data_interpretation_with_R
and is called data_set_ALL_AML_train.tsv.
Note: Remember the location of the data file uses a relative path to find the dataset, so it depends on the current directory from which you are running the Markdown document. If you are running the document from the newly created Data_Workshop folder, then you need to use the relative path ../data/Data_Interpretation/data_set_ALL_AML_train.tsv
to read the file. This will go up one directory using the ..
annotation for parent directory.
-
Use the
head()
andreadLines()
functions to look at the structure of the file -
With the read.table function, read the file contents into a data frame
- Use the second column (labelled Gene Accession Number, actually the microarray probe ID) as the row names
## [1] "Gene Description\tGene Accession Number\t1\tcall\t2\tcall\t3\tcall\t4\tcall\t5\tcall\t6\tcall\t7\tcall\t8\tcall\t9\tcall\t10\tcall\t11\tcall\t12\tcall\t13\tcall\t14\tcall\t15\tcall\t16\tcall\t17\tcall\t18\tcall\t19\tcall\t20\tcall\t21\tcall\t22\tcall\t23\tcall\t24\tcall\t25\tcall\t26\tcall\t27\tcall\t34\tcall\t35\tcall\t36\tcall\t37\tcall\t38\tcall\t28\tcall\t29\tcall\t30\tcall\t31\tcall\t32\tcall\t33\tcall"
## [2] "AFFX-BioB-5_at (endogenous control)\tAFFX-BioB-5_at\t-214\tA\t-139\tA\t-76\tA\t-135\tA\t-106\tA\t-138\tA\t-72\tA\t-413\tA\t5\tA\t-88\tA\t-165\tA\t-67\tA\t-92\tA\t-113\tA\t-107\tA\t-117\tA\t-476\tA\t-81\tA\t-44\tA\t17\tA\t-144\tA\t-247\tA\t-74\tA\t-120\tA\t-81\tA\t-112\tA\t-273\tA\t-20\tA\t7\tA\t-213\tA\t-25\tA\t-72\tA\t-4\tA\t15\tA\t-318\tA\t-32\tA\t-124\tA\t-135\tA"
## [3] "AFFX-BioB-M_at (endogenous control)\tAFFX-BioB-M_at\t-153\tA\t-73\tA\t-49\tA\t-114\tA\t-125\tA\t-85\tA\t-144\tA\t-260\tA\t-127\tA\t-105\tA\t-155\tA\t-93\tA\t-119\tA\t-147\tA\t-72\tA\t-219\tA\t-213\tA\t-150\tA\t-51\tA\t-229\tA\t-199\tA\t-90\tA\t-321\tA\t-263\tA\t-150\tA\t-233\tA\t-327\tA\t-207\tA\t-100\tA\t-252\tA\t-20\tA\t-139\tA\t-116\tA\t-114\tA\t-192\tA\t-49\tA\t-79\tA\t-186\tA"
## [4] "AFFX-BioB-3_at (endogenous control)\tAFFX-BioB-3_at\t-58\tA\t-1\tA\t-307\tA\t265\tA\t-76\tA\t215\tA\t238\tA\t7\tA\t106\tA\t42\tA\t-71\tA\t84\tA\t-31\tA\t-118\tA\t-126\tA\t-50\tA\t-18\tA\t-119\tA\t100\tA\t79\tA\t-157\tA\t-168\tA\t-11\tA\t-114\tA\t-85\tA\t-78\tA\t-76\tA\t-50\tA\t-57\tA\t136\tA\t124\tA\t-1\tA\t-125\tA\t2\tA\t-95\tA\t49\tA\t-37\tA\t-70\tA"
## [5] "AFFX-BioC-5_at (endogenous control)\tAFFX-BioC-5_at\t88\tA\t283\tA\t309\tA\t12\tA\t168\tA\t71\tA\t55\tA\t-2\tA\t268\tA\t219\tM\t82\tA\t25\tA\t173\tA\t243\tM\t149\tA\t257\tA\t301\tA\t78\tA\t207\tA\t218\tA\t132\tA\t-24\tA\t-36\tA\t255\tA\t316\tA\t54\tA\t81\tA\t101\tA\t132\tA\t318\tA\t325\tA\t392\tP\t241\tA\t193\tA\t312\tA\t230\tP\t330\tA\t337\tA"
## [6] "AFFX-BioC-3_at (endogenous control)\tAFFX-BioC-3_at\t-295\tA\t-264\tA\t-376\tA\t-419\tA\t-230\tA\t-272\tA\t-399\tA\t-541\tA\t-210\tA\t-178\tA\t-163\tA\t-179\tA\t-233\tA\t-127\tA\t-205\tA\t-218\tA\t-403\tA\t-152\tA\t-146\tA\t-262\tA\t-151\tA\t-308\tA\t-317\tA\t-342\tA\t-418\tA\t-244\tA\t-439\tA\t-369\tA\t-377\tA\t-209\tA\t-396\tA\t-324\tA\t-191\tA\t-51\tA\t-139\tA\t-367\tA\t-188\tA\t-407\tA"