[Bioclusters] A representation of gene expression rate data over spaced time intervals

Tue, 27 May 2003 07:00:06 -0700

Sounds like curve clustering with nonlinear (or functional data)
regression models.   Simple to fit in R, http://www.r-project.org/.

best,
-tony

"MyungHo Kim" <bio_front@hotmail.com> writes:

> Currently DNA micro-array experiments have been done extensively in genom=
ic
> research and analyzing those data is a challenging problem. Here we would
> like to suggest a representation method for the data of a single gene
> expression over a certain period. This is the summary of the paper in full
> text, available in www.biofront.biz or http://arxiv.org/abs/cs.CC/0305008=
 .
>
> Note: DNA micro-array techniques convert the expression rates into densit=
ies
> of stained images, which may be recorded as a series of numbers. All the
> experiments and phenomenon infer that the numbers fluctuate over the time.
>
> Step 1. Function representation fitting with data
>
> Since we need a periodic-looking, fluctuating functions, it will be wise =
to
> start with sin(t) or cos(t), while, for increasing and decreasing effects,
> the exponential function would be the feasible choice. Consequently, a
> possible function for representing the changes of the expression rate wou=
ld
> be of the form exp(kt)sin(mt) or exp(kt)cos(mt), where k and m are real
> constants. The exponential functions have their shares in science,
> especially, in modeling problems and theories, so it is not surprising th=
at
> exponential functions make their appearances here. Although we are
> comfortable and familiar with the function, once in a while, one might ask
> the following question: why do the exponential functions appear so often?
> Although the answer to this question is not obvious, I would like to just=
ify
> my choice of the exponential function here. The clue could be in the
> profound experimental fact, i.e., that the radioactive decay is measured =
in
> terms of half-life ? the number of years required for half of the atoms i=
n a
> sample of radioactive material to decay.
>
> Mathematically this is expressed as y' =3D ky
>
> Here y represents the mass and k is the rate constant. Then the general t=
ype
> of a solution looks like y =3D Cexp(kt), where t represents time and C is=
 a
> constant. This might be extended to observe some sort of life expectancy =
of
> a certain phenomenon or behavior.
>
> Step 2. Determining coefficients of the functional representation
>
> Once we fix a candidate function, it remains to determine the coefficient=
s C
> =A1=AFs and k=A1=AFs etc., for each set of data. This may be achieved by =
using the
> least square sum principle with high accuracy set to our own standard.
> Commercial software, such SAS and SPSS, are available for such calculatio=
n,
> namely, R-squared. The least square sum method, as the most popular one f=
or
> fitting a curve/function with experimental data, finds the coefficients o=
f a
> function of given type, by minimizing the sum of square of errors, or
> deviations. More precisely, given a set of data points, (x1, y1), (x2,
> y2)... (xn, yn) and a candidate function f with undetermined coefficients,
> the unknowns in f would be determined so that the summation of squares of
> difference of errors, f(x) and y, be minimized. Note that two is the
> smallest and good for further manipulation, i.e., we could use many tools,
> calculus, involving differentiation unlike the absolute value function, |=
 |.
>
> Step 3. Vector representation for machine learning method
>
>>From the first two steps, we have obtained a =A1=B0functional=A1=B1 repre=
sentation
> for observed data, i.e., a function fitting with the data. Consequently,
> with respect to the fixed type, each function is represented as a set of
> coefficients calculated in the step two. Suppose the data fits well with
> model, y =3D Cexp(kt)sin(mt),
> where y is the expression rate. Then we can say that the set of numbers (=
C,
> k, m) represent the object, A. In other words, the object may be identifi=
ed
> with the triple (C, k, m), analogous to students and their corresponding =
ID
> numbers. For the general case, i.e., the gene expression rates of n multi=
ple
> genes, we will get (C1, k1, m1), (C2, k2, m2),.., (Cn, kn, mn), which for=
m a
> vector.
>
> Conclusion: Once we have the vector representation, by using SVMs, we wou=
ld
> get a criterion, which may be applied for diagnosis of a disease etc.
>
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>

--=20
A.J. Rossini  /  rossini@u.washington.edu  /  rossini@scharp.org
Biomedical/Health Informatics and Biostatistics, University of Washington.
Biostatistics, HVTN/SCHARP, Fred Hutchinson Cancer Research Center.
FHCRC: 206-667-7025 (fax=3D4812)|Voicemail is pretty sketchy/use Email=20

CONFIDENTIALITY NOTICE: This e-mail message and any attachments may be
confidential and privileged. If you received this message in error,
please destroy it and notify the sender. Thank you.