Chemometrics 711
Chemometrics 711
variables or the 3rd order however this if they had uh additional where I
happened into our model our squares will always increase so at one point it
will reach to 1 so when we have as much variable because we have data
points so I was wearing uh not very cool parameter for decide whether your
model to choose it is a good parameter only to compare models with the
same number of varieties and we can see here if we make predictions with
no 2-3 models the slide slide for 16 and 27 you can see that there is a
difference between Marina and contractual public but quadratic and cubic
model state are pretty much the same so the values are different much so
adding two daughter will definitely speak your faith the better and if you are
at more orders it will at one point we waited for perfectly but those
parameters are not necessary help me if you have to if you have to select
model that we use use different parameters to make a choice but uh let's say
stay here right now status post square and or square meter they're lying so it
will always increase if you add more variables your model fits your
background right there and like I said since it's always increasing it's not
good for model selection but uh it is used for uh for comparing models with
the same number of variables if you want to use OHS learn from other
selection you should use the post where just adjusted r ^2 so basically what
they it does is he hasn't had a number of objects and number of variables
also into the equation and uh it compares the experience with our
notification buttons that contain different number for reference so are you
are you saying uh alright I think we proved more than than than would be
expected to challenge thing there there are square starts decreasing so if
you go back here I think this is the second order model we can see that our
compared to you know more than to the evenings and so it explains more
however it is the third well you'll be carried into the model you can see that
the adjusted R-squared starts decreasing and thinks it really cleans even
increases even more when we add the parameters into the model so based
on that please your parameter model will be enough to describe the
relationship 39 here and sending this to the what kind of relationship we
have in our and I you know we we can have condition of orders orders to the
equation so similar back then if there are doors that are that what I believe
described it best you can move I will give you this same same data set in
seminar and you can see how ohh melina line goes through that those data
points so 2nd order line goes through and the 1st the power also you can go
up to 10 or whatever or have forwards you will want to see how it works and
how it developed so we already started like it's multiple linear regression
with this previous one is probably near one however there there we used that
one next one and just do different powers in in it with it so however um you
have one why would I have and mainly export I apples and we cannot add
them to our linear regression model and we will seek relationship between 1
dependent variable and more more than one independent correct forever so
and why we do that is that relationship between the way variables and 1:00
next variable might be so alright guys we're available and do work whatever
but if you had might be better than XRP happens with speak gration together
then we can guy or uh might explain the why you were not able way better
and anyway when people are you know regression it's for dinner migration
looks like this on the other line explaining some purchases exclude some
purchasing and so on and so on and so on and the lessons it's also near
latent variable these properties are like this gives the maximum Pearson
correlations between Y and predicted Y value and and when we used the this
univariate linear regression or think alone why and think Alex where
everything we try to find the best flying through those data points second uh
today please and then we try to find the best plane to that those data points
and the more if they are 3rd 4th and whatever number of variables then we
try to find the best hyperplane through those states they take a point yes if
one artificial example so we have why and every X variables you can see
that relationship depending on those it's one and only whenever it's good
each one explains like 48% of the variance in the Y excel file what you want
man the extreme does not explain what explains almost nothing if you have
those uh it's one and next two together into one equation we can explain
88.8% of the variance that we have in life this additional X3 and like I said R-
squared will always increase so it increases a little bit a little more about ants
but I will I will choose staffing model or that's because if you if you why you
always complicated and those uh linear regression models are quite often
used in quantitative structure activity relationships so those user
relationships are the approach where we try to find the relationship between
chemical structure and their biological activity like toxicity or
bioconcentration factor and those what I have also seen in these cases are
theoretical molecular descriptors and that's to describe her chemical
structure numerically and uh state ideas behind that in the those views are
very structured activity quantitative structure activity relationship please is
that the structure and similar chemicals are more likely to have similar
chemical or biological properties and uh the main workflow how we can
filtrate those through some through some models this would be asked to
compile the data set from public databases for literature we can also
measure them those distinct by ourselves so the measure measure some
biological activity for things hundreds of chemicals and then try to build a
model that can predict predict those activities or unknown compounds in the
data set so who would take the possessing to make the best modeling in our
body is it possible that we should use this linear regression models because
they are easier to explain and you know if you want to use those models for
regulatory purposes then quite often uh they will choose the simple linear
regression model over complicated neural networks because if people
understand what all those smaller words what goes inside the neural network
nobody knows and you build the model I'm telling you ask you also it won't
make the performance of that model because uh if you had enough variables
into the model you can explain our data our training set almost perfectly
however at this point we need that our model is able to predict our work with
the unknown components uh these performance must be estimated prepare
for this so usually if we build our multi linear regression models user models
we have white property and hundreds of variables and we want to end up
with something like that so we select 5-6 compare what happens from
hundreds and fit the model and now we have to tested against some
unknown data and usually get this unknown data is that we take our
datasets and dispute it before before the modeling because if we use all the
data quote that that if they're only you can only know how those variables
describe the data you have in and that is not enough and depending how
much data we have to begin with we can have different strategies or
possibilities to split our data set so if you have lots of faith and we can speak
to our data in three different to three different datasets so we have training
set the data that we use to build the model then we have information sent
when we that we used to validate our model or control them more than that
and finally if we had chosen this final model in place for training and
combinations you can trace it to see I like the additional validation whether
our model works or not or how it works to unknown objects so if you have a
little bit less typed up you can use my calibration on and test it the basically
the data calibration sector modernization and press it and it's migration of a
far more than I started using cross validation or footsteps so basically the
first couple of or optics out multiple times and you can see how how our
model works and those are left out the office it's like talk about cross
validation and bootstrap hopefully later on next next week and that we
would discuss that to estimate the final predictable performance of our
model very very small letters that we can use training set and just or the
baseline training sets and use transformation to speak to uh So what do you
think they they are more than however those methods are not recommended
they are they were evidently used like three years ago we were thinking
about that much data at the at the begin with and yeah if you go to the
literature and our mother standing more than it's like 30 years ago they
basically are just yeah like the point is like Joe and yeah that's OK today
nobody will publish sometimes it's like that so those those methods are not
recommended recommended Specially if you use all the data as training set
and say OK that's it yeah that's OK for calibration when you have your data
but if you build a model that will predict some geologic biologic activity then
that's enough for me to vote well ohh I'd choose objects into our training and
test set ohh how to split so one way is order your data according to what
happens when rebels will highest values of lowest and pull stuff down to
lowest and you would just pick 2nd 3rd or whatever whatever observation
into the test set and all the all the other goes to training set this way you call
our range of the values you have in your data set you can do random
sampling so you randomly pick 20 percent 50% and put them to the test set
or or you can also use clustering so if you remember clustering groups
similar objects together so you do a clustering and spending you do
clustering and then you have right OK six different clusters in theory there
was objecting clusters are similar to each other so we pick randomly from
each cluster some objects to this set and that's it or you order so that one
goes to test set that one goes and that's the one ohh that's
1. ohh not the tool software if you have authored your librarians and just
like upper part or lower part of the data because uh This way you don't
you cannot estimate the prediction power of your remote and correctly
testing basically you want to see that your model works in a whole
range of your you know based on so if we start building a model and
yeah 100 objects to 400 when I have those all those things forever so if
you that will be problem for you if you do some calibration were or you
use spectral data so really situations you will have and many many
objects but you have way more wavelengths or so speculate nice
inspector you have like thousands of what happens and without how to
deal with spectrum take time without talk next week but uh today why
not use all the explorers in our model so 1st yeah hello they want the
collinearity problem of 3 what is multicollinearity so let's start what is
not multicollinearity so much because another piece where X variables
are not related to each other basically they do not correlate and that is
the if you go back to the last week and telling them think about
assumption of linear regression models and one assumption was that
the X or the X whereabouts are independent and are not correlated
usually that that is not the case but here there's just one example so
basically when we have a why what happened and we went to X what
happens to explain this why variable didn't it and that's what
happened one we'll explain a small portion of it that's what I have to do
a little bit more and so on and so on and so but they're not correlated
and that makes them very easy to interpret so coefficient if I put
coefficient changes that much why will change that much and if our
that's what happens I'm not correlated then adding additional
variability so our model where leaves the coefficient right I changed so
any started this example and I felt liner model using Y and X1 so I've
got some coefficient coefficient so intercept I'm confident for X1 I have
second one you can see that position for the first one is the same as in
the linear model the case will be used only for experiments in
education what you know I think that does not work like that your ex
variables are always a little bit correlated but I I can I could do that
because as you can see I use principal components here principal
components are uncorrelated that's what I remember thank you this
this model actually are squared for that model was 2.002 or something
so that's what I didn't explain anything I think the second principle
conference with the R square for the model to 0.88 usually we have a
situation like that so why would I have to XY equals we still can see the
unique effect of insects what happens so that one we can play that one
but this was coefficients do not show the whole picture coefficients
cannot cannot tell you what is happening here or here we can still
interpret those conditions if we take into account that helps our
explorers are slightly correlated however our uh that's what I have and
that will be the case if you use spectrum data uh then uh please send
me the goals estimation problems and there's there's not enough
information or unique information in X1 and X2 so and then those
partitions will be imprecise and imprecise and for example if you take a
different sample from your data set then those provisions for X1 and
X2 can even change the fine so you cannot reinterpret those those
partitions so just this part is very vague and if if you're uh you're a
perfect correlation between X1 and XY and next one one and X2
actually you cannot do that ordinary least square regression because
you cannot take inverse of your covariance matrix so the matrix will be
singular and if you have more than one more variables than
observations what will happen is that the there's just a small example
and they're starting to do anything variable I'll explain the increased
increase yeah it's the it's the white white variables it is perfect so but
but does that mean that interested each of them describe one point so
basically you get the perfect correlation but then you you add
additional tools to some unknown data points most likely this modern
cannot predict it at all it's overfitting and then if you make a model
with six or apples that you can see that it cannot estimate the
coefficient for the 6th 1 so only use the intercept by first Corinthians
and if you have a spectral data then yes that will happen if you want to
lose all the whole spectrum of you make a multilinear regression model
and I think too much forever goes into our model well will usually leave
you over 50 without the absolute points the right amount of variables
into our model so if we you have to make 2 little or not so many
variations in my underwear so we cannot capture the main things in
our data so so we tried to fix the light to curve the data for example or
we tried to find 2 complicated finds so slide that starts capturing the
noise in our data so you want to find this optimal model but the that is
it's not easy thing to do and the complexity in our models are in
regression models increase if we add number of variables or we add
30. What happens so we take locally or 100% so on or and be less and
be sure that we talk next week so we increase the number of
components in our models so if we do modeling and then usually not or
lighting and they're even might be animation we make models with the
different numbers of variables so we start with two and another one
and another one and another one and then our calibration of this
training set is uh always increases or R square goes higher the error
goes lower and so on however if we test our model our relationship we
can see that at some point there are square or that's or or prediction
error starts to go up again so we try to find this optimal place where
prediction and uh never mind if we start to find the best model you have to
compare our model somehow and there are multiple statistics and
actually we should look at them all together not just one so first one is
mean square there it is using yours when during the more than the
growth and optimization because if that is not the same unit test to
predict that the property so you have squares there everything
yourself who's your name sqrt 3 square now and you know what data
set to use in literature there are different from patients so they must
see see calibrations that cross validation leaks and and so on and
there's also information sharing so they are useful for selecting model
and as you can see that they also use some crazy years so they are
always look at the videos of in our moment but depending it was uh
information contains they also take that number of objects in your
model and number of parameters and just those numbers itself they
they meaningless but we cannot interpret them in any way however
they are used for if you want to compare different models and and
then if you compare you want to pick no more elevator core right
where you is smaller or smaller and depending have it say information
15 and used and usually this will favor smaller models so because it
has had more penalty to the number of uh and I'm getting asked on
square adjusted so if you have to select the model then look multiple
statistics not just one so if they all say the same thing ohh and usually
and if the you have to be please select a group of modern modern
candidates so we don't go to this one so you think multiple similar
motors like let's say the built 10 or 100 more than each at 500 meters
in isn't it like finished the five out of those and then you look at the
variables for first you look at the statistics if the statistics are pretty
similar so at school then you look at the variables and if you have
worked better like that before then you know that some variable
explained this your property better or it has like some mechanical
interpretation so from that group of variables the test or group of
models I suggest to take take that model that you can interpret you
can put like some mechanistic interpretation interpretation behind you
because such such models are much easier to if you want to publish it
it's easier to publish them but it is also easier to use them because you
know why something is happening but now it will be this difficult part
so selecting more than from points of training that models it's much
easier than to select variables into your mobile so you have been
variables then we can go all those combinations that those cameras
can have it doesn't take long long computer so there's around 1000
combinations and that's fine I have another 10 variables into your
thing now you have your 1,000,000 combination to to to I have another
one so one billion 100 varieties if you have 1000 what happens guys
what is the length of combinations so if you start going through all of
them most likely the sun or you even universe will die before it goes to
the end so if you don't want to go to all of them then all 5 arrival
combination problem how do you go from 12,400 it's still a very big
number and can take time like 2 weeks ago I OK that's good idea I had
470 variables OK I go to to all the three vulnerable go combinations to
find the best best model out of that Michael was a little so it didn't
save anything during the during the and so I need to write everything
to the memory so after two weeks OK now it's good time to do the
update and people are starting and everything went to so I learned my
lesson and but yeah even with the small number of variables going
through through all the combination candidate a lot of time so how to
select uh what happens in their mother so if you if you have like some
prior knowledge then you it's easy or easy but and then try to find
additional narratives that complement them and if your data set is
small enough you can go to exhaustive search find all all the
combinations but often that's not the case and what can you do some
to make it so I talked about a little step by step forward selection in
fact for elimination and genetic algorithm but there is also simulated
annealing and some other feature selection methods that you can use
but before we start selecting uh forever 50 model you can also reduce
the number of arrivals we have in mind so today can you go to the
wherever and find for example variables that has many missing values
for some reason you were not able to calculate those values for some
chemical compounds for example and if you have those what happens
when you have many missing whatever you can just exclude them
from your analysis order when I have placed there you will have ohh
order cancel the order what happens is 0 so or invite me if you
remember we think that in fact analysts but Ryan's is information so if
you you cannot learn anything if if all the values are same so so if the
variables include many and severe outliers you can also that exclude
them all while you correlate variables like percent for multiple linear
regression highly correlated parable will cause problems so and and
you will just remove those are you bored one of those highly correlated
variables so and you can you can deploy those going ahead with that
that are useful so are you worrying of the variable so again you have to
remember that and then but close photo because yeah you are very
different than different scales than the they might they show the high
variance that is actually not there so and you can speak uh what I have
is that our high correlation coefficient with the why what happened so
they already explained quite quite a lot information that you have in
your valuable yeah those and then what happens to be correlation can
also be important if they can capture the friends that is not captured
by other variables so basically if we go back to this why and then we
have just one mix whatever that explains quite a lot then settings OK
yeah and then there's this one it's really that is not correlated to those
X1 and X2 but it still can explain the variance that is not captured with
the X1 and X2 so don't be scared what happened small we correlation
with the Y big they also can be important to get together and you told
the other one and if you know that you have some variables that must
be in the model then put them in the model so if we start speaking
couple of automatic ways so forward selection backward elimination
and success wise regression go selection it starts this way that it picks
the variable but correlates best with the Y variable expanding the
starts adding their new variables and they're going and as the next
best variable and so on and so on and so on as you can see that you
cannot select the add next Mariah will based on R square values so you
have you have to have some other parameter that you optimize for for
example I value or value or some something like that so I just want us
to be sold so pulling up here so all right for the lamination where we
we had all the model correctly into the model and start eliminating
until the model is not very goes like that goes that goes bad like for
instance but as you can see here the problem is that in fact the order
backwards the elimination you cannot have more variables than you
have objects because you cannot build a model he told the miracles in
that case so if you have like five object objects and then variables you
cannot do passport elimination because you cannot feel the model with
all the variables just started 20 minutes six place regression but it's
much better than those people but it's the combination of about you
can ask and also exclude the variables or on its way to find the best
model those places are backward elimination the final model is not
guaranteed to be optimal change basically so I'm just going to give you
a single final model although most likely you have several other
equipment models and also accommodate important export levels is
identified or eliminate it those none of those methods will consider
your knowledge about the predictors so they're actually you can like
clearly demonstrate that particular variable is in the model and put the
rest start pointing to the old other models what I have is that kind of a
company but uh in my experience those methods work well if you are if
you don't have it too much variety please be quiet if you go over 100
on understanding I I haven't got any good models out of those
methods so this might be within our relation it works well but it's like
ohh great all those stepwise or backwards elimination so maybe it
starts to build all the two parameter motors so we don't have that
many hit my explanation for department and will intellect press
4_whatever number you give it give it give it to that and if you like
another 3 description alright but to those 400 models and select the
best next page 400 and so on and so on and so on until it will reach to
some later so and and then we'll find them speak more than better
what it means is it order reach to some plateau so and then we have a
political body alright it comes from biology so it's something creation
procedure so basically we start with the single bit strings or maybe
have one inside of heroes and that's the length of this fixing is the
number of variables you have and uh start some random point where
you know signing ones and zeros to that public statement So what it
means that the your big your variable into the model 0 means you
don't and you in the mothers to have calculated some treatments
trying to fight crime or Peach value adjusted R square whatever and
start with random points of those so we'll talk like $100 more different
models so each of them have different and what I have been selected
and the the bad ones that do not have like any any prediction power to
delete and coupons you try to find better one so basically you uh
what's your chromosomes and combining them so hopefully talk to
hope to hopefully find better solution for your audio problem and and
at some point you also add mutations or you randomly 1 when do you
know when when I blow on or off and you do it multiple for hundreds of
thousands of cycles until you find this optimal model calculates that
one model but bunch of candidate models and then you can select
final model from that you can take minutes to hours in with the one for
linear regression model because the feeling like that ohh I'd like that
how are you ten years ago I was in school mania and I was working
with neural networks and then training her one neural network look like
15 minutes or so and I tried to find the best set of I'm getting 400 or
something and it took me two weeks to reduce the number of 204
hundred parameters to Turkey so it can take lots of lots of time if you if
you are the order of this more than that you're trying to optimize takes
much time to build they will come get you I don't know to be finished
so let's let's say so another option there will be simulated annealing
but yeah I'm not going to talk about that so you can look it up but it
works similarly so it goes over a cycle of models and tries to find the
best subset of varieties into into your model no let's say that we have
our finally picked up or might be our regression model and now we
want to know how to do this or usually the first thing everybody will
look he's asked where values so how much what is describe with our
model so basically it's been great correlation between experiment and
why and the reason why Important if we already selected our model no
ohh yeah we have deepest 1st we have this overall leftist or regression
so whether at the checks whether our model is and then umm only
model that we covered already last time so if the P value for this if this
is less than your significance levels of 0.05 in most cases you can
conclude that your model provides better fit than intercept only model
we can do also tests for individual conditions so then this one
hypothesis for this is equal to 0 basically tells you this particular carpet
coefficient and has no effect to the amount of and you can usually use
those P values to determine wants to keep international order thank
you make your people you know generation model will have your
alright question Constance partition yeah yeah product key thank you
more than 0.0510 likely that this this has no effect to your mother so if
you remove that there's nothing most likely nothing will happen as you
can see here dominate the trees like 0.03 so and uh we cannot also
this the multicollinearity of all models of the and for that two methods
are used uh tolerance or orientation factor is all the time so basically
this ordinance for each when I have miscalculated touch way that you
1 - r square of your variables so basically what you do is you would
take your variables and build safe safe now now you are why variable
we'll build a model using so tolerance is you know maybe that's the
now take this one and I don't know look at this slide one build a model
that tries to take explain your XY using all the remaining variables and
then one line of our square of that model surveillance video and you go
with that all of your experiences so next one will be X X2 will be your
Y1X3X4 and so on will be exponentials and if there's another model
and model model and if perhaps the currency is less than 0.2 or 0.1
depending what what kind of book reading then it indicates that you
might have some multicollinearity problem solved some of those rivals
are I may correlate and 3 and we look at this example here considered
you're either you I don't know why different here is that this was done
man earlier at least one of the public value in it so I I don't know
proved it and now with this data set contains a little bit less so and and
we you know we can see that OK we are you guys are there So what
happened 4 it's not significantly to that model and if you calculate
these values for for that model you can see explain the MX4 especially
highly dependent so so sorry I I mean 5 or 10 so it is one divided by
the tolerance 10 minutes and if you look at the correlation maybe we
can see that if we therefore they are highly correlated so they basically
describe the same thing in our model in this case they are highly
correlated but they sound like they are significant yeah it's difficult it's
a particularly significant for the for the model or not I can understand
that they are very highly correlated but with their values it's also
extremely significant but thanks for listening so but uh yeah and
thanks so much and we will explore you know from the model actually
I'm looking deeper into things and so at first we look at the feedback
you have a problem that we calculate the bag or tolerance value and
see whether we have multiple culinary features so to understand why
you have some other shows that we have some problem so and you
can see that we're OK the program is there and here I removed the six
four but the will maybe we should maybe we should they got nothing
will happen if I remove XD instead so I will probably get the similar
model because they have lots of lots of similar information in it so they
explain explain the why similar rate so and then we have regression
coefficients that tell us how our what happens I think the deletion
employee what happens so if the quotient is positive then if that exists
for everybody increases it also will increase Y value in your model or or
or in your in your prediction and if it's negative then it decreases your
why why about what about prediction and also if they before the other
that's what happened once then I'll wait it will increase or decrease
changing life but if you got it there will you get those information for
teachers parents you should not use those drivers to stay rich and I
have become more important or more than saying that OK this has the
highest absolute value exporting so this is the most important OK if it's
not ohh it might be but you don't know for that you're you're to
standardize your confusion so if you stand that that's your regression
coefficients it will show you the relative importance of the variables
and you're basically have to do that you can do it you're just doing
your regression analysis using standard despite coefficients then you
will get all your regression coefficients will be standardized or you take
your on some of these proficient and multiplied by ten activation of
your variables be my guest and activation of why then then you've
done all that you know what I have the basement on national trails and
the higher adoption failure of the more relatively more important that's
our policy to your model so it will that's why I've been explains more
than the see that's order I'm going to leave you you have your final
model everything is nice and so you picked your final variables in you
should also try to interpret them so and here is where I am fine
parameter model all the specifics of what what you want to happen to
us factors so racial concentration of chemicals in organisms and
between the short run we can work environment in an expanded state
so measurement effective particular activity it takes six months or so
maybe there and closes like contact or 200 different fishes and post
2010 years ago it cost like $100,000 or so so that's model can reduce
cost alright they might be environment so so on and and I think I think
those 519 and that was my interpretation of those variables whether
it's 100% correct I'm not sure but I don't say that it's a good
interpretation reviewer started so so this is published so so only say
you also need to break them then they will know why your mother
works and finally if you if you'll shoot your mother then speak with the
data or the training set well it does not guarantee you that your model
can actually believe it it's just the ability to do that yeah I can see
patterning in that particular data set but really loves that you like to
use it with with some ongoing data to see is it kind of really predict or
not so if possible the best way is actually if you can experimentally
validate your data so at first you yeah take your training set validation
set that is nice and all but if you can find additional data from view
experiments that not used at all and it still can predict it then your
model is very good yeah drive through around both of them have to
find some irregularities we've definitely been finding some outliers
trying to understand why are there why they are outliers and so and
especially think about the program that you have and definitely
definitely always complex models for small datasets you don't if you
have to use complex neural network for 10 people for instance
properties move right