diff --git a/.vscode/settings.json b/.vscode/settings.json index cf42b25..e81dabd 100644 --- a/.vscode/settings.json +++ b/.vscode/settings.json @@ -1,9 +1,10 @@ { "cSpell.words": [ - "beefblup", "BLUP", - "dairyblup", "EBVs", "EPDs", + "autob", + "beefblup", + "dairyblup" ] -} \ No newline at end of file +} diff --git a/docs/make.jl b/docs/make.jl index 5a8bec7..bdc2e32 100644 --- a/docs/make.jl +++ b/docs/make.jl @@ -15,6 +15,8 @@ makedocs(; ), pages=[ "Home" => "index.md", + "How to Calculate EPDs" => "how-to-calculate-epds.md", + "CLI Reference (WIP)" => "beefblup-cli.md" ], ) diff --git a/docs/src/beefblup-cli.md b/docs/src/beefblup-cli.md index 85b1fb6..daa6529 100644 --- a/docs/src/beefblup-cli.md +++ b/docs/src/beefblup-cli.md @@ -1,7 +1,8 @@ ```@meta CurrentModule = BeefBLUP ``` -beefblup Command Line Interface (CLI) documentation + +# beefblup Command Line Interface (CLI) documentation > _A work in progress_ diff --git a/docs/src/how-to-calculate-epds.md b/docs/src/how-to-calculate-epds.md index f460b60..04f92c7 100644 --- a/docs/src/how-to-calculate-epds.md +++ b/docs/src/how-to-calculate-epds.md @@ -15,7 +15,7 @@ Since I'm mostly talking to American beef producers, though, we'll stick with EPDs for most of this discussion. Expected Breeding Values (EBVs) (which are more often halved and published as -Expected Progeny Differences [EPDs] or Predicted Transmitting Abilities [PTAs] +Expected Progeny Differences (EPDs) or Predicted Transmitting Abilities (PTAs) in the United States) are generally found using Charles Henderson's linear mixed-model equations. Great, you say, what is that? I'm glad you asked... @@ -29,15 +29,17 @@ P = G + E Where: -- _P_ = phenotype -- _G_ = genotype (think: breeding value) -- _E_ = environmental factors +- ``P`` = phenotype +- ``G`` = genotype (think: breeding value) +- ``E`` = environmental factors Now, we can't identify _every_ environmental factor that affects phenotype, but -we can identify some of them, so let's substitute _E_ with some absolutes. A +we can identify some of them, so let's substitute ``E`` with some absolutes. A good place to start is the "contemporary group" listings for the trait of -interest in the [BIF Guidelines], though for the purposes of this example, I'm -only going to consider sex, and birth year. +interest in the +[BIF Guidelines](https://beefimprovement.org/wp-content/uploads/2018/03/BIFGuidelinesFinal_updated0318.pdf), +though for the purposes of this example, I'm only going to consider sex, and +birth year. ```math P = G + E_{year} + E_{sex} @@ -45,61 +47,61 @@ P = G + E_{year} + E_{sex} Where: -- _En_ is the effect of _n_ on the phenotype +- ``E_n`` is the effect of ``n`` on the phenotype -Now let's say I want to find the weaning weight breeding value (_G_) of my +Now let's say I want to find the weaning weight breeding value (``G``) of my favorite herd bull. I compile his stats, and then plug them into the equation -and solve for G, right? Let's try that. +and solve for ``G``, right? Let's try that. ### Calf Records -ID | Birth Year | Sex | YW (kg) --- | - | - | - -1 | 1990 | Male | 354 +| ID | Birth Year | Sex | YW (kg) | +|:-- | :--------- | :----- |:------- | +| 1 | 1990 | Male | 354 | ```math -354 \textup{kg} &= G_1 + E_{1990} + E_{male} +354 \ \textup{kg} = G_1 + E_{1990} + E_{male} ``` -Hmm. I just realized I don't know any of those _E_ values. Come to think of it, +Hmm. I just realized I don't know any of those ``E`` values. Come to think of it, I remember from math class that I will need as many equations as I have unknowns, so I will add equations for other animals that I have records for. ### Calf Records -ID | Birth Year | Sex | YW (kg) --- | - | - | - -1 | 1990 | Male | 354 -2 | 1990 | Female | 251 -3 | 1991 | Male | 327 -4 | 1991 | Female | 328 -5 | 1991 | Male | 301 -6 | 1991 | Female | 270 -7 | 1992 | Male | 330 +| ID | Birth Year | Sex | YW (kg) | +|:--- |:---------- |:------ |:------- | +| 1 | 1990 | Male | 354 | +| 2 | 1990 | Female | 251 | +| 3 | 1991 | Male | 327 | +| 4 | 1991 | Female | 328 | +| 5 | 1991 | Male | 301 | +| 6 | 1991 | Female | 270 | +| 7 | 1992 | Male | 330 | ```math \begin{aligned} -251 \textup{kg} &= G_2 + E_{1990} + E_{female} \\ -327 \textup{kg} &= G_3 + E_{1991} + E_{male} \\ -328 \textup{kg} &= G_4 + E_{1991} + E_{female} \\ -301 \textup{kg} &= G_5 + E_{1991} + E_{male} \\ -270 \textup{kg} &= G_6 + E_{1991} + E_{female} \\ -330 \textup{kg} &= G_7 + E_{1992} + E_{male} +251 \ \textup{kg} &= G_2 + E_{1990} + E_{female} \\ +327 \ \textup{kg} &= G_3 + E_{1991} + E_{male} \\ +328 \ \textup{kg} &= G_4 + E_{1991} + E_{female} \\ +301 \ \textup{kg} &= G_5 + E_{1991} + E_{male} \\ +270 \ \textup{kg} &= G_6 + E_{1991} + E_{female} \\ +330 \ \textup{kg} &= G_7 + E_{1992} + E_{male} \end{aligned} ``` Drat! Every animal I added brings more variables into the system than it eliminates! In fact, since each cow brings in _at least_ one term -(_Gn_), I will never be able to write enough equations to solve for -_G_ numerically. I will have to use a different approach. +(``G_n``), I will never be able to write enough equations to solve for +``G`` numerically. I will have to use a different approach. ## The statistical model: the setup -Since I can never solve for _G_ directly, I will have to find some way to -estimate it. I can switch to a statistical model and solve for _G_ that way. The +Since I can never solve for ``G`` directly, I will have to find some way to +estimate it. I can switch to a statistical model and solve for ``G`` that way. The caveat with a statistical model is that there will be some level of error, but so long as we know and can control the level of error, that will be better than -not knowing _G_ at all. +not knowing ``G`` at all. Since we're switching into a statistical space, we should also switch the variables we're using. I'll rewrite the first equation as @@ -110,12 +112,12 @@ y = b + u + e Where: -- _y_ = Phenotype -- _b_ = Environment -- _u_ = Genotype -- _e_ = Error +- ``y`` = Phenotype +- ``b`` = Environment +- ``u`` = Genotype +- ``e`` = Error -It's not as easy as simply substituting _b_ for every _E_ that we had above, +It's not as easy as simply substituting ``b`` for every ``E`` that we had above, however. The reason for that is that we must make the assumption that environment is a **fixed effect** and that genotype is a **random effect**. I'll go over why that is later, but for now, understand that we need to transform the @@ -126,20 +128,20 @@ We'll start with the environment terms. ## The statistical model: environment as fixed effects To properly transform the equations, I will have to introduce -_bmean_ terms in each animal's equation. This is part of the fixed +``b_{mean}`` terms in each animal's equation. This is part of the fixed effect statistical assumption, and it will let us obtain a solution. Here are the transformed equations: ```math \begin{aligned} -354 \textup{kg} &= u_1 + b_{mean} + b_{1990} + b_{male} + e_1 \\ -251 \textup{kg} &= u_2 + b_{mean} + b_{1990} + b_{female} + e_2 \\ -327 \textup{kg} &= u_3 + b_{mean} + b_{1991} + b_{male} + e_3 \\ -328 \textup{kg} &= u_4 + b_{mean} + b_{1991} + b_{female} +e_4 \\ -301 \textup{kg} &= u_5 + b_{mean} + b_{1991} + b_{male} + e_5 \\ -270 \textup{kg} &= u_6 + b_{mean} + b_{1991} + b_{female} + e_6 \\ -330 \textup{kg} &= u_7 + b_{mean} + b_{1992} + b_{male} + e_7 +354 \ \textup{kg} &= u_1 + b_{mean} + b_{1990} + b_{male} + e_1 \\ +251 \ \textup{kg} &= u_2 + b_{mean} + b_{1990} + b_{female} + e_2 \\ +327 \ \textup{kg} &= u_3 + b_{mean} + b_{1991} + b_{male} + e_3 \\ +328 \ \textup{kg} &= u_4 + b_{mean} + b_{1991} + b_{female} +e_4 \\ +301 \ \textup{kg} &= u_5 + b_{mean} + b_{1991} + b_{male} + e_5 \\ +270 \ \textup{kg} &= u_6 + b_{mean} + b_{1991} + b_{female} + e_6 \\ +330 \ \textup{kg} &= u_7 + b_{mean} + b_{1992} + b_{male} + e_7 \end{aligned} ``` @@ -148,13 +150,13 @@ equations above to a single matrix equation that means the exact same thing. ```math \begin{bmatrix} -354 \textup{kg} \\ -251 \textup{kg} \\ -327 \textup{kg} \\ -328 \textup{kg} \\ -301 \textup{kg} \\ -270 \textup{kg} \\ -330 \textup{kg} +354 \ \textup{kg} \\ +251 \ \textup{kg} \\ +327 \ \textup{kg} \\ +328 \ \textup{kg} \\ +301 \ \textup{kg} \\ +270 \ \textup{kg} \\ +330 \ \textup{kg} \end{bmatrix} = \begin{bmatrix} @@ -200,9 +202,10 @@ e_7 \end{bmatrix} ``` -That's a nice equation, but now my hand is getting tired writing all those _b_ -terms over and over again, so I'm going to use [the dot product] to condense -this down. +That's a nice equation, but now my hand is getting tired writing all those ``b`` +terms over and over again, so I'm going to use +[the dot product](https://www.khanacademy.org/math/precalculus/x9e81a4f98389efdf:matrices/x9e81a4f98389efdf:multiplying-matrices-by-matrices/v/matrix-multiplication-intro) +to condense this down. ```math \begin{bmatrix} @@ -233,7 +236,6 @@ u_7 1 & 0 & 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 & 1 & 0 \end{bmatrix} -+ \begin{bmatrix} b_{mean} \\ b_{1990} \\ @@ -257,26 +259,27 @@ e_7 That matrix in the middle with all the zeros and ones is called the **incidence matrix**, and essentially reads like a table with each row corresponding to an animal, and each column corresponding to a fixed effect. For brevity, we'll just -call it _**X**_, though. One indicates that the animal and effect go together, +call it ``X``, though. One indicates that the animal and effect go together, and zero means they don't. For our record, we could write a table to go with -_**X**_, and it would look like this: - -Animal | mean | 1990 | 1991 | 1992 | male | female --- | - | - | - | - | - | - -1 | yes | yes | no | no | yes | no -2 | yes | yes | no | no | no | yes -3 | yes | no | yes | no | yes | no -4 | yes | no | yes | no | no | yes -5 | yes | no | yes | no | yes | no -6 | yes | no | yes | no | no | yes -7 | yes | no | no | yes | yes | no - -Now that we have _**X**_, we have the ability to start making changes to allow -us to solve for _u_. Immediately, we see that _**X**_ is **singular**, meaning +``X``, and it would look like this: + +| Animal | mean | 1990 | 1991 | 1992 | male | female | +|:------ |:---- |:---- |:---- |:---- |:---- |:------ | +| 1 | yes | yes | no | no | yes | no | +| 2 | yes | yes | no | no | no | yes | +| 3 | yes | no | yes | no | yes | no | +| 4 | yes | no | yes | no | no | yes | +| 5 | yes | no | yes | no | yes | no | +| 6 | yes | no | yes | no | no | yes | +| 7 | yes | no | no | yes | yes | no | + +Now that we have ``X``, we have the ability to start making changes to allow +us to solve for ``u``. Immediately, we see that ``X`` is **singular**, meaning it can't be solved directly. We kind of already knew that, but now we can -quantify it. We calculate the [rank of _**X**_], and find that there is only -enough information contained in it to solve for 4 variables, which means we need -to eliminate two columns. +quantify it. We calculate the +[rank of ``X``](https://math.stackexchange.com/a/2080577), +and find that there is only enough information contained in it to solve for 4 +variables, which means we need to eliminate two columns. There are several ways to effectively eliminate fixed effects in this type of system, but one of the simplest and the most common methods is to declare a @@ -288,27 +291,22 @@ last occuring form of each variable. ### Base population -