What does it mean if my structural equation model or a part of my model is unidentified? What are some rules for identification of my model?
Identification is demonstrated by showing that the unknown parameters in your model are functions only of identified parameters and that these functions lead to unique solutions (Bollen, 1989). In this context, identified parameters refer to stable sample estimates of population parameters such as variances and covariances. In other words, for a statistical model to be identified, we need to have information available which indicates that there is one best value for each parameter in the model whose value is not known. Identification is made possible based on information about the distribution of the observed x and y variables.
This FAQ explains some conceptual and statistical methods which relate to identification and assumes that the user has knowledge of structural equation modeling and the matrix algebra used in structural equation models.
In order to understand some of the concepts behind identification, it is necessary to understand some basic definitions and the equations which are used to solve for parameter estimates (e.g., we will be using some matrix algebra to statistically examine some rules of identification). The rules that follow are for models with observed variables.
Some definitions
Unknown parameters: parameters whose identification status is unknown.
Known parameters: parameters that are known to be identified--the variances and covariances of these parameters have consistent sample estimators which are uniquely estimable.
An underidentified model occurs when at least one parameter cannot be identified.
The following equation, drawn from Rigdon (1997) may help make this more clear:
x + 2y = 7
In the above equation, there are an infinite number of solutions for x and y (e.g., x = 7 and y = 0 or x = 5 and y = 1). These values are therefore "underidentified" because there are fewer "knowns" than "unknowns".
A just identified model is one in which there are as many knowns as unknowns.
x + 2y = 7
3x - y = 7
For this equation, there are just as many knowns as unknowns, and thus, there is one best pair of values (x = 3, y = 2). For researchers, models which are just identified yield a perfect fit, which is not really meaningful and thus makes the test of this model's fit uninteresting.
An overidentified model occurs when every parameter is identified and at least one parameter is overidentified (e.g., it can be solved for in more than way--instead of solving for this parameter with one equation, more than one equation will generate this parameter estimate). Typically, most people who use structural equation modeling prefer to work with models that are overidentified. An overidentified model has positive degrees of freedom and may not fit as well as a model which is just identified. Imposing restrictions on the model when we have an overidentified model provides us with a test of our hypotheses, which can then be evaluated using the Chi-square statistic and fit indices. The positive degrees of freedom associated with an overidentified model allows the model to be falsified with a statistical test. When an overidentified model does fit well, then the researcher typically considers the model to be an adequate fit for the data.
Example
The following example of a structural equation model should help make the concept of identification more clear. This example appears in Rigdon (1997). It can be visually represented as a confirmatory factor analysis model with a single latent variable, x1, and separate error variance estimates, d 1and d2, for each of the two observed variables, x1 and 2. The following equations algebraically represent the figure:
x1 = x 1 + d 1
x2 = x 1+ d 2

This model is not identified. As stated earlier, the knowns in structural equation modeling are based on information from the distribution of the x and y variables, or the variances and covariances of the measured variables, while the unknowns consist of model parameters. If we count the number of known, identified variables, we have two variances (the variances of X1 and X2), and one covariance (Cov[X1,X2]). So, we have three known pieces of information.
How many unknown parameters are we trying to estimate using these three known pieces of information? The model has two error variances (d1 and d2), two factor loading paths (X1-x1 and X2-x1), and one factor variance (x1). This means that the model has five unknown parameters to estimate based on three known pieces of information. Therefore, the model is not identified.
To move the model from an underidentified state to an identified condition, it is necessary to impose additional constraints on the model. If we set the scale of the latent variable x1 to 1.00 and set the two factor loading paths to be equal to each other, the model now has three unknown parameters: the two error variances and the common factor loading parameter. The following figure represents these changes:
Since the number of known pieces of information now equals the number of unknown parameters we wish to solve for, the new model is just identified. When you impose identifying constraints on your model, the constraints should be consistent with your theoretical predictions.
Algebraic Representation of Structural Equation Modeling
The following equation is a general representation of structural equations with observed variables taken from Bollen (1989, pp. 80-81); it will be used to explain some of the rules of identification that follow:
y = By + G x + z
where:
B = m X m coefficient matrix (m = the number of latent, endogenous variables)
G= m X n coefficient matrix (m = number of latent, endogenous variables, n = number of latent, exogenous variables)
y = p X 1 vector of endogenous variables (p = the number of manifest y variables)
x = q X 1 vector of exogenous variables (q = the number of manifest x variables)
z= p X 1 vector of errors in the equations (p = the number of manifest y variables)
The z term represents random errors in the relationships between the X's an y's and is sometimes referred to as errors in the equations. The standard assumption is that the errors (z) are uncorrelated with X.
The measurement model for structural equations with observed variables is the following:
y = h
x = x
where y = p X 1 vector of manifest (observed) variables and x = q X 1 vector of manifest (observed) variables. Thus, for models with observed variables, x and y are assumed to exactly represent the latent h and x variables and therefore only one indicator is used for each variable. So, the number of y variables equals the number of h variables (p = m) and the number of x variables equals the number of x variables (q = n).
Conceptual and Statistical Theory for Identification
For the equation, S = S (q), S (sigma) is the population covariance matrix of observed variables, q (theta) is a vector that contains the model parameters, and S (q) is the covariance matrix written as a function of q. The parameters whose identification status is unknown are in q, where q contains the t free and (nonredundant) constrained parameters of B (beta), G (gamma), f (phi), and Y (psi).
If an unknown parameter in q can be written as a function of one or more elements in S, then that parameter is identified. If all of the unknown parameters in q are identified, then the model is identified.
Identification is not related to your sample size. For example, a model is not considered underidentified because one doesn't have enough cases. The population covariance matrix is the source of identified information and the parameters refer to the population, not to sample values. So, no matter how big your sample size is, an unidentified parameter still remains unidentified.
Model identification occurs when you place restrictions on your model parameters. For example, if a researcher were to free all of the elements in the B, G, f, and Y to see what relations were significant, this model would not run because it would be underidentified.
For our data to be meaningful and tell us about associations, we must restrict certain parameters and free others. Most commonly, people set elements in the B, G , f , or Y matrices to zero. Others may impose equality or inequality constraints on the parameters. These restrictions should be consistent with your theory.
Two restrictions necessary for identification are already imposed, although they may not be obvious. Recall the equation
First, the main diagonal of B is fixed at zero, otherwise each endogenous variable would be shown as having a direct effect on itself. The standard convention is that the diagonal of B is set to zero so that the dependent variable of each equation appears on the left-hand side with an implicit coefficient of one (Bollen, 1989). This is sometimes referred to as the normalization convention, and without it, models would be underidentified.
Second, it is often taken for granted that the coefficient matrix for z in the above equation is an identity matrix. This means that each error variance value associated with each endogenous latent variable appears in only one equation with a coefficient of one. This also helps to identify your model.
These two identification constraints are so common that many researchers may not even realize that they have imposed them when they are estimating their models.
These constraints are not, however, always sufficient to identify multiequation models, and other information must therefore be used to specify the model properly. Table 1 summarizes some possible rules and requirements for identification.
| Identification Rule |
Evaluates |
Requirements |
Necessary Condition |
Sufficient Condition |
|
t- Rule |
model |
t < (1/2) (p + q) (p + q + 1) |
yes |
no |
|
Null B Rule |
model |
B = 0 |
no |
yes |
|
Recursive Rule |
model |
B triangular Ψ diagonal |
no |
yes |
|
Order Condition |
equation |
restrictions > p - 1 Ψ free |
yes1 |
no |
|
Rank Condition |
equation |
rank (Ci) = p -1 Ψ free |
yes1 |
yes1 |
1 This characterization of the rank and order conditions assumes that all elements in Y are free. (Table taken from Bollen, K.A. (1989). Structural equations with latent variables. New York: John Wiley & Sons).
t-Rule
The easiest test to use is a necessary but not sufficient condition of identification. The t-rule for identification is that the number of nonredundant elements in the covariance matrix of the observed variables must be greater than or equal to the number of unknown parameters in:
t < (1/2) (p + q)(p + q + 1)
where p + q is the number of observed variables and t is the number of free parameters in S. The right hand side of the equation is the number of nonredundant elements and each of these variances or covariances is known to be identified. If the number of unknowns (t) exceeds the number of equations [(1/2) (p + q)(p + q + 1)], then the identification of q is not possible.
Null B Rule
When you have a multiequation model in which no endogenous variable affects any other endogenous variable, the B matrix is zero.
We can therefore show that the unknown parameters in G, f, and Y, are functions of the identified parameters of S. Substituting B = 0 into the equation, By + x + z, and partitioning S into four parts leads to the following equations:
Equation 1
S = S(q) =
| Syy |
Syx |
|
G f G'+ Y |
Gf |
| Sxy |
Sxx |
= |
fG' |
f |
The lower right quadrant indicates that f = S xx so that f is identified. Using the lower-left quadrant results in the following equation:
Equation 2
f G '= S xy
S xxG ' = S xy
G ' = S -1xx Sxy
The second step of the above equation follows from substituting Sxx in for f, and the last step occurs by premultiplying both sides by S -1xx, where S xx must be nonsingular. The bottom line indicates that G is a function of known-to-be identified covariances matrices, and is therefore identified.
Solving for Y in the upper-left quadrant of equation 1 creates the following set of equations:
Ψ = S yy -G f G'
= S yy - S yx S -1xx S xx S -1xx S xy
= S yy - S yx S -1xx Sxy
Thus, when B = 0, f , G , and Y can each be written as functions of the identified covariance matrices of the observed variables, they are identified.
Recursive Rule
The recursive rule, like the null B Rule, is a sufficient condition for model identification, but not a necessary one. The Recursive Rule does not require that B = 0. Instead, the B matrix must be written as a lower triangular matrix, and the Y matrix must be diagonal. If both conditions hold, then the model is identified. In addition, a property of all recursive models is that the error term is uncorrelated with the explanatory variables. If all explanatory variables are uncorrelated with the error, then it is like a standard regression equation and such equations are identified. Thus, recursive models are always identified. See Rigdon (1995) for further details.
In addition, Bollen (1989) demonstrates the Recursive Rule extensively in his book, solving for B, G , f, and Y in a number of equations. If you are interested in the actual equations, you can find this information on pages 96-98.
As these past few rules demonstrate, the parameters of a structural equation model are generally considered identified if the researcher can solve the covariance structure equations for the unknown parameters. As the complexity of your model grows, however, using these algebraic equations can become quite monotonous and the possibility of making an error also increases. In addition, when you solve for unknown parameters, you must be aware of dependencies that may be concealed within the solution.
Rank and Order Conditions
Except for the t - rule, the other identification rules place restrictions on either B or Y. Nonrecursive models, however, do not satisfy these restrictions and therefore must have identification established in another way. Recall that a nonrecursive model has systems of equations in which there may be reciprocal causation or feedback loops.
Similar to the Null B Rule and the Recursive rule, the rank and order conditions of identification are for models that assume that all exogenous variables (x) are uncorrelated with the errors (z). These rules differ from previous rules that have been discussed in that B can assume any form as long as (I - B) is nonsingular (I = identity matrix and B = beta matrix). A singular matrix has a determinant of zero and indicates that parameters are linearly dependent. Thus, a nonsingular matrix indicates that the parameters are NOT linearly dependent (i.e., they are unique).
Second, rank and order rules help identify one equation at a time. For the Null B Rule and the Recursive Rules, the whole model is identified if the conditions are met. In contrast, for the rank and order conditions, each equation must meet the conditions in order for the model to be identified. In addition, the rank and order conditions assume that Y does not have any restrictions. Thus, no elements in this matrix must be constrained to a fixed value (e.g., zero) or have any other constraint. This means that the disturbance terms are allowed to correlate. If all of the equations in the model meet these conditions, then we know that all the elements in the Y matrix are identified and must be estimated. The disadvantage is that if we want to restrict certain elements of Y or know that elements should be restricted, we cannot utilize the rank and order rules for model identification.
The order condition is stated by Bollen (1989) as: "A necessary condition for an equation to be identified is that the number of variables excluded from the equation be at least p-1."
The following matrix equation, which represents a structural equation model, will more fully illustrate the order condition:
| y1 |
|
0 |
b 12 |
0 |
|
y1 |
|
g 11 |
0 |
|
x1 |
|
z 1 |
| y2 |
= |
b 21 |
0 |
0 |
X |
y2 |
+ |
0 |
g 22 |
X |
|
+ |
z 2 |
| y3 |
|
b 31 |
b 32 |
0 |
|
y3 |
|
0 |
0 |
|
x2 |
|
z 3 |
One way to check the order condition for the equations in a model is to form a matrix, say, C, which is [(I-B) | -G].
We get the C matrix by multiplying the beta and gamma part of the equation by an identity matrix, which has all zero elements and ones down the diagonal:
| 1 |
0 |
0 |
|
0 |
1 |
0 |
|
0 |
0 |
1 |
For each row in the C matrix, you count the number of zero elements. If a row has (p -1) or more zeros, it meets the order condition, where p = the number of parameters. In this example, each row must have two values that are fixed, since the number of parameters is 3 and 3 - 1 = 2. Thus for the equation,
C = [(I-B) | -G ]
the C matrix is the following:
|
|
|
1 |
- b12 |
0 |
- g11 |
0 |
|
C |
= |
- b21 |
1 |
0 |
0 |
- g22 |
|
|
|
- b31 |
- b32 |
1 |
0 |
0 |
Each of the equations above satisfies the order condition because it has (p -1) or 2 exclusions. For example, in the first row, two parameters are fixed to zero, thus, there are at least 2 values that are fixed for this first equation, which satisfies the order condition.
The Rank Rule is a necessary and sufficient condition for the identification of the ith equation such that the rank of Ci equals (p-1). The rank rule also starts with the C matrix which we used above to describe the order rule. The rank of a matrix or a vector is the number of independent rows and columns. We will use the C matrix again to demonstrate the rank rule.
|
|
|
1 |
- b12 |
0 |
- g11 |
0 |
|
C |
= |
- b21 |
1 |
0 |
0 |
- g22 |
|
|
|
- b31 |
- b32 |
1 |
0 |
0 |
First, for this rule, we only look at rows that have a zero in the first column. Thus, we get the following C1 matrix which does not include columns 1, 2, or 4. Then, we must determine if the matrix is nonsingular (e.g., the parameters are not linearly dependent on one another). Unless g22 = 0, the C1 matrix has two independent columns and rows, so its rank is two. We get the number for this rank by subtracting the number of parameters (3) minus one. Thus, 3 - 1 = 2 and the rank condition is satisfied.
|
|
|
0 |
0 |
|
C1 |
= |
0 |
- g22 |
|
|
|
1 |
0 |
Information and Jacobian Matrix Techniques
Two other approaches to assessing identification are by using information matrix techniques and the augmented Jacobian matrix. These techniques are discussed more fully in Rigdon (1997).
Briefly, the information matrix relates to all the free parameters in a model. It is the matrix of second order derivatives of the fit or discrepancy function with respect to these free parameters. If the parameters are all identified, the information matrix rank and the number of free parameters in the model will be equal. If all of the parameters are not identified, however, then the rank will be deficient.
Identifying your structural equation model in this way is comparable to the approach used to check for multicollinearity in regression. For regression, the rank of the predictor covariance matrix is evaluated and multicollinearity is assessed among the variables. Some structural equation modeling programs will provide you with identification problem information (e.g., EQS) when this type of problem (e.g., multicollinearity) has been detected. For example, you might receive information in your printout which indicates that one parameter in the model is "linearly dependent on" some other parameter(s).
It is important to know that the information matrix approach has a few shortcomings. First, evaluation of the rank of the information matrix occurs only after parameters have been estimated, and this evaluation only applies at that point in parameter space. So, the problem is that the model may only be identified at a local level versus a global level (McDonald, 1982). Thus, the model may be identified at one point in space but it be may unidentified at other points.
Second, structural equation modeling programs implement this technique by evaluating the rank of the information sequentially. The program thus begins with one row and column (which represents one parameter) then goes on to the next row and column (so now it is examining two parameters) and continues in this fashion until the whole matrix is assessed or there is a rank deficiency. If there is a problem, the program may report an error which indicates that the corresponding parameter is a problem. This message does not, however, clearly indicate whether other parameters are also part of this problem. It is therefore difficult to diagnose what the true problem might be. Examination of the information matrix or whether large standard errors or very high correlations have occurred might provide some insight. It is difficult to assess, however, based only on the value of these numbers whether the issue is identification, bad model fit, or whether there is even a problem to diagnose.
Another approach to assess identification is evaluating the augmented Jacobian matrix. This recent approach has been published in Bekker, Merckens, and Wansbeek (1994). Similar to the identification matrix approach, the augmented Jacobian matrix also relates to the free parameters in a model. This matrix is different, however, from the information matrix in that it involves evaluation of the matrix of first order derivatives of the discrepancy function with respect to the parameters (versus evaluation of the matrix of second order derivatives). This matrix is augmented because it includes equations which represent restrictions on the model, such as equality constraints.
To quote Ed Rigdon's explanation of this matrix (Rigdon [1997]): "Using modern algebra techniques, Bekker, Merckens, and Wansbeek (1994) show that identification of the model can be assessed by evaluating the rank of a subset of this augmented Jacobian matrix, and that this evaluation can be conducted symbolically, before the parameters are estimated, and thus independently of any particular set of parameter values. In other words, this procedure tests global, rather than local, identification. Furthermore, the output of this procedure is a report on the identification status of every model parameter. This means that the researcher has a complete list of all problem parameters, which makes it more likely that the problem will be properly understood."
The current structural equation modeling programs do not yet implement this approach. The authors do provide a disk, however, which implements their technique (it comes with their book).
Empirical Underidentification
Finally, we should briefly touch on empirical underidentification. This issue is different from structural identification and was introduced by Kenny (1979). This term refers to those situations in which the model should be identified based on its structure, but it is NOT identified due to the sample data being analyzed.
Specifically, empirical underidentification typically occurs when a parameter estimate should have a non-zero value and the value actually approaches zero or equal zero. For instance, in a confirmatory factor analysis model with two factors and four observed variables, a model with two variables loading onto each factor is only identified if the two factors are correlated. However, if the interfactor correlation is zero, or close to zero, the model may become empirically underidentified (Rigdon, 1997).
If you have further questions about identification, please refer to the following the references.
References
Bekker, P.A., Merckens, A., & Wansbeek, T. (1994). Identification, equivalent models, and computer algebra. American Educational Research Journal, 15, 81-97.
Bollen, K.A. (1989). Structural equations with latent variables. New York: John Wiley & Sons.
Kenny, D.A. (1979). Correlation and causality. New York: Wiley.
McDonald, R.P. (1982). A note on the investigation of local and global identifiability. Psychometrika, 47 (1), 101-103.
Rigdon, E. (1997). Approaches to testing identification. http://www.gsu.edu/~mkteer/identifi.html.
Rigdon, E. E. (1995). A necessary and sufficient identification rule for structural models estimated in practice. Multivariate Behavioral Research, 30(3), 359-383.
If you have further questions, send E-mail to stats@ssc.utexas.edu.