TEXT SIZE

search for



CrossRef (0)
Basic issues and challenges of statistical network analysis
Communications for Statistical Applications and Methods 2025;32:107-123
Published online January 31, 2025
© 2025 Korean Statistical Society.

Jaehee Kim1,a

aDepartment of Information Statistics, Duksung Women’s University, Korea
Correspondence to: 1 Department of Statistics, Duksung Women’s University, 144 Gil, 33 Samyang-ro Ssangmun-Dong, Dobong-Gu, Seoul 01369, Korea. E-mail: jaehee@duksung.ac.kr
Received September 30, 2024; Revised November 3, 2024; Accepted November 4, 2024.
 Abstract
Extracting information from large graphs or networks has become an essential statistical problem since network data is ubiquitous in various fields, including social and natural science. The analysis of network data is a rapidly growing area, both within and outside of the discipline of statistics. This paper aims to provide a review of many standard classes of network models with some details and to point the way to further reading. We investigate the most natural statistical questions for the network models including basic random graph models, community detection in the stochastic block model, preferential attachment trees, and temporal network progresses. We also cover some interesting topics related to statistical network models with the neuroimaging data. We emphasize formal model descriptions and pay special attention to the statistical models and many challenges to statistical network analysis (SNA).
Keywords : network analysis, random graph, stochastic block model, TERGM
1. Introduction

The analysis of network data has become an essential component of doing research in many fields including social and friendship networks, food webs, protein interaction and regulatory networks in genomics, neuronal connectivity networks, the World Wide Web, and computer networks. On the algorithmic side, many algorithms for identifying important network structures, such as communities, have been proposed, mainly by computer scientists and physicists. On the mathematical side, various probability models for random graphs have been studied. However, there has only been a limited amount of research on statistical inference for networks and learning the network features by fitting statistical models to data.

Given some underlying parameters, the models assume a likelihood function for the network data. Estimates for these parameters are then inferred from the data. The models, intriguingly, range in complexity from having a single scalar parameter for link probability to increasingly elaborate models constructed on counts of network summary statistics. Kolaczyk (2009, 2017) and Crane (2018) provide a complementary statistical overview for the statistical and probabilistic network analysis. Popular books on networks and network analysis began to appear two decades ago (Barabasi, 2002; Watt, 1999). Kolaczyk and Csárdi (2014) provide statistical network analysis with R codes. Rȧcz and Bubeck (2017) cover interesting topics such as community detection and preferential attachment for statistical network analysis.

A graph G(V, L) consists of a set of N nodes (or vertices) V = {n1, n2, . . . , nN} and a set of L edges (or connections). Let L = {l1, l2, . . . , lL} denote the links between nodes. An adjacency or socio-matrix Y, of dimension N × N can also be used to represent G, with

yij={1,edge exists from node nito node nj0,otherwise.

Note that this applies only for a binary network. Weighted (or valued) networks are described using non-negative integer values for the entries in Y.

There are two main types of networks: directed and undirected. In an undirected network, relationships are non-directional by their nature. Their edges indicate a two-way relationship having no specific direction, and the source and target nodes are interchangeable. In a directed network, connections between nodes are directional. Their edges indicate a one-way relationship.

In this paper, we consider the undirected graph networks and related models. We present a succinct review in statistical network modelling up to the current state of the art. Section 2 provides an introduction to the terminology and basic statistics for the network analysis. Static network models are introducted in Section 3 and tempral network models in Section 4. Network modeling procedure is discussed in Section 5 including popular softwares. In Section 6, as an important examplary application of network data, human brain neuroimaginge data analyses are discussed. And final comments are given in Section 7.

2. Network descriptive statistics

We briefly review the common basic measures for the network summary statistics. We may be interested in how nodes interface at a local level. This behavior within the graph can be summarised using various statistics. Highlighting nodes of importance to the network is also a task of statistical network analysis quantitatively and qualitatively. Barabási (2016) describes the network measures comprehensively and extensively.

2.1. Degree and degree distribution

The degree ki of a node ni is the number of edges connected to other nodes. indicating the relative importance of the node ni such as

ki=j=1Nyij.

The nodes with a high degree can be regarded as important points in the network (so-called hubs) (Newman, 2018; Caldarelli, 2007). An average of degrees can be computed for the whole network as

d=1Ni=1Nki.

In addition, the degree distribution P(k) reflects the probability that a randomly chosen node will have degree k. A power-law degree-distribution, in which some vertices have an exceptionally high degree, and most others with a low degree, indicates the presence of hubs.

2.2. Centrality

As a network characterization, the goal is to have some measures that allow an informative description of the part of the network. The importance of a node can be quantified using centrality measures. Betweenness centrality (BC) is defined as the number of shortest paths going through a node or edge. The BC is high when the node or edge is used for many shortest paths, and it can gauge a node’s influence by considering its role in linking other nodes together in the network. This measure can be normalized by dividing it by its maximum value (the total amount of shortest paths in the network) σst as

BC(v)=svtσst(v)σst,

where σst(v) is the total number of shortest path from node s to node t through node v.

Closeness centrality (CC) measure is defined as the inverse of the average of the geodesic distance li j between node v = ni and nj,

CC(v)=1<l(v)>,

where < l(v) >= (1/(N(N − 1)))4ij li j.

Eigenvector centrality (EC), a different type of measure of node centrality, allocates a value to each node in the network in such a way that the node receives a large value if it has strong connections with many other nodes in a central position (Lohmann et al., 2010). In other words, the connections to important nodes count more, making the nodes with relatively few edges to critical nodes also important. The advantages of the eigenvector centrality are the fact that the quality of the connections is taken into account and the relatively short computation time. The EC is a powerful tool for understanding the influence and emphasizing the importance of nodes within a network, defined as

EC(v)=αvVcEi(u),

where vector cEi = (cEi(1), . . . , cEi(N)) satisfies AcEi = α−1cEi with the largest eigen value α−1 of the adjacent matrix Y.

2.3. Motif and modularity

A Motif is grouped nodes in a simple subgraph. Triangles of three interconnected nodes are examples of such motifs. The simple motif of a triangle underlies the important property of clustering. It provides insight into the network’s structure and function, as well as the interactions between the nodes.

When larger groups of nodes tend to be more connected within groups than between groups, this group is called a module. These modules, also known as network communities or clusters, represent cohesive substructures and play a significant role in the network organization and function.

2.4. Clustering coefficient and path length

The most basic and frequently used cohesive measure is the clustering coefficient, which measures local connectivity. The clustering coefficient C is defined as to what extent the neighboring nodes to which node is connected.

The shortest path length is the smallest number of edges between two nodes. The average shortest path length is a measure of global efficiency and defined as the mean number of steps along the shortest paths between all possible pairs of network nodes. Intuitively, when the average path length is short, it is easy to travel fast from one point in the network to a random other point.

2.5. Small world networks

Watts and Strogatz (1998) and Watts (1999) have shown that graphs with many local connections and a few random long-distance connections are characterized by a high clustering coefficient and a short path length. Such near-optimal networks are designated as ‘small-world’ networks. Since then, many types of real networks have been shown to have small-world features. For example, patterns of anatomical connectivity in neuronal networks are particularly characterized by high clustering and a small path length (Watts and Strogatz, 1998).

3. Static network models

The existing statistical network models may be organized along several principal axes, such as static vs. dynamic models. Static network models concentrate on explaining the observed links on a single network snapshot. In contrast, dynamic network models are concerned with the mechanisms to govern changes in the network over time. Static network models have been the main focus of research firstly. However, with the emergence of dynamic temporal networks in recent years, there has been growing interest in dynamic modeling.

3.1. Random geometric graphs

The Erdös-Rényi model (1960, 1981) is the most basic probabilistic network model. This model assumes that the presence and absence of edges between all pairs of nodes are iid, where yi j = 1 with probability θ and yi j = 0 with probability 1−θ. Hence, the probability of a particular network is given by

P(Yθ)=i,jθyij(1-θ)1-yij,

where the product is over all node pairs ij if the graph is directed, and i < j if the graph is undirected. In particular, the asymptotic properties of the model, as N → ∞, have been studied in detail.

However, due to the assumption of independent edges and equal probability of connectivity between pairs of nodes, this model is not appropriate for modelling many real world networks. Instead, it serves as a null model, one in which there exists no structure.

3.2. The p1 and p2 models

The p1 and related models are logistic regression models for the network dyads. They are a step up in complexity from the Erdös-Rényi model which models all ties as equally probable given a fixed probability of an individual tie for the network. Due to the assumption of the independence of dyads in the p1 model, these models cannot capture common features of networks involving more than two nodes, such as transitivity and clustering.

This model for directed graphs was introduced by Holland and Leinhardt (1981). There are four possible states of linkage between two nodes ni and nj : (i) No link (ii) ni links to nj only (iii) nj links to ni only (iv) ni links to nj and nj links to ni. These four types of relationship are modelled respectively by the four terms below, with the probability of the entire network given by the exponent of the weighted sum:

P(Y=yθ)exp (θi,jyij+iαijyij+jβjiyij+ρm),

where {yi j} is the observed adjacency matrix and m = 4i<j yi jyji is the number of mutual links.

The p1 model contains three additional sets of parameters compared to the Erdös-Rényi model. Thus, there is still the network-wide base rate of link probability, θ. The other parameters are for productivity α, attractiveness β and mutuality ρ. α and β are vectors with a separate value for each node. When ρ and all α and β are zero, the model defaults to the Erdös-Rényi model. If ρ is zero, the link probability of a dyad is solely dependent on the degrees of the nodes involved. Allowing different values for ρ for each dyad leads to identifiability problems for the parameters. However, this model allows the effect of reciprocity to depend linearly on the two actors in a dyad such that a dyad-specific ρi j = ρ + ρi + ρj and the ρi are normalised to sum to zero. The p1 model can be expressed as a GLM as follows:

P(yij=y1,yji=y2)=P(yji=y2yij=y1)·P(yij=y1)=(exp (y1(θ+αi+βj))+exp (y1(θ+αi+βj+ρ)+θ+αj+βi))/kij(1)×exp (y2(θ+αj+βi+y1ρ))/kji(2),

where kij(1) and kji(2) are normalising constants with

kij(1)=1+exp (θ+αi+βj)+exp (θ+αj+βi)+exp (2θ+αi+βj+αi+βj+ρ)kji(2)=1+exp (θ+αj+βi+y1ρ).

This formulation can be used for model fitting for the p1 model as a generalized linear model (GLM) or logistic regression (Salter-Townshend et al., 2012).

3.3. Exponential random graph models

Exponential (family) random graph models (ERGM) are a family of models (p* models) that address the issues arising out of the assumption of dyadic independence in the p1 and p2 models. They are an extension of the Markov graphs (Frank and Strauss, 1986), which adds triangle statistics to the p1 model. Wasserman and Pattison (1996) generalized the model to include arbitrary network statistics. Krivitsky et al. (2011) propose a further extension of the ERGM model with the addition of an offset term to adjust for network size. Robins et al. (2007) and Robins et al. (2006) provide good introductions and summary of recent developments in ERGM modeling.

The p1 and p2 models cannot capture structure in networks about more than node pairs (e.g. transitivity). ERGMs do not assume dyadic independence and model the whole network as a single realization arising from a distribution summarised by a sufficient network statistics collection. Specifically, the probability of the observed network y is in exponential family form, which is proportional to the exponent of the sum of the network sufficient statistics times some unknown parameters:

P(Y=y)=exp (θS(y)-γ(θ)),

where θ are the parameters of the model, S (y) are the network sufficient statistics and γ(θ) is a normalising constant.

Fitting ERGMs necessarily involves approximating this normalizing constant which is often very difficult to obatin. The network summary statistics are chosen by the analyst and may be calculated on higher order than two interactions. Typical choices include the number of edges, the number of triangles and the number of k-stars for various k values. Chatterjee et al. (2011) examine ERGMs with the degree sequence as a sufficient statistic. Fitting the ERGM then involves finding estimates of the parameters for each of the network statistic terms in the model. There are several approaches to fitting ERGM models without recourse to summing over all possible networks; these include maximum pseudolikelihood estimation (Strauss and Ikeda, 1990), Monte Carlo MLE (van Duijn et al., 2009) and MCMC (Caimo and Friel, 2011; Snijders, 2002).

Maximum pseudolikelihood fitting involves computing change statistics for the network and finding a pseudo MLE for the parameters using a pseudolikelihood approximation of the likelihood. The pseudolikelihood approximation is computed as the product of the full conditionals for each dyad given the rest of the observed network with the log odds of a particular dyad linked as θ times the change in the observed network sufficient statistics in switching from that dyad being linked to unlinked.

Following Robins et al. (2007) the change in network statistics is denoted, when Yi j is switched from 1 to 0 as di j, as the log-odds of Yi j being one, given the rest of the network is

log (P(Yij=1yijc)Pr (Yij=0yijc))=θdij(y),

where yijc is all entries of the adjacency matrix except i j.

Estimation of the parameters θ then proceeds via maximum likelihood estimation in a manner similar to logistic regression. Furthermore, the logistic regression assumes independent observations; this assumption is wrong for network data, potentially leading to biased parameter estimates and standard errors, which may be too small (Robins et al., 2007). For these reasons, Monte-Carlo-based inference is preferred, where possible. Also, Bayesian estimation of ERGMs has been addressed (Caimo and Friel, 2011). Frequentist MCMC methods for ERGMs (MCMCMLE) refine approximate parameter estimates by comparing the observed network against a set of simulated networks given a parameter configuration (Hunter and Hankcock, 2006).

In both inferential approaches, model degeneracy may be an issue. The degeneracy refers to the situation in which only a few networks have appreciable probability, given the model. If degeneracy or near degeneracy occurs, then the parameter estimates may not converge, and maximum pseudolikelihood estimators will yield misleading results in these cases. Therefore, new specifications for ERGMs have emerged, and network summary statistics have been specifically chosen to address degeneracy problems. These include conditioning on the number of edges or the inclusion of alternating k-stars and alternating k-triangles, geometrically weighted degrees, edgewise shared partner, and dyad-wise shared partner statistics. Hunter et al. (2008) discuss the goodness of fit for ERGM models. They propose a comparison of observed network statistics with the sample distributions of other statistics calculated on networks simulated from the fitted model.

3.4. Stochastic block models and latent variable models

Community detection is a fundamental problem in many sciences, such as sociology (e.g., finding tie groups in social networks), biology (e.g., detecting protein complexes), and neuroscience (e.g., searching connected neurons). Given its importance, there have been a plethora of algorithms to detect communities. However, how can we test whether an algorithm performs well? What are the fundamental limits to any community detection algorithm?

Probabilistic generative models can be used to model real networks. They can act as benchmarks for comparing different clustering algorithms. One of the most widely studied generative models that exhibit community structure is the stochastic block model (SBM). Holland (1983) first introduced the SBM in sociology. Moreover, it was then studied in several different scientific communities, including statistics (Bickel and Chen, 2009). The SBM gives a distribution on graphs with a hidden partition of the nodes into K communities (for example, Figure 1). The parameters of the general SBM are the relative sizes of the communities and the edge densities connecting communities. The statistical inference problem is to discover the community structure as possible, given a realization of the graph without knowing any community labels.

A random graph from the general stochastic block model SBM(n, p, Q) is defined as follows:

  • N is the number of vertices.

  • K is the number of communities, assuming that K is a fixed constant. A probability distribution p = (p1, . . . , pK) describes the relative sizes of the communities.

  • Q ∈ [0, 1]K×K, a symmetric K × K matrix describes the probabilities with which two given vertices are connected, depending on which communities they belong to.

  • The vertex set of the graph is V = {1, . . . , n}.

  • Every vertex vV is independently assigned a (hidden) label hv ∈ [K] from the probability distribution p on [K]. That is, P(hv = i) = pi for every i ∈ [K].

  • Given the labels of the vertices, each (unordered) pair of vertices (nu, nv) ∈ V × V is connected independently with probability Qhu,hv.

Write G ~ SBM(n, p, Q) for a graph generated according to the SBM without the hidden vertex labels. The goal of a statistical inference algorithm is to recover as many labels as possible using only the underlying graph as an observation.

  • Weak recovery (detection). An algorithm is said to weakly recover or detect the communities if it outputs a partition of the nodes which is positively correlated with the true partition, with high probability (whp).

  • Partial recovery. How much can be recovered about the communities? An algorithm is said to recover communities with an accuracy of α ∈ [0, 1] if it outputs a labelling of the nodes which agrees with the true labelling on a fraction α of the nodes whp. An important special case is when only o(n) vertices are allowed to be misclassified whp, known as weak consistency or almost exact recovery.

  • Exact recovery (recovery or strong consistency). The strongest notion of reconstruction is to recover the labels of all vertices exactly whp. When this is not possible, it can still be of interest to understand which communities can be exactly recovered, if not all; this is sometimes known as ‘partial-exact-recovery’.

By introducing the indicator variable as

dijkl={1,nodes ni,to node njare clusters in Bk,Bl0,otherwise,

and introducing λ, and a matrix of parameters of dimension B × B which describes block model interaction, the p1 model then becomes

P(Y=y)exp (ρm+θi,jyij+klλk,lijyijdijkl+iαijyij+jβjiyij).

In the more general case where attribute data is unknown, block membership assignation must be assigned with the use of latent variables.

Lei’s (2016) groundbreaking work in developing a goodness-of-fit test for the stochastic block model is a significant advancement in our understanding of network analysis and community detection. The test statistic is based on the largest singular value of a residual matrix obtained by subtracting the estimated block mean effect from the adjacency matrix. Asymptotic null distribution is obtained using random matrix theory (RMT).

In many applications, communities are not disjoint but overlapping. Detecting overlapping communities is a complex and intriguing challenge that requires further exploration. How well can we do it? This question sparks curiosity and invites further discussion. What are the fundamental limits? To understand the utility of the SBM, we must also study how robust the obtained results are changes in the model.

Hoff et al. (2002) proposed the latent space model, in which the latent variable associated with node v is denoted by Zv. Handcock et al. (2007) proposed an extension form of a latent space cluster model by considering the apriori distribution of the latent positions. For each node v, Zv is assumed to be drawn from a mixture of K Gaussian distributions with a different mean and covariance matrix to represent a different group/cluster. In latent feature models, the possible combination of latent features increases highly according to the number of nodes. The stochastic gradient MCMC algorithms can reduce the computational burden for this model.

3.5. Likelihood for networks

All network data models rely on the information contained in the degrees associated with the nodes of the corresponding graphs. Assume the model is described by a possibly vector-valued parameter θ. Let Gi be an observation from the model. The likelihood, as a function of θ is

L(θ,Gi)=1TvR(Gi)w(θ,Gi,v)L(θ,δ(Gi,v)),

where w(θ,Gi, v) = Pθ(Gi) and R(Gi) is the random graph with Gi.

We need to formulate conditions that guarantee that the maximum likelihood estimator (MLE) exists with probability tending to one as the number of nodes increases. Rinaldo et al. (2011) studied maximum likelihood estimation for the random graph models in which the degree sequences are minimal sufficient statistics.

Given Z and C, assume that the edges are Bernoulli distributed conditional on the group memberships in the undirected graph G. Let C be the memberships of the communities. The likelihood is written as

L(YZ,C)=p<qP(YijZ,C)=p<q[(ZpCZq)Ypq(1-ZpCZq)(1-Ypq)].

Karrer and Newman (2011) worked with undirected valued graphs with Poisson and degree-corrected block models. Let Ypq be the number of edges for the dyad (p, q) following a Poisson distribution, and C be the expected number of edges from a node in group p to a node in group q. The likelihood can be written as

L(YZ,C)=p<q(Ypq!)-1exp (-ZpCZq)(ZpCZq)Ypq.

Tallberg (2005) proposed a model in which the group memberships Z depend on the covariates x. Nu et al. (2013) considered the edge probability as

P(YpqZ,C,x)exp[ψf(x,Ypq)+(ZpCZq)g(x,Ypq)],

where x are the covariates, and f and g are functions that may depend on x. The parameter ψ is constant to both x and the groups p and q belong to.

The likelihood of network data reflects on the data situation, and needs its computational algorithm. If the likelihood is complicated, the usable algorithms for detecting the nonexistence of the MLE and for identifying non-estimable parameters. EM (expectation-maximization) algorithm is an alternative, and VEM (variational EM) (Chang and Blei, 2010; Corneli et al., 2018) methods are another.

3.6. Preferential attachment trees

A preferential attachment (PA) tree is a type of network structure with a preferential attachment mechanism that governs the formation of connections between nodes in a growing network, such as collaboration networks. In the context of a tree structure, preferential attachment can influence how new nodes are connected to existing nodes, describing a rich-get-richer effect in network transformation (Barabási, 2016; Barabási and Albert, 1999).

If the corresponding degrees are {di(t)}i∈[t], where [t] denotes the nodes at time t, then the node vt+1 connects to the existing node viV with probability proportional to f (di(t)), for a given function f : N+R+ with probability

f(di(t))Σj=1tf(dj(t)).

Here f is referred as the preferential attachment function and f (k) as the preference for a node of degree k. The denomenator is the total preference at time t. The model can evolve to connect any number of nodes, obtaining a dynamic graph that changes over time.

Gao and van der Vaart (2021) proposed the study of the MLEs of the parametric preference function with their consistency and asymptotic normality.

4. Temporal network models

A dynamic network model is a statistical framework describing how nodes and edges change or how networks evolve over time. These dynamic or temporal network models explicitly incorporate time into their structure. Different mathematical/statistical tools are used to describe dynamic networks, such as stochastic processes, differential equations, agent-based modeling, or graph theory in order to capture the complexity of network dynamics. Modeling dynamic networks is a formidable task, primarily due to the complexity of real-world data, the need to balance relational system with simplicity, and the computational demands when dealing with large-scale networks.

Dynamic network modeling is a crucial tool in various interdisciplinary fields such as social science, biology, neuroscience, computer science, and economics. Networks in these fields, representing relationships between entities, are rarely static; they change and evolve due to various processes. Brain networks, whose edges represent the transmission of an electrical charge between neurons change depending on brain function and activity. Social networks, whose edges represent friendships, often change over time due to increased/decreased social activity.

The edges associated with different vertices evolve on different time scales (for example, Figure 2). There is a need to model a dynamic network whose vertex set stays fixed while its edges vary over time. A dynamic network is one whose entire structure varies concerning time. Thus a dynamic network is a collection Y = (Y(t))tT indexed by a set of times T, with each Y(t) representing a network for a (fixed) vertex. The components of a dynamic network are related to one another through their association over time.

The temporal exponential random graph model (TERGM) (Doreian and Stokman, 1997; Leifeld et al., 2018) has been the most widely studied statistical model for dynamic networks. Let {0, 1}n×n be the state space for binary relational data of size n. Let Θ be a parameter space. Define T : {0, 1}N×N ×{0, 1}N×N → ℛd where d ≥ 1 is the length of the sufficient statistic vector T = (T1, . . . , Td).

The transition probabilities of the TERGM are given by

Pr (Y(t+1)=yY(t)=y;θ,T)exp {η(θ)·T(y,y)}         y,y{0,1}n×n,

where η(θ) is the natural parameter for the exponential family and η(θ)=Σi=1dηi(θ)Ti(y,y). Note that the transition probabilities for the TERGM incorporate temporal (i.e. Markovian) dependence into ERGM. The main difference between the ERGM and TERGM is the joint dependence on the sufficient statistic T on both y and y.

Zhu et al. (2014) proposed a novel spatially varying coefficient model (SVCM) to capture the varying association between imaging measures in a three-dimensional volume (or two-dimensional surface) with a set of covariates. As an alternative, Lee et al. (2020) proposed varying-coefficient exponential random graph models (VCERGMs), reflecting temporal heterogeneity, to characterize the evolution of network topology through smoothly varying parameters. Zhu et al. (2022) proposed a network functional varying coefficient (NFVC) model to model both the interindividual dependence and within-individual dynamic correlation. They used each individual’s response, characterized by a linear combination of responses from its connected nodes and exogenous covariates.

There is a significant and growing interest in examining the temporal dynamics of the brain’s network activity. This interest underscores the importance of understanding the fluctuations in brain connectivity, a key area that has seen limited exploration using temporal network theory. Thompson et al. (2017) presented temporal centrality measures and discussed from static networks to dynamic temporal network analysis.

5. Statistical network modeling procedure

Quantitative methods of model assessment currently fall into four overlapping categories;

  • Comparison with ground truth/nodal attributes: comparing aspects of the fitted model to observed nodal attributes. Any community detection algorithm that separates the nodes into clusters that are consistent with this split is deemed to be a good model.

  • Link prediction: comparison of the predictive probabilities under the fitted model for all possible links with the observed links/non-links. This method is used for generative models, and inspection of the fit is often performed visually.

  • Network process modeling: explaining and optimizing network performance and predicting how changes in some parts of network might affect others.

  • Goodness-of-fit diagnostics: the method involves simulation of networks from the fitted model; summary statistics (such as degree distribution, shared partners distribution, geodesic distance and etc.) derived from these simulated networks are then compared with the corresponding observed values.

  • Model comparison: comparison of competing models under an information criterion such as BIC or assessment of both models’ goodness-of-fit. Comparison of parameter selection using AIC can be used with selection based on statistics derived from the goodness-of-fit diagnostics.

In order to perform inference on a network, we can implement a variational approximation to the model using some software. There are some software packages available for visualization and analysis of network data. The R package igraph allows the efficient manipulation and decomposition of directed and undirected graphs. The statnet suite of R packages, including sna, ergm, and network, provides tools to obtain network summary statistics and both node and graph-level indices for a variety of centrality measures. The RSiena contains visualization tools using the standard layout algorithms and sna can be used to fit stochastic block models and estimate block structure in network datasets.

Python, a versatile language, offers a range of graph libraries and packages specifically designed for complex network analysis. For example, the Python package NetworkX is a powerful tool for conducting network analysis.

6. Examplary network analysis of brain diseases

The network structure of the human brain by representing brain function has become an increasingly exciting feature to the neuroscientific community, because it has the potential to illuminate human cognition, to characterize over-development and aging, and identify or distinguish disease or injury (Liu et al., 2017; Luppi and Stamatakis, 2021; van Straaten and Stam, 2013). The use of modern network theory in neuroscience is becoming more prevalent for analyzing the brains of both healthy and diseased individuals. The study utilizes graph theory to describe networks mathematically and applies probability theory and statistical mechanics to address the stochastic characteristics of large networks. Over the decade, attempts to characterize these data have been done with new, multidisciplinary approaches to the study of complex systems (Strogatz, 2001; Boccaletti et al., 2006; Newman, 2018). Based on mathematical graph theory, complex network analysis describes important properties of complex systems by quantifying the topological characteristics of their respective networks. However, unlike classical graph theory, complex network analysis primarily deals with large and complex real-life networks that are neither uniformly random nor ordered.

Brain connectivity datasets comprise networks of brain regions connected by anatomical tracts or functional associations. Brain networks are invariably complex and represented using complex networks, and characterized via complex network methods (Bassett and Bullmore, 2006; Bullmore and Sporns, 2009; Telesford et al., 2011; Sporns 2016; Rubinov and Sporns, 2010).

First, complex network analysis promises to quantify brain networks with some neurobiologically meaningful and easily computable measures (Deuker et al., 2009) and reliably computable measures (Sporns and Zwi, 2004). Second, by explicitly defining anatomical and functional connections on the same map of brain regions, network analysis may be a valuable setting for exploring structural-functional connectivity relationships (Honey et al., 2009). Third, comparisons of functional network topologies between subject populations appear to reveal presumed connectivity abnormalities in neurological and psychiatric disorders (Stam, 2007).

An increasing number of studies provide disturbances of the optimal organization of brain function in brain disorders. These results show that even localized processes impact on the global topology due to the integrative nature of the system (Sporns, 2016). Some patterns of network changes are found in brain disorders (Shin and Kim, 2023; Lee and Kim, 2023; Han and Kim, 2023; Shin et al., 2023).

We provide some network analysis results with the pattern interpretation for brain disorders. Alzheimer’s disease (AD) is a neurodegenerative condition affecting cognitive function, with memory deficits. Functional networks in AD lose their normal small-world structure, and regress toward a more random architecture (Mandal et al., 2018; Stam, 2007; Stam et al., 2006). Linking the patterns found in functional network analysis influences amyloid deposition by synaptic activity.

Parkinson’s disease (PD) is a neurodegenerative disorder with prominent motor signs. Skidmore et al. (2013) showed that the network efficiency in PD is reduced, which might unravel the associations of network changes with the pathological process.

Brain tumors have an impact on anatomical structures, leading to loss of function and epilepsy. Bartolomei et al. (2006) showed extensive global changes in the brain networks of brain tumor patients, especially loss of connectivity independent of the location of the lesion. Douw et al. (2010) showed that these changes were also related to cognitive disturbances and seizures. These studies implied that global cognitive dysfunction is associated with the disruption of the network structure.

Epilepsy is referred to as a result of a hyperexcitable state of the brain and characterized by an abnormal synchronized firing activity and the neuron involved seizure (Lehnertz et al., 2009). Even though the underlying mechanisms are not precise, what happens with functional networks before, during, and after a seizure might provide insight into the involved dynamical processes. Netoff et al. (2004) reported that hippocampal neuronal networks are in a small-world configuration that increases connectivity between the neurons (Ponten et al., 2007). An fMRI (functional magnetic resonance imaging) study on the dynamics of the functional networks showed that the networks are near a threshold of order/disorder transition with some change-points (Kim, 2019; Kim et al., 2021). The randomness of resting state networks with epilepsy subject was also found in an MEG (magnetoencephalography) study (Horstmann et al., 2010). Ortega et al. (2008) reported that pathologically active hubs contribute to epilepsy. Wilke et al. (2011) conducted a study to identify critical nodes in the networks of patients undergoing epilepsy surgery. This research has the potential to significantly advance our understanding of epilepsy and its impact on brain networks.

Schizophrenia is regarded as a disorder of abnormal brain functional connectivity and reported as decreased structural functional connectivity and reduced global integration of specific brain areas, especially frontal hubs. Changes in functional networks might be detectable before the disease is clinically manifest. These signs could identify subjects at a particular risk of developing the disease (Bullmore and Sporns, 2009).

Healthy structural and functional brain networks are characterized by a cost-effective architecture that optimally balances local and global connectivity and has a hierarchical modular structure (Stam, 2014). Normal brain network organization arises during development under genetic control and is related to cognitive function. Local brain lesions cause widespread network changes, whereas global brain disorders preferentially affect highly connected hub regions. The discoveries of modern network science have unveiled key characteristics of typical brain-network structure, including small-world and scale-free patterns, hierarchical modularity, hubs, and rich clubs. The next step is to harness the power of this information and knowledge to significantly advance our understanding of brain disorders, offering hope for improved treatments and outcomes.

The comparison of the brain networks of healthy subjects to patients of neurological or psychiatrical disease has added to knowledge of the mechanisms involved in brain diseases. One of the challenges is to establish the relationship between network changes and pathological processes for other diseases and to relate functional networks to therapeutic strategies. Another challenge is developing additional network models capturing and describing the brain’s functional structure.

7. Conclusion

We have presented a concise summary of network models and methods for the statistical analysis of network data. We explored classical models in which the presence of a link between two nodes depends on the network graph structure and latent variable models in which the presence of a link in the network depends on the presence of a latent variable. In recent times, network data has become proliferative. Modern data collection methods allow network data to be collected more easily over time. When analyzing temporal network data, it needs to aggregate across time or to consider single-time snapshots of the network and analyze these data using static network techniques. Therefore, proper dynamic methods for network analysis are rapidly growing. There is much scope for modeling innovations for such temporal network data.

Many challenges remain in the field of statistical analysis of network data. There are many open problems and further directions for network data analysis. For instance, can the bounds on the size of the optimal confidence set be improved and ultimately tightened? What about other tree growth models? What happens when we lose the tree structure and consider general graphs? Moreover, real network data can motivate many interesting research topics. This review concentrates on the statistical modeling of network data rather than other computational approaches. We emphasize that statisticians can develop advanced statistical network models and should contribute to network data analysis.

Figures
Fig. 1. A schematic of the stochastic block model ().
Fig. 2. International relation change during three years ( ).
References
  1. Barabási AL (2002). The New Science of Networks, Cambridge, MA, Perseus.
  2. Barabási AL (2016). Network Science, London, Cambridge University Press.
  3. Barabási AL and Albert R (1999). Emergence of scaling in random networks. Science, 286, 509-512.
    Pubmed CrossRef
  4. Bartolomei F, Bosma I, and Klein M, et al. (2006). How do brain tumors alter functional connectivity? A magnetoencephalography study. Annals of Neurology, 59, 128-138.
    CrossRef
  5. Bassett DS and Bullmore ED (2006). Small-world brain networks. The Neuroscientist, 12, 512-523.
    Pubmed CrossRef
  6. Bickel PJ and Chen A (2009). A nonparametric view of network models and Newman–Girvan and other modularities. Proceedings of the National Academy of Sciences, 106, 21068-21073.
    CrossRef
  7. Boccaletti S, Latora V, Moreno Y, Chavez M, and Hwang DU (2006). Complex networks: Structure and dynamics. Physics Reports, 424, 175-308.
    CrossRef
  8. Bullmore E and Sporns O (2009). Complex brain networks: Graph theoretical analysis of structural and functional systems. Nature Reviews Neuroscience, 10, 186-198.
    Pubmed CrossRef
  9. Caimo A and Friel N (2011). Bayesian inference for exponential random graph models. Social Networks, 33, 4-55.
    CrossRef
  10. Caldarelli G (2007). Scale-Free Networks: Ccomplex Webs in Nature and Technology, London, Oxford University Press.
    CrossRef
  11. Chang J and Blei DM (2010). Hierarchical relational models for document networks. Annals of Applied Statistics, 4, 124-150.
    CrossRef
  12. Chatterjee S, Diaconis P, and Sly A (2011). Random graphs with a given degree sequence. Annals of Applied Probability, 21, 1400-1435.
    CrossRef
  13. Corneli M, Bouveyron C, Latouche P, and Rossi F (2018). The dynamic stochastic topic block model for dynamic networks with textual edges. Statistics and Computing, 29, 677-695.
    CrossRef
  14. Crane H (2018). Probabilistic Foundations of Statistical Network Analysis, New York, Chapman and Hall.
    CrossRef
  15. Deuker L, Bullmore ET, Smith M, Christensen S, Nathan PJ, Rockstroh B, and Bassett DS (2009). Reproducibility of graph metrics of human brain functional networks. Neuroimage, 47, 1460-1468.
    Pubmed CrossRef
  16. Doreian P and Stokman FN (1997). Evolution of Social Networks, New Jersey, Routledge.
  17. Douw L, De Groot M, Van Dellen E, Heimans JJ, Ronner HE, Stam CJ, and Reijneveld JC (2010). Functional connectivity is a sensitive predictor of epilepsy diagnosis after the first seizure. PloS One, 5, e10839.
    CrossRef
  18. Erdös P and Rėnyi A (1960). On the evolution of random graphs. Publication of the Mathematical Institute of the Hungarian Academy of Science, 5, 17-61.
  19. Frank O and Strauss D (1986). Markov graphs. Journal of the American Statistical Association, 81, 832-842.
    CrossRef
  20. Gao F and van der Vaart A (2021). Statistical inference in parametric preferential attachment trees.
  21. Han Y and Kim J (2023). Comparison of TERGM and SAOM : Statistical analysis of student network data. Applied Statistics, 36, 1-19.
  22. Handcock MS, Raftery AE, and Tantrum JM (2007). Model-based clustering for social networks. Journal of Royal Statisitcal Society Ser A, 170, 301-354.
    CrossRef
  23. Hoff PD, Raftery AE, and Handcock MS (2002). Latent space approaches to social network analysis. Journal of Americal Statistical Association, 57, 1090-1098.
    CrossRef
  24. Holland PW, Laskey KB, and Leinhardt S (1983). Stochastic blockmodels: First steps. Social Networks, 5, 109-137.
    CrossRef
  25. Holland PW and Leinhardt S (1981). An exponential family of probability distributions for directed graphs. Journal of the American Statistical Association, 76, 33-50.
    CrossRef
  26. Honey CJ, Sporns O, Cammoun L, Gigandet X, Thiran JP, Meuli R, and Hagmann P (2009). Predicting human resting-state functional connectivity from structural connectivity. Proceedings of the National Academy of Sciences, 106, 2035-2040.
    CrossRef
  27. Horstmann MT, Bialonski S, Noennig N, Mai H, Prusseit J, Wellmer J, and Lehnertz K (2010). State dependent properties of epileptic brain networks: Comparative graph–theoretical analyses of simultaneously recorded EEG and MEG. Clinical Neurophysiology, 121, 172-185.
    CrossRef
  28. Hunter DH, Goodreau SM, and Handcock MS (2008). Goodness of fit of social network models. Journal of the American Statistical Association, 103, 248-258.
    CrossRef
  29. Hunter DH and Handcock MS (2006). Inference in curved exponential family models for networks. Journal of Computational and Graphical Statistics, 15, 565-583.
    CrossRef
  30. Karrer B and Newman MEJ (2011). Stochastic blockmodels and community structure in networks. Physics Review, E, 83, 016107.
    CrossRef
  31. Kim J (2019). Change-point estimation and testing for brain functional connectivity networks. Proceedings of 2019 11th International Conference on Knowledge and Smart Technology (KST) Phuket. , 226-231.
  32. Kim J, Jeong W, and Chung CK (2021). Dynamic functional connectivity change-point detection with random matrix theory inference. Frontiers in Neuroscience, 15, 565029.
    Pubmed KoreaMed CrossRef
  33. Kolaczyk ED (2009). Statistical Analysis of Network Models, New York, Springer.
  34. Kolaczyk ED (2017). Topics at the Frontier of Statistics and Network Analysis:(re) visiting the Foundations, London, Cambridge University Press.
    CrossRef
  35. Kolaczyk ED and Csárdi G (2014). Statistical Analysis of Network Data with R, New York, Springer.
    CrossRef
  36. Krivitsky PN, Handcock MS, and Morris M (2011). Adjusting for network size and composition effects in exponential-family random graph models. Statistical Methodology, 8, 319-339.
    Pubmed KoreaMed CrossRef
  37. Lee H and Kim J (2023). Statistical network analysis for epilepsy MEG data. Communications for Statistical Applications and Methods, 30, 561-575.
    CrossRef
  38. Lee J, Li G, and Wilson JD (2020). Varying-coefficient models for dynamic networks. Computational Statistics & Data Analysis, 152, 107052.
    CrossRef
  39. Lehnertz K, Bialonski S, Horstmann MT, Krug D, Rothkegel A, Staniek M, and Wagner T (2009). Synchronization phenomena in human epileptic brain networks. Journal of Neuroscience Methods, 183, 42-48.
    Pubmed CrossRef
  40. Leifeld P, Cranmer SJ, and Desmarais BA (2018). Temporal exponential random graph models with btergm: Estimation and bootstrap confidence intervals. Journal of Statistical Software, 83, 1-36.
    CrossRef
  41. Liu J, Li M, Pan Y, Lan W, Zheng R, Wu FX, and Wang J (2017). Complex brain network analysis and its applications to brain disorders: A survey. Complexity, 1, 8362741.
  42. Lohmann G, Margulies DS, and Horstmann A, et al. (2010). Eigenvector centrality mapping for analyzing connectivity patterns in fMRI data of the human brain. PloS One, 5, e10232.
    Pubmed KoreaMed CrossRef
  43. Luppi AI and Stamatakis EA (2021). Combining network topology and information theory to construct representative brain networks. Network Neuroscience, 5, 96-124.
    Pubmed KoreaMed CrossRef
  44. Mandal PK, Banerjee A, Tripathi M, and Sharma A (2018). A comprehensive review of magnetoencephalography (MEG) studies for brain functionality in healthy aging and Alzheimer’s disease (AD). Frontiers in Computational Neuroscience, 12, 60.
    CrossRef
  45. Netoff TI, Clewley R, Arno S, Keck T, and White JA (2004). Epilepsy in small-world networks. Journal of Neuroscience, 24, 8075-8083.
    Pubmed KoreaMed CrossRef
  46. Newman M (2018). Networks, London, Oxford University Press.
    CrossRef
  47. Ortega GJ, Menendez de la Prida L, Sola RG, and Pastor J (2008). Synchronization clusters of interictal activity in the lateral temporal cortex of epileptic patients: Intraoperative electrocorticographic analysis. Epilepsia, 49, 269-280.
    CrossRef
  48. Ponten SC, Bartolomei F, and Stam CJ (2007). Small-world networks and epilepsy: Graph theoretical analysis of intracerebrally recorded mesial temporal lobe seizures. Clinical Neurophysiology, 118, 918-927.
    Pubmed CrossRef
  49. Rȧcz MZ and Bubeck S (2017). Basic models and questions in statistical network analysis. Statistics Surveys, 11, 1-47.
    CrossRef
  50. Rinaldo A, Petrovic S, and Fienberg SE (2011). Maximum likelihood estimation in network models, 829-834.
    Available from: arXiv:1105.6145
  51. Robins G, Pattison P, Kalish Y, and Lusher D (2007). An introduction to exponential random graph (p*) models for social networks. Social Networks, 29, 173-191.
    CrossRef
  52. Robins G, Snijders TAB, Wang P, and Handcock MS (2006). Recent developments in exponential random graph (p*) models for social networks. Social Networks, 29, 192-215.
    CrossRef
  53. Rubinov M and Sporns O (2010). Complex network measures of brain connectivity: Uses and interpretations. Neuroimage, 52, 1059-1069.
    CrossRef
  54. Salter-Townshend M, White A, Gollini I, and Murphy TB (2012). Review of statistical network analysis: Models, algorithms, and software. Statistical Analysis and Data Mining, 5, 243.
    CrossRef
  55. Shin S and Kim J (2023). Review of complex network analysis for MEG. The Korean Journal of Applied Statistics, 36, 361-380.
    CrossRef
  56. Shin S, Chung CK, and Kim J (2023). Minimum spanning tree analysis for epilepsy magnetoencephalography (MEG) data. Exploration of Neuroprotective Therapy, 3, 446-456.
    CrossRef
  57. Skidmore FM, Yang M, Baxter L, Von Deneen K, Collingwood J, He G, and Liu Y (2013). Apathy, depression, and motor symptoms have distinct and separable resting activity patterns in idiopathic Parkinson disease. Neuroimage, 81, 484-495.
    CrossRef
  58. Snijders TAB (2002). Markov chain Monte Carlo estimation of exponential random graph models. Journal of Social Structure, 3, 1-40.
  59. Sporns O (2016). Networks of the Brain, Boston, MIT Press.
  60. Sporns O and Zwi JD (2004). The small world of the cerebral cortex. Neuroinformatics, 2, 145-162.
    Pubmed CrossRef
  61. Stam CJ (2014). Modern network science of neurological disorders. Nature Reviews Neuroscience, 15, 683-695.
    Pubmed CrossRef
  62. Stam R (2007). PTSD and stress sensitisation: A tale of brain and body: Part 1: Human studies. Neuroscience and Biobehavioral Reviews, 31, 530-557.
    CrossRef
  63. Stam CJ, Jones BF, Manshanden I, Van Walsum AVC, Montez T, Verbunt JP, and Scheltens P (2006). Magnetoencephalographic evaluation of resting-state functional connectivity in Alzheimer’s disease. Neuroimage, 32, 1335-1344.
    Pubmed CrossRef
  64. Stam CJ, Jones BF, Nolte G, Breakspear M, and Scheltens P (2007). Small-world networks and functional connectivity in Alzheimer’s disease. Cerebral Cortex, 17, 92-99.
    CrossRef
  65. Strauss D and Ikeda M (1990). Pseudolikelihood estimation for social networks. Journal of the American Statistical Association, 85, 204-212.
    CrossRef
  66. Strogatz SH (2001). Exploring complex networks. Nature, 410, 268-276.
    Pubmed CrossRef
  67. Tallberg C (2005). A Bayesian approach to modeling stochastic blockstructures with covariates. J Mathematical Sociology, 29, 1-23.
    CrossRef
  68. Telesford QK, Simpson SL, Burdette JH, Hayasaka S, and Laurienti PJ (2011). The brain as a complex system: Using network science as a tool for understanding the brain. Brain Connectivity, 1, 295-308.
    Pubmed KoreaMed CrossRef
  69. Thompson WH, Brantefors P, and Fransson P (2017). From static to temporal network theory: Applications to functional brain connectivity. Network Neuroscience, 1, 69-99.
    CrossRef
  70. van Duijn MAJ, Gile KJ, and Handcock MS (2009). A framework for the comparison of maximum pseudo-likelihood and maximum likelihood estimation of exponential family random graph models. Social Networks, 31, 52-62.
    CrossRef
  71. van Straaten EC and Stam CJ (2013). Structure out of chaos: Functional brain network analysis with EEG, MEG, and functional MRI. European Neuropsychopharmacology, 23, 7-18.
    CrossRef
  72. Wasserman S and Pattison P (1996). Array. Psychometrika, 61, 401-425.
    CrossRef
  73. Watts DJ (1999). Small Worlds: The Dynamics of Networks between Order and Randomness, New Jersey, Princeton University Press.
    CrossRef
  74. Watts DJ and Strogatz SH (1998). Collective dynamics of ‘small-world’ networks. Nature, 393, 440-442.
    Pubmed CrossRef
  75. Wilke M, Pieper T, Lindner K, Dushe T, Staudt M, Grodd W, and Krägeloh-Mann I (2011). Clinical functional MRI of the language domain in children with epilepsy. Human Brain Mapping, 32, 1882-1893.
    CrossRef
  76. Zhu H, Fan J, and Kong L (2014). Spatially varying coefficient model for neuroimaging data with jump discontinuities. Journal of the American Statistical Association, 109, 1084-1098.
    Pubmed KoreaMed CrossRef
  77. Zhu X, Cai Z, and Ma Y (2022). Network functional varying coefficient model. Journal of the American Statistical Association, 117, 2074-2085.
    CrossRef