
The analysis of network data has become an essential component of doing research in many fields including social and friendship networks, food webs, protein interaction and regulatory networks in genomics, neuronal connectivity networks, the World Wide Web, and computer networks. On the algorithmic side, many algorithms for identifying important network structures, such as communities, have been proposed, mainly by computer scientists and physicists. On the mathematical side, various probability models for random graphs have been studied. However, there has only been a limited amount of research on statistical inference for networks and learning the network features by fitting statistical models to data.
Given some underlying parameters, the models assume a likelihood function for the network data. Estimates for these parameters are then inferred from the data. The models, intriguingly, range in complexity from having a single scalar parameter for link probability to increasingly elaborate models constructed on counts of network summary statistics. Kolaczyk (2009, 2017) and Crane (2018) provide a complementary statistical overview for the statistical and probabilistic network analysis. Popular books on networks and network analysis began to appear two decades ago (Barabasi, 2002; Watt, 1999). Kolaczyk and Csárdi (2014) provide statistical network analysis with R codes. Rȧcz and Bubeck (2017) cover interesting topics such as community detection and preferential attachment for statistical network analysis.
A graph
Note that this applies only for a binary network. Weighted (or valued) networks are described using non-negative integer values for the entries in
There are two main types of networks: directed and undirected. In an undirected network, relationships are non-directional by their nature. Their edges indicate a two-way relationship having no specific direction, and the source and target nodes are interchangeable. In a directed network, connections between nodes are directional. Their edges indicate a one-way relationship.
In this paper, we consider the undirected graph networks and related models. We present a succinct review in statistical network modelling up to the current state of the art. Section 2 provides an introduction to the terminology and basic statistics for the network analysis. Static network models are introducted in Section 3 and tempral network models in Section 4. Network modeling procedure is discussed in Section 5 including popular softwares. In Section 6, as an important examplary application of network data, human brain neuroimaginge data analyses are discussed. And final comments are given in Section 7.
We briefly review the common basic measures for the network summary statistics. We may be interested in how nodes interface at a local level. This behavior within the graph can be summarised using various statistics. Highlighting nodes of importance to the network is also a task of statistical network analysis quantitatively and qualitatively. Barabási (2016) describes the network measures comprehensively and extensively.
The degree
The nodes with a high degree can be regarded as important points in the network (so-called hubs) (Newman, 2018; Caldarelli, 2007). An average of degrees can be computed for the whole network as
In addition, the degree distribution
As a network characterization, the goal is to have some measures that allow an informative description of the part of the network. The importance of a node can be quantified using centrality measures. Betweenness centrality (BC) is defined as the number of shortest paths going through a node or edge. The BC is high when the node or edge is used for many shortest paths, and it can gauge a node’s influence by considering its role in linking other nodes together in the network. This measure can be normalized by dividing it by its maximum value (the total amount of shortest paths in the network)
where
Closeness centrality (CC) measure is defined as the inverse of the average of the geodesic distance
where <
Eigenvector centrality (EC), a different type of measure of node centrality, allocates a value to each node in the network in such a way that the node receives a large value if it has strong connections with many other nodes in a central position (Lohmann
where vector
A Motif is grouped nodes in a simple subgraph. Triangles of three interconnected nodes are examples of such motifs. The simple motif of a triangle underlies the important property of clustering. It provides insight into the network’s structure and function, as well as the interactions between the nodes.
When larger groups of nodes tend to be more connected within groups than between groups, this group is called a module. These modules, also known as network communities or clusters, represent cohesive substructures and play a significant role in the network organization and function.
The most basic and frequently used cohesive measure is the clustering coefficient, which measures local connectivity. The clustering coefficient
The shortest path length is the smallest number of edges between two nodes. The average shortest path length is a measure of global efficiency and defined as the mean number of steps along the shortest paths between all possible pairs of network nodes. Intuitively, when the average path length is short, it is easy to travel fast from one point in the network to a random other point.
Watts and Strogatz (1998) and Watts (1999) have shown that graphs with many local connections and a few random long-distance connections are characterized by a high clustering coefficient and a short path length. Such near-optimal networks are designated as ‘small-world’ networks. Since then, many types of real networks have been shown to have small-world features. For example, patterns of anatomical connectivity in neuronal networks are particularly characterized by high clustering and a small path length (Watts and Strogatz, 1998).
The existing statistical network models may be organized along several principal axes, such as static vs. dynamic models. Static network models concentrate on explaining the observed links on a single network snapshot. In contrast, dynamic network models are concerned with the mechanisms to govern changes in the network over time. Static network models have been the main focus of research firstly. However, with the emergence of dynamic temporal networks in recent years, there has been growing interest in dynamic modeling.
The Erdös-Rényi model (1960, 1981) is the most basic probabilistic network model. This model assumes that the presence and absence of edges between all pairs of nodes are iid, where
where the product is over all node pairs
However, due to the assumption of independent edges and equal probability of connectivity between pairs of nodes, this model is not appropriate for modelling many real world networks. Instead, it serves as a null model, one in which there exists no structure.
The
This model for directed graphs was introduced by Holland and Leinhardt (1981). There are four possible states of linkage between two nodes
where {
The
where
This formulation can be used for model fitting for the
Exponential (family) random graph models (ERGM) are a family of models (
The
where
Fitting ERGMs necessarily involves approximating this normalizing constant which is often very difficult to obatin. The network summary statistics are chosen by the analyst and may be calculated on higher order than two interactions. Typical choices include the number of edges, the number of triangles and the number of
Maximum pseudolikelihood fitting involves computing change statistics for the network and finding a pseudo MLE for the parameters using a pseudolikelihood approximation of the likelihood. The pseudolikelihood approximation is computed as the product of the full conditionals for each dyad given the rest of the observed network with the log odds of a particular dyad linked as
Following Robins
where
Estimation of the parameters
In both inferential approaches, model degeneracy may be an issue. The degeneracy refers to the situation in which only a few networks have appreciable probability, given the model. If degeneracy or near degeneracy occurs, then the parameter estimates may not converge, and maximum pseudolikelihood estimators will yield misleading results in these cases. Therefore, new specifications for ERGMs have emerged, and network summary statistics have been specifically chosen to address degeneracy problems. These include conditioning on the number of edges or the inclusion of alternating k-stars and alternating k-triangles, geometrically weighted degrees, edgewise shared partner, and dyad-wise shared partner statistics. Hunter
Community detection is a fundamental problem in many sciences, such as sociology (e.g., finding tie groups in social networks), biology (e.g., detecting protein complexes), and neuroscience (e.g., searching connected neurons). Given its importance, there have been a plethora of algorithms to detect communities. However, how can we test whether an algorithm performs well? What are the fundamental limits to any community detection algorithm?
Probabilistic generative models can be used to model real networks. They can act as benchmarks for comparing different clustering algorithms. One of the most widely studied generative models that exhibit community structure is the stochastic block model (SBM). Holland (1983) first introduced the SBM in sociology. Moreover, it was then studied in several different scientific communities, including statistics (Bickel and Chen, 2009). The SBM gives a distribution on graphs with a hidden partition of the nodes into
A random graph from the general stochastic block model SBM(
The vertex set of the graph is
Every vertex
Given the labels of the vertices, each (unordered) pair of vertices (
Write
Weak recovery (detection). An algorithm is said to weakly recover or detect the communities if it outputs a partition of the nodes which is positively correlated with the true partition, with high probability (whp).
Partial recovery. How much can be recovered about the communities? An algorithm is said to recover communities with an accuracy of
Exact recovery (recovery or strong consistency). The strongest notion of reconstruction is to recover the labels of all vertices exactly whp. When this is not possible, it can still be of interest to understand which communities can be exactly recovered, if not all; this is sometimes known as ‘partial-exact-recovery’.
By introducing the indicator variable as
and introducing
In the more general case where attribute data is unknown, block membership assignation must be assigned with the use of latent variables.
Lei’s (2016) groundbreaking work in developing a goodness-of-fit test for the stochastic block model is a significant advancement in our understanding of network analysis and community detection. The test statistic is based on the largest singular value of a residual matrix obtained by subtracting the estimated block mean effect from the adjacency matrix. Asymptotic null distribution is obtained using random matrix theory (RMT).
In many applications, communities are not disjoint but overlapping. Detecting overlapping communities is a complex and intriguing challenge that requires further exploration. How well can we do it? This question sparks curiosity and invites further discussion. What are the fundamental limits? To understand the utility of the SBM, we must also study how robust the obtained results are changes in the model.
Hoff
All network data models rely on the information contained in the degrees associated with the nodes of the corresponding graphs. Assume the model is described by a possibly vector-valued parameter
where
We need to formulate conditions that guarantee that the maximum likelihood estimator (MLE) exists with probability tending to one as the number of nodes increases. Rinaldo
Given
Karrer and Newman (2011) worked with undirected valued graphs with Poisson and degree-corrected block models. Let
Tallberg (2005) proposed a model in which the group memberships
where
The likelihood of network data reflects on the data situation, and needs its computational algorithm. If the likelihood is complicated, the usable algorithms for detecting the nonexistence of the MLE and for identifying non-estimable parameters. EM (expectation-maximization) algorithm is an alternative, and VEM (variational EM) (Chang and Blei, 2010; Corneli
A preferential attachment (PA) tree is a type of network structure with a preferential attachment mechanism that governs the formation of connections between nodes in a growing network, such as collaboration networks. In the context of a tree structure, preferential attachment can influence how new nodes are connected to existing nodes, describing a rich-get-richer effect in network transformation (Barabási, 2016; Barabási and Albert, 1999).
If the corresponding degrees are {
Here
Gao and van der Vaart (2021) proposed the study of the MLEs of the parametric preference function with their consistency and asymptotic normality.
A dynamic network model is a statistical framework describing how nodes and edges change or how networks evolve over time. These dynamic or temporal network models explicitly incorporate time into their structure. Different mathematical/statistical tools are used to describe dynamic networks, such as stochastic processes, differential equations, agent-based modeling, or graph theory in order to capture the complexity of network dynamics. Modeling dynamic networks is a formidable task, primarily due to the complexity of real-world data, the need to balance relational system with simplicity, and the computational demands when dealing with large-scale networks.
Dynamic network modeling is a crucial tool in various interdisciplinary fields such as social science, biology, neuroscience, computer science, and economics. Networks in these fields, representing relationships between entities, are rarely static; they change and evolve due to various processes. Brain networks, whose edges represent the transmission of an electrical charge between neurons change depending on brain function and activity. Social networks, whose edges represent friendships, often change over time due to increased/decreased social activity.
The edges associated with different vertices evolve on different time scales (for example, Figure 2). There is a need to model a dynamic network whose vertex set stays fixed while its edges vary over time. A dynamic network is one whose entire structure varies concerning time. Thus a dynamic network is a collection
The temporal exponential random graph model (TERGM) (Doreian and Stokman, 1997; Leifeld
The transition probabilities of the TERGM are given by
where
Zhu
There is a significant and growing interest in examining the temporal dynamics of the brain’s network activity. This interest underscores the importance of understanding the fluctuations in brain connectivity, a key area that has seen limited exploration using temporal network theory. Thompson
Quantitative methods of model assessment currently fall into four overlapping categories;
Comparison with ground truth/nodal attributes: comparing aspects of the fitted model to observed nodal attributes. Any community detection algorithm that separates the nodes into clusters that are consistent with this split is deemed to be a good model.
Link prediction: comparison of the predictive probabilities under the fitted model for all possible links with the observed links/non-links. This method is used for generative models, and inspection of the fit is often performed visually.
Network process modeling: explaining and optimizing network performance and predicting how changes in some parts of network might affect others.
Goodness-of-fit diagnostics: the method involves simulation of networks from the fitted model; summary statistics (such as degree distribution, shared partners distribution, geodesic distance and etc.) derived from these simulated networks are then compared with the corresponding observed values.
Model comparison: comparison of competing models under an information criterion such as BIC or assessment of both models’ goodness-of-fit. Comparison of parameter selection using AIC can be used with selection based on statistics derived from the goodness-of-fit diagnostics.
In order to perform inference on a network, we can implement a variational approximation to the model using some software. There are some software packages available for visualization and analysis of network data. The R package igraph allows the efficient manipulation and decomposition of directed and undirected graphs. The statnet suite of R packages, including sna, ergm, and network, provides tools to obtain network summary statistics and both node and graph-level indices for a variety of centrality measures. The RSiena contains visualization tools using the standard layout algorithms and sna can be used to fit stochastic block models and estimate block structure in network datasets.
Python, a versatile language, offers a range of graph libraries and packages specifically designed for complex network analysis. For example, the Python package NetworkX is a powerful tool for conducting network analysis.
The network structure of the human brain by representing brain function has become an increasingly exciting feature to the neuroscientific community, because it has the potential to illuminate human cognition, to characterize over-development and aging, and identify or distinguish disease or injury (Liu
Brain connectivity datasets comprise networks of brain regions connected by anatomical tracts or functional associations. Brain networks are invariably complex and represented using complex networks, and characterized via complex network methods (Bassett and Bullmore, 2006; Bullmore and Sporns, 2009; Telesford
First, complex network analysis promises to quantify brain networks with some neurobiologically meaningful and easily computable measures (Deuker
An increasing number of studies provide disturbances of the optimal organization of brain function in brain disorders. These results show that even localized processes impact on the global topology due to the integrative nature of the system (Sporns, 2016). Some patterns of network changes are found in brain disorders (Shin and Kim, 2023; Lee and Kim, 2023; Han and Kim, 2023; Shin
We provide some network analysis results with the pattern interpretation for brain disorders. Alzheimer’s disease (AD) is a neurodegenerative condition affecting cognitive function, with memory deficits. Functional networks in AD lose their normal small-world structure, and regress toward a more random architecture (Mandal
Parkinson’s disease (PD) is a neurodegenerative disorder with prominent motor signs. Skidmore
Brain tumors have an impact on anatomical structures, leading to loss of function and epilepsy. Bartolomei
Epilepsy is referred to as a result of a hyperexcitable state of the brain and characterized by an abnormal synchronized firing activity and the neuron involved seizure (Lehnertz
Schizophrenia is regarded as a disorder of abnormal brain functional connectivity and reported as decreased structural functional connectivity and reduced global integration of specific brain areas, especially frontal hubs. Changes in functional networks might be detectable before the disease is clinically manifest. These signs could identify subjects at a particular risk of developing the disease (Bullmore and Sporns, 2009).
Healthy structural and functional brain networks are characterized by a cost-effective architecture that optimally balances local and global connectivity and has a hierarchical modular structure (Stam, 2014). Normal brain network organization arises during development under genetic control and is related to cognitive function. Local brain lesions cause widespread network changes, whereas global brain disorders preferentially affect highly connected hub regions. The discoveries of modern network science have unveiled key characteristics of typical brain-network structure, including small-world and scale-free patterns, hierarchical modularity, hubs, and rich clubs. The next step is to harness the power of this information and knowledge to significantly advance our understanding of brain disorders, offering hope for improved treatments and outcomes.
The comparison of the brain networks of healthy subjects to patients of neurological or psychiatrical disease has added to knowledge of the mechanisms involved in brain diseases. One of the challenges is to establish the relationship between network changes and pathological processes for other diseases and to relate functional networks to therapeutic strategies. Another challenge is developing additional network models capturing and describing the brain’s functional structure.
We have presented a concise summary of network models and methods for the statistical analysis of network data. We explored classical models in which the presence of a link between two nodes depends on the network graph structure and latent variable models in which the presence of a link in the network depends on the presence of a latent variable. In recent times, network data has become proliferative. Modern data collection methods allow network data to be collected more easily over time. When analyzing temporal network data, it needs to aggregate across time or to consider single-time snapshots of the network and analyze these data using static network techniques. Therefore, proper dynamic methods for network analysis are rapidly growing. There is much scope for modeling innovations for such temporal network data.
Many challenges remain in the field of statistical analysis of network data. There are many open problems and further directions for network data analysis. For instance, can the bounds on the size of the optimal confidence set be improved and ultimately tightened? What about other tree growth models? What happens when we lose the tree structure and consider general graphs? Moreover, real network data can motivate many interesting research topics. This review concentrates on the statistical modeling of network data rather than other computational approaches. We emphasize that statisticians can develop advanced statistical network models and should contribute to network data analysis.