流网络分析

来自集智百科
跳转到: 导航搜索

目录

Introduction

Network science provides a new perspective to analyze the flow of collective attention. By tracing the flow of individual attention, we can build up the flow network for collective attention. In the constructed network, nodes are the information resources visited by users, and edges represent the switching of collective attention from one node to another. Different from the other networks, the edges are both directed and weighted, and more importantly, the inflow must equal the outflow for each node in attention flow network. Therefore, we have to add two external nodes (“source” and “sink”) to balance the flow of the nodes. The balanced flow network is also regarded as “open flow network”, since it considers the flow from and to the environment (Guo et al., 2015).

The characteristics of the attention flow networks discussed above calls for a new approach of analysis. The research on flow network analysis supplies a systematic way to understand the flow of collective attention. By reviewing current research on flow network analysis, we outline our analytical framework, with particular focus on three aspects of the attention flow network, namely flow impact, flow distance, and flow dissipation. To be specific, we frame our analysis in the perspective metabolism, and the information system (e.g., websites) are viewed as “virtual living organisms that grow at the expense of users’ attention” (L. Wu et al., 2014, pp. 1-2). The number of unique visitors (UV) can be viewed as the “body mass”, and the number of page views (PV) are viewed as the energy or nutrients. By absorbing users’ attention into a website and dissipate part of the attention out, the website produces information, and interaction between users and information resources gives rise to the allowmetric growth,


 PV \sim  UV^{\eta}                                (1)

Flow Dissipation

For any node i in the attention flow network, the amount of attention flowing from i to sink is defined as its dissipation, D_i. We can also capture the other similar measurements of the flow network. For example, the flow passing through node i is denoted as Ai. The flow coming directly from source is denoted as Si, and the flow out of node i which does not flow directly to the sink is denoted as Fi. For a balanced attention flow network at a given time point t,

UV_t  = \sum_i^N D_i ,                                                                (2)
PV_t  = \sum_i^N A_i .                                                                 (3)

Similar to the allowmetric scaling relationship between PV and UV, there also exists a power law relationship between D_i and A_i,

 D_i \sim A_i^{\alpha} ,                                                                     (4)

where the scaling exponent α denotes the efficiency of dissipation. Eq (12) is also regarded as “the dissipation law” (Cheng-Jun Wang & Wu, 2016; Jiang Zhang & Wu, 2013) On the contrary, 1/α can be used as an indicator of the attractiveness of the information resource (Cheng-Jun Wang & Wu, 2016).

If the dissipation exponent α is larger than 1, the flow dissipation from node i increases faster than the flow through node i. The nodes with a larger flow tend to dissipate more flow to sink, which corresponds to the star-like flow network topology (Lingfei Wu & Zhang, 2013; Jiang Zhang & Wu, 2013). On the contrary, if the dissipation exponent α is smaller than 1, the nodes with a larger amount of traffic tend to dissipate less flow to the sink, which corresponds to the chain-like flow network topology (Lingfei Wu & Zhang, 2013; Jiang Zhang & Wu, 2013).

Flow Impact

The flow impact Ci of a node in attention flow network measures its capability in controlling the flow circulation of the network. The idea of flow impact comes from the study of Kleiber’s law named after Max Kleiber (1947) who found that animals’ metabolic rates scale to the 3/4 power of their body weights. Let A denote the metabolism rate or the total flow from source to the network, and let C represent the total mass or the summation of all individual flow rates in the network,

A ≈ C^γ,                                                                      (5)

and γ is the allometric exponent. Interestingly, later research showed the allometric scaling relationship between A and C holds for the river networks, and γ = 2/3 in this case (Banavar, Maritan, & Rinaldo, 1999). Researchers tried to explain the allometric scaling law following the assumption that the most efficient transportation network was embedded in a d dimensional space, for example, the river network was embedded in a 2d space and the vascular network was embedded in a 3d space, γ = d/(d+1) (Banavar et al., 1999; Dreyer, 2001; Jiang Zhang & Wu, 2013). Dreyer (2001) further created a 1d space and ran water over permeable tissue. The relationship between the length of tissue wet by water A and the total mass of water C also follows the allowmetric scaling law, and the exponent γ = 1/2, which was in good agreement with prior findings.

West et al. (1997, 1999) proposed that the Kleiber’s law origins from the fractal structure of transportation system within living system, and they developed a model of spacefilling hierarchical networks to explain the observed quarter-law allometric scaling. Banavar et al. (1999) proposed a method to deal with the discrete transportation network, especially the river networks. To be specific, they defined the size of the drainage basin of a river as A and the total amount of water contained in the water as C. For a sub-basin X of the river,

A_(X )= ∑_(Z∈nn(X))▒A_Z +1,                                                        (6)

where nn(X) are the nearest neighbours that drains into X, and

C_(X )= ∑_(Z∈γ)▒A_Z ,                                                                 (7)

in which γ is “the collection of all sets connected to X through drainage directions” (Banavar et al., 1999, p. 132). Dreyer (2001) proposed a continuous approach similar to the method of Banavar et al. (1999). Inspired by the model of Banavar et al. (1999), Garlaschelli et al. (2003) proposed the method to compute A and C by converting the food web to a minimal spanning tree. For the minimal spanning tree, Ai of node i is the throughflow of the node, and Ci is the total throughflow of the nodes rooted from node i. The method of Garlaschelli et al. (2003) has obvious shortcomings for flow network analysis: first, when some edges are removed, the information about the flow is lost; second, it can only be used for binary network rather than weighted networks.

Zhang and Guo (2010) proposed a systematic method to solve the problem: Given a flow network G with N nodes, two artificial nodes “source” and “sink” need to be added into the network to balance the inflow and the outflow of each node. Node 0 is the “source”, and node N+1 is the “sink”. The balanced flow network can be expressed as a flow matrix F. The element fij in F denotes the weight of flow from node i to j. The flow matrix F can be normalized by row to get the transition matrix M, and the element m_ij of M satisfies

m_ij = f_ij⁄(∑_(k=1)^(N+1)▒f_ik ). (8)

The fundamental matrix U can be derived from M,

U=I+M+M^2+⋯=∑_(i=0)^∞▒M^i =〖(I-M)〗^(-1),                                       (9)

in which I is the identity matrix. The element u_ij in the fundamental matrix U represents the total flow from i to j along all possible pathways. Using the flow matrix F, we can obtain the total flow through any given node i,

A_i  = ∑_(j=1)^(N+1)▒f_ij .                                                                 (10)

Accordingly, the flow impact C_i can be defined as the following to capture the total direct and indirect flow stored in the sub-network rooted from node i,

C_(i )=∑_(k=1)^N▒∑_(j=1)^N▒〖(f_0j u_ji/u_ii ) u_ik 〗. (11)

According to Zhang and Guo (2010), both A_i and C_i in the weighted food webs follow the power law distribution, and the relationship between A_i and C_i also follows a power law relationship,

C_(i )=〖A_i〗^η. (12)

Garlaschelli et al. (2003) argues that for the binary network, C_(i )represents the cost of the transportation, and the scaling exponent η represent the efficiency of transportation. η ranges from 1 to 2, where 1 denotes the most efficient network (star-like network), and 2 denotes the least efficient network (chain-like network). While Zhang and Guo (2010) interpret the meaning of C_(i )and η differently. C_(i )is the total influence of node i on the other nodes in the flow network, and η represent the capability of the network to store flow within the system. The flow network with larger η can store more flow “by means of cycling the flows in the network” (J. Zhang & Guo, 2010, p. 765). Additionally, the range of η is not exactly [1, 2], and the range [1, 2] only applies to the minimal spanning tree (J. Zhang & Guo, 2010).

It is also necessary to note that η also reflects the degree of centralization (Jiang Zhang & Wu, 2013). For an attention flow network whose flow passing through node i is A_i, with the increasing of η, the distribution of C_(i )can become much more uneven. The flow network with η>1 is centralized, while the flow network with η<1 is decentralized. Employing this method, it is found that most trade networks are centralized (P. T. Shi, Luo, Wang, & Zhang, 2013). In previous research, the flow impact exponent η and dissipation exponent α were usually negatively correlated (L. Wu et al., 2014; Jiang Zhang & Wu, 2013) .

Wu and Zhang (2013) analyzed the flow structure of the clickstream on the Web. Clickstream is an ordered sequence of webpages visited by users. Using clicksteam as a proxy of the flow of collective attention, they confirmed the power law relationship between C_(i )and A_i, but the scaling exponent η is smaller than 1, implying a decentralized flow structure of the World Wide Web (WWW). The overall decentralized flow structure for the whole WWW is natural, since the people with the same language tend to visit the websites within the sub-community of the entire WWW. However, they found that the scaling exponents η for the sub-communities of different languages are regularly smaller than 1. By conducting network permutations, they further found that the decentralized flow structure for the collective attention is related to “the topological structure of the clickstream network rather than the distribution of weights on clickstreams” (Lingfei Wu & Zhang, 2013, p. 5).

Flow Distance

The flow distance of a node measures how many steps a random walker takes to reach the node from source in the attention network (Guo et al., 2015; C. J. Wang, Wu, Zhang, & Janssen, 2016). We can assume that there are many random walkers or particles randomly walking on the flow network. The first-passage flow ϕ_ij from i to j is defined as the number of random walkers that reach j at each time step for the first time after they have visited i. Accordingly, the average number of steps that these random walkers have taken is defined as the first-passage flow distance l_ij. Similarly, the total flow ρ_ij from i to j is defined as the number of random walkers that arrived at node j at each time step (no matter it is the first time or not) after they have visited i. And the average number of steps these random walkers take is defined as the total flow distance t_ij.

Let p_ij^k denote the probability that random walkers transfer from i to j after k steps. Then the total flow distance t_ij from i to j along all possible pathways can be obtained,

t_ij=∑_(k=1)^∞▒〖kp_ij^k 〗. (13)

Given the flow from i to j after k steps is ϕ_0i 〖(M^k)〗_ij and the total flow from i to j is ρ_ij, we have p_ij^k = (ϕ_0i 〖(M^k)〗_ij)⁄ρ_ij . (14)

Based on Eq. (13) and (14), Guo et al. (2015) get the expression for t_ij:

t_ij=∑_(k=1)^∞▒k (ϕ_0i (M^k )_ij)/ρ_ij = (ϕ_0i (〖∑_(k=1)^∞▒k M〗^k )_ij)/ρ_ij =(ϕ_0i (MU^2 )_ij)/ρ_ij =(ϕ_0i (MU^2 )_ij)/(ϕ_0i u_ij ) = (MU^2 )_ij/u_ij , (15)

where M is the transition matrix, and U is the aforementioned fundamental matrix, and u_ij is an element of U.

Similarly, the expression for the first-passage flow distance can be obtained. Let q_ij^k denote the probability that random walkers jump from i to j after k steps for the first time. Then the first-passage flow distance l_ij from i to j along all possible pathways can be expressed,

l_ij=∑_(k=1)^∞▒〖kq_ij^k 〗. (16)

Given the expression for the probability q_ij^k = (ϕ_0i 〖(M_(-j)^k)〗_ij)⁄ϕ_ij , Guo et al. (2015) showed that l_ij can further be expressed,

l_ij=  t_ij-t_jj=(MU^2 )_ij/u_ij -(MU^2 )_jj/u_jj .                                                                (17)                                    

Based on first-passage flow distance l_ij and l_ji, the symmetric flow distance c_ij can be defined,

c_ij  =  (2l_ij l_ji)⁄(〖(l〗_ij+l_ji )).                                                                (18)

Flow network can be embeded into a Euclidean space in which the distance between any two nodes equals their flow distance. Flow distance reflects the nature about how random walkers move on the flow network, thus it connects the network topology with flow dynamics. By embedding flow network into a proper geometric space, our understanding about the underlying dynamics may be simplified. For example, using the effective distance instead of conventional geometric distance, Brockmann and Helbing (2013) captured the hidden geometry of the air-traffic-mediated epidemics, and found a very good linear relationship between the arrival times for cities and the effective distance. Shi et al. (2015) further specified the network embeding method based on symmetric flow distance c_ij, and applied the method to the clicksteam network of Indiana Unversity. Their findings indicated that websites can be grouped intro three layers (P. T. Shi et al., 2013). The most popular websites, such as google.com, are in the central of the embedded geometric space featured by a large amount of attention flow and a small fraction of dissipation. A large number of websites with a small amount of attention flow and a large fraction of dissipation lies between the core and the periphery. And the small websites with few attention flow and a small fraction of dissipation stays in the periphery.

Flow distance is important for us to understand the lifecycle of news. For example, Wang et al. (2016) measures the flow distance for the news on Digg—a social news website, and they found that attention flow network of news stories preserved a stable tree-like structure over time which led to the scaling between the number of users and clicks.

flownetwork

The python package flownetwork[1] has been developed for the data analysis of flow networks

相关页面

Word2Vec与流网络 - 流网络根据流距离嵌入空间

参考文献

  1. Cheng-Jun Wang, 2017 flownetwork: A python package for flow network analysis. from https://pypi.python.org/pypi/flownetwork
个人工具
名字空间
操作
导航
工具箱