Measuring scientific collaboration in co-authorship networks

Received Oct 23, 2020 Revised Aug 25, 2021 Accepted Sep 10, 2021 Scientific research is currently considered one of the key factors in the development of our life. It plays a significant role in managing our business, study, and work more conveniently. One of the important aspects when it comes to scientific research is the level of collaboration among researchers/disciplines. The collaboration between two different disciplines contributes to obtaining more reliable solutions for our everyday issues. Therefore, it is needed to understand the collaboration patterns among researchers and come up with convenient strategies for strengthening this kind of collaboration. In this work, we aim at investigating the patterns of scientific collaboration among researchers across disciplines. To this end, we generate a co-authorship network for several disciplines. The generated network reveals many interesting facts regarding the collaboration patterns among researchers who work in the same/different disciplines. We involve several measurements in this study that evaluate different aspects, which is of interest to the research communities since most of the studies in the literature measure specific aspects. Moreover, we propose a novel metric for measuring scientific collaboration in a research community and use it to benchmark the collaboration among disciplines. Finally, we use the obtained results/facts in providing recommendations for scientific communities.


Figure 1. An Example of generating a co-authorship network
Recently, the area of co-authorship networks has attracted research communities due to its role in improving and strengthening the collaboration patterns among researchers. Revealing the patterns of this kind of collaboration has been studied in the literature such as the distinguished studies of Newman [3] and Girvan-Newman [8], they used three bibliographic datasets for three fields of study; biology, mathematics, and physics. The goal of these studies was to investigate the collaboration patterns among the authors who work in the same area of study as well as among the three mentioned areas. They used statistical tools in the analysis approach. One of the interesting results that were obtained, the biological scientists have a strong tendency to co-author papers with authors from the same field, and this tendency is significantly decreased for mathematicians or physicists. These results were also confirmed in the study of Coccia and Bozeman [9]. In another study by Newman [10], they generated three networks for three areas of research; Computer Science, Physics, and Biomedical Research. They investigated and studied these networks and found the best-connected scientists in terms of the strength of collaboration. In the analysis, they used network centrality measurements. In 2013, Divakarmurthy and Menezes [11] investigated the collaboration patterns and citation patterns among the authors of the association for computing and machinery (ACM). They generated two collaboration networks; the first one was based on the citation of articles and the second was based on publications only without considering articles' citations. They used several network measurements such as degree centrality, betweenness centrality, closeness centrality, and the characteristics of community structure in evaluating authors. They also compared the results of the two generated networks to rank authors. Furthermore, the characteristics of Social Networks can also be considered as a useful tool in understanding the collaboration patterns among researchers as presented in the distinguished work of Barabasi et al. [12]. They considered the evolution of the social network of scientific collaboration in deeply understanding what is driving the collaboration patterns that exist among researchers.
The main problem in the literature is the lack of providing a deep evaluation of the performance of research communities/researchers in terms of collaboration. Most of the works perform the evaluation based on fixed factors (e.g., the number of published articles and citations) or the evaluation is performed using a single network measurement that reflects one fact on a network/author. Furthermore, the majority of works in the literature were performed using specific repositories. This case might force the evaluation to be biased to Most of the works in the literature considered few fields of research from a particular repository (e.g., ACM, IEEE). In fact, investigating the relations/collaboration among disciplines need to include as many disciplines as possible and varying the repositories aiming at enriching the dataset with fruitful items and eventually provides more reliable results.  Most of the proposed approaches use a particular measurement that evaluates and reflects one fact on a discipline/author. In this case, the evaluation process looks at the problem from one angle, which is not sufficient when seeking a comprehensive analysis.
In this article, we try to fill the aforementioned gaps and come up with results and recommendations for research communities. Hence, the contributions of this work can be summarized as:  Generate a co-authorship network that includes mainly 6 major disciplines (and subdisciplines). Many repositories are used and the data collection is performed on Google Scholar. This makes the dataset colorful of items and the analysis to be more comprehensive and reliable.  Propose a novel metric that can evaluate the strength of collaboration in a research community (discipline). This metric is based on several network measurements, which gives a view from different angles (utilizing different features) and obtain a more accurate evaluation. We believe that filling the aforementioned gaps will make the difference between our work and the works in the literature. There are many web applications (e.g., Publons, Scopus, Google Scholar) that evaluate the scientific status of an institution/researcher using well-known indicators such as h-index, Cite-Score in Scopus, or Impact Factor (SCIE and SSCI) in Clarivate. However, our approach evaluates the scientific collaboration based on the relations among disciplines/researchers and this feature is not available in current web applications. The main advantage of our work is that; it can be used (or integrated) to implement a web application that provides recommendations for a research community in real-time, which is a new service that can be shared.
This article is organized as; the next section includes our research methodology including the dataset collection and network measurements. Section 3 contains the results obtained and a discussion on the generated co-authorship network and the proposed metric. Finally, we conclude our article in section 4.

RESEARCH METHOD 2.1. Data collection
The dataset of this work was collected from worldwide authors. The collection process was performed on Google Scholar. We designed a special-purpose crawler for retrieving the data using the R language. The collected dataset included information on researchers and the articles they have authored/coauthored. We generated a co-authorship network using the collected dataset that includes researcher name, discipline, affiliation, number of articles published, articles' titles, number of co-authors in each article and co-authors names, journal/conferences name for each article, and publishing year. The total number of authors in our dataset was about 3444 authors after removing the noisy data. The strategy that was used in dealing with the co-authorship network was based on classifying authors into groups of disciplines. The fields of chemistry, physics, and biology were formed as a science group (SC). In the same way, engineering group (EN) including architectural, civil, mechanical, and electrical fields; computing group (CO) including computer science, mathematics, statistics, and operations research; education group (EDU); agriculture group (AG); and business and administration group (BA) including marketing, finance, accounting, and business administration. The dataset contains 6 main disciplines, which are our targeted disciplines in this work. It should be mentioned that the number of articles used in this work was 12,532. Dividing this number on the number of authors taken in this work (3444) leads to having an average of 3.6 papers per author in our coauthorship network, taking into considerations there are single-author papers with no collaborators. The data collection had the issue of author name disambiguation. To solve this issue, we used a particular strategy, which states that distinguishing authors can be performed through their ORCID numbers. However, in case of this identifier is not available for a particular author, we used his/her name along with his/her affiliation as the primary key in the database. Furthermore, we took into considerations that this issue cannot exist with the authors from the same institution. As mentioned, our co-authorship network contains international authors from worldwide institutions. During the data collection, our crawler went through two depths. More precisely, in depth-1 the crawler started randomly with a publication and took its main author and co-authors (if available). Then, the crawler looked for those co-authors if they appear as the main authors/co-authors in other publications and extract their information, which is depth-2. We strongly believed that this strategy further enriched our dataset with colorful patterns of collaboration, which is our purpose in this work. The generated network included 3444 nodes and 4240 edges.

Network measurements
The analysis of this work was based on many network measurements, each of which has the ability to reveal a particular fact on the co-authorship network. The main reason behind using these indicators was to enable readers to understand each measurement and what can evaluate in a co-authorship network. These measurements can be either used at the node (author) level or network (community or discipline) level. Now, we consider our network graph G=(N, E), where N represents network nodes and E represents network edges. The measurements we have used in this work are:  Clustering Coefficient (C): it reflects the tendency of network nodes to cluster together [13]. The value of C depends on the number of triangles that are formed by a particular node (3 nodes connected to each other). In a co-authorship network, C measures the tendency of authors in co-authoring articles and can be local clustering coefficient (Ci) or global clustering coefficient (C G ). The former can be defined for each author as: where is an article between the authors and . is the total network authors and is the neighbors' authors in the network. On the other hand, the average clustering coefficient (global) ( ) of a network can be defined as: where is defined in (1) and is the number of network authors.  Average Path Length ( ): for all possible pairs of authors in the network, it is defined as the average number of paths (steps) for all the shortest paths among the pairs of authors [13]. It shows the average shortest distance among authors and can be defined as: where is the length between author and author .  Diameter ( ): for a network, it is the longest path among all the shortest paths [14]. In our work, it calculates the distance between the farthest authors in the network.  Density ( ): it is the proportion of the number of network edges to the number of potential (possible) edges in that network, which means how close a network to be a fully-connected [14]. It shows the collaboration density among authors as well as the potential collaboration among authors and can be defined as:  Communities ( ): refers to the groups of nodes in a network that are densely connected with each other. In co-authorship networks, it reveals the research groups that have articles in common (collaborative groups). In our work, we used the Girvan-Newman Clustering algorithm [8] to find the number of research communities in the network and for each discipline. This algorithm detects the links (edges) that connect network communities then removes these links and leaves only the communities themselves. This technique uses a centrality measurement called betweenness.  Betweenness Centrality ( ): it shows how many times a node appears in the shortest path of network pairs [14]. It reveals the importance of a particular node in the flow of information within a network. In other words, it represents the importance of an author in a research community. In this work, shows how influential an author in a research community and within a discipline.
of the author can be defined: where is the shortest path between the authors and . ( ) is the number of paths that pass through author .  Degree Centrality ( ): it reflects the number of connections that a particular node has in a network [14]. In the context of this work, it reflects the actual number of papers an author has published.  Closeness Centrality ( ): it represents the reciprocal of the sum of all the shortest paths of a node to other network nodes [14]. It shows how close an author to other authors in a research community and can be described as: where ( ) is the distance between the authors and . This measurement will be further used in the proposed metric.

Co-authorship network
As mentioned, our co-authorship network (CAN) was generated based on a particular strategy. It states that when two authors have participated in co-authoring an article, a link is established between them. This strategy was also followed in [8], [10], [11], [15]. CAN network consisted of 3444 authors and the number of edges was 4240 that connecting the authors. The visualization of the network is shown in Figure 2, which shows the dense level of co-authoring articles among authors from different disciplines. In the figure, the size of nodes reflects the frequency of collaboration of a particular author in the CAN network. Moreover, according to the Girvan-Newman Clustering algorithm [8], CAN include 486 potential research communities with a strong modularity level of 0.875 as shown in Figure 3. Also, all the potential communities include authors from different disciplines, which is a positive indicator of the future of scientific collaboration among disciplines. It should be mentioned that the number of communities is dynamic and changed over time. Figure 4 depicts the degree distribution of the CAN network, which followed a power-law. According to [3], [12], co-authorship networks follow this kind of distribution since there exist few authors with a high degree (authored/co-authored large number of articles), while a large number of authors with low degree. This phenomenon reflects one of the most important features in co-authorship networks. Based on the distinguished work of Barabasi and Bonabeau [16], when the degree distribution of a network follows a power-law, the network is considered to be Scale-Free Network. Therefore, the nodes of CAN network evolution are governed by the preferential attachment feature according to [12], [17]. This feature reflects important facts on a network. One of these facts is that the newly connected authors prefer to attach to the highly connected authors within the network. It can be inferred, fresh researchers should try to consider senior researchers for their future collaborations because this leads to improve their positions in the research community and eventually improve the quality of their researches. Furthermore, we benchmarked the CAN network with other co-authorship networks in the literature. The goal of this comparison is to see how our network performs compared to other networks. Table 1 presents a comparison between CAN and 4 other co-authorship networks, namely, ACM, Biology, and Physics networks [11], and Engineering network [18]. According to the aforementioned table, the C value of CAN reflected a weak tendency of authors to collaborate. Also, the tendency (C) of authors from the same discipline is stronger than in authors from different disciplines. The value was higher compared to the benchmarking networks. This means the shortest distance among CAN authors is longer than what was obtained from other networks. Moreover, CAN is less dense compared to the benchmarking (Density (D)=0.001), this is due to the number of disciplines (and subdisciplines) compared to the other networks. Finally, the Diameter (O) is 19 in CAN, which is higher than the other networks.
The above-mentioned results confirm our first claim in this work. The performance of CAN in terms of network measurements (C, l, D, O) underperformed the benchmarking networks. This means evaluating a coauthorship network needs to have a dataset that is colorful in disciplines and repositories.

Disciplines network
In this section, we extracted 6 sub-networks from CAN, each of which represents a particular discipline group. Practically, it is needed to perform a comparison among these disciplines in terms of the measurements mentioned in section 2.2. Table 2 presents the disciplines with the corresponding values of measurements. It can be observed that the SC discipline group has the highest number of communities, which reflects the highest level of collaboration among the other disciplines. In fact, a high number of communities in a network reflects the strong tendency of their authors to collaborate. Furthermore, when observing the number of communities in a network, the number of authors in that network should also be observed and taken into considerations. Therefore, we see that the performance of a network in terms of forming research communities should be measured according to (as we propose) the ratio ( ) of the number of scientific communities (or research groups) ( ) to the actual number of authors (number of nodes) as: where denotes a discipline. According to , most of the groups have almost close levels of forming communities as shown in Table 2. This leads to a better evaluation when investigating the scientific collaboration among authors in a co-authorship network. Furthermore, in the CAN network, the largest number of scientific communities has existed in the SC group. It also has the second-highest tendency of authors to cluster and collaborate with each other. It is inferred that increasing the number of research groups plays a significant role in increasing the research productivity of that group. Therefore, we found that about 35% of the research articles in our network belong to the SC group. This result was also verified in other works in the literature such as [3], [10], [12], [19], [20]. These works found that the authors in the field of Science have strong tendencies to collaborate. On the other hand, although its small number of communities, the engineering discipline (EN) reflects the strong relations (C=0.347) of its authors to collaborate and coauthor articles with a high dense level of collaboration (D=0.011) compared to the other disciplines. Yet, there is a difference between the number of communities and the strengths of these communities (the strength of authors' connections in a community). Furthermore, when comparing our results with the results of [11], the worldwide researchers of the Engineering discipline reflect approximately the same pattern of in terms of their tendency to collaborate. As can be seen, each network measurement can be considered as an indicator to reflect a specific fact on network communities and the collaboration patterns in disciplines.

CAN best connected authors
As mentioned, network measurements can be used in two levels (network and node levels). In this section, the CAN network is analyzed based on node-level measurements. We aim at using centrality measurements for evaluating authors from different disciplines. The goal of this analysis is to reveal the disciplines that have the best-connected authors based on authors' relations. The other reason for this analysis is to have an indicator for our further discussions. Table 3 ranks CAN authors based on the values of betweenness centrality ( ) measurement and the frequency of collaboration with other authors. In the context of co-authorship networks, reveals how influential an author in a research community. It shows the number of times an author appears in the shortest paths of network pairs. The results show that the science discipline (SC) has 4 authors out of the top 10 best-connected authors. Furthermore, in addition to the frequency of collaborations, the position of an author in a research community is another important factor that can be used in assessing an author. For instance, the CAN network has authors with more than 122 published articles but their positions in the network do not make them influential. This means the positions of authors in a research community play a crucial role in the level of collaboration in the whole community. In section 3.1, we showed that the CAN network is scale-free and the concept of preferential attachment [21] is applicable. Therefore, increasing the level of collaboration among CAN authors can also be obtained when the authors tend to be connected and attached (collaborate) to best-connected authors within the network. This specific case leads to an increase in the number of triangles in the network (3 authors are connected and collaborated), which eventually increases the global clustering coefficient of the network. The concept of preferential attachment is based on the concepts of clustering coefficient and degree centrality. This means when an author has a strong tendency to collaborate and he/she has frequently collaborated with other authors, the probability of preferential attachment is also increased. The results show that the probability of preferential attachment in Science and Engineering disciplines is the highest compared to other disciplines (0.758 and 0.701 respectively), while all the other disciplines have less than 0.3.

CAN best connected authors
This section shows the current trends in collaboration among disciplines. The analysis of this section is important insofar as it investigates the integration of the disciplines that are considered in this work. Figure 5 depicts how CAN disciplines are connected and collaborated with each other in terms of coauthoring articles. In this figure, each discipline group is represented as a node, and the edge between the two disciplines is formed if there was a collaboration between them. Usually, the collaboration among disciplines is almost performed when inspiring theories from a discipline and involving them in another one. The figure also depicts the level of collaboration between each pair of disciplines, which is represented by the weight of edges The size of nodes reflects the actual size of the disciplines in terms of the number of co-authored articles. It is clear that some pairs of disciplines reflect a high level of collaboration such as the pairs (CO-EDU, BA-CO, EN-CO, and SC-AG). The integration among disciplines leads the value of to be 0.583, which is acceptable, and among CAN pairs equals 1.58, which is long compared to the size of this graph. Moreover, the highest value of was gained by the engineering discipline (EN), which reflects the strong tendency of the authors in this discipline to collaborate and contribute more to other disciplines. Practically, the integration of a particular field of research with other fields opens the horizon to the authors of both fields to come up with new contributions that will significantly improve the quality of research and share the knowledge.

The proposed approach
In the previous sections, we measured the level of collaboration among authors/disciplines. The way we followed was based on standard network measurements. In this section, we describe the proposed metric and how it is used for measuring scientific collaboration in disciplines.
In developing the proposed metric, we were inspired by the concept of social capital in sociology [20]- [22]. In social communities, social capital can be defined as the collectively shared values, cooperation, and reciprocity among individuals [20]. Social capital is not a measurement for network nodes, instead; it evaluates the relations among nodes [23]. Therefore, we propose to incorporate centrality measurements in developing a novel metric for measuring the global level of collaboration in academic communities. Furthermore, clustering coefficient measurement is an expression of the authors' tendency to collaborate with others. Therefore, we propose to incorporate it into the proposed metric. Since we are investigating scientific collaboration among academic communities, we also propose to use the term collaboration capital (CC). This term is used in business and economic contexts [24] referring to the collaborative processes that improve the outcomes. The value of CC for a particular author represents the collective centralities and the clustering values of that author. On the other hand, the value of CC for a community (discipline) represents the average values of all authors in a discipline (means CC). The selected measurements give our proposed metric the ability to deeply investigate the relations among authors as well as their positions in CAN structure. We believe this way is efficient in evaluating the collaboration capital in a research community. Moreover, our metric looks at the network from many different points of view because it combines the characteristics of many measurements (features) in one strong metric.
The value of for a particular author in a scientific community ( ) can be defined as: where is the clustering coefficient of author in a community . represents the collective value of the centrality measurements of a particular author and can be formalized as: where , , and are betweenness as shown in (5), closeness as shown in (6), and degree centralities for author respectively. The power ( ) in (9) reflects the frequency of collaborations of an author. This will contribute to increasing because the frequency of collaboration is considered as an important factor that should be given its actual role when measuring scientific collaboration.
According to the aforementioned description, the CC of a particular community or discipline represents the average collaboration capital values of the authors in that discipline as: In the experimental results, the value of of a discipline follows a power-law distribution (longtail distribution). All the values were normalized to be in the range of 0 and 1. The of each discipline is shown in Table 4. It can be observed that the CO and EN disciplines outperform the other disciplines in terms of collaboration capital as well as the strong ability of their authors to collaborate. On the other hand, SC discipline does not reflect a high performance of CC. Although four of the top 10 best-connected authors in the CAN network are from the SC discipline, the results show that its collaboration capital underperformed the aforementioned two disciplines. We believe this is due to First: the high level of variations in the clustering coefficient of authors, which is reasonable because the degree distribution of the CAN network follows a power-law. Second, the creation of links among network authors is controlled by the concept of preferential attachment as stated in [25].
According to the performance of the proposed metric (collaboration capital), we see that the results show a different perspective from what we have analyzed in the previous sections. For instance, the performance of the Science discipline is changed when it comes to collaboration capital. Therefore, using one measurement in evaluating scientific collaboration is insufficient and not reliable. This result confirms our second claim in this work. The evaluation should be performed using a metric that looks at the problem from many different angles aiming at having a clear view and coming up with an accurate and reliable assessment.

Recommendations
Based on the obtained results, we summarize our recommendations by the following:  The concepts of complex networks can be considered as powerful tools in understanding the collaboration patterns in co-authorship networks and this is due to the ability of each concept in analyzing network relations from a different perspective. Therefore, developers can benefit from these concepts when designing academic assessment tools.  The preferential attachment feature in coauthorship networks plays a significant role in improving research quality. Therefore, it is of benefit for the institutions to encourage this kind of feature within their academic settings.  The process of measuring the level of collaboration does not depend on a specific metric, it depends on the aspect we are investigating. For instance, one can measure the tendency of a researcher/community to collaborate with others; the clustering coefficient can be involved in this case. In the same context, when exploring the most influential authors in a community; betweenness centrality measurement can be used.  The number of authors in a discipline is not an important factor for strengthening the collaboration level. Instead, the number of research groups (communities) is an effective indicator that contributes to increasing the level of collaboration among authors. This feature also enriches the quality of research with colorful experiences and provides more trusted and flexible solutions for our life issues.  Improving the collaboration level is not only about the quantity of the co-authored articles, it is also about with whom the authors are connected (collaborated).  The concept of social capital in sociology is applicable in co-authorship networks and can be in the form of CC, which is very useful when it comes to scientific collaboration assessment.  According to the obtained results, there are some facts on the scientific collaboration in disciplines such that; the authors in Science and Agriculture disciplines reflected high performance in terms of coauthoring articles, the Engineering discipline authors have strong tendencies to collaborate, and Computer Science discipline outperforms the other disciplines in terms of collaboration capital.

CONCLUSION
In this work, we investigated and analyzed the collaboration patterns among authors from different disciplines considering 6 of them, namely; science, engineering, computing, education, administration and economics, and agriculture. We generated a co-authorship network called CAN network containing all the aforementioned disciplines (and subdisciplines). The dataset in this work was collected from Google Scholar including 3444 authors and 12,532 articles retrieved from many different repositories. We also proposed a novel metric to measure what we called CC. The analysis of this work showed that; accurate evaluating of coauthorship networks can be obtained when including more disciplines and involve different repositories in the dataset. The results also showed that measuring the scientific collaboration in a research community needs to adopt a metric that can capture most of the possible features in that community aiming at producing a more precise evaluation. In future work, we plan to investigate our coauthorship network using more measurements (e.g., Eigencentrality and bridging centrality). We also plan to develop a social-driven approach using some social theories (e.g., assortativity) in exploring research communities and predict future collaboration among disciplines.