A world database for conducting systematic evaluations and meta-analyses in innovation and high quality administration


Database building was carried out in three phases: (1) gathering knowledge from the Net of Science (WoS), (2) cleansing and knowledge mining (preprocessing) and (3) establishing quotation community nodes and edge tables. Fig. 1 reveals the database building framework.

Fig. 1
figure 1

Database building framework.

Information assortment

Information assortment was carried out following the PRISMA methodology proposed by20. This technique supplies steerage to students in conducting systematic literature evaluations by following the 4 proposed steps: (1) identification, (2) screening, (3) eligibility and (4) included. The PRISMA methodology was chosen because the framework of the info assortment and filtering as a result of following benefits: it supplies a complete and clear course of; it’s relevant in any analysis subject; and it strongly helps the reproducibility of the evaluation. Moreover, a course of movement diagram (as additionally contained by PRISMA) helps readers to higher perceive the general course of and the boundaries of the examine and may improve the standard of the literature evaluation21. The search was carried out individually for the 2 areas of curiosity, and the outcomes had been mixed to supply the whole bibliometric dataset. Fig. 2 reveals the info assortment course of.

Fig. 2
figure 2

A key phrase search was utilized in each subjects on the WoS platform. The search was carried out solely on article titles to attenuate the inclusion of nonrelevant papers that solely point out the phrases inside the summary associated to innovation or high quality administration. The whole time vary was analyzed (the primary out there 12 months in WoS was 1975) till the date of information assortment, which occurred on 22 September 2021. Concerning the language of the paperwork, English was thought of because the lingua franca of scientific publication22. Within the case of the innovation subject (left aspect of Fig. 2), 61,930 information had been discovered after the key phrase search within the specified time interval. This quantity was decreased to 38,630 after making use of the abovementioned filtering guidelines by excluding 23,570 papers. Within the case of high quality administration (proper aspect of Fig. 2), 36,390 papers had been discovered by the key phrase search, and this quantity was decreased through the screening part to twenty,517 by excluding 15,873 papers as they had been paperwork with non-English language or had been varieties aside from articles. On this paper, step “Eligibility” will not be related as a result of the variety of ensuing knowledge information doesn’t make it potential to manually learn and consider all of the screened papers. After the applying of the steps proposed by the PRISMA methodology, 59,231 screened papers had been collected into the database.

Cleansing and knowledge mining

The purpose of this step was to increase the dataset with extra priceless variables such because the institute of the primary creator, nation of the primary creator, ISO3 nation code, COVID-19 content material, geographical coordinates and subject indicator (innovation or high quality administration). The extra variables had been supplied based mostly on the next:

Institute of the primary creator (Institute)

The column “Affiliations” supplied by WoS was used to extract the primary creator’s affiliation. These knowledge had been saved initially as a steady string together with all of the authors’ names and affiliations. An extra drawback was that the identical authors from the identical affiliation had been dealt with as one entity inside the string, and their affiliations didn’t comply with the identical format and construction in all circumstances. Resulting from this unstructured nature, textual content cleansing and textual content mining wanted for use to extract the required data. To return the primary creator’s affiliation, first, common expressions had been used to take away the pointless substrings; second, time period shortenings had been changed by their full types (equivalent to “College” as a substitute of “Univ.” or “Division” as a substitute of “Dept.”). Lastly, the cleaned string was tokenized to separate the precise elements of the affiliation, such because the institute identify, metropolis, avenue tackle, and nation. These steps had been carried out utilizing the Python program language.

Nation-related columns (Nation, ISO3)

Utilizing the preprocessed column “Institute”, the nation of the primary creator’s affiliation was extracted as a part of the cleaned string. Not solely had been nation names extracted, however their ISO3 codes had been mapped. Since a number of statistical packages and packages (equivalent to R) establish nations based mostly on ISO codes, this step makes simple identification potential for statistical packages with out the necessity for additional mapping effort by the researcher.

COVID-19 content material (Covid19Content)

To focus on whether or not a paper was written within the context of COVID-19, a key phrase search was used, and the worth of the column was set to 1 if the title, the key phrases or the summary contained a minimum of one of many following key phrases: “COVID”, “coronavirus”, “pandemic”, or “SARSCoV2”. In any other case, its worth was set to zero.

Geographical coordinates (Lat, Lon)

Latitude and longitude values associated to the primary creator’s affiliation had been retrieved by geocoding utilizing the GeoPy Python package deal. Geocoding was carried out utilizing the extracted and tokenized column “Establishment” as enter values.

Search class indicator (InnovQMCateg)

This column was manually added when combining the outcomes from each searches as described by Fig. 2. If a particular paper was collected solely by the innovation-related search, its worth was set to “innovation”, and the class identify “high quality” signifies that the paper could be discovered solely within the high quality administration search outcomes. Lastly, the intersection was denoted by the class identify “each”. On this case, duplicated information had been faraway from the info desk.

Quotation community building

The node and edge tables had been generated utilizing the collected and additional processed dataset from WoS. Within the core dataset, the cited articles had been saved in a single column in string format. To assemble the sting checklist format from the string-type enter variable, RegEx (common expressions) instructions had been utilized to search out all of the DOI numbers showing inside the lengthy textual content. After extracting the cited DOI numbers, a listing format was constructed. The sting checklist building course of could be described as follows:

  1. 1.

    Choose paper i

  2. 2.

    Extract all DOI numbers from the string of cited references utilizing common expressions

  3. 3.

    For all of the extracted DOI numbers: add DOIi – DOIj pairs to the sting checklist (the place j is the jth factor of the extracted cited DOIs for paper i)

  4. 4.

    Iterate steps 1–3 by all of the papers (DOIs) inside the core dataset.

The development was carried out within the Python program language utilizing re, NLTK, NumPy and pandas packages.


Supply hyperlink