Research Portal Denmark
About Data & Documentation

Technical Documentation - Publications

For Research Portal Denmark a data pipeline and a system architecture that can collect, enrich and link data from various sources has been created. The portal also offers a range of services for the research portal’s end users.

 

The ‘High Level Architecture diagram‘ above illustrates the four main processes involved:

  1. Harvesting and pre-processing: The data from different commercial and local data providers respectively is gathered and stored.
  2. Data Enhancement: The data is augmented and enhanced with additional or improved data elements, in collaboration between the main pipelines and the NORA teams data analysts.
  3. Data Consolidation: The data is clustered to identify and link identical publication records from different providers.
  4. Dissemination: The data is made available through a web/search interface and analytical overviews.

 

Under In More Details each of these data pipeline and processes are further elaborated.

In More Details

Harvesting and pre-processing

There are two main pipelines that share some steps of the overall data pipeline: the Global pipeline and the Local pipeline. 

Global Data is stored in a raw MongoDB collection that standardizes the heterogeneous inputs into more comparable JSON structures. Within the MongoDB database, the raw collections store the results in JSON format but with all the original fields and nested structures.

 

Local Data is harvested by querying all the local repositories for a full dataset in XML format and is stored raw in an SQLite database. Then the data is transformed into JSON format and also stored in MongoDB. 

 

As a rule, all data exists within the one MongoDB server in one or more collections within that same server. The objective is to have a single source of truth that creates a common reference point and to facilitate the technical management of the data flows moving forward.

Data Enhancement

The next common data process step for both pipelines is the name variants extraction and enhancement, which involves:

  • Extracting and counting the list of unique affiliation IDs and organisation name variants to feed into the affiliation mappings. 
  • Extracting and matching DOIs within and between data providers and creating a new collection with the results.
  • Creating more uniform data structures (parsed) that allow feeding processes further down the line.

This extraction is crucial for the main service of NORA Data Enhancement, also common to both pipelines. The tools used here are neo4j and Google Sheets, with the latter being used for storing the master mapping tables of organizations, countries and subject classifications.

 

When a name variant is extracted, it is compared against the mappings to see if there is a match, meaning that this variant has been identified and handled before. If it is already mapped, the number of its occurrences in all the records gets updated in the mapping sheet. If the variant is not already mapped, a new code is proposed in the Google Sheet for subsequent manual validation.

 

Details about the exact enhancements process and rules can be found here.

Data Consolidation

The next main data process is the data consolidation among the different data providers, both global and local and across:

  • The clustering of the global data – identified by GOI (Global Object Identifier). GOI are clusters of the same records across the three global providers and is created within neo4J.
  • The clustering of local data – identified by LOI (Local Object Identifier). LOI are clusters of same records across all Danish data providers – these clusters are done in a different way due to the type of data and its quality. As the diagram depicts, the LOI clusters are however also imported in neo4J to be used for the final step of the universal clustering (NOI)
  • The clustering and matching between global and local data – identified by NOI (NORA Object Identifier). NOI are clusters of same records across GOI and LOI clusters.

Detailed information about the algorithms and the rules used for clustering can be found here.

The information generated about the enhanced name variants and the clusters are stored back to MongoDB, and via the FastAPI they become available for the Research Portal Denmark’s databases. 

Dissemination

The presentation of the various publication metadata is visible through various search interfaces, where the users can perform simple and advanced text searches and use the advanced filters, that are are possible due to the enhanced and consolidated data.

 

The well-structured nature of the Global Data allows the usage of VIVO for this matter, which is an open source semantic web tool for research discovery. There is one VIVO instance for every data provider. The clustered data (GOI clusters) is derived from these instances and is presented in another VIVO instance ‘Across All Data’ where the user can search across data from all three global commercial data providers.

 

The search interface for Local Data is based on an Elastic Search index and special efforts are made to create a consolidated metadata presentation of the clustered records.

 

More information about the merging display rules can be found here.