Recap of OpenUP’s Pilot Studies

Fom May 2017 to August 2018 the OpenUP project carried out seven pilot studies involving various research communities from the Arts and Humanities, Social Sciences, Life Sciences and Energy areas.

The seven OpenUP pilot studies were successful and collected valuable input from the communities involved. Key stakeholders involved in the pilots were researchers, publishers, data providers, institutions, research projects, and general public stakeholders.

All pilots run under OpenUP contributed to raising awareness and increasing skills related to the tested open science approaches among the involved communities. We were also able to generate lessons learned and an evidence base on various aspects of the tested approaches, in particular their applicability to a specific field and contacts, and what impacts they have in distinct research communities and researchers (considering their gender, career stage, country, ethnicity).

With this post we want to give you an overview of the key findings and lessons learned from each pilot. The final evaluation report is available on the OpenUP website.

Pilot 1: Open Peer Review for Conferences

In this pilot we tested the practicability and impact of Open Peer Review at conferences. The first venue for Pilot 1 was the Second European Machine Vision Forum 2017 (EMVA 2017) and the second venue was the eHealth2018 Master Student Competition. The conference organisers agreed to test the four OPR principles “Open Identity”, “Open Participation”, “Open Report” and “Open final-version comments” for papers submitted through the submission system.

The most important changes to the traditional workflow of paper submission, reviewing and voting included:

  • All participants can see all submissions after the submission deadline and can discuss the submission with the authors
  • Instead of strict reviews, only shorter comments of about a paragraph are used to summarize the individual opinion and suggestions for improvements
  • All identities connected to comments and submissions are visible to all participants
  • For the final voting, all participants have four votes; project committee members get 10 votes.
  • The final result based on the sum of all votes from all members was accepted as a final ruling
  • All discussions can continue in the CMS interface after the conference has finished

For the eHealth2018 student competition, we again tested three OPR principles “Open Identity”, “Open Participation”, and “Open Report” but with a different setup:

  • After the submission deadline, all submissions stayed hidden. However, each participant of the challenge had to write two ‘lay-man’s reviews’ (“Open Participation”). These initially double-blind reviews were augmented by traditional ‘expert reviews’ done by external assigned reviewers.
  • The rebuttal phase allowed each participant to withdraw his submission based on all the reviews. In this case his contribution would have remained hidden and all involved persons (reviewers and authors) will stay anonymous. All participants moving forward and staying in the race at the end of the rebuttal phase will move to an “Open Identity” status: all submissions, the reviews (“Open Report”), reviewer’s names and author’s names are visible to all conference visitors.
  • The program committee used all available reviews (lay-man and expert) to decide on the final winner of the competition.

To support the specific mix of new OPR features needed we had to create and adapt our own CMS solution based on the popular HotCRP. The resulting source code has been released to the public under an open source license at the HotCRP GitHub repository linked above.

Feedback from the researchers involved in the OPR process at the EMVA conference and eHealth 2018 student competition was positive. Overall, the participants expressed a strong acceptance of the proposed OPR process and would support it again. The participants’ greatest fears associated with OPR included: biased/whitewashed reviews due to non-anonymity; backlash for bad reviewing (e.g. over other channels/private email); and added effort and risk for reviews outside one’s own expertise (lay-man reviews). Also, the conference organisers of the EMVA are willing to continue applying the OPR approach for the next conference.

Pilot 2: Open Peer Review for Research Data in Social Sciences

OpenUP’s Pilot 2 has investigated the applicability of (open) peer review to research data in the scientific community of the Human Mortality Database (HMD). Similarly to peer review of publications, data peer review is a quality assessment process of a dataset performed by experts in the field. Data quality assessment is a complex process that has to consider the different phases of the data lifecycle, starting from the development of a data management plan (DMP) at the initial stage of a scientific project to the publication of its results.

The detailed methodology adopted for this pilot study comprises two parallel activities. The first one is the set of interviews to HMD managers and country specialists, who are responsible for the validation of data coming from the contributing countries. These interviews provided us with important information on procedures performed to assess the quality of data in a pre-publishing phase as well as strong and weak points connected with data sharing. The interviews aimed at exploring origin, motivations and organisational features of HMD, the goal and main features of the database, its data quality assessment process, and the interviewees’ opinion on Open Access to data.

The second activity, carried out in collaboration with HMD management, is the development and the submission of a questionnaire to HMD users. The results of the survey help understanding the users’ practices in data access and use that can be considered as proxy indicator of post-publishing appreciation of the quality of the database. The survey specifically focused on the HMD user’s practices and attitudes in data access and use. In particular we asked the HMD users how often and how long they use HMD data; in which geographic area users are more interested in; how and which type of dataset they download; on which type of datasets; how they use the datasets; and how they perceive HMD compared to other sources of information.

Some important indications emerged from the analysis of the interviews that can drive the adoption of data quality assessment, and hence peer review, as well as some principles that can incentivize other scientific communities to share their research data. As stated by the HMD interviewees, the guiding principles to create an open access database were: comparability, flexibility, accessibility and reproducibility. Comparability was reached using a uniform, scientific methodology to calculate the various statistics of the 39 countries included in the database. Flexibility was achieved in the analysis of results using a uniform set of procedures for each population, but at the same time giving significant attention to each population in terms of its history and socio-political development. This is also reflected in the available formats of output data series. This is achieved thanks to the experiences and knowledge of country specialists, that are persons in charge of collecting data from a specific number of countries, who interact with statistical offices, check data consistency and provide population statistics together with a country report that explains specificity and motivation of analysis. Accessibility was guaranteed from the beginning by free of charge access of data, as well as by the provision of data in an open, no-proprietary format. Reproducibility is provided by the reconstruction of the data lifecycle that includes the availability of raw data, the method applied, the related results as well as the explanatory documentation. One of the main successful features of HMD is its transparent way of data managing and sharing that has two central phases of data validation. The first one is carried out by the CSs, who analyse the raw data according to a common predefined checklist that verifies consistency and plausibility of data. The second one is carried out in a collaborative way within the HMD team that validate the statistics before their publication, each time the database is updated.

Moreover, another successful component of HMD was its collaborative approach that is based on a strong scientific interest in the field as well as on the trust among the involved community that only recently has formally signed a Memorandum of understanding.

The interviews also highlighted some indications that confirm some concerns already mentioned by other surveys. Interviewees stressed the importance of having a strong commitment of the organization in supporting the development of data infrastructures. This pertains different aspects: a long-term financial support (beyond the project duration), a policy endorsement on open data as well as a formal recognition of scientists for the efforts in data curation and quality assurance.

Considering the results of the survey, users confirm the main strength points of HMD regarding in particular the accurate and well-documented data quality assessment that make the process transparent and facilitate the reproducibility of the analysis. They do not outline evident weak points; they rather suggest improvements mainly related to the provision of tools that facilitate the import of data into statistical packages. This may be also related to a simple style interface, where some links could be better highlighted. A user’s comment summarises well this aspect: “the format of the website could be more aesthetically appealing, but as it is the site is very functional and suits the needs of the users”. Moreover, the different types of user profiles that comprise the research field as well as the private sector, addressing different users’ needs are indications of the importance of data sharing that reinforce Open Science principles. If considered under an OPR perspective, a straightforward transposition of the procedures adopted for scientific journals seems to be hard to apply. However, some traits of OPR, such as transparency in the quality assessment process, represent for open data a feature that should be promoted at a larger scale. This could be also applied to the trait of open participation that in case of open data implies a more common use of data citations by end-users as well as the implementation of additional tools to track data re-use. Further research is needed to explore practices of data sharing and management not only in Social sciences, to take the necessary steps to support and improve high quality data sharing.

Pilot 3: A data journal for the Arts and Humanities

Pilot 3 had a dual purpose: on the one hand, it described the data sharing practices within Humanities research, and on the other hand, the study evaluated how quality assessment and (open) peer review can be applied to research data within this field of study. Based on existing e-Infrastructures and practices of Humanities research groups the pilot analysed and demonstrated the feasibility of a basic workflow that will combine the publication of data with commenting and reviewing systems. The research setting is provided by DARIAH-EU and DARIAH-DE, their extended network of Humanities research groups and by the research groups related to the Campus Labor at the University of Göttingen. The study builds on desk research based on reports and survey executed by DARIAH projects. Other inputs for the study were provided by workshop results: 1. OpenUP workshop on open peer review “Open Peer Review hands on: alternative methods of evaluation in scholarly publishing” at the DARIAH annual event on 23 May 2018 in Paris, France, and 2. FOSTER/OPENUP joined training day on open peer review on 20 June 2018 in Göttingen, Germany.

There are numerous barriers researchers encounter in data exchange processes. One of the main obstacles is the generally closed world of scientific discourse in the Humanities. Much of the work in European Humanities research is not visible. The disperse research communities often fail to connect to one another because of the language barriers. Humanities scholars very often publish in their national languages, and the trend is to continue doing so in the future. Europe lacks an integrated database of published journals in various national languages. A database of this kind could be a sort of 'who's who' within a particular field of research.

Another barrier relates to the actual research data. Due to lack of standards and common guidelines in data managements, it is very difficult to connect data. There are initiatives on a EU level, which work toward a more unified Humanities research landscape. CLARIN, the Common Language Resources and Technology Infrastructure, is focused on integrating language data across Europe. DARIAH, the Digital Research Infrastructure for the Arts and the Humanities, is more focused on increasing the visibility at the European level of national research related to cultural heritage, digital arts, etc. These two projects provide a positive direction of development in this field. Both projects try to fill in the gaps where no data exists and try to connect data where it does exist but lives a life of its own in an unconnected place.

Within the context of this pilot, we have examined projects that can serve as best practices for publishing data and building an infrastructure for advancing data sharing. There are initiatives focusing on widening the access to data through the development of digital archives that are reusable in an open access framework. Since the EC supports research infrastructure (RI) developments in the Humanities with special attention to the field of Digital Humanities, there are several projects, such as DARIAH ERIC, DIGILAB, KPLEX projects, that have the agenda of creating RIs, including the development of networks of facilities and resources and services offered to research communities to support their work.

The development of a data journal framework involves a description of the communication flow and a breakdown of this process into single steps. The data journal framework should include the following attributes:

  • the assignment of persistent identifiers (PIDs) to datasets,
  • peer review of data
  • metadata information and technical check
  • links to related outputs (journal articles)
  • facilitation of data citation
  • standards compliance
  • discoverability (indexing of the data).

During the small group discussion section of the workshop the topic of data sharing and data availability was examined more in-depth. Participants were given a poster on which they could record examples of good practice, barriers and challenges to implementation, and (based on these barriers) what actions should be taken, by whom. The group rotated so that each group moved on to evaluate and validate the findings of the other groups. They had the option of adding any points they feel were not covered by the previous group.  We received valuable input from researchers and publishers.

Participants listed the following challenges and barriers that hinder the uptake of data sharing and data publishing:

  • some disciplines are less willing to share materials than others,
  • unclear intellectual property rules and licensing,
  • data ownership issues,
  • technical aspects of linking research outputs,
  • lack of incentives to do the extra work (reformatting, anonymizing, making datasets platform ready).

The actions needed to solve these issues recommended by the workshop participants are the following:

  • raising awareness of licensing option, data ownership issues, intellectual property issues,
  • developing and implementing data documentation processes,  
  • including steps on data curation in the regular research workflow.

Current practices demonstrate the lack of standardized workflows for data curation, sharing and publishing. Humanities data management practices at the University of Göttingen demonstrate a varied picture with various degree of openness in regard to archiving and sharing data within the research groups and with external researchers. Humanities projects and departments could take advantage of the institutional repository infrastructure or the developing DARIAH data repository services where standardized data templates, workflows and added quality assurances tools could provide a more consistent view on data publishing across the different disciplines in Humanities. Implementation of standards and guidelines for managing research data would definitely support a more common view on data sharing and data availability within Humanities projects. In many cases the tools are given for data publishing (e.g. psycholinguists are using a platform for data analysis which allows the publishing of the description of the data set, a data paper in a push of a button), however the awareness around the benefits and value of sharing research data is not part of their research flow. Humanities data publishing will be more prominent as awareness is increased among researchers on data management and data discoverability issues.

Pilot 4: Transferring the research lifecycle to the web (Open Online Research)

In this pilot, we addressed the question whether data analysis and data collection in qualitative research can be transferred to open online groups, in which potentially both academics and non-academics can participate. A special focus lied on mechanisms to reach out to and engage citizens in qualitative research processes. Our aim was to transfer the data-analysis and data collection parts of the research lifecycle to the web. To this end, we developed dedicated software called OpenOnlineResearch (OOR) further. In particular, we tested the applicability of this online solution to involve citizens in qualitative research. The goal was to gain further insight into working practices and address current challenges/gaps of open online collaboration approaches applied to qualitative research.

OOR builds on prototypes developed earlier at the University of Amsterdam. At that point, the prototypes were geared to academic participants who already had some experience with similar tools. In OpenUP we worked to make the software easier to use, which allowed to involve citizens that were not experienced in research or using the tool. We have performed three tests with team members and non-team members (uninitiated and untrained students from different disciplines plus non-academic and academic members of our network). In 2018, following further improvement of the tool, we had a software test with the extended team. In this test we both assessed the workings of the functionalities, the specific needs for (a minimum of) instructions and the CMS and output capacities so far. Five users went from log-in to interpretation to stacking repeatedly. All steps were analysed, and potential improvements were discussed. The test revealed no major flaws.

In September we have interactively demonstrated and tested the tool at the final OpenUP conference. We invited conference participants to use the interpretation features of OOR, visually observe the use of it, and asked them about usability, instructions and meaning of the interpretations given. Six participants have gone through the software and performed the main operations. It turned out that users understand the flow and meaning of the software swiftly and are able to produce meaningful content. It also turned out that we omitted a small number of instructions. The test also triggered participants to think about the applicability of the tool. For example: one participant suggested to try the tool for peer-review.

At the end of 2018 we will perform an ulterior test in Zooniverse. The test will focus on scaling up and the pitfalls we encounter when scaling up.

Pilot 4 demonstrated that open online interpretation of qualitative data is feasible and that yet unused parts of the research cycle can be opened to wider ranges of collaborators both within and outside academia. The results of the testing confirmed that the Open Online Research (OOR) tool enables online collaborative interpretation. We learned that a simple tool, without the need for detailed instruction, is feasible. We also learned that the input of scientists is still needed for the formulation of sound research questions and instructions. We have also seen that online collaboration needs moderation (either technically or by humans) to settle differences. However, we found that conflicts were rare and that participants were willing to collaborate in most cases.

The outcomes of this pilot are feeding into future developments in open science in two ways. First, the collaboration with Zooniverse is continuing and might lead to either an integration of OOR methodology in Zooniverse or to a strengthening of OOR. Second, the continuity is safeguarded by new funding appointed by the University of Amsterdam. Based on the progress made within OpenUP, the UvA was willing to invest into the development of service package for OOR. While OOR is designed to be open and freely available, certain users might have more elaborate needs when using the tool. The need for services is assessed with the funding. In short, OOR is evolving beyond OpenUP.

Pilot 5: Addressing & reaching businesses and the public with research output

The goal of the fifth OpenUP pilot study was to analyse and test how disseminated research results can be made more interesting, appealing, and usable for target audiences beyond the research community. In this pilot we particularly addressed dissemination to businesses and the general public.

In a first step, we interviewed seven science communication experts to define requirements and expectations by these targeted audiences. In addition, we consulted one of the community contacts of the previously involved SmarterTogether project, who was responsible for project communication. Based upon the feedback gathered, we created guidelines and recommendations for researchers who want to communicate their research to target audiences beyond academia.

In a second step these guidelines were tested by a research project in the Energy research area (ReFlex, a European smart grids project). Based on the provided recommendations and guidelines, the research community re-shaped and evaluated their dissemination strategy and produced targeted dissemination content tailored to its stakeholders. Feedback was collected in an informal discussion and an interview with the involved dissemination team.

The feedback from the ReFlex project gave us very valuable input to improve the guidelines. In particular we added one additional step regarding monitoring and implementing the dissemination strategy during the project runtime. The final version of the guidelines is available on the OpenUP Hub.

A part of the Pilot 5 evaluation consisted of a quantitative analysis of the achieved impact metrics of the project’s Twitter channel and a qualitative analysis of the reached target groups. The goal was to explore if Altmetrics can be used as a meaningful indicator for assessing impact in specific stakeholder groups. In particular, we wanted to test if additional information about the reached target groups can be extracted by means of Altmetrics to answer the question if the alternative dissemination methodology applied helped making the research outputs more interesting, appealing, and re-usable. The qualitative analysis of the reached target groups was done by manually looking at the profile picture, the short text (incl. hash-tags) included in the Twitter profiles, and the history of tweets of the individual accounts from which the re-tweets and likes were made.

Conclusions from this analysis: By looking at the Twitter profiles and the tweets of individual accounts, it is not always evident to which target group the reached individual belongs. For instance, even if the accounts clearly included references to interests in research topics in the profile description text, the tweets and re-tweets from their Twitter history did also refer to other topics such as politics. Accounts of individuals can be used very personally, professionally or both. This makes it difficult to draw conclusions about the stakeholder or target group an individual belongs to.

An additional difficulty are fake accounts (in fact, one of the individual accounts was marked by Twitter as restricted due to suspicious activities) and private accounts without a public profile. Determining the reached target audience is highly depending on contextual information. If this information is not provided or restricted, it is not possible to make any deductions in terms of target groups reached.

Summarising we can say that the guidelines have proven to be useful for shaping/defining a communication strategy for a research project targeting these two large audiences. However, they do not give enough information and guidance for composing the final communication message as such. What could be re-evaluated and expanded is the chosen terminology and the defined scope and target groups (e.g. to include trans-disciplinary questions or guidance for addressing ulterior target groups). For future research it would be relevant to explore other ways to structure the guidelines and their content to provide additional guidance for the points that our guidelines fail to provide substantial support.

Our pilot did not provide enough evidence about the measurability of impact at the targeted audiences by means of analysing likes and re-tweets by Twitter users. Our results suggest that it is not as straightforward to draw conclusions about the kind of target group Twitter users belong to. It would, however, be interesting to analyse this further with a larger dataset.

Pilot 6: Reflexivity of metrics on medical research and dissemination practices

The goal and scope of this pilot study was to explore how biomedical research communities deal with opening up their research enterprise and how reflexive engagement with research practices at their facilities might help to develop metrics and incentives for research organization. The stance towards our partners which led us through this activity was to really focus on the needs of this biomedical community to reach their targets in the realm of open science.

For this pilot, we managed to cooperate with the Berlin Institute of Health (BIH). The BIH seeks to develop new practices, processes and guidelines, which help to create bridges between what is called laboratory research on the one, and patient oriented research on the other hand. To achieve this target, the BIH wants to spur Open Science practices at both facilities, by providing new funding instruments and better infrastructures.

In order to deal with the topic of Open Data in biomedicine, we analysed debates in editorials of major biomedical journals (N=144). Based on this analysis and the discussion we had with members of the BIH, we found that there may be different cultures of dealing with data in the biomedical sciences. Clinical, pre-clinical, and lab-oriented research have developed different practices and stances towards the handling, governing and acknowledging research. We therefore agreed to focus on data use and data stewardship at different stages of the biomedical enterprise (clinical and pre-clinical research). Clinical, pre-clinical, and lab-oriented research have developed different practices and stances towards the handling, governing and acknowledging research. On the basis of field work and exploratory analysis, we aim at providing recommendations for specific metrics appropriate to the needs of our community. Therefore, our goal was to identify barriers, enablers and constraints in Open Data in the biomedical research field.  Based on this, we developed four different criteria for the evaluation of our pilot.

  1. The extent to which we identified problems related to the provision of open data in biomedicine
  2. Extent to which we identified community needs for the use of open data
  3. Extent to which we find field specific practices related to open data
  4. Extent to which we provide input of how to govern and incentivize the use and provision of open data in a biomedical research facility

In order to meet this goal, we developed a strategy which contained three elements. First, we assisted the process of monitoring the Open Data publication output in order to gain an overview about publication practices in the field and to select cases for exploratory case studies. Second, we aimed at constructing a search strategy which allows for identifying different ways of how biomedical researchers mention or link open data in publications. And, third, we aimed at conducting field studies at BIH research facilities both at the Charité and the MDC in order to explore current field specific data practices and potential institutional or social barriers for open data. Summarizing, our research design and our focus allowed us to carry out activities which are targeted and which respond to the needs of the community and which will put us in the position to derive balanced and field specific recommendations towards how novel forms of metrics can be established.

The preliminary results of the interviews revealed the aforementioned different data cultures in biomedicine there are in fact very different cultures in biomedicine regarding the handling and the provision of data. We found rather different accounts of problems, but also different solutions to the problem of disseminating data in biomedicine. The results of the interviews will be collected and discussed with the partners at BIH. Our main aim is to carve out the differences between the different fields, e.g. clinical research on the one hand and laboratory research (molecular biology) on the other. Referring to our questions, our main goal is to explore field specific problem perceptions and research practices which may hinder the provision of research. In addition, we also identified arguments of persons who accrue more critical stances. The main argument which has been repeatedly made is that, particularly in clinical research, the provision of data should not be promoted freely, but regulation should be clearer on who should share the data, for what purposes, and to what extent the data should be provided and shared with other researchers to prevent abuse. Currently, these problems only appear in clinical research, but it may be also relevant for laboratory research, as these more strongly orient on personalization of therapies.

Our pilot study has shown that Open Data is a relevant task for members of the biomedical research community. The results show that to answer these questions, one really has to engage with the members of the community. These are our main findings:

  • The pilot has shown that providing incentives for Open Data provision is difficult. There are different regulations, guidelines which need to be addressed.
  • Incentives for providing Open Data need to be field specific. Even in the biomedical realm, there is an enormous variety of different data cultures, that is, different stances, ways of reflecting, handling and valuing data which make a unified framework difficult. Thus, metrics for incentives should be field specific and reflect the respective epistemic practices.
  • There is a lack of regulation regarding the governance of data usage. Particularly in the realm of clinical research, there is a need of a governance framework of who can access, who can use and alter the data, for what purpose, and at what point in time.
  • To a certain degree, there is still a lack of institutional and organizational support, such as guidance and technical advice in Open Data principles, technologies, and practices. Mentorships, technical or infrastructural advice might benefit the provision of Open Data.

Pilot 7: Piratical demand as a form of impact indicator and reaching unexpected audiences

In this pilot study we conducted a quantitative, statistical and econometric analysis of large scale datasets on the supply of and demand for scholarly works on illegal platforms such as Sci-Hub and Library Genesis. The study provided insights into the underground circulation of scholarly books. Using a dataset provided to us by one of the administrators of a prominent shadow library in 2012 and in 2015 we mapped the both the supply of and the demand for academic monographs, textbooks and other learning material via piratical shadow libraries. Our primary findings suggest that scholarly book piracy is a ubiquitous global phenomenon, with no apparent end in sight. If that is indeed true, we must ask, what might the consequences be for the status quo in scholarly publishing.

The study is based on a number of data sources to analyse the supply and demand of pirated scholarly publications. The analysis of supply is based on the catalogue of Library Genesis. The shadow library publishes its catalogue with basic bibliographic metadata as daily database dumps. Usage data is based on to two sets of access log data provided to us by the administrators of one of the mirror services that distribute the titles in Library Genesis. The datasets were detailed enough to link the download of catalogue items to geographic locations. To conduct extra research into legal availability we occasionally queried other data sources, such as from price and legal availability, and worldcat for library availability.

In 2012 the Library Genesis catalogue contained 836.479 records. Three years later, in 2015, the catalog almost doubled to 1.317.424 records, and by the time of writing in 2018, Library Genesis hosts more than 2.237.940 documents, almost all scholarly publications. In addition, there is an extensive collection of literary works, comics, and of course the 100 million journal articles archived through the SciHub.

Post-Soviet republics, which in 2012 were heavy users of this particular mirror seem to have migrated to other services, and the traffic from these countries declined. Countries and regions, that account for the bulk of the usage (US, India, China, Europe) show average growth. On the other hand, there have been a staggering growth in Latin America, which in 2012 was hardly using (this particular mirror of) Library Genesis at all, but by 2015 they discovered LibGen, and became one of the most intensive users of the library. Our results show that the biggest per capita users are the high income North American and European countries. In fact, just a handful of countries, the United States (11.66%), India (8.58%), Germany (5.23%), the UK (4.10%), Iran (3.68%), China (3.67%), Italy (3.30%), Canada (2.36%), Indonesia (2.29%), Spain (2.28%), Turkey (2.24%), and Brazil (2.11%) account for more than half of all the downloads.

The ten top Dewey content categories suggest a strong science and technology focus of the library, since these two categories enjoy the highest demand, and also since works have the highest download volume per title (26.8). Social sciences, on the other hand sees the second lowest (13) download per title, while this section is big, both in terms of supply, and in terms of the number of titles downloaded.

What kind of impact shadow libraries may have on the current system of scholarly publishing? It seems that the scholarly publishing industry understood that it is close to impossible to efficiently fight scholarly piracy. Gigapedia, the predecessor of LibGen was relatively easy to shut down, as it relied on a centralized database, and a centralized document repository. LibGen and SciHub are much more difficult to eliminate, as they are both radically decentralized, and already exist in multiple copies all over the internet. That might also explain why there is only one court case against these services (Elsevier Inc. et al v. Sci-Hub et al Case No. 1:15-cv-04282-RWS 2015).

Under such conditions academic publishers have to ask themselves if the copyright and exclusivity-based business models are sustainable. For a number of reasons, the answer might still be in the affirmative. Both the US and the EU has mandated open access publishing for its publicly funded research, creating a lucrative revenue stream for publishers in the form of article processing fees which are not threatened by piracy. The fact that the scholarly pirates (the scholars themselves) are not those who must pay for the materials (the ones paying are the academic institutions, libraries, in some cases government agencies) may ultimately mitigate the negative effects of piracy, where illegal consumption substitutes sales. One illegally downloaded scholarly monograph, already priced for the library market does not diminish sales to individuals but may generate a purchase by the library at the request of the researcher who had a free sample copy through the shadow library. The net effect may well be positive for publishers.

Ultimately there are major consequences of the increasing shadow library use on the current systems of producing and interpreting academic indicators. Our pilot shows that shadow libraries are now an integral part of the systems of scholarly communication. They are part of the everyday routine of scholars of both the developed and the developing countries. However, there is no reliable, systematic insight into the use of these resources. Consequently, our academic indicators only give an incomplete and biased picture on the circulation of scholarly works. The copyright-infringing nature of shadow libraries only allows ad-hoc and fragmented insight into the circulation of works through them. Their modus operandi, on the other hand, is certain to introduce an unknown level of bias to our currently accepted set of indicators. For example, since SciHub uses leaked/shared academic credentials to provide access to pay-walled materials, the traffic as measured at the point of access, at the library through which the unauthorized access takes place will not provide an accurate picture of who uses the library resources and for what reasons. Since it is not reliably known to what extent SciHub serves subsequent requests to an article from its own archive as opposed to getting it again through a library, its impact on library usage metrics is also unknown. It is certain however, that any newly published article behind a paywall is at least once requested through a library, and that inflates library usage statistics.

On the other hand, the high usage numbers from developed countries suggest that at least some of the shadow library traffic is generated by users who otherwise could have had legal access through their institutions. That both applies to articles for which users go to SciHub rather than their own institutional repositories, and books for which users visit LibGen rather than their library print or e-book collections. Ours statistical models (not reported here) seem to suggest that in North America and Western Europe we cannot explain the high usage with serious access limitations. Instead we suspect that in these territories the convenient one-click access shadow libraries provide to full digital copies plays a role. This of course means that the official usage statistics of those resources that are also available through the shadow libraries will be underreporting the actual demand for key library resources.

Shadow libraries do not just introduce noise into the current indicators that measure the circulation and use of scholarly resources. Given their size, the intensity and growth of their use, the omission of the traffic through these libraries threatens to falsify these indicators.

Leave a comment

You are commenting as guest. Optional login below.

Unless otherwise indicated, content hosted on OpenUP Hub is licensed under an Attribution 4.0 International (CC BY 4.0).