From cf2bfc9382fe1c934f2e11562c5c95b86fac5114 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Fri, 3 Apr 2020 16:38:59 -0700 Subject: README, about page, sources page --- README.md | 97 +++++++++++++++++++++++++++++--- fatcat_covid19/templates/about_en.html | 59 ++++++++++++++++++- fatcat_covid19/templates/sources_en.html | 60 +++++++++++++++++++- 3 files changed, 206 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 650d7eb..0875eae 100644 --- a/README.md +++ b/README.md @@ -2,24 +2,107 @@ [covid19.fatcat.wiki](https://covid19.fatcat.wiki) ====================================================== -**Work in Progress!** +**Not Medical Advice for General Public or Clinical Use!** -**Not Medical Advice for Clinical or General Public!** - -This repository contains scripts and a web search front-end for a corpus of -research publications and datasets relating to the COVID-19 pandemic. +This repository contains a web search front-end and data munging pipeline for a +corpus of research publications and datasets relating to the COVID-19 pandemic. The main dataset is the ["CORD-19"](https://pages.semanticscholar.org/coronavirus-research) (sic) paper set from Semantic Scholar, enriched with additional metadata and web archive fulltext from [fatcat.wiki](https://fatcat.wiki). -Major acknowledgements (not complete): +Visit the live site ["about"](https://covid19.fatcat.wiki/about) and +["sources"](https://covid19.fatcat.wiki/sources) pages for more context about +this project. In particular, note several **DISCLAIMERS** about quality, +content, and service reliability, and licensing context about paper content and +bibliographic metadata. + + +## Technical Overview + +A crude python data perparation pipeline runs through the following stages: + +- ``parse``: source metadata into JSON rows, one per paper +- ``enrich-fatcat``: queries fatcat API for full metadata and links to fulltext PDFs +- commands and shell scripts under `bin/` are run to download PDF copies and + make "derivative" files (like thumbnails, extracting text) +- ``derivatives``: add derivative file paths and and full text to JSON rows +- ``transform-es``: convert from full JSON fulltext rows to elasticsearch schema +- load into elasticsearch cluster using `esbulk` tool + +Currently, only documents with a fatcat release ident are indexed into +elasticsearch, and use that ident as the document key. This means that the +index can be reloaded to update documents without creating duplicate entries. + +A stateless web interface (implemented in Python with Flask) provides a search +front-end to the elasticsearch index. The web interface uses the Babel library +to provide language localization, but additional work will be needed to make +the interface actually usable across languages. + + +## Elasticsearch API Access + +The fulltext search index is currently world-readable in the native +elasticsearch 6.8 API at: + + https://search.fatcat.wiki/covid19_fatcat_fulltext + +An index of native fatcat release schema for just the papers in this corpus is +also available at: + + https://search.fatcat.wiki/covid19_fatcat_release +Accessing both of these indices from your own software, or from browsers +directly via cross-site requests, should mostly work fine. + +## Development Environment + +This software is developed and deployed on GNU/Linux (Debian family) and hasn't +been tested elsewhere. Software dependencies include: + +- python 3.7 (locked to this minor version) +- [pipenv](https://github.com/pypa/pipenv) +- elasticsearch 6.x (7.x may or may not work fine) +- [esbulk](https://github.com/sharkdp/fd) +- [ripgrep](https://github.com/BurntSushi/ripgrep) (`rg`) +- [`fd`](https://github.com/sharkdp/fd) + +To run the web interface in local/debug mode, with search queries sent to +public search index by default: + + cp example.env .env + pipenv install --dev --deploy + pipenv shell + ./covid19_tool.py webface --debug + + # output will include a localhost URL to open + +## Acknowledgements + +For content and bibliographic metadata (partial list): + +- Allen Institute's CORD-19 dataset - PubMed catalog and PMC repository +- World Health Organization - Wanfang Data - CNKI - biorxiv and medrxiv pre-print repositories -- publishers large and small, from around the world, making additional content available +- publishers large and small, from around the world, making this research + accessible (in some cases temporarily) - research authors - hospital workers and other emergency responders around the world + +## Contact, Contributions, Licensing + +General inquires should go to +[webservices@archive.org](mailto:webservices@archive.org). Take-down requests +and legal inqueries to [info@archive.org](mailto:info@archive.org). Bryan's +contact information is available [on his website](https://bnewbold.net/about/). + +Contributions are welcome! Development is currently on Github and technical +issues (bugs, feature requests) can be filed there: + + +The software in this repository is licensed under a combination of MIT and +AGPLv3 licenses. See `LICENSE.md` and `CONTRIBUTORS.md` for details. diff --git a/fatcat_covid19/templates/about_en.html b/fatcat_covid19/templates/about_en.html index 8db4a6f..95a272d 100644 --- a/fatcat_covid19/templates/about_en.html +++ b/fatcat_covid19/templates/about_en.html @@ -6,6 +6,63 @@

About Fatcat COVID-19 Paper Search

-TODO +

+This is a prototype full text search index of papers, reports, datasets, and +other research resources related to the COVID-19 crisis, including public +health responses to influenza pandemics more generally. The curation of content +to be included is based on efforts like the "CORD-19" dataset and efforts by +authorities such as the WHO and NIH Pubmed. Metadata and content comes from the +existing open fatcat catalog of research +outputs. +See "Sources" for details. + +

+It is hoped that with additional care and development this resource may be +useful to anybody keeping up with research in this area, and particularly folks +working on systemic reviews, bibliometrics, or metaresearch. However, at time +of writing, this is at best a technology demonstration, not a robust piece of +knowledge infrastructure. + +

+We encourage folks to consider the following more authoriative and +well-supported tools for research discovery: + +

+ +

+Feedback and queries can be directed to webservices@archive.org. + +

Service Disclaimers

+ +

+This is not a production-supported service of the Internet Archive. The website +and search API may become unavailable due to resource load, operator +availability, etc. If you would like to depend on this service, please contact +us. + +

+Some content available in this index may not be "perpetually accessible" after +the COVID-19 crises ends, due to temporary content licenses. The service itself +(covid19.fatcat.wiki) may also not be operated after the crisis, though all of +the source code and upstream metadata should be "perpetually accessible". + +

Additional Resources

+ +

+Source code is available on Github, and bugs can be reported there as issues: +https://github.com/bnewbold/covid19-fatcat-wiki + +

An elasticsearch API is available; see the above repo README for details. + +

+Bulk exports of metadata and derived content are available on the Internet +Archive at: +https://archive.org/details/fatcat_covid19 {% endblock %} diff --git a/fatcat_covid19/templates/sources_en.html b/fatcat_covid19/templates/sources_en.html index d46ac77..bca32a7 100644 --- a/fatcat_covid19/templates/sources_en.html +++ b/fatcat_covid19/templates/sources_en.html @@ -4,8 +4,64 @@ {% block body %} -

{{ _("Sources of Content and Metadata") }}

+

Curated COVID-19 Sources

-TODO +Works are tagged with the source of their inclusion in this COVID-19 corpus: + + + +To clarify use of the CORD-19 corpus in particular, the corpus is used only to +identify papers for inclusion in this index (eg, by DOI or PMCID). +Bibliographic metadata and content is then fetched from the exiting Fatcat +catalog of open metadata, and full-text content is indexed from copies found on +the public web, repositories, and publisher websites. + +

Disclaimers

+ +

+The fatcat catalog is intended to be a "universal" preservation and access +archive, not a narrow currated collection of only the highest quality research +content. This means that not all content has undergone peer-review, and some +may have been uploaded to services like academic social networks (eg, +researchgate) or institutional repositories with absolutely no human editorial +review or filtering. + +

+The catalog intends to capture metadata such as publication stage (draft, +published, retracted), venue, and medium (journal article, web post, +encyclopedia entry, frontmatter) to help filter through this content. But in +some cases this metadata is incomplete or may be inaccurate. For example, +pre-print PDF files may be incorrectly associated with the final published +version of a work, or vica versa. + + +

Sources of Metadata

+ +The source of all bibliographic information is recorded in edit history +metadata, which allows the provenance of all records to be reconstructed. A few +major sources are worth highlighting here: + + + +Many thanks for the hard work of all these projects, institutions, and individuals! {% endblock %} -- cgit v1.2.3