covid19.fatcat.wiki
Not Medical Advice for General Public or Clinical Use!
This repository contains a web search front-end and data munging pipeline for a corpus of research publications and datasets relating to the COVID-19 pandemic.
The main dataset is the "CORD-19" (sic) paper set from Semantic Scholar, enriched with additional metadata and web archive fulltext from fatcat.wiki.
Visit the live site "about" and "sources" pages for more context about this project. In particular, note several DISCLAIMERS about quality, content, and service reliability, and licensing context about paper content and bibliographic metadata.
Technical Overview
A crude python data perparation pipeline runs through the following stages:
parse
: source metadata into JSON rows, one per paperenrich-fatcat
: queries fatcat API for full metadata and links to fulltext PDFs- commands and shell scripts under
bin/
are run to download PDF copies and make "derivative" files (like thumbnails, extracting text) derivatives
: add derivative file paths and and full text to JSON rowstransform-es
: convert from full JSON fulltext rows to elasticsearch schema- load into elasticsearch cluster using
esbulk
tool
Currently, only documents with a fatcat release ident are indexed into elasticsearch, and use that ident as the document key. This means that the index can be reloaded to update documents without creating duplicate entries.
A stateless web interface (implemented in Python with Flask) provides a search front-end to the elasticsearch index. The web interface uses the Babel library to provide language localization, but additional work will be needed to make the interface actually usable across languages.
Elasticsearch API Access
The fulltext search index is currently world-readable in the native elasticsearch 6.8 API at:
https://search.fatcat.wiki/covid19_fatcat_fulltext
An index of native fatcat release schema for just the papers in this corpus is also available at:
https://search.fatcat.wiki/covid19_fatcat_release
Accessing both of these indices from your own software, or from browsers directly via cross-site requests, should mostly work fine.
Development Environment
This software is developed and deployed on GNU/Linux (Debian family) and hasn't been tested elsewhere. Software dependencies include:
- python 3.7 (locked to this minor version)
- pipenv
poppler-utils
- elasticsearch 6.x (7.x may or may not work fine)
- esbulk
- ripgrep (
rg
) fd
pv
parallel
To run the web interface in local/debug mode, with search queries sent to public search index by default:
cp example.env .env
pipenv install --dev --deploy
pipenv shell
./covid19_tool.py webface --debug
# output will include a localhost URL to open
Translations
Update the .pot file and translation files:
pybabel extract -F extra/i18n/babel.cfg -o extra/i18n/web_interface.pot fatcat_covid19/
pybabel update -i extra/i18n/web_interface.pot -d fatcat_covid19/translations
Compile translated messages together:
pybabel compile -d fatcat_covid19/translations
Create initial .po file for a new language translation (then run the above update/compile after doing initial translations):
pybabel init -i extra/i18n/web_interface.pot -d fatcat_covid19/translations -l de
Acknowledgements
For content and bibliographic metadata (partial list):
- Allen Institute's CORD-19 dataset
- PubMed catalog and PMC repository
- World Health Organization
- Wanfang Data
- CNKI
- biorxiv and medrxiv pre-print repositories
- publishers large and small, from around the world, making this research accessible (in some cases temporarily)
- research authors
- hospital workers and other emergency responders around the world
Contact, Contributions, Licensing
General inquires should go to webservices@archive.org. Take-down requests and legal inqueries to info@archive.org. Bryan's contact information is available on his website.
Contributions are welcome! Development is currently on Github and technical issues (bugs, feature requests) can be filed there: https://github.com/bnewbold/covid19-fatcat-wiki
The software in this repository is licensed under a combination of MIT and
AGPLv3 licenses. See LICENSE.md
and CONTRIBUTORS.md
for details.