README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129


[covid19.fatcat.wiki](https://covid19.fatcat.wiki)
======================================================

**Not Medical Advice for General Public or Clinical Use!**

This repository contains a web search front-end and data munging pipeline for a
corpus of research publications and datasets relating to the COVID-19 pandemic.

The main dataset is the
["CORD-19"](https://pages.semanticscholar.org/coronavirus-research) (sic) paper
set from Semantic Scholar, enriched with additional metadata and web archive
fulltext from [fatcat.wiki](https://fatcat.wiki).

Visit the live site ["about"](https://covid19.fatcat.wiki/about) and
["sources"](https://covid19.fatcat.wiki/sources) pages for more context about
this project. In particular, note several **DISCLAIMERS** about quality,
content, and service reliability, and licensing context about paper content and
bibliographic metadata.


## Technical Overview

A crude python data perparation pipeline runs through the following stages:

- ``parse``: source metadata into JSON rows, one per paper
- ``enrich-fatcat``: queries fatcat API for full metadata and links to fulltext PDFs
- commands and shell scripts under `bin/` are run to download PDF copies and
  make "derivative" files (like thumbnails, extracting text)
- ``derivatives``: add derivative file paths and and full text to JSON rows
- ``transform-es``: convert from full JSON fulltext rows to elasticsearch schema
- load into elasticsearch cluster using `esbulk` tool

Currently, only documents with a fatcat release ident are indexed into
elasticsearch, and use that ident as the document key. This means that the
index can be reloaded to update documents without creating duplicate entries.

A stateless web interface (implemented in Python with Flask) provides a search
front-end to the elasticsearch index. The web interface uses the Babel library
to provide language localization, but additional work will be needed to make
the interface actually usable across languages.


## Elasticsearch API Access

The fulltext search index is currently world-readable in the native
elasticsearch 6.8 API at:

    https://search.fatcat.wiki/covid19_fatcat_fulltext

An index of native fatcat release schema for just the papers in this corpus is
also available at:

    https://search.fatcat.wiki/covid19_fatcat_release

Accessing both of these indices from your own software, or from browsers
directly via cross-site requests, should mostly work fine.

## Development Environment

This software is developed and deployed on GNU/Linux (Debian family) and hasn't
been tested elsewhere. Software dependencies include:

- python 3.7 (locked to this minor version)
- [pipenv](https://github.com/pypa/pipenv)
- `poppler-utils`
- elasticsearch 6.x (7.x may or may not work fine)
- [esbulk](https://github.com/sharkdp/fd)
- [ripgrep](https://github.com/BurntSushi/ripgrep) (`rg`)
- [`fd`](https://github.com/sharkdp/fd)
- `pv`
- `parallel`

To run the web interface in local/debug mode, with search queries sent to
public search index by default:

    cp example.env .env
    pipenv install --dev --deploy
    pipenv shell
    ./covid19_tool.py webface --debug

    # output will include a localhost URL to open


## Translations

Update the .pot file and translation files:

    pybabel extract -F extra/i18n/babel.cfg -o extra/i18n/web_interface.pot fatcat_covid19/
    pybabel update -i extra/i18n/web_interface.pot -d fatcat_covid19/translations

Compile translated messages together:

    pybabel compile -d fatcat_covid19/translations

Create initial .po file for a new language translation (then run the above
update/compile after doing initial translations):

    pybabel init -i extra/i18n/web_interface.pot -d fatcat_covid19/translations -l de


## Acknowledgements

For content and bibliographic metadata (partial list):

- Allen Institute's CORD-19 dataset
- PubMed catalog and PMC repository
- World Health Organization
- Wanfang Data
- CNKI
- biorxiv and medrxiv pre-print repositories
- publishers large and small, from around the world, making this research
  accessible (in some cases temporarily)
- research authors
- hospital workers and other emergency responders around the world

## Contact, Contributions, Licensing

General inquires should go to
[webservices@archive.org](mailto:webservices@archive.org). Take-down requests
and legal inqueries to [info@archive.org](mailto:info@archive.org). Bryan's
contact information is available [on his website](https://bnewbold.net/about/).

Contributions are welcome! Development is currently on Github and technical
issues (bugs, feature requests) can be filed there:
<https://github.com/bnewbold/covid19-fatcat-wiki>

The software in this repository is licensed under a combination of MIT and
AGPLv3 licenses. See `LICENSE.md` and `CONTRIBUTORS.md` for details.