README, about page, sources page

author: Bryan Newbold <bnewbold@archive.org> 2020-04-03 16:38:59 -0700
committer: Bryan Newbold <bnewbold@archive.org> 2020-04-03 16:38:59 -0700
commit: cf2bfc9382fe1c934f2e11562c5c95b86fac5114 (patch)
tree: 50b1b3150696ed08a3af3b80ece9fbd81718e24e
parent: fb767adb9472ff85b46b5a383f3986950b12dd27 (diff)
download: fatcat-covid19-cf2bfc9382fe1c934f2e11562c5c95b86fac5114.tar.gz
fatcat-covid19-cf2bfc9382fe1c934f2e11562c5c95b86fac5114.zip
3 files changed, 206 insertions, 10 deletions
diff --git a/README.md b/README.md
index 650d7eb..0875eae 100644
--- a/README.md
+++ b/README.md
@@ -2,24 +2,107 @@
 [covid19.fatcat.wiki](https://covid19.fatcat.wiki)
 ======================================================
 
-**Work in Progress!**
+**Not Medical Advice for General Public or Clinical Use!**
 
-**Not Medical Advice for Clinical or General Public!**
-
-This repository contains scripts and a web search front-end for a corpus of
-research publications and datasets relating to the COVID-19 pandemic.
+This repository contains a web search front-end and data munging pipeline for a
+corpus of research publications and datasets relating to the COVID-19 pandemic.
 
 The main dataset is the
 ["CORD-19"](https://pages.semanticscholar.org/coronavirus-research) (sic) paper
 set from Semantic Scholar, enriched with additional metadata and web archive
 fulltext from [fatcat.wiki](https://fatcat.wiki).
 
-Major acknowledgements (not complete):
+Visit the live site ["about"](https://covid19.fatcat.wiki/about) and
+["sources"](https://covid19.fatcat.wiki/sources) pages for more context about
+this project. In particular, note several **DISCLAIMERS** about quality,
+content, and service reliability, and licensing context about paper content and
+bibliographic metadata.
+
+
+## Technical Overview
+
+A crude python data perparation pipeline runs through the following stages:
+
+- ``parse``: source metadata into JSON rows, one per paper
+- ``enrich-fatcat``: queries fatcat API for full metadata and links to fulltext PDFs
+- commands and shell scripts under `bin/` are run to download PDF copies and
+  make "derivative" files (like thumbnails, extracting text)
+- ``derivatives``: add derivative file paths and and full text to JSON rows
+- ``transform-es``: convert from full JSON fulltext rows to elasticsearch schema
+- load into elasticsearch cluster using `esbulk` tool
+
+Currently, only documents with a fatcat release ident are indexed into
+elasticsearch, and use that ident as the document key. This means that the
+index can be reloaded to update documents without creating duplicate entries.
+
+A stateless web interface (implemented in Python with Flask) provides a search
+front-end to the elasticsearch index. The web interface uses the Babel library
+to provide language localization, but additional work will be needed to make
+the interface actually usable across languages.
+
+
+## Elasticsearch API Access
+
+The fulltext search index is currently world-readable in the native
+elasticsearch 6.8 API at:
+
+    https://search.fatcat.wiki/covid19_fatcat_fulltext
+
+An index of native fatcat release schema for just the papers in this corpus is
+also available at:
+
+    https://search.fatcat.wiki/covid19_fatcat_release
 
+Accessing both of these indices from your own software, or from browsers
+directly via cross-site requests, should mostly work fine.
+
+## Development Environment
+
+This software is developed and deployed on GNU/Linux (Debian family) and hasn't
+been tested elsewhere. Software dependencies include:
+
+- python 3.7 (locked to this minor version)
+- [pipenv](https://github.com/pypa/pipenv)
+- elasticsearch 6.x (7.x may or may not work fine)
+- [esbulk](https://github.com/sharkdp/fd)
+- [ripgrep](https://github.com/BurntSushi/ripgrep) (`rg`)
+- [`fd`](https://github.com/sharkdp/fd)
+
+To run the web interface in local/debug mode, with search queries sent to
+public search index by default:
+
+    cp example.env .env
+    pipenv install --dev --deploy
+    pipenv shell
+    ./covid19_tool.py webface --debug
+
+    # output will include a localhost URL to open
+
+## Acknowledgements
+
+For content and bibliographic metadata (partial list):
+
+- Allen Institute's CORD-19 dataset
 - PubMed catalog and PMC repository
+- World Health Organization
 - Wanfang Data
 - CNKI
 - biorxiv and medrxiv pre-print repositories
-- publishers large and small, from around the world, making additional content available
+- publishers large and small, from around the world, making this research
+  accessible (in some cases temporarily)
 - research authors
 - hospital workers and other emergency responders around the world
+
+## Contact, Contributions, Licensing
+
+General inquires should go to
+[webservices@archive.org](mailto:webservices@archive.org). Take-down requests
+and legal inqueries to [info@archive.org](mailto:info@archive.org). Bryan's
+contact information is available [on his website](https://bnewbold.net/about/).
+
+Contributions are welcome! Development is currently on Github and technical
+issues (bugs, feature requests) can be filed there:
+<https://github.com/bnewbold/covid19-fatcat-wiki>
+
+The software in this repository is licensed under a combination of MIT and
+AGPLv3 licenses. See `LICENSE.md` and `CONTRIBUTORS.md` for details.
diff --git a/fatcat_covid19/templates/about_en.html b/fatcat_covid19/templates/about_en.html
index 8db4a6f..95a272d 100644
--- a/fatcat_covid19/templates/about_en.html
+++ b/fatcat_covid19/templates/about_en.html
@@ -6,6 +6,63 @@
 
 <h1>About Fatcat COVID-19 Paper Search</h1>
 
-TODO
+<p>
+This is a prototype full text search index of papers, reports, datasets, and
+other research resources related to the COVID-19 crisis, including public
+health responses to influenza pandemics more generally. The curation of content
+to be included is based on efforts like the "CORD-19" dataset and efforts by
+authorities such as the WHO and NIH Pubmed. Metadata and content comes from the
+existing open <a href="https://fatcat.wiki">fatcat</a> catalog of research
+outputs.
+See <a href="{{ url_for("search.page_sources") }}">"Sources"</a> for details.
+
+<p>
+It is hoped that with additional care and development this resource may be
+useful to anybody keeping up with research in this area, and particularly folks
+working on systemic reviews, bibliometrics, or metaresearch. However, at time
+of writing, this is at best a technology demonstration, not a robust piece of
+knowledge infrastructure.
+
+<p>
+We encourage folks to consider the following more authoriative and
+well-supported tools for research discovery:
+
+<ul>
+  <li><a href="https://pubmed.gov">Pubmed</a> for biomedical research in
+  general, and the subject-specific <a href="https://www.ncbi.nlm.nih.gov/research/coronavirus/">LitCovid</a>
+  index for COVID-19.
+  <li><a href="https://www.semanticscholar.org/">Semantic Scholar</a>
+  <li><a href="https://scholar.google.com">Google Scholar</a>
+</ul>
+
+<p>
+Feedback and queries can be directed to <b><a href="mailto:webservices@archive.org">webservices@archive.org</a></b>.
+
+<h2>Service Disclaimers</h2>
+
+<p>
+This is not a production-supported service of the Internet Archive. The website
+and search API may become unavailable due to resource load, operator
+availability, etc. If you would like to depend on this service, please contact
+us.
+
+<p>
+Some content available in this index may not be "perpetually accessible" after
+the COVID-19 crises ends, due to temporary content licenses. The service itself
+(covid19.fatcat.wiki) may also not be operated after the crisis, though all of
+the source code and upstream metadata should be "perpetually accessible".
+
+<h2>Additional Resources</h2>
+
+<p>
+Source code is available on Github, and bugs can be reported there as issues:
+<a href="https://github.com/bnewbold/covid19-fatcat-wiki">https://github.com/bnewbold/covid19-fatcat-wiki</a>
+
+<p>An elasticsearch API is available; see the above repo README for details.
+
+<p>
+Bulk exports of metadata and derived content are available on the Internet
+Archive at:
+<a href="https://archive.org/details/fatcat_covid19">https://archive.org/details/fatcat_covid19</a>
 
 {% endblock %}
diff --git a/fatcat_covid19/templates/sources_en.html b/fatcat_covid19/templates/sources_en.html
index d46ac77..bca32a7 100644
--- a/fatcat_covid19/templates/sources_en.html
+++ b/fatcat_covid19/templates/sources_en.html
@@ -4,8 +4,64 @@
 
 {% block body %}
 
-<h1>{{ _("Sources of Content and Metadata") }}</h1>
+<h2>Curated COVID-19 Sources</h2>
 
-TODO
+Works are tagged with the source of their inclusion in this COVID-19 corpus:
+
+<ul>
+  <li><a href="https://pages.semanticscholar.org/coronavirus-research">Allen Institute for AI CORD-19 corpus</a>
+  <li><a href="https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov">WHO Database of publications on coronavirus disease (COVID-19)</a>
+  <li><a href="http://subject.med.wanfangdata.com.cn/Channel/7">Wanfang corpus of Chinese COVID-19 papers</a>
+  <li><a href="http://en.gzbd.cnki.net/GZBT/brief/Default.aspx">CNKI corpus of Chinese COVID-19 papers</a>
+  <li><a href="https://fatcat.wiki">Fatcat</a> (based on keyword queries against the full catalog)
+</ul>
+
+To clarify use of the CORD-19 corpus in particular, the corpus is used only to
+identify papers for inclusion in this index (eg, by DOI or PMCID).
+Bibliographic metadata and content is then fetched from the exiting Fatcat
+catalog of open metadata, and full-text content is indexed from copies found on
+the public web, repositories, and publisher websites.
+
+<h2>Disclaimers</h2>
+
+<p>
+The fatcat catalog is intended to be a "universal" preservation and access
+archive, not a narrow currated collection of only the highest quality research
+content. This means that not all content has undergone peer-review, and some
+may have been uploaded to services like academic social networks (eg,
+researchgate) or institutional repositories with absolutely no human editorial
+review or filtering.
+
+<p>
+The catalog intends to capture metadata such as publication stage (draft,
+published, retracted), venue, and medium (journal article, web post,
+encyclopedia entry, frontmatter) to help filter through this content. But in
+some cases this metadata is incomplete or may be inaccurate. For example,
+pre-print PDF files may be incorrectly associated with the final published
+version of a work, or vica versa.
+
+
+<h2>Sources of Metadata</h2>
+
+The source of all bibliographic information is recorded in edit history
+metadata, which allows the provenance of all records to be reconstructed. A few
+major sources are worth highlighting here:
+
+<ul>
+ <li>Release metadata from <b>Crossref</b>, via their public
+ <a href="https://github.com/CrossRef/rest-api-doc">REST API</a>
+ <li>Release metadata and linked full-text content from NIH <b>Pubmed</b> and <b><a href="https://arxiv.org">arXiv.org</a></b>
+ <li>Release metadata and linked public domain full-text content the <b>JSTOR</b> Early Journal Content collection
+ <li>Creator names and de-duplication from <b>ORCID</b>, via their annual public data releases
+ <li>Journal title metadata from <b>DOAJ</b>, <b>ISSN ROAD</b>, and <b>SHERPA/RoMEO</b>
+ <li>Full-text URL lists from <b><a href="https://core.ac.uk">CORE</a></b>,
+ <b><a href="http://unpaywall.org">Unpaywall</a></b>,
+ <b><a href="https://www.semanticscholar.org">Semantic Scholar</a></b>,
+ <b><a href="https://citeseerx.ist.psu.edu">CiteseerX</a></b>,
+ and <b><a href="https://www.microsoft.com/en-us/research/project/academic">Microsoft Academic Graph</a></b>.
+ <li><a href="https://guide.fatcat.wiki/sources.html">The Fatcat Guide</a> lists more major sources
+</ul>
+
+Many thanks for the hard work of all these projects, institutions, and individuals!
 
 {% endblock %}
author	Bryan Newbold <bnewbold@archive.org>	2020-04-03 16:38:59 -0700
committer	Bryan Newbold <bnewbold@archive.org>	2020-04-03 16:38:59 -0700
commit	cf2bfc9382fe1c934f2e11562c5c95b86fac5114 (patch)
tree	50b1b3150696ed08a3af3b80ece9fbd81718e24e
parent	fb767adb9472ff85b46b5a383f3986950b12dd27 (diff)
download	fatcat-covid19-cf2bfc9382fe1c934f2e11562c5c95b86fac5114.tar.gz fatcat-covid19-cf2bfc9382fe1c934f2e11562c5c95b86fac5114.zip