diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-04-03 16:38:59 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-04-03 16:38:59 -0700 |
commit | cf2bfc9382fe1c934f2e11562c5c95b86fac5114 (patch) | |
tree | 50b1b3150696ed08a3af3b80ece9fbd81718e24e | |
parent | fb767adb9472ff85b46b5a383f3986950b12dd27 (diff) | |
download | fatcat-covid19-cf2bfc9382fe1c934f2e11562c5c95b86fac5114.tar.gz fatcat-covid19-cf2bfc9382fe1c934f2e11562c5c95b86fac5114.zip |
README, about page, sources page
-rw-r--r-- | README.md | 97 | ||||
-rw-r--r-- | fatcat_covid19/templates/about_en.html | 59 | ||||
-rw-r--r-- | fatcat_covid19/templates/sources_en.html | 60 |
3 files changed, 206 insertions, 10 deletions
@@ -2,24 +2,107 @@ [covid19.fatcat.wiki](https://covid19.fatcat.wiki) ====================================================== -**Work in Progress!** +**Not Medical Advice for General Public or Clinical Use!** -**Not Medical Advice for Clinical or General Public!** - -This repository contains scripts and a web search front-end for a corpus of -research publications and datasets relating to the COVID-19 pandemic. +This repository contains a web search front-end and data munging pipeline for a +corpus of research publications and datasets relating to the COVID-19 pandemic. The main dataset is the ["CORD-19"](https://pages.semanticscholar.org/coronavirus-research) (sic) paper set from Semantic Scholar, enriched with additional metadata and web archive fulltext from [fatcat.wiki](https://fatcat.wiki). -Major acknowledgements (not complete): +Visit the live site ["about"](https://covid19.fatcat.wiki/about) and +["sources"](https://covid19.fatcat.wiki/sources) pages for more context about +this project. In particular, note several **DISCLAIMERS** about quality, +content, and service reliability, and licensing context about paper content and +bibliographic metadata. + + +## Technical Overview + +A crude python data perparation pipeline runs through the following stages: + +- ``parse``: source metadata into JSON rows, one per paper +- ``enrich-fatcat``: queries fatcat API for full metadata and links to fulltext PDFs +- commands and shell scripts under `bin/` are run to download PDF copies and + make "derivative" files (like thumbnails, extracting text) +- ``derivatives``: add derivative file paths and and full text to JSON rows +- ``transform-es``: convert from full JSON fulltext rows to elasticsearch schema +- load into elasticsearch cluster using `esbulk` tool + +Currently, only documents with a fatcat release ident are indexed into +elasticsearch, and use that ident as the document key. This means that the +index can be reloaded to update documents without creating duplicate entries. + +A stateless web interface (implemented in Python with Flask) provides a search +front-end to the elasticsearch index. The web interface uses the Babel library +to provide language localization, but additional work will be needed to make +the interface actually usable across languages. + + +## Elasticsearch API Access + +The fulltext search index is currently world-readable in the native +elasticsearch 6.8 API at: + + https://search.fatcat.wiki/covid19_fatcat_fulltext + +An index of native fatcat release schema for just the papers in this corpus is +also available at: + + https://search.fatcat.wiki/covid19_fatcat_release +Accessing both of these indices from your own software, or from browsers +directly via cross-site requests, should mostly work fine. + +## Development Environment + +This software is developed and deployed on GNU/Linux (Debian family) and hasn't +been tested elsewhere. Software dependencies include: + +- python 3.7 (locked to this minor version) +- [pipenv](https://github.com/pypa/pipenv) +- elasticsearch 6.x (7.x may or may not work fine) +- [esbulk](https://github.com/sharkdp/fd) +- [ripgrep](https://github.com/BurntSushi/ripgrep) (`rg`) +- [`fd`](https://github.com/sharkdp/fd) + +To run the web interface in local/debug mode, with search queries sent to +public search index by default: + + cp example.env .env + pipenv install --dev --deploy + pipenv shell + ./covid19_tool.py webface --debug + + # output will include a localhost URL to open + +## Acknowledgements + +For content and bibliographic metadata (partial list): + +- Allen Institute's CORD-19 dataset - PubMed catalog and PMC repository +- World Health Organization - Wanfang Data - CNKI - biorxiv and medrxiv pre-print repositories -- publishers large and small, from around the world, making additional content available +- publishers large and small, from around the world, making this research + accessible (in some cases temporarily) - research authors - hospital workers and other emergency responders around the world + +## Contact, Contributions, Licensing + +General inquires should go to +[webservices@archive.org](mailto:webservices@archive.org). Take-down requests +and legal inqueries to [info@archive.org](mailto:info@archive.org). Bryan's +contact information is available [on his website](https://bnewbold.net/about/). + +Contributions are welcome! Development is currently on Github and technical +issues (bugs, feature requests) can be filed there: +<https://github.com/bnewbold/covid19-fatcat-wiki> + +The software in this repository is licensed under a combination of MIT and +AGPLv3 licenses. See `LICENSE.md` and `CONTRIBUTORS.md` for details. diff --git a/fatcat_covid19/templates/about_en.html b/fatcat_covid19/templates/about_en.html index 8db4a6f..95a272d 100644 --- a/fatcat_covid19/templates/about_en.html +++ b/fatcat_covid19/templates/about_en.html @@ -6,6 +6,63 @@ <h1>About Fatcat COVID-19 Paper Search</h1> -TODO +<p> +This is a prototype full text search index of papers, reports, datasets, and +other research resources related to the COVID-19 crisis, including public +health responses to influenza pandemics more generally. The curation of content +to be included is based on efforts like the "CORD-19" dataset and efforts by +authorities such as the WHO and NIH Pubmed. Metadata and content comes from the +existing open <a href="https://fatcat.wiki">fatcat</a> catalog of research +outputs. +See <a href="{{ url_for("search.page_sources") }}">"Sources"</a> for details. + +<p> +It is hoped that with additional care and development this resource may be +useful to anybody keeping up with research in this area, and particularly folks +working on systemic reviews, bibliometrics, or metaresearch. However, at time +of writing, this is at best a technology demonstration, not a robust piece of +knowledge infrastructure. + +<p> +We encourage folks to consider the following more authoriative and +well-supported tools for research discovery: + +<ul> + <li><a href="https://pubmed.gov">Pubmed</a> for biomedical research in + general, and the subject-specific <a href="https://www.ncbi.nlm.nih.gov/research/coronavirus/">LitCovid</a> + index for COVID-19. + <li><a href="https://www.semanticscholar.org/">Semantic Scholar</a> + <li><a href="https://scholar.google.com">Google Scholar</a> +</ul> + +<p> +Feedback and queries can be directed to <b><a href="mailto:webservices@archive.org">webservices@archive.org</a></b>. + +<h2>Service Disclaimers</h2> + +<p> +This is not a production-supported service of the Internet Archive. The website +and search API may become unavailable due to resource load, operator +availability, etc. If you would like to depend on this service, please contact +us. + +<p> +Some content available in this index may not be "perpetually accessible" after +the COVID-19 crises ends, due to temporary content licenses. The service itself +(covid19.fatcat.wiki) may also not be operated after the crisis, though all of +the source code and upstream metadata should be "perpetually accessible". + +<h2>Additional Resources</h2> + +<p> +Source code is available on Github, and bugs can be reported there as issues: +<a href="https://github.com/bnewbold/covid19-fatcat-wiki">https://github.com/bnewbold/covid19-fatcat-wiki</a> + +<p>An elasticsearch API is available; see the above repo README for details. + +<p> +Bulk exports of metadata and derived content are available on the Internet +Archive at: +<a href="https://archive.org/details/fatcat_covid19">https://archive.org/details/fatcat_covid19</a> {% endblock %} diff --git a/fatcat_covid19/templates/sources_en.html b/fatcat_covid19/templates/sources_en.html index d46ac77..bca32a7 100644 --- a/fatcat_covid19/templates/sources_en.html +++ b/fatcat_covid19/templates/sources_en.html @@ -4,8 +4,64 @@ {% block body %} -<h1>{{ _("Sources of Content and Metadata") }}</h1> +<h2>Curated COVID-19 Sources</h2> -TODO +Works are tagged with the source of their inclusion in this COVID-19 corpus: + +<ul> + <li><a href="https://pages.semanticscholar.org/coronavirus-research">Allen Institute for AI CORD-19 corpus</a> + <li><a href="https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov">WHO Database of publications on coronavirus disease (COVID-19)</a> + <li><a href="http://subject.med.wanfangdata.com.cn/Channel/7">Wanfang corpus of Chinese COVID-19 papers</a> + <li><a href="http://en.gzbd.cnki.net/GZBT/brief/Default.aspx">CNKI corpus of Chinese COVID-19 papers</a> + <li><a href="https://fatcat.wiki">Fatcat</a> (based on keyword queries against the full catalog) +</ul> + +To clarify use of the CORD-19 corpus in particular, the corpus is used only to +identify papers for inclusion in this index (eg, by DOI or PMCID). +Bibliographic metadata and content is then fetched from the exiting Fatcat +catalog of open metadata, and full-text content is indexed from copies found on +the public web, repositories, and publisher websites. + +<h2>Disclaimers</h2> + +<p> +The fatcat catalog is intended to be a "universal" preservation and access +archive, not a narrow currated collection of only the highest quality research +content. This means that not all content has undergone peer-review, and some +may have been uploaded to services like academic social networks (eg, +researchgate) or institutional repositories with absolutely no human editorial +review or filtering. + +<p> +The catalog intends to capture metadata such as publication stage (draft, +published, retracted), venue, and medium (journal article, web post, +encyclopedia entry, frontmatter) to help filter through this content. But in +some cases this metadata is incomplete or may be inaccurate. For example, +pre-print PDF files may be incorrectly associated with the final published +version of a work, or vica versa. + + +<h2>Sources of Metadata</h2> + +The source of all bibliographic information is recorded in edit history +metadata, which allows the provenance of all records to be reconstructed. A few +major sources are worth highlighting here: + +<ul> + <li>Release metadata from <b>Crossref</b>, via their public + <a href="https://github.com/CrossRef/rest-api-doc">REST API</a> + <li>Release metadata and linked full-text content from NIH <b>Pubmed</b> and <b><a href="https://arxiv.org">arXiv.org</a></b> + <li>Release metadata and linked public domain full-text content the <b>JSTOR</b> Early Journal Content collection + <li>Creator names and de-duplication from <b>ORCID</b>, via their annual public data releases + <li>Journal title metadata from <b>DOAJ</b>, <b>ISSN ROAD</b>, and <b>SHERPA/RoMEO</b> + <li>Full-text URL lists from <b><a href="https://core.ac.uk">CORE</a></b>, + <b><a href="http://unpaywall.org">Unpaywall</a></b>, + <b><a href="https://www.semanticscholar.org">Semantic Scholar</a></b>, + <b><a href="https://citeseerx.ist.psu.edu">CiteseerX</a></b>, + and <b><a href="https://www.microsoft.com/en-us/research/project/academic">Microsoft Academic Graph</a></b>. + <li><a href="https://guide.fatcat.wiki/sources.html">The Fatcat Guide</a> lists more major sources +</ul> + +Many thanks for the hard work of all these projects, institutions, and individuals! {% endblock %} |