From 4b152e02d1a0d0d7a9a391ed211ecd6f304d6962 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Wed, 28 Apr 2021 20:49:54 -0700 Subject: proposal for landing pages, access redirects, and sitemaps --- proposals/2021-04-28_indexability.md | 106 +++++++++++++++++++++++++++++++++++ 1 file changed, 106 insertions(+) create mode 100644 proposals/2021-04-28_indexability.md diff --git a/proposals/2021-04-28_indexability.md b/proposals/2021-04-28_indexability.md new file mode 100644 index 0000000..cfa928f --- /dev/null +++ b/proposals/2021-04-28_indexability.md @@ -0,0 +1,106 @@ + +subject: work-in-progress + +Persistent Landing Pages, Access URLs, Sitemaps +=============================================== + +This is a proposal to start hosting persistent landing pages (for "work" +entities) and access redirect URLs on the scholar.archive.org domain. +Additionally, to create bulk sitemaps linking to these resources. + +The initial motivation for these features are to enable indexing, aggregation, +and linking to preservation content by large academic search engines (like +Google Scholar, Microsoft Academic, and lens.org), as well as simple linking +from platforms like Wikipedia. All of these may, for various reasons, prefer +linking directly to PDF files, and to have have links on a *.archive.org (as +opposed to fatcat.wiki). + + +# Work Landing Pages + +New URL endpoint: + + /work/ + +Work landing pages will summarize bibliographic metadata about the work, list +versions ("releases") and list access options. The landing pages will include +bibliographic metadata summarized in HTML tags for the "primary" release. + + +In the future, these pages could be host to additional features or sub-pages +(new endpoints), such as: + +- citation graph lists or visualizations +- alternative reading interface +- content previews + + +## Design Notes + +The `citation_pdf_url` metadata tag will only link to a PDF file hosted on a +*.archive.org domain (aka, archive.org files or web.archive.org web-archived +files), via the access redirect URLs mentioned below. At least initially, only +PDF files which correspond to the "primary" version of the work will be +included. Eg, if there is a published release, a file manifestation of that +release will be linked, not earlier pre-print or accepted manuscript versions. +This behavior may change at some point to include "green" access links from the +"work" landing page. + +Alternatively, we could have landing pages only for "releases" (versions), like +already exist on fatcat.wiki. This would make the decision about which files to +link to simpler. + +However, to date scholar.archive.org as a product/service has taken the +approach of simplifying the fatcat.wiki data model in the interest of +usability, and will probably continue with that approach here. + +Landing pages will be rendered from a simple, single "GET" request to the same +elasticsearch backend index; no new backing services (eg, api.fatcat.wiki) are +introduced. + +# Access Redirect URLs + +Some academic search engines require a `citation_pdf_url` link from the same +domain as the landing page, with an optional HTTP redirect. + +New URL endpoint: + + /access-redirect/.pdf + +Requests to such URLs will redirect (HTTP 302) to an *.archive.org access +location of the exact file (by sha1), if known. + +It is likely that in the future `.xml` and `.epub` access redirect links would +be added in the same format. Unclear what the scheme would be for HTML content +(SHA-1 of the "primary" HTML document? or wayback timestamp and URL?). + +This URL structure was chosen to reduce confusion that the file might be served +from scholar.archive.org itself ("redirect"); to indicate the filetype +expected; and to encode information about which resource is being linked to +("content addressible"). + + +### Design Notes + +An alternative would have been to scope the URL below the work itself, eg: + + /work//access-redirect/.pdf + +Such URLs would be quite long. + +These redirects are implemented by querying the same scholar elasticsearch +backend index, querying for fulltext access with the matching file SHA-1, and +using the `access_url` returned. + + +# Sitemap + +The sitemap setup will be copied from fatcat.wiki. The new resources (URLs) +will include: + + /robots.txt - updated to include sitemap references + /sitemap.xml - basic generic list of pages (homepage, about, userguide) + /sitemap-index-works.xml - XML file pointing to many sub-sitemap files; includes lastmod metadata + /sitemap-works-YYYY-MM-DD-NNNNN.txt.gz - series of timestamped "simple" sitemaps (URL list files) + +Only works for which there is an appropriate fulltext access URL -- cgit v1.2.3