diff options
Diffstat (limited to 'proposals/2021-04-02_crawlability.md')
-rw-r--r-- | proposals/2021-04-02_crawlability.md | 76 |
1 files changed, 76 insertions, 0 deletions
diff --git a/proposals/2021-04-02_crawlability.md b/proposals/2021-04-02_crawlability.md new file mode 100644 index 00000000..ee9f3c5b --- /dev/null +++ b/proposals/2021-04-02_crawlability.md @@ -0,0 +1,76 @@ + +status: not-implemented + +Crawlability Improvements +-------------------------- + +NOTE: After some back and forth on this topic, we have decided for now to focus +on having scholar.archive.org indexed, not fatcat.wiki. This proposal document +document is being kept as documentation of that decision. + + +## Original Intro + +We are interested in making the fatcat corpus more crawlable/indexable by +aggregators and academic search enginges. For example, CiteseerX, Google +Scholar, or Microsoft Academic (when themselves get used by other projects). + +Some open questions: + +- is the web.archive.org iframe for PDFs ok, or should we redirect to PDFs with `id_` in the datetime? + + + +## Redirect URLs and `citation_pdf_url` + +We suspect that some crawlers do not like that fatcat.wiki landing pages have +`citation_pdf_url` fields that point to a different registered domain. +`www.semanticscholar.org -> pdfs.semanticscholar.org` is presumably ok, but +maybe `fatcat.wiki -> web.archive.org` and `fatcat.wiki -> archive.org` are +not. + +Google Scholar docs also request the PDF link be "in the same subdirectory" +(though this obviously isn't true on, eg, semanticscholar.org): + +> If this page shows only the abstract of the paper and you have the full text +> in a separate file, e.g., in the PDF format, please specify the locations of +> all full text versions using citation_pdf_url or DC.identifier tags. The +> content of the tag is the absolute URL of the PDF file; for security reasons, +> it must refer to a file in the same subdirectory as the HTML abstract. + +Also suspect that a redirect is probably find. If a journal links from a +landing page to a `.pdf` URL on the same domain, often there is an HTTP +redirect to, eg, amazon AWS. These seem to get indexed fine. + +So, potentially have `citation_pdf_url` point to something on `fatcat.wiki`, +which then redirects to `web.archive.org` or `archive.org`, would be +sufficient. This would also be a reasonable URL for external services to point +to, in that which specific access mechanism is redirected would vary as the +catalog is improved. + +So, proposing two new web fatcat.wiki endpoints: + + /release/<ident>/access-redirect + /file/<ident>/access-redirect + +Both of these would use an HTTP 302 "temporary" redirect to the "best" archival +fulltext copy. + +If somebody wants to link to a specific file (by hash), they should use the +file link. If they want to link to any fulltext access copy, then should use +the release link. + +Open questions: + +- should the redirect only ever go to archive.org properties? +- for releases, should the file type and access type be filtered? maybe with a + query parameter, or a `.pdf` suffix? + + +## "Browsable" Site + +Another improvement would be to make the site more "browsable". To start, an +index of journals (by first letter, publisher, country, or similar), then +organize papers under the journal by volume, year, etc. This would give +crawlers a way to spider all papers in the index. + |