start notes/proposal about 'crawlability' improvements

author: Bryan Newbold <bnewbold@robocracy.org> 2021-04-02 17:09:59 -0700
committer: Bryan Newbold <bnewbold@robocracy.org> 2021-11-10 17:08:07 -0800
commit: 92e2b302192f5d8c1f0a4e1918bc705bb35ea9b1 (patch)
tree: 534452e20265f237ab331eff1921f2071177a628
parent: 13b107d8397c47a3159800a34307063f8e2b509e (diff)
download: fatcat-92e2b302192f5d8c1f0a4e1918bc705bb35ea9b1.tar.gz
fatcat-92e2b302192f5d8c1f0a4e1918bc705bb35ea9b1.zip
1 files changed, 68 insertions, 0 deletions
diff --git a/proposals/2021-04-02_crawlability.md b/proposals/2021-04-02_crawlability.md
new file mode 100644
index 00000000..6b9ef66c
--- /dev/null
+++ b/proposals/2021-04-02_crawlability.md
@@ -0,0 +1,68 @@
+
+status: wip
+
+Crawlability Improvements
+--------------------------
+
+We are interested in making the fatcat corpus more crawlable/indexable by
+aggregators and academic search enginges. For example, CiteseerX, Google
+Scholar, or Microsoft Academic (when themselves get used by other projects).
+
+Some open questions:
+
+- is the web.archive.org iframe for PDFs ok, or should we redirect to PDFs with `id_` in the datetime?
+
+
+## Redirect URLs and `citation_pdf_url`
+
+We suspect that some crawlers do not like that fatcat.wiki landing pages have
+`citation_pdf_url` fields that point to a different registered domain.
+`www.semanticscholar.org -> pdfs.semanticscholar.org` is presumably ok, but
+maybe `fatcat.wiki -> web.archive.org` and `fatcat.wiki -> archive.org` are
+not.
+
+Google Scholar docs also request the PDF link be "in the same subdirectory"
+(though this obviously isn't true on, eg, semanticscholar.org):
+
+> If this page shows only the abstract of the paper and you have the full text
+> in a separate file, e.g., in the PDF format, please specify the locations of
+> all full text versions using citation_pdf_url or DC.identifier tags. The
+> content of the tag is the absolute URL of the PDF file; for security reasons,
+> it must refer to a file in the same subdirectory as the HTML abstract.
+
+Also suspect that a redirect is probably find. If a journal links from a
+landing page to a `.pdf` URL on the same domain, often there is an HTTP
+redirect to, eg, amazon AWS. These seem to get indexed fine.
+
+So, potentially have `citation_pdf_url` point to something on `fatcat.wiki`,
+which then redirects to `web.archive.org` or `archive.org`, would be
+sufficient. This would also be a reasonable URL for external services to point
+to, in that which specific access mechanism is redirected would vary as the
+catalog is improved.
+
+So, proposing two new web fatcat.wiki endpoints:
+
+    /release/<ident>/access-redirect
+    /file/<ident>/access-redirect
+
+Both of these would use an HTTP 302 "temporary" redirect to the "best" archival
+fulltext copy.
+
+If somebody wants to link to a specific file (by hash), they should use the
+file link. If they want to link to any fulltext access copy, then should use
+the release link.
+
+Open questions:
+
+- should the redirect only ever go to archive.org properties?
+- for releases, should the file type and access type be filtered? maybe with a
+  query parameter, or a `.pdf` suffix?
+
+
+## "Browsable" Site
+
+Another improvement would be to make the site more "browsable". To start, an
+index of journals (by first letter, publisher, country, or similar), then
+organize papers under the journal by volume, year, etc. This would give
+crawlers a way to spider all papers in the index.
+
author	Bryan Newbold <bnewbold@robocracy.org>	2021-04-02 17:09:59 -0700
committer	Bryan Newbold <bnewbold@robocracy.org>	2021-11-10 17:08:07 -0800
commit	92e2b302192f5d8c1f0a4e1918bc705bb35ea9b1 (patch)
tree	534452e20265f237ab331eff1921f2071177a628
parent	13b107d8397c47a3159800a34307063f8e2b509e (diff)
download	fatcat-92e2b302192f5d8c1f0a4e1918bc705bb35ea9b1.tar.gz fatcat-92e2b302192f5d8c1f0a4e1918bc705bb35ea9b1.zip