aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--extra/sitemap/README.md2
-rwxr-xr-xextra/sitemap/container_url_lists.sh1
-rw-r--r--proposals/2021-04-02_crawlability.md76
-rw-r--r--python/fatcat_tools/importers/pubmed.py7
4 files changed, 84 insertions, 2 deletions
diff --git a/extra/sitemap/README.md b/extra/sitemap/README.md
index f72893cd..581ee9f3 100644
--- a/extra/sitemap/README.md
+++ b/extra/sitemap/README.md
@@ -8,7 +8,7 @@ After a container dump, as `fatcat` user on prod server:
/srv/fatcat/src/extra/sitemap/container_url_lists.sh $DATE /srv/fatcat/snapshots/container_export.json.gz
/srv/fatcat/src/extra/sitemap/release_url_lists.sh $DATE /srv/fatcat/snapshots/release_export_expanded.json.gz
# delete old sitemap url lists
- /srv/fatcat/src/extra/sitemap/generate_sitemap_indices.py
+ python3.8 /srv/fatcat/src/extra/sitemap/generate_sitemap_indices.py
## Background
diff --git a/extra/sitemap/container_url_lists.sh b/extra/sitemap/container_url_lists.sh
index fcc0f4b6..1a37c220 100755
--- a/extra/sitemap/container_url_lists.sh
+++ b/extra/sitemap/container_url_lists.sh
@@ -15,6 +15,7 @@ DATE="$1"
# eg, container_export.json.gz
EXPORT_FILE_GZ="$2"
+# TODO: remove stubs? only if we have releases?
zcat $EXPORT_FILE_GZ \
| jq .ident -r \
| awk '{print "https://fatcat.wiki/container/" $1 }' \
diff --git a/proposals/2021-04-02_crawlability.md b/proposals/2021-04-02_crawlability.md
new file mode 100644
index 00000000..ee9f3c5b
--- /dev/null
+++ b/proposals/2021-04-02_crawlability.md
@@ -0,0 +1,76 @@
+
+status: not-implemented
+
+Crawlability Improvements
+--------------------------
+
+NOTE: After some back and forth on this topic, we have decided for now to focus
+on having scholar.archive.org indexed, not fatcat.wiki. This proposal document
+document is being kept as documentation of that decision.
+
+
+## Original Intro
+
+We are interested in making the fatcat corpus more crawlable/indexable by
+aggregators and academic search enginges. For example, CiteseerX, Google
+Scholar, or Microsoft Academic (when themselves get used by other projects).
+
+Some open questions:
+
+- is the web.archive.org iframe for PDFs ok, or should we redirect to PDFs with `id_` in the datetime?
+
+
+
+## Redirect URLs and `citation_pdf_url`
+
+We suspect that some crawlers do not like that fatcat.wiki landing pages have
+`citation_pdf_url` fields that point to a different registered domain.
+`www.semanticscholar.org -> pdfs.semanticscholar.org` is presumably ok, but
+maybe `fatcat.wiki -> web.archive.org` and `fatcat.wiki -> archive.org` are
+not.
+
+Google Scholar docs also request the PDF link be "in the same subdirectory"
+(though this obviously isn't true on, eg, semanticscholar.org):
+
+> If this page shows only the abstract of the paper and you have the full text
+> in a separate file, e.g., in the PDF format, please specify the locations of
+> all full text versions using citation_pdf_url or DC.identifier tags. The
+> content of the tag is the absolute URL of the PDF file; for security reasons,
+> it must refer to a file in the same subdirectory as the HTML abstract.
+
+Also suspect that a redirect is probably find. If a journal links from a
+landing page to a `.pdf` URL on the same domain, often there is an HTTP
+redirect to, eg, amazon AWS. These seem to get indexed fine.
+
+So, potentially have `citation_pdf_url` point to something on `fatcat.wiki`,
+which then redirects to `web.archive.org` or `archive.org`, would be
+sufficient. This would also be a reasonable URL for external services to point
+to, in that which specific access mechanism is redirected would vary as the
+catalog is improved.
+
+So, proposing two new web fatcat.wiki endpoints:
+
+ /release/<ident>/access-redirect
+ /file/<ident>/access-redirect
+
+Both of these would use an HTTP 302 "temporary" redirect to the "best" archival
+fulltext copy.
+
+If somebody wants to link to a specific file (by hash), they should use the
+file link. If they want to link to any fulltext access copy, then should use
+the release link.
+
+Open questions:
+
+- should the redirect only ever go to archive.org properties?
+- for releases, should the file type and access type be filtered? maybe with a
+ query parameter, or a `.pdf` suffix?
+
+
+## "Browsable" Site
+
+Another improvement would be to make the site more "browsable". To start, an
+index of journals (by first letter, publisher, country, or similar), then
+organize papers under the journal by volume, year, etc. This would give
+crawlers a way to spider all papers in the index.
+
diff --git a/python/fatcat_tools/importers/pubmed.py b/python/fatcat_tools/importers/pubmed.py
index d32fcefa..1cdb450b 100644
--- a/python/fatcat_tools/importers/pubmed.py
+++ b/python/fatcat_tools/importers/pubmed.py
@@ -768,7 +768,12 @@ class PubmedImporter(EntityImporter):
self.counts["exists"] += 1
return False
- if existing and existing.ext_ids.pmid and (existing.refs or not re.refs):
+ if (
+ existing
+ and existing.ext_ids.pmid
+ and (existing.ext_ids.pmcid or not re.ext_ids.pmcid)
+ and (existing.refs or not re.refs)
+ ):
# TODO: any other reasons to do an update?
# don't update if it already has PMID
self.counts["exists"] += 1