diff options
-rw-r--r-- | extra/sitemap/README.md | 2 | ||||
-rwxr-xr-x | extra/sitemap/container_url_lists.sh | 1 | ||||
-rw-r--r-- | proposals/2021-04-02_crawlability.md | 76 | ||||
-rw-r--r-- | python/fatcat_tools/importers/pubmed.py | 7 |
4 files changed, 84 insertions, 2 deletions
diff --git a/extra/sitemap/README.md b/extra/sitemap/README.md index f72893cd..581ee9f3 100644 --- a/extra/sitemap/README.md +++ b/extra/sitemap/README.md @@ -8,7 +8,7 @@ After a container dump, as `fatcat` user on prod server: /srv/fatcat/src/extra/sitemap/container_url_lists.sh $DATE /srv/fatcat/snapshots/container_export.json.gz /srv/fatcat/src/extra/sitemap/release_url_lists.sh $DATE /srv/fatcat/snapshots/release_export_expanded.json.gz # delete old sitemap url lists - /srv/fatcat/src/extra/sitemap/generate_sitemap_indices.py + python3.8 /srv/fatcat/src/extra/sitemap/generate_sitemap_indices.py ## Background diff --git a/extra/sitemap/container_url_lists.sh b/extra/sitemap/container_url_lists.sh index fcc0f4b6..1a37c220 100755 --- a/extra/sitemap/container_url_lists.sh +++ b/extra/sitemap/container_url_lists.sh @@ -15,6 +15,7 @@ DATE="$1" # eg, container_export.json.gz EXPORT_FILE_GZ="$2" +# TODO: remove stubs? only if we have releases? zcat $EXPORT_FILE_GZ \ | jq .ident -r \ | awk '{print "https://fatcat.wiki/container/" $1 }' \ diff --git a/proposals/2021-04-02_crawlability.md b/proposals/2021-04-02_crawlability.md new file mode 100644 index 00000000..ee9f3c5b --- /dev/null +++ b/proposals/2021-04-02_crawlability.md @@ -0,0 +1,76 @@ + +status: not-implemented + +Crawlability Improvements +-------------------------- + +NOTE: After some back and forth on this topic, we have decided for now to focus +on having scholar.archive.org indexed, not fatcat.wiki. This proposal document +document is being kept as documentation of that decision. + + +## Original Intro + +We are interested in making the fatcat corpus more crawlable/indexable by +aggregators and academic search enginges. For example, CiteseerX, Google +Scholar, or Microsoft Academic (when themselves get used by other projects). + +Some open questions: + +- is the web.archive.org iframe for PDFs ok, or should we redirect to PDFs with `id_` in the datetime? + + + +## Redirect URLs and `citation_pdf_url` + +We suspect that some crawlers do not like that fatcat.wiki landing pages have +`citation_pdf_url` fields that point to a different registered domain. +`www.semanticscholar.org -> pdfs.semanticscholar.org` is presumably ok, but +maybe `fatcat.wiki -> web.archive.org` and `fatcat.wiki -> archive.org` are +not. + +Google Scholar docs also request the PDF link be "in the same subdirectory" +(though this obviously isn't true on, eg, semanticscholar.org): + +> If this page shows only the abstract of the paper and you have the full text +> in a separate file, e.g., in the PDF format, please specify the locations of +> all full text versions using citation_pdf_url or DC.identifier tags. The +> content of the tag is the absolute URL of the PDF file; for security reasons, +> it must refer to a file in the same subdirectory as the HTML abstract. + +Also suspect that a redirect is probably find. If a journal links from a +landing page to a `.pdf` URL on the same domain, often there is an HTTP +redirect to, eg, amazon AWS. These seem to get indexed fine. + +So, potentially have `citation_pdf_url` point to something on `fatcat.wiki`, +which then redirects to `web.archive.org` or `archive.org`, would be +sufficient. This would also be a reasonable URL for external services to point +to, in that which specific access mechanism is redirected would vary as the +catalog is improved. + +So, proposing two new web fatcat.wiki endpoints: + + /release/<ident>/access-redirect + /file/<ident>/access-redirect + +Both of these would use an HTTP 302 "temporary" redirect to the "best" archival +fulltext copy. + +If somebody wants to link to a specific file (by hash), they should use the +file link. If they want to link to any fulltext access copy, then should use +the release link. + +Open questions: + +- should the redirect only ever go to archive.org properties? +- for releases, should the file type and access type be filtered? maybe with a + query parameter, or a `.pdf` suffix? + + +## "Browsable" Site + +Another improvement would be to make the site more "browsable". To start, an +index of journals (by first letter, publisher, country, or similar), then +organize papers under the journal by volume, year, etc. This would give +crawlers a way to spider all papers in the index. + diff --git a/python/fatcat_tools/importers/pubmed.py b/python/fatcat_tools/importers/pubmed.py index d32fcefa..1cdb450b 100644 --- a/python/fatcat_tools/importers/pubmed.py +++ b/python/fatcat_tools/importers/pubmed.py @@ -768,7 +768,12 @@ class PubmedImporter(EntityImporter): self.counts["exists"] += 1 return False - if existing and existing.ext_ids.pmid and (existing.refs or not re.refs): + if ( + existing + and existing.ext_ids.pmid + and (existing.ext_ids.pmcid or not re.ext_ids.pmcid) + and (existing.refs or not re.refs) + ): # TODO: any other reasons to do an update? # don't update if it already has PMID self.counts["exists"] += 1 |