Merge branch 'bnewbold-sitemap' into 'master'

basic sitemap setup See merge request webgroup/fatcat!79
author: bnewbold <bnewbold@archive.org> 2020-08-20 21:17:59 +0000
committer: bnewbold <bnewbold@archive.org> 2020-08-20 21:17:59 +0000
commit: daf91b137483b7345448b597289c78f8fb3f9969 (patch)
tree: 712c27d902235d8d007763b512c57eaecd8045ad /extra/sitemap/README.md
parent: 5007ee299ce07b31db6d48cd4ab2587f87af53ab (diff)
parent: 2a98d10be1cc1368f9510745bff07c343974d4a7 (diff)
download: fatcat-daf91b137483b7345448b597289c78f8fb3f9969.tar.gz
fatcat-daf91b137483b7345448b597289c78f8fb3f9969.zip
1 files changed, 69 insertions, 0 deletions
diff --git a/extra/sitemap/README.md b/extra/sitemap/README.md
new file mode 100644
index 00000000..f72893cd
--- /dev/null
+++ b/extra/sitemap/README.md
@@ -0,0 +1,69 @@
+
+## HOWTO: Update
+
+After a container dump, as `fatcat` user on prod server:
+
+    cd /srv/fatcat/sitemap
+    export DATE=`date --iso-8601` # or whatever
+    /srv/fatcat/src/extra/sitemap/container_url_lists.sh $DATE /srv/fatcat/snapshots/container_export.json.gz
+    /srv/fatcat/src/extra/sitemap/release_url_lists.sh $DATE /srv/fatcat/snapshots/release_export_expanded.json.gz
+    # delete old sitemap url lists
+    /srv/fatcat/src/extra/sitemap/generate_sitemap_indices.py
+
+## Background
+
+Google has a limit of 50k lines / 10 MByte for text sitemap files, and 50K
+lines / 50 MByte for XML site map files. Google Scholar has indicated a smaller
+20k URL / 5 MB limit.
+
+For the time being, we will include only a subset of fatcat entities and pages
+in our sitemaps.
+
+- homepage, "about" pages
+- all container landing pages (~150k)
+- "best" release landing page for each work with fulltext (~25 million)
+
+In the short term, calculating "best" is tricky so let's just take the first
+release with fulltext per work.
+
+In tree form:
+
+- `/robots.txt`: static file (in web app)
+  - `/sitemap.xml`: about page, etc. static file (in web app)
+  - `/sitemap-containers-index.xml`: points to .txt URL lists; generated by scripts
+    - `/sitemap-containers-<date>-<shard>.txt`
+  - `/sitemap-releases-index.xml`: same as above
+    - `/sitemap-releases-<date>-<shard>.txt`
+
+Workflow:
+
+- run bash script over container dump, outputing compressed, sharded container sitemaps
+- run bash script over release work-grouped, outputing compressed, sharded release sitemaps
+- run python script to output top-level `sitemap.xml`
+- `scp` all of this into place
+
+To make this work, will configure an nginx rule to point all requests like
+`/sitemap-*` to the directory `/srv/fatcat/sitemap/`, and will collect output
+there.
+
+## Ideas on Huge (complete) Index
+
+With a baseline of 100 million entities, that requires an index file pointing
+to at least 2000x individual sitemaps. 3 hex characters is 12 bits, or 4096
+options; seems like an ok granularity to start with.
+
+Should look in to what archive.org does to generate their sitemap.xml, seems
+simple, and comes in batches of exactly 50k.
+
+## Text Sitemaps
+
+Should be possible to create simple text-style sitemaps, one URL per line, and
+link to these from a sitemap index. This is appealing because the sitemaps can
+be generated very quickly from identifier SQL dump files, run through UNIX
+commands (eg, to split and turn into URLs). Some script to create an XML
+sitemap index to point at all the sitemaps would still be needed though.
+
+
+## Resources
+
+Google sitemap verifier: https://support.google.com/webmasters/answer/7451001
author	bnewbold <bnewbold@archive.org>	2020-08-20 21:17:59 +0000
committer	bnewbold <bnewbold@archive.org>	2020-08-20 21:17:59 +0000
commit	daf91b137483b7345448b597289c78f8fb3f9969 (patch)
tree	712c27d902235d8d007763b512c57eaecd8045ad /extra/sitemap/README.md
parent	5007ee299ce07b31db6d48cd4ab2587f87af53ab (diff)
parent	2a98d10be1cc1368f9510745bff07c343974d4a7 (diff)
download	fatcat-daf91b137483b7345448b597289c78f8fb3f9969.tar.gz fatcat-daf91b137483b7345448b597289c78f8fb3f9969.zip