summaryrefslogtreecommitdiffstats
path: root/extra/sitemap/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'extra/sitemap/README.md')
-rw-r--r--extra/sitemap/README.md37
1 files changed, 36 insertions, 1 deletions
diff --git a/extra/sitemap/README.md b/extra/sitemap/README.md
index 6963bb1f..735ac925 100644
--- a/extra/sitemap/README.md
+++ b/extra/sitemap/README.md
@@ -1,6 +1,41 @@
+## Background
+
Google has a limit of 50k lines / 10 MByte for text sitemap files, and 50K
-lines / 50 MByte for XML site map files.
+lines / 50 MByte for XML site map files. Google Scholar has indicated a smaller
+20k URL / 5 MB limit.
+
+For the time being, we will include only a subset of fatcat entities and pages
+in our sitemaps.
+
+- homepage, "about" pages
+- all container landing pages (~150k)
+- "best" release landing page for each work with fulltext (~25 million)
+
+In the short term, calculating "best" is tricky so let's just take the first
+release with fulltext per work.
+
+In tree form:
+
+- `/robots.txt`: static file (in web app)
+ - `/sitemap.xml`: about page, etc. static file (in web app)
+ - `/sitemap-containers-index.xml`: points to .txt URL lists; generated by scripts
+ - `/sitemap-containers-<date>-<shard>.txt`
+ - `/sitemap-releases-index.xml`: same as above
+ - `/sitemap-releases-<date>-<shard>.txt`
+
+Workflow:
+
+- run bash script over container dump, outputing compressed, sharded container sitemaps
+- run bash script over release work-grouped, outputing compressed, sharded release sitemaps
+- run python script to output top-level `sitemap.xml`
+- `scp` all of this into place
+
+To make this work, will configure an nginx rule to point all requests like
+`/sitemap-*` to the directory `/srv/fatcat/sitemap/`, and will collect output
+there.
+
+## Ideas on Huge (complete) Index
With a baseline of 100 million entities, that requires an index file pointing
to at least 2000x individual sitemaps. 3 hex characters is 12 bits, or 4096