iterate on sitemap generation

author: Bryan Newbold <bnewbold@robocracy.org> 2020-08-19 22:55:05 -0700
committer: Bryan Newbold <bnewbold@robocracy.org> 2020-08-19 22:56:31 -0700
commit: 5f282a6267182214080ca36bcec4da1755589b46 (patch)
tree: c6f9bd5a84da9e1c2e53aa6af2931df6f9110e60
parent: 88a99387e09c7c43803129e72215ef3f6b4cafc6 (diff)
download: fatcat-5f282a6267182214080ca36bcec4da1755589b46.tar.gz
fatcat-5f282a6267182214080ca36bcec4da1755589b46.zip
6 files changed, 119 insertions, 7 deletions
diff --git a/extra/sitemap/.gitignore b/extra/sitemap/.gitignore
new file mode 100644
index 00000000..5dd7dadc
--- /dev/null
+++ b/extra/sitemap/.gitignore
@@ -0,0 +1,3 @@
+*.txt.gz
+*.xml
+*.json.gz
diff --git a/extra/sitemap/README.md b/extra/sitemap/README.md
index 6963bb1f..735ac925 100644
--- a/extra/sitemap/README.md
+++ b/extra/sitemap/README.md
@@ -1,6 +1,41 @@
 
+## Background
+
 Google has a limit of 50k lines / 10 MByte for text sitemap files, and 50K
-lines / 50 MByte for XML site map files.
+lines / 50 MByte for XML site map files. Google Scholar has indicated a smaller
+20k URL / 5 MB limit.
+
+For the time being, we will include only a subset of fatcat entities and pages
+in our sitemaps.
+
+- homepage, "about" pages
+- all container landing pages (~150k)
+- "best" release landing page for each work with fulltext (~25 million)
+
+In the short term, calculating "best" is tricky so let's just take the first
+release with fulltext per work.
+
+In tree form:
+
+- `/robots.txt`: static file (in web app)
+  - `/sitemap.xml`: about page, etc. static file (in web app)
+  - `/sitemap-containers-index.xml`: points to .txt URL lists; generated by scripts
+    - `/sitemap-containers-<date>-<shard>.txt`
+  - `/sitemap-releases-index.xml`: same as above
+    - `/sitemap-releases-<date>-<shard>.txt`
+
+Workflow:
+
+- run bash script over container dump, outputing compressed, sharded container sitemaps
+- run bash script over release work-grouped, outputing compressed, sharded release sitemaps
+- run python script to output top-level `sitemap.xml`
+- `scp` all of this into place
+
+To make this work, will configure an nginx rule to point all requests like
+`/sitemap-*` to the directory `/srv/fatcat/sitemap/`, and will collect output
+there.
+
+## Ideas on Huge (complete) Index
 
 With a baseline of 100 million entities, that requires an index file pointing
 to at least 2000x individual sitemaps. 3 hex characters is 12 bits, or 4096
diff --git a/extra/sitemap/container_url_lists.sh b/extra/sitemap/container_url_lists.sh
new file mode 100755
index 00000000..fcc0f4b6
--- /dev/null
+++ b/extra/sitemap/container_url_lists.sh
@@ -0,0 +1,23 @@
+#!/usr/bin/env bash
+
+set -e              # fail on error
+set -u              # fail if variable not set in substitution
+set -o pipefail     # fail if part of a '|' command fails
+
+: ${1?' You you did not supply a date argument'}
+: ${2?' You you did not supply an input file (JSON gzip)'}
+if [ ! -f $2 ] ; then
+  echo "Input file not found: $2" && exit 1;
+fi
+
+# eg, 2020-08-19
+DATE="$1"
+# eg, container_export.json.gz
+EXPORT_FILE_GZ="$2"
+
+zcat $EXPORT_FILE_GZ \
+    | jq .ident -r \
+    | awk '{print "https://fatcat.wiki/container/" $1 }' \
+    | split --lines 20000 - sitemap-containers-$DATE- -d -a 5 --additional-suffix .txt
+
+gzip sitemap-containers-*.txt
diff --git a/extra/sitemap/generate_sitemap_indices.py b/extra/sitemap/generate_sitemap_indices.py
new file mode 100755
index 00000000..9766ac1f
--- /dev/null
+++ b/extra/sitemap/generate_sitemap_indices.py
@@ -0,0 +1,28 @@
+#!/usr/bin/env python3
+
+import sys
+import glob
+import datetime
+
+def index_entity(entity_type, output):
+
+    now = datetime.datetime.now().isoformat()
+    print("""<?xml version="1.0" encoding="UTF-8"?>""", file=output)
+    print("""<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">""", file=output)
+
+    for filename in glob.glob(f"sitemap-{entity_type}-*.txt.gz"):
+        print("  <sitemap>", file=output)
+        print(f"    <loc>https://fatcat.wiki/{filename}</loc>", file=output)
+        print(f"    <lastmod>{now}</lastmod>", file=output)
+        print("  </sitemap>", file=output)
+
+    print("</sitemapindex>", file=output)
+
+def main():
+    with open('sitemap-index-containers.xml', 'w') as output:
+        index_entity("containers", output)
+    with open('sitemap-index-releases.xml', 'w') as output:
+        index_entity("releases", output)
+
+if __name__=="__main__":
+    main()
diff --git a/extra/sitemap/release_url_lists.sh b/extra/sitemap/release_url_lists.sh
new file mode 100755
index 00000000..4190011f
--- /dev/null
+++ b/extra/sitemap/release_url_lists.sh
@@ -0,0 +1,29 @@
+#!/usr/bin/env bash
+
+set -e              # fail on error
+set -u              # fail if variable not set in substitution
+set -o pipefail     # fail if part of a '|' command fails
+
+: ${1?' You you did not supply a date argument'}
+: ${2?' You you did not supply an input file (JSON gzip)'}
+if [ -f $2 ] ; then
+  echo "Input file not found: $2" && exit 1;
+fi
+
+# eg, 2020-08-19
+DATE = "$1"
+# eg, release_export_expanded.json.gz
+EXPORT_FILE_GZ = "$2"
+
+# filter to fulltext releases only, then filter to only one hit per work
+zcat $EXPORT_FILE_GZ \
+    | rg '"release_ids"' \
+    | rg 'archive.org/' \
+    | rg -v '"stub"' \
+    | jq -r '[.work_id, .ident] | @tsv' \
+    | uniq -w 26 \
+    | cut -f 2 \
+    | awk '{print "https://fatcat.wiki/release/" $1 }' \
+    | split --lines 20000 - sitemap-releases-$DATE- -d -a 5 --additional-suffix .txt
+
+gzip sitemap-releases-*.txt
diff --git a/extra/sitemap/sitemap.xml b/extra/sitemap/sitemap.xml
deleted file mode 100644
index 4404bdc2..00000000
--- a/extra/sitemap/sitemap.xml
+++ /dev/null
@@ -1,6 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
-    <url>
-        <loc>{{page[0]|safe}}</loc>
-    </url>
-</urlset>
author	Bryan Newbold <bnewbold@robocracy.org>	2020-08-19 22:55:05 -0700
committer	Bryan Newbold <bnewbold@robocracy.org>	2020-08-19 22:56:31 -0700
commit	5f282a6267182214080ca36bcec4da1755589b46 (patch)
tree	c6f9bd5a84da9e1c2e53aa6af2931df6f9110e60
parent	88a99387e09c7c43803129e72215ef3f6b4cafc6 (diff)
download	fatcat-5f282a6267182214080ca36bcec4da1755589b46.tar.gz fatcat-5f282a6267182214080ca36bcec4da1755589b46.zip