From b362abd38ad4a6624bc056c58eb90ae235c63f00 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Mon, 23 Sep 2019 23:00:23 -0700 Subject: rename postgrest directory sql --- postgrest/README.md | 126 ---------------------------------------------------- 1 file changed, 126 deletions(-) delete mode 100644 postgrest/README.md (limited to 'postgrest/README.md') diff --git a/postgrest/README.md b/postgrest/README.md deleted file mode 100644 index b171614..0000000 --- a/postgrest/README.md +++ /dev/null @@ -1,126 +0,0 @@ - -TL;DR: replace hbase with postgresql tables with REST API (http://postgrest.org) - -No primary storage of anything in this table. Everything should be rapidly -re-creatable from dumps, kafka topics (compressed), CDX, petabox metadata, etc. -This is a secondary view on all of that. - -## Schema - - schema/database name is 'sandcrawler' - - cdx: include revisits or not? - id: int64, PK - sha1hex: string, not null, index - cdx_sha1hex: string - url: string, not null - datetime: ISO 8601:1988 (string?), not null - mimetype: string - warc_path: string (item and filename) - warc_offset: i64 - created: datetime, index (?) - ?crawl: string - ?domain: string - - file_meta - sha1hex, string, PK - md5hex: string - sha256hex: string - size_bytes: i64 - mime: string (verifying file status; optional for now?) - - fatcat_file - sha1hex: string, PK - file_ident: string, index? - release_ident: ? - - petabox - id: int64, PK - sha1hex: string, notnull, index - item: string, notnull - path: string, notnull (TODO: URL encoded? separate sub-archive path?) - - grobid - sha1hex: string, PK - updated: datetime - grobid_version (string) - status_code: i32 - status: string (JSONB?), only if status != 200 - metadata: JSONB, title, first author, year (not for now?) - glutton_fatcat_release: string, index - - shadow - sha1hex: string, PK - shadow_corpus: string, PK - shadow_id: string - doi: string - pmid: string - isbn13: string - -Alternatively, to be more like existing system could have "one big table" or -multiple tables all with same key (sha1b32) and UNIQ. As is, every sha1 pk -column is 40 bytes of both index and data, or 8+ GByte (combined) for each -table with 100 million rows. using raw bytes could help, but makes all -code/queries much trickier. - -Should we have "created" or "updated" timestamps on all these columns to enable -kafka tailing? - -TODO: -- how to indicate CDX sha1 vs. true sha1 mis-match? pretty rare. recrawl and delete row from `gwb_cdx`? -- only most recent GROBID? or keep multiple versions? here and minio - -## Existing Stuff Sizes (estimates) - - 78.5G /user/bnewbold/journal_crawl_cdx - 19.7G /user/bnewbold/sandcrawler/output-prod/2018-12-14-1737.00-dumpfilemeta - 2.7G file_hashes.tsv - 228.5G /user/bnewbold/sandcrawler/output-prod/2018-09-23-0405.30-dumpgrobidmetainsertable - -## Use Cases - -Core goal here is to mostly kill hbase/hadoop. What jobs are actually used there? - -- backfill: load in-scope (fulltext) crawl results from CDX - => bulk (many line) inserts -- rowcount: "how many unique PDFs crawled?" - => trivial SQL query -- status code count: "how much GROBID progress?" - => trivial SQL query -- dumpungrobided: "what files still need to be processed" - => SQL join with a "first" on CDX side -- dumpgrobidxml: "merge CDX/file info with extracted XML, for those that were successful" - => SQL dump or rowscan, then minio fetches - -This table is generally "single file raw fulltext metadata". - -"Enrichment" jobs: - -- GROBID -- glutton (if not GROBID) -- extra file metadata -- match newly enriched files to fatcat - -What else? - -- track additional raw file metadata -- dump all basic GROBID metadata (title, authors, year) to attempt merge/match - -Questions we might want to answer - -- total size of PDF corpus (terabytes) -- unqiue files hit per domain - -## Prototype Plan - -- backfill all CDX crawl files (TSV transform?) -- load full GROBID XML (both into minio and into SQL) -- load full fatcat file dump (TSV transform) -- load dumpfilemeta - -## Example Useful Lookups - - - http get :3030/cdx?url=eq.https://coleccionables.mercadolibre.com.ar/arduino-pdf_Installments_NoInterest_BestSellers_YES - http get :3030/file_meta?sha1hex=eq.120582c855a7cc3c70a8527c560d7f27e6027278 - -- cgit v1.2.3