Questions in the context of fuzzy matching.
Input file is a single larger JSON, mapping names to issns.
{
"Acta Orientalia.": [
"0001-6438"
],
"Acta Orientalia (København)": [
"0001-6438"
],
..
import json
import pandas as pd
with open("../data/name_to_issn.json") as f:
mapping = json.load(f)
We have about 3M keys.
len(mapping)
df = pd.DataFrame(((k, len(v)) for k, v in mapping.items()), columns=["name", "issn_count"])
len(df)
df.head()
unique_name = df[df.issn_count == 1]
repeated_names = df[df.issn_count > 1]
len(repeated_names)
len(repeated_names) / len(df)
About 6% (or 194241) names are repeated.
repeated_names.describe()
Which name is shared by over 8000 ISSN?
repeated_names.iloc[repeated_names.issn_count.argmax()] # Annual report.
It is the "Annual report."
mapping["Annual report."][:10]
On average a repeated name will point to 3 ISSN. About 24k names point to more than 3 ISSN.
len(repeated_names[repeated_names.issn_count > 3])
repeated_names[repeated_names.issn_count > 3].sample(n=10)
mapping["Philosophica."]
repeated_names[repeated_names.issn_count > 3].issn_count.hist(bins=20)
repeated_names[(repeated_names.issn_count > 3) & (repeated_names.issn_count < 50)].issn_count.hist(bins=10)
repeated_names[(repeated_names.issn_count > 3) & (repeated_names.issn_count < 20)].issn_count.hist(bins=10)
repeated_names[(repeated_names.issn_count > 3) & (repeated_names.issn_count < 8)].issn_count.hist()
repeated_names[repeated_names.issn_count > 1000].issn_count.hist(bins=10)
repeated_names[repeated_names.issn_count > 1000]
repeated_names[repeated_names.issn_count > 500]
repeated_names[repeated_names.issn_count > 200]
repeated_names[repeated_names.issn_count > 100]
repeated_names
If a name matches a repeated name exactly or fuzzy matches to a repeated name and there is not other information available, the match status must be ambigious.