# Where in the World is “11 H 1-2”? ## Aligning Public Datasets With Nazca --- <img src=https://www.pycon.fr/2019/static/images/partners/Logilab.svg width=800 height=200 style="border:none;box-shadow:none"> <img src=https://francearchives.fr/data/61d267173410a0069808bf843ea18985/images/FranceArchive_Signature.svg width=300 style="border:none;box-shadow:none"> <img src=https://www.mnhn.fr/sites/mnhn.fr/files/museum-national-d-histoire-naturelle_2.png width=300 style="border:none;box-shadow:none"> --- ## Motivation ---- <img src=https://www.mnhn.fr/sites/mnhn.fr/files/museum-national-d-histoire-naturelle_2.png width=600 style="border:none;box-shadow:none"> <img src=https://extranet.logilab.fr/upload/file.php?h=R85521724a489642dd88c078f16f68821 style="border:none;box-shadow:none"> https://www.mnhn.fr/ (The National Natural History Museum (NNHM)) ---- * The National Natural History Museum (NNHM) has a lot of collections data. * Some of it is digitized (~7 millions specimens) * *Who* did *what* and *when* ? ---- <img src=https://upload.wikimedia.org/wikipedia/commons/2/23/Fr%C3%A9d%C3%A9ric_Cuvier_by_Ambroise_Tardieu.jpg width=375 style="border:none;box-shadow:none"> Frédéric Cuvier (1773 - 1838), french *zoologist* and *paleontologist*. ---- <img src=https://i.imgur.com/TZwStRJ.png width=800 style="border:none;box-shadow:none"> (a fullname and a date ! aren't we lucky ?) ---- <img src=https://i.imgur.com/dGbaU4d.png width=300 style="border:none;box-shadow:none"> <img src=https://i.imgur.com/9U7ZJnK.png width=300 style="border:none;box-shadow:none"> ---- * Do you know George Cuvier (1769-1832) ? Frédéric's brother ? * Do you know that Frédéric Cuvier's fullname is “Cuvier, Georges, Frédéric” ? (sometimes abbreviated “Cuvier G.” …) ---- :::info :mega: Our goal is to make connections between *actors* and *activities*, with an estimated probabily. ::: ---- <img src=https://francearchives.fr/data/61d267173410a0069808bf843ea18985/images/FranceArchive_Signature.svg width=600 style="border:none;box-shadow:none"> <img src=https://extranet.logilab.fr/upload/file.php?h=R30151822e1a56284b5ce010154ae9ebd style="border:none;box-shadow:none"> https://francearchives.fr/ ---- <img src=https://extranet.logilab.fr/upload/file.php?h=R996b983dbeb9d4c679731d1c323f4941 width=900 style="border:none;box-shadow:none"> ---- * archival records are organized with the help of so-called finding aids * contain a document's "metadata" such as its title, publisher, ... * but also additional information such as the location a document refers to ---- <img src=https://extranet.logilab.fr/upload/file.php?h=R9a4813b9cdffba70e45135ed598b0244 width=900 style="border:none; box-shadow:none"> :::info :mega: Our goal is to enrich these informations with external information to provide as much information as possible :::: --- ## Nazca ---- ### Naive approach * reference set $R$ of length $n$ * target set $T$ of length $m$ * distance function $\delta$ * calculate $\delta(r, t)$ for all elements $r \in R, t \in T$ and choose the pair $(r_i, t_j)$ minimizing $\delta$ for every $r_i$ * $|R|\times|T|$ distance matrix ---- ### In theory ... * given reference set $R = \{r_1, r_2, r_3, r_4\}$ and target set $T = \{t_1, t_2, t_3\}$ $$R \times T = \begin{array}{|c|c|c|} \hline \delta(r_1t_1) & \delta(r_1t_2) & \delta(r_1t_3) \\ \hline \delta(r_2t_1) & \delta(r_2t_2) & \delta(r_2t_3) \\ \hline \delta(r_3t_1) & \delta(r_3t_2) & \delta(r_3t_3) \\ \hline \delta(r_4t_1) & \delta(r_4t_2) & \delta(r_4t_3) \\ \hline \end{array}$$ * doesn't look that bad ---- ### ... in reality * $522.658$ labels in FranceArchives referring to locations * $111.571$ entries in GeoNames related to France * the distance matrix would have $293.389 * 111.571 = 58.313.475.718$ entries ---- ###### `a naive example` ```python >>> refset = get_refset() >>> targetset = get_targetset() >>> print(refset) [ ('0', 'Marilyn Reilly'), ('1', 'Vicki Morales'), […] ] >>> len(refset) 1014 >>> len(target) 10000 >>> from nazca.utils.distances import DifflibProcessing >>> processings = ( DifflibProcessing( # a function to estimate ref_attr_index=1, # distances between strings target_attr_index=1, ), ) >>> from nazca.rl.aligner import BaseAligner >>> aligner = BaseAligner( threshold=0.2, # take all the pairs # where the distance < 0.2 processings=processings, # the distances unique=True, # take the best pairs only ) >>> for pair in aligner.get_aligned_pairs(refset, targetset): ... print(pair) (('0', 0), ('2038', 2038), 0.0) (('1', 1), ('160', 160), 0.0) […] ``` ---- ```python >>> %timeit list(aligner.get_aligned_pairs(refset, targetset)) 39.4 s ± 998 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` :::info 40s on average ! It's quite long for such small dataset ! **How to improve that ?!** -> by *quickly* estimating the distance ::: ---- ```graphviz digraph structs { ref[label="refset",shape=cylinder]; target[label="targetset",shape=cylinder]; r0[label="<f0>|<f1>|<f2>",shape=record] r1[label="<f0>|<f1>|<f2>",shape=record] ref->r0:f1 target->r1:f1 b0[label="{{<f0>|<f1>}|block}",shape=record]; b1[label="{{<f0>|<f1>}|block}",shape=record]; b2[label="{{<f0>|<f1>}|block}",shape=record]; r0:f0 -> b0:f0 r0:f1 -> b1:f0 r0:f2 -> b2:f0 r1:f0 -> b0:f1 r1:f1 -> b1:f1 r1:f2 -> b2:f1 a0[label="aligned",shape=record] b0,b2 -> a0 b1 -> a0 } ``` ---- ```python >>> from nazca.rl.blocking import MinHashingBlocking >>> aligner = BaseAligner( threshold=0.2, processings=processings, unique=True, ) >>> minhashing_blocking = MinHashingBlocking( refset_attr_index=1, targetset_attr_index=1, threshold=0.3, ) >>> aligner.register_blocking(minhashing_blocking) >>> %timeit list(aligner.get_aligned_pairs(refset, targetset)) 1.09 s ± 9.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` :::success :tada: From 40s to 1s ! **How does it work ?** ::: ---- #### Implemented blockers | Blocking | Type | | -------- | -------- | | `KeyBlocking` | f(x) = f(y)| | `SoundexBlocking` | pronunciation | | `NGramBlocking` | similar substring | | `KmeansBlocking` | same cluster | | `KdTreeBlocking` | coordinates | | `MinHashingBlocking` | jaccard distance | --- ## Applications ---- <img src=https://francearchives.fr/data/61d267173410a0069808bf843ea18985/images/FranceArchive_Signature.svg width=600 style="border:none;box-shadow:none"> ---- <img src=https://extranet.logilab.fr/upload/file.php?h=Rd04e0051cfcb383977ce62fbeaac12b6 width=1000 style="border:none;box-shadow:none"> ---- * to enrich finding aids with additional information, we align to external knowledge bases * persons are aligned to [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) and [data.bnf.fr](https://data.bnf.fr/) * locations are aligned to [GeoNames](http://www.geonames.org/) and [BANO](https://wiki.openstreetmap.org/wiki/FR:France/WikiProject_Base_Adresses_Nationale_Ouverte_(BANO)) * limit presentation to alignment to GeoNames ---- * reminder * $522.658$ labels in FranceArchives need to be aligned to * $111.571$ entries in GeoNames * many of these refer to the same location * only one of these labels needs to be aligned * down to $293.389$ labels ---- ```graphviz digraph structs{ FranceArchives[shape=cylinder] FranceArchives -> index0,index1,index2 index0[label="BORDEAUX",shape=record] index1[label="Bordeaux (Gironde, France)",shape=record] index2[label="bordeaux (gironde)",shape=record] location0[label="Bordeaux (Gironde, France)",shape=record,color=dodgerblue3] index0,index1,index2 -> location0 target0[label="Bordeaux, France",shape=record,color=darkgreen] location0 -> target0[label="align to"] GeoNames[shape=cylinder] GeoNames -> target0 label="Location authorities in FranceArchives" } ``` ---- * labels contain additional information e.g. $\textit{Bordeaux (Gironde, France)}$ * use this additional information to construct different reference sets * aligned to different target sets containing * France * archive's department * foreign countries * topographic features * smaller sets, but still quite large * at this point blocking comes in ---- * illustrate alignment to target set containing France * alignment pipeline consisting of $3$ sequential steps * each composed of $2$ blocking steps and $1$ alignment step * blocking steps are crucial * alignment step is simple distance function ---- ```graphviz digraph structs{ f[label="FranceArchives",shape=cylinder] g[label="GeoNames",shape=cylinder] r0[label="<f0>|<f1>|<f2>|<f3>",shape=record,color="dodgerblue3"] r1[label="<f0>|<f1>|<f2>|<f3>",shape=record,color="darkgreen"] f -> r0 g -> r1 b0[label="<f0>|<f1>",shape=record] b1[label="...", shape=plain] r0:f1 -> b0:f0 r0:f2 -> b1 r1:f1 -> b0 r1:f2 -> b1[label=" KeyBlocking (department, city)"] b2[label="{{<f0>|<f1>}|<f2>block to align}",shape=record] b3[label="...",shape=plain] b0:f0 -> b2:f0 b0:f0 -> b2:f1 b0:f1 -> b3[label=" MinHashingBlocking"] a0[label="aligned",shape=record] a1[label="non-aligned",shape=record] b2:f2 -> a0 b2:f2 -> a1[label=" alignment"] label="Step 1: Blocking pipeline using department-city key pair" } ``` ---- ```graphviz digraph structs{ s0[label="...",shape=plain] r0[label="<f0>|<f1>|<f2>|<f3>",shape=record,color="dodgerblue3"] s0 -> r0[label="non-aligned step 1"] g[label="GeoNames",shape=cylinder] r1[label="<f0>|<f1>|<f2>|<f3>",shape=record,color=darkgreen] g -> r1 b0[label="<f0>|<f1>",shape=record] b1[label="...", shape=plain] r0:f1 -> b0:f0 r0:f2 -> b1 r1:f1 -> b0:f1 r1:f2 -> b1[label=" KeyBlocking (department,)"] b2[label="{{<f0>|<f1>}|<f2>block to align}",shape=record] b3[label="...",shape=plain] b0:f0 -> b2:f0 b0:f0 -> b2:f1 b0:f1 -> b3[label=" MinHashingBlocking"] a0[label="aligned",shape=record] a1[label="non-aligned",shape=record] b2:f2 -> a0 b2:f2 -> a1[label=" alignment"] label="Step 2: Blocking pipeline using department key" } ``` ---- ```graphviz digraph structs{ s0[label="...",shape=plain] r0[label="<f0>|<f1>|<f2>|<f3>",shape=record,color="dodgerblue3"] s0 -> r0[label="non-aligned step 2"] g[label="GeoNames",shape=cylinder] r1[label="<f0>|<f1>|<f2>|<f3>",shape=record,color=darkgreen] g -> r1 b0[label="<f0>|<f1>",shape=record] b1[label="...", shape=plain] r0:f1 -> b0:f0 r1:f1-> b0:f1 r1 -> b1[label=" NGramBlocking"] b2[label="{{<f0>|<f1>}|<f3>block to align}",shape=record] b3[label="...",shape=plain] b0:f0 -> b2:f0 b0:f0 -> b2:f1 b0 -> b3[label=" MinHashingBlocking"] a0[label="aligned",shape=record] a1[label="non-aligned",shape=record] b3[label="...",shape=plain] b2:f3 -> a0 b2:f3 -> a1[label=" alignment"] label="Step 3: Blocking pipeline using NGramBlocking" } ``` ---- ## Results * development in stages * latest addition topographic features * as of today $60.604$ authorities could be aligned * small fraction has been aligned by hand ---- ### Additional information <img src=https://extranet.logilab.fr/upload/file.php?h=Rbdd942291a700c9193e59bd326d3745d style="border:none;box-shadow:none" height=100> ---- ### Interactive Map <img src=https://extranet.logilab.fr/upload/file.php?h=R79c067f43a11689d13410e0dd4acbdab style="border:none;box-shadow:none" height=400> <img src=https://extranet.logilab.fr/upload/file.php?h=Rac596d363f4947a1c0c6cf40b9500a36 style="border:none;box-shadow:none"> https://francearchives.fr/carte-inventaires ---- ### Data(POC) MNHN ---- * POC: Proof Of Concept * 482 naturalists * Data issued from about **ten** differents sources gathered into a single place. * More than *one million* alignments generated * collects * determination * taxon creation * publication * … ---- <img src=https://i.imgur.com/RtzNpi2.png style="border:none;box-shadow:none"> Information from this picture come from multiples alignments (toward *idref* and *wikidata*, and *MNHN*'s library for the picture) ---- ## Lamarck, Jean-Baptiste <img src=https://i.imgur.com/SmylMxf.png style="border:none;box-shadow:none"> :::info We can explain to the *user* the **score**, and on which **basis** the alignment has been made. ::: ---- <img src=https://i.imgur.com/Q1w2CFI.png style="border:none;box-shadow:none"> :::info Naturalists' abbreviation is obtained **from** an alignement made toward **wikidata** ! ::: ---- <img src=https://i.imgur.com/nINNJgG.png style="border:none;box-shadow:none"> :::info Known identifiers can be used to **strengthen trust** in an alignment ::: ---- ###### `open data` <img src=https://i.imgur.com/cJAyYnn.png style="border:none;box-shadow:none"> :::success names, wikidata and viaf identifiers of people who have collected a *Fabaceae* during the XVII century ::: ---- <img src=https://i.imgur.com/WkZ2F2o.png style="border:none;box-shadow:none" height=600> --- ## Try it! <img src=https://extranet.logilab.fr/upload/file.php?h=Rbc572be7101c1a6cabb7b245d9d8f6fa style="border:none;box-shadow:none" height=300> https://forge.extranet.logilab.fr/open-source/nazca --- ## Thank you for your attention! Questions? --- ## Contributors (in alphabetical order) **FranceArchives** * Juliette Belin * Tanguy Le Carrour * Carine Dengler * Adrien Di Mascio * David Douard ---- (contin.) **FranceArchives** * Arthur Lutz * Philippe Pepiot * Katia Saurfelt * Sylvain Thénault * Samuel Trégouët * Guillaume Vandevelde ---- (contin.) **DataPOC-MNHN** * Fabien Amarger * Simon Chabot * Pierre Choffé * Élouan Martinet * Laurent Wouters ---
{"title":"Nazca / Pycon","tags":"presentation,nazca","slideOptions":{"transition":"fade","theme":"white","allottedMinutes":1}}