Tim Sherratt
wragge.hcommons.social.ap.brid.gy
Tim Sherratt
@wragge.hcommons.social.ap.brid.gy
I’m a historian and hacker who researches the possibilities and politics of digital cultural collections.

My main project at the moment is the #GLAMWorkbench […]

🌉 bridged from https://hcommons.social/@wragge on the fediverse by https://fed.brid.gy/
As mentioned in my post yesterday, I'm going to run a workshop on the GLAM Workbench while I'm in Melbourne in a couple of weeks. It's aimed at GLAM organisations who want to encourage people to do cool stuff with their data […]
Original post on hcommons.social
hcommons.social
November 20, 2025 at 5:13 AM
If you're searching the Sands and Mac directories for addresses, be aware that suburb names are often abbreviated. I've started adding a table of abbreviations to the database so you can check […]
Original post on hcommons.social
hcommons.social
November 20, 2025 at 12:21 AM
My residency at State Library Victoria LAB is wrapping up in a few weeks! Here's a blog post that pulls together links to the things I've been working on, and gives a peek at what's coming. https://updates.timsherratt.org/2025/11/19/counting-down-to-the-end.html #glam #libraries #openglam […]
Original post on hcommons.social
hcommons.social
November 19, 2025 at 5:10 AM
Reposted by Tim Sherratt
We are still looking for a UI designer who can help us shape beautiful tools around Wikimedia Commons.

#getfedihired #wikimediacommons #glamwiki

https://urmyt.se/jobs/
Jobs - Urmyt
urmyt.se
November 17, 2025 at 11:59 AM
A few thousand newspapers from SLV geolocated by place of publication/distribution. This includes digitised and non-digitised titles.
November 18, 2025 at 4:34 AM
Reposted by Tim Sherratt
Want a guaranteed interesting start to your day? Load this up and click the first link that piques your interest. https://elvery.net/prototypes/wikipedia-edits/ #wikipedia
sw'as
elvery.net
November 16, 2025 at 11:15 PM
A short post on using ALTO and @IIIF to create image snippets of Sands & Mac entries. https://updates.timsherratt.org/2025/11/16/some-sands-mac-tweaks-thanks.html
I posted recently about my new fully-searchable version of the Sands & MacDougall directories. I’ve now moved on to try and pull together a number of the State Library of Victoria’s place-based collections into a new discovery interface. It’s going to be a busy couple of weeks as my residency ends in early December! I wanted to incorporate Sands & Mac search results into the new interface. Getting the data was easy because Datasette has a JSON API baked in. But what about the images? I could just display a thumbnail of the whole page, but it would be better to show a snippet of the actual entry. Thanks to IIIF and ALTO, I now can. IIIF makes it easy to cut small sections out of a larger image. You just put the coordinates of the desired section in the IIIF url. As I noted in my previous post, the ALTO files that contain the OCR data from Sands & Mac include the coordinates of every line, and every word. I just had to bring the two together. All I did was update the code that extracts the data from the ALTO files to save the results as newline delimited JSON instead of a plain text file. Each line in each volume of Sands & Mac is now saved a JSON object that contains the text, as well as the height, width, vertical position, and horizontal position of the line within the page image. When I load up the SQLite database, I add the values for `h`, `w`, `x`, and `y` as well as the text for each line. What does this make possible? 1. When you go to an individual entry, the page image now automatically pans and zooms so that the current entry is at the centre of the image viewer. I just updated the OpenSeadragon code to focus on the entry’s position. 2. If you share an entry on social media, a snipped out section of the page image showing the selected entry is displayed as there’s now an image `META` tag that points to an IIIF url. 3. You can retrieve entries via the API and use the coordinates to request snipped out images of them via IIIF. Nice image snippets thanks to IIIF and ALTO (and a sneak prview of what's coming...)
updates.timsherratt.org
November 16, 2025 at 4:20 AM
Sneaking up on 500 maps from the SLV georeferenced – in just over 2 weeks. With another 2 weeks to go before my #slvresidency ends, we might actually make it to 1,000! That would be amazing. https://wragge.github.io/slv-allmaps/dashboard.html
allmaps_dashboard
wragge.github.io
November 16, 2025 at 3:32 AM
If you're interested in hearing more about what I've been doing during my #slvresidency, I'll be giving a talk at 1pm on 3 December in the Library's Create Quarter.
November 16, 2025 at 12:22 AM
Starting to bring various bits of my #slvresidency together in a new prototype...
November 15, 2025 at 6:36 AM
Just updated Sands & Mac. I've added the line coordinates from the ALTO files, so now when you view an entry the image will automatically pan/zoom to show the selected line!
November 14, 2025 at 12:31 PM
Good to hear today that my new Sands & Mac is already being used by front-of-house librarians at the SLV to help people with their family history queries. https://updates.timsherratt.org/2025/11/12/a-new-way-of-searching.html
In the fortnight I spent onsite at the State Library of Victoria, ‘Sands & Mac’ was mentioned many times. And no wonder. The Sands & McDougall’s directories are a goldmine for anyone researching family, local, or social history. They list thousands of names and addresses, enabling you to find individuals, and explore changing land use over time. When people ask the SLV’s librarians, ‘What can you tell me about the history of my house?’, Sands & Mac is one of the first resources consulted. The SLV has digitised 24 volumes of Sands & Mac, one every five years from 1860 to 1974. You can browse the contents of each volume in the SLV image viewer, using the partial contents listing to help you find your way to sections of interest. To search the full text content you need to use the PDF version, either in the built-in viewer, or by downloading the PDF. There’s a handy guide to using Sands & Mac that explains the options. **However, there’s currently no way of searching across all 24 volumes, so as part of my residency at the SLV LAB, I thought I’d make one!** **Try it now!** My new Sands & Mac database follows the pattern I’ve used previously to create fully-searchable versions of the NSW Post Office directories, Sydney telephone directories, and Tasmanian Post Office directories. Every line of text is saved to a database, so a single query searches for entries across all volumes. You can also use advanced search features like wildcards and boolean operators. Search across all 24 volumes! Once you’ve found a relevant entry you can view it in context, alongside a zoomable image of the page. You can even use Zotero to save individual entries to your own research database. This blog post from the Everyday Heritage project describes how the Tasmanian directories have been used to map Tasmania’s Chinese population. View each entry in context! (Here's my Dad building his first house in Beaumaris in the 1950s.) There’s still a few things I’d like to try, such as making use of the table of contents information for each volume. I’d also like to create some additional entry points to take users directly to listings for individual suburbs (maybe even streets!). Each volume has a directory of suburbs, so it would be a matter of extracting and cleaning the data and linking the entries to digitised pages. Certainly possible, but I don’t think I’ll have time to get it all done before the end of my residency. Perhaps I’ll try to get at least one volume done to demonstrate how it might work, and the value it would add. As I was writing this blog post I also realised there’s a dataset of businesses extracted from the Sands & Mac, so I need to think about how I can use that as well! ## Technical information follows… I’ve documented the process I used to create fully-searchable versions of the Tasmanian and NSW directories in the GLAM Workbench. I followed a similar method for Sands and Mac, though with a few dead-ends and discoveries along the way. ### Downloading the PDFs I assumed that it would be easiest to work from the PDF versions of each volume, as I’d done for Tasmania. So I set about finding a way to download them all. There’s only 24 volumes, so I _could_ have downloaded them manually, but where’s the fun in that? I started with a CSV file listing the Sands & Mac volumes that I downloaded from the catalogue. This gave me the Alma identifiers for each volume. To download the PDFs I needed two more identifiers, the `IE` identifier assigned to each digitised item, and a file identifier that points to the PDF version of the item. The `IE` identifier can be extracted from the item’s MARC record, as I described in my post on exploring urls. The PDF file identifier was a bit more difficult to track down. The PDF links in the image viewer are generated dynamically, so the data had to be coming from somewhere. Eventually I found that the viewer loaded a JSON file with all sorts of useful metadata in it! The url to download the JSON file is: `https://viewerapi.slv.vic.gov.au/?entity=[IE identifier]&dc_arrays=1`. In the `summary` section I found identifiers for `small_pdf` and `master_pdf`. I could then use these identifiers to construct urls to download the PDFs themselves: `https://rosetta.slv.vic.gov.au/delivery/DeliveryManagerServlet?dps_func=stream&dps_pid=[PDF id]` Once I had the PDFs I used PyMuPDF to extract all the text and images. As I suspected the text wasn’t really fit for purpose. The OCR was ok, but the column structures were a mess. Because I wanted to index each entry individually, it was important to try and get the columns represented as accurately as possible. The images in the small PDFs were already bitonal, so I started feeding them to Tesseract to see if I could get better results. After a bit of tweaking, things were looking pretty good. But when I came to compile all the data, I realised there was a potential problem matching the PDF pages to the images available through IIIF. I found one case where some pages were missing from the PDF, and another couple where the page order was different. As I was looking around for a solution, I realised that those JSON files I downloaded to get the PDF identifiers also included links to ALTO XML files that contain all the original OCR data (before it got mangled by the PDF formatting). There was one ALTO file for every page. Even better, the JSON linked the identifiers for the text and the image together – no more page mismatches! ### Downloading the ALTO files Let’s start this again shall we. After wasting several days futzing about with the PDFs, I decided to download all the ALTO files and extract the text from them. As I downloaded each XML file, I also grabbed the corresponding image identifier from the JSON and included both identifiers in the file name for safe keeping. The ALTO files break the text down by block, line, and word. To extract the text, I just looped through every line, joining the words back together as a string, and writing the result to a new text file – one for each page. It’s worth noting that the ALTO files include _all_ the positional data generated by the OCR process, so you have the size and position of every word on every page. I just pulled out the text, but there are many more interesting things you could do… ### Assembling and publishing the database From here on everything pretty much followed the pattern of the NSW and Tasmanian directories. I looped through each volume, page, and line of text, adding the text and metadata to a SQLite database using sqlite_utils. I then indexed the text for full-text searching. At the same time I populated a metadata file with titles, urls, and few configuration details. The metadata file is used by Datasette to fill in parts of the interface. I made some minor changes to the Datasette template I used for the other directories. In particular, I had to update the urls that loaded the IIIF images into the OpenSeadragon viewer. But it mostly just worked. It’s so nice to be able to reuse existing patterns! Finally, I used Datasette’s `publish` command to push everything to Google Cloudrun. The final database contains details of more than 50,000 pages, and over 19 million lines of text! It weighs in at about 1.7gb. The Cloudrun service will ‘scale to zero’ when not in use. This saves some money and resources, but means it can take a little while to spin up. Once it’s loaded, it’s very fast. My original post on the Tasmanian directories included a little note on costs, if you’re interested. ## More information The notebooks I used are on GitHub: * Download Sands and Mac PDFs and OCR text * Load data from the Sands and Mac directories into an SQLite database (for use with Datasette) Here are some posts about the NSW and Tasmanian directories: * Making NSW Postal Directories (and other digitised directories) easier to search with the GLAM Workbench and Datasette (September 2022) * From 48 PDFs to one searchable database – opening up the Tasmanian Post Office Directories with the GLAM Workbench (September 2022) * Where’s 1920? Missing volume added to Tasmanian Post Office Directories! (September 2024) * Six more volumes added to the searchable database of Tasmanian Post Office Directories! (November 2024)
updates.timsherratt.org
November 14, 2025 at 7:12 AM
A little bit terrified of all the things I want to get done before the end of my #slvresidency on 5 December...
November 13, 2025 at 11:36 PM
There's a new version of the AllMaps editor out, which means I'll need to update my SLV georeferencing documentation! It works the same, but some things have been moved or added. Now includes a 'search' option in the georefencing panel that should make it easier to match maps.
November 12, 2025 at 11:57 PM
Reposted by Tim Sherratt
A few weeks back, as part of my #slvresidency, I created a big, fully-searchable database with all the content from the 24 volumes of Sands & MacDougall's directories digitised by the SLV. I've finally written up the details: https://updates.timsherratt.org/2025/11/12/a-new-way-of-searching.html […]
Original post on hcommons.social
hcommons.social
November 12, 2025 at 11:38 AM
A few weeks back, as part of my #slvresidency, I created a big, fully-searchable database with all the content from the 24 volumes of Sands & MacDougall's directories digitised by the SLV. I've finally written up the details: https://updates.timsherratt.org/2025/11/12/a-new-way-of-searching.html […]
Original post on hcommons.social
hcommons.social
November 12, 2025 at 11:38 AM
Gippsland map lovers have joined the party. https://wragge.github.io/slv-allmaps/dashboard.html
November 12, 2025 at 10:43 AM
Today I gave a presentation to the @IIIF workshop at Sydney Uni. I talked about some of my experiments using IIIF, including my current work at the State Library of Victoria. Here are the slides: https://slides.com/wragge/iiif-workshop-2025 #glam #maps #digitalhumanities
Presentation to IIIF workshop, November 2025
A presentation created with Slides.
slides.com
November 10, 2025 at 1:40 AM
November 8, 2025 at 4:25 AM
More than 200 maps from State Library of Victoria have been georeferenced in the past week! You can chart progress using this dashboard which is updated every 2 hours: https://wragge.github.io/slv-allmaps/dashboard.html
allmaps_dashboard
wragge.github.io
November 7, 2025 at 6:24 AM
Reposted by Tim Sherratt
Data rescue for World Digital Preservation Day 2025
Today, Thursday 6 November 2025 if I actually manage to finish and publish this today, is World Digital Preservation Day so I thought I would try and get a blog post out about some work I’ve been doing to rescue at-risk data. I’ve briefly mentioned this in my post about Library of Congress Subject Headings but not in much detail. The project is Safeguarding Research & Culture and I got involved back in March or April when Henrik reached out on social media looking for someone with library & metadata experience to contribute. I said that I wasn’t a Real Librarian but I’d love to help if I could, and now here we are. The concept is simple: download public datasets that are at risk of being lost, and replicate them as widely as possible to make them hard to destroy, though obviously there’s a lot of complexity buried in that statement. When the Trump administration first took power, there were a lot of people around the world worried about this issue and wanting to help, so while there are a number of institutions & better resourced groups doing similar things, we aim to complement them by mobilising grassroots volunteers. Downloading data isn’t always straightforward. It may be necessary to crawl an entire website, or query a poorly-documented API, or work within the constraints of rate-limiting so as not to overload an under-resourced server. That takes knowledge and skill, so part of the work is guiding and mentoring new contributors and fostering a community that can share what they learn and proactively find and try out new tools. We also need people to be able to find and access the data, and volunteers to be able to contribute their storage to the network. We distribute data via the venerable BitTorrent protocol, which is very good at defeating censorship and getting data out to as many peers as possible as quickly as possible. To make those torrents discoverable, our dev team led by the incredible Jonny have built a catalogue of dataset torrents, playfully named SciOp. That’s built on well-established linked data standards like DCAT, the Data Catalogue Vocabulary, so the metadata is standardised and interoperable, and there’s a public API and a developing commandline client to make it even easier to process and upload datasets. There are even RSS and RDF feeds of datasets by tag, size, threat status or number of seeds (copies) in the network that you can plug into your favourite BitTorrent client to automatically start downloading newly published datasets. There are even exciting plans in the works to make it federated via ActivityPub, to give us a network of catalogues instead of just a single one. We’re accidentally finding ourselves needing to push the state of the art in BitTorrent client implementations. If you’re familiar with the history of BitTorrent as a favoured tool for _ahem_ less-than-legal media sharing, it probably won’t surprise you that most current BitTorrent clients are optimised for working with single audio-visual streams of about 1 to 2½ hours in length. Our scientific & cultural data is much more diverse than that, and the most popular clients can struggle for various reasons. In many cases there are BEPs (BitTorrent Enhancement Proposals) to extend the protocol to improve things, but these are optimal features that most clients don’t implement. The collection of BEPs that make up “BitTorrent v2” is a good example: most clients don’t support v2 well, so most people don’t bother making v2-compatible torrents, but that means there’s no demand to implement v2 in the clients. We are planning to make a scientific-grade BitTorrent client as a test-bed for these and other new ideas. Myself I’m running one of a small number of “super” nodes in the swarm, with much more storage available than the average laptop or desktop, and often much better bandwidth too. That’s good, because some of our datasets run to multiple terabytes, plus to ensure new nodes can get started quickly we need to have some always-on nodes with most of the data available to others. Since BitTorrent is truly peer-to-peer, it doesn’t matter how many people have a copy of a given dataset, if none of them are online no-one else can access it. This is all very technically interesting, but communications, community, governance, policy, documentation, funding are also vitally important, and for us these are all works in progress. We need volunteers to help with all of this, but especially those less-technical aspects. If you’re interested in helping, please drop us a line at contact@safeguar.de, or join our community forum and introduce yourself and your interests. If you want to contribute but don’t feel you have the time or skills, well, to start with we’re more than happy to show you the ropes and help you get started, but as an alternative, I’m running one of those “super” nodes and you can contribute to my storage costs via GoFundMe: even a few quid helps. I currently have 3x 6TB hard drives with no space to mount them, so I’m currently in need of a drive cage to hold them and plug them into my server. Special shout-out also to our sibling project, the Data Rescue Project, who are doing amazing work on this and often send us requests for websites or complex datasets for our community to save. I’ve barely scratched the surface here, but I _really_ want to actually get this post out for WDPD so I’m going to stop here and hopefully continue soon!
erambler.co.uk
November 6, 2025 at 9:32 PM
On this #worlddigitalpreservationday I'm reflecting on the fact that my attempts to preserve data documenting the development of online collections was thwarted this year by GLAM institutions who are today posting about #wdpd2025. Oh the irony...

Anyway, here's what I was trying to do […]
Original post on hcommons.social
hcommons.social
November 6, 2025 at 7:50 AM
196 SLV maps georeferenced. When will we hit 200? https://wragge.github.io/slv-allmaps/dashboard.html
allmaps_dashboard
wragge.github.io
November 5, 2025 at 10:35 AM
I'll be Zooming into this @IIIF workshop at Sydney Uni on Monday to talk about some of my experiments, including the latest work on SLV maps. It's free and there's still a few places left, so come along if you're IIIF-curious […]
Original post on hcommons.social
hcommons.social
November 5, 2025 at 6:28 AM