Lightnews — Scholar-powered news

Travis Reid

@treid803.bsky.social

Goel et al.’s crawler (Jawa) removes some non-deterministic JavaScript code so that the replay of an archived web page does not change if different users replay it. Removing this code resulted in storage savings of 41% and improved the crawling throughput by 39%.

June 11, 2025 at 11:34 PM

Travis Reid

@treid803.bsky.social

While working on the Saving Ads project, we encountered similar problems that involved JavaScript code generating URLs with random values that differed during crawl time and replay time. A Google SafeFrame URL is an example where a random value in the URL caused a replay problem.

June 11, 2025 at 11:34 PM

Travis Reid

@treid803.bsky.social

When non-determinism causes variance in resources’ URLs it results in failed requests, which prevents resources from loading.

Goel et al. matched a requested URL with a crawled URL by removing the query string (querystrip) and using Levenshtein distance (fuzzy matching).

June 11, 2025 at 11:34 PM

Travis Reid

@treid803.bsky.social

Sources of non-determinism identified by Goel et al:
> Server-side state
> Client side state
> Client characteristics
> JavaScript's Date, Random, and Performance APIs

June 11, 2025 at 11:34 PM

Travis Reid

@treid803.bsky.social

Our tech report provides additional details not covered in the blog posts:
bsky.app/profile/trei...

(7/7)

Travis Reid @treid803.bsky.social · Feb 6

Our new tech report (arxiv.org/abs/2502.01525): "Archiving and Replaying Current Web Advertisements…" describes the archiving and replay problems we encountered while creating a dataset of 279 archived ads.

#WebArchiveWednesday @webscidl.bsky.social
@machawk1 @phonedudemln @weiglemc

(1/10)

February 12, 2025 at 8:58 PM

Travis Reid

@treid803.bsky.social

To learn more about the replay problems identified while creating this dataset you can read this blog post:
ws-dl.blogspot.com/2024/12/2024...

(6/7)

2024-12-04: Problems With Replaying Ads That Use iframes

The Web Science and Digital Libraries Research Group at Old Dominion University.

ws-dl.blogspot.com

February 12, 2025 at 8:58 PM

Travis Reid

@treid803.bsky.social

We also created a web page that allows us to view all of the information from the dataset including the replay of the archived ads:
savingads.github.io/themed_ad_co...

(5/7)

February 12, 2025 at 8:58 PM

Travis Reid

@treid803.bsky.social

To identify ads that were not able to replay in the containing web page that loaded the ads during the crawling session, we used ReplayWeb.page’s URL search feature (replayweb.page/docs/user-guide/exploring/) and our Display Archived Ads tool (github.com/savingads/Display-Archived-Ads).

(4/7)

February 12, 2025 at 8:58 PM

Travis Reid

@treid803.bsky.social

When archiving these web pages, we utilized:
> Web archiving services
>> Internet Archive's Save Page Now
>> Arquivo.pt
>> archive.today
>> Conifer

> Browser-based tools
>> ArchiveWeb.page
>> Browsertrix Crawler
>> Brozzler

(3/7)

February 12, 2025 at 8:58 PM

Travis Reid

@treid803.bsky.social

Our dataset of 279 ads was created by archiving 17 web pages from SimilarWeb's top websites worldwide.

Dataset: github.com/savingads/Re...

(2/7)

github.com

February 12, 2025 at 8:58 PM

Travis Reid

@treid803.bsky.social

We were able to create a dataset of 279 ads by archiving 17 web pages from SimilarWeb’s top websites worldwide.

Dataset of 279 archived ads: github.com/savingads/Re...

We also created a web page to display ads from our dataset: savingads.github.io/themed_ad_co...

(10/10)

February 6, 2025 at 4:27 AM

Travis Reid

@treid803.bsky.social

To identify ads that were not able to replay in the containing web page that loaded the ads during the crawling session, we used ReplayWeb.page’s URL search feature (replayweb.page/docs/user-guide/exploring) and our Display Archived Ads tool (github.com/savingads/Display-Archived-Ads).

(9/10)

February 6, 2025 at 4:27 AM

Travis Reid

@treid803.bsky.social

5) Successful replay of ads loaded in iframes with the src attribute of "about:blank" depended upon a given browser's service worker implementation. A Chromium bug stopped service workers from accessing resources inside of this type of iframe, which prevented replay.

(8/10)

February 6, 2025 at 4:27 AM

Travis Reid

@treid803.bsky.social

4) When loading Flashtalking web page ads outside of ad iframes, the ad script requested a non-existent URL, which prevented the replay of ad resources.

(7/10)

February 6, 2025 at 4:27 AM

Travis Reid

@treid803.bsky.social

We created an example web page that used ad code from Google’s pubads_impl_2023020201.js script to determine how the random values were generated for a Google SafeFrame.

Demo web page for generating random numbers and Google SafeFrames: treid003.github.io/random_Value...

(6/10)

February 6, 2025 at 4:27 AM

Travis Reid

@treid803.bsky.social

3) During crawling and replay sessions, Google's and Amazon's ad scripts generated URLs with different random values, because the random number generator’s seed is not the same during the crawl and replay sessions. This prevented archived ads' replay.

(5/10)

February 6, 2025 at 4:27 AM

Travis Reid

@treid803.bsky.social

2) During 2023, Brozzler was incompatible with versions of Chrome released after version 111.0.5563.110, which prevented ads from being archived.

This thread describes the problem:
x.com/TReid803/sta...

(4/10)

x.com

February 6, 2025 at 4:27 AM

Travis Reid

@treid803.bsky.social

1) Before August 2023, Internet Archive's Save Page Now (SPN) excluded ad services' ads & URLs with ad related file and directory names. After August 2023, SPN still excluded ads loaded on a web page & only allowed ad resources if the user directly archived the ad's URL(s)

(3/10)

February 6, 2025 at 4:27 AM

Travis Reid

@treid803.bsky.social

Five problems with archiving & replaying ads during 2023:
1. IA's Save Page Now excluded ads
2. Brozzler's incompatibility with Chrome
3. Google & Amazon ad URLs with random values
4. Flashtalking ads requested unarchived URL
5. Replay of ads differed depending on browser

(2/10)

February 6, 2025 at 4:27 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news