claromes.com

Major Release: v1.0 of Wayback Tweets

A toolkit for retrieving archived tweets


by claromes in releases, en_us, OSINT, SOCMINT, open source / read this article in


Major Release: v1.0 of Wayback Tweets

Last year I announced on this blog that the project would move to the command line, and with that shift, new features would be added. In addition to the CLI, we now have an API that can be used independently as a module, allowing for more flexible usage of the tool.

The web app prototype will not receive all the updates from the package, but it still supports downloading archived tweets and benefits from 35GiB of resources kindly provided by the Streamlit team. The legacy version of the web app is no longer maintained, but it remains available.

The tool will continue evolving and gradually becoming more robust.

Features

Here are the main features of the Python package in the stable version 1.0.

Archived Tweet Retrieval

Fetches data from the Wayback Machine CDX API based on a Twitter/X username and supports query continuation using a resumption_key.

Date Filtering and Result Grouping

The tool allows filtering results by date using the --from and --to parameters in the YYYYmmdd format. You can also limit the number of results with --limit. To avoid duplicate entries (e.g., multiple captures of the same tweet), you can group results using the --collapse option, with fields such as urlkey, digest, or timestamp.

Comprehensive Parsing of Archived Data

Beyond fetching tweets, Wayback Tweets performs detailed parsing of archived captures, extracting fields such as:

  • Timestamp of the capture (in a human-readable format).
  • Original and archived URLs (including parsing of legacy URL structures).
  • Tweet text still available from the source.
  • Retweet identification.
  • Archived content type.
  • HTTP status (when available), content size, and hash digest of the original data.

Read more about the Field Options.

Export in Multiple Formats

  • HTML: Ideal for viewing in a browser using <iframe>.
  • CSV: For spreadsheet analysis.
  • JSON: For programmatic use.

Practical Examples

waybacktweets [OPTIONS] USERNAME
waybacktweets --from 20200101 --to 20231231 --limit 500 --collapse digest jack
from waybacktweets import WaybackTweets, TweetsParser, TweetsExporter

USERNAME = "jack"
FIELD_OPTIONS = [
  "archived_timestamp",
  "parsed_archived_timestamp",
  "archived_tweet_url",
  "parsed_archived_tweet_url",
  "original_tweet_url",
  "parsed_tweet_url",
]

api = WaybackTweets(USERNAME)
archived_tweets = api.get()

if archived_tweets:
  parser = TweetsParser(archived_tweets, USERNAME, FIELD_OPTIONS)
  parsed_tweets = parser.parse()

  exporter = TweetsExporter(parsed_tweets, USERNAME, FIELD_OPTIONS)
  exporter.save_to_csv()

The CLI is simpler and more practical, offering direct commands to retrieve archived tweets. The API, on the other hand, provides more flexibility and customization, allowing integration of data access into other projects and enabling customization of queries and tweet processing.

Highlights

Following the initial alpha and release candidate versions of 1.0, the project received some recognition...

It was featured in Henk van Ess' Deleted Tweet Finder project, where Wayback Tweets is one of the search options ("User Search"). There is also a demonstration in Spanish on Jey Zeta's YouTube channel and a great tutorial in English written by CyberRaya. It was mentioned in the 5th edition of a guide published by the Institute of the Security Service of Ukraine at Yaroslav Mudryi National Law University (p. 41), and also in the 4th edition of the book Manual de Investigação Digital by Guilherme Caselli (pp. 425–426), published in Brazil by Editora Juspodivm.

Visit

Below are the links to each part of the toolkit, along with everything that has been implemented or modified in the project since the alpha versions.

What's Changed

v1.0

  • Added resumption_key option
  • Updated documentation
  • Fixed JSON generation
  • Updated HTML visualization
  • Updated Streamlit Web App

v1.0rc1

  • Streamlit App/ Docs: Improved description
  • Streamlit App: Fixed IndexError

v1.0rc0

  • Added Pandas as a package dependency
  • Checked field_options in the Viz module
  • Fixed accordions not opening in Firefox
  • Adjusted the Streamlit app to allow search without date filter

v1.0a7

  • Streamlit App: Added tabs to show results (HTML, CSV, JSON)
  • Streamlit Legacy App: Updated descriptions
  • Module: Updated CLI help text
  • Added Donate button
  • Added Hands-On Examples to documentation
  • Updated Installation documentation

v1.0a6

  • Streamlit Web App: Fixed width title
  • Streamlit Web App: Set anchor headers to False
  • Streamlit Web App: Added username query param
  • Added Citation file

v1.0a5

  • Fixed visualize module
  • Updated Streamlit Web App

v1.0a4 (released alongside v1.0a5)

  • Updated Streamlit Web App
  • Updated documentation
  • Updated CLI print messages
  • Added pagination to the generated HTML
  • Added "Outputs" to the documentation

v1.0a3 (released alongside v1.0a5)

  • Updated Streamlit version to 1.36
  • Updated Streamlit Web App UI
  • Added legacy Streamlit Web App (v0.4.3)
  • Updated visualize and export modules
  • Fixed request module

v1.0a2 (released alongside v1.0a5)

  • Added streamlit only for dev group in Poetry
  • Added Python 3.10 as a dependency
  • Added accordion on generated HTML
  • Added parsed_archived_timestamp as a Field Option
  • Reviewed tweet URL parser

v1.0a1 (released alongside v1.0a5)

  • Updated the base code
  • Downloaded the archived tweets CDX data
  • Parsed available tweets
  • Parsed JSON archived tweets (not implemented in the API or CLI, only in the Web App)
  • Added HTML generator
  • Added docstrings
  • Added Poetry for package management
  • Added Black, Flake8, isort, and pre-commit for development
  • Added documentation with Sphinx (initially tested with MkDocs, but decided to use Sphinx with the Pallets/Flask theme)
  • Added CLI with the click package
  • Updated the Streamlit Web App:
  • To use the waybacktweets package (not yet implemented on Streamlit Cloud)
  • Updated Streamlit version (1.35.0)
  • Added a calendar interface
  • Updated README and LICENSE
  • Added automatic documentation deployment with Actions
  • Added verbose flag in the CLI and global configuration for verbose mode
  • Published version 1.0 alpha on PyPI
  • Added basic OpenGraph tags / General template for all documentation pages

Share: Mastodon / Bluesky / Órbita / Hacker News / Email