Cryptocurrency

Bitcoin research with a transaction graph dataset

March 8, 2025

Table of Contents

Graph construction

Raw data extraction

All transactions since the inception of the Bitcoin economy are stored in a public ledger called the Bitcoin blockchain¹⁵. The blockchain is maintained through a decentralized network of peers¹⁶. Every ten minutes, a new block of transactions is appended. After installing Bitcoin Core (https://bitcoin.org/en/bitcoin-core) version 24.0, we set up a Bitcoin node with the standard configuration. The node allowed us to connect to the network of peers and download the complete transaction ledger. The entire transaction history was saved in the local blockchain data directory, specifically within the ‘blkXXXXX.dat’ files located within the ‘blocks’ folder created by the node. Subsequently, we used parsing techniques to extract all transaction details. This process ensured accurate data for our analysis. In this work, we considered the transactions contained within the first 700,000 blocks of the blockchain. In this work, we considered the transactions contained within the first 700,000 blocks of the blockchain, as this range precedes the activation of the Taproot upgrade¹⁷. Taproot introduced significant modifications to Bitcoin’s transaction structure, necessitating additional adaptations to the data processing methods employed. By focusing on blocks prior to this upgrade, the analysis remains consistent and avoids potential methodological complications.

Definition of nodes

All circulating bitcoins are allocated within unspent TXOs, each safeguarded by a locking script. Several TXOs can be locked by the same script, and thus can be spent by the same address or group of addresses. Furthermore, a transaction can be conceptualized as a transfer of value from one set of scripts to another. In this sense, scripts can be regarded as the owners of the bitcoins they lock. Consequently, locking scripts naturally emerge as candidates for the nodes within the graph dataset. All TXOs and scripts that have existed can be inferred from transaction data. In our analysis, TXOs with a zero value have been excluded, and we have identified over 874 million scripts.

A script is derived from a set of private keys held by one or more entities, thus making these entities the de facto owners of the bitcoins protected by the script. Typically, a user may possess multiple private keys for purposes of management, security, or privacy¹⁴. Additionally, the derivation of a locking script from a set of private keys is not unique¹⁸. As a result, a user generally owns or has owned bitcoins within TXOs protected by different scripts. For a more effective study of Bitcoin, it is preferable to analyze value transfers between real entities or users rather than scripts. This approach has been adopted in most previous research papers preceding this work^6,7,8. Therefore, it is necessary to identify and cluster scripts that likely belong to a single entity, which will then represent a node in the graph.

To achieve this, we employed heuristics developed in previous research¹⁹. These heuristics leverage established behavioral patterns and habits of Bitcoin users, along with known human biases, to establish links between scripts appearing in the same transaction. Consequently, the nodes in our graph represent clusters of scripts, with approximately 252 million clusters identified. Each cluster is identified by a unique integer alias. Henceforth, a TXO will be characterized by a value v and the alias a representing the cluster of its locking script. Since the locking scripts are also derived from private keys, we will refer to them as addresses hereafter.

Edges

A transaction is the transformation of a set of input TXOs Δ_in into a new set of output TXOs Δ_out. There is nothing to prevent an alias from being present in both the inputs and the outputs. This situation is common, for example, when receiving change from a payment, as all input TXOs will be fully consumed regardless of the payment amount. Since an alias can appear in both the input and the output of a transaction, it needs to be determined whether this alias sends or receives value during the transaction. We define the value received by an alias a using the equation (1). This value is simply the difference between the value received in the output and the value spent in the input.

$${v}_{\Delta }(a)=\sum _{({v}^{{\prime} },{a}^{{\prime} })\in {\Delta }_{{\rm{out}}},\,{a}^{{\prime} }=a}{v}^{{\prime} }-\sum _{({v}^{{\prime} },{a}^{{\prime} })\in {\Delta }_{{\rm{in}}},\,{a}^{{\prime} }=a}{v}^{{\prime} }$$

(1)

Consequently, the entity denoted as a can be classified as a recipient if the net value received is positive, otherwise a is a sender. The quantity transmitted from sender a to recipient ${a}^{{\prime} }$ is defined as the proportion of the total input value provided by a times the value received by ${a}^{{\prime} }$. An edge is finally drawn from each sender to each recipient of the transactions considered.

CoinJoin and colored coin transactions

CoinJoin refers to a specific type of transaction that adds a layer of privacy to Bitcoin²⁰. The transaction ledger is public, and each transaction can be analyzed, making it relatively simple to follow value flows. Specifically, for a given user, it is straightforward to trace their wealth, from whom they receive value, and to whom they send it, which seriously compromises their privacy. CoinJoin works by combining multiple individual transactions from different users into a single large transaction. Each participant contributes input TXOs and specifies output TXOs without revealing which inputs belong to which outputs. As a result, it obscures the origin and destination of transactions, making it harder for external observers to trace the transfer flow of a particular user. This type of transaction also helps to thwart certain clustering heuristics by leading them to cluster together scripts that do not belong to the same user. For all these reasons, we have decided to exclude these transactions from (1) the construction of script clusters and (2) the addition of edges in the graphs. The construction of this type of transaction is facilitated by specialized software, with several implementations available, including Wasabi (https://docs.wasabiwallet.io) and Whirlpool/Samourai (https://github.com/Samourai-Wallet/Whirlpool). We used heuristics developed in previous work to detect these transactions²⁰. These heuristics were developed by analyzing the open-source implementations of these software implementations to identify recognizable patterns in such transactions.

Colored coin transactions are utilized to transfer value in forms other than bitcoin, including other cryptocurrencies and tangible assets²¹. These transactions embed additional information, such as the type of asset or the quantity being transferred, within the transaction’s locking script, making them relatively straightforward to identify. To accurately detect these transactions, we devised heuristics based on established protocols like Open Asset (https://github.com/OpenAssets), Omni Layer (https://github.com/OmniLayer), and EPOBC (https://github.com/chromaway/ngcccbase/wiki/EPOBC_simple). Consequently, we exclude these transactions from our graph construction to maintain the integrity of our analysis by focusing on standard Bitcoin transactions.

Attributes

Attributes attached to edges represent the aggregate characteristics of the directed value transfers. Attributes attached to nodes are primarily derived from the edges involving those nodes, providing insights into the nodes’ transactional behavior. The different attributes are described in Tables 1 and 2. The blockchain is a chain of blocks, each containing a sequence of transactions. When defining the attributes, the block index of a transaction refers to the index of the block containing the transaction within the blockchain. Consequently, the block index can be regarded as a timestamp.

Table 1 Attributes of the edges.

Table 2 Attributes of the nodes.

Overview of the dataset construction

The graph dataset consists of two tables: one table containing the nodes and their attributes, and another table containing the edges and their attributes. The dataset construction process required multiple steps and involved the creation of several intermediate tables. The construction process is illustrated in Fig. 2. All code is written in Python and is publicly available to ensure reproducibility. We use PostgreSQL (https://www.postgresql.org) for data storage in a database and the Python package Psycopg2 (https://www.psycopg.org) to query the database.

The code is organized as follows:

1.

Block indexing The transaction blocks downloaded with the Bitcoin node are stored in several thousand binary files. These blocks are organized in the order they are received from peers rather than their chronological position in the blockchain. To facilitate data reading, we create a table that contains the location of each block along with various metadata.
2.

Transaction processing The blocks and their transactions are read in chronological order. For each transaction, we store the TXOs created in the transaction in the ‘CreatedTXO’ table. Similarly, we store the spent TXOs in the ‘SpentTXO’ table. This allows for efficient retrieval of transaction-related data. During this step, we identify CoinJoin and colored coin transactions to exclude them from future analyses. In parallel, we list each encountered locking script (address) in a table.
3.

Address clustering In this step, we construct address clusters. Initially, each address forms its own cluster. As we process transactions and apply clustering heuristics, clusters merge until they form the final clusters, which represent the nodes in our graph. We use a disjoint-set data structure to efficiently store and merge clusters. The final clusters are stored in the ‘Alias’ table in the form of (address, cluster).
4.

Inter-cluster edges At this stage, we have identified the node corresponding to each address found in the transaction data. This allows us to determine the owning node of each TXO. We then use equation (1) to calculate value transfers within a transaction. As we process transactions, we add directed edges representing value transfers between nodes to the ‘TransactionEdge’ table. If an edge between two nodes already exists, its attributes are updated accordingly. From the ‘TransactionEdge’ table, we also construct the ‘UndirectedTransactionEdge’ table, which contains undirected edges between nodes. This table is useful for calculating the node degrees.
5.

Intra-cluster edges By repeating the previous step but staying at the level of TXOs owned by addresses, we construct the ‘ClusterTransactionEdge’ table, which contains undirected edges representing value transfers occurring within the same address cluster. This table is useful for computing certain node features.
6.

Node attribute computation The attributes of the nodes are computed by reading the tables constructed in steps 3, 4, and 5, using simple counting operations as well as additional operations such as summation, minimum, and maximum calculations.

Node labels

Various real-world entities with distinct motivations utilize Bitcoin, including individuals, government organizations, corporations, service providers, and criminal organizations. Extensive research efforts in Bitcoin are dedicated to examining the behavior and dynamics of value transfers among these diverse entities. These studies are invaluable for providing insights into the purposes and motivations behind Bitcoin usage. Bitcoin users are identified by randomly generated addresses. Information from the blockchain alone is insufficient to ascertain the true identity or nature of the entity represented by an alias. At no stage does the labeling process, nor the final labels, disclose any information pertaining to individual humans, thereby eliminating any privacy concerns.

Entities that have often been studied in prior research include single individuals and those connected to malicious or criminal activities, such as Ponzi schemes, ransomware operators, or mixers^6,9,22. Other analyzed entities include participants in Bitcoin’s economic activity, such as miners, exchanges, marketplaces, and faucets, as well as those linked to entertainment, like sports betting and gambling platforms^7,8,23. Given the expansion of the crypto-economy beyond Bitcoin, we propose studying ‘bridges,’ which facilitate value transfers between different crypto-economies. Therefore, we aim to focus our investigation on the following categories of entities:

Individual
Mining: individual or entity that validates and confirms transactions on the Bitcoin network.
Exchange: online platform that facilitates the buying, selling, and trading of cryptocurrencies and fiat currencies.
Marketplace: online platform where users can buy and sell goods or services using bitcoin as the primary form of payment.
Gambling: online platform where users can wager and play casino games, sports betting, and participate in lotteries using Bitcoin.
Bet: address created by a gambling service specifically for receiving funds related to a unique bet.
Faucet: promotional tool that rewards users with small amounts of bitcoin for completing tasks or viewing advertisements.
Mixer: service that enhances the privacy and anonymity of transactions by making it more difficult to trace transactions on the blockchain.
Ponzi: a financial scheme promising high returns to investors by using funds from new investors to pay returns to earlier investors.
Ransomware: malicious software that encrypts files on a victim’s computer, demanding payment to decrypt and restore access.
Bridge: protocol that facilitates the exchange of assets between Bitcoin and different blockchain networks (e.g. Ethereum).

These entities were selected due to their relevance and prevalence within the cryptocurrency ecosystem, providing a comprehensive overview of the diverse actors within the Bitcoin ecosystem.

Summary

In this experimental framework for Bitcoin research, we leverage BitcoinTalk (https://bitcointalk.org), a prominent online forum, to extract and analyze Bitcoin-related data. Using a Python-based scraper with Selenium (https://www.selenium.dev), we systematically collected 14 million messages from 546,000 threads, focusing on posts mentioning Bitcoin addresses. These addresses were then associated with entities (e.g., services, companies) using ChatGPT (https://openai.com/chatgpt), a large language model fine-tuned for contextual understanding. ChatGPT was prompted to identify deposit addresses, hot/cold wallets, and withdrawal transactions based on post content, transaction IDs, and USD amounts converted using the Bitstamp exchange rate. The full labeling pipeline is illustrated in Fig. 3. This approach enabled the labeling of 34,000 nodes and 100,000 Bitcoin addresses with entity types, such as ransomware operators or Ponzi schemes, by mapping forum discussions to predefined categories. However, the dataset has limitations, including potential inaccuracies from user-generated content, biases toward English-speaking entities, and challenges in extracting precise information from unstructured text. Despite these constraints, ChatGPT demonstrated high accuracy (83-96%) in extracting relevant details from forum posts. This methodology provides a scalable, automated pipeline for constructing a large-scale, labeled Bitcoin transaction graph, facilitating advanced research into transaction patterns, entity behaviors, and malicious activities within the Bitcoin ecosystem. The integration of forum data and AI-driven labeling offers a novel approach to overcoming the scarcity of curated datasets in blockchain research.

BitcoinTalk

BitcoinTalk is an online forum dedicated to Bitcoin, and it remains one of the most active forums on the subject. The forum is divided into several sections, subsections, and threads. A thread is a sequence of messages or posts, which should be related to the thread’s topic or title. Bitcoin addresses are often mentioned in posts, and the context of the thread can sometimes help to assign these addresses to entities, such as services or companies. We developed a Python-based scraper utilizing the Selenium (https://www.selenium.dev) package to systematically collect posts from the English-speaking section of the forum. Threads in this section often comprise multiple messages distributed across several pages. The scraper initiates its operation by accessing the first page of a thread to extract its posts and then proceeds to navigate through subsequent pages to retrieve all remaining posts. For each post, the scraper captures the textual content, a unique author identifier, and the publication date. We collected the data at the end of 2023. In total, we collected 14,067,713 messages from 546,440 threads.

ChatGPT

Addresses were assigned to entity names using ChatGPT, an artificial intelligence assistant developed by OpenAI based on the GPT²⁴ foundation models. ChatGPT is designed to engage in human-like automated conversations with users. The GPT models have been fine-tuned using supervised learning and reinforcement learning from human feedback. A conversation consists of a sequence of user prompts and assistant responses. ChatGPT has demonstrated impressive results in various tasks, including following instructions and solving logic problems²⁵. We utilized ChatGPT (model ‘gpt-4o-mini’) via API calls (https://platform.openai.com) for this purpose.

Deposit addresses, hot and cold wallets

We concentrated on addresses owned by organizations, especially those providing services in exchange for bitcoins. Transactions between user addresses and service addresses are relatively common across various services²⁶. To access these services, users deposit funds by transferring bitcoins from their personal addresses to addresses managed by the organization, known as deposit addresses²⁷. Organizations typically generate unique deposit addresses for each client, simplifying the monitoring of client deposits. However, these addresses remain under the organization’s control. Once funds are deposited from personal addresses to the deposit addresses, users can utilize the service in exchange for the bitcoins they have deposited. Users can withdraw their remaining bitcoins to personal addresses after they have finished using the service. Typically, funds from deposit addresses are consolidated into two types of addresses: hot wallets and cold wallets²⁸. Hot wallets are internet-connected and hold sufficient funds for routine operations like user withdrawals. Conversely, cold wallets are generally offline to protect against online threats and store the bulk of the service’s funds and user deposits. Based on this typical interaction pattern and observations from the collected posts, we designed instructions for ChatGPT to extract information about the addresses mentioned in the posts.

Prompts

We have formulated several prompts to guide ChatGPT in associating addresses mentioned in a post with an entity name, provided that the context (the post and the thread title) allows for it. In addition to the textual information within the post, we incorporate supplementary data. Certain posts include transaction IDs, which serve as unique identifiers of transactions on the blockchain. This enables readers to retrieve detailed transaction information from the blockchain, thereby providing additional context. Since this detailed transaction information can also help ChatGPT, we included transaction details (senders, recipients, amounts) of the mentioned transaction IDs in the prompt. Although all amounts on the blockchain are denominated in satoshis, posters frequently refer to amounts in USD in their posts. To assist ChatGPT in matching amounts in bitcoins in transactions with the USD amounts mentioned in the posts, we added the converted USD amounts using the BTC/USD exchange rate from the date of the post. We obtained the daily conversion rate from the Bitstamp exchange using their official API (https://www.bitstamp.net/api).

All prompts are available in the code directory (see ‘Code availability’). The first script is designed to detect bitcoin deposits from customers to a service. Posters usually mention deposit addresses or transaction IDs when they encounter issues during their deposit, such as their account not being credited with the correct amount of bitcoin. This script associates mentioned deposit addresses with entity names if the context allows. The second script targets users’ withdrawals. When users want to withdraw their funds, they provide a recipient address to the service. The service then creates the transaction and communicates the transaction ID once it is incorporated into the blockchain. Withdrawal transactions are typically funded by the service’s funds, likely controlled by hot wallets. For this reason, we assume that the sending addresses of these transactions are owned by the identified service. The third script is quite similar; it also attempts to detect withdrawals and aims to identify the involved entity, the address used by the user, and the amount withdrawn. We refer to the previous case by searching the blockchain for the corresponding withdrawal transaction, around the date of the post ( + /− three days), with the withdrawal address as the output receiving the indicated amount. If a unique transaction matches these characteristics, we assume it is the withdrawal transaction, and the addresses funding this transaction belong to the detected entity. Finally, the last script identifies hot and cold addresses under various circumstances.

Labeling

Prompts have been meticulously crafted to ensure the assistant returns a succinct reasoning along with an entity name and Bitcoin addresses or transaction ids. A mapping between the returned entity names and the predefined entity categories has been established. This mapping process utilized threads from BitcoinTalk where entity names were mentioned and the Internet Wayback Machine (https://web.archive.org). In cases where the name of an entity was unknown, often because services had ceased operations or failed to achieve prominence, we manually navigated BitcoinTalk forum threads, focusing on those referencing the entity, to infer the type of service provided. When posts included URLs, we attempted to access the corresponding websites to gather additional context. For websites that were no longer operational, we employed the Internet Wayback Machine to retrieve historical snapshots of the webpages. If the entity type could not be determined following these procedures, the pair was excluded from the dataset to maintain the integrity and reliability of the labels. For each labeled address, we identified and labeled the locking scripts that can be unlocked by the address with the same label as the address. Subsequently, for each labeled script, we assigned the corresponding cluster/node the same label. In instances where a cluster contains multiple conflicting labels, no label was assigned.

Limits

Posts retrieved from BitcoinTalk may contain inaccuracies, misinformation, or deliberate falsehoods posted by users. Users may misrepresent the context of a transaction or intentionally provide misleading information about the ownership or purpose of certain addresses. These inaccuracies may stem from the pseudonymous nature of the forum users, which can make it challenging to verify the authenticity of claims. Furthermore, the dataset may reflect biases due to its reliance on self-reported or community-shared information, which could disproportionately highlight certain types of entities or behaviors while omitting others.

Additionally, since we only fetched posts from the English-speaking part of the forum, the constructed dataset of labels may not be representative of all entities globally. This language constraint could lead to an underrepresentation of non-English-speaking users and entities, potentially skewing the node labels toward regions and communities more active in English-language discussions.

Moreover, despite advancements in large language models such as ChatGPT and others, extracting information from short, unstructured text remains a significant challenge. Online discussions often lack sufficient contextual clarity, contain ambiguous references, or exhibit inconsistent formatting, hindering precise information retrieval. To evaluate the model’s capability in extracting the desired information, we conducted an assessment using samples drawn from all available posts. Specifically, for each prompt, we selected 150 posts containing at least one Bitcoin address or transaction identifier. The evaluation metric was the proportion of posts in which all relevant details were correctly identified and extracted. For the deposit address extraction prompt, we filtered posts containing the keywords ‘deposit’, ‘deposited’, ‘transfer’, and ‘transferred’ to maximize the presence of relevant addresses. Our analysis showed that ChatGPT successfully extracted complete information from 138 out of 150 posts (92%). For the withdrawal address and transaction prompts, we selected posts mentioning ‘withdraw’, ‘withdrew’, ‘withdrawn’, and ‘withdrawal’. The results indicate high extraction accuracy, with 96% for withdrawal transactions and 86% for withdrawal addresses. For the hot and cold wallet address extraction prompt, we selected posts containing the terms ‘hot’ and ‘cold’. ChatGPT correctly extracted information from 83% of the posts. These findings demonstrate that ChatGPT effectively extracts information from forum posts.

Other sources

To enrich our dataset, we incorporated additional data from diverse sources.

We have incorporated wallet addresses from various cryptocurrency exchanges, as provided by the exchanges themselves. Several exchanges publicly disclose the addresses of their hot and cold wallets to enhance transparency and demonstrate their custody of customer funds. We manually collected these addresses directly from the official websites of the respective exchanges. The URLs where these addresses were found are listed in Table 3. The addresses are directly available on the link’s page or are contained in a file that can be downloaded from the page.

Table 3 Sources of cryptocurrency exchange wallet addresses.
We collected 29 addresses from DefiLlama (https://defillama.com/cexs), specifically those associated with the exchanges Coinsquare, Gate.io, Swissborg, Latoken, Woo, and Cakedefi. These addresses can be accessed by selecting the exchange on the DefiLlama CEX page, then clicking on the Wallet Addresses button, which redirects to a GitHub page containing the respective addresses.
Our collection of ransomware addresses was expanded by incorporating addresses identified in previous research papers, including the two Padua ransomware datasets (https://spritz.math.unipd.it/projects/btcransomware/)⁶ and the Montréal ransomware dataset (https://doi.org/10.5281/zenodo.1238041)^10,29. These addresses were compiled from public sources such as security reports, academic publications, and online databases that list Bitcoin addresses linked to illegal activities.
We also integrated addresses from the Specially Designated Nationals (SDN) List (https://sanctionslist.ofac.treas.gov/Home/SdnList) maintained by the U.S. Department of Treasury. The addresses are available in the ‘SDN.XML’ file. The selected entities from this list include ‘Suex’, ‘Chatex’, and ‘Garantex’ (all labeled as exchanges), ‘Hydra’ (marketplace), and ‘Blender.io’ (mixer).
Additionally, we included addresses holding bitcoins related to the bridge Wrapped Bitcoin (https://wbtc.network/dashboard/audit), categorizing them under ‘bridge’.
Bitcoin addresses were also extracted from user profiles on BitcoinTalk. Forum users often include their personal Bitcoin addresses in their profiles or signatures, displayed below their posts. We scraped the profiles of all forum posters from previously collected messages, labeling each identified address as ‘individual’.
Furthermore, we incorporated addresses associated with mining entities. Miners earn rewards for incorporating new transactions into the blockchain through special transactions known as Coinbase transactions. Miners have the ability to include a message in these transactions; some mining companies may embed their name or a distinctive pattern. These messages allow us to identify and categorize these addresses under ‘mining’.
Lastly, we included addresses related to betting and gambling platforms. Some platforms, such as Fairlay and DirectBet, enable customers to participate in bets by sending bitcoin to specific addresses created for each bet. Users can share their bets on forums like BitcoinTalk via URLs that redirect to the bet, which include the Bitcoin deposit addresses. We developed regex patterns to detect these URLs, extract the addresses, and classify them as ‘bet’.

All addresses obtained in this subsection, except for the 29 addresses from DefiLlama, are included in the dataset (see ‘Data Records’). However, these excluded addresses can still be accessed through the provided links.

Source link