An angry paper

Ecrit le , 5 minutes de bouquinage

Hot opinion
This blog post you're maybe about to read contains some stuff that are beyond unacceptable to me.

It's salty. Most of what I've written in this post is my opinion on the matter. If you agree, fine. If you disagree, fine too. But don't start yelling at me pretending to have the ultimate knowledge.

The context

A research paper written by some students of the University of Milan and published by the AAAI in their proceedings that blew up quite a bit on Mastodon. Reading the replies to the post advertising the existence of the paper, I got a little bit mad with how my data has been used without my agreement and some neglect about some aspects of the said data.

The paper is named "Mastodon Content Warnings: Inappropriate Contents in a Microblogging Platform" and, in a nutshell, explains how they managed to collect about 6M toots and train some AI to classify toots as being "appropriate" and "inappropriate". In summary, making a tool that allows you to automatically sort contents like Facebook and others.

With some plots here and there and stuff. If you want to read it in details, you are free to do so even if there are some elements I'm skeptical about or I completely or partially disagree on.

The interesting part

Now to the interesting part where I'm going to express my opinion on the subject. In the replies of the toot exposing the paper, there's a link to the list of scraped instances on Pastebin. I've been able to find the instance where my main account is.

Identifying information

Examinating the previously published (now deaccessioned) scraped toots without my consent (more on that below), I have been able to find my toots with a very simple command line: grep 'users/l4p1n' timeline_*.jsonl | less. In total, I found 2010 toots from me with cat timeline_*.jsonl | jq .uri | grep 'users/l4p1n'. The issue is that the students have written in the published paper:

Since the Mastodon user may be unaware of their data being public and reusable for research purposes we disposed of the information about the users and we fully [emphasis added] anonymized them by hashing the Mastodon user identifier.

I have just found some indentifying information about me: my username, the direct URI of the toot, the contents, when the toot has been sent and everything that the Mastodon API spits in the previously published timeline dataset. All of this without my consent, thus potentially violating GDPR. To illustrate, this is the kind of line found in the gigabytes of JSON data (pretty printed):

    "id": "1[REDACTED]",
    "created_at": "2018-06-00T00:00:00.000Z",
    "sensitive": false,
    "spoiler_text": "",
    "language": "fr",
    "uri": "[REDACTED]",
    "instance": "",
    "content": "<p>[REDACTED]</p>",
    "account_id": "000",
    "tag_list": [],
    "media_attachments": [],
    "emojis": [],
    "mentions": []

There is data from before GDPR enforcement day so I have no idea if the violation can also be applicable there.

Worse, reading back their paper (more or less correctly) later, I spot a screenshot with the response of the API with... the toot URI... insert criquet noises... criquet noises intensifies... Okay, Africa may not (yet) be a part of Europe, but still !

Answer: No, I did not give them my explicit consent. They may have respected what the instance said to robots in robots.txt as the author said:

[...] we have also respected the limitations imposed by the robots.txt files of the different instances.

But there's not only a robots.txt file to respect. In the <head> on the HTML body, there's also another tag to watch out for. Guess what ? The <meta> tag. One special tag that was and still is in the page head of my Mastodon profile !

<meta content='noindex, noarchive' name='robots'>

Not only I am outraged that they picked my data here and there for a study and without my authorization and left some identifying information. Negligently? I'll let the reader make up his own mind about that.

After a quick look at the API documentation and a little query, there's not much data on whether I deny or not indexing. Can I still make the remark on the meta header ? Maybe. This won't change they didn't have my permission to use my data for their research.

"Yeah the toots you're talking about are public"

Now what ? A toot being public and available to anyone that want to view them do not grant them any implicit consent of exploitation, in my opinion, except for the purpose they were made: viewing. No-thing else. I could say the very same to you if you receive spam from dogdy companies. You left your email address somewhere, it's public and people do whatever they want with it even though your email address is part of your data.

Many other issues

There are certainly some issues I didn't mention such as ethical considerations, terms of service violation ( is quite explicit about data collection for research purposes), potential copyright issues. But here's a summary of what I've written, aka. not much:

  • Probable if not certain violation of GDPR;
    • Failure to deindentify personal data properly;
    • Absence of consent.

There's also an open letter (link to the toot mentionning the letter) being made, in the wake of this research paper.