PRESS RELEASE

Platforms vs. PhDs: How tech giants court and crush the people who study them

Published

2 years ago

June 30, 2024

A legal standoff between NYU researchers and Facebook sheds light on the increasingly fraught dynamic between tech companies and academics. Laura Edelson was in a state of panic, and I knew I was a little bit to blame for her freakout.

Table show

“If this is true,” Edelson told me this week during one of several frantic phone calls, “this is my nightmare.”

I had reached out to Edelson, a Ph.D. candidate at New York University’s Tandon School of Engineering, to ask her about something Facebook had told me about why the company recently served Edelson and her colleagues with a highly-publicized cease-and-desist notice. Edelson is the co-creator of Ad Observer, a browser extension that collects data on who political advertisers are targeting on Facebook. Facebook told me that one reason it was ordering Edelson to shut down Ad Observer was that it had violated Facebook’s policies by scraping data from users who had never consented to have their information collected. Facebook said the Ad Observer team was publishing that information, too, for anyone to download.

This was news to Edelson, who, as a cybersecurity researcher, had tried to keep people’s private information out of her data set. To her knowledge, the only information Ad Observer collected and published came from people who had installed the browser extension and voluntarily submitted both the content of the political ads they saw on Facebook and why they had been targeted with those ads.

Though she vehemently denied Facebook’s claims, the idea that anyone else’s private information might be lurking in Edelson’s data still left her hands shaking. She instantly turned off the ability for anyone else to download the data while she and her team began an emergency privacy audit.

But Edelson’s panic turned out to be premature. When Facebook said Ad Observer was collecting data from users who had not authorized her to do so, the company wasn’t referring to private users’ accounts. It was referring to advertisers’ accounts, including the names and profile pictures of public Pages that run political ads and the contents of those ads.

When I relayed this information to Edelson, she let out a laugh and a long sigh. Of course she was collecting advertisers’ data. That data is already public on Facebook, and it’s the crux of what Ad Observer is all about. “The core premise of this is that the public deserves some transparency into who is trying to influence their vote and the political conversation,” she said. “That’s sort of the point of the project.”

But to Facebook, advertisers are users too, and important ones at that. Scraping advertiser data, even data Facebook makes public, and publishing it without the advertiser’s consent is a violation of Facebook’s rules all the same. And, after all, those are rules only Facebook gets to write.

The core premise of this is that the public deserves some transparency into who is trying to influence their vote and the political conversation. That’s sort of the point of the project.

Facebook’s crackdown on Ad Observer, which has yet to be resolved, is just one of the most extreme examples of the increasingly fraught relationship between platforms and the people who study them. Over the last few years, amid mounting scrutiny of Silicon Valley, tech platforms have made overtures to the research community, opening up previously inaccessible data sets that academics can use to study how tech platforms impact society. Twitter, for one, recently launched a free API for pre-approved academics to gain access to its full back catalog of tweets. Facebook, meanwhile, has made a huge trove of Facebook data available to researchers through its Facebook Open Research and Transparency project and is currently working with a team of more than a dozen researchers to study the impact of the platform on the 2020 election.

But even as this work progresses, tech companies are simultaneously cracking down on academics whose methods break their rules. As topics like online disinformation, ad targeting and algorithmic bias have emerged as core fields of study, researchers have relied on APIs and social analytics tools, dummy accounts and scraped data to figure out, say, whether online housing ads are discriminating against Black people or whether fake news gets more engagement than real news online. Often, those methods violate companies’ carefully crafted terms of service forbidding data scraping, data sharing and fake accounts, among other things. And so, to protect their users’ privacy — and their own reputations — tech giants have at times used those terms of service as a cudgel to shut down even well-meaning research projects.

Ad Observer is one of those projects. In October, just weeks before the presidential election, as political ads flooded Facebook, the company ordered Edelson and her collaborator, NYU associate professor Damon McCoy, not only to shut down the browser extension by Nov. 30, but to delete all the data they had collected or face “additional enforcement action.” Months later, the NYU team and Facebook have yet to reach an agreement.

“The issue was not what NYU was trying to accomplish. It was how they were trying to accomplish it,” said Steve Satterfield, Facebook’s director of privacy and public policy. Satterfield noted that when people install the Ad Observer browser extension, they give the NYU researchers access to everything they can view from their browser. Scraping any of that data directly violates Facebook’s terms. “We’re open to partnership,” Satterfield said, “but there are certain areas where we’re not going to make compromises, and this was one of them.”

Some of the debates about the relationship between researchers and Big Tech are philosophical. Others are legal. Indeed, the Supreme Court is currently mulling a case that academics fear could make research methods that violate companies’ terms of service illegal. But they’re all part of the delicate dance between researchers and social media platforms, taking place at a time when the world arguably needs that research most.

A cat-and-mouse game

This isn’t the first time Edelson and McCoy have been caught up in Facebook’s fight against data scraping.

Shortly after Facebook launched its repository of political ads, now called Ad Library, in 2018, the two researchers built a tool to scrape the archive so that they could parse the data more easily. At first, sifting through the Ad Library required using keyword searches, making it hard for researchers and journalists to analyze the data set if they didn’t know what they were looking for.

“It looked like it had a lot of interesting juicy data, but it looked like it was hastily implemented and not horribly useful,” McCoy remembered.

McCoy wanted a bird’s-eye view of the political ad landscape on Facebook, so he tapped Edelson, who had worked at Palantir before beginning her Ph.D. program, to build a scraper that would give them just that. The scraper unlocked a trove of important insights about political advertising before the 2018 midterms — most notably, that former President Trump was spending more on political ads than any other advertiser. The New York Times wrote a story about that finding in July 2018, complete with a glowing quote from Facebook’s then-director of product management, Rob Leathern, saying the NYU report was “exactly how we hoped the tool would be used.”

Soon after the story ran, Facebook broke the tool anyway.

The Times story happened to hit just four months after the Cambridge Analytica scandal broke — a privacy debacle for Facebook that traced back to a single professor at Cambridge University, who had built a tool to scrape unwitting Facebook users’ data. Facebook soon began anti-scraping efforts, making technical changes that effectively cut off Edelson and McCoy’s exploit. Facebook said these anti-scraping efforts, which have been underway since 2018, haven’t targeted any specific tool. But McCoy and Edelson were affected all the same. “It gets into a cat-and-mouse game,” McCoy said. “We get around it, they erect new barriers.”

The Times story was also just two months after Europe’s General Data Protection Regulation went into effect, creating new hurdles for data sharing and new consent requirements for data collection.

Suddenly tech companies had a series of official arguments for withholding data from researchers who were turning up unflattering findings anyway. “It was a bit of a PR decision, but also a policy decision,” said Nu Wexler, who worked in policy communications at Facebook during the Cambridge Analytica scandal and later worked at Google.

That skirmish with Facebook, at least, was short-lived. Shortly after breaking NYU’s scraper, Facebook released an API for its political ad archive and invited the NYU team to be early testers. Using the API, Edelson and McCoy began studying the spread of disinformation and misinformation through political ads and quickly realized that the dataset had one glaring gap: It didn’t include any data on who the ads were targeting, something they viewed as key to understanding advertisers’ malintent. For example, last year, the Trump campaign ran an ad envisioning a dystopian post-Biden presidency, where the world is burning and no one answers 911 calls due to the “defunding of the police department.” That ad, Edelson found, had been targeted specifically to married women in the suburbs. “I think that’s relevant context to understanding that ad,” Edelson said.

But Facebook was unwilling to share targeting data publicly. According to Satterfield, that could make it too easy to reverse-engineer a person’s interests and other personal information. If, for instance, a person likes or comments on a given ad, it wouldn’t be too hard to check the targeting data on that ad, if it were public, and deduce that that person meets those targeting criteria. “If you combine those two data sets, you could potentially learn things about the people who engaged with the ad,” Satterfield said.

There are certain areas where we’re not going to make compromises, and this was one of them.

For that reason, Facebook only shows ad-targeting data to users when they have personally been shown an ad. Knowing this, Edelson and McCoy built Ad Observer as a workaround. Users could install the browser extension and voluntarily submit their own ad-targeting data as they browsed Facebook. Sensitive to concerns about privacy, McCoy and Edelson designed the tool to strip out data that might contain personally identifiable information and released the tool in May 2020. Eventually, more than 15,000 people installed Ad Observer.

The notion that Facebook users might inadvertently give other people’s data away to third-party researchers — again — set off immediate alarm bells inside Facebook. Facebook has previously taken legal action against other companies caught scraping data. Their approach toward the NYU team, they argue, was just an extension of that work. “It was that action, which was encouraging people to install extensions that scrape data, that led to the action we took,” Facebook’s Satterfield said.

A global ricochet

Facebook’s cease-and-desist notice was aimed at the NYU team, but it ricocheted around the global research community. Mark Ledwich, an Australian researcher who studies YouTube, said the confrontation has already put a chill on other, similar transparency projects, like his.

Ledwich is the co-creator of transparency.tube, a website that mines YouTube’s top English-language channels and categorizes them based on characteristics like their political slant or whether they spread conspiracies. He’s also co-founder of a new online forensics startup called Pendulum. Initially, transparency.tube was powered by YouTube’s API, which places certain requirements on data storage and how many API calls a developer can make in one day. Abiding by those terms would have made Ledwich’s work impossible, he said.

“There is no way to perform research like ours and not breach their terms of service,” Ledwich said. To get around those terms, Ledwich began using multiple accounts to gather data, but when he came clean about that to YouTube, he was blocked from the API entirely.

Instead, Ledwich began scraping publicly available YouTube data. That’s given him new insights into what’s happening on the platform, but he says the method has spooked some potential research partners. George Washington University’s Institute for Data, Democracy and Politics, for one, recently turned down Ledwich’s request for funding and cited NYU’s issues with Facebook in its rationale. “The reason given was they were worried about the risk after the legal threat from Facebook against the NYU Ad Observatory,” Ledwich said.

Rebekah Tromble, director of the Institute, said she personally believes researchers should be given safe harbor under the law to scrape data if they do it in a way that respects users’ privacy, as she feels the NYU researchers do. And she’s been pushing for regulations in the U.S. and Europe that would create rules of the road for such data sharing. “However, when it comes to providing financial support for building tools based on scraping, our obligations to the university and our own funders mean we have to proceed with caution,” Tromble said. “I am very happy to advocate for transparency.tube and other projects that scrape data, but our institute cannot fund them at this time.”

There is no way to perform research like ours and not breach their terms of service.

Unlike Facebook and Twitter, YouTube content is primarily video-based, which makes it more difficult for researchers to analyze in bulk. None of the researchers or former Google employees Protocol spoke to for this story were aware of any transparency projects YouTube has created for outside researchers, and YouTube couldn’t point to any specific research projects either. But spokesperson Elena Hernandez said, “YouTube regularly works with researchers from around the world on a range of critical topics.”

Google spokesperson Lara Levin, meanwhile, said that while the search team doesn’t have any research projects to share, “Search is quite distinct from social networks and feed/recommendation-based products, in that it’s responding to user queries, and it is an open platform that researchers [and] media can (and do) study.”

The limits on the API mean that some of the most illuminating transparency projects related to YouTube come from researchers who scrape publicly-available data. That’s how Guillaume Chaslot, founder of algotransparency.org and a former Googler, built his tool to monitor YouTube recommendations.

YouTube prohibits scraping, except from search engines or in cases where it’s given prior written approval. But so far, Chaslot says, YouTube hasn’t tried to stop him. “If they cut me off, it looks really bad for them. It looks like they really don’t want people to see what they’re doing,” he said. In fact, the company has invited Chaslot to its offices to discuss how it’s addressing some of the issues he’s flagged.

That doesn’t mean YouTube has been fully supportive of his work, though. When Chaslot first told his story to The Guardian in 2018 and outlined findings that suggested, among other things, that YouTube recommendations had nudged viewers toward pro-Trump content before the 2016 election, YouTube tried to undermine Chaslot’s methodology. “The sample of 8,000 videos they evaluated does not paint an accurate picture of what videos were recommended on YouTube over a year ago in the run-up to the US presidential election,” a spokesperson told The Guardian at the time.

“They say I just collect a small part of the data, so I don’t see the whole picture,” Chaslot said. YouTube provided Protocol with a similar explanation of why the company feels AlgoTransparency’s conclusions are inaccurate.

All of this makes YouTube one of the least-studied social media platforms. One recent analysis of papers submitted to the International Communication Association’s annual conference last year found that only 28 papers submitted had Google or YouTube in the name, compared to 41 for Facebook. Twitter beat them both, appearing in the name of 51 papers. In another analysis of research on hate speech, racism and social media, researchers found Twitter data appeared in more than half of the papers they studied, while YouTube data appeared in just under 9%.

That’s a testament to Twitter’s inherently public nature and its concerted efforts to work with researchers. But even that research hasn’t been based on the complete picture of activity on Twitter. Until recently, only business developers who paid for access to the API could see a full history of tweets. The free version most researchers used offered just a subset. Earlier this year, Twitter announced it was changing that, creating a free API with access to the complete archive, but only for pre-approved researchers.

“Access to Twitter data provides researchers with helpful insight into what’s happening across the globe, but their successes have largely been in spite of Twitter, not because of us,” Twitter staff product manager Adam Tornes said on a recent call with reporters. “For too long we’ve ignored their unique circumstances and differentiated needs and capabilities. We’ve also failed to fully appreciate and embrace the value and impact of scholarly work using Twitter data.”

Still, even in this update, there’s a trade-off for researchers. Now, they’ll have access to more historical data on the platform, but Twitter is simultaneously reducing the amount of data researchers can collect, capping it at 10 million tweets per month. “We have all these projects ongoing that are based on collecting a lot more data than we’re going to be permitted to collect,” said Josh Tucker, another NYU professor who co-directs the school’s Social Media and Political Participation lab. “Someone at Twitter may alter the future trajectory of our research agenda, having nothing to do with us.”

Twitter said it plans to release additional tiers of access for researchers that don’t include that data cap and is open to evolving the product based on researchers’ needs.

Opening up a trove of historical data in this way could also create other thorny ethical dilemmas, said Casey Fiesler, an assistant professor at University of Colorado Boulder, who studies tech research ethics. In a recent survey of Twitter users, Fiesler found the majority of people who responded had no idea their tweets were being studied. “A lot of people fundamentally object to the idea of being experimented on without their consent, regardless of what it was for,” Fiesler said.

Someone at Twitter may alter the future trajectory of our research agenda, having nothing to do with us.

That’s one reason why Fiesler understands why Facebook might be worried about the NYU researchers’ work. After all, the Cambridge Analytica scandal ended up costing Facebook $5 billion in fines to the Federal Trade Commission, which found that the company had failed to secure user data in violation of an earlier consent decree. “I do not envy Facebook’s position here. Look at the massive amounts of privacy scandals they’ve dealt with,” she said.

At the same time, Fiesler understands the NYU researchers’ motivations as well and feels it’s important for tech companies not to be the only ones allowed to grade their homework. “The fundamental thing is: This is hard,” Fiesler said. “These are value tensions, and in some cases, it’s a no-win scenario.”

‘Scraping is not a crime’

For all of the concerns from researchers, Facebook has made significant gestures toward transparency in recent years. Last year, following nearly two years of tortured negotiations with another cohort of academics, Facebook released a trove of 38 million URLs that pre-approved researchers could access. Later in the year, the company announced it was working with 17 researchers to study Facebook’s impact on political polarization, voter participation, trust in democracy and the spread of misinformation during the 2020 election.

More recently in January, Facebook announced it was releasing targeting data on 1.3 million political ads that ran before the 2020 election. The catch: It’s sharing that data in a closed environment where researchers are unable to export the raw data. “What we’re trying to do is mitigate the risks of the misuse of the information,” Satterfield said.

Over the course of negotiations, Facebook offered the NYU team access to this targeting data, but Edelson said the privacy restrictions would make the type of research she and McCoy do impossible. “We aggregate data and join it to other data sets. We train models. None of that would be possible in this system,” she said.

That said, Edelson applauds Facebook’s efforts to make more data available, both to academics and to the public through tools like the Ad Library and CrowdTangle, a Facebook-owned social media analytics tool that mostly gets Facebook in trouble for showing how popular far-right propaganda is in the U.S. It was CrowdTangle data that recently led Edelson and a number of colleagues to find that far-right misinformation received more engagement than other types of political news on Facebook before and immediately after the 2020 election.

“Credit where credit is due. Facebook has made a lot more data available than anyone else,” Edelson said. And she believes there are good reasons why certain research projects need to happen in closed environments. “There’s a place for researcher-platform partnerships that do cover private data that do need to be pretty closely held. I think it’s great those partnerships are happening,” she said.

Tucker, the NYU politics professor who was concerned about his lab’s Twitter research projects, is also one of the lead researchers on Facebook’s 2020 election study. Prior to joining, Tucker wrote extensively about the trade-offs researchers face both when they choose to collaborate with a platform and when they go it alone. He says he went into the Facebook partnership “eyes wide open.”

Conducting research without a platform’s blessing, Tucker explains, “has the advantage of total independence, but it subjects you to all sorts of limitations to access data, and it also subjects you to the arbitrary nature of the platform.”

But working hand-in-hand with a company like Facebook, as he’s now doing, has its flaws too. “One: You have to get the companies to agree to do it with you,” he said. “Then, there’s the question about how do you maintain the integrity of the research when you’re working with people who are paid employees of the company?”

In negotiating the terms of the project, Tucker and his co-lead, Talia Stroud of the University of Texas at Austin, demanded that the outside researchers have full control over any papers that result from the project, and they pre-registered their plans, effectively telling the public exactly what they would inevitably share before they collected any data. Facebook came to the table with terms of its own: Only Facebook employees would have access to the raw data, and Facebook would get to review papers to ensure they’re not violating legal restrictions. Facebook would cover the cost of data collection and Facebook employees’ time, but none of the researchers would take money from Facebook.

How do you maintain the integrity of the research when you’re working with people who are paid employees of the company?

So far, Tucker said the work has “not gone without its hiccups.” Specifically, Tucker’s team wants to share as much of the underlying data they collect as possible, and Facebook, well, doesn’t. Tucker said those negotiations are still ongoing.

Despite those points of contention, though, he still believes collaborative projects like this are critical and hopes this one will serve as a model for other companies. “This is probably the most important election of the post-war era, in the heart of the social media era, with these enormous moves by these platforms in the middle of the process,” Tucker said. “We have a study that’s measuring its impact.” If nothing else, he argues, that’s an important contribution.

But he and Edelson agree that doesn’t eliminate the need for research that happens outside of Facebook’s terms. Tech companies’ self-imposed transparency efforts continue to raise questions about the limits of what they’re actually sharing and their failure to imagine new, safe ways to share it. The truth is, tech giants have little motivation to hold researchers’ hands while they map out the companies’ mistakes. “There aren’t a whole lot of researchers out there who are going to write super positive papers about social media and data right now,” Wexler said.

Tech companies might not have much control over what boundaries researchers can and cannot cross for long. Those questions are increasingly being answered by courts and government bodies. Last spring, in a case called Sandvig v. Barr, a federal court ruled that researchers who use fake job profiles to study algorithmic discrimination in violation of companies’ terms of service wouldn’t be violating the Computer Fraud and Abuse Act, which vaguely forbids unauthorized access of computer systems. This summer, the Supreme Court is set to rule in another case, Van Buren v. United States, which deals with similar questions around what constitutes unauthorized access. Researchers worry that the court’s decision in that case could dramatically broaden the scope of CFAA and have sweeping implications for people like Edelson.

The day the court heard oral arguments in Van Buren last fall, Edelson was following along virtually, wearing a T-shirt that read, “Scraping is not a crime.”

Meanwhile, in Europe, regulators are trying to find a middle ground under GDPR that would allow for some data sharing with researchers. Article 31 of the Digital Services Act, which was introduced last December by the European Commission, would enable pre-vetted researchers affiliated with academic institutions to get access to data from “very large online platforms,” provided that access doesn’t create significant security vulnerabilities or violate trade secrets.

“We do acknowledge some of this data that researchers or oversight bodies are interested in could have very significant privacy effects. It could give people’s political opinions or sexual preference,” said Mathias Vermeulen, public policy director of the U.K. data rights agency AWO, who has been advocating for more data sharing with researchers. “To a large degree I think these concerns can be solved by sitting down together and clarifying some of the respective obligations.”

For now, Facebook and NYU’s negotiations are at an impasse. Facebook wouldn’t comment on whether it plans to follow through with a lawsuit or other enforcement action against Edelson and McCoy. If it does, the whole research community will be watching what happens next. “If Facebook wins that case, I think it will stop a lot of institutional funding going into unsanctioned social media research, even YouTube,” Ledwich said.

Whatever happens next, Edelson says her work will continue. In fact, it’s expanding. She and McCoy recently changed the name of their research group from the Online Political Transparency Project to Cybersecurity for Democracy in hopes of broadening its scope. Under that banner, not only have they kept Ad Observer going, but in the last month they began using it to collect data from YouTube as well. “It is my job as a security researcher to test to see whether systems on the internet are safe. We know there are problems in the ad delivery networks. They’re vectors for social engineering attacks, and that means it’s my job to study it,” she said. “I am not going to stop doing that.”