Newsletter Subject

Is Google reading this email to train its AI?

From

vox.com

Email Address

newsletter@vox.com

Sent On

Wed, Jul 26, 2023 10:36 PM

Email Preheader Text

The tricky truth about how generative AI uses your data. AI systems train on your data. What can you

The tricky truth about how generative AI uses your data. AI systems train on your data. What can you do about it? When the White House revealed its list of [voluntary safety and societal commitments]( signed by seven AI companies, one thing was noticeably missing: anything related to the data these AI systems collect and use to train this powerful technology. Including, very likely, yours. There are many concerns about the potential harm that sophisticated generative AI systems have unleashed on the public. What they do with our data is one of them. We know very little about where these models get the petabytes of data they need, how that data is being used, and what protections, if any, are in place when it comes to sensitive information. The companies that make these systems aren’t telling us much, and [may not even know]( themselves. You may be okay with all of this, or think the good that generative AI can do far outweighs whatever bad went into building it. But a lot of other people aren’t. Two weeks ago, a [viral tweet]( accused Google of scraping Google Docs for data on which to train its AI tools. In a follow-up, its author [claimed]( that Google “used docs and emails to train their AI for years.” The initial tweet has nearly 10 million views, and it’s been retweeted thousands of times. The fact that this may not even be true is almost beside the point. (Google says it doesn’t use data from its free or enterprise Workspace products — that includes Gmail and Docs — to train its generative AI models unless it has user permission, though it [does train]( some Workspace AI features like spellcheck and Smart Compose using anonymized data.) “Up until this point, tech companies have not done what they’re doing now with generative AI, which is to take everyone’s information and feed it into a product that can then contribute to people’s professional obsolescence and totally decimate their privacy in ways previously unimaginable,” said Ryan Clarkson, whose law firm is behind class action lawsuits against [OpenAI and Microsoft]( and [Google](. Google’s general counsel, Halimah DeLaine Prado, said in a statement that the company has been clear that it uses data from public sources, adding that “American law supports using public information to create new beneficial uses, and we look forward to refuting these baseless claims.” Exactly what rights we may have over our own information, however, is still being worked out in lawsuits, worker strikes, regulator probes, executive orders, and possibly new laws. Those might take care of your data in the future, but what can you do about what these companies already took, used, and profited from? The answer is probably not a whole lot. Generative AI companies are hungry for your data. Here’s how they get it. Simply put, generative AI systems need as much data as possible to train on. The more they get, the better they can generate approximations of how humans sound, look, talk, and write. The internet provides massive amounts of data that’s relatively easy to gobble up through web scraping tools and APIs. But that gobbling process doesn’t distinguish between copyrighted works and personal data; if it’s out there, it takes it. “In the absence of meaningful privacy regulations, that means that people can scrape really widely all over the internet, take anything that is ‘publicly available’ — that top layer of the internet for lack of a better term — and just use it in their product,” said Ben Winters, who leads the Electronic Privacy Information Center’s AI and Human Rights Project and co-authored [its report]( on generative AI harms. Which means that, unbeknownst to you and, apparently, [several of the]( [companies]( whose sites were being scraped, some startup may be taking and using your data to power a technology you had no idea was possible. That data may have been posted on the internet years before these companies existed. It may not have been posted by you at all. Or you may have thought you were giving a company your data for one purpose that you were fine with, but now you’re afraid it was used for something else. Many companies’ privacy policies, which are updated and changed all the time, may let them do exactly that. They often say something along the lines of how your data may be used to improve their existing products or develop new ones. Conceivably, that includes generative AI systems. Not helping matters is how cagey generative AI companies have been about revealing their data sources, often simply saying that they’re “publicly available.” Even Meta’s [more detailed list]( of sources for its first LLaMA model refers to things like “[Common Crawl](,” which is an open source archive of the entire internet, as well as sites like Github, Wikipedia, and Stack Exchange, which are also enormous repositories of information. (Meta [hasn’t been]( as forthcoming about the data used for the just-released Llama 2.) All of these sources may contain personal information. OpenAI [admits]( that it uses personal data to train its models, but says it comes across that data “incidentally” and only uses it to make “our models better,” as opposed to building profiles of people to sell ads to them. Google and Meta have vast troves of personal user data they say they don’t use to train their language models now, but we have no guarantee they won’t do so in the future, especially if it means gaining a competitive advantage. We know that Google scanned users' emails [for years]( in order to target ads (the [company says]( it no longer does this). Meta had a major scandal and a [$5 billion fine]( when it shared data with third parties, including [Cambridge Analytica](, which then misused it. The fact is, these companies have given users plenty of reasons not to take their assurances about data privacy or commitments to produce safe systems at face value. “The voluntary commitments by big tech require a level of trust that they don’t deserve and they have not earned,” Clarkson said. Copyrights, privacy laws, and “publicly available” data For creators — writers, musicians, and actors, for instance — copyrights and image rights are a major issue, and it’s pretty obvious why. Generative AI models have been trained on their work and could put them out of work in the future. That’s why comedian Sarah Silverman is [suing OpenAI and Meta]( as part of a class action lawsuit. She alleges that the two companies trained off of her written work by using datasets that contained text from her book, The Bedwetter. There are also [lawsuits]( over image rights and the use of open source computer code. The use of generative AI is also one of the reasons why [writers]( and [actors]( are [on strike](, with both of their unions, the WGA and SAG-AFTRA, fearing that studios will train AI models on artists’ words and images and simply generate new content without compensating the original human creators. But you, the average person, might not have intellectual property to protect, or at least your livelihood may not depend on it. So your concerns might be more about how companies like OpenAI are protecting your privacy when their systems scoop it up, remix it, and spit it back out. Regulators, lawmakers, and lawyers are wondering about this, too. Italy, which has stronger privacy laws than the US, even temporarily banned ChatGPT over privacy issues. Other European countries are [looking into]( doing their own probes of ChatGPT. The Federal Trade Commission has also [set its sights]( on OpenAI, investigating it for possible violations of consumer protection laws. The agency has also [made it clear]( that it will keep a close eye on generative AI tools. But the FTC can only enforce what the laws allow it to. President Biden has encouraged Congress to pass AI-related bills, and many members of Congress [have said]( they want to do the same. Congress is notoriously slow-moving, however, and has done little to regulate or protect consumers from social media platforms. Lawmakers may learn a lesson from this and act faster when it comes to AI, or they may repeat their mistake. The fact that there is interest in doing something relatively soon after generative AI’s introduction to the general public is promising. “The pace at which people have introduced legislation and said they want to do something about [AI] is, like, 9 million times faster than it was with any of these other issues,” Winters said. But it’s also hard to imagine Congress acting on data privacy. The US doesn’t have a federal consumer online privacy law. Children under 13 do get some [privacy protections](, as do residents of states that passed their own privacy laws. [Some types]( of data are protected, too. That leaves a lot of adults across the country with very little by way of data privacy rights. We will likely be looking at the courts to figure out how generative AI fits with the laws we already have, which is where people like Clarkson come in. “This is a chance for the people to have their voice heard, through these lawsuits,” he said. “And I think that they’re going to demand action on some of these issues that we haven’t made much progress through the other channels thus far. Transparency, the ability to opt out, compensation, ethical sourcing of data — those kinds of things." In some instances, Clarkson and Tim Giordano, a partner at Clarkson Law Firm who is also working on these cases, said there’s existing law that doesn’t explicitly cover people’s rights with generative AI but which a judge can interpret to apply there. In others, there are things like [California’s privacy law](, which requires companies that share or sell people’s data to give them a way to opt out and delete their information. “There’s currently no way for these models to delete the personal information that they’ve learned about us, so we think that that’s a clear example of a privacy violation,” Giordano said. ChatGPT’s [opt out and data deletion tools](, for example, are only for data collected by people using the ChatGPT service. It does have [a way]( for people in “certain jurisdictions” to opt out of having their data processed by OpenAI’s models now, but it also doesn’t guarantee it will do so and it requires that you provide evidence that your data was processed in the first place. Although OpenAI [recently changed]( its policy and has stopped training models off data provided by its own customers, another set of privacy concerns crops up with how these models use the data you give them when you use them and the information they release into the wild. “Customers clearly want us not to train on their data,” Sam Altman, CEO of OpenAI, told CNBC, an indicator that people aren’t comfortable with their data being used to train AI systems, though only some are given the chance to opt out of it, and in limited circumstances. Meanwhile, OpenAI [has been sued]( for defamation over a ChatGPT response that falsely claimed that someone had defrauded and stolen money from a nonprofit. And this [isn’t the only time]( a ChatGPT response levied false accusations against someone. So what can you currently do about any of this? That’s what’s so tricky here. A lot of the privacy issues now are the result of a failure to pass real, meaningful privacy laws in the past that could have protected your data before these datasets and technologies existed. You can always try to minimize the data you put out there now, but you can’t do much about what’s already been scraped and used. You’d need a time machine for that, and not even generative AI has been able to invent one yet. —Sara Morrison, senior reporter   [UPS workers gather around a large inflatable pig holding a bag of money, preparing picket lines with a variety of signs supporting the Teamsters.]( Timothy A Clary/AFP via Getty Images [A UPS strike would have been worse than you think]( [Our reliance on delivery gave the Teamsters union a lot more leverage in UPS negotiations.](   [The blue bird logo of Twitter is large against a stone exterior wall, seen from the street below.]( Tayfun Coskun/Anadolu Agency via Getty Images [The weird sorrow of losing Twitter]( [Grieving a loss, when the loss is the hell-bird site you weren’t supposed to love.](   [Elon Musk in the Twitter headquarters holding a sink.]( Twitter account of Elon Musk/AFP via Getty Images [Welcome to X, the wannabe “super app” formerly known as Twitter]( [Elon Musk’s favorite letter wants to be the center of your app universe.](    [Learn more about RevenueStripe...](   [An illustration on a peach-colored background of three women, all yawning in different contexts — one is on the phone, another is holding a cup, while the third is just waking up.]( Getty Images/iStockphoto [Don’t schedule meetings after 4 pm]( [People are redefining the 9-to-5 and that’s a good thing.](   [President Biden at a speech in Philadelphia on July 20, 2023.]( Fatih Aktas/Anadolu Agency via Getty Images [Biden sure seems serious about not letting AI get out of control]( [Some AI companies have made safety commitments. Is that enough?](   Support our work Vox Technology is free for all, thanks in part to financial support from our readers. Will you join them by making a gift today? [Give](   [Listen To This]( [Listen to This]( Strikes! AI! And a Steven Soderbergh show he’s selling himself. Hollywood is reeling from two different strikes. Disney CEO Bob Iger has hung a For Sale sign on parts of his company. And Steven Soderbergh just made a TV series and is selling it directly to consumers, like it’s 2012 or something. [Listen to Apple Podcasts](   [This is cool]( [Anyone want Apple shoes for $50K?](  [Learn more about RevenueStripe...](   [Facebook]( [Twitter]( [YouTube]( This email was sent to {EMAIL}. Manage your [email preferences]( , or [unsubscribe](param=tech)  to stop receiving emails from Vox Media. View our [Privacy Notice]( and our [Terms of Service](. Vox Media, 1201 Connecticut Ave. NW, Washington, DC 20036. Copyright © 2023. All rights reserved.

EDM Keywords (246)

years yawning writers write worse worked work wondering wga well way want waking variety using uses used use us updated unleashed unions unbeknownst types twitter trust true tricky trained train times time thought third think things thanks terms technology taking takes take systems supposed sued studios strike street still states statement spit speech sources something someone sights share sent selling scraped says say said rights revenuestripe revealing result residents requires report remix reliance release regulate refuting reeling redefining reasons readers put public protections protecting protected protect promising profited product processed probes probably privacy power posted possible policy place philadelphia petabytes people past passed parts partner part pace others order opt opposed openai one okay nonprofit need much models misused mistake minimize microsoft meta means may manage making make made lot loss looking longer little listen list lines likely leverage level lesson leaves least learned leads lawyers lawsuits laws large lack know kinds keep judge join italy issues introduction interpret internet interest information indicator improve images illustration idea hungry hung hollywood holding guarantee google good going gobble giving given give get future ftc free forthcoming follow fine figure feed failure fact example exactly even enforce emails email done docs distinguish directly deserve depend delete defrauded defamation datasets data currently cup courts country could control contribute congress company companies commitments comfortable comes clear chatgpt changed chance center building book better bedwetter bag back assurances apply apis answer also already alleges ai agency afraid actors absence able ability 2012 13

Marketing emails from vox.com

View More
Sent On

31/05/2024

Sent On

31/05/2024

Sent On

30/05/2024

Sent On

29/05/2024

Sent On

29/05/2024

Sent On

29/05/2024

Email Content Statistics

Subscribe Now

Subject Line Length

Data shows that subject lines with 6 to 10 words generated 21 percent higher open rate.

Subscribe Now

Average in this category

Subscribe Now

Number of Words

The more words in the content, the more time the user will need to spend reading. Get straight to the point with catchy short phrases and interesting photos and graphics.

Subscribe Now

Average in this category

Subscribe Now

Number of Images

More images or large images might cause the email to load slower. Aim for a balance of words and images.

Subscribe Now

Average in this category

Subscribe Now

Time to Read

Longer reading time requires more attention and patience from users. Aim for short phrases and catchy keywords.

Subscribe Now

Average in this category

Subscribe Now

Predicted open rate

Subscribe Now

Spam Score

Spam score is determined by a large number of checks performed on the content of the email. For the best delivery results, it is advised to lower your spam score as much as possible.

Subscribe Now

Flesch reading score

Flesch reading score measures how complex a text is. The lower the score, the more difficult the text is to read. The Flesch readability score uses the average length of your sentences (measured by the number of words) and the average number of syllables per word in an equation to calculate the reading ease. Text with a very high Flesch reading ease score (about 100) is straightforward and easy to read, with short sentences and no words of more than two syllables. Usually, a reading ease score of 60-70 is considered acceptable/normal for web copy.

Subscribe Now

Technologies

What powers this email? Every email we receive is parsed to determine the sending ESP and any additional email technologies used.

Subscribe Now

Email Size (not include images)

Font Used

No. Font Name
Subscribe Now

Copyright © 2019–2024 SimilarMail.