3.5 billion Instagram photos and their corresponding hashtags were used to train Facebook’s machine learning algorithms, the social network has disclosed.
The images and 17,000 corresponding hashtags, as appended by Instagram users, were used in an experiment to train Facebook’s algorithms to categorise images for themselves. The cache of photos is more than 10 times the size of the training data set for image algorithms that was revealed by Google last year.
This latest revelation comes in the wake of Facebook’s announcement this week that it is introducing a browser-style ‘Clear History’ option on its platform, to enable users to – according to the company – permanently erase historic data to ease their privacy concerns.
When CEO Mark Zuckerberg appeared before Congress last month in the US, congressman Ben Luján said to him, “You’ve said everyone controls their data, but you’re collecting data on people who are not even Facebook users, who never signed a consent or privacy agreement and you’re collecting their data.”
Having access to Instagram data avoided the time and cost of paying human teams to manually label photos, said the company. Having so many training images helped Facebook’s team set a new record on a test that challenges software to assign photos to 1,000 categories, including ‘cat’, ‘car wheel’, and ‘Christmas stocking’.
Facebook says that algorithms trained on Instagram images correctly identified 85.4 percent of photos on the test, known as ImageNet; the previous best was 83.1 percent, set by Google earlier this year using the smaller training set.
While the statistics are clearly improving for computer vision systems, they still reveal an uncomfortable fact: such AI systems are – at best – still wrong 15 percent of the time, despite access to vast data sets. This is a challenge if comparable AI tools are tasked with making decisions that may affect people’s lives.
This is also why China will be the emerging power in artificial intelligence. Its lack of data privacy regulations that are comparable to GDPR, and its compulsory social ratings system – which is coming online in 2020 – mean that 1.4 billion Chinese citizens constitute a live training data set that even Facebook is unable to rival.
Policing the platform
As social networks grow, they are under increasing pressure to better police their platforms. With 2.2 billion registered users, nearly 1.5 billion of whom are active daily, Facebook is the top priority for lawmakers, not least because it includes Instagram, WhatsApp, and Messenger in its broadening portfolio of assets.
Aside from its effective weaponisation as a political tool, concerns have been raised over Facebook’s handling of problems such as terrorist propaganda, personal abuse, and revenge pornography. Not to mention Facebook News’ famous inability to distinguish fact from satire.
The obvious way to deal with such problems is for human teams to check the reports submitted by users on a daily basis – which itself would raise privacy concerns. But the company believes that the smarter and more sustainable method in the long run is to train its own AI algorithms to spot sensitive material (assuming that they are not importing biases from the training data sets).
To an extent, the technology is already in place: Facebook already uses computer vision algorithms to spot nudity and violence in images, but the system is far from foolproof (which is implicit in the latest test statistics, which largely concern simple images). And so the quest to improve the technology is underway, with help from Instagram’s endless #Cats, #Dogs, and more.
Plus: Instagram unveils native payments
In related news, Instagram has added a native payments feature for some users this week. The limited rollout could add new revenue streams, as Instagrammers will be able to shop without leaving the app, using linked payment cards. The development opens up the potential for ‘shoppable’ image tags in future.
Internet of Business says
This week it was revealed that Cambridge Analytica is closing amid bankruptcy proceedings. Ironically, for a company that was about collecting data for media manipulation, it cited media attacks for damaging its business. However, another entity, Emerdata, formed last year, is thought to be taking its place with the same investors and political affiliations, according to reports in The Guardian and the FT.
The lesson from these scandals remains, and – unlikely as it seems – is intrinsically linked to stories such as the revelation about Instagram training data. For data-based companies such as Facebook and Google, its users aren’t its customers, they’re its product. And any company will look for ways to sell as many of its products as possible.
Additional reporting: Chris Middleton