Fake news spread as actively as a forest fire - in various media, social networks, including from mouth to mouth. This is a very serious and alarming problem, since it causes great damage to both society and the state. The Internet, as we all know, is one of the largest inventions in history, and many people use it for various purposes and objectives. There are many social media platforms where users can place and share various news that are most often not checked. These fake news are mainly divided in order to denigrate someone's image, spread rumors, provoke disputes and so on.
As a result, for the operational assessment and recognition of fake news, a machine learning classifier is required, since there is a high probability of human error. Over the past years, many studies and efforts have been conducted to identify fake news; In this article, we will consider how to use a machine learning model to determine whether the news is false. We will use a very large set of data for a machine learning model to find a relationship between headlines or names of fake news and determine fake news.
It should also be emphasized that this is not a universal solution, and this is far from the only method for identifying fake news using Python. Other approaches include the use of PassivegGressiveClassifier and algorithms such as Random Forests, K-Nearest Neighbors (KNN), Naive Bayes, Decision Trees and Support Vector Machines. To display the results of analysis of the set of data using various approaches, you can use the mixing matrix. Therefore, I strongly advise you to conduct your own research, collect information, and then choose the best option for you.
Naive Bayesian classifiers are a popular statistical tool for filtering email. They were one of the first attempts to solve the problem of filtering spam when they only appeared in the mid-1990s. Naive Bayesian classifiers use the strategy of a bag of words that is common in text classification to identify spam in e -mail. Spam and not spam vary in toxmas (usually these words, but sometimes other syntactic or non -intaccial components), and Bayes theorem is used to determine whether the letter is spam or not.
Type of fake news
1. Clickbeite: Clickbeit is a popular method of spreading false news (and other types of materials). These are invented stories that are used to increase the attendance of the site and advertising income.
2. The headlines that mislead: articles and materials that are not completely fake can be distorted through the use of sensational or deceptive headings.
3. Satire or parody: content that often parodies news shows and uses comedy techniques to establish contact with the audience, harmlessly harm, but with the ability to deceive.
4. False content: this is mainly when genuine or true sources are issued as fake, false and fabricated sources or publications.
Necessary requirements:
Install Scikit Learn:
Enter this code in the terminal or command line to set the Sci-Kit Learn.
Install Jupyter Notebook:
Copy and insert this code block into the terminal or command line to install Jupyter Notebook.
Launch it with this command:
The set of data that we use to detect fake news includes information about the heading of the news, its content and the Label column, which indicates whether the news is a fake or not.
Now let's import all the libraries necessary for this lesson.
We will need PANDAS for data import and effective processing of a large data set. We will also import other libraries, such as Numpy and the Sci-Kit Learn library.
Most projects fail, without reaching production. Check out our free e -book to find out how to implement the Mlops life cycle for better monitoring, training and deployment of machine learning models to increase performance and iterativeness.
Import data set:
At this stage, we will import a loaded data set. Look at the first 5 lines of the data set to get a brief idea of how the data set looks.
Let's get more information about data recruitment.
In principle, the reason for obtaining the structure of our data set is to get the dataframe dimensions, and at the output we get such a set.
Checking the availability of missing values in the data set:
The absent data are defined as values or data for a variable that are not stored (or absent) in the specified data set. They are often represented by NAN or an empty cell, and are caused by earlier data that could be destroyed due to improper care, or the user could deliberately give incorrect information. If the absent values are not properly processed, in the end you can get a biased machine learning model that gives erroneous results. Inaccuracies in a statistical analysis may be a result of a lack of data. To determine the missing values in each column, you can use the TheDataset.isnull (). SUM function.
As you can see, there are no absent values in the data set.
Let's check the columns we have to clean the data set.
This is a step that I personally recommend when preparing data, since it helps to remove unnecessary columns from the data set.
We do not need a column 'Unnamed: 0' in our model, so let's drop it.
The Label column will be used to predict values, and the Title column - for teaching machine learning model.
Divide the data set into training and test sets, and then train the model of the detection of fake news, using the method of naive Bayesian classifiers. We use naive Bayesian classifiers, because among all different models the accuracy of the detection of spamers is 70%, and scammers - 71.2%. The models used have reached a low level of intermediate accuracy for separating spamers from non-spam, in different ways to identify fake news.
Normalization technique is an important stage in data purification before using machine learning for data categorization. The results showed that the accuracy of naive Bayesian classifiers is 96.08% for detecting fake messages.