non-uniqueness - a value should be unique, but is repeated in several different records. For example, two people have the same passport number;
duplication - some object is described twice, for example, there are two absolutely identical users in the database;
contradictions - data about one object differs in different places, for example, a person’s last name is written sometimes as Balashev, sometimes as Balashov;
Incorrect links - the links between features in one record are broken, for example, the first and last names are mixed up.
Errors in features. There are more possible errors at the level of one feature - here are some of the main types:
absence of value - some cell remains empty - for example, a person has no name or a set of zeros instead of a phone number;
Incorrect value - there is information in the cell, but it does not fit the format - say, age 20 instead of 20;
spelling errors - a word is written incorrectly, for example "Sanktpeterburg" instead of "Sankt Petersburg";
polysemy - the same meaning in different features is called differently - for example, "nurse" and "nurse sister";
anomalous values - the information in the attribute benin telegram data cannot be real - for example, the age of a living person is indicated as 271 years, and the date as March 34;
word reversal - words in a meaning have a different order in different places - for example, "building material" and "building material";
nesting of values - one feature contains several values - say, the city "Perm, Penza".
The data that reflects the readings of some devices also contains noise - interference, for example, rustling on the audio track or stripes on the video. And if the information was collected from different sources, a problem of different types of data may arise: in one place the date is written as April 7, and in another - as 07.04.
If errors remain in the sample, the model may perceive them incorrectly and produce incorrect answers later. For example, it may actually consider "Saint Petersburg" to be a separate city, unrelated to Saint Petersburg. Or it may remember that March has 34 days.
Read more
Become a data analyst and get a sought-after specialty
How data cleaning works
Become a data analyst and get a sought-after specialty
-
- Posts: 669
- Joined: Fri Dec 27, 2024 12:23 pm