Organizing or grouping things according to certain characteristics or defining criteria. This allows users to more easily see patterns, make complex sets of data easier to read, and ultimately provide an organized set of data to analyze.
Key Concepts:
- Attributes/Criteria:
- Attributes have a specific characteristic that is used as the basis to classify data. They may also include colour, size, shape, range, and numerical values. For example, colour (red, green or yellow) or categories (citrus, berry or tropical).
- Methods of Sorting & Classifying:
- 1. Manual sorting involves sorting items physically or visually into different categories or groups. An example of this is organizing email messages into folders.
- 2. Algorithmic sorting uses various electronic tools (computers) and mathematical techniques to sort and categorize items (e.g., Excel filter feature, SQL GROUP BY command, machine learning clustering algorithms).
- 3. A rule-based classification system groups customers according to their previous buying activity; customers who have made substantial purchases in the past are considered “high value” customers.
- Applications in Data Analysis:
* There are many ways to cleanse data, which involves both identifying and/or removing duplicates & discrepancies by grouping together similar data points.
For example: customers can be segmented demographically so that targeted ads can be sent to specific customers.
Product inventories can be grouped by category, demand level, and/or expiration date.
In a scientific study, the species, compounds, and/or experimental results can be grouped according to identical parameters.
- Techniques & Tools:
* Software for Excel/Spreadsheet – Spreadsheet programs often provide features to sort and Filter Data, create Pivot Tables, and create Conditional Formatting.
* Programming Languages (Python and R) – When using Data frames via libraries (for example, with Pandas, through the functions groupby() and sort_values()) or Scikit-learn (for example, through the application of clustering algorithms).
* Database Management Systems – When using SQL (Structured Query Language) Commands such as ORDER BY, GROUP BY and CLASSIFICATION.
- Challenges:
* The selection of irrelevant features can lead to inaccurate classification as a result of personal bias.
Redundant Classifications: Some classifications may overlap. For example, a product may be classified into multiple classes (e.g., mixed product).
Non-Scalable Manual Classification: When using large datasets, manually classifying and managing the classification of data becomes infeasible.
Applications




