Data Sourcing and Collection.

Now moving on to the next step in creating an AI model, we need to think about the data that we plan to use towards building our model. Data is the most important step when it comes to model building as No Data = No Model and Bad Data = Bad Model. There’s a very common statement -

Garbage in Garbage out.

This means that if the data we feed into our AI model is flawed—whether due to bias, inconsistency, or incompleteness—then the predictions or insights generated by the model will also be unreliable. To ensure a robust AI model, we need to carefully consider the data collection process, its sources, and how we handle data quality issues.

Types of Data Sources

When sourcing data, we typically look at three broad categories:

  1. First-Party Data (Proprietary Data)

    • Data collected directly by an organization, such as customer interactions, internal databases, or transaction records.

    • Example: A bank using customer transaction histories to build a credit risk model.

  2. Second-Party Data (Shared Data)

    • Data that is collected by another organization but shared with partners or buyers.

    • Example: Retailers sharing purchasing behavior with manufacturers to optimize supply chain decisions.

  3. Third-Party Data (External Data)

    • Data sourced from external vendors, open datasets, or publicly available sources like government portals.

    • Example: A hedge fund using alternative data, such as satellite imagery, to predict agricultural yields.

Methods of Data Collection

  1. Manual Data Collection

    • Surveys, interviews, and forms.

    • Useful for obtaining qualitative insights but time-consuming.

  2. Web Scraping

    • Extracting data from websites using automated tools.

    • Example: Collecting real estate pricing data from property listing websites.

  3. APIs & Data Feeds

    • Directly fetching data from APIs provided by companies or organizations.

    • Example: Financial market data from Bloomberg or Alpha Vantage APIs.

  4. IoT & Sensor Data

    • Data from connected devices such as wearables, smart homes, and industrial sensors.

    • Example: Smart meters collecting energy consumption data for predictive analytics.

  5. Crowdsourcing & Public Datasets

    • Leveraging open datasets from platforms like Kaggle, Google Dataset Search, and government portals.

    • Example: Using the UCI Machine Learning Repository for training models on classic datasets.

Final Thoughts

Data sourcing and collection lay the foundation for AI models. No matter how sophisticated an algorithm is, its performance is ultimately limited by the quality of the data it learns from. The next step after data collection is data preprocessing and cleaning, where we refine the data before feeding it into the model.

In the next blog, we’ll explore how to clean and preprocess raw data, ensuring it’s in the best possible shape for training an AI model.

Next
Next

Defining the Problem: The First Step to Building a Successful AI Model