Data helps organization to make better and smarter decision, not only using the data from your organization but also the growing source of data outside your organization, such as whether data, customer demographic data and social media data from Twitter, Facebook, Instagram, etc.
However , we have to be careful about accuracy. Real world data is messy, need to process it and make sure it is ready to be used.
A collection of records. For example, collection of bank transaction,
buying transaction, phone call record, test results and service record.
Each record has a set of attributes .
For example, the record for bank transaction:
A dataset can be:
- Structured (database records, spreadsheet)
- Unstructured (facebook, blog, newspaper articles)
- Semi structured (emails)
Structured data refers to any data recorded and organized neatly into a rational database. It is easily searchable and query able. It can be entered, stored, queried and analyzed easily.
Unstructured data refers to any data that has not been organized.
Semi structured data lies somewhere between these two.
Knowledge Discovery from Data:
- Relationship with logical explanation, relate with the real world.
For example: relationship between age and high blood pressure.
Relationship between two attributes
-If X occurs then Y occurs
-If X occurs then Y does not occur
-If X increase(decrease) then Y increase(decrease)
-If X increase(decrease) then Y decrease(increase)
Relationship in multiple attributes
-If X occurs and Y decrease then Z occurs
Correlation: measure the statistical relationship of bivariate data.
To learn the relationship among attributions:
- Know the domain
- Understand the attributes in the domain
- Collect the data
- Build a model from data which able to capture key attributes of their relationship
-Do a prediction
-Complex relationship will lead to a complex model