Handling Big Data for ML applications.
Mentioned below are 6 ways of handling a huge amount of data while training an ML model.
Abstracting data: A key concept underlying Deep Learning methods is distributed representations of the data, in which a large number of possible configurations of the abstract features of the input data are feasible. Try reducing the number of features by using techniques like PCA etc. This allows for a compact representation of each sample and leads to a richer generalization.
Work with a Smaller Sample: Are you sure you need to work with all of the data? Take a random sample of your data, such as the first 1,000 or 100,000 rows. Use this smaller sample to work through your problem before fitting a final model on all of your data (using progressive data loading techniques).
Change the Data Format: Is your data stored in raw ASCII text, like a CSV file? Perhaps you can speed up data loading and use less memory by using another data format. A good example is a binary format like GRIB, NetCDF, or HDF. There are many command-line tools that you can use to transform one data format into another that do not require the entire dataset to be loaded into memory.
Stream Data or Use Progressive Loading: Does all of the data need to be in memory at the same time? Perhaps you can use code or a library to stream or progressively load data as needed into memory for training. This may require algorithms that can learn iteratively using optimization techniques such as stochastic gradient descent, instead of algorithms that require all data in memory to perform matrix operations such as some implementations of linear and logistic regression. For example, the Keras deep learning library offers this feature for progressively loading image files and is called flow_from_directory. Another example is the Pandas library that can load large CSV files in chunks.
Use a Big Data Platform: In some cases, you may need to resort to a big data platform i.e, a platform designed for handling very large datasets, that allows you to use data transforms and machine learning algorithms on top of it. Two good examples are Hadoop with the Mahout machine learning library and Spark with the MLLib library. Nevertheless, there are problems where the data is very large and the previous options will not cut it.
Use a Relational Database: Relational databases provide a standard way of storing and accessing very large datasets. Internally, the data is stored on a disk can be progressively loaded in batches and can be queried using a standard query language (SQL). Free open-source database tools like MySQL or Postgres can be used and most (all?) programming languages and many machine learning tools can connect directly to relational databases. You can also use a lightweight approach, such as SQLite.
I hope you found the blog interesting and helpful. Please subscribe for more such content.