blog




  • Essay / Cognitive Computing and Big Data Analytics

    Table of ContentsIntroductionCognitive Computing and Big Data FrameworkDefinitionComputing with Heterogeneous DataApplications of Cognitive Computing including Big DataFor Transportation SystemsFor the EnvironmentUrban Computing for Consumption Urban Energy ManagementUrban Computing for the EconomyUrban Computing for Public Safety and SecurityTypical TechnologyUrban Data Management TechniquesTechniques Dealing with Data ScarcityBig Data VisualizationOptimization TechniquesInformation SecurityFuture DirectionsConclusionThere is a Problem in the Big Data. The problem is that there is too much information and not enough talent to manage it. The supply of analysts and data scientists cannot meet the ever-increasing demand for this type of talent. This shortage poses a problem because even the most advanced data platforms are useless without experienced professionals to operate and manage them. How can we solve this problem? More training and better academic programs? Maybe, but what if there was another solution. What if we trained computers to do the work for us, or at least make data tools easier to manage? Advances in cognitive computing are making this a near reality. Say no to plagiarism. Get a tailor-made essay on “Why Violent Video Games Should Not Be Banned”? Get the original essayIntroductionSensing technologies and large-scale computing infrastructures have produced a variety of big data in urban spaces (e.g., human mobility, air quality, traffic patterns, and geographic data). Big Data involves rich knowledge about the population of any organization and can help address these challenges when used correctly. Motivated by the opportunities to build smarter cities, we can propose a vision for computing, which aims to unleash the power of knowledge from large, heterogeneous data collected in urban spaces and apply these powerful insights to solve problems major challenges facing our cities today. . In short, our goal is to address the big challenges of big cities using big data. Cognitive computing will bring a high level of fluidity to analysis. Data processing, which is normally essential for analytical functions to function properly, allows staff who are not as familiar with the language of data to interact with programs and platforms in the same way that humans interact between them. Therefore, platforms built with AI technology could translate regular speeches and requests. in data queries, providing simple commands and using normal language, then providing responses in the same way they were received. With a feature like this, it would be much easier for anyone to work in the data field.Cognitive Computing and Big Data FrameworkDefinitionCognitive computing is a process of acquiring, integrating and analyzing data. large and heterogeneous data generated by various sources in urban environments. spaces, such as sensors, devices, vehicles, buildings and humans, to solve major problems that cities face (e.g. air pollution, increasing energy consumption and traffic jams). It connects discrete and ubiquitous sensing technologies, advanced data management and analytical models, and new visualization methods to create solutionswin-win that improve the environment, the quality of human life and the city's operating systems. Cognitive computing also helps us understand the nature of urban phenomena and even predict the future. It is an interdisciplinary field merging the field of computer science with traditional fields such as transportation, civil engineering, economics, ecology and sociology in the context of urban spaces. Computing with Heterogeneous Data Learn mutually reinforcing knowledge from heterogeneous data: Solving urban challenges involves a wide range of factors (e.g., exploring air pollution involves simultaneously studying traffic, meteorology and land use). However, existing data mining and machine learning techniques typically process only one type of data: for example, computer vision processes images and natural language processing is based on texts. Treating features extracted from different data sources the same (e.g., simply putting those features into a feature vector and throwing them into a classification model) does not get the best performance. Additionally, using multiple data sources in an application leads to high dimensionality. space, which generally worsens the data scarcity problem. If not managed properly, more data sources would even compromise a model's performance. This requires advanced data analytics models capable of gaining mutually reinforcing insights among multiple heterogeneous data generated from different sources, including sensors, people, vehicles, and buildings. Both effective and efficient learning capability: Many urban computing scenarios (e.g., traffic anomaly detection and air quality monitoring) require instantaneous responses. Beyond simply increasing the number of machines to speed up computation, we need to bring together data management, data mining, and machine learning algorithms into a computational framework to provide knowledge discovery capability both effective and efficient. Additionally, traditional data management techniques are typically designed for a single source of modal data. There is still a lack of an advanced management methodology that can properly organize multimodal data (such as streaming, geospatial, and text data). Thus, computing with multiple heterogeneous data is a fusion of data and algorithms. Visualization: Big data brings a huge amount of information that requires better presentation. Good visualization of the original data could inspire new ideas to solve a problem, while visualizing the computational results can reveal knowledge intuitively to aid decision-making. Data visualization can also suggest the correlation or causation between different factors. Multimodal data in urban computing scenarios leads to high dimensions of views, such as spatial, temporal and social, for visualization. How to relate different types of data in different views and detect patterns and trends is challenging. Additionally, when faced with multiple types and huge volumes of data, it becomes even more difficult to understand how exploratory visualization can provide users with an interactive way to generate new hypotheses. This requires an integration of data mining techniquesinstantaneous information in a visualization framework, which is still lacking in urban computing. Applications in cognitive computing including big data For transportation systems Finding fast driving routes saves both the driver's time and the energy consumption in traffic jams. wastes a lot of gas. Extensive studies have been carried out to learn historical traffic patterns, estimate real-time traffic flows, and predict future traffic conditions on individual road segments in terms of floating car data, such as vehicle GPS trajectories. vehicles, WiFi and GSM signals. However, work modeling traffic patterns on a city scale is still rare. Taxis are an important mode of travel between public and private transport, providing almost door-to-door travel services. In big cities like New York and Beijing, people usually wait a significant amount of time before catching a free taxi, while taxi drivers are eager to find passengers. Efficiently connecting passengers to vacant taxis is of great importance to reduce people's waiting time, increase taxi drivers' profits, and reduce unnecessary traffic and energy consumption. By 2050, it is expected that 70% of the world's population will live in cities. Municipal planners will be faced with an increasingly urbanized and polluted world, where cities everywhere will suffer from an over-stretched road transport network. Building more efficient public transport systems, as alternatives to private vehicles, has therefore become an urgent priority, both to provide a good quality of life and a cleaner environment and to remain economically attractive to potential investors and employees. Public transport systems, combined with integrated fare management and advanced passenger information systems, are considered essential tools to better manage mobility. For the environment Without effective and adaptive planning, rapid progress in urbanization will become a potential threat to the environment of cities. Recently, we have witnessed an increasing trend of pollution in different aspects of the environment, such as air quality, noise and waste, all over the world. Protecting the environment while modernizing people's lives is of utmost importance in urban computing. Urban computing for urban energy consumption The rapid progress of urbanization consumes more and more energy, which which requires technologies capable of detecting energy costs on a city scale, improving energy infrastructure and, ultimately, reducing energy consumption. Urban computing for the economy The dynamics of a city ( for example, human mobility and the number of changes in a POI category) can indicate the trend of the city's economy. For example, the number of movie theaters in Beijing continued to increase between 2008 and 2012, reaching 260. This could mean that more and more people living in Beijing would like to watch a movie in a movie theater. On the contrary, a category of POI will disappear in a city, denoting the slowdown in activity. Likewise, human mobility could indicate the unemployment rate of certain major cities, helping to predict the trend of a stock market. Urban computing for public safety and security Major events, pandemics, severe accidents, environmental disasters, and terrorist attacks pose additional threats to public safety. and order. The big oneavailability of different types of urban data allows us, on the one hand, to learn lessons from history on how to properly deal with the aforementioned threats and, on the other hand, to detect them in time, or even predict them at advance. .Typical technologyUrban data management techniquesData generated in urban spaces is usually associated with a spatial or spatio-temporal property. For example, road networks and POIs are spatial data frequently used in urban spaces; Weather data, surveillance videos, and power consumption are time-based data (also called time series, or streams). Other data sources, such as traffic flows and human mobility, simultaneously possess spatiotemporal properties. Sometimes temporal data can also be associated with a location, thus becoming a kind of spatio-temporal data (for example, the temperature of a region and the electricity consumption of a building). Therefore, good urban data management techniques should be able to efficiently process spatial and spatio-temporal data. In addition, an urban IT system must generally exploit a variety of heterogeneous data. In many cases, these systems must respond quickly to instantaneous user queries (e.g. predicting traffic conditions and air pollution). Without data management techniques that can organize multiple heterogeneous data sources, it becomes impossible for the following data mining process to quickly acquire knowledge from these data sources. For example, without an efficient spatio-temporal indexing structure that organizes POIs, road networks, traffic and human mobility data in advance, the feature extraction process of the U-Air project alone will take a few hours. This delay will prevent this app from informing people about a city's air quality every hour. Techniques for Managing Data Sparsity There are many reasons that lead to a missing data problem. For example, a user would only check in to a few locations in a location-based social networking service, and some locations might not receive a visit at all. If we place the user's location in a matrix where each entry indicates the number of user visits to a location, the matrix is ​​very sparse; that is, many entries have no value. If we further consider the activities (such as shopping, dining, and sports) that a user can perform in a place like the third dimension, a tensor can be formulated. Of course, the tensor is even sparser. Data sparsity is a general challenge that has been studied for years in many computing tasks. Big Data Visualization When talking about data visualization, many people only think about visualizing raw data and presenting the results generated by data mining processes. The former can reveal the correlation between different factors, thereby suggesting characteristics for a machine learning model. As mentioned earlier, spatio-temporal data is widely used in urban computing. For a comprehensive analysis, the data must be considered from two complementary perspectives: as spatial distributions changing over time (i.e., spaces over time) and as distributed profiles of local temporal variation. above space. However, data visualization is not just about displaying raw data and presenting results. Visualizationexploratory becomes even more important in urban computing. Semi-supervised learning and transfer learning. Semi-supervised learning is a class of supervised learning tasks and techniques that also use unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data. Many machine learning researchers have found that unlabeled data, when used in conjunction with a small amount of labeled data, can produce a dramatic improvement in learning accuracy. There are several semi-supervised learning methods, such as generative models, graph-based methods, and co-training. Specifically, co-training is a semi-supervised learning technique that requires two views of the data. This assumes that each example is described by two different sets of features that provide different and complementary information about an instance. Ideally, the two feature sets of each instance are conditionally independent given the class, and the class of an instance can be accurately predicted from each view alone. Co-training can generate a better inference result because one of the classifiers correctly labels data that the other classifier previously misclassified. Transfer learning: A major assumption in many machine learning and data mining algorithms is that training and future data must be in the same feature space and have the same distribution. However, in many real-world applications this assumption may not hold true. For example, sometimes we have a classification task in one domain of interest, but we only have sufficient training data in another domain of interest, where the latter data may be in a different feature space or follow a different data distribution. Different from semi-supervised learning, which assumes that the distributions of labeled and unlabeled data are the same, transfer learning, on the other hand, allows the domains, tasks and distributions used in training and testing to be different. In the real world, we see many examples of transfer learning. For example, learning to recognize tables can help recognize chairs. Optimization Techniques First, many data mining tasks can be solved by optimization methods, such as matrix factorization and tensor decomposition. Examples include location and activity recommendations and research on fueling behavior inference. Second, the training process of many machine learning models is actually based on optimization and approximation algorithms, for example maximum likelihood, gradient descent, and EM (estimation and maximization). Third, the results of operations research can be applied to problem solving. an urban computing task if combined with other techniques, such as database algorithms. For example, the problem of carpooling has been studied for many years in the context of operations research. It turned out to be an NP-hard problem if we want to minimize the total travel distance of a group of people who expect to share trips. As a result, it is very difficult to apply the solutions to a large group of users, especially in an online application. In the system ofdynamic taxi ride sharing, T-Share has combined spatio-temporal database techniques with optimization algorithms to significantly reduce the number of taxis to control. Finally, the service can be provided online to respond to instant queries from millions of users. Another example combines a PCA-based anomaly detection algorithm with L1 minimization techniques to diagnose traffic flows that lead to a traffic anomaly. The spatio-temporal property and dynamics of urban computing applications also pose new challenges to current operations research. Information Security Information security is also not trivial for an urban IT system that can collect data from different sources and communicate with millions of devices and users. Common issues that could arise in urban IT systems include data security (e.g. ensuring that received data is integrated, fresh and undeniable), authentication between different sources and clients and intrusion detection into a system hybrid (connecting the digital and physical worlds). .Future DirectionsAlthough many research projects on urban computing have been carried out in recent years, many technologies are still missing or not well studied. Balanced Crowd Sensing: The data generated by a crowd sensing method is not evenly distributed across geographies and geographies. temporal spaces. In some places we may have much more data than we actually need. An undersampling method (e.g., compressive sensing) could be useful to reduce the communication loads of a system. On the contrary, where we do not have enough data, if any at all, some incentives that might motivate users to provide data should be considered. Given a limited budget, how to configure the incentive for different locations and time periods to maximize the quality of received data (e.g., coverage or accuracy) for a specific application remains to be explored. Skewed Data Distribution: In many cases, we may get a sample of urban data, the distribution of which may be skewed compared to the full dataset. Having the entire data set may still be impossible in an urban IT system. Some information is transferable from partial data to the whole data. For example, the speed of taxis moving on roads can be transferred to other vehicles that are also traveling on the same road segment. Similarly, the waiting time for a taxi at a gas station can be used to infer the waiting time for other vehicles. However, other information cannot be transferred directly. For example, the traffic volume of taxis on a road may be different from that of private vehicles. Therefore, observing more taxis on a road segment does not always suggest more other vehicles. Managing and indexing multi-mode data sources: Different types of index structures have been proposed to handle different data types individually, while the hybrid index can handle multiple data types simultaneously (e.g., spatial, temporal, and media social) still need to be studied. The hybrid index, like the example shown in Figure 15, provides a foundation for effective and efficient learning of multiple heterogeneous data sources. Knowledge fusion: models.