This lecture is selected from the speech entitled "Frontier Technologies in Social Network Data Mining" given by Yu Shilun, Dean of the Institute of Data Science at Tsinghua University, at the "Social Relationship Network and Big Data Technology" session of the Tsinghua RONGv2.0 series of forums on December 23, 2015. First of all, I would like to thank all the guests for participating in Tsinghua RONGv2.0 series of seminars on "Social Relationship Networks and Big Data Technology". Professor Deng has just told us about the importance of this issue, and now I will share with you my research in this area. We all know that big data has four "Vs". The data is large in scale and generated at a fast speed. But more importantly, big data is diverse, like a kaleidoscope, with all kinds of data. If we want to do a good job with various types of data, we must fuse different types of data together. This is the theme of our series - RONG. Only by fusing different data together can we make it more accurate and richer in content. In addition, because the data is diverse, there is also a problem here, that is, not all data can be directly fused together. If it is not handled well, it will mess up the good data. After effectively fusing different types of data, we must be able to extract value from them. We know that social networks are a typical example. Social networks are very large. For example, Facebook in the United States has billions of network nodes, and the number of network nodes in China is also very large. People constantly express their opinions on social networks and share photos or videos, which generates various data, and there are various forms of information, such as text, images, links, communities, etc. We know that there is a lot of information on social networks, which is huge in scale but sparse in value. How to obtain value from it is a problem to be solved. Today I will mainly talk about two issues: first, how to integrate different types of data, and second, how to deal with junk data. Although we are talking about social networks, there is not only one social network. There are many different social networks in the United States. Generally, the most familiar one is Facebook, but Facebook is not the only social network. For example, Twitter is also widely used, but Twitter only sends short messages. There is also Foursquare, which is mainly a social network for e-commerce. Its service model is that if your friends visit the corresponding store after seeing it, their accounts will show that they have been here, and gradually a network reflecting the preferences and interests of friends will be formed. Like LinkedIn, basically everyone will put the content of their work on it. If you want to hire someone, you can go to LinkedIn to see which one is suitable for your needs. Another one is YouTube, where you can put short videos you shot. In summary, there are various social networks. Because these social networks have different characteristics and interests, people usually participate in different social networks. We can say that each social network individual has different characteristics on different social networks. If we can integrate with multiple social networks, we can get more information. If you have a new social network, you can also get some information from other known social networks and get more information. So how to integrate these social networks is a challenge. We want to integrate social networks and solve two problems: First, the name a person uses to join different social networks may not be the same, so it is difficult to know that two accounts in social network A and social network B correspond to the same person. This is a challenge. Second, even if you know that person A on Facebook and person B on Twitter correspond to the same person, how to help him make better use of his Facebook information on Twitter is another challenge. The purpose of social networks is to connect people together, so usually what we like to do on social networks is to recommend friends, which is similar to the function of recommending products in e-commerce. On a social network, the most important thing is social interaction, that is, finding out who is friends with whom. Any social network will always recommend more friends to you. But how to recommend? Usually social networks have all kinds of information in them. Of course, we first know that some people are friends, and they will establish connections on their own. In addition, we can also know the different locations of these people. In addition, people can post messages on social networks, so we can also know what he is interested in and when he is interested. Therefore, generally in a social network, we can know who is interested in what and where, and approximately when it happens. If we want to connect two social networks, such as Foursquare, sometimes Foursquare can recommend Twitter accounts, so we can easily connect them together. So we can easily connect some people together, so we connect users one by one, but we don’t know that most users don’t know what the corresponding ones are in another social network. So generally speaking, when we get a network, usually only a part of the people on both sides are corresponding. So usually in our research, if we want to predict social network friends, we can do it like this: Based on the Anchor Links we already know, we can train to help us organize more AnchorLinks, and then help us map them to SocialLinks, and then come back. Why connect to another social network? For example, if we know these people, and we want to predict whether A and B are friends, we need to see if there is any intersection between A's friends and B's friends. If A and B have many common friends, we think these two people are likely to become friends, and we can make a recommendation. On the contrary, if there is no intersection, it is difficult for us to make a guess. If the network is not dense enough, it is very likely that we will not be able to find the connection between A and B. However, if we know A's corresponding account in another social network, we know his friends in another social network. If we have this information, it will be very helpful for us to recommend them to be friends. Let's look at another example. There are two social networks. In the first network, C is connected to A, and in the other network, B is connected to A, so B and C are friends of friends. If we combine these data sources effectively, our predictions will be more accurate than using just one source. The next question is how do we deal with spam in social networks. For example, when we go to a restaurant to eat, we are used to looking at the reviews of this restaurant on Dianping.com. If everyone says that this restaurant is good, we will go there. But the problem we often encounter is that these reviews contain a lot of spam, that is, in fact, it may not be really delicious, but someone deliberately puts some good reviews to trick you. Or if he doesn’t like the restaurant next door, he will put some bad reviews of the restaurant next door, so sometimes the reviews you see are fabricated. In other words, if you want to know what this restaurant is like, you have to remove these spam, otherwise these reviews are not credible. But when you just read a review, it is difficult to decide whether a review is spam or not, because the text is written with rich emotions, such as "I came here and it was delicious", etc., which is inconsistent with the facts. So it is not enough to just read the text. It cannot be said that if the text is well written, it is not spam; if the text is not well written, it may be typed on a mobile phone and there are typos, but it does not mean that it is spam. So this is a very difficult problem. Fortunately, we don’t have just one review, but many reviews. We don’t have just one restaurant, but many restaurants. We can use this large amount of data to help us solve this problem. The size of big data also has great benefits. Generally speaking, an evaluator will review many different restaurants, and each restaurant will receive many reviews. So we have to find the relationship between them. Usually, if the reviews written by an evaluator are very credible, we say he is honest; on the other hand, if a review is consistent with what other honest evaluators wrote, we say that this review is credible. In addition, if most honest evaluators say that a restaurant is good, we say that this restaurant is reliable; vice versa. Finally, we don’t read the reviews of dishonest evaluators. To sum up, how do we say a review is honest? It depends on two things: First, if the review is consistent with the opinions of trustworthy people, then it is honest. If the review is inconsistent with the opinions of dishonest people, then we can also get some information. Finally, let's talk about the scoring behavior. If this is a very good store, and you give a bad review, it will have a great impact on the honesty of the reviewer. But if some people like it and some don't, it doesn't matter, it won't affect your integrity. Ultimately, when we evaluate whether a store is good or not, if all honest reviewers say it is good, then it is good. If honest people say it is not good, that is also OK. We can determine whether the store is good or not just by looking at the reviews. For example, if we compare the Resellerating scores of these stores, basically the higher the Resellerating score, the better the store's reviews are. Sometimes, like a store like CCI, although the BBB score is high, the Resellerating score is very low. We think this store is not good, and we did a deeper investigation and found that there is something wrong with this store. So we conclude that in the era of big data , everyone has understood that data is becoming more and more important. Of course, social networks are a place with a lot of data. We need to try to extract gold from it. In the era of big data, if we can extract value from this big data, not just from any aspect of data, this will give us new opportunities. This is a disruptive technology. If many traditional industries do not pay attention to data, they may be overthrown. This also gives us many new opportunities. We can set up new companies. The last time a disruptive technology happened was the Internet. When the Internet happened, many emerging companies jumped out. In China, there was Alibaba. Traditional industries, such as many retail industries in the United States, have big problems. The first is bookstores. No matter where you buy the same book, it can be sent to you online at the same price. It is better to buy it online. The same is true for electronics. So we can see that if traditional industries do not pay attention, problems will occur. The simplest one is taxis. The traditional taxi industry has problems. Now it is more convenient to call Didi on your mobile phone. This is a great opportunity for our students. If you can capture big data, you may be able to start a new company. Our Prime Minister also said that this is a big opportunity. Finding gold from data is not an easy task. We have talked a lot today, and you have seen that this requires technology, whether it is statistical technology or computer technology, etc. Only with these knowledge can you start a business. In any case, big data will definitely have a great impact on different industries in the future and on the lives of each of us. So I hope everyone will participate in this symposium today and hope that you can absorb some good knowledge. Thank you! Source: Data View |