GMSCM Student Demonstrates a New Solution to Private Cloud Computation

PublishDate:11 Oct 2017  Click:62

Oct. 11, 2017 Hangzhou - On YunQi conference private cloud session, I've introduced ApsaraStackInsight (a lightweight private cloud bigdata solution) and BigdataCloud Console (a brand new cluster operation tool). Due to the limitation of presentation time, I only gave a brief talk, but there are also some great thoughts and experiences our engineers summarized from real world projects, which I would like to share with audiences who wish to study further in this area.

During the Session

I have been working as a system engineer since I joined Alibaba, and started another role in the private cloud area 3 years ago. The business was new and challenges came from different sides, in short, turned to one major question: how to sell our products along with our professional skills in the new private cloud market? Simply adding more product features wouldn't solve all the issues, we also need to dive into the customers' system architecture, understand their business, according to which, rapidly make technical decisions and then execution. One example is disaster recovery, in this particular case, network latency plays an important role, in an enclosed environment, it is defined and optimized by us, but if in another customer's environment, it would be limited in a much smaller range, even become self-adaptive. Fortunately, my previous OPS experiences, make me able to solve this problem perfectly.

In the beginning, our ability of delivering productivity and stability in large scale environments shined on the market, but then, came the demands for medium, even small scale production environments. The previous experiences did help, somehow, we need more studies. First of all, we need to figure out, what exactly is a lightweight bigdata solution? People might think bigdata is only for "big" scenes, nothing wrong with it, when businesses already grow big enough, the scenarios are, most likely, plain and consistent (e.g. e-commerce, online gaming), but for other smaller businesses, who don't have such size, where shall we find a suitable product? Just like 30 years ago, the starting of traditional personal computers, people was used to buy parts from different sources and build by themselves, after that, notebooks have showed up, then followed all-in-one computer, and then Macs. When during the self-build times, you might not think about MacOS (regardless of the price), but when came to the all-in-one times, Mac OS became outstanding, from particular to common. Also, ApsaraStackInsight is the answer to the enterprise bigdata HW/SW integration solution.

Second question, customers who need a lightweight solution, which stage are they in? In the private cloud market, Alibaba is always the technical leader, some other competitors tried to use "open source software + cheap hardware" strategies to catch up, or even make an overtaking on the market. Also, customers sometimes, don't even know what exactly they want? which is the best solution? Where ApsaraStackInsight is just the perfect fit: 1. It represents Alibaba's bigdata HW/SW integration solution; 2. With the previous market demands being fulfilled, the ApsaraStackInsight started the self-evolution by solving problems in different projects. Finally, from customized needs to common needs, ApsaraStackInsight may not be the only answer, some other similar concepts will surely show up. HW/SW integration solution is not new, some old aged IT service companies also have similar solutions, the difference is our lightweight cloud is able to connect with public cloud. To summarize, as a general technical solution and has been used by customers, the revolution is just the beginning.

So why ApsaraStackInsight has potential for success?

1. From the beginning of the study, we started from existed customer scenes and mature environments, so the solution has kept product functionalities and specialities, which make ApsaraStackInsight still working for ordinary bigdata businesses, meanwhile, in the final delivering staging, we also give customers/partners free of choices in terms of product combinations, which made them easy to join the game, and be able to seek for the market opportunities.

Architectural Design

2. To understand the architecture of ApsaraStackInsight, which is a mature and stable solution, imagine the blue rect above as a server rack, the inside and outside networks are isolated and independent, so the inside network traffic change doesn't have any impacts to the customer's environment. Our solution contains a complete infrastructure, and can communicate with outside on demand. The minimum size from 10+ x86 servers, which substantially helps customers cutting down their budget. We also provide stunning operation features, such as faster booting up, elegant shutting down, auto recovering on any service nodes and computing nodes, any serving modules in the cluster are up-up running which can do hot switching during any runtime, any physical server outages are guaranteed not affecting the upper level running services, and storage nodes are designed following the classical distributed system guideline which make data survive when any two physical nodes disappear.

3. The products integrated in ApsaraStackInsight, are all leading products in each industry area, regarding the functionality, performance and activity of communities. --- references

Compare with Others

To conclude, the abilities offered by ApsaraStackInsight, like: system stability in the batch production scenario, cutting down the cost significantly and usr friendly to new partners, integrating replicable data applications, make me believe ApsaraStackInsight will become to the leading position in the market, by replicating its HW/SW integration solution from particular customers to overwhelming majority, just like "all-in-one story".

So, what's the relation between BCC (BigdataCloudConsole) and ApsaraStackInsight? The operation management is significantly important, we realized the necessity of bringing producibility to our operation management skills, and this is where BCC comes in.


We already had some powerful operation tools, but the question is: are they ready to be delivered to outside customers? We've tried, but not that easy. Customers are different, our engineers are more professional while private cloud customers, they normally don't have a full-time person focus on this area, or they even don't concern about system operation, in this case, system operation must be a part of our entire solution, delivered as a separate product. This product has the same scope and solve same problems, but are using different methods. From the other side, Alibaba started building our own bigdata platform since 2009, there was an entire team working on it, trying to achieve automation operation over clusters sized 100K+ from birth, and gathering daily information by using our own platforms. Tesla is our internal operation system, which is designed to replace the tradition human operations (oil engine) with green, intelligent and data-driven operations (electric engine). These years in Alibaba, the computing platform which is supporting the fast-growing cluster capacity and business size company wide, is greatly benefiting from the efficient support by Tesla, as well as our product capable SREs, while the growth of team size is far more slower than the business growth and still able to handle the mass and complexity of businesses, which was thought as impossible before.

From the other side, BCC is the best practice of Telsa in the private cloud market, it's also aiming to help customers do basic system operation and advanced product operation without human interference. Although the products are same (both batch and real time), the practice is different, we have our SRE team which is growing with our businesses and products, while we can't ask for a same SRE team in customers' companies, thus, BCC is also trying to solve the people problem, take an example, Tesla is like manual driving in a sophisticated scenario while BCC is complete self-driving in a simpler scenario.

To conclude, the internal experiences in large cluster management and performance optimizing greatly helped us when developing BCC, more than 50% functions have been fully tested and then migrated into BCC, the main difference is about the user experiences, BCC was delivered along with other bigdata products to the market in the very early phase, with several years of iteration, BCC can be thought to be exclusively designed for private cloud: solving complicate operation in an elegant and senseless way. It abstracts a unique service layer from managed products which provides various automation features, like service auto recoveries, etc., And the most important is all products can be managed with consistent user experiences, how did we achieve this?

Improved UI

First, what kind of problems is solved by BCC? sounds naive, but in fact, we have researched on various operation products, none of which are fully friendly to end users. Most of them are function-driven, each entrance stands for a separate function. For instance, when you want to check product health, you have to find a health related button and then have to click a next button for further investigation. This might be OK in our internal environment, our users are sometimes professional Ops or product developers, but the same investigation path may not work as expected in a private cloud case, users are unlikely to know all details of product functions and execution results which lead to users heavily rely on manuals or training courses, but products overusing manuals or training would make itself worse. Take another example, everyone uses smartphone everyday, but how many people have read the manual before using? If this case, the problems is on the design, not users. Same story in ApsaraStackInsight system operations. The challenge is we need to extract a consistent operation experience from products in different architectures and user scenarios. So when we designing BCC, we abandoned the traditional user entrances, abstracted different layers from managed products and computing engines, considered both physical and logical abstractions, in the above picture, each region and cluster stand for the physical location of a product, where products/services represent the logical relations, each physical and logical infos is served as an object on the tree menu, each object has its own properties including charts, operations and statuses. All users need is following the guidance and hints on the webpages, then consequently getting clear understandings of managed product structures and product operation paths, even users are new to these different products, they won't feel difficult when getting started.


Second, to decrease the frequency of users visiting BCC and make them more focused on developing, we added hundreds of health checks, application configurations and periodical checks which are self-executed, rather than waiting to be triggered, so failures are auto detected and systems are self-healed. All the information is transparent to users, when selecting any objects on a product's service level, all the related information, configurations will show up on one page, all users need is checking on demand and only focusing on the given advice. Moreover, for other skilled operation engineers, BCC also makes these check items configurable, so they can easily do advanced and customized operations.

Third, considering the huge amount of data from system operation itself, while any operations and changes on product's service levels and configurations have potential to affect the whole system, moreover, all these data are changing all the time, it is impossible to ask all users have all knowledge about products and systems, especially when system outages happened, how to collect all relevant information by using one single keyword and show all related products and services compactly? The global search feature from BCC solved this, it offers a search like entrance, but more than just searching, it lists all highly relevant data on one page in final.

To summarize all, BCC is always focused on private cloud market from designing to building, and targeting to automation operation on a batch producing HW/SW integration product, by using above innovative methods and features.

Rome wasn't built in a day, ApsaraStackInsight and BCC are the first step our data infrastructure team makes the turning over to an architect/development/product team, we are experiencing the pain brought by this change while still keeping exploring the unknown area. During this time, everyone moves forward together, the designing and testing phase was happening during Chinese new year, but we still keep working; The refactor of BCC brought tons of extra workload to the team, but all of us made it. Now we can see the lights from the new age, all the predictions are becoming real, our team is eagerly looking for more talented ones to join us, with product sense, development skills and operation background. The team will make yourself grow as well.

We are hiring, contact me in person:

Data infrastructure team is a development team, focusing on bigdata operation automation, user product development, operation platform development and private cloud support.

(GMSCM 2016 LIU Yi)