BBC Russian

Kubernetes 1.31 is now available on GKE, just one week after Open Source Release!

Wednesday, August 28, 2024

Kubernetes 1.31 is now available in the Google Kubernetes Engine (GKE) Rapid Channel, just one week after the OSS release! For more information about the content of Kubernetes 1.31, read the official Kubernetes 1.31 Release Notes and the specific GKE 1.31 Release Notes.

This release consists of 45 enhancements. Of those enhancements, 11 have graduated to Stable, 22 are entering Beta, and 12 have graduated to Alpha.

Kubernetes 1.31: Key Features

Field Selectors for Custom Resources

Kubernetes 1.31 makes it possible to use field selectors with custom resources. JSONPath expressions may now be added to the spec.versions[].selectableFields field in CustomResourceDefinitions to declare which fields may be used by field selectors. For example, if a custom resource has a spec.environment field, and the field is included in the selectableFields of the CustomResourceDefinition, then it is possible to filter by environment using a field selector like spec.environment=production. The filtering is performed on the server and can be used for both list and watch requests.

SPDY / Websockets migration

Kubernetes exposes an HTTP/REST interface, but a small subset of these HTTP/REST calls are upgraded to streaming connections. For example, both kubectl exec and kubectl port-forward use streaming connections. But the streaming protocol Kubernetes originally used (SPDY) has been deprecated for eight years. Users may notice this if they use a proxy or gateway in front of their cluster. If the proxy or gateway does not support the old, deprecated SPDY streaming protocol, then these streaming kubectl calls will not work. With this release, we have modernized the protocol for the streaming connections from SPDY to WebSockets. Proxies and gateways will now interact better with Kubernetes clusters.

Consistent Reads

Kubernetes 1.31 introduces a significant performance and reliability boost with the beta release of "Consistent Reads from Cache." This feature leverages etcd's progress notifications to allow Kubernetes to intelligently serve consistent reads directly from its watch cache, improving performance particularly for requests using label or field selectors that return only a small subset of a larger resource. For example, when a Kubelet requests a list of pods scheduled on its node, this feature can significantly reduce the overhead associated with filtering the entire list of pods in the cluster. Additionally, serving reads from the cache leads to more predictable request costs, enhancing overall cluster reliability.

Traffic Distribution for Services

The .spec.trafficDistribution field provides another way to influence traffic routing within a Kubernetes Service. While traffic policies focus on strict semantic guarantees, traffic distribution allows you to express preferences (such as routing to topologically closer endpoints). This can help optimize for performance, cost, or reliability.

Multiple Service CIDRs

Services IP ranges are defined during the cluster creation and can not be modified during the cluster lifetime. GKE also allocates the Service IP space from the VPC. When dealing with IP exhaustion problems, cluster admins needed to expand the assigned Service CIDR range. This new beta feature in Kubernetes 1.31 allows users to dynamically add Service CIDR ranges with zero downtime.

Acknowledgements

As always, we want to thank all the Googlers that provide their time, passion, talent and leadership to keep making Kubernetes the best container orchestration platform. From the features mentioned in this blog, we would like to mention especially Googlers Joe Betz, Jordan Liggitt, Sean Sullivan, Tim Hockin, Antonio Ojea, Marek Siarkowicz, Wojciech Tyczynski, Rob Scott, Gaurav Ghildiyal.

By Federico Bongiovanni – Google Kubernetes Engine

Fluent Bit WriteAPI Connector: Lowering the barrier to streaming data

Wednesday, August 21, 2024

Automating ingestion processes is crucial for modern businesses that handle vast amounts of data daily. In today's fast-paced digital landscape, the ability to seamlessly collect, process, and analyze data can make the difference between staying ahead of the competition and falling behind. To simplify ingestion, tools such as Fluent Bit enable customers to route data between pluggable sources and sinks without needing to write a single line of code. Instead, data routing is managed via a config file. The Fluent Bit WriteAPI Connector is a pluggable sink built on top of the BigQuery Storage Write API that enables organizations to rapidly develop a data ingestion pipeline.

What are the BigQuery Storage Write API and Fluent Bit?

The BigQuery Storage Write API is a high-performance data-ingestion API for BigQuery. It leverages both batching and streaming methods to ingest records into BigQuery in real-time. The WriteAPI offers features such as ability to scale and provides exactly-once delivery to guarantee that data is not duplicated. Using the Write API directly typically requires technical expertise, as users must navigate one of the client SDKs. This can create a high barrier to entry for some customers to stream data into BigQuery.

Fluent Bit is a widely-used open-source observability agent known for its lightweight design, speed, and flexibility. It operates by collecting logs, traces and metrics through various inputs such as local or network files, filtering and buffering them, and then routing them to designated outputs. Fluent Bit's high-performance parsing capabilities allow for data to be processed according to user specifications. The output component is a configurable plugin that directs data to different destinations, such as various tables in BigQuery. There can be multiple WriteAPI outputs and each output can be independently configured to use a specific write mode, enabling seamless data streaming into BigQuery based on tag/match pairs.

Why Use the Fluent Bit WriteAPI Connector?

Our solution to the technical challenges posed by using the WriteAPI is the Fluent Bit WriteAPI Connector. This connector automates the data ingestion process, eliminating the need for customers to write any code. The entire pipeline is managed through a single configuration file, making it easy to use. The flow of data is depicted in the diagram below.

Example Use Case

Say we wish to monitor a log file containing JSON data, and we would like to ingest this data into a BigQuery table that has a single column titled “Text” of type String. A line from the log file looks like this:

{"Text": "Hello, World"}

Setup Process

Setting Up Fluent Bit:

install

Cloning the Google Git Repository:

Google Git Repository

/usr/local/fluentbit-bigquery-writeapi-sink

plugins.conf

writeapi

[PLUGINS]

Path /usr/local/fluentbit-bigquery-writeapi-sink/out_writeapi.so

Setting Up BigQuery Tables:

BigQuery tables

Text

STRING

myProject.myDataset.myTable.

click to enlarge

Prepare the input file:

/usr/local/logfile.log

touch /usr/local/logfile.log

Configuring the Plugin:

/usr/local

demo.conf

format

This routes the data from /usr/local/logfile.log to the BigQuery table at myProject.myDataset.myTable. There are additional configurable fields that control the stream, such as chunking, asynchronous response queue, and also the type of stream. These fields let you control how your data is streamed.

To run the pipeline, use the command:

fluent-bit -c /usr/local/demo.conf

As the log file is updated new lines will automatically appear in the BigQuery table. For example, to populate the log file you can run the following command:

echo "{\"Text\": \"Hello, world\"}" >> /usr/local/logfile.log

Note that the default flush interval in Fluent Bit is 1 minute, so it might take a minute before the log file is flushed. The BigQuery table will now be updated as follows:

click to enlarge

Key Features

The connector supports a wide variety of features including multi-instancing, dynamic scaling, exactly-once delivery, and automatic retry.

1. Multi-Instancing

The multi-instancing feature of the Fluent Bit WriteAPI Connector is designed to offer flexibility in routing data. Specifically, users can configure the connector to handle multiple data inputs and outputs in various combinations. This feature also supports more complex configurations, such as multiple inputs feeding into multiple outputs, allowing data to be aggregated or distributed as needed. An input connector is labeled with a tag field. In our example, this has value log1. Data is routed to an output connector based on the value of its match field. In our example, this also has value log1, meaning there is a 1-to-1 correspondence between the input and output connector. The match field is a regex so it can be used to connect with multiple inputs. For example, if this was set to * then data from all inputs would flow to this output.

2. Dynamic Scaling

Handling large volumes of data efficiently is crucial for modern pipelines. The dynamic scaling feature addresses the issue of potential overloads in the Write API. As data is streamed into BigQuery, there may be times when the API queue becomes full—by default, it can hold up to 1000 pending responses. When this limit is reached, no new data can be appended until some of the pending responses are processed, which can create back pressure in the system. To manage this, the connector automatically scales up its capacity by creating an additional network connection when it detects that the number of pending responses has reached the threshold.

3. Exactly-Once

The "exactly-once" feature ensures that each piece of data is sent and recorded in BigQuery exactly once. This feature ensures no data is duplicated. If the connector encounters an intermittent issue while sending a specific piece of data, it will synchronously retry sending it until it is successful. This ensures data is delivered correctly.

4. Retry Functionality

The retry functionality allows the connector to handle temporary failures gracefully. The retry mechanism is configurable, meaning users can set how many times the system should attempt to resend the data before giving up. By default, the connector will retry sending failed data up to four times. In the default stream mode, if a row of data fails to send, it is retried while other rows continue to be processed. However, in the "exactly once" mode, the retry process is synchronous, meaning the system will wait for the failed row to be successfully sent before moving on to subsequent rows.

5. Error Handling

Error handling in the connector is designed to catch and manage issues that may arise during data transmission. The connector will continue processing incoming data even if earlier data had a failure. Any permanent issues that are encountered are logged to the console.

Conclusion

The ability to efficiently collect, process, and analyze data is a critical factor for business success. The Fluent Bit WriteAPI Connector stands out as a powerful solution that simplifies and automates the data ingestion process, bridging the gap between Fluent Bit's versatile data collection capabilities and Google BigQuery's robust analytics platform.

By eliminating the need for complex coding and manual data management, the Fluent Bit WriteAPI Connector lowers the barrier to entry for businesses of all sizes. Whether you're a small startup or a large enterprise, this tool allows you to effortlessly set up and manage your data pipelines with a single configuration file. Its features like multi-instancing, dynamic scaling, exactly-once delivery, and error handling ensure that your data is ingested accurately, reliably, and in real-time.

The straightforward setup process, combined with the flexibility and scalability of the connector, make it a valuable asset for any organization looking to harness the power of their data. By automating the ingestion process, businesses can focus on what truly matters: deriving actionable insights from their data to drive growth and innovation.

By Tanishqa Puhan, BigQuery WriteAPI

2023 Open Source Contributions: A Year in Review

Tuesday, August 13, 2024

At Alphabet, open source remains a critical component of our business and internal systems. We depend on thousands of upstream projects and communities to run our infrastructure, products, and services. Within the Open Source Programs Office (OSPO), we continue to focus on investing in the sustainability of open source communities and expanding access to open source opportunities for contributors around the world. As participants in this global ecosystem, our goal with this report is to provide transparency and to report our work within and around open source communities.

In 2023 roughly 10% of Alphabet’s full-time workforce actively contributed to open source projects. This percentage has remained roughly consistent over the last five years, indicating that our open source contribution has remained proportional to the size of Alphabet over time. Over the last 5 years, Google has released more than 7,000 open source elements, representing a mix of new projects, features, libraries, SDKs, datasets, sample code, and more.

Most open source projects we contribute to are outside of Alphabet

In 2023, employees from Alphabet interacted with more than 70,000 public repositories on GitHub. Over the last five years, more than 70% of the non-personal GitHub repositories receiving Alphabet contributions were outside of Google-managed organizations. Our top external projects (by number of unique contributors at Alphabet) include both Google-initiated projects such as Kubernetes, Apache Beam, and gRPC as well as community-led projects such as LLVM, Envoy, and web-platform-tests.

In addition to Alphabet employees supporting external projects, in 2023 Alphabet-led projects received contributions from more than 180,000 non-Alphabet employees (unique GitHub accounts not affiliated with Alphabet).

Open source remains vital to industry collaboration and innovation

As the technology industry turns to focus on novel AI and machine learning technologies, open source communities have continued to serve as a shared resource and avenue for collaboration on new frameworks and emerging standards. In addition to launching new projects such as Project Open Se Cura (an open-source framework to accelerate the development of secure, scalable, transparent and efficient AI systems), we also collaborated with AI/ML industry leaders including Alibaba, Amazon Web Services, AMD, Anyscale, Apple, Arm, Cerebras, Graphcore, Hugging Face, Intel, Meta, NVIDIA, and SiFive to release OpenXLA to the public for use and contribution. OpenXLA is an open source ML compiler enabling developers to train and serve highly-optimized models from all leading ML frameworks on all major ML hardware. In addition to technology development, Google’s OSPO has been supporting the OSI's Open Source AI definition initiative, which aims to clearly define 'Open Source AI' by the end of 2024.

Investing in the next generation of open source contributors

As a longstanding consumer and contributor to open source projects, we believe it is vital to continue funding both established communities as well as invest in the next generation of contributors to ensure the sustainability of open source ecosystems. In 2023, OSPO provided $2.4M in sponsorships and membership fees to more than 60 open source projects and organizations. Note that this value only represents OSPO's financial contribution; other teams across Alphabet also directly fund open source work. In addition, we continue to support our longstanding programs:

In its 19th year, Google Summer of Code (GSoC) enabled more than 900 individuals to contribute to 168 organizations. Over the lifetime of this program, more than 20,000 individuals from 116 countries have contributed to more than 1,000 open source organizations across the globe.

In its fifth year, Google Season of Docs provided direct grants to 13 open source projects to improve open source project documentation. Each organization also created a case study to help other open source projects learn from their experience.

In 2023, the Google Open Source Peer Bonus Program gave awards to 163 non-Alphabet contributors from the broader open source community representing 35 different countries.

A map of the world with highlighting every country that has had Google Summer of Code participants

Securing our shared supply chain remains a priority

We continue to invest in improving the security posture of open source projects and ecosystems. Since launching in 2016, Google's free OSS-Fuzz code testing service has helped discover and get over 10000 vulnerabilities and 34,000 bugs fixed across more than 1200 projects. In 2023, we added features, expanded our OSS-Fuzz Rewards Program, and continued our support for academic fuzzing research. In 2023, we also applied the generative power of LLMs to improve fuzz testing. In addition to this project we’ve been:

Collaborating on resources and frameworks: We’ve continued to support and participate in the Open Source Security Foundation (OpenSSF), which announced the release of SLSA v1.0, a framework that provides specifications for software supply chain security. We also supported the development of the malicious packages repository, the first open source system for collecting and publishing cross-ecosystem reports of malicious packages, and added GitLab support in v4.12 of Scorecard. The OSV Vulnerability Schema has continued to see strong organic adoption, with 8 new ecosystems adopting the format and publishing records for inclusion in OSV.dev.

Investing in tooling to increase trust and transparency: In 2023, Google’s Open Source Security Team (GOSST) helped to fund the launch of Trusted Publishing for PyPI and supported the rollout of 2FA enforcement across PyPI. GOSST also supported the launch of Sigstore-powered provenance in npm and other sigstore clients like Python to milestone releases. We also released Capslock, a capability analysis CLI tool that informs users of privileged operations (like network access and arbitrary code execution) in a given package and its dependencies, quantum resilient security keys as part of OpenSK (our open source security key firmware), and increased free public access to package dependency data through deps.dev’s new API.

Helping more projects adopt security best practices as well as identify and remediate vulnerabilities: Over the last year, the upstream team has proposed security improvements to more than 181 critical open source projects including widely-used projects such as NumPy, etcd, XGBoost, Ruby, TypeScript, LLVM, curl, Docker, and more. In addition to this work, GOSST continues to support OSV-Scanner to help projects find existing vulnerabilities in their dependencies, and enable comprehensive detection and remediation by providing commit-level vulnerability detail for over 30,000 existing CVE records from the NVD.

Our open source work will continue to grow and evolve to support the changing needs of our communities. Thank you to our colleagues and community members who continue to dedicate personal and professional time supporting the open source ecosystem. Follow our work at opensource.google.

Appendix: About this data

This report features metrics provided by many teams and programs across Alphabet. In regards to the code and code-adjacent activities data, we wanted to share more details about the derivation of those metrics.

Data sources: These data represent the activities of Alphabet employees on public repositories hosted on GitHub and our internal production Git service Git-on-Borg. These sources represent a subset of open source activity currently tracked by Google OSPO.

GitHub: We continue to use GitHub Archive as the primary source for GitHub data, which is available as a public dataset on BigQuery. Alphabet activity within GitHub is identified by self-registered accounts, which we estimate underreports actual activity.

Git-on-Borg: This is a Google managed git service which hosts some of our larger, long running open source projects such as Android and Chromium. While we continue to develop on this platform, most of our open source activity has moved to GitHub to increase exposure and encourage community growth.

Driven by humans: We have created many automated bots and systems that can propose changes on various hosting platforms. We have intentionally filtered these data to focus on human-initiated activities.

Business and personal: Activity on GitHub reflects a mixture of Alphabet projects, third-party projects, experimental efforts, and personal projects. Our metrics report on all of the above unless otherwise specified.

Alphabet contributors: Please note that unless additional detail is specified, activity counts attributed to Alphabet open source contributors will include our full-time employees as well as our extended Alphabet community (temps, vendors, contractors, and interns). In 2023, full time employees at Alphabet represented more than 95% of our open source contributors.

GitHub Accounts: For counts of GitHub accounts not affiliated with Alphabet, we cannot assume that one account is equivalent to one person, as multiple accounts could be tied to one individual or bot account.

^*Active counts: Where possible, we will show ‘active users’ defined by logged activity (excluding ‘WatchEvent’) within a specified timeframe (a month, year, etc.) and ‘active repositories’ and ‘active projects’ as those that have enough activity to meet our internal active-project criteria and have not been archived.

By Sophia Vargas – Analyst and Researcher, OSPO

opensource.google.com

Google Open Source Blog